How Understanding Redis Cut Our AWS Costs by 50%

Artificial Intelligence is powerful. It accelerates development, generates scripts, and reduces operational effort.

But this article is not about replacing humans with AI.

It’s about how deep understanding of your system — combined with AI — can directly save you money. And sometimes, quite a lot.

When Deep Observability Really Matters

If you don’t have unlimited money and want your business to scale sustainably, you must understand:

What each component does
Why it exists
How it behaves over time
What exactly you’re paying for

The internet is full of advice about AWS cost optimization. Almost every article repeats the same checklist:

Optimize compute
Pick the correct instance type
Right-size your database
Turn on autoscaling
Buy reserved instances

That’s not wrong — but it’s generic. What you rarely see are practical examples where understanding business logic and system behavior makes the real difference.

Let me show you a real example.

The Problem: Redis Evictions

We were using AWS ElastiCache (Redis) in production. We configured alarms for:

Memory utilization
CPU utilization
Evictions

Our eviction policy was allkeys-lru, meaning:

When memory is full, the least recently used keys are evicted.

We initially provisioned:

Instance Type	Memory	vCPU	Network	Architecture
cache.m6g.large	13.07 GiB	2	Up to 10 Gbps	Graviton2

Everything worked perfectly — until the first eviction alarm.

First Reaction: Scale Up

Memory was clearly insufficient, so we scaled up to:

Instance Type	Memory	vCPU	Network	Architecture
cache.m6g.xlarge	26.32 GiB	4	Up to 10 Gbps	Graviton2

Cost doubled. The system stabilized:

Cache hit rate: ~90%
No significant traffic changes
No new major clients

Everything seemed fine — for two weeks. Then eviction alarms returned.

Something Was Wrong

Memory usage was constantly growing.

Redis memory usage constantly growing over time

That should not happen in a healthy cache. So I asked the development team:

What is the TTL of our Redis keys?

The answer:

Indefinitely.

Even with no significant traffic increase, memory had doubled again. That meant one thing: something was accumulating.

Investigation Phase

We created a ticket and isolated the environment:

Spin up an additional container
Used sticky sessions (to avoid impacting production load)
Installed Redis CLI
Counted all keys

Result:

1
2
redis-cli --scan | wc -l
1,937,337 keys

That’s a lot. Then we checked the top keys and noticed a pattern:

1
<client>:2.3.3:entity:values:paragraph:<number>

Example:

1
<client>:2.3.3:entity:values:paragraph:676437

Every key contained a semantic version.

Root Cause

We discovered:

Backend uses only the current release version
Old release keys remain in Redis
They are never reused
They are never deleted
TTL = infinite

So every deployment added a new full layer of cache:

1
2
3
Deployment 7.30 → Keys stored
Deployment 7.31 → New keys stored
Deployment 7.32 → More keys stored

Old versions just sit there consuming memory. This was not a scaling problem — it was a logic problem. And no autoscaling would fix it.

Possible Solutions

We evaluated options:

Option	Description	Problem
1	Purge entire cache now	Problem returns in 1 month
2	Ignore eviction alarms	Lose control & observability
3	Flush cache before deployment	Causes DB spike
4	Clean old keys after deployment	Controlled & version-aware

We chose Option 4.

Constraints

We needed a solution that:

Does not block deployment
Does not run inside the main application container
Deletes only old version keys
Is fast

First Attempt (Too Slow)

1
2
3
4
5
redis-cli --scan \
  | grep -v '2\.20\.2_' \
  | while read k; do
      redis-cli UNLINK "$k"
    done

It worked, but:

⏱ 1–2 hours execution time
Operationally unacceptable

Smarter Approach: Lua Inside Redis

We discovered we could use a Lua script executed directly inside Redis.

Problem:

No one in the team knew Lua
Learning it would take 1–2 days
Would we use it again? Probably not

Instead of spending days learning syntax, we used AI to generate the Lua script in 5 minutes. Tested it.

Results:

Stage: ~7 minutes
Production: ~15 minutes

Clean. Controlled. Version-aware. Exactly what we needed.

Deployment Architecture

Final workflow:

After deploying a new version using a Blue/Green strategy, we run a canary verification to ensure stability. Once everything is confirmed healthy, a separate cleanup task is triggered. This task executes a Lua script inside Redis to remove outdated versioned keys — keeping only the current release cache and preventing unnecessary memory growth.

The Result

Since we stopped storing old cache, we scaled Redis back down:

Before	After
cache.m6g.xlarge	cache.m6g.large

Cost reduced by 50%
Cache hit rate remained stable (~90%)
Operational performance unchanged

Redis cost reduction after optimization

Financial Impact

Across environments:

Monthly cost: $500
Yearly: $500 × 12 = $6,000

After optimization:

$3,000 per year

Saved: $3,000 annually from one logical insight.

No new infrastructure. No complex redesign. No overengineering. Just understanding.

Key Takeaways

1. Work as One Team

Ask questions. Challenge assumptions. Brainstorm together.

2. Understand End-to-End Flow

Scaling is not always the answer. Sometimes the issue is architectural.

3. Don’t Be Afraid to Use AI

AI wrote the Lua script. Human reasoning found the real problem. That’s the right balance.

4. Use Community Knowledge

Forums (Reddit, StackOverflow) are gold.

5. Stay Curious

Even if it takes 1–2 extra hours to deeply understand something — it might save thousands.

Final Thought

AI can generate scripts. But only a human who understands the business logic can ask:

“Why is memory constantly growing?”

Technology scales. AI accelerates. But mindset saves money.

When Deep Observability Really Matters#

The Problem: Redis Evictions#

First Reaction: Scale Up#

Something Was Wrong#

Investigation Phase#

Root Cause#

Possible Solutions#

Constraints#

First Attempt (Too Slow)#

Smarter Approach: Lua Inside Redis#

Deployment Architecture#

The Result#

Financial Impact#

Key Takeaways#

Final Thought#