Artificial Intelligence is powerful. It accelerates development, generates scripts, and reduces operational effort.

But this article is not about replacing humans with AI.

It’s about how deep understanding of your system — combined with AI — can directly save you money. And sometimes, quite a lot.

When Deep Observability Really Matters

If you don’t have unlimited money and want your business to scale sustainably, you must understand:

  • What each component does
  • Why it exists
  • How it behaves over time
  • What exactly you’re paying for

The internet is full of advice about AWS cost optimization. Almost every article repeats the same checklist:

  • Optimize compute
  • Pick the correct instance type
  • Right-size your database
  • Turn on autoscaling
  • Buy reserved instances

That’s not wrong — but it’s generic. What you rarely see are practical examples where understanding business logic and system behavior makes the real difference.

Let me show you a real example.

The Problem: Redis Evictions

We were using AWS ElastiCache (Redis) in production. We configured alarms for:

  • Memory utilization
  • CPU utilization
  • Evictions

Our eviction policy was allkeys-lru, meaning:

When memory is full, the least recently used keys are evicted.

We initially provisioned:

Instance TypeMemoryvCPUNetworkArchitecture
cache.m6g.large13.07 GiB2Up to 10 GbpsGraviton2

Everything worked perfectly — until the first eviction alarm.

First Reaction: Scale Up

Memory was clearly insufficient, so we scaled up to:

Instance TypeMemoryvCPUNetworkArchitecture
cache.m6g.xlarge26.32 GiB4Up to 10 GbpsGraviton2

Cost doubled. The system stabilized:

  • Cache hit rate: ~90%
  • No significant traffic changes
  • No new major clients

Everything seemed fine — for two weeks. Then eviction alarms returned.

Something Was Wrong

Memory usage was constantly growing.

Redis memory usage constantly growing over time

That should not happen in a healthy cache. So I asked the development team:

What is the TTL of our Redis keys?

The answer:

Indefinitely.

Even with no significant traffic increase, memory had doubled again. That meant one thing: something was accumulating.

Investigation Phase

We created a ticket and isolated the environment:

  1. Spin up an additional container
  2. Used sticky sessions (to avoid impacting production load)
  3. Installed Redis CLI
  4. Counted all keys

Result:

1
2
redis-cli --scan | wc -l
1,937,337 keys

That’s a lot. Then we checked the top keys and noticed a pattern:

1
<client>:2.3.3:entity:values:paragraph:<number>

Example:

1
<client>:2.3.3:entity:values:paragraph:676437

Every key contained a semantic version.

Root Cause

We discovered:

  • Backend uses only the current release version
  • Old release keys remain in Redis
  • They are never reused
  • They are never deleted
  • TTL = infinite

So every deployment added a new full layer of cache:

1
2
3
Deployment 7.30 → Keys stored
Deployment 7.31 → New keys stored
Deployment 7.32 → More keys stored

Old versions just sit there consuming memory. This was not a scaling problem — it was a logic problem. And no autoscaling would fix it.

Possible Solutions

We evaluated options:

OptionDescriptionProblem
1Purge entire cache nowProblem returns in 1 month
2Ignore eviction alarmsLose control & observability
3Flush cache before deploymentCauses DB spike
4Clean old keys after deploymentControlled & version-aware

We chose Option 4.

Constraints

We needed a solution that:

  • Does not block deployment
  • Does not run inside the main application container
  • Deletes only old version keys
  • Is fast

First Attempt (Too Slow)

1
2
3
4
5
redis-cli --scan \
  | grep -v '2\.20\.2_' \
  | while read k; do
      redis-cli UNLINK "$k"
    done

It worked, but:

  • ⏱ 1–2 hours execution time
  • Operationally unacceptable

Smarter Approach: Lua Inside Redis

We discovered we could use a Lua script executed directly inside Redis.

Problem:

  • No one in the team knew Lua
  • Learning it would take 1–2 days
  • Would we use it again? Probably not

Instead of spending days learning syntax, we used AI to generate the Lua script in 5 minutes. Tested it.

Results:

  • Stage: ~7 minutes
  • Production: ~15 minutes

Clean. Controlled. Version-aware. Exactly what we needed.

Deployment Architecture

Final workflow:

After deploying a new version using a Blue/Green strategy, we run a canary verification to ensure stability. Once everything is confirmed healthy, a separate cleanup task is triggered. This task executes a Lua script inside Redis to remove outdated versioned keys — keeping only the current release cache and preventing unnecessary memory growth.

The Result

Since we stopped storing old cache, we scaled Redis back down:

BeforeAfter
cache.m6g.xlargecache.m6g.large
  • Cost reduced by 50%
  • Cache hit rate remained stable (~90%)
  • Operational performance unchanged

Redis cost reduction after optimization

Financial Impact

Across environments:

  • Monthly cost: $500
  • Yearly: $500 × 12 = $6,000

After optimization:

  • $3,000 per year

Saved: $3,000 annually from one logical insight.

No new infrastructure. No complex redesign. No overengineering. Just understanding.

Key Takeaways

1. Work as One Team

Ask questions. Challenge assumptions. Brainstorm together.

2. Understand End-to-End Flow

Scaling is not always the answer. Sometimes the issue is architectural.

3. Don’t Be Afraid to Use AI

AI wrote the Lua script. Human reasoning found the real problem. That’s the right balance.

4. Use Community Knowledge

Forums (Reddit, StackOverflow) are gold.

5. Stay Curious

Even if it takes 1–2 extra hours to deeply understand something — it might save thousands.

Final Thought

AI can generate scripts. But only a human who understands the business logic can ask:

“Why is memory constantly growing?”

Technology scales. AI accelerates. But mindset saves money.