When building production-grade enterprise solutions on AWS, we try to follow all the best practices (let’s call it the 5P approach). It’s great when we can build super simple solutions, but in most cases even a “simple” solution eventually becomes very complex: many services communicating with each other, multiple caching layers, different customer requirements, and so on.
It gets even more complicated when we deal with serverless and edge solutions. There are many limitations which are not obvious at first—especially if you haven’t faced them before. And when everything is already built, you can’t just abandon it. Instead, you start inventing custom workarounds (“костилі”).
I personally don’t like using such things in production, but sometimes there is simply no other option.
The good thing is that AWS provides a huge number of tools to solve problems. Sometimes all you need to do is wait and follow the AWS roadmap.
I’m pretty sure the AWS Lambda team will read this post and remove the current Lambda limitation I’ll talk about in the next section.
Enough Preamble
I think I’ve already caught your attention.
Let’s move on to the real problem. 🙂
The Architecture
We have a complex image optimization microservice that works like a charm. In short, it’s a multi-tenant, enterprise-grade environment that optimizes images on the fly.
Components
- CloudFront — CDN and edge caching
- CloudFront Function — request/response manipulation
- Origin Group with fallback:
- S3 bucket (primary)
- Lambda custom origin (fallback)
- Signed cookies with custom policy for authentication
- Three caching layers:
- CloudFront edge cache
- Browser cache
- Backend cache
Request Flow
- Frontend sends a request with query parameters
- CloudFront triggers the custom origin (Lambda)
- Image is transformed by Lambda
- Transformed image is stored in S3 and returned to the user
- Next request for the same image:
- Served directly from S3 (faster)
- Subsequent requests:
- Served directly from CloudFront cache (fastest)
So far, so good. ✅
The First Problem: Lambda Response Size Limit
Here’s where things get interesting.
Lambda cannot directly respond with a payload larger than 6 MB.
That’s a hard limit, and it’s a problem for our image optimization service.
The Initial Solution
Our approach was straightforward:
- Check the transformed image size
- If size > 6 MB:
- Store the image in S3
- Return a
302redirect with a pre-signed URL
But wait—the image can’t be public for security reasons.
Solution: return a pre-signed URL with limited validity.
The New Problem: Caching Pre-signed URLs
Here’s the catch: we can’t cache a pre-signed URL in:
- Browser cache
- CloudFront cache
Why? Because pre-signed URLs have expiration times, and caching them would serve expired URLs to users.
So we added cache-control headers:
| |
This works perfectly for the first request (Lambda → User).
But here’s the issue: the second request comes directly from S3. If we don’t add the same headers to the S3 object, we’ll have caching problems.
We fixed that too.
Everything seemed under control. 👍
Cache Invalidation Fun
Now we needed to clean CloudFront cache only for images larger than 6 MB, without invalidating everything (which would be expensive and slow).
Our Approach
- Use S3 Inventory to get object metadata
- Query with Athena to find large objects
- Find all objects > 6 MB
- Invalidate only those paths in CloudFront
We discovered about 3,000 files that needed invalidation.
The CloudFront Limitation
“How would you handle 3,000 invalidations?”
Put all paths into the CloudFront console and click Invalidate?
❌ No.
CloudFront allows invalidating only 3 paths at a time when using wildcards (*).
Our Solution
We built a custom script:
| |
It worked perfectly. ✅
Everything Is Fine… Until It Isn’t
Two weeks later, our client reported random issues: images sometimes disappeared from the cache unexpectedly.
Debugging Hell
Initial reproduction attempts:
- ✅ Different browsers
- ✅ Incognito mode
- ❌ Nothing reproduced
One hour later—finally reproduced the issue! 🎯
I spent two hours investigating:
- CloudWatch logs
- X-Ray traces
- CloudFront metrics
The S3 Logs Revelation
S3 server logs were tricky to analyze because:
- Many requests returned
!= 200status codes - We use an origin group with fallback, so some failures are expected
After filtering logs and excluding CloudFront user agents, I discovered something interesting:
- Many requests came directly from browsers (not CloudFront)
- These requests had status codes
!= 200
Why were browsers making direct requests?
Because after a 302 response from CloudFront, the browser follows the redirect and makes the next request directly to the pre-signed S3 URL.
The Smoking Gun
I checked the failing requests:
- Request time:
2025-12-08 14:30 - Response: expired pre-signed URL
Cache-Control vs CloudFront Reality
I double-checked:
- Custom origin headers
no-cacheno-store
Meaning:
- No browser cache
- No CloudFront cache
Then I checked the CDK CloudFront cache policy:
| |
Looks fine, right?
Default TTL = 0
Disclaimer
You should always read AWS documentation—especially the Warnings section.
It was not obvious that even with no-store and no-cache, a 302 response can still be cached, and that the behavior depends on minTtl.
From AWS docs:
If your minimum TTL is greater than 0, CloudFront uses the cache policy’s minimum TTL, even if the
Cache-Control: no-cache,no-store, and/orprivatedirectives are present in the origin headers.
That was exactly my case.
The CDK Surprise
Another hour of debugging, and I finally checked the CloudFront UI. What do I see?
| |
Wait… what?
Who changed this?
Obviously me—I’m the only one working on this. No one hacked the account (hopefully).
But how?
I work only with CDK and never change production manually.
I open CDK, fix it, deploy again. Deployment successful.
Test again.
Same problem.
CloudFront UI still shows:
| |
At this point: OMG.
I create a test CDK stack.
Set defaultTtl = 0.
CloudFront UI still shows 365.
Try:
defaultTtl = 0minTtl = 0
❌ Deployment fails due to validation.
No documentation. No warnings. No explanation.
For now:
- I change it manually
- I add a big warning note
Preventing This in the Future
Every problem should not only be fixed, but also prevented.
So I added additional monitoring, beyond existing alarms and dashboards:
- Monitor S3 access logs
- If the number of
!= 200responses not coming from CloudFront exceeds a threshold → alert
There is no native AWS solution for this, so I built one:
- Step Functions
- Lambda
- Athena query
- Slack notification
This solution is:
- Cost-effective
- Relatively simple
As of writing this:
- Almost 2 weeks in production
- No false alarms
- I even caught a path traversal attack
Key Takeaway
Honestly? I don’t really know.
I believe I did everything correctly. There was no obvious reason to verify CloudFront configuration after a CDK deployment.
I hope this post:
- Helps someone avoid the same mistake
- Finds someone who can explain why this happens
Because even now, this still feels like a riddle to me.
