When building production-grade enterprise solutions on AWS, we try to follow all the best practices (let’s call it the 5P approach). It’s great when we can build super simple solutions, but in most cases even a “simple” solution eventually becomes very complex: many services communicating with each other, multiple caching layers, different customer requirements, and so on.

It gets even more complicated when we deal with serverless and edge solutions. There are many limitations which are not obvious at first—especially if you haven’t faced them before. And when everything is already built, you can’t just abandon it. Instead, you start inventing custom workarounds (“костилі”).

I personally don’t like using such things in production, but sometimes there is simply no other option.

The good thing is that AWS provides a huge number of tools to solve problems. Sometimes all you need to do is wait and follow the AWS roadmap.

I’m pretty sure the AWS Lambda team will read this post and remove the current Lambda limitation I’ll talk about in the next section.

Enough Preamble

I think I’ve already caught your attention.
Let’s move on to the real problem. 🙂

The Architecture

We have a complex image optimization microservice that works like a charm. In short, it’s a multi-tenant, enterprise-grade environment that optimizes images on the fly.

Components

  • CloudFront — CDN and edge caching
  • CloudFront Function — request/response manipulation
  • Origin Group with fallback:
    • S3 bucket (primary)
    • Lambda custom origin (fallback)
  • Signed cookies with custom policy for authentication
  • Three caching layers:
    • CloudFront edge cache
    • Browser cache
    • Backend cache

Request Flow

  1. Frontend sends a request with query parameters
  2. CloudFront triggers the custom origin (Lambda)
  3. Image is transformed by Lambda
  4. Transformed image is stored in S3 and returned to the user
  5. Next request for the same image:
    • Served directly from S3 (faster)
  6. Subsequent requests:
    • Served directly from CloudFront cache (fastest)

So far, so good. ✅

The First Problem: Lambda Response Size Limit

Here’s where things get interesting.

Lambda cannot directly respond with a payload larger than 6 MB.
That’s a hard limit, and it’s a problem for our image optimization service.

The Initial Solution

Our approach was straightforward:

  1. Check the transformed image size
  2. If size > 6 MB:
    • Store the image in S3
    • Return a 302 redirect with a pre-signed URL

But wait—the image can’t be public for security reasons.

Solution: return a pre-signed URL with limited validity.

The New Problem: Caching Pre-signed URLs

Here’s the catch: we can’t cache a pre-signed URL in:

  • Browser cache
  • CloudFront cache

Why? Because pre-signed URLs have expiration times, and caching them would serve expired URLs to users.

So we added cache-control headers:

1
Cache-Control: no-store, no-cache

This works perfectly for the first request (Lambda → User).

But here’s the issue: the second request comes directly from S3. If we don’t add the same headers to the S3 object, we’ll have caching problems.

We fixed that too.

Everything seemed under control. 👍

Cache Invalidation Fun

Now we needed to clean CloudFront cache only for images larger than 6 MB, without invalidating everything (which would be expensive and slow).

Our Approach

  1. Use S3 Inventory to get object metadata
  2. Query with Athena to find large objects
  3. Find all objects > 6 MB
  4. Invalidate only those paths in CloudFront

We discovered about 3,000 files that needed invalidation.

The CloudFront Limitation

“How would you handle 3,000 invalidations?”

Put all paths into the CloudFront console and click Invalidate?

No.

CloudFront allows invalidating only 3 paths at a time when using wildcards (*).

Our Solution

We built a custom script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Pseudo-code for our invalidation script
def invalidate_large_images():
    paths = get_large_image_paths_from_athena()

    for batch in chunks(paths, 3):  # CloudFront limit
        cloudfront.create_invalidation(
            DistributionId=distribution_id,
            InvalidationBatch={
                'Paths': {
                    'Quantity': len(batch),
                    'Items': batch
                }
            }
        )
        time.sleep(1)  # Rate limiting

It worked perfectly. ✅

Everything Is Fine… Until It Isn’t

Two weeks later, our client reported random issues: images sometimes disappeared from the cache unexpectedly.

Debugging Hell

Initial reproduction attempts:

  • ✅ Different browsers
  • ✅ Incognito mode
  • ❌ Nothing reproduced

One hour later—finally reproduced the issue! 🎯

I spent two hours investigating:

  • CloudWatch logs
  • X-Ray traces
  • CloudFront metrics

The S3 Logs Revelation

S3 server logs were tricky to analyze because:

  • Many requests returned != 200 status codes
  • We use an origin group with fallback, so some failures are expected

After filtering logs and excluding CloudFront user agents, I discovered something interesting:

  • Many requests came directly from browsers (not CloudFront)
  • These requests had status codes != 200

Why were browsers making direct requests?

Because after a 302 response from CloudFront, the browser follows the redirect and makes the next request directly to the pre-signed S3 URL.

The Smoking Gun

I checked the failing requests:

  • Request time: 2025-12-08 14:30
  • Response: expired pre-signed URL

Cache-Control vs CloudFront Reality

I double-checked:

  • Custom origin headers
  • no-cache
  • no-store

Meaning:

  • No browser cache
  • No CloudFront cache

Then I checked the CDK CloudFront cache policy:

1
2
3
defaultTtl: cdk.Duration.seconds(0),
maxTtl: cdk.Duration.days(365),
minTtl: cdk.Duration.days(365),

Looks fine, right?

Default TTL = 0

Disclaimer

You should always read AWS documentation—especially the Warnings section.

It was not obvious that even with no-store and no-cache, a 302 response can still be cached, and that the behavior depends on minTtl.

From AWS docs:

If your minimum TTL is greater than 0, CloudFront uses the cache policy’s minimum TTL, even if the Cache-Control: no-cache, no-store, and/or private directives are present in the origin headers.

That was exactly my case.

The CDK Surprise

Another hour of debugging, and I finally checked the CloudFront UI. What do I see?

1
2
3
defaultTtl: cdk.Duration.seconds(365),
maxTtl: cdk.Duration.days(365),
minTtl: cdk.Duration.days(365),

Wait… what?

Who changed this?

Obviously me—I’m the only one working on this. No one hacked the account (hopefully).

But how?

I work only with CDK and never change production manually.

I open CDK, fix it, deploy again. Deployment successful.

Test again.

Same problem.

CloudFront UI still shows:

1
2
3
defaultTtl: cdk.Duration.seconds(365),
maxTtl: cdk.Duration.days(365),
minTtl: cdk.Duration.days(365),

At this point: OMG.

I create a test CDK stack. Set defaultTtl = 0.

CloudFront UI still shows 365.

Try:

  • defaultTtl = 0
  • minTtl = 0

❌ Deployment fails due to validation.

No documentation. No warnings. No explanation.

For now:

  • I change it manually
  • I add a big warning note

Preventing This in the Future

Every problem should not only be fixed, but also prevented.

So I added additional monitoring, beyond existing alarms and dashboards:

  • Monitor S3 access logs
  • If the number of != 200 responses not coming from CloudFront exceeds a threshold → alert

There is no native AWS solution for this, so I built one:

  • Step Functions
  • Lambda
  • Athena query
  • Slack notification

This solution is:

  • Cost-effective
  • Relatively simple

As of writing this:

  • Almost 2 weeks in production
  • No false alarms
  • I even caught a path traversal attack

Key Takeaway

Honestly? I don’t really know.

I believe I did everything correctly. There was no obvious reason to verify CloudFront configuration after a CDK deployment.

I hope this post:

  • Helps someone avoid the same mistake
  • Finds someone who can explain why this happens

Because even now, this still feels like a riddle to me.