Sometimes the most beautiful cost optimizations break production in ways you never expect.

The Beginning: A Simple Optimization

We had a web application running on Amazon ECS with AWS Fargate. The app processed background jobs: taking files, generating archives, and returning results to users asynchronously.

Then we noticed: EFS costs were much higher than expected.

After checking billing, the culprit was clear: read/write operations, not storage size. Most operations were temporary files during archive generation.

The fix seemed obvious: move transient workload from EFS to container /tmp (ephemeral storage).

Results:

  • ✅ ~70% cost reduction
  • ✅ No network storage dependency
  • ✅ Cleaner architecture

We deployed. Production started acting strange.

What We Didn’t See

The async flow:

1
Web request → Notification → Lambda → Command inside running container

The critical detail: Background commands executed inside the same container handling HTTP traffic.

No dedicated worker. No isolation. Just “another process” in the web task.

What went wrong:

1
2
3
4
5
6
1. Lambda always targeted the first running task
2. All async jobs hit one container
3. Ephemeral storage filled up
4. Web server in that container failed
5. Load balancer kept routing traffic to it
6. Users got random errors

The cost optimization worked. The architecture didn’t.

The Real Problems

Problem 1: Hidden Coupling

Async logic inside web containers meant shared resources:

ResourceImpact
CPUJobs starve web requests
MemoryOOM kills affect both
StorageEphemeral fills, web fails
LifecycleDeploy kills jobs

Problem 2: No True Decoupling

No durable queue. No guaranteed delivery. No retry strategy. No job ownership.

Just direct triggering. It looked asynchronous but wasn’t architecturally decoupled.

Problem 3: Autoscaling Chaos

When autoscaling or deployments terminated tasks, running jobs disappeared. No state. No retry. No visibility.

Why Not Lambda or Batch?

SolutionWhy Not?
Lambda• 6 MB response limit
• Storage constraints
• Cold starts
• Execution time limits
Batch• 2-3 min startup latency
• Image pull overhead
• Overkill for 1-5 sec tasks
• Cost inefficient for high-frequency jobs

We needed: Fast startup, isolated from web traffic, scalable, durable, cost-aware.

The Solution: Dedicated Worker Architecture

High-Level Architecture:

Worker Service Architecture

Flow:

1
2
3
4
5
6
7
8
9
┌─────────┐      ┌─────┐      ┌─────┐      ┌────────────┐
│   Web   │─────▶│ SNS │─────▶│ SQS │─────▶│  Worker    │
│ Service │      │     │      │     │      │  Service   │
└─────────┘      └─────┘      └─────┘      └────────────┘
                                            ┌──────────────┐
                                            │ DynamoDB + S3│
                                            └──────────────┘

Queue-Based Decoupling

Introduced Amazon SQS for durable job management:

✓ Guaranteed delivery
✓ Automatic retries
✓ Dead-letter queue for failures
✓ Back-pressure control

Dedicated Worker Service

Separate ECS service with:

FeatureBenefit
Independent scaling1-10 tasks based on queue depth
Isolated resourcesCPU/memory tuned for jobs
Separate lifecycleDeployments don’t kill jobs
Dedicated logsClear observability

State Management

ComponentPurpose
DynamoDBJob tracking & status
S3Results storage
EFSOnly where necessary

AWS Well-Architected Alignment

PillarImplementation
ReliabilityDurable queue, DLQ, retries, graceful termination
PerformanceAuto-scaling based on queue depth, independent tuning
CostEphemeral storage where appropriate, no over-provisioning
OperationsInfrastructure as Code, dedicated monitoring, clear ownership
SecurityIsolated IAM roles, least privilege, private networking

Key Takeaways

The lesson: The solution wasn’t just separation — it was architectural decoupling.

Sometimes a 2-second background task requires a properly designed distributed system.

Why? Production doesn’t fail on obvious parts. It fails on invisible assumptions.

Before:

1
2
3
4
Web Container = HTTP Server + Background Jobs
→ Shared resources
→ Tight coupling
→ No guarantees

After:

1
2
3
4
Web Service → Queue → Worker Service
→ Isolated resources
→ Decoupled architecture
→ Guaranteed delivery

The difference between “it works” and “it survives production”:

Don’t just make systems asynchronous. Make them decoupled.