When you run a web application, not everything should happen inside the main server. Some tasks are simply too heavy.
Whether it is importing massive datasets, rebuilding caches, generating PDFs, or syncing with external APIs, these jobs can take minutes — sometimes even hours. If you run them inside the main web server, users face massive delays, requests timeout, and the whole application grinds to a halt.
To avoid this, the standard engineering pattern is straightforward:
Web App → SQS Queue → Background Worker
The web app drops a job into the queue, and a dedicated worker picks it up in the background. This keeps the user-facing application fast, clean, and stable.
But solving one problem often introduces another: cost.
The Price of Scale
If you run background workers 24/7, you pay for idle compute time when the queue is empty. Naturally, the next step is autoscaling:
- Queue is full? Spin up more workers.
- Queue is empty? Terminate workers to save money.
In our architecture, we used a separate AWS ECS Fargate service for our background jobs. The setup looked perfectly logical:
Backend → Amazon SQS → ECS Fargate Worker
When there was work, ECS scaled up. When the work dried up, autoscaling removed the idle containers. Simple, right?
Not quite.
The Problem: Death by Scale-In
Our worker was completely healthy. It wasn’t crashing, it wasn’t stuck in an infinite loop, and it wasn’t leaking memory. Yet, AWS autoscaling killed it anyway.
Worse, it killed it at the worst possible moment: right as the worker picked up a brand-new job.
Here is the exact sequence of events that led to the failure:
| |
The result was painful. A heavy database operation started but never finished, leaving data in an incomplete state. Furthermore, because the worker had already grabbed the SQS message, that message became invisible until its visibility timeout expired — meaning no other worker could retry it immediately.
Why This Happened (The Core Mistake)
We were utilizing ECS Task Protection. This AWS feature explicitly tells the ECS agent: “Do not terminate this task right now; it is doing critical work.”
Our mistake was when we were toggling that protection. Our original application logic followed this lifecycle:
Receive Message → Enable Protection → Process Job → Disable Protection
At first glance, this looks correct. You protect the worker while it has work, and unprotect it when it’s done.
However, this created a dangerous timing gap. The worker was entirely unprotected while waiting for a new message. During that exact window, ECS Autoscaling could select the task for termination. Once ECS fires a SIGTERM and begins the shutdown process, enabling task protection is useless. Task protection only prevents a task from being selected for termination; it cannot cancel a shutdown that is already underway.
The Fix: Flipping the Lifecycle
We inverted the logic. Instead of protecting the worker only while it processes a job, we protect it whenever it is active and ready to accept work.
The new lifecycle looks like this:
| |
Now, autoscaling can still scale down the fleet to save money, but it only targets workers that have truly been idle for a safe period, not workers actively looking for a fight.
Adding Two More Safety Nets
To make the architecture entirely resilient, we also implemented two defensive coding mechanisms:
- SIGTERM Awareness: Before a worker pulls or starts a new job, it checks if it has received a termination signal. If the container is already shutting down, it refuses to touch new work.
- Immediate SQS Release: If a worker accidentally pulls a message right as it receives a
SIGTERM, it immediately changes that specific message’s visibility timeout to0. This drops the message straight back into the queue so another healthy worker can process it instantly, rather than waiting for a 15-minute timeout.
Summary of the Shift
Before
Idle & Unprotected → Receive Message → Protect → Process → Unprotect
- The Flaw: The worker is highly vulnerable in the gap between processing jobs.
After
Protected & Active → Poll SQS → Process → Poll Again → Queue Empty → Unprotect
- The Benefit: The worker is safe anytime it is capable of receiving data.
The Takeaway
Autoscaling isn’t dangerous, SQS isn’t flawed, and ECS Task Protection works exactly as intended. The failure was entirely ours for mismanaging the worker lifecycle.
If your backend workers are polling and ready to accept work at any split second, they must be protected before the message arrives. Once AWS starts the shutdown sequence, you can’t tell the infrastructure, “Wait, I’m busy now.” It was a small race condition, but it carried a massive production impact.
