An external event pushed ad traffic far above normal levels.
Autoscaling tuned for baseline load couldn’t keep up. New PODs hadn’t become Ready before existing healthy PODs collapsed, and the shock propagated to the next system.
The immediate read was straightforward: traffic spike plus scaling lag. But once we unpacked the retrospective, the real problem sat elsewhere. The filtering component alone went down, and the primary and the fallback went down together.
How the Cascade Unfolded
The filtering component gave way first. Traffic grew faster than HPA could spin up new PODs. CPU on existing healthy PODs crossed the limit before new replicas were Ready, and Readiness Failed events stacked up. A scale-out in progress while the existing PODs fell apart in parallel — the first link in the cascade.
The primary ad system was next. It calls the filtering component when picking ad candidates. As the filtering component degraded, those calls piled up as TIMEOUTs, and the accumulated failures eventually dragged the primary ad system’s own state into a bad place. It didn’t recover on its own — a manual restart was required.
The inference component came third. Once the primary ad system recovered, the requests that had been blocked released all at once. Load that had dropped sprang back to normal in a single jump, and 5xx responses appeared while HPA caught up.
The cascade kept extending even as each upstream piece recovered. The filtering component recovered while the primary ad system was still stuck. The primary ad system recovered, and the inference server staggered. Read as a timeline, it looked like three separate incidents. The underlying flow was one.
Diagnosis — Shared Dependency and a Single Point of Failure
Two surface causes stand out. The sudden load created by an external event. And the autoscaling that couldn’t match its pace.
Stopping there points the follow-up work toward “faster HPA, faster alerts, faster manual response.” All valid. All answers to the symptoms.
Pushing the retrospective one step further surfaced a different picture. Our ad system has an filtering component, and the primary ad system depends on it. We also kept a fallback system for failover. It sits idle most of the time and activates only when the primary can’t respond.
Inside that fallback, however, the filtering logic was wired to call the filtering component’s API directly to avoid duplicating logic.
That one wire changed everything.
Both the primary and the fallback were tied to the same filtering component. The filtering component was a single point of failure, and when that one point went down, both went down with it.
A fallback that shares dependencies cannot absorb the primary’s load. It pushes additional traffic through the same struggling dependency and amplifies the cascade. The fallback server we’d stood up separately turned out to be effectively a second primary, sharing the same single point of failure.
There’s another layer. Why did that single point fall over so quickly? The filtering component runs CPU-bound work — ad candidate evaluation, filtering — and Node’s single-threaded event loop doesn’t fit that profile well. CPU limits arrived faster on each POD, and that’s part of why the cascade’s first link started as quickly as it did.
The real problem of this outage was not the external event, and not the autoscaling lag. The filtering component was a single point of failure, the fallback was tied to it, and the point itself was structured to hit its CPU ceiling fast.
Recovery
With the diagnosis clear, the fix split into three paths. One separates the two sides by removing the fallback’s shared dependency. Another hardens the single point itself so it doesn’t collapse all at once. The third lifts the throughput of the point so the limit arrives later. None substitutes for the others. All three are needed to keep the single point of failure from becoming a cascade.
Removing the Fallback’s Filtering Dependency
The first path is to give the fallback its own filtering logic.
Operational cost goes up. Duplicating the logic and its data into the fallback adds synchronization work. The original choice to call the filtering component’s API came from that cost trade-off, and the decision wasn’t unreasonable at the time.
This outage shifted the weight on that trade-off. The savings ate the fallback’s definition. We had traded the reason for the fallback’s existence against its operating cost. Looking again, one side is clearly heavier.
One option worth exploring is hosting the fallback’s filtering on serverless infrastructure. Idle is the fallback’s normal state, so serverless’s zero idle cost matches its profile. Independence comes back without the full operational tax.
Rate Limiting on the Filtering Component
The second path hardens the filtering component itself.
The first link of the cascade was the pattern of existing healthy PODs collapsing before scale-out finishes. Autoscaling is reactive by design — it kicks in after load arrives, so there’s always a gap when a spike hits. During that gap, the healthy PODs need protection from being dragged to their limit.
Rate limiting fills that role. It sheds a portion of requests until new PODs are Ready, keeping the healthy ones from being pushed past their threshold. A circuit breaker can do something similar by cutting requests to a struggling dependency once certain conditions hold. Either way, the goal is to keep the single point of failure from collapsing all at once.
If fallback independence separates the two sides, rate limiting hardens the point itself. The single point of failure gets addressed from both directions.
Runtime Reassessment
The third path lifts the throughput of the point itself.
Reviewing the filtering component’s workload makes clear that Node was a mismatch for it. CPU-heavy evaluation and filtering dominate the work, and a single-threaded event loop turns each in-flight request into a delay for the next.
For more throughput on the same POD budget, a systems-language runtime fits better. Runtime reassessment was in progress. If rate limiting keeps the single point from collapsing all at once, reconsidering the runtime raises the ceiling of that point. The two reinforce each other from different angles.
The migration stayed at the review stage, but given that the cascade’s first link came from hitting the CPU limit, the direction still reads as a valid option.
Remaining Follow-ups
A few operational changes round out the picture. Switching HPA’s target from CPU utilization to request count catches load increases earlier. CPU is a signal of the result of load; request count is closer to the cause. The change moves detection upstream.
Manual scale-out by editing k8s configuration directly is slow. Routing it through a Slack bot trims that time significantly.
Neither is as fundamental as the three paths above, but both shorten the subsequent links of the cascade.
What I Took Away
The external event that day was only a trigger. Autoscaling’s limits were real. Above all of it sat the structure we had built. The filtering component was a single point of failure, the fallback was tied to it, and the point itself was structured to hit its CPU ceiling fast.
Following the cascade as a timeline pulls the answer toward “faster HPA, faster alerts.” The retrospective’s value sat in the question one step beyond that: why didn’t the fallback stop the cascade. From there came the diagnosis that the filtering component had been the single point of failure for both sides all along. Pushing once more brought the question of why that point collapsed so quickly. The gap between surface diagnosis and structural diagnosis was where the lesson lived.
A single point of failure becomes a cascade when the fallback shares the same dependency. Separating the two sides, hardening the point itself, and raising its throughput is how the system gets stronger.
References
- Ad Fallback Server Design — The original design retrospective of the fallback system. This outage came from a single line in that design: the fallback’s dependency on the filtering component.
- Rate Limiting — Protection layer and algorithm patterns that could have blocked the cascade at its starting point.
- Circuit Breaker — Circuit Breaker + Bulkhead pattern for blocking cascades on shared dependencies.