The deploy was two days old and metrics had been calm the whole time. That evening we removed the old cache refresh batch and cleared the lingering cache; external ad responses stopped. That was when we realized the change from two days earlier had never actually taken effect.
Timeline
On 11-25 we shipped what was supposed to be a switch to the new cache module. Monitoring stayed normal, user traffic looked fine. But because of a package version-lock mistake, the code referencing the new module was there while the dependency was still pinned to the old one. The system was reading the old cache keys, and a separate batch kept refreshing the old cache, so nothing looked broken.
At 18:20 on 11-27 we removed the old cache refresh batch. The cache stayed valid while the TTL ran out; at 18:39 we manually cleared what was left, and old cache misses started immediately. Nothing was writing to the new cache keys, so the system could not fetch campaign data. External ad responses stopped at 18:40, an external SSP team reported the outage at 19:23, and recovery completed at 20:05.
Root Cause
A deploy being done and the change being applied are two different facts. Until that evening, monitoring had only been reporting “deploy finished, system normal.” “Did we actually switch to the new module?” was not captured by any metric.
The signal in this case was obvious in hindsight. The lookup rate on the new cache keys should have been non-zero, and the lookup rate on the old keys should have been trending to zero. Comparing the two right after the 11-25 deploy would have surfaced the broken state immediately. But that metric did not exist. “Operating correctly on the old cache” and “operating correctly on the new cache” were indistinguishable at the metric level.
Two secondary factors delayed detection. When campaign data was empty, the system still returned HTTP 200; the internal placements’ Fallback mechanism absorbed the alarm signal. Together they kept the dashboard looking calm right after the cache misses began. They were not the root cause though. The root cause had been sitting quietly since two days earlier.
Retrospective
I think post-deploy monitoring should lead with “did the change actually take effect?” before “is the main metric steady?” Dependency version hashes, new cache key lookup rates, old cache key lookup rates — whatever form it takes, before-and-after needs to be separable as a metric. Without it, the broken state goes quiet and blows up on its own schedule.
Looking back, two days of calm was the most dangerous signal in this incident. A flat metric line looks the same whether the change took effect or never did.