The cache had plenty of CPU and memory headroom. Network throughput was the bottleneck.

Multiple ad servers periodically fetched campaign configuration from the cache. Campaign settings rarely changed. But the system pulled the entire dataset every cycle regardless of whether anything had been modified. As the number of servers grew, network throughput approached the instance’s baseline limit, and downscaling was off the table.

Data Separation

Looking at the cached campaign data, I found three different types bundled together.

Metadata. Campaign metadata and targeting conditions change infrequently. They only update when an advertiser modifies a campaign.

State data. Budget consumption updates with every ad impression. It must always reflect the latest state.

Shared data. Ad creatives can be shared across multiple campaigns. Including them within campaign data creates duplication.

I separated all three. Metadata and shared data switched to incremental refresh. State data continued refreshing every cycle.

Incremental Refresh

Switching from full refresh to incremental refresh requires knowing what has changed.

A batch job fetches the latest data from the database, then compares it against what is stored in the cache. Only items with different content are written to the cache. Change timestamps are recorded in a dedicated change index. On the read side, services fetch only items changed since their last refresh.

flowchart LR
    subgraph Write ["Write Path"]
        DB["DB"] --> BATCH["Batch"]
        BATCH --> CMP{"Compare
with cache"} CMP -->|"Changed"| WRITE["Update cache
+ record timestamp"] CMP -->|"Same"| SKIP["Skip"] end subgraph Read ["Read Path"] SVC["Service"] --> TS{"Changes since
last refresh?"} TS -->|"Yes"| FETCH["Fetch changes only"] TS -->|"No"| LOCAL["Keep local data"] end

Separating write and read paths was the key. The batch writes only changes. Services read only changes. The detailed principles of this pattern are documented separately.

Result

Network throughput dropped significantly. During cycles with no changes, almost no data was transmitted. The cache instance could be downscaled to a smaller type.

Looking back, the starting point of this work was accurately identifying the bottleneck. Confirming that the constraint was network, not CPU or memory, naturally led to data separation and incremental refresh as the direction.

Reference