Rate Limiting

Rate limiting is the device that keeps healthy instances from exhausting their resources before autoscaling can react. When a traffic spike arrives faster than new instances become ready, it rejects some requests early so the healthy ones do not reach their limits.

How to count is the algorithm question. Where to count is the protection layer question. The two look independent, but the protection layer narrows which algorithms are actually available. That is why the layer has to be chosen before the algorithm.

Protection Layer

Three layers are candidates: L4, L7, and Application. Identification precision and algorithm choice widen together as the layer moves inward.

L4 (Load Balancer) sits the furthest outside. It counts at the TCP connection level, with identification limited to roughly IP. Processing cost is low, but the counting granularity is coarse, so only simple algorithms apply. Identification gets coarser still when clients are behind NAT.

L7 (Gateway or Sidecar) operates at the HTTP level. Application identifiers like headers, paths, and tokens can serve as counting keys. Counts split by user and by API, so the algorithm choice widens. In a microservice environment, a sidecar (Envoy and the like) is the first candidate.

The Application layer sees business context. Different limits per user tier, protecting specific endpoints, splitting by authenticated token type — these decisions become possible. The most precise, and also the heaviest, with the added cost of counters distributed across instances.

The three layers form a clear trade-off. Outer layers identify coarsely but cost little, while inner ones grow more precise and more expensive.

The reason the layer choice narrows the algorithm candidate space lies in identification precision and counting granularity. At L4, where only IP/connection-level identification is possible, a precise algorithm like Sliding Window loses its precision advantage because the identification key is ambiguous. At Application, where authenticated user-level identification is possible, every algorithm operates meaningfully. Identification precision decides the algorithm’s utility itself. The algorithm comparison in the next section is a follow-up decision after this layer choice.

Algorithms

The widely used algorithms are Token Bucket, Fixed Window, Leaky Bucket, and Sliding Window. They split into two groups by whether bursts are allowed.

Burst-Allowing Group

Token Bucket keeps a bucket where tokens refill at a constant rate, and each request consumes a token. Pass if tokens exist, reject otherwise. Quiet periods let tokens accumulate, allowing short bursts in proportion. It fits workloads that are quiet most of the time but spike briefly.

Fixed Window counts only within a time window. Bursts are allowed inside the window; the count resets at the window boundary. The simplest to implement, with one weakness — bursts are possible across the boundary (requests piled at the end of one window and the start of the next both pass through).

Burst-Removing Group

Leaky Bucket is the queue analogy where requests leak out at a constant rate. Even if input arrives in bursts, the output rate stays flat. When downstream can only handle a fixed rate and you must feed it at that pace, it becomes the first candidate. Calls sent to an external payment gateway are a typical case.

Sliding Window slides the time window as it counts. With no boundary concept, the boundary-burst problem of Fixed Window disappears. Precision is highest, but it must store each request timestamp separately, making memory and computation the heaviest.

The choice between the two groups depends on whether downstream can absorb bursts. If it can, the burst-allowing group buys operational simplicity. If it cannot, the burst-removing group guarantees flattening.

Tool Mapping

Not every layer-algorithm combination is possible in practice. Real tools have settled on certain combinations.

Layer	Representative Tool	Natural Algorithm	Notes
L4	Nginx (`limit_req`)	Leaky Bucket	Fits connection-level processing
L7 Sidecar	Istio / Envoy	Token Bucket	HTTP header-based identification
Application	Resilience4j (Java)	Cycle based	Business context aware
Application	Bucket4j (Java)	Token Bucket	Distributed backend support

Nginx’s limit_req module runs as a Leaky Bucket. The structure of accepting connections and forwarding them at a fixed rate corresponds directly to Leaky’s output flattening. The burst option also allows absorption of short input bursts.

Istio / Envoy’s rate limit filter defaults to Token Bucket because allowing per-client bursts identified by HTTP headers is a common requirement in gateway environments. The sidecar itself provides both local mode (within a single instance) and global mode (delegating to an external RLS server).

Resilience4j’s RateLimiter module runs on cycle-based counting. Every limitRefreshPeriod, limitForPeriod permissions are reset — unlike Token Bucket where tokens accumulate, here the count refreshes in a single step at each cycle boundary. It fits scenarios where simple cycle counting is enough for method-level protection, and ships as part of the same component set as Circuit Breaker, Retry, and others.

Bucket4j is a Token Bucket-dedicated library. It supports sharing counters across distributed environments through backends like Redis, making it a candidate when cluster-wide protection is needed instead of single-JVM protection.

Combinations that have settled in practice: Leaky dominates at L4; Token dominates in L7 sidecars and distributed Application protection (Bucket4j). Cycle-based tools like Resilience4j fit single-JVM scenarios where simple counting is enough. Sliding Window is missing from the table because tools rarely default to it — it tends to be custom built or layered on top of a distributed counter.

Decision Order

A single flow emerges from the layout above. Traffic shape does not decide algorithms in isolation; the layer narrows them first, then traffic shape narrows them further within what the layer allows.

Need protection based on business context → Application layer → Token Bucket
HTTP-level identification is enough → L7 → Token Bucket (local or global)
Connection-level protection is enough → L4 → Leaky Bucket

On top of that, whether bursts are allowed becomes the final refinement for the algorithm.

When placing protection in front of a dependency that needs it, starting from the algorithm comparison narrows the tool space before you even choose the algorithm. The layer decision has to come first for the algorithm’s candidate space to open up.

References

Ad system outage retrospective — shared dependencies and a single point of failure — A real case of how rate limiting could have blocked the starting point of a cascade.

Protection Layer#

Algorithms#

Burst-Allowing Group#

Burst-Removing Group#

Tool Mapping#

Decision Order#

References#