A rate limiter is a mechanism used to control the number of requests or operations a user, client, or system can perform within a specific time window.
Its primary purpose is to ensure fair usage of resources, prevent abuse, and protect backend systems from being overwhelmed by sudden spikes in traffic.
Example: If a system allows a maximum of 100 requests per minute, any request beyond that limit within the same minute would either be throttled (delayed) or rejected outright, often with an HTTP
429 Too Many Requests
response.
In this article, we will dive into the system design of a distributed rate limiter, and explore the the 5 most commonly used rate limiting algorithms with examples, pros and cons.
1. Requirements
Before diving into the architecture, lets outline the functional and non-functional requirements:
1.1 Functional Requirements
Per-User Rate Limiting: Enforce a fixed number of requests per user or API key within a defined time window (e.g., 100 requests per minute). Excess requests should be rejected with an HTTP
429 Too Many Requests
.Global Enforcement: Limits must be enforced consistently across all nodes in a distributed environment. Users shouldn't bypass limits by switching servers.
Multi-Window Support: Apply limits across multiple time granularities simultaneously (e.g., per second, per minute, per hour) to prevent abuse over short and long bursts.
1.2 Non-Functional Requirements
To be usable at scale, our distributed rate-limiter must meet several critical non-functional goals:
Scalability: The system should scale horizontally to handle massive request volumes and growing user counts.
Low Latency: Rate limit checks should be fast ideally adding no more than a few milliseconds per request.
High Availability: The rate-limiter should continue working even under heavy load or node failures. There should be no single point of failure.
Strong Consistency: All nodes should have a consistent view of each user’s request counts. This prevents a client from bypassing limits by routing requests through different servers.
High Throughput: The system should support a large number of operations per second and serve many concurrent clients without significant performance degradation.
2. High-Level Architecture
The rate limiter acts as a middleware layer between the client and the backend servers. Its job is to inspect incoming requests and enforce predefined usage limits (e.g., 100 requests per minute per user or IP).
To apply these limits effectively, the rate limiter must track request counts for each client. These counts are often maintained across multiple time windows, such as per second, per minute, or per hour.
Using a traditional relational database for this purpose is generally unsuitable due to:
High latency: Relational databases involve disk I/O, which introduces delays on every read/write.
Concurrency bottlenecks: Handling thousands of concurrent updates (e.g., one per incoming request) can lead to locks and race conditions.
Limited throughput: RDBMSs are not optimized for high-frequency, real-time counter updates.
An in-memory data store like Redis is a far better fit for rate limiting use cases because it offers:
Sub-millisecond latency for both reads and writes
Atomic operations like
INCR
,INCRBY
, andEXPIRE
, ensuring safe concurrent updates without race conditionsTTL (Time-to-Live) support, allowing counters to reset automatically at the end of each time window (e.g., after 60 seconds for a per-minute limit)
Request Lifecycle
Here’s how the rate limiter fits into the flow of an incoming request:
Client sends request to an endpoint of the application.
The rate limiter middleware performs several checks:
Identifies the client (via IP, token, or API key)
Looks up the current request count in Redis (or in-memory cache)
Applies any tier-specific rules (e.g., free vs premium users)
If the count exceeds the allowed threshold, the request is rejected with
HTTP 429 Too Many Requests
.If the count is within the limit, the counter is incremented and the request proceeds to the backend service.
Periodically, counters expire via TTL or are reset based on window granularity.
Many modern applications delegate rate limiting to edge components such as API gateways or reverse proxies, which can efficiently enforce limits before traffic reaches backend services. However, for this discussion, we will focus on designing a standalone rate limiter that is integrated into or called by application servers directly.
3. Design Deep Dive
3.1 Single-Node Rate Limiting
For small-scale applications with low traffic and a single application server, rate limiting can be implemented entirely in-memory, without relying on external systems like Redis. This approach is lightweight, fast, and easy to set up.
You maintain a simple hash map (dictionary) in the application process where:
Keys represent client identifiers (e.g., user ID, API key, or IP address)
Values represent request counts within the current time window
For each incoming request:
Checks if the user exists in the map
If not, create a new entry with a count of 1
If the user exists, increments their counter
Compare the count against the defined rate limit
If the count is within the limit, allow the request; otherwise, reject it
You can also add a time-based mechanism (e.g., timestamps or TTL logic) to reset counters after each time window.
Despite its simplicity, this approach comes with critical drawbacks that make it unsuitable for production environments at scale:
Single Point of Failure (SPOF): If the server crashes, all in-memory counters are lost. After a restart, the system "forgets" users' recent request history potentially allowing them to exceed their limits until the counters rebuild.
No Horizontal Scalability: The rate limiter lives on a single node so it doesn’t scale with traffic.
Unbounded Memory Growth: Without proper eviction or TTL logic, memory usage can grow unbounded over time, especially if you're tracking many users or long-duration windows.
Now, lets explore two common strategies to implement rate limiting in a distributed environment.