API Rate Limiting: Complete Implementation Guide
Without rate limiting, a single misbehaving client can bring your entire API to its knees. This guide covers every algorithm, header convention, and implementation pattern you need to protect your service and give clients actionable feedback when they hit limits.
Why Every API Needs Rate Limiting
Imagine you launch a public API with no rate limits. Within days, one of three things happens: a scraper hammers your endpoints and exhausts your database connection pool; a buggy client gets into an infinite retry loop and floods your logs; or a competitor drains your free tier to build their product. Rate limiting prevents all three scenarios.
Beyond abuse prevention, rate limiting serves a second purpose: fairness. On a shared infrastructure, one heavy tenant should not degrade service for everyone else. Rate limits enforce service-level expectations and make your SLAs defensible.
Real-world examples illustrate why this matters:
- GitHub API: 5,000 requests/hour for authenticated users, 60/hour for unauthenticated. Exceeding this returns
403 Forbiddenwith aX-RateLimit-Resetheader. - Stripe API: ~100 read requests/second per API key. Exceeding returns
429 Too Many Requestswith aRetry-Afterheader. - Twitter/X API: Tiered limits per endpoint - search endpoints have tighter limits than timeline reads because they are computationally more expensive.
The pattern is consistent: pick a window, set a limit, return a 429 with helpful headers when the limit is exceeded.
The Four Core Algorithms
There are four widely-used algorithms, each with different characteristics for burst tolerance, memory use, and implementation complexity.
1. Fixed Window Counter
The simplest approach: count requests in a fixed time window (e.g. 100 requests per minute). When the counter reaches the limit, reject additional requests until the window resets.
# Pseudocode: fixed window with Redis
key = "ratelimit:{user_id}:{current_minute}"
count = INCR key
EXPIRE key 60 # expires at end of minute
if count > 100:
return 429
Problem: A client can make 100 requests in the last second of window N, then 100 more in the first second of window N+1 - 200 requests in 2 seconds, double the intended rate. This is the "boundary burst" problem.
2. Sliding Window Log
Store a timestamp for every request in a sorted set. To check the limit, count entries within the last N seconds. This eliminates boundary bursts but uses more memory because each request costs one entry.
# Pseudocode: sliding window log with Redis sorted set
now = current_timestamp_ms()
window_start = now - 60000 # 60 seconds ago
key = "ratelimit:{user_id}"
# Remove old entries
ZREMRANGEBYSCORE key 0 window_start
# Count remaining + add current
count = ZCARD key
if count >= 100:
return 429
ZADD key now now
EXPIRE key 60
This is precise but memory-intensive for high-traffic APIs. At 1000 req/s with a 60-second window, you could have 60,000 entries per user.
3. Token Bucket
The most popular algorithm for production APIs. Each client has a "bucket" that holds tokens. Each request consumes one token. Tokens refill at a constant rate up to the bucket capacity. This naturally allows short bursts while enforcing an average rate.
# Token bucket implementation in Python
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # max burst size
self.tokens = capacity # start full
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def consume(self, tokens=1):
now = time.time()
# Add tokens based on elapsed time
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True # request allowed
return False # rate limited
Example: a bucket with capacity 20 and refill rate 10/s allows a burst of 20 requests immediately, then sustains 10 requests/second indefinitely. This matches how real users behave: occasional bursts, low average.
4. Leaky Bucket
Requests flow into a queue (the "bucket") and are processed at a constant rate. If the queue is full, new requests are dropped. Unlike token bucket, there is no burst allowance - output rate is perfectly smooth. This is useful for rate-limiting outbound API calls (e.g. calling a third-party service at exactly 10 req/s) rather than inbound traffic.
What to Rate Limit On
The granularity of your rate limit key matters as much as the algorithm:
- IP address: Easiest to implement, but shared IPs (NAT, corporate proxies) will cause false positives. Never use IP as the only dimension for authenticated APIs.
- API key / user ID: The right choice for authenticated APIs. Each consumer gets their own bucket independent of where requests originate.
- Endpoint: Different endpoints have different costs. Rate limit
/searchmore aggressively than/ping. Use composite keys:{user_id}:{endpoint}. - Account + IP combined: Rate limit on user ID for authenticated paths, fall back to IP for unauthenticated paths like login (to prevent credential stuffing).
Rate limiting your login endpoint by IP (e.g. 10 attempts per 15 minutes) is one of the simplest and most effective defenses against brute force and credential stuffing attacks.
Standard Response Headers
Returning helpful headers with every response (not just 429s) lets clients self-throttle before hitting limits. The de facto standard headers are:
# Successful response headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1711120060 # Unix timestamp when limit resets
# On 429 Too Many Requests
HTTP/1.1 429 Too Many Requests
Retry-After: 30 # seconds to wait
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711120060
The IETF has standardized RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset (without the X- prefix) in RFC 6585 and the draft RateLimit Headers spec. GitHub, Stripe, and most major APIs still use the X-RateLimit-* convention for backwards compatibility.
Implementing Rate Limiting in Node.js (Express)
The express-rate-limit package provides a production-ready token bucket/fixed window limiter with Redis support:
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const redis = require('redis');
const client = redis.createClient({ url: process.env.REDIS_URL });
// Global limiter: 100 req/15min per IP
const globalLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100,
standardHeaders: true, // Return RateLimit-* headers
legacyHeaders: false,
store: new RedisStore({
sendCommand: (...args) => client.sendCommand(args),
}),
handler: (req, res) => {
res.status(429).json({
error: 'Too Many Requests',
retryAfter: res.getHeader('Retry-After'),
message: 'You have exceeded the rate limit. Please wait before retrying.'
});
}
});
// Stricter limiter for auth endpoints
const authLimiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 10,
message: { error: 'Too many login attempts. Try again in 15 minutes.' }
});
app.use('/api/', globalLimiter);
app.use('/api/auth/login', authLimiter);
Nginx Rate Limiting (No Code Required)
For many use cases, Nginx's built-in limit_req module is sufficient and requires no application-level changes:
# Define a zone: 10MB shared memory, 10 req/s per IP
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
# Allow burst of 20 requests, no delay for first 10
limit_req zone=api burst=20 nodelay;
limit_req_status 429;
# Custom 429 response
error_page 429 /rate_limit.json;
location = /rate_limit.json {
return 429 '{"error":"Too Many Requests","retryAfter":1}';
add_header Content-Type application/json;
}
}
}
This uses the leaky bucket algorithm internally. The burst parameter sets queue size; nodelay processes queued requests immediately rather than spacing them out.
Check HTTP Status Codes Instantly
Look up the meaning of any HTTP status code - including 429, 503, and all rate limiting related codes. Free, instant, runs in your browser.
Open HTTP Status Codes ReferenceHandling Rate Limits as a Client
If you are consuming a rate-limited API, your client code needs to handle 429 responses gracefully rather than crashing or hammering the server harder:
import time
import requests
def api_request_with_retry(url, headers, max_retries=3):
for attempt in range(max_retries):
response = requests.get(url, headers=headers)
if response.status_code == 429:
# Respect Retry-After header if present
retry_after = int(response.headers.get('Retry-After', 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
if response.status_code == 200:
return response.json()
# For other errors, raise immediately
response.raise_for_status()
raise Exception(f"Failed after {max_retries} retries")
Key rules for well-behaved API clients:
- Always check for
429and implement exponential backoff - Read
Retry-AfterorX-RateLimit-Resetand wait the specified duration - do not guess - Monitor
X-RateLimit-Remainingand slow down proactively before hitting zero - Cache responses where possible to reduce unnecessary requests
- Use a request queue with concurrency control for bulk operations
Distributed Rate Limiting
When your API runs on multiple servers, in-memory rate limiters fail because each instance maintains its own counter. A request hitting server A and server B appears as separate clients to each. The solution is a shared backing store:
- Redis: The standard choice.
INCRis atomic; sliding window can use sorted sets. Redis Cluster supports sharding for very high throughput. - Memcached: Simpler, but
add/incroperations are not atomic - use Redis instead. - API Gateway: AWS API Gateway, Kong, Cloudflare, and Nginx Plus all implement distributed rate limiting natively, offloading the concern from your application entirely.
Tiered Rate Limits by Plan
Most commercial APIs offer different limits per subscription tier. The cleanest implementation uses a middleware that looks up the user's plan and applies the appropriate limit:
const PLAN_LIMITS = {
free: { windowMs: 60000, max: 60 },
pro: { windowMs: 60000, max: 600 },
enterprise: { windowMs: 60000, max: 6000 }
};
async function tieredRateLimiter(req, res, next) {
const user = await getUser(req.headers.authorization);
const plan = user?.plan || 'free';
const { windowMs, max } = PLAN_LIMITS[plan];
const key = `ratelimit:${user.id}`;
const count = await redis.incr(key);
if (count === 1) await redis.pexpire(key, windowMs);
res.set('X-RateLimit-Limit', max);
res.set('X-RateLimit-Remaining', Math.max(0, max - count));
if (count > max) {
return res.status(429).json({ error: 'Rate limit exceeded', plan });
}
next();
}
FAQ
What HTTP status code should I return for rate limiting?
429 Too Many Requests is the correct status code, defined in RFC 6585. Some older APIs return 403 Forbidden (GitHub used to) or 503 Service Unavailable, but 429 is the correct and universally understood choice. Always include a Retry-After header so clients know when to retry.
Should I rate limit by IP or by API key?
For authenticated APIs, always rate limit by user ID or API key - not IP. IP-based limiting punishes users behind shared IPs (offices, universities, VPNs). Use IP-based limiting only for unauthenticated endpoints like login and registration to prevent abuse before authentication occurs.
What is the difference between rate limiting and throttling?
The terms are often used interchangeably, but there is a subtle distinction: rate limiting enforces a hard ceiling and rejects requests that exceed it (returns 429). Throttling slows down excess requests by delaying them rather than rejecting them (the leaky bucket's queuing behavior). In practice, most systems use "rate limiting" to mean either approach.
How do I rate limit without Redis in a serverless environment?
Serverless functions are stateless, so you need an external store. Options include: Upstash Redis (HTTP-based Redis designed for serverless, no persistent connections), DynamoDB with atomic counters, or a managed API gateway (Cloudflare Workers, AWS API Gateway) that handles rate limiting before your function is invoked. In-memory limiting in serverless is unreliable because each invocation may run on a fresh instance.
What is a good default rate limit to start with?
A common starting point for authenticated REST APIs is 100 requests per minute per user for general endpoints and 10 requests per 15 minutes for authentication endpoints. Monitor your actual usage distribution in production - the 99th percentile of legitimate users should be well below your limit. If legitimate users are regularly hitting the limit, raise it or add plan tiers rather than punishing real usage.
Should rate limits reset at fixed times or on a rolling basis?
Rolling windows (sliding window algorithm) are fairer because there is no "reset rush" where clients hammer the API immediately after the window resets. Fixed windows are simpler to implement and explain. For most APIs, fixed windows per minute or per hour are a practical choice. Stripe and GitHub both use fixed windows and communicate the reset time clearly via headers.
Use our free tool to look up any HTTP status code, including 429 and all related rate-limiting codes: Use our HTTP Status Codes reference here →
Usman has 10+ years of experience securing enterprise infrastructure, managing high-traffic servers, and building zero-knowledge security tools. Read more about the author.