Free lesson
Rate Limit Management
Implement rate limit monitoring and handling strategies
~25 min read · Free to read — no subscription required.
Rate Limit Management
Rate limits protect provider infrastructure and ensure fair resource distribution. Effective rate limit handling is essential for production reliability.
Introduction
When you push an inference workload past a provider's per-minute quota, requests start coming back as 429s — and a naive client will either crash, retry in tight loops (making the throttle worse), or silently lose user-facing traffic. If you've shared quota across many concurrent callers, you've likely seen a single un-governed worker starve every other request route on the same key. By the end of this lesson you'll be able to read a provider's RPM/TPM ceilings, implement retry logic that respects retry-after headers with jitter, and add a proactive tracker that throttles before the provider does.
Key Terminology
- Requests per minute (RPM) — the count of API calls allowed in a 60-second window; the first ceiling most callers hit when fanning out short prompts.
- Tokens per minute (TPM) — the total input + output tokens permitted per minute; the binding constraint for long-context or high-output workloads.
- Exponential backoff — a retry strategy where wait time grows multiplicatively (e.g. 4s, 8s, 16s); prevents thundering-herd retries from re-saturating a recovering provider.
- Jitter — a small random offset added to backoff delays so concurrent retriers don't synchronize and re-collide on the same tick.
- Retry-after header — a 429 response field from the provider indicating the minimum wait before the next attempt; honoring it shortens recovery time and avoids unnecessary backoff inflation.
Concepts
The rate-limit decision flow
Every outbound call passes through the same gate: check local capacity, send if room exists, react to a 429 if not. Modeling it as a state machine makes the retry/backoff branches explicit instead of buried in try/except.
The two limit dimensions
Providers enforce multiple ceilings simultaneously — exceeding any one returns 429:
- RPM — request count per minute.
- TPM — token count (input + output) per minute.
- TPD — daily token quota on lower tiers.
- Concurrent requests — simultaneous in-flight calls.
Sample ceilings to calibrate your expectations:
- Anthropic (typical tier): 4,000 RPM, 400,000 TPM
- OpenAI (Tier 2): 500 RPM, 30,000 TPM
- Gemini (free tier): 15 RPM, 1,000,000 TPM
A workload sending 2k-token prompts at 50 RPM hits OpenAI Tier 2's TPM ceiling (100k > 30k) long before its RPM ceiling — so RPM-only tracking is insufficient.
Reactive retry vs. proactive throttling
Reactive retry (catch 429, sleep, try again) is the floor — every client needs it for transient bursts. Proactive throttling (track usage, refuse to send if you'd exceed) is the ceiling — it prevents you from generating 429s in the first place, which matters because providers may impose progressively harsher penalties on accounts that 429 frequently. Combine both: track locally, retry on the residual mismatch (see Code Walkthrough).
Code Walkthrough
Building on the decision flow and dual-ceiling model from Concepts, the two snippets below implement the reactive layer (retry with backoff + jitter, honoring retry-after) and the proactive layer (a sliding-window RPM/TPM tracker called before each send).
Code snippetpython
1import time 2import random 3from tenacity import retry, stop_after_attempt, wait_exponential 4 5@retry(stop=stop_after_attempt(5), 6 wait=wait_exponential(multiplier=1, min=4, max=60)) 7def call_with_retry(client, **kwargs): 8 try: 9 return client.messages.create(**kwargs) 10 except RateLimitError as e: 11 retry_after = int(e.response.headers.get("retry-after", 60)) 12 time.sleep(retry_after + random.uniform(0, 1)) 13 raise # let tenacity count this attempt
The @retry decorator caps attempts at 5 with exponential waits between 4 and 60 seconds. Inside, we extract the provider's retry-after hint when present, add up to 1 second of jitter, sleep, then re-raise so tenacity records the failure and applies its own backoff on the next loop. Re-raising (not returning) is the key — swallowing the exception would mask the failure.
Code snippetpython
1from dataclasses import dataclass, field 2from datetime import datetime, timedelta 3from collections import deque 4 5@dataclass 6class RateLimitTracker: 7 requests_per_minute: int 8 tokens_per_minute: int 9 _requests: deque = field(default_factory=deque) 10 _tokens: deque = field(default_factory=deque) 11 12 def can_make_request(self, estimated_tokens: int): 13 now = datetime.now() 14 cutoff = now - timedelta(minutes=1) 15 while self._requests and self._requests[0] < cutoff: 16 self._requests.popleft() 17 while self._tokens and self._tokens[0][0] < cutoff: 18 self._tokens.popleft() 19 used_tokens = sum(t for _, t in self._tokens) 20 if (len(self._requests) >= self.requests_per_minute or 21 used_tokens + estimated_tokens >= self.tokens_per_minute): 22 wait = (self._requests[0] + timedelta(minutes=1) - now).total_seconds() if self._requests else 60 23 return False, max(0.0, wait) 24 return True, 0.0 25 26 def record_request(self, tokens: int): 27 now = datetime.now() 28 self._requests.append(now) 29 self._tokens.append((now, tokens))
The tracker keeps two deques — request timestamps and (timestamp, tokens) pairs — and trims entries older than 60 seconds on every check. can_make_request returns (allowed, wait_seconds); callers sleep for wait_seconds when blocked. After each successful response, record_request is called with the actual token count so the window stays accurate.
You'll know it works when: synthetic load at 110% of your RPM ceiling produces zero provider 429s (the tracker absorbs them), and a forced 429 (e.g. shared-key burst) recovers within the retry-after window without exceeding 5 attempts.
Do's and Don'ts
Having just walked through the tracker and retry decorator, the rules below distill the choices most likely to bite you in production.
Do's
- ✓Do honor
retry-after— the provider's hint is shorter and more accurate than your default backoff floor. - ✓Do track tokens, not just requests — TPM is the binding ceiling for long-context workloads and RPM-only counters miss it entirely.
- ✓Do add jitter to every backoff — without it, parallel retriers re-collide on the same tick and amplify the throttle.
Don'ts
- ✗Don't retry indefinitely — cap attempts (5 is a sensible default) so a sustained outage surfaces as an error instead of a silent stall.
- ✗Don't share one tracker across processes without locking — concurrent writes to the deques will undercount and let bursts through.
- ✗Don't ignore 429 frequency in metrics — repeated throttling signals quota or batching is mis-sized; fix the cause, not the retry loop.
Keep going with GenAI Inference Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.