Free lesson

Build operational runbook for litellm gateway

You will build operational documentation and runbooks for litellm gateway. Document the architecture, configuration reference, and troubleshooting guide. Build executable runbook steps for the top 5 failure scenarios specific to this system. Implement health check and readiness endpoints. Create a monitoring dashboard showing key operational metrics with alerting thresholds.

~25 min read · Free to read — no subscription required.

Build operational runbooks and architecture documentation for the LiteLLM gateway

Introduction

In production, a LiteLLM gateway that lacks written runbooks becomes a liability the moment the on-call engineer isn't the person who built it. Provider rate-limit storms, fallback exhaustion, and Redis OOM events all carry recovery steps that must be captured before an incident, not improvised during one. Architecture documentation closes the same gap for engineers onboarding to the system. By the end of this lesson, you'll be able to produce the four non-negotiable documents — architecture overview, configuration reference, troubleshooting runbooks, and SLA contract — that turn a working gateway into a maintainable, production-grade service.

Key Terminology

Operational Runbook — A pre-written, step-by-step response document for a specific, named failure mode (such as a provider 429 storm or Redis OOM) that on-call engineers execute during an incident, eliminating the need to improvise recovery steps under pressure.
SLA Contract — A document (docs/sla.md) that declares the numeric thresholds the gateway must meet — max_p95_overhead_ms, min_availability_pct, and max_cost_per_call_usd — and that mirrors the sla: block in litellm.yaml so the written agreement and the enforced configuration always describe the same numbers.
Architecture Overview — A document combining a Mermaid traffic-flow diagram with a 200-word narrative that traces each hop a request takes through the gateway: ingress entry, virtual-key authentication and RBAC, model-list routing, Redis cache lookup, and provider pool dispatch on a cache miss.
Configuration Reference — A table-based document that maps every field in litellm.yaml — including model_list, fallbacks, allowed_fails, cooldown_seconds, cache.ttl, and the sla: block — to its default value, valid range, and behavioral effect on the gateway.
Fallback Chain — The ordered sequence of alternative providers declared under fallbacks in litellm.yaml, governed by allowed_fails and cooldown_seconds, that the gateway activates to reroute traffic when the primary provider exceeds its error threshold.
Provider 429 Storm — A failure mode where one provider's rate-limit error rate exceeds a threshold over a short observation window, requiring diagnosis of pool size versus provider rate-limit mismatches and mitigation by shifting traffic to the fallback chain or cooling down the pool.

Concepts

Why Runbooks Must Be Written Before an Incident

When a LiteLLM gateway is built by one engineer and operated by a team, the knowledge gap between builder and on-call responder is the primary operational risk. Provider rate-limit storms, fallback exhaustion, and Redis OOM events are predictable failure modes — they will happen. Without a runbook, the engineer woken at 3 AM must reconstruct the correct mitigation from memory or source code, extending the incident window and increasing the chance of a compounding mistake. The architecture overview closes the same gap for engineers who inherit or onboard onto the system: without a diagram and narrative, the traffic path from ingress through virtual-key authentication to provider pool dispatch must be reverse-engineered from configuration files alone.

Documentation written reactively — during or after an incident — captures adrenaline-filtered memory, not the clean mental model the builder holds before the incident occurs. The discipline is to produce the four documents while the system is well-understood and stable.

What Each Document Is Responsible For

The four-document framework maps directly to the four questions any operator will ask. The architecture overview answers "how does this system work?" A traffic-flow diagram paired with a narrative gives any engineer a working mental model before they touch the system. The configuration reference answers "what does each setting do?" — a table documenting every field in litellm.yaml with its default, range, and behavioral effect prevents cargo-culting a template without understanding the consequences. The runbook collection answers "what do I do when X breaks?" — one file per named failure mode lets on-call engineers navigate from an alert directly to the correct recovery procedure without reading a monolithic ops guide. The SLA contract answers "how do I know when it's fixed?" — by declaring numeric thresholds that define what "working" means, independent of what the system currently measures (see Code Walkthrough).

A documentation set missing any of the four leaves a gap: understanding the system, tuning it safely, recovering from known failures, or confirming that recovery has actually succeeded.

Configuration and Documentation as a Coupled System

The most common documentation failure is drift: the sla: block in litellm.yaml declares max_p95_overhead_ms: 80, but docs/sla.md still references 100 ms from an earlier agreement. When those numbers diverge, on-call engineers cannot know which is authoritative — the machine-readable contract or the human-readable one. The discipline the lesson enforces is that both must change in the same commit, treating the litellm.yaml sla: block and docs/sla.md as two representations of one contract, not as independent documents.

The same coupling applies to runbooks and failure modes. Naming runbooks by failure mode — 429-storm.md, fallback-exhaustion.md, cache-exhaustion.md, cert-expiry.md — rather than by solution keeps the alert-to-runbook path direct and makes each file discoverable from the monitoring alert that fires it. When a new failure mode is identified, it earns its own file rather than a paragraph appended to an existing document, keeping the runbook set both extensible and navigable.

Code Walkthrough

Building on the four-document framework from the Concepts section, the walkthrough below shows how each artifact takes concrete shape in your gateway repository.

Architecture overview

The Mermaid diagram below captures the traffic path every request follows through the gateway. Embed it in docs/architecture.md alongside a 200-word narrative that describes each hop: request entry at the ingress, authentication and scope enforcement via virtual keys, model-list routing, the Redis cache short-circuit, and provider pool dispatch on a cache miss.

Loading diagram...

Configuration reference

Store the canonical gateway configuration in litellm.yaml. The snippet below covers all four required sections: model definitions, fallback chain, cache settings, and the SLA contract. Every field that appears here should have a corresponding row in your configuration reference table documenting its default, valid range, and effect on gateway behavior.

Code snippetyaml
1# litellm.yaml
2model_list:
3  - model_name: gpt-4o
4    litellm_params:
5      model: openai/gpt-4o
6      api_key: os.environ/OPENAI_API_KEY
7  - model_name: claude-sonnet
8    litellm_params:
9      model: anthropic/claude-sonnet-4-6
10fallbacks:
11  - {gpt-4o: [claude-sonnet]}
12allowed_fails: 5
13cooldown_seconds: 60
14cache:
15  type: redis
16  ttl: 300
17sla:
18  max_p95_overhead_ms: 80
19  min_availability_pct: 0.999
20  max_cost_per_call_usd: 0.05

Runbook placement and SLA linkage

Each runbook from the Concepts section — provider 429 storm, fallback exhaustion, cache exhaustion, and certificate expiry — becomes a standalone file under docs/runbooks/. Naming files by failure mode (429-storm.md, fallback-exhaustion.md, cache-exhaustion.md, cert-expiry.md) lets on-call engineers jump directly to the right document from an alert link without navigating a monolithic ops guide.

The SLA contract (docs/sla.md) ties the set together: it declares the thresholds that the sla: block in litellm.yaml enforces, so the configuration and the written contract always describe the same numbers. When either changes, both must be updated in the same commit.

Confirm that your repository contains all four documents — docs/architecture.md, docs/configuration-reference.md, a docs/runbooks/ directory with one file per failure mode, and docs/sla.md — and that the sla: values in litellm.yaml match the limits declared in docs/sla.md.

Do's and Don'ts

Having walked through the material above, the following Do's and Don'ts distill it into practice.

Do's

✓Do name each runbook file after its failure mode (429-storm.md, fallback-exhaustion.md, cache-exhaustion.md, cert-expiry.md under docs/runbooks/) — alert links can point directly to the right document, so an on-call engineer under pressure never has to navigate a monolithic ops guide to find recovery steps.
✓Do keep the sla: block in litellm.yaml and docs/sla.md in sync within the same commit — the YAML enforces thresholds like max_p95_overhead_ms: 80 and min_availability_pct: 0.999 at runtime, and the written contract declares the same numbers; letting them diverge means engineers and the gateway disagree on what "passing" looks like.
✓Do embed the Mermaid traffic-path diagram in docs/architecture.md alongside a 200-word narrative that explicitly labels each hop — ingress, virtual-key auth and RBAC, model_list routing, Redis cache lookup, and provider pool dispatch — so an onboarding engineer can trace a request end-to-end without reading source code.

Don'ts

✗Don't write runbooks during an incident — provider rate-limit storms, fallback exhaustion, and Redis OOM events require documented recovery steps captured in advance; improvising them under pressure is exactly the gap a docs/runbooks/ directory exists to close.
✗Don't omit a row in the configuration reference table for any field that appears in litellm.yaml — fields like allowed_fails, cooldown_seconds, and cache.ttl have non-obvious interactions; leaving their defaults, valid ranges, and behavioral effects undocumented forces every operator to rediscover them by trial and error.
✗Don't treat the four documents (docs/architecture.md, docs/configuration-reference.md, docs/runbooks/, docs/sla.md) as optional polish — the lesson is explicit that a gateway without them is a liability the moment the engineer who built it is not on call, making this documentation as load-bearing as the litellm.yaml configuration itself.

Keep going with GenAI Inference Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.

Create a free account Subscribe — →