Prerequisites

Familiarity with RESTful APIs and HTTP request/response patterns
Working knowledge of Kubernetes deployments, services, and ConfigMaps
Experience with at least one LLM provider SDK (OpenAI, Anthropic, or Gemini)
Understanding of API key management and environment variables
Basic knowledge of load balancing and reverse proxy concepts
Completion of Chapter 20 (structured extraction pipelines calling multiple providers)

Learning Goals

By the end of this chapter, you will be able to:

Deploy LiteLLM proxy in K8s
- Deploy LiteLLM proxy in K8sDeploy LiteLLM as a Kubernetes service with a model_list configuration that routes requests to OpenAI, Anthropic, and Gemini -- with virtual keys per department, RBAC with model-level access controls, and rate limits that enforce per-key usage quotas.
- Configure a LiteLLM proxy deployment with config.yaml specifying model_list entries for OpenAI GPT-4o, Anthropic Claude, and Gemini Pro
- Set up virtual keys via the /key/generate endpoint that issue department-specific API keys with spend limits and model restrictions
- Implement RBAC roles that control which models each key can access, with admin, internal, and external permission tiers
- Configure rate limits per virtual key using max_parallel_requests, tpm_limit, and rpm_limit to prevent any single department from exhausting shared provider quotas
Configure virtual keys and RBAC
- Configure virtual keys and RBACBuild a key management system that generates, tracks, and enforces virtual keys with fine-grained access controls -- enabling per-department cost attribution, model-level permissions, and real-time spend tracking through LiteLLM's admin API.
- Generate virtual keys with metadata tags for department, project, and cost center attribution
- Configure model-level access lists per key so that each department only sees the models approved for their use cases
- Set up spend limits with max_budget and budget_duration to enforce monthly cost caps per department
- Query the /spend/logs endpoint to build cost attribution dashboards showing per-key, per-model, and per-department spend
Route to Responses API and all providers
- Route to Responses API and all providersConfigure LiteLLM to route requests across OpenAI (including the Responses API), Anthropic Messages API, and Gemini GenerateContent API -- with model aliases, load balancing across deployments, and provider-specific parameter translation.
- Define model aliases that map user-friendly names like best and fast to specific provider deployments
- Configure weighted load balancing across multiple deployments of the same model (e.g., Azure OpenAI regions)
- Route OpenAI Responses API requests through LiteLLM with proper parameter translation
- Handle provider-specific features like Anthropic's extended thinking and Gemini's grounding through LiteLLM passthrough
Goal4B Build Litellm Gateway Test Suite
- Implement failover and circuit breakersConfigure automatic failover chains that reroute requests when providers fail, with circuit breakers that detect unhealthy providers and temporarily remove them from the routing pool -- ensuring high availability even during provider outages.
- Configure fallback chains using LiteLLM's fallbacks parameter (e.g., OpenAI to Anthropic to Gemini)
- Set up circuit breakers with allowed_fails and cooldown_time to detect and isolate unhealthy providers
- Implement health check endpoints that verify provider connectivity before routing requests
- Test failover behavior by simulating provider outages and measuring recovery time

Key Terminology

LiteLLM

An open-source Python library and proxy server that provides a unified OpenAI-compatible interface to 100+ LLM providers. LiteLLM translates request and response formats between the standard OpenAI chat completions format and provider-specific APIs, enabling applications to call any provider through a single endpoint without provider-specific SDK code.

AI Gateway

A centralized proxy service that sits between applications and LLM providers, handling authentication, routing, rate limiting, cost tracking, failover, and observability. An AI gateway decouples application code from provider-specific details, similar to how an API gateway manages microservice traffic.

LiteLLM Proxy

The server mode of LiteLLM that runs as a standalone HTTP service (typically in a Docker container or Kubernetes pod) exposing OpenAI-compatible endpoints. The proxy accepts requests at `/chat/completions`, `/completions`, and `/embeddings`, routes them to configured providers, and returns translated responses. The proxy is the deployment unit for production AI gateways.

Virtual Key

An API key generated by LiteLLM that maps to one or more underlying provider API keys. Virtual keys enable per-department or per-application key issuance without sharing provider credentials. Each virtual key has its own rate limits, spend caps, model access controls, and usage tracking, while the actual provider authentication uses shared master keys configured on the proxy.

model_list

The core configuration section in LiteLLM's **config.yaml** that defines available models and their provider mappings. Each entry in the model_list specifies a model name (what clients request), a **litellm_params** block with the actual provider model identifier and API credentials, and optional routing parameters like weight and priority.

RBAC (Role-Based Access Control)

A permission system in LiteLLM that assigns roles to virtual keys, controlling which models and endpoints each key can access. RBAC roles include admin (full access including key management), internal (access to internal models and endpoints), and external (restricted to specified models only). RBAC prevents unauthorized model usage and enforces organizational access policies.

Fallback Chain

An ordered list of alternative model deployments that LiteLLM tries when the primary model fails. Configured via the **fallbacks** parameter, a fallback chain like **["openai/gpt-4o", "anthropic/claude-sonnet", "gemini/gemini-pro"]** means LiteLLM attempts OpenAI first, falls back to Anthropic on failure, and then to Gemini. Fallback chains provide high availability across provider outages.

Circuit Breaker

A reliability pattern that monitors provider failure rates and temporarily removes unhealthy providers from the routing pool. LiteLLM's circuit breaker triggers after **allowed_fails** consecutive failures, marking the provider as unhealthy for **cooldown_time** seconds. During cooldown, requests route to healthy providers. After cooldown expires, the provider is retried. Circuit breakers prevent cascading failures from repeatedly hitting a down provider.

Rate Limiting

Per-key constraints that limit request throughput to prevent any single consumer from exhausting shared resources. LiteLLM supports **rpm_limit** (requests per minute), **tpm_limit** (tokens per minute), and **max_parallel_requests** (concurrent request cap). Rate limits are enforced at the virtual key level, enabling differentiated service tiers.

Spend Tracking

LiteLLM's built-in cost accounting that tracks token usage and estimated cost per request, per key, per model, and per time period. Spend data is accessible via the `/spend/logs` endpoint and can be aggregated for cost attribution dashboards. Spend limits (**max_budget**) trigger automatic request rejection when a key exceeds its budget.

Model Alias

A user-friendly name that maps to a specific model deployment in the model_list. For example, aliasing **best** to `openai/gpt-4o` and **fast** to `anthropic/claude-haiku` lets applications request models by capability tier rather than specific provider identifiers. Aliases enable model upgrades without application code changes -- update the alias mapping, and all applications automatically use the new model.

Load Balancing

The distribution of requests across multiple deployments of the same model. LiteLLM supports weighted round-robin and least-connections load balancing across model_list entries with the same model name. Load balancing across Azure OpenAI regions or multiple API keys for the same provider increases throughput beyond single-deployment rate limits.

Cooldown Time

The duration in seconds that a circuit-broken provider remains excluded from the routing pool. After **cooldown_time** elapses, LiteLLM sends a test request to the provider. If the test succeeds, the provider is restored to the routing pool. Typical cooldown times range from 30 to 300 seconds depending on the expected provider recovery time.

Health Check

An automated probe that verifies a provider deployment is responsive and functioning correctly. LiteLLM can perform periodic health checks against each model_list entry, testing connectivity and response quality. Unhealthy deployments are temporarily removed from routing, similar to Kubernetes liveness probes for pods.

Provider Translation

The process by which LiteLLM converts OpenAI-format requests into provider-specific formats and converts provider-specific responses back to OpenAI format. Translation handles differences in message structure (Anthropic's system message handling), parameter names (temperature, max_tokens), and response formats (streaming chunk structure). Provider translation is what enables the single-endpoint abstraction.

Budget Duration

The time window over which a virtual key's spend limit is evaluated. Setting **budget_duration** to **monthly** resets the spend counter on the first of each month. Combined with **max_budget**, this enforces recurring cost caps -- a key with a $5,000 monthly budget can spend up to $5,000 each month before requests are rejected. Budget durations can be set to daily, weekly, monthly, or custom intervals.

Proxy Config (config.yaml)

The YAML configuration file that defines the LiteLLM proxy's behavior, including the model_list, general settings (like master key and database URL), litellm_settings (like fallbacks and circuit breaker parameters), and router settings (like routing strategy and retry policies). The config file is the single source of truth for proxy behavior and is typically mounted as a Kubernetes ConfigMap.

Responses API

OpenAI's newer API format (distinct from chat completions) that supports stateful conversations, tool use, and structured outputs through a different endpoint and request format. LiteLLM can route Responses API requests to OpenAI while translating equivalent functionality to other providers, enabling applications to use the Responses API format while maintaining multi-provider failover.

Observability

The combination of logging, metrics, and tracing that provides visibility into gateway operations. LiteLLM integrates with Langfuse, Prometheus, and custom callbacks to emit per-request logs (model, tokens, latency, cost), aggregate metrics (requests per second, error rates, p99 latency), and distributed traces. Observability is essential for capacity planning, cost management, and incident response.

Retry Policy

The configuration that determines how LiteLLM handles transient failures before triggering failover. Retry policies specify the number of retries (**num_retries**), the timeout per request, and the backoff strategy. Retries handle transient network issues and rate limit responses (HTTP 429) without failing over to an alternative provider, reserving failover for sustained outages.

On This Page

Prerequisites

Learning Goals

Key Terminology

On This Page