GenAI Inference Engineering
Architect multi-provider LLM gateways, implement semantic caching and batch optimization, monitor provider SLAs, and optimize inference costs.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.
Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.
Design and operate LLM gateways: routing across providers, request hedging, circuit breakers, failover chains, latency-based routing, rate-limit middleware, and provider SLA/health tracking that drives routing. The LiteLLM / OpenRouter layer.
Optimize inference spend: token counting and per-provider cost, prompt-caching APIs (Anthropic cache_control, OpenAI prefix, Gemini context), batch APIs, semantic caching, model tiering/cascading, and cache-friendly prompt structure.
Measure and improve request latency: TTFT, p50/p95/p99 profiling, SSE/token streaming, streaming-vs-blocking trade-offs, throughput tuning, prefix-cache reasoning, and latency-driven model selection.
Keep a multi-provider gateway within provider limits: TPM/RPM ceilings, token-bucket admission control, backpressure and queueing under burst, 429 budgeting, multi-key/account/region capacity planning, and load/capacity testing.
Treat the gateway as the security boundary: PII detection/redaction before third-party egress, prompt-injection and jailbreak filtering, output moderation/guardrails, per-tenant key isolation and secrets, audit logging, and data-residency/zero-retention routing.
Make model/provider swaps safe: offline eval harnesses, golden-set regression tests, canary and shadow traffic on swaps, and A/B quality comparison gating routing decisions.
Instrument LLM calls with OpenTelemetry, Langfuse, Logfire, and Prometheus: latency/error/cost telemetry, production sampling, and signal emission. (Quality adjudication lives in Eval & Swap-Safety.)
Enforce typed outputs and reliable tool calling: JSON mode, Pydantic-validated schemas, function calling, multi-tool orchestration, parallel tool calls, and error recovery on malformed outputs.
Apply reasoning models (o-series, extended thinking), test-time compute (CoT/ToT/best-of-N), context-window and KV/prefix-cache economics, and reasoning-token budget trade-offs.
Call OpenAI, Anthropic, and Gemini SDKs in production: auth, retry with backoff, response parsing, error recovery, and unified multi-provider interfaces.
Production-grade Python applied to inference engineering: async/await, Pydantic models, dataclasses, decorators, context managers, generators, pytest. The language fluency required to ship LLM-backed services.
Ship the gateway as a service: containerize, deploy, autoscale on RPS/latency (HPA), and manage secrets/config for a stateless I/O-bound LLM proxy.
What you'll ship in production
Core responsibilities this discipline prepares you for.
What you'll ship in production
Core responsibilities this discipline prepares you for.
- 1
Design LLM gateway infrastructure
routing requests across providers
- Deploy and configure LiteLLM gateway on Kubernetes with provider routing rules and load balancing
- Manage API key rotation, failover policies, and per-provider request distribution
- Validate gateway behavior under failover scenarios and measure routing latency overhead
- 2
Optimize request latency
through caching, batching, and streaming
- Implement semantic caching with Redis using embedding similarity for cache key matching
- Build request batching strategies and streaming-first response patterns
- Benchmark cache hit rates, measure P50/P95 latency improvements, and tune eviction policies
- 3
Implement structured output extraction
from LLMs with type safety
- Use Pydantic AI for type-safe LLM interactions with guaranteed schema compliance
- Build structured extraction pipelines with Instructor and DSPy for programmatic optimization
- Validate extraction accuracy across providers and measure schema conformance rates
- 4
Build cost attribution and FinOps dashboards
tracking token spend
- Track token costs per team, model, and feature using Langfuse cost attribution
- Build Grafana dashboards for cost visualization with Prometheus budget alerting
- Implement cost optimization through semantic caching, model tiering, and prompt compression
- 5
Monitor inference quality metrics
in production
- Instrument LLM calls with OpenTelemetry spans capturing latency, tokens, and error rates
- Set up Logfire for Python-native tracing and Prometheus for P50/P95/P99 latency monitoring
- Configure alerting rules that detect latency spikes and diagnose root causes from distributed traces
- 6
Implement intelligent routing
— route queries to model tiers based on complexity
- Build RouteLLM semantic routing with model cascading: cheap models for simple, expensive for complex
- Configure complexity-based dispatch logic with fallback chains across providers
- Demonstrate 60%+ cost savings while maintaining output quality on standardized test datasets
- 7
Manage API rate limits and quotas
across providers
- Build rate limiting middleware in FastAPI with per-endpoint and per-user throttling
- Configure LiteLLM quota management with per-team token budgets and key rotation policies
- Validate graceful degradation behavior under sustained load with provider quota exhaustion
- 8
Deploy inference services on K8s
with scaling and health checks
- Configure Kubernetes Deployments with readiness/liveness probes tailored for LLM services
- Set up Horizontal Pod Autoscaler with custom metrics for token throughput scaling
- Validate zero-downtime rolling updates under active inference load
Curriculum
7 courses · each builds on previous goals
Curriculum
7 courses · each builds on previous goals
12 goals unlocked for preview — click to read. Locked goals need a subscription.