GenAI Inference Engineering

Architect multi-provider LLM gateways, implement semantic caching and batch optimization, monitor provider SLAs, and optimize inference costs.

12 skill groups7 courses614 goals~588 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

01
Multi-Provider Gateways, Routing & Failover

Design and operate LLM gateways: routing across providers, request hedging, circuit breakers, failover chains, latency-based routing, rate-limit middleware, and provider SLA/health tracking that drives routing. The LiteLLM / OpenRouter layer.

02
Cost, Caching & Batch Optimization

Optimize inference spend: token counting and per-provider cost, prompt-caching APIs (Anthropic cache_control, OpenAI prefix, Gemini context), batch APIs, semantic caching, model tiering/cascading, and cache-friendly prompt structure.

03
Latency, Streaming & Throughput

Measure and improve request latency: TTFT, p50/p95/p99 profiling, SSE/token streaming, streaming-vs-blocking trade-offs, throughput tuning, prefix-cache reasoning, and latency-driven model selection.

04
Rate-Limit, Quota & Capacity Management

Keep a multi-provider gateway within provider limits: TPM/RPM ceilings, token-bucket admission control, backpressure and queueing under burst, 429 budgeting, multi-key/account/region capacity planning, and load/capacity testing.

05
Gateway Security & Data Governance

Treat the gateway as the security boundary: PII detection/redaction before third-party egress, prompt-injection and jailbreak filtering, output moderation/guardrails, per-tenant key isolation and secrets, audit logging, and data-residency/zero-retention routing.

06
Eval & Swap-Safety

Make model/provider swaps safe: offline eval harnesses, golden-set regression tests, canary and shadow traffic on swaps, and A/B quality comparison gating routing decisions.

07
Operational Observability

Instrument LLM calls with OpenTelemetry, Langfuse, Logfire, and Prometheus: latency/error/cost telemetry, production sampling, and signal emission. (Quality adjudication lives in Eval & Swap-Safety.)

08
Structured Output & Tool Use

Enforce typed outputs and reliable tool calling: JSON mode, Pydantic-validated schemas, function calling, multi-tool orchestration, parallel tool calls, and error recovery on malformed outputs.

09
Reasoning & Inference-Time Techniques

Apply reasoning models (o-series, extended thinking), test-time compute (CoT/ToT/best-of-N), context-window and KV/prefix-cache economics, and reasoning-token budget trade-offs.

10
Hosted LLM API Integration

Call OpenAI, Anthropic, and Gemini SDKs in production: auth, retry with backoff, response parsing, error recovery, and unified multi-provider interfaces.

11
Python for Inference Engineering

Production-grade Python applied to inference engineering: async/await, Pydantic models, dataclasses, decorators, context managers, generators, pytest. The language fluency required to ship LLM-backed services.

12
Deploy & Scale the Gateway

Ship the gateway as a service: containerize, deploy, autoscale on RPS/latency (HPA), and manage secrets/config for a stateless I/O-bound LLM proxy.

What you'll ship in production

Core responsibilities this discipline prepares you for.

  1. 1

    Design LLM gateway infrastructure

    routing requests across providers

    • Deploy and configure LiteLLM gateway on Kubernetes with provider routing rules and load balancing
    • Manage API key rotation, failover policies, and per-provider request distribution
    • Validate gateway behavior under failover scenarios and measure routing latency overhead
  2. 2

    Optimize request latency

    through caching, batching, and streaming

    • Implement semantic caching with Redis using embedding similarity for cache key matching
    • Build request batching strategies and streaming-first response patterns
    • Benchmark cache hit rates, measure P50/P95 latency improvements, and tune eviction policies
  3. 3

    Implement structured output extraction

    from LLMs with type safety

    • Use Pydantic AI for type-safe LLM interactions with guaranteed schema compliance
    • Build structured extraction pipelines with Instructor and DSPy for programmatic optimization
    • Validate extraction accuracy across providers and measure schema conformance rates
  4. 4

    Build cost attribution and FinOps dashboards

    tracking token spend

    • Track token costs per team, model, and feature using Langfuse cost attribution
    • Build Grafana dashboards for cost visualization with Prometheus budget alerting
    • Implement cost optimization through semantic caching, model tiering, and prompt compression
  5. 5

    Monitor inference quality metrics

    in production

    • Instrument LLM calls with OpenTelemetry spans capturing latency, tokens, and error rates
    • Set up Logfire for Python-native tracing and Prometheus for P50/P95/P99 latency monitoring
    • Configure alerting rules that detect latency spikes and diagnose root causes from distributed traces
  6. 6

    Implement intelligent routing

    — route queries to model tiers based on complexity

    • Build RouteLLM semantic routing with model cascading: cheap models for simple, expensive for complex
    • Configure complexity-based dispatch logic with fallback chains across providers
    • Demonstrate 60%+ cost savings while maintaining output quality on standardized test datasets
  7. 7

    Manage API rate limits and quotas

    across providers

    • Build rate limiting middleware in FastAPI with per-endpoint and per-user throttling
    • Configure LiteLLM quota management with per-team token budgets and key rotation policies
    • Validate graceful degradation behavior under sustained load with provider quota exhaustion
  8. 8

    Deploy inference services on K8s

    with scaling and health checks

    • Configure Kubernetes Deployments with readiness/liveness probes tailored for LLM services
    • Set up Horizontal Pod Autoscaler with custom metrics for token throughput scaling
    • Validate zero-downtime rolling updates under active inference load

Curriculum

7 courses · each builds on previous goals

12 goals unlocked for preview — click to read. Locked goals need a subscription.