LLMOps Engineering

Monitor hallucination rates and token costs, operate guardrails and eval gates, manage prompt versioning and canary deployments.

12 skill groups8 courses869 goals~348 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

01
Output Quality Eval & Gating

Is the answer good? Hallucination/groundedness detection, eval-gate pipelines that block bad deploys, quality-drift detection, RAGAS/DeepEval, LLM-as-judge (and validating the judge), and RAG retrieval-quality monitoring.

02
Cost & Token FinOps

Keep it cheap: per-request cost attribution, token-budget controllers and gates, cache economics, model-tier arbitrage, batch scheduling, and capacity forecasting — driving spend down (per-tenant chargeback belongs to the platform engineer).

03
Incident Response, On-Call & Resilience

Keep it not-on-fire: incident command, triage and post-mortems, runbook automation, on-call and alerting hygiene, error-budget policy, and provider-outage failover/fallback drills.

04
Compliance & Governance Ops

Keep it compliant: audit engines and trails, EU AI Act / SOC2 / HIPAA / GDPR controls, policy gates, bias and fairness monitoring, and regulatory reporting.

05
Prompt/Model Lifecycle & Release Safety

Safely change what's in prod: prompt/model versioning and registry, provenance/lineage, canary analysis, automated rollback triggers, progressive delivery, feature flags, and shadow/replay testing.

06
Service Observability & SLOs

Is the service up, fast, and within SLO? OpenTelemetry tracing, Langfuse, Prometheus/Grafana, latency/throughput/error telemetry, and SLI/SLO engines.

07
Security Ops & Guardrails

Operate the safety layer in production: runtime guardrails with canary/kill-switch, key rotation, PII detection, prompt-injection monitoring, content-safety filters, and red-team automation.

08
CI/CD & GitOps for AI

Ship safely and repeatably: GitHub Actions/ArgoCD GitOps, Infrastructure-as-Code, secrets management, environment promotion, and deployment automation.

09
Container & Kubernetes for LLMOps

Operate the serving substrate: Helm, multi-tenant isolation and quota, HPA, gateway/guardrail sidecars, and namespace policy.

10
Retrieval Substrate Health (Ops)

Keep the retrieval substrate up in production: health checks, failure detection, recovery, and capacity/staleness alarms — operating it, not building it.

11
Hosted LLM API Integration

Baseline LLM access in ops code: provider SDKs, multi-provider gateways, smart routers, and provider failover.

12
Python for LLMOps

Production Python for ops tooling: async/await, Pydantic, typing, dataclasses, pytest, and error handling.

What you'll ship in production

Core responsibilities this discipline prepares you for.

  1. 1

    Design CI/CD pipelines

    for LLM application deployment

    • Build ArgoCD GitOps workflows with Helm-based deployments and environment promotion
    • Implement canary and blue-green rollout strategies with automated quality-based rollback
    • Wire complete CI/CD pipelines that trigger rollbacks when evaluation metrics degrade
  2. 2

    Monitor LLM systems in production

    — latency, errors, costs, quality

    • Instrument with OpenTelemetry and Langfuse v3 for OTEL-native distributed tracing
    • Build Grafana dashboards with Logfire for Python application monitoring and alerting
    • Set up monitoring stacks that detect anomalies, fire alerts, and enable trace-based root cause analysis
  3. 3

    Manage LLM gateway operations

    — key rotation, failover, quota management

    • Operate LiteLLM gateway: API key lifecycle management, provider health monitoring, per-team quotas
    • Handle zero-downtime model version switching with traffic draining and validation
    • Simulate provider outages and quota exhaustion to validate failover and degradation behavior
  4. 4

    Implement FinOps practices

    — cost attribution, budgets, and optimization

    • Track token costs by team, feature, and model with Prometheus-based budget alerting
    • Implement cost optimization through semantic caching, model tiering, and prompt compression
    • Build FinOps dashboards that demonstrate measurable cost reduction across optimization strategies
  5. 5

    Build continuous evaluation pipelines

    for production LLM quality

    • Run RAGAS and DeepEval evaluation pipelines alongside production traffic as shadow evaluators
    • Set up Langfuse-based quality tracking with automated quality gates and threshold alerting
    • Detect quality degradation in real time and trigger automated alerts when scores drop below baselines
  6. 6

    Detect and respond to prompt attacks

    and safety incidents in production

    • Monitor NeMo Guardrails operationally for prompt injection and jailbreak detection patterns
    • Classify incident severity and execute structured response workflows with containment procedures
    • Simulate attack scenarios end-to-end: detection, triage, remediation, and post-incident analysis
  7. 7

    Manage data quality for RAG systems

    — freshness, drift, accuracy

    • Monitor embedding drift and retrieval accuracy with continuous RAGAS evaluation
    • Set up automated reindexing triggers and stale content detection pipelines
    • Build monitoring for live RAG systems that detects quality degradation and triggers reindexing workflows
  8. 8

    Implement capacity planning

    — predict demand and right-size deployments

    • Forecast token demand using historical usage patterns and run load tests for LLM services
    • Model SLA capacity requirements and configure KEDA-based autoscaling policies
    • Run load tests that predict capacity requirements and validate SLA compliance under variable traffic

Curriculum

8 courses · each builds on previous goals

14 goals unlocked for preview — click to read. Locked goals need a subscription.