GenAI Platform Engineering

Build internal GenAI developer platforms with self-service capabilities, multi-tenancy, RBAC, CI/CD for model/prompt/guardrail pipelines.

Preview 17 goals free

12 skill groups9 courses795 goals~339 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

Self-Service Developer Platform & Golden Paths (IDP)

Build the paved road: a self-service internal developer platform — portals, service catalogs, templates and scaffolding, and self-service provisioning so other engineers ship AI without filing tickets.

Multi-Tenancy, Isolation & RBAC

Make one platform safely serve many teams: multi-tenant architecture, namespace and resource isolation, noisy-neighbor prevention, per-tenant limits, SSO/identity, RBAC, and secrets-as-a-service.

Self-Service Deployment Control Plane

Build the deployment control plane others consume: a self-service paved road for promoting models, prompts, and guardrails, with registry/versioning and GitOps offered as a platform capability — not operating one team's pipeline.

Multi-Tenant Kubernetes & Control-Plane Engineering

Engineer the platform layer on Kubernetes: operators, CRDs and controllers, the control-plane backend behind self-service, multi-cluster, GPU/accelerator scheduling, and platform IaC (Crossplane/Terraform).

Model-Serving & Inference-Endpoints (Platform Service)

Offer model serving as a platform product: provision inference endpoints and model-serving-as-a-service, internal model gateways, and autoscaling serving infra that tenants request on demand.

Multi-Tenant Observability (Platform Service)

Give every tenant observability out of the box: per-tenant dashboards, traces, and metrics the platform provides, the shared stack teams plug into, and fleet-wide plus platform-self SLOs.

Platform Governance, Policy-as-Code & Audit

Enforce the rules platform-wide: policy-as-code and admission policies (OPA/Kyverno), an audit engine every tenant inherits, and governance gates baked into the control plane — preventive, not operational compliance.

Multi-Tenant Quota & Chargeback

Run the platform's economics: per-tenant cost attribution, chargeback/showback, cross-tenant quota enforcement, and capacity planning — billing and bounding tenants, not reducing model spend.

Eval & Benchmark Infrastructure (Platform Service)

Host eval as a self-serve capability: shared eval/benchmark harnesses, golden-set storage, and gate scheduling that tenants plug their own metrics into — the infrastructure, not the eval science.

Data & Vector Infra (Platform Service)

Provision data infra as a platform offering: managed vector stores and data-infra golden paths tenants self-serve — not building the corpus/embedding pipelines themselves.

Hosted LLM API Integration

Baseline provider access in platform code: LLM/embedding SDK calls, auth, and retries.

Python for Platform Engineering

Production Python for platform code: async, typing, Pydantic, and error handling.

What you'll ship in production

Core responsibilities this discipline prepares you for.

1
Build the internal GenAI platform
enabling developers to deploy LLM applications self-service
- Design platform APIs with golden path templates and self-service provisioning workflows
- Build developer portals with pre-approved LLM configurations, guardrails, and monitoring included
- Wire end-to-end self-service: from app registration to deployed inference endpoint with observability
2
Design multi-tenant infrastructure
with namespace isolation and RBAC
- Implement Kubernetes namespace isolation with RBAC policies and resource quotas per tenant
- Automate tenant provisioning with network policies and admission controllers
- Validate tenant isolation by enforcing resource limits under concurrent multi-team workloads
3
Implement CI/CD pipelines
with GitOps for GenAI applications
- Set up ArgoCD GitOps for declarative deployment from Git push to production rollout
- Build GitHub Actions workflows with act for local CI and Helm chart packaging
- Wire complete GitOps pipelines with Kustomize overlays for dev/staging/production environments
4
Manage data infrastructure
— databases, caches, message queues on K8s
- Deploy PostgreSQL + pgvector, Redis, Kafka, Neo4j, and MinIO as Kubernetes-native services
- Configure backup/restore, horizontal scaling, and monitoring for each data component
- Benchmark throughput and failover behavior for each infrastructure component under load
5
Build autoscaling for GenAI workloads
using event-driven scaling and batch job queuing
- Configure KEDA for event-driven pod autoscaling based on queue depth, HTTP rate, and custom metrics
- Set up Kueue for Kubernetes-native batch job scheduling with priorities and fair quotas
- Validate auto-scaling policies under burst GenAI workloads with realistic traffic patterns
6
Provision infrastructure-as-code
using K8s-native tooling
- Declare infrastructure as Kubernetes custom resources with Crossplane providers
- Manage databases, storage, and networking declaratively through kubectl apply
- Verify reconciliation behavior by modifying infrastructure state and observing self-healing
7
Implement full-stack observability
across the GenAI platform
- Build unified observability with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing
- Add Logfire for Python application tracing and Langfuse for LLM-specific cost and quality monitoring
- Wire a unified observability stack spanning infrastructure, application, and LLM inference layers
8
Operate LLM gateways
as platform infrastructure
- Manage LiteLLM gateway operations: API key lifecycle, per-team cost tracking, and provider health
- Handle model version migration and zero-downtime provider switching
- Operate a production gateway serving multiple internal teams with isolated quotas and routing

Curriculum

9 courses · each builds on previous goals

17 goals unlocked for preview — click to read. Locked goals need a subscription.

CourseGoals

Python Essentials for Agent Builders62

Your Dev Environment4

Navigate filesystem with terminal
Manage files from command line
Set up VS Code
Configure terminal in VS Code

Python, Git & Package Management6

Install and verify Python
Write hello world script
Use Python REPL
Initialize Git repository
Track changes with Git
Install packages with pip

Variables & Basic Types5

Create and name variables
Work with strings
Work with numbers
Work with booleans
Format with f-strings

Control Flow4

Make decisions with if/elif/else
Iterate with for loops
Repeat with while loops
Control loop execution

Functions5

Define and call functions
Use parameters
Return values
Document with docstrings
Understand scope

Modules & Imports4

Import standard library
Create custom modules
Understand Python path
Create packages

Lists & Tuples5

Create and access lists
Modify lists
Slice lists
Use list comprehensions
Work with tuples

Dictionaries & Sets5

Create and access dicts
Modify dictionaries
Iterate over dicts
Work with nested dicts
Use sets

Classes & Dataclasses5

Understand class basics
Create dataclasses
Add methods
Use default values
Basic inheritance

Files, JSON & Error Handling5

Read and write files
Work with JSON
Use pathlib
Handle exceptions
Create custom exceptions

Basic Testing4

Use assert statements
Create test functions
Run pytest
Test classes

Environment Variables & Configuration5

Understand environment variables
Use .env files
Load with python-dotenv
Handle missing variables
Organize configuration

Decorators & Context Managers5

Understand decorators
Write simple decorators
Use context managers
Write context managers
Combine patterns

LLM Foundations for Agent Builders55

Generators & Iterators5

Understand iteration
Create generators
Use generator expressions
Build data pipelines
Use itertools

Async Programming Basics5

Understand async concepts
Write async functions
Run concurrent operations
Use async context managers
Handle async exceptions

Type Hints & Pydantic5

Add basic type hints
Use typing generics
Create Pydantic models
Validate API data
Configure Pydantic

Data Pipelines & Transformations5

Build functional pipelines
Work with tabular data
Transform data shapes
Process LLM data formats
Optimize for performance

HTTP Clients & httpx5

Make GET requests
Make POST requests
Use async httpx
Handle errors
Use sessions

Your First LLM Call5

Set up credentials
Install Gemini SDK
Make first API call
Parse response
Handle API errors

Sampling Parameters & Output Control5

Understand temperature
Use top-p sampling
Implement determinism
Control output length
Use structured output

Embeddings & Semantic Search5

Understand embeddings
Generate embeddings
Calculate similarity
Build simple search
Compare embedding models

RAG Fundamentals5

Understand RAG pattern
Chunk documents
Build retrieval pipeline
Compose RAG prompts
Evaluate RAG quality

Cost Awareness & Token Economics5

Understand pricing models
Calculate request costs
Compare provider costs
Identify cost drivers
Basic cost optimization

Retry Patterns with Tenacity5

Understand retry need
Use tenacity basics
Implement exponential backoff
Handle specific exceptions
Combine with async

Kubernetes Essentials for GenAI72

Containerizing LLM Applications6

Write a Python app that calls the Gemini API and returns structured responses
Write a Dockerfile and build a container image for the LLM app
Run the containerized LLM app with environment-based configuration
Use Docker Compose to run the LLM app with supporting services
Tag images with semantic versions and push to a container registry
Debug containers with exec, logs, and inspect

Your Kubernetes Cluster & First LLM Pod6

Understand K8s architecture and connect to your vCluster
Deploy the LLM app as your first Kubernetes pod
Organize workloads with namespaces
Use labels and selectors to organize and query resources
Understand pod lifecycle and restart policies
Master kubectl debugging: exec, logs, describe, port-forward

Services & the LLM Chat Backend6

Create a ClusterIP service to expose the LLM chat API internally
Deploy a multi-tier LLM chat application
Compare service types: ClusterIP, NodePort, LoadBalancer
Master DNS-based service discovery in Kubernetes
Understand endpoints and traffic routing
Debug service connectivity problems

Deployments, Scaling & Rolling Updates6

Create a Deployment for the LLM chat API
Scale LLM app replicas to handle concurrent requests
Perform a rolling update with zero downtime
Roll back a broken deployment
Compare deployment strategies: RollingUpdate vs Recreate
Manage deployment lifecycle with kubectl rollout

ConfigMaps & Secrets for LLM Apps6

Create ConfigMaps for LLM app settings
Mount ConfigMaps as files for complex configuration
Store LLM proxy credentials securely in Secrets
Manage per-environment configuration for dev, staging, and prod
Handle configuration updates and rolling restarts
Debug configuration issues in LLM app pods

Persistent Storage & StatefulSets6

Create PersistentVolumeClaims for durable storage
Deploy PostgreSQL as a StatefulSet
Connect the LLM chat API to PostgreSQL for conversation persistence
Deploy Redis as a StatefulSet for LLM response caching
Understand StatefulSet scaling and ordering guarantees
Manage PVC lifecycle: expansion, snapshots, and cleanup

Multi-Container Pods: Sidecars & Init Containers6

Add an LLM proxy sidecar to the chat API pod
Use init containers for database setup and config loading
Share data between containers via emptyDir volumes
Implement the ambassador pattern for multi-model LLM routing
Add a logging and metrics sidecar to the LLM app
Debug multi-container pods

Resource Management & Cost Optimization6

Set resource requests and limits for the LLM chat API
Understand QoS classes and their impact on eviction
Enforce resource defaults with LimitRanges
Cap namespace resource usage with ResourceQuotas
Right-size LLM app containers based on actual usage
Diagnose OOMKilled and CPU throttling issues

Packaging with Helm & Kustomize6

Create a Helm chart for the LLM chat application
Parameterize the chart with values.yaml for each environment
Manage Helm release lifecycle: install, upgrade, rollback
Use Kustomize bases and overlays for the LLM app
Use Kustomize patches and generators
Compare Helm vs Kustomize for different deployment scenarios

Networking, Ingress & TLS6

Expose the LLM chat API via an Ingress resource
Add TLS to the Ingress for HTTPS access
Isolate services with NetworkPolicies
Configure Ingress annotations for production traffic
Understand K8s networking: pod IPs, CNI, and service routing
Debug networking and connectivity issues

Health Probes, Autoscaling & Self-Healing6

Add liveness and readiness probes to the LLM chat API
Configure startup probes for containers with slow initialization
Scale the chat API automatically with HPA based on CPU
Create PodDisruptionBudgets for safe maintenance
Implement health check patterns for LLM-dependent services
Combine autoscaling, probes, and PDBs for a resilient LLM service

RBAC, Security & K8s Troubleshooting6

Create RBAC roles for the LLM chat application
Enforce Pod Security Standards
Apply SecurityContext for defense in depth
Debug CrashLoopBackOff and OOMKilled failures
Use kubectl debug and ephemeral containers for live debugging
Troubleshoot LLM-specific issues: timeouts, proxy errors, stale connections

Web APIs for GenAI Engineers60

FastAPI Fundamentals6

Create a FastAPI application with path operations
Define Pydantic request and response models
Implement dependency injection for shared resources
Build CRUD endpoints with proper HTTP semantics
Configure OpenAPI documentation with examples
Handle errors with custom exception handlers

Async Python for APIs6

Convert sync endpoints to async with proper await patterns
Implement background tasks for non-blocking operations
Execute concurrent API calls with asyncio.gather
Manage application lifecycle with lifespan handlers
Build async generators for streaming responses
Control concurrency with semaphores and throttling

Database Integration6

Configure SQLAlchemy async engine with connection pooling
Define ORM models with relationships and constraints
Create and manage database migrations with Alembic
Implement repository pattern for data access
Build transactional endpoints with session lifecycle
Implement filtering, sorting, and full-text search

Authentication & Authorization6

Implement user registration with password hashing
Build OAuth2 password flow with JWT tokens
Implement API key authentication for services
Enforce role-based access control with permissions
Build token refresh and revocation
Compose multiple auth strategies into dependencies

Real-time Streaming6

Build SSE endpoint for streaming LLM responses
Implement WebSocket endpoint with connection lifecycle
Build WebSocket connection manager for broadcasting
Handle backpressure and slow clients
Implement heartbeat and automatic reconnection
Build real-time notification system with Redis pub/sub

Resilience Patterns6

Implement rate limiting with Redis sliding window
Build circuit breaker for LLM provider calls
Configure retry logic with tenacity
Isolate critical paths with bulkhead semaphores
Build fallback responses for degraded mode
Combine resilience patterns into middleware stack

API Gateway & Routing6

Build reverse proxy with path-based routing
Implement load balancing across backend instances
Transform requests and responses through the gateway
Aggregate responses from multiple backends
Implement service discovery with health checking
Build gateway authentication and request enrichment

Testing & Documentation6

Write async endpoint tests with httpx.AsyncClient
Build database fixtures with transaction rollback
Mock external services for deterministic tests
Implement contract tests for API consumers
Measure test coverage and set quality gates
Generate rich OpenAPI documentation with examples

API Versioning & Evolution6

Implement URL-based API versioning with routers
Build header-based version negotiation
Manage deprecation with Sunset and Warning headers
Build request and response adapters for version translation
Detect breaking changes automatically
Generate API changelogs from schema diffs

Deployment & Observability6

Build production Docker images with multi-stage builds
Deploy to Kubernetes with health check probes
Instrument endpoints with Prometheus metrics
Implement distributed tracing with OpenTelemetry
Build structured logging with correlation IDs
Create Grafana dashboards for API monitoring

Data Infrastructure for GenAI60

PostgreSQL & pgvector6

Install pgvector and create vector-enabled tables
Build HNSW and IVFFlat indexes for fast similarity search
Perform similarity search with distance operators
Build hybrid search combining vectors and metadata
Integrate pgvector with SQLAlchemy ORM
Build a semantic search API endpoint

Advanced PostgreSQL6

Partition tables by range for time-series AI data
Query and index JSONB for AI metadata
Build full-text search for document retrieval
Optimize queries with EXPLAIN ANALYZE
Implement connection pooling with PgBouncer
Build a database migration strategy for AI schemas

Redis for Caching & Sessions6

Implement cache-aside pattern for LLM responses
Build session storage for multi-turn conversations
Use Redis pub/sub for real-time event broadcasting
Implement distributed locks for concurrent operations
Build rate limiting with Redis sorted sets
Monitor Redis performance and memory

MinIO Object Storage6

Create MinIO buckets and configure access policies
Upload and download with presigned URLs
Implement versioned storage for datasets
Build multipart uploads for large model files
Build content-addressable storage for embeddings
Monitor MinIO health and storage metrics

Kafka Fundamentals6

Create topics with partition and replication strategies
Produce messages with key-based partitioning
Consume messages with consumer groups
Implement message serialization with schemas
Handle delivery guarantees and idempotency
Monitor Kafka with consumer lag metrics

Event-Driven Architectures6

Build an event-driven inference pipeline
Implement event sourcing for prediction audit trails
Design dead letter queues for failed processing
Build stream processing for real-time enrichment
Implement the saga pattern for multi-step workflows
Implement event replay for reprocessing

Neo4j Graph Database6

Model a knowledge graph with nodes and relationships
Write Cypher queries for graph traversal
Build GraphRAG pipeline combining graph and LLM
Integrate Neo4j with Python using async driver
Build a knowledge graph update pipeline
Monitor Neo4j performance and queries

Data Pipeline Orchestration6

Define Argo Workflow templates for data processing
Build DAG workflows with parallel execution
Pass artifacts between workflow steps
Implement retry and error handling strategies
Build reusable WorkflowTemplates
Schedule pipelines with CronWorkflows

Data Quality & Validation6

Build schema validation for AI datasets
Implement embedding quality checks
Create data profiling reports
Build Great Expectations validation suites
Implement data quality gates in pipelines
Monitor data quality metrics over time

Data Infrastructure Operations6

Deploy data services on Kubernetes with StatefulSets
Configure Prometheus monitoring for data services
Implement automated PostgreSQL backup and restore
Build MinIO backup and replication
Build auto-scaling for data services
Build operational runbooks and incident response

Devops Foundations for GenAI48

Git Workflows for AI Teams6

Implement trunk-based development for AI projects
Configure branch protection and status checks
Build PR templates for prompt and model changes
Manage merge conflicts in AI data files
Implement pre-commit hooks and automated dependency updates
Version AI artifacts with Git tags and releases

CI Pipelines with GitHub Actions6

Compare CI platforms: GitHub Actions vs Tekton vs Dagger
Build AI artifact validation CI checks
Implement matrix builds for multi-provider testing
Build reusable workflow templates
Configure CI to run on GKE self-hosted runners
Optimize CI pipeline performance

Container Image CI/CD6

Build optimized Docker images for AI applications
Automate image builds with GitHub Actions
Scan images with Trivy, generate SBOMs with Syft, and assess SLSA level
Manage image lifecycle and garbage collection
Sign images with Cosign and enforce Binary Authorization on GKE
Build multi-architecture images for GKE

ArgoCD & GitOps6

Install ArgoCD and deploy first application
Configure sync policies for automated deployment
Detect and resolve configuration drift
Compare GitOps controllers: ArgoCD ApplicationSet vs Flux CD
Implement sync waves and hooks for ordered deployment
Implement ArgoCD RBAC and multi-tenancy

Infrastructure as Code6

Build Helm charts with Skaffold local development workflow
Use Kustomize overlays for environment management
Enforce policies with OPA Gatekeeper and test with Conftest
Test Helm charts before deployment
Implement Helmfile for multi-chart orchestration
Build chart versioning and release pipeline

Secrets Management6

Store hosted LLM API keys in Google Secret Manager
Deploy External Secrets Operator for auto-sync
Implement automatic secret rotation
Prevent secrets from leaking in logs and manifests
Secure environment variable injection patterns
Audit and monitor secret access

Deployment Strategies6

Configure rolling updates with health checks
Implement blue-green and progressive delivery with Argo Rollouts
Automate canary deployments with Flagger
Implement automated rollback on SLO breach
Configure PodDisruptionBudgets for availability
Build deployment dashboards and change tracking

DevOps for AI Artifacts6

Version prompt templates as code
Build evaluation CI for prompt changes
Implement evaluation-gated deployment
Deploy guardrails as versioned configuration
Build A/B testing infrastructure for prompts
Implement model configuration drift detection

GenAI Operations300

GenAI Failure Catalog6

Classify GenAI failures into five categories: provider, quality, cost, security, and data failures
Instrument a multi-provider LLM gateway to detect each failure category
Build typed failure event models that feed into alerting and incident management
Measure baseline failure rates across OpenAI, Anthropic, and Google providers
Implement failure prediction from leading indicators
Create failure impact assessment system

GenAI SLI Framework6

Define latency SLIs: TTFT, tokens-per-second, end-to-end response time across providers
Define quality SLIs: faithfulness, hallucination rate, format compliance, retrieval precision
Define cost SLIs: cost-per-request, cost-per-token, cache hit rate, budget burn rate
Instrument all SLIs with Prometheus metrics and Langfuse traces
Build SLI aggregation and reporting API
Implement SLI validation and testing

GenAI SLO Engine6

Define SLO targets for latency, quality, and cost SLIs with business-justified thresholds
Compute error budgets and track consumption over rolling windows
Build multi-window burn-rate alerts that detect SLO violations before budget exhaustion
Create SLO status dashboards showing budget remaining and projected exhaustion
Implement SLO negotiation framework
Build cross-SLO dependency tracking

GenAI Toil Analyzer6

Identify GenAI-specific toil patterns: manual model updates, prompt tweaking, provider failover, cache invalidation
Measure toil using time-tracking instrumentation and classify by automation potential
Automate the highest-impact toil items with operational scripts and scheduled workflows
Track toil reduction over time with team-level reporting
Build automation testing and validation
Create toil reduction roadmap generator

GenAI Launch Readiness6

Define operational readiness criteria specific to GenAI services
Build automated readiness checks that verify infrastructure, monitoring, and runbook completeness
Implement launch gate enforcement that blocks deployment without readiness sign-off
Create readiness dashboards and historical tracking for continuous improvement
Implement progressive readiness rollout
Create readiness automation toolkit

GenAI Runbook Engine6

Write structured runbooks for the top 5 GenAI failure modes identified in Ch 1
Build executable runbook steps that link to operational APIs and scripts
Implement runbook testing that validates each step works as documented
Track runbook usage and effectiveness metrics
Build runbook recommendation engine
Implement cross-runbook orchestration

GenAI Helm Charts6

Package LiteLLM, Langfuse, Prometheus, and Grafana as Helm Charts with Production-Ready Defaults
Create values files with environment-specific overrides for dev, staging, and production
Implement Helm chart testing and linting with ct and helm unittest
Build chart dependency management for the full GenAI stack
Implement Helm chart documentation generation
Build chart rollback and recovery procedures

Dev/Staging/Prod Environments6

Create K8s namespaces for dev/staging/prod with resource quotas and LimitRanges
Build Kustomize overlays for environment-specific configuration
Implement environment parity verification that detects configuration drift
Deploy the full GenAI stack to all three environments
Implement environment configuration validation
Create environment lifecycle automation

GitOps Control Plane6

Deploy Argo CD to vCluster with Application and ApplicationSet CRDs
Implement GitOps sync for all GenAI components across dev/staging/prod
Configure drift detection and auto-remediation for configuration consistency
Build Argo CD RBAC for team-based access control
Implement performance optimization for gitops with argo cd
Build operational documentation for gitops with argo cd

GenAI CI/CD Pipelines6

Build Argo Workflow templates for GenAI artifact CI/CD
Implement pipeline stages: lint, validate, eval, promote
Create artifact-specific pipelines for prompts, models, and RAG configs
Monitor pipeline health with observability metrics
Implement performance optimization for ci/cd pipelines for genai artifacts
Build operational documentation for ci/cd pipelines for genai artifacts

GenAI Secret Manager6

Deploy External Secrets Operator with GCP Secret Manager provider
Implement perenvironment provider key isolation with
Build secret sync monitoring and alerting for rotation compliance
Create emergency secret rotation procedures with zero-downtime key swap
Implement performance optimization for secret management for genai
Build operational documentation for secret management for genai

Self-Service Environment API6

Build API for on-demand feature environment provisioning with full GenAI stack
Implement environment lifecycle management with TTL and auto-cleanup
Create environment templates with pre-configured GenAI stack components
Monitor environment usage and resource consumption across all feature environments
Implement performance optimization for developer self-service environments
Build operational documentation for developer self-service environments

Pipeline Health Dashboard6

Instrument Argo Workflow metrics for comprehensive pipeline observability
Build promotion velocity tracking across dev, staging, and production environments
Implement pipeline failure analysis with root cause categorization
Create pipeline health dashboards with bottleneck detection and DORA metrics
Implement performance optimization for pipeline observability
Build operational documentation for pipeline observability

Prompt Registry6

Build Immutable Prompt Storage with Content-Addressable Versioning
Implement Promotion Gates Between Dev, Staging, and Production
Create Prompt Diff and Review Workflow
Track Prompt Lineage Across All Deployments
Implement performance optimization for immutable prompt registry
Build operational documentation for immutable prompt registry

Model Lifecycle Manager6

Deploy MLflow on vCluster for Model Registry and Experiment Tracking
Implement Model Versioning with Stage Promotion Gates
Build Model Deprecation Workflow with Consumer Notification
Track Experiments with Cost and Quality Metrics for Data-Driven Model Selection
Implement performance optimization for model registry and lifecycle
Build operational documentation for model registry and lifecycle

Progressive Delivery Engine6

Deploy Argo Rollouts with Canary Strategy for LiteLLM Model Config Changes
Implement Shadow Deployments for Risk-Free Model Comparison in Production
Build Automated Rollback on Quality Regression During Canary Progression
Monitor Canary vs Baseline Quality, Latency, and Cost Metrics in Real-Time
Implement performance optimization for canary and shadow deployments
Build operational documentation for canary and shadow deployments

Eval Gate Pipeline6

Build Promptfoo Eval Suites for Pre-Promotion Quality Verification
Implement Eval Gates in Argo Workflows That Block Promotion on Failure
Create Golden Test Sets for Regression Detection
Track Eval Pass Rates and Gate Effectiveness Metrics
Implement performance optimization for automated eval gates
Build operational documentation for automated eval gates

AI Feature Flag System6

Build a Feature Flag Service for AI Configuration with Redis-Backed Storage
Implement Percentage-Based Rollouts for Model and Prompt Changes
Create Kill Switches for Rapid AI Behavior Reversion During Incidents
Track Feature Flag Impact on Quality and Cost Metrics
Implement performance optimization for feature flags for ai behaviors
Build operational documentation for feature flags for ai behaviors

RAG Release Pipeline6

Build a Release Pipeline for Embedding Model Swaps with Dual-Index Strategy
Implement Chunking Strategy Changes with A/B Comparison Using RAGAS Metrics
Create an Index Migration Workflow with Zero-Downtime Cutover
Validate RAG Releases with Retrieval Quality Metrics Before and After
Implement performance optimization for rag pipeline release management
Build operational documentation for rag pipeline release management

Distributed LLM Tracer6

Deploy an OpenTelemetry Collector with Langfuse Exporter
Instrument Multi-Provider Request Chains with Parent-Child Trace Spans
Build Trace Correlation Across RAG Retrieval, LLM Inference, and Guardrail Processing
Create Trace-Based Latency Analysis Dashboards with Drill-Down Capability
Implement performance optimization for end-to-end llm tracing
Build operational documentation for end-to-end llm tracing

Quality Drift Detector6

Implement Output Quality Drift Detection with Rolling Window Comparison
Build Embedding Drift Detection Using Distribution Divergence Metrics
Detect Retrieval Relevance Degradation with RAGAS-Based Monitoring
Configure Automated Alerts for Each Drift Type with Severity Classification
Implement performance optimization for quality drift detection
Build operational documentation for quality drift detection

GenAI Alert System6

Configure Alertmanager with GenAI-Specific Routing Rules and Severity Classification
Deploy Grafana OnCall for On-Call Schedules, Escalation Policies, and Incident Lifecycle
Implement Alert Deduplication and Grouping for Noisy GenAI Metrics
Build Alert Effectiveness Tracking to Reduce Alert Fatigue
Implement performance optimization for alerting strategy
Build operational documentation for alerting strategy

GenAI Dashboard Suite6

Build operational dashboard with SLO status, active incidents, and system health overview
Create business dashboard with usage, cost, and adoption metrics for stakeholders
Build compliance dashboard with guardrail activity, audit coverage, and policy status
Implement dashboard-as-code with Grafana provisioning for version-controlled dashboards
Implement performance optimization for dashboard engineering
Build operational documentation for dashboard engineering

Provider SLA Tracker6

Implement per-provider availability tracking with synthetic probes
Build provider degradation detection using quality and latency SLIs
Create automated escalation chains for provider issues with status page integration
Track provider SLA compliance for vendor management and contract negotiation
Implement performance optimization for provider sla monitoring
Build operational documentation for provider sla monitoring

Cross-Env Comparator6

Implement cross-environment metric comparison for quality regression detection
Build staging-to-prod quality correlation analysis for deployment confidence
Create environment drift detection for configuration parity monitoring
Monitor promotion impact by comparing pre/post metrics across environments
Implement performance optimization for cross-environment observability
Build operational documentation for cross-environment observability

AI Incident Commander6

Define LLM-specific incident severity classification with impact-based criteria
Build incident lifecycle management with role assignments and status tracking
Create communication templates for AI-specific incidents targeting different audiences
Track incident metrics with MTTD, MTTA, MTTR and trend analysis
Implement performance optimization for llm incident response framework
Build operational documentation for llm incident response framework

Runbook Automation Engine6

Build alert-to-runbook routing that triggers automated remediation workflows
Implement human approval gates for high-impact remediation steps
Create runbook execution auditing with step-by-step logging and outcome tracking
Track automation coverage and success rates across all runbook types
Implement performance optimization for automated runbook execution
Build operational documentation for automated runbook execution

AI Post-Mortem Engine6

Build structured post-mortem templates for GenAI failure modes
Implement timeline reconstruction from Langfuse traces and Prometheus metrics
Create action item tracking with follow-through verification
Analyze post-mortem trends to identify systemic issues
Implement performance optimization for post-mortems for ai failures
Build operational documentation for post-mortems for ai failures

LLM Chaos Lab6

Deploy Chaos Mesh in vCluster
Build Provider Failover Chaos Experiments
Create Cache Invalidation Chaos Experiments
Implement Quality Degradation Injection
Implement performance optimization for chaos engineering for llm providers
Build operational documentation for chaos engineering for llm providers

Pipeline Chaos Experiments6

Create Embedding Pipeline Chaos Experiments
Build Ingestion Interruption Tests
Implement Index Corruption Detection and Recovery Validation
Track Chaos Experiment Results and Improvement Trends
Implement performance optimization for pipeline failure chaos
Build operational documentation for pipeline failure chaos

GenAI Game Day6

Design multi-failure game day scenarios for GenAI platforms
Build game day orchestration that chains chaos experiments with time delays
Implement game day scoring: response time, runbook adherence, communication quality
Create game day retrospectives with improvement tracking
Implement performance optimization for game day operations
Build operational documentation for game day operations

Cost Attribution Engine6

Instrument per-request cost tracking across all pipeline stages
Build cost attribution to teams, projects, and use cases
Create cost allocation models for shared infrastructure components
Implement cost anomaly detection with automated investigation
Implement performance optimization for full-stack cost attribution
Build operational documentation for full-stack cost attribution

Token Budget Controller6

Configure LiteLLM Virtual Keys with Per-Team Budget Limits
Implement Per-Request Token Limits
Build Budget Alerting at 50%, 80%, and 100% Thresholds with Escalation
Create Budget Override Workflows for Emergency Usage Beyond Limits
Implement performance optimization for token budget enforcement
Build operational documentation for token budget enforcement

Cache Economics Analyzer6

Deploy Redis Semantic Cache and Measure Hit Rate vs Cost Savings
Compare Provider Caching Strategies for OpenAI, Anthropic, and Google
Build Cost-Benefit Analysis with Break-Even Calculations
Recommend Optimal Caching Mix Per Use Case
Implement performance optimization for caching roi analysis
Build operational documentation for caching roi analysis

Batch API Scheduler6

Implement workload classification: real-time vs batch-eligible based on latency requirements
Build Batch API job scheduling with priority queues and SLA tracking
Create batch job monitoring with completion time SLAs and failure handling
Measure and report cost savings from batch routing vs synchronous requests
Implement performance optimization for batch api optimization
Build operational documentation for batch api optimization

Capacity Forecaster6

Build token demand forecasting using historical usage patterns and trend analysis
Implement embedding volume projection for storage and compute planning
Create cost projection models for budget planning cycles
Track forecast accuracy and improve models over time with feedback loops
Implement performance optimization for capacity forecasting
Build operational documentation for capacity forecasting

FinOps Governance Platform6

Build Showback and Chargeback Reports per Team and Project with Full Cost Transparency
Create Executive FinOps Dashboards with Trend Analysis for Leadership
Implement Cost Governance Policies with Automated Enforcement
Generate Monthly FinOps Reviews with Optimization Recommendations
Implement performance optimization for finops reporting and governance
Build operational documentation for finops reporting and governance

Multi-Tenant GenAI Platform6

Automate tenant onboarding with namespace provisioning and secret management
Implement namespace isolation with network policies and resource quotas
Build noisy-neighbor detection that identifies tenants causing resource contention
Create tenant operations dashboards with per-tenant health visibility
Implement performance optimization for multi-tenant platform operations
Build operational documentation for multi-tenant platform operations

AI Developer Platform6

Build self-service deployment workflows with approval gates for AI artifacts
Create golden path templates for common GenAI patterns
Implement internal tool marketplace for reusable AI components
Build developer experience metrics and platform analytics
Implement performance optimization for internal developer platform for ai
Build operational documentation for internal developer platform for ai

GenAI Ops Maturity Assessor6

Define GenAI operational maturity model with five levels across eight capability areas
Build automated maturity assessment that evaluates current operational state
Generate improvement roadmaps with prioritized actions based on assessment results
Track maturity progression over time with milestone tracking
Implement performance optimization for operational maturity model
Build operational documentation for operational maturity model

LLM-Specific Observability6

Hallucination Detection Pipeline
Semantic Drift Monitor
Response Quality Scorer
LLM Observability Dashboard
Quality Degradation Root Cause
LLM Observability Capstone

Token Cost FinOps6

Cost Tracking Pipeline
Budget Management System
Chargeback Reporting
Cost Optimization Engine
FinOps Drill-Down Dashboard
FinOps Capstone

OpenTelemetry for Agentic Systems6

LLM Call Instrumentation
Agent Execution Tracing
Custom Agent Attributes
Langfuse Trace Integration
K8s Distributed Tracing
Tracing Capstone

Eval Gates in CI/CD6

Eval Suite Design
CI Pipeline Integration
Eval Result Storage
Change-Type-Specific Gates
Eval Failure Workflow
Eval Gates Capstone

Prompt Versioning and Deployment6

Operational Prompt Registry
Canary Prompt Deployment
Automated Prompt Rollback
Prompt Deployment Pipeline
Multi-Prompt Coordination
Prompt Ops Capstone

Guardrail Operations6

Guardrail Deployment
Guardrail Tuning
Guardrail Versioning
Effectiveness Monitoring
Guardrail Incident Response
Guardrail Ops Capstone

AI Incident Response6

AI Incident Taxonomy
Automated Detection
AI Incident Runbooks
Post-Incident Review
Escalation Path Design
Incident Response Capstone

Multi-Tenant AI Platform Ops6

Tenant Isolation
Quota Management
Noisy-Neighbor Detection
Per-Tenant SLA Monitoring
Tenant Self-Service
Multi-Tenant Ops Capstone

AI Governance Compliance Ops6

EU AI Act Controls
NIST AI RMF Implementation
SOC2 AI Controls
Compliance Dashboard
Audit Preparation Workflow
Compliance Ops Capstone

Platform Operations Capstone6

Platform Integration
Unified Ops Dashboard
Operational Readiness
Operational Lifecycle Demo
Operations Automation
Platform Ops Capstone Report

GenAI Eval Safety Governance18

EU AI Act Compliance6

Classify AI systems under EU AI Act risk categories
Implement the Feb 2025 AI literacy requirements
Build technical documentation for GPAI compliance
Implement risk management system
Build human oversight mechanisms
Track EU AI Act enforcement timeline compliance

Compliance Frameworks6

Implement NIST AI RMF Govern and Map functions
Implement NIST AI RMF Measure and Manage functions
Build ISO 42001 AI management system documentation
Create unified governance dashboard with Credo AI Agent Registry
Implement comprehensive audit trail
Build compliance automation and alerting

End-to-End Eval, Safety & Governance Pipeline6

Build evaluation benchmark gate
Build safety testing gate
Build red-team gate with automated adversarial testing
Build compliance evidence gate
Orchestrate the full pipeline with Argo Workflows
Build pipeline dashboard and deploy to production

AI Developer Platform Engineering120

Internal Developer Platform Vision6

Design service catalog data model and golden path templates
Build platform configuration management service
Build service catalog REST API with search and filtering
Integrate platform with Kubernetes cluster discovery
Build platform health dashboard with Prometheus metrics
Deploy platform control plane with Helm and ArgoCD

Platform API & Service Mesh6

Design resource provisioning API with async workflows
Implement service registry with health-checked endpoints
Build inter-service communication with retry and timeout
Integrate service mesh with Kubernetes endpoints
Implement API versioning and backward compatibility
Deploy distributed tracing across platform services

Developer Self-Service Portal6

Build resource provisioning request forms and validation
Implement real-time provisioning status with WebSockets
Build team workspace dashboard API
Implement request approval workflows for elevated access
Build portal search and resource discovery
Deploy portal with SSO and session management

LLM Gateway as Platform Service6

Deploy LiteLLM gateway with multi-model routing config
Implement per-tenant API key management and rate limits
Build semantic caching layer for LLM responses
Add request logging with PII redaction pipeline
Configure provider fallback and load balancing
Monitor gateway latency and token usage with Prometheus

Model Registry & Catalog6

Design model metadata schema with version lineage
Build model registration API with validation gates
Implement benchmark result storage and comparison
Build model approval workflow with compliance checks
Track model usage statistics and deprecation timelines
Deploy registry and integrate with LLM gateway

Multi-Tenant Architecture6

Design tenant data model with namespace isolation strategy
Implement K8s namespace provisioning with quota enforcement
Build network policies for tenant traffic isolation
Implement tenant-aware database connection pooling
Build cross-tenant resource sharing with access grants
Deploy multi-tenant infrastructure with Helm overrides

RBAC & Access Control6

Design RBAC model with roles, permissions, and scopes
Implement permission checking middleware for APIs
Build role assignment with team hierarchy inheritance
Implement fine-grained resource-level access control
Build audit logging for all permission changes
Deploy RBAC with policy-as-code validation

Resource Quota Management6

Design quota model for tokens, compute, and storage
Build real-time quota enforcement middleware
Implement quota reservation for long-running jobs
Build quota usage dashboards with alert thresholds
Build quota adjustment workflows with approval
Integrate quota system with gateway and provisioner

Cost Allocation & Chargeback6

Design cost allocation model with per-request pricing
Build cost tracking pipeline from gateway metrics
Implement budget management with spending alerts
Create chargeback reports with department breakdowns
Build cost optimization recommendations engine
Deploy cost dashboards with Grafana

Onboarding Automation6

Design onboarding pipeline with ordered step execution
Build namespace provisioner with secrets and configmaps
Implement default resource allocation and gateway access
Build onboarding status tracker with rollback
Build team offboarding with resource cleanup
Deploy onboarding system with ArgoCD integration

Agent Runtime as Platform Service6

Design agent execution model with sandboxed pods
Build agent job submission and scheduling API
Implement execution sandboxing with seccomp and AppArmor
Build execution log streaming and artifact capture
Build agent runtime auto-scaling and queue depth metrics
Deploy agent runtime with K8s Job controller

Tool Registry & MCP Hub6

Design tool registry model with MCP server metadata
Build tool registration and discovery API
Implement tool versioning with compatibility tracking
Build access control for tool visibility per team
Build tool health monitoring and usage analytics
Deploy MCP hub with Helm and agent integration

Vector DB as Platform Service6

Design vector DB provisioning with schema isolation
Build collection management API with CRUD and bulk insert
Implement automated backup and point-in-time recovery
Build index tuning advisor from query patterns
Build usage metering for vector operations
Deploy managed pgvector with Helm StatefulSet

Evaluation Platform Service6

Design evaluation suite model with benchmark definitions
Build evaluation job runner with parallel execution
Implement benchmark comparison and regression detection
Create evaluation leaderboards per model and team
Build automated regression testing on model updates
Deploy evaluation platform with Helm and Grafana

Prompt Engineering Workspace6

Design prompt registry with version history and metadata
Build prompt editor API with variable interpolation
Implement prompt A/B testing with traffic splitting
Build prompt approval workflow with evaluation gate
Build prompt usage analytics and cost tracking
Deploy prompt workspace integrated with gateway

Platform Monitoring & SLAs6

Define SLI/SLO/SLA models for platform services
Build SLO tracking with error budget calculations
Implement composite health scoring per service
Build SLA breach detection and escalation pipeline
Build platform status page with incident tracking
Deploy SLA monitoring with Grafana dashboards

Compliance & Audit6

Design immutable audit log with hash-chain verification
Build compliance rule engine with declarative policies
Implement data residency enforcement per tenant
Build compliance report generation API
Build real-time compliance violation alerting
Deploy compliance system with retention policies

Platform Change Management6

Design change request model with impact classification
Build change approval workflow with reviewer gates
Implement blast radius analysis for platform changes
Build automated rollback with pre-change snapshots
Build change calendar with freeze window enforcement
Deploy change management with ArgoCD hooks

Platform Analytics & Adoption6

Design adoption metrics with activation funnels
Build developer journey analytics from onboarding to prod
Implement feature usage heatmaps across services
Build developer satisfaction survey and NPS tracking
Build executive analytics dashboard with platform KPIs
Deploy analytics pipeline with Grafana dashboards

AI Platform Capstone6

Integrate platform services into unified control plane API
Build end-to-end onboarding through first LLM call
Implement cross-service dependency health monitor
Build unified governance dashboard with compliance status
Build platform disaster recovery and failover
Deploy complete platform with Helm umbrella chart

GenAI Platform Engineering

Verifiable skill graph

What you'll ship in production

Build the internal GenAI platform

Design multi-tenant infrastructure

Implement CI/CD pipelines

Manage data infrastructure

Build autoscaling for GenAI workloads

Provision infrastructure-as-code

Implement full-stack observability

Operate LLM gateways

Curriculum