GenAI Platform Engineering

Build internal GenAI developer platforms with self-service capabilities, multi-tenancy, RBAC, CI/CD for model/prompt/guardrail pipelines.

12 skill groups9 courses795 goals~339 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

01
Self-Service Developer Platform & Golden Paths (IDP)

Build the paved road: a self-service internal developer platform — portals, service catalogs, templates and scaffolding, and self-service provisioning so other engineers ship AI without filing tickets.

02
Multi-Tenancy, Isolation & RBAC

Make one platform safely serve many teams: multi-tenant architecture, namespace and resource isolation, noisy-neighbor prevention, per-tenant limits, SSO/identity, RBAC, and secrets-as-a-service.

03
Self-Service Deployment Control Plane

Build the deployment control plane others consume: a self-service paved road for promoting models, prompts, and guardrails, with registry/versioning and GitOps offered as a platform capability — not operating one team's pipeline.

04
Multi-Tenant Kubernetes & Control-Plane Engineering

Engineer the platform layer on Kubernetes: operators, CRDs and controllers, the control-plane backend behind self-service, multi-cluster, GPU/accelerator scheduling, and platform IaC (Crossplane/Terraform).

05
Model-Serving & Inference-Endpoints (Platform Service)

Offer model serving as a platform product: provision inference endpoints and model-serving-as-a-service, internal model gateways, and autoscaling serving infra that tenants request on demand.

06
Multi-Tenant Observability (Platform Service)

Give every tenant observability out of the box: per-tenant dashboards, traces, and metrics the platform provides, the shared stack teams plug into, and fleet-wide plus platform-self SLOs.

07
Platform Governance, Policy-as-Code & Audit

Enforce the rules platform-wide: policy-as-code and admission policies (OPA/Kyverno), an audit engine every tenant inherits, and governance gates baked into the control plane — preventive, not operational compliance.

08
Multi-Tenant Quota & Chargeback

Run the platform's economics: per-tenant cost attribution, chargeback/showback, cross-tenant quota enforcement, and capacity planning — billing and bounding tenants, not reducing model spend.

09
Eval & Benchmark Infrastructure (Platform Service)

Host eval as a self-serve capability: shared eval/benchmark harnesses, golden-set storage, and gate scheduling that tenants plug their own metrics into — the infrastructure, not the eval science.

10
Data & Vector Infra (Platform Service)

Provision data infra as a platform offering: managed vector stores and data-infra golden paths tenants self-serve — not building the corpus/embedding pipelines themselves.

11
Hosted LLM API Integration

Baseline provider access in platform code: LLM/embedding SDK calls, auth, and retries.

12
Python for Platform Engineering

Production Python for platform code: async, typing, Pydantic, and error handling.

What you'll ship in production

Core responsibilities this discipline prepares you for.

  1. 1

    Build the internal GenAI platform

    enabling developers to deploy LLM applications self-service

    • Design platform APIs with golden path templates and self-service provisioning workflows
    • Build developer portals with pre-approved LLM configurations, guardrails, and monitoring included
    • Wire end-to-end self-service: from app registration to deployed inference endpoint with observability
  2. 2

    Design multi-tenant infrastructure

    with namespace isolation and RBAC

    • Implement Kubernetes namespace isolation with RBAC policies and resource quotas per tenant
    • Automate tenant provisioning with network policies and admission controllers
    • Validate tenant isolation by enforcing resource limits under concurrent multi-team workloads
  3. 3

    Implement CI/CD pipelines

    with GitOps for GenAI applications

    • Set up ArgoCD GitOps for declarative deployment from Git push to production rollout
    • Build GitHub Actions workflows with act for local CI and Helm chart packaging
    • Wire complete GitOps pipelines with Kustomize overlays for dev/staging/production environments
  4. 4

    Manage data infrastructure

    — databases, caches, message queues on K8s

    • Deploy PostgreSQL + pgvector, Redis, Kafka, Neo4j, and MinIO as Kubernetes-native services
    • Configure backup/restore, horizontal scaling, and monitoring for each data component
    • Benchmark throughput and failover behavior for each infrastructure component under load
  5. 5

    Build autoscaling for GenAI workloads

    using event-driven scaling and batch job queuing

    • Configure KEDA for event-driven pod autoscaling based on queue depth, HTTP rate, and custom metrics
    • Set up Kueue for Kubernetes-native batch job scheduling with priorities and fair quotas
    • Validate auto-scaling policies under burst GenAI workloads with realistic traffic patterns
  6. 6

    Provision infrastructure-as-code

    using K8s-native tooling

    • Declare infrastructure as Kubernetes custom resources with Crossplane providers
    • Manage databases, storage, and networking declaratively through kubectl apply
    • Verify reconciliation behavior by modifying infrastructure state and observing self-healing
  7. 7

    Implement full-stack observability

    across the GenAI platform

    • Build unified observability with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing
    • Add Logfire for Python application tracing and Langfuse for LLM-specific cost and quality monitoring
    • Wire a unified observability stack spanning infrastructure, application, and LLM inference layers
  8. 8

    Operate LLM gateways

    as platform infrastructure

    • Manage LiteLLM gateway operations: API key lifecycle, per-team cost tracking, and provider health
    • Handle model version migration and zero-downtime provider switching
    • Operate a production gateway serving multiple internal teams with isolated quotas and routing

Curriculum

9 courses · each builds on previous goals

17 goals unlocked for preview — click to read. Locked goals need a subscription.