GenAI Platform Engineering
Build internal GenAI developer platforms with self-service capabilities, multi-tenancy, RBAC, CI/CD for model/prompt/guardrail pipelines.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.
Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.
Build the paved road: a self-service internal developer platform — portals, service catalogs, templates and scaffolding, and self-service provisioning so other engineers ship AI without filing tickets.
Make one platform safely serve many teams: multi-tenant architecture, namespace and resource isolation, noisy-neighbor prevention, per-tenant limits, SSO/identity, RBAC, and secrets-as-a-service.
Build the deployment control plane others consume: a self-service paved road for promoting models, prompts, and guardrails, with registry/versioning and GitOps offered as a platform capability — not operating one team's pipeline.
Engineer the platform layer on Kubernetes: operators, CRDs and controllers, the control-plane backend behind self-service, multi-cluster, GPU/accelerator scheduling, and platform IaC (Crossplane/Terraform).
Offer model serving as a platform product: provision inference endpoints and model-serving-as-a-service, internal model gateways, and autoscaling serving infra that tenants request on demand.
Give every tenant observability out of the box: per-tenant dashboards, traces, and metrics the platform provides, the shared stack teams plug into, and fleet-wide plus platform-self SLOs.
Enforce the rules platform-wide: policy-as-code and admission policies (OPA/Kyverno), an audit engine every tenant inherits, and governance gates baked into the control plane — preventive, not operational compliance.
Run the platform's economics: per-tenant cost attribution, chargeback/showback, cross-tenant quota enforcement, and capacity planning — billing and bounding tenants, not reducing model spend.
Host eval as a self-serve capability: shared eval/benchmark harnesses, golden-set storage, and gate scheduling that tenants plug their own metrics into — the infrastructure, not the eval science.
Provision data infra as a platform offering: managed vector stores and data-infra golden paths tenants self-serve — not building the corpus/embedding pipelines themselves.
Baseline provider access in platform code: LLM/embedding SDK calls, auth, and retries.
Production Python for platform code: async, typing, Pydantic, and error handling.
What you'll ship in production
Core responsibilities this discipline prepares you for.
What you'll ship in production
Core responsibilities this discipline prepares you for.
- 1
Build the internal GenAI platform
enabling developers to deploy LLM applications self-service
- Design platform APIs with golden path templates and self-service provisioning workflows
- Build developer portals with pre-approved LLM configurations, guardrails, and monitoring included
- Wire end-to-end self-service: from app registration to deployed inference endpoint with observability
- 2
Design multi-tenant infrastructure
with namespace isolation and RBAC
- Implement Kubernetes namespace isolation with RBAC policies and resource quotas per tenant
- Automate tenant provisioning with network policies and admission controllers
- Validate tenant isolation by enforcing resource limits under concurrent multi-team workloads
- 3
Implement CI/CD pipelines
with GitOps for GenAI applications
- Set up ArgoCD GitOps for declarative deployment from Git push to production rollout
- Build GitHub Actions workflows with act for local CI and Helm chart packaging
- Wire complete GitOps pipelines with Kustomize overlays for dev/staging/production environments
- 4
Manage data infrastructure
— databases, caches, message queues on K8s
- Deploy PostgreSQL + pgvector, Redis, Kafka, Neo4j, and MinIO as Kubernetes-native services
- Configure backup/restore, horizontal scaling, and monitoring for each data component
- Benchmark throughput and failover behavior for each infrastructure component under load
- 5
Build autoscaling for GenAI workloads
using event-driven scaling and batch job queuing
- Configure KEDA for event-driven pod autoscaling based on queue depth, HTTP rate, and custom metrics
- Set up Kueue for Kubernetes-native batch job scheduling with priorities and fair quotas
- Validate auto-scaling policies under burst GenAI workloads with realistic traffic patterns
- 6
Provision infrastructure-as-code
using K8s-native tooling
- Declare infrastructure as Kubernetes custom resources with Crossplane providers
- Manage databases, storage, and networking declaratively through kubectl apply
- Verify reconciliation behavior by modifying infrastructure state and observing self-healing
- 7
Implement full-stack observability
across the GenAI platform
- Build unified observability with Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing
- Add Logfire for Python application tracing and Langfuse for LLM-specific cost and quality monitoring
- Wire a unified observability stack spanning infrastructure, application, and LLM inference layers
- 8
Operate LLM gateways
as platform infrastructure
- Manage LiteLLM gateway operations: API key lifecycle, per-team cost tracking, and provider health
- Handle model version migration and zero-downtime provider switching
- Operate a production gateway serving multiple internal teams with isolated quotas and routing
Curriculum
9 courses · each builds on previous goals
Curriculum
9 courses · each builds on previous goals
17 goals unlocked for preview — click to read. Locked goals need a subscription.