Learning Goals

  1. Goal1 Map Organizational Ai Workloads To Platform

    • Map organizational AI workloads to platform service requirementsby conducting a systematic discovery process that translates business-level AI initiatives into concrete infrastructure, compute, and service demands that your internal developer platform must satisfy.
    • Classify AI workloads across a maturity spectrum—from experimental notebooks and batch inference jobs to real-time serving endpoints and fine-tuning pipelines—so that each workload type maps to a distinct resource profile covering GPU allocation, memory quotas, storage class, and network egress policies.
    • Build a workload intake framework that captures critical metadata for every AI project entering the platform, including model framework (PyTorch, TensorFlow, JAX), expected inference latency SLA, data residency constraints, and upstream data pipeline dependencies, ensuring the platform team can provision environments without back-and-forth with application developers.
    • Construct a service requirements matrix that cross-references workload categories against platform capabilities such as secret management, model registry integration, feature store connectivity, and observability stack compatibility, producing a gap analysis that directly feeds your platform roadmap and prioritization decisions.
  2. Goal2 Design A Service Catalog With Golden

    • Design a service catalog with golden paths for common AI workflowsthat gives development teams a curated, opinionated set of production-ready templates while retaining enough flexibility for advanced users to customize components without abandoning the platform entirely.
    • Define golden path templates for the five most common AI workflow patterns—batch training, online inference, A/B model evaluation, data preprocessing pipelines, and RAG-based retrieval services—each template encoding organizational best practices for logging, authentication, resource limits, and CI/CD integration so that a new team can ship to production within hours rather than weeks.
    • Structure the service catalog as a versioned, schema-validated registry where every catalog entry includes a machine-readable specification (owner, maturity tier, supported runtimes, required platform version), human-readable documentation, and a direct link to the golden path scaffolding command, enabling both programmatic consumption by platform tooling and manual browsing by engineers.
    • Implement catalog governance policies that enforce lifecycle management rules—draft, active, deprecated, retired—with automated notifications to consuming teams when a catalog entry transitions state, preventing teams from building on deprecated paths and ensuring the catalog remains a living, trustworthy resource rather than a stale wiki page.
  3. Goal3 Build Platform Configuration Models With Pydantic

    • Build platform configuration models with Pydantic and PostgreSQLto establish a strongly-typed, version-controlled configuration layer that serves as the single source of truth for every resource, policy, and integration point managed by the platform control plane.
    • Design Pydantic models that represent the full hierarchy of platform entities—organizations, teams, environments, services, and deployment targets—using nested models, discriminated unions, and custom validators to enforce invariants such as "GPU quotas per team must not exceed the organization-level ceiling" or "staging environments must mirror production network policies" at configuration parse time rather than at deployment time.
    • Implement a PostgreSQL persistence layer with SQLAlchemy that stores configuration snapshots with full audit history, enabling point-in-time recovery of any platform configuration state and supporting diff-based change review workflows where platform operators approve configuration changes through pull requests before they propagate to the control plane.
    • Wire configuration validation into the platform API surface so that every mutation—whether initiated by a CLI tool, a self-service portal form, or a GitOps reconciliation loop—passes through the same Pydantic validation pipeline, guaranteeing that no invalid configuration reaches the database regardless of the entry point, and returning structured, actionable error messages that tell the caller exactly which field failed which constraint.
  4. Goal4 Implement A Platform Health Dashboard With

    • Implement a platform health dashboard with Prometheus and Grafanathat provides platform operators and tenant teams with real-time visibility into platform capacity, service health, and SLA compliance across every layer of the internal developer platform.
    • Instrument the platform control plane to emit custom Prometheus metrics covering request throughput, configuration reconciliation latency, catalog lookup response times, and error rates by tenant, going beyond default infrastructure metrics to capture the platform-specific signals that actually predict developer experience degradation before it surfaces as support tickets.
    • Build Grafana dashboards organized into three tiers—executive summary (platform-wide adoption and reliability KPIs), operator view (per-service health, resource utilization heat maps, and reconciliation queue depth), and tenant view (team-specific quota consumption, deployment frequency, and pipeline success rates)—so that each audience sees exactly the information relevant to their decision-making without drowning in irrelevant detail.
    • Configure alerting rules in Prometheus Alertmanager that distinguish between platform-level incidents (control plane degradation, catalog service unavailability, certificate expiry) and tenant-level anomalies (individual team exceeding quota, single service experiencing elevated error rates), routing each alert class to the appropriate on-call rotation with contextual runbook links embedded directly in the alert annotation.
  5. Create platform architecture decision records and runbooks that codify the reasoning behind every significant platform design choice and provide step-by-step operational procedures for both routine maintenance and incident response, transforming tribal knowledge into durable organizational assets.

    • Author Architecture Decision Records (ADRs) using a structured template—title, status, context, decision, consequences, alternatives considered—for every major platform choice including control plane technology selection, multi-tenancy isolation model, secret management strategy, and golden path toolchain, storing them as versioned Markdown files in the platform repository so they appear in code review workflows and remain discoverable through standard search.
    • Develop operational runbooks for the ten most critical platform operations—scaling the control plane, rotating platform-wide credentials, onboarding a new tenant team, promoting a catalog entry from draft to active, recovering from a failed GitOps reconciliation, performing a database migration, updating Helm chart dependencies, handling GPU node pool exhaustion, investigating configuration drift, and executing a platform rollback—each runbook containing preconditions, exact commands with expected output, verification steps, and escalation criteria.
    • Establish a runbook testing cadence where the platform team executes every runbook against a staging environment at least once per quarter, validating that commands still work, outputs match expectations, and the documented escalation contacts are current, treating runbooks as living code that requires the same maintenance discipline as production services rather than write-once documentation artifacts.
  6. Deploy the platform control plane with Helm and ArgoCD to establish a GitOps-driven deployment pipeline where every control plane component—from the service catalog API to the configuration management service—is declaratively defined, version-controlled, and automatically reconciled against the desired state in the cluster.

    • Structure the Helm chart for the platform control plane as an umbrella chart with sub-charts for each component (catalog service, configuration API, health aggregator, tenant provisioner), using Helm values files per environment (development, staging, production) to manage configuration differences while keeping the chart templates identical across environments, eliminating configuration drift caused by manual per-environment patching.
    • Configure ArgoCD Application resources that point to the platform Helm chart repository, enabling automated sync with configurable sync policies—auto-sync with self-heal for development environments, manual sync with diff preview for production—and leveraging ArgoCD sync waves to enforce deployment ordering so that database migrations complete before API servers start and API servers register before the health dashboard begins polling.
    • Implement a promotion workflow where control plane changes flow through development → staging → production with automated integration tests gating each promotion, ArgoCD ApplicationSets generating per-environment Application resources from a single template, and Slack or PagerDuty notifications firing at each promotion stage so the platform team maintains full awareness of what is running where without manually inspecting cluster state.
    • Validate the deployed control plane by running a post-deployment smoke test suite that exercises every critical path—creating a tenant, registering a catalog entry, deploying a golden path template, querying the health dashboard, and triggering a configuration validation error—confirming that the end-to-end platform workflow functions correctly after every deployment before declaring the release successful.

Prerequisites

This chapter serves as the foundational entry point for the AI Developer Platform Engineering course. Before beginning, ensure you have completed the following:

  • Python 3.11+ installed with working knowledge of type hints, dataclasses, and async patterns
  • Docker and Kubernetes fundamentals including pod lifecycle, services, and Helm chart basics
  • PostgreSQL experience with schema design and basic SQL operations
  • Git proficiency for version-controlled infrastructure-as-code workflows
  • Cloud platform access to a Kubernetes cluster (local minikube or managed GKE/EKS) with kubectl configured

No prior chapters in this course are required.

Key Terminology

Internal Developer Platform (IDP)
A self-service layer built on top of infrastructure tooling that abstracts away operational complexity so product teams can provision, deploy, and observe AI workloads without filing tickets or waiting on platform engineers.
Service Catalog
A curated, versioned registry of pre-approved platform offerings—such as GPU compute pools, model-serving endpoints, and feature stores—that developers browse and instantiate through a self-service portal rather than ad-hoc infrastructure requests.
Golden Path
An opinionated, well-supported, and thoroughly documented default workflow that guides developers through the recommended way to accomplish a task—such as deploying an inference service or creating a training pipeline—without restricting them from deviating when justified.
Platform Control Plane
The centralized set of APIs, controllers, and reconciliation loops—often running as Kubernetes operators or ArgoCD applications—that accept declarative intent from developers and continuously drive the actual infrastructure state toward the desired state.
Platform Abstraction Layer
A boundary that translates high-level developer intent (for example, "I need a model-serving endpoint with autoscaling") into the specific infrastructure primitives (Kubernetes Deployments, HPA resources, Istio VirtualServices) required to fulfill that intent across heterogeneous cloud providers.
Developer Self-Service Portal
A web-based or CLI-driven interface—commonly built with Backstage, Port, or a custom React application—through which engineers discover catalog entries, provision resources, and inspect platform health without direct cluster access.
Platform Configuration Model
A strongly typed, validated schema—typically implemented with Pydantic models backed by PostgreSQL—that represents every tunable parameter of a platform resource, ensuring that invalid configurations fail fast at submission time rather than at deployment time.
Architecture Decision Record (ADR)
A lightweight, version-controlled document that captures the context, decision, and consequences of a significant platform design choice—such as selecting Helm over Kustomize for templating—so future engineers understand not just what was decided but why.
Backstage Software Catalog
An open-source developer portal framework originally created by Spotify that provides a standardized metadata model (catalog-info.yaml) for registering services, libraries, and infrastructure components into a unified, searchable inventory.
Platform Primitive
The smallest composable building block exposed by the platform—such as a managed database instance, a secrets vault namespace, or a GPU node pool—that golden paths combine into higher-order workflows for specific use cases.
Reconciliation Loop
A controller pattern—central to Kubernetes operators and GitOps tools like ArgoCD—that continuously compares the declared desired state in a Git repository or API with the observed actual state in the cluster, and applies corrective actions to eliminate drift.
Helm Chart
A packaged collection of templatized Kubernetes manifests, default values, and lifecycle hooks that the platform control plane uses to version, distribute, and reproducibly install complex multi-resource applications like Prometheus stacks or model-serving gateways.
ArgoCD Application
A custom Kubernetes resource that binds a Git repository path and target cluster namespace together, enabling the GitOps controller to automatically sync, diff, and roll back platform component deployments whenever the declared manifests change.
Prometheus Metrics Endpoint
An HTTP path (conventionally `/metrics`) that platform services expose in OpenMetrics format, allowing the Prometheus server to scrape numeric time-series data—such as request latency histograms or GPU utilization gauges—at regular intervals for dashboarding and alerting.
Grafana Dashboard
A composable visualization panel backed by PromQL or LogQL queries that platform engineers assemble into health dashboards displaying real-time SLI data—request error rates, p99 latencies, and resource saturation—for every service in the catalog.
Platform API Gateway
A single ingress point—often implemented with Envoy, Kong, or an Istio ingress gateway—that enforces authentication, rate limiting, and routing policies for all control-plane and data-plane traffic entering the platform boundary.
Runbook
An operational playbook, stored alongside the code it supports, that provides step-by-step diagnostic and remediation procedures for known failure modes—such as a model-serving pod crash-looping due to OOM—so on-call engineers resolve incidents without tribal knowledge.
Service Level Indicator (SLI)
A quantitative measurement of a specific aspect of platform service health—such as the proportion of inference requests completed under 200 milliseconds—that feeds into SLO calculations and drives alerting thresholds on the platform health dashboard.

On This Page