On This Page
Learning Goals
-
Goal1 Map Organizational Ai Workloads To Platform
- Map organizational AI workloads to platform service requirementsby conducting a systematic discovery process that translates business-level AI initiatives into concrete infrastructure, compute, and service demands that your internal developer platform must satisfy.
- Classify AI workloads across a maturity spectrum—from experimental notebooks and batch inference jobs to real-time serving endpoints and fine-tuning pipelines—so that each workload type maps to a distinct resource profile covering GPU allocation, memory quotas, storage class, and network egress policies.
- Build a workload intake framework that captures critical metadata for every AI project entering the platform, including model framework (PyTorch, TensorFlow, JAX), expected inference latency SLA, data residency constraints, and upstream data pipeline dependencies, ensuring the platform team can provision environments without back-and-forth with application developers.
- Construct a service requirements matrix that cross-references workload categories against platform capabilities such as secret management, model registry integration, feature store connectivity, and observability stack compatibility, producing a gap analysis that directly feeds your platform roadmap and prioritization decisions.
-
Goal2 Design A Service Catalog With Golden
- Design a service catalog with golden paths for common AI workflowsthat gives development teams a curated, opinionated set of production-ready templates while retaining enough flexibility for advanced users to customize components without abandoning the platform entirely.
- Define golden path templates for the five most common AI workflow patterns—batch training, online inference, A/B model evaluation, data preprocessing pipelines, and RAG-based retrieval services—each template encoding organizational best practices for logging, authentication, resource limits, and CI/CD integration so that a new team can ship to production within hours rather than weeks.
- Structure the service catalog as a versioned, schema-validated registry where every catalog entry includes a machine-readable specification (owner, maturity tier, supported runtimes, required platform version), human-readable documentation, and a direct link to the golden path scaffolding command, enabling both programmatic consumption by platform tooling and manual browsing by engineers.
- Implement catalog governance policies that enforce lifecycle management rules—draft, active, deprecated, retired—with automated notifications to consuming teams when a catalog entry transitions state, preventing teams from building on deprecated paths and ensuring the catalog remains a living, trustworthy resource rather than a stale wiki page.
-
Goal3 Build Platform Configuration Models With Pydantic
- Build platform configuration models with Pydantic and PostgreSQLto establish a strongly-typed, version-controlled configuration layer that serves as the single source of truth for every resource, policy, and integration point managed by the platform control plane.
- Design Pydantic models that represent the full hierarchy of platform entities—organizations, teams, environments, services, and deployment targets—using nested models, discriminated unions, and custom validators to enforce invariants such as "GPU quotas per team must not exceed the organization-level ceiling" or "staging environments must mirror production network policies" at configuration parse time rather than at deployment time.
- Implement a PostgreSQL persistence layer with SQLAlchemy that stores configuration snapshots with full audit history, enabling point-in-time recovery of any platform configuration state and supporting diff-based change review workflows where platform operators approve configuration changes through pull requests before they propagate to the control plane.
- Wire configuration validation into the platform API surface so that every mutation—whether initiated by a CLI tool, a self-service portal form, or a GitOps reconciliation loop—passes through the same Pydantic validation pipeline, guaranteeing that no invalid configuration reaches the database regardless of the entry point, and returning structured, actionable error messages that tell the caller exactly which field failed which constraint.
-
Goal4 Implement A Platform Health Dashboard With
- Implement a platform health dashboard with Prometheus and Grafanathat provides platform operators and tenant teams with real-time visibility into platform capacity, service health, and SLA compliance across every layer of the internal developer platform.
- Instrument the platform control plane to emit custom Prometheus metrics covering request throughput, configuration reconciliation latency, catalog lookup response times, and error rates by tenant, going beyond default infrastructure metrics to capture the platform-specific signals that actually predict developer experience degradation before it surfaces as support tickets.
- Build Grafana dashboards organized into three tiers—executive summary (platform-wide adoption and reliability KPIs), operator view (per-service health, resource utilization heat maps, and reconciliation queue depth), and tenant view (team-specific quota consumption, deployment frequency, and pipeline success rates)—so that each audience sees exactly the information relevant to their decision-making without drowning in irrelevant detail.
- Configure alerting rules in Prometheus Alertmanager that distinguish between platform-level incidents (control plane degradation, catalog service unavailability, certificate expiry) and tenant-level anomalies (individual team exceeding quota, single service experiencing elevated error rates), routing each alert class to the appropriate on-call rotation with contextual runbook links embedded directly in the alert annotation.
-
Create platform architecture decision records and runbooks that codify the reasoning behind every significant platform design choice and provide step-by-step operational procedures for both routine maintenance and incident response, transforming tribal knowledge into durable organizational assets.
- Author Architecture Decision Records (ADRs) using a structured template—title, status, context, decision, consequences, alternatives considered—for every major platform choice including control plane technology selection, multi-tenancy isolation model, secret management strategy, and golden path toolchain, storing them as versioned Markdown files in the platform repository so they appear in code review workflows and remain discoverable through standard search.
- Develop operational runbooks for the ten most critical platform operations—scaling the control plane, rotating platform-wide credentials, onboarding a new tenant team, promoting a catalog entry from draft to active, recovering from a failed GitOps reconciliation, performing a database migration, updating Helm chart dependencies, handling GPU node pool exhaustion, investigating configuration drift, and executing a platform rollback—each runbook containing preconditions, exact commands with expected output, verification steps, and escalation criteria.
- Establish a runbook testing cadence where the platform team executes every runbook against a staging environment at least once per quarter, validating that commands still work, outputs match expectations, and the documented escalation contacts are current, treating runbooks as living code that requires the same maintenance discipline as production services rather than write-once documentation artifacts.
-
Deploy the platform control plane with Helm and ArgoCD to establish a GitOps-driven deployment pipeline where every control plane component—from the service catalog API to the configuration management service—is declaratively defined, version-controlled, and automatically reconciled against the desired state in the cluster.
- Structure the Helm chart for the platform control plane as an umbrella chart with sub-charts for each component (catalog service, configuration API, health aggregator, tenant provisioner), using Helm values files per environment (development, staging, production) to manage configuration differences while keeping the chart templates identical across environments, eliminating configuration drift caused by manual per-environment patching.
- Configure ArgoCD Application resources that point to the platform Helm chart repository, enabling automated sync with configurable sync policies—auto-sync with self-heal for development environments, manual sync with diff preview for production—and leveraging ArgoCD sync waves to enforce deployment ordering so that database migrations complete before API servers start and API servers register before the health dashboard begins polling.
- Implement a promotion workflow where control plane changes flow through development → staging → production with automated integration tests gating each promotion, ArgoCD ApplicationSets generating per-environment Application resources from a single template, and Slack or PagerDuty notifications firing at each promotion stage so the platform team maintains full awareness of what is running where without manually inspecting cluster state.
- Validate the deployed control plane by running a post-deployment smoke test suite that exercises every critical path—creating a tenant, registering a catalog entry, deploying a golden path template, querying the health dashboard, and triggering a configuration validation error—confirming that the end-to-end platform workflow functions correctly after every deployment before declaring the release successful.
Prerequisites
This chapter serves as the foundational entry point for the AI Developer Platform Engineering course. Before beginning, ensure you have completed the following:
- Python 3.11+ installed with working knowledge of type hints, dataclasses, and async patterns
- Docker and Kubernetes fundamentals including pod lifecycle, services, and Helm chart basics
- PostgreSQL experience with schema design and basic SQL operations
- Git proficiency for version-controlled infrastructure-as-code workflows
- Cloud platform access to a Kubernetes cluster (local minikube or managed GKE/EKS) with kubectl configured
No prior chapters in this course are required.