Free lesson
Design service catalog data model and golden path templates
Define the core data structures for platform services, golden path templates, and service catalog entries. Build Pydantic models that represent the platform's offerings to development teams.
~25 min read · Free to read — no subscription required.
Design a service catalog with golden paths for common AI workflows
Introduction
Engineers often find that AI teams default to ad-hoc provisioning when no structured catalog exists—GPU spend grows ungoverned, model deployments skip monitoring instrumentation, and connection strings are hardcoded across dozens of services with no automated wiring between them. A service catalog paired with golden paths solves this by giving teams a curated registry of validated services and pre-composed workflows they can adopt without reinventing platform decisions from scratch. By the end of this lesson, you'll be able to design a catalog data model that enforces governance policies at registration time, define golden path step sequences that compose those services into end-to-end AI workflows, and understand how the provisioning orchestrator propagates inter-service configuration automatically through the platform control plane.
Key Terminology
- Service Catalog: A curated registry of platform-provided services, each with metadata describing its capabilities, SLAs, dependencies, and provisioning interface.
- Golden Path: A pre-validated, opinionated workflow template that composes multiple catalog services into an end-to-end pipeline for a specific use case.
- Service Template: A parameterized blueprint that generates infrastructure-as-code and configuration for a specific service instance.
- Platform Control Plane: The set of APIs and controllers that manage service lifecycle, configuration propagation, and health monitoring across the platform.
- Catalog Entry: A single service definition within the catalog, including its schema, version history, ownership metadata, and dependency graph.
Concepts
Connecting to Platform Configuration Management
The service catalog and golden path registry do not exist in isolation—they feed directly into the platform configuration management service described in the related lab objective. When the provisioning orchestrator deploys a service from a golden path step, it writes the resulting configuration (endpoints, credentials references, resource allocations) to PostgreSQL with Redis caching. Subsequent services in the golden path read their predecessors' configuration from this store, enabling automatic wiring.
For example, when the golden path provisions a vector database in step 1, the orchestrator writes the database's connection string and collection schema to the configuration store under a namespaced key like platform/team-alpha/vector-db/qdrant/connection. When step 2 provisions a model serving endpoint, the serving framework's startup configuration reads that same key to discover where to send embedding lookups. This pattern eliminates hardcoded connection strings and enables the platform to rotate credentials or migrate services without requiring consumer code changes.
The configuration management service should expose a gRPC or REST API with the following operations that map to catalog lifecycle events:
- PUT /config/{namespace}/{service_id}/{key} — Writes a configuration value during provisioning, with automatic Redis cache invalidation.
- GET /config/{namespace}/{service_id}/{key} — Reads a configuration value, served from Redis cache with PostgreSQL fallback.
- DELETE /config/{namespace}/{service_id} — Removes all configuration for a decommissioned service, triggered when a golden path step is rolled back.
- LIST /config/{namespace} — Enumerates all services configured within a team's namespace, powering the platform health dashboard's service inventory view.
Design Principles for AI-Specific Golden Paths
Building golden paths for AI workflows requires attention to concerns that do not arise in traditional microservice platforms:
-
Encode GPU lifecycle management — Every golden path that provisions GPU workloads must include steps for quota validation, node pool selection, and automatic scale-down policies. The GPURequirement model in the catalog entry enforces declaration, but the golden path must also verify that the target cluster has sufficient GPU capacity before beginning provisioning. If capacity is insufficient, the path should return a clear error with a link to the capacity request workflow rather than failing mid-deployment.
-
Version model artifacts alongside infrastructure — Golden paths for model serving must pin the model artifact version (e.g., an MLflow model URI with a specific run ID) in the provisioning parameters. This ensures that infrastructure and model versions are deployed atomically and can be rolled back together. Storing the model version in the configuration management service's PostgreSQL backend creates an auditable history of what model version ran on which infrastructure at which time.
-
Include observability by default — Every golden path should include a monitoring step that provisions Prometheus scrape targets and Grafana dashboard definitions for the deployed services. AI workloads require specialized metrics beyond request latency—token throughput, batch queue depth, GPU utilization percentage, and inference drift scores. The monitoring step should use the catalog's MONITORING service type and inject pre-built dashboard JSON that the Grafana operator reconciles automatically.
-
Support experiment-to-production promotion — Data scientists frequently develop models in experiment tracking environments and need a clear path to promote a validated experiment into a production serving deployment. A well-designed golden path provides a promote action that reads the experiment's metadata from the feature store, packages the model using the platform's standard container image builder, and deploys it through the same ArgoCD pipeline used for all production services. This eliminates the gap between experimentation and production that causes most ML projects to stall.
These principles ensure that golden paths are not merely convenience wrappers around Helm charts but encode the platform team's accumulated operational knowledge about running AI workloads reliably at scale. Each path should be versioned using semantic versioning, stored in a Git repository that ArgoCD watches, and tested through the platform's own CI pipeline before being published to the developer portal's catalog.
Code Walkthrough
Building on the Key Terminology definitions for ServiceCatalogEntry, GoldenPathDefinition, and the Platform Control Plane, the following implementation encodes these structures as Pydantic models that the provisioning orchestrator validates before deploying any service.
Code snippetpython
1from pydantic import BaseModel, Field, field_validator 2from enum import Enum 3from typing import Optional 4from datetime import datetime 5 6class ServiceType(str, Enum): 7 MODEL_SERVING = "model-serving" 8 TRAINING_JOB = "training-job" 9 VECTOR_DB = "vector-db" 10 FEATURE_STORE = "feature-store" 11 MONITORING = "monitoring" 12 13class GPURequirement(BaseModel): 14 gpu_type: str 15 min_count: int = Field(ge=0, default=0) 16 max_count: int = Field(ge=0, default=8) 17 memory_gb: int = Field(ge=0, default=40) 18 19class ServiceDependency(BaseModel): 20 service_id: str 21 version_constraint: str 22 optional: bool = False 23 24class ServiceCatalogEntry(BaseModel): 25 service_id: str = Field(min_length=3, max_length=64) 26 name: str 27 service_type: ServiceType 28 version: str = Field(pattern=r"^\d+\.\d+\.\d+$") 29 owner_team: str 30 description: str = Field(min_length=20) 31 gpu_requirements: Optional[GPURequirement] = None 32 dependencies: list[ServiceDependency] = Field(default_factory=list) 33 helm_chart_ref: Optional[str] = None 34 created_at: datetime = Field(default_factory=datetime.utcnow) 35 deprecated: bool = False 36 37 @field_validator("gpu_requirements") 38 @classmethod 39 def gpu_required_for_compute_services(cls, v, info): 40 stype = info.data.get("service_type") 41 gpu_types = {ServiceType.MODEL_SERVING, ServiceType.TRAINING_JOB} 42 if stype in gpu_types and v is None: 43 raise ValueError(f"gpu_requirements mandatory for {stype}") 44 return v
ServiceType enumerates the AI workload categories the platform governs, using kebab-case values that align with Kubernetes label conventions. The GPURequirement sub-model forces explicit hardware declarations on every entry. The field_validator on ServiceCatalogEntry rejects any model-serving or training-job entry that omits GPU requirements, enforcing cost governance at schema validation time rather than at deploy time—a critical policy for platforms where unplanned GPU allocations create budget overruns.
When the orchestrator provisions each golden path step, it writes the resulting connection details to the configuration store so downstream steps can discover them automatically, eliminating hardcoded strings. The following helpers implement the PUT and GET operations described in the Concepts section:
Code snippetpython
1import httpx 2 3def write_service_config( 4 namespace: str, 5 service_id: str, 6 key: str, 7 value: str, 8 config_api_base: str = "http://platform-config:8080", 9) -> None: 10 """Write a provisioned service's config for downstream golden path steps.""" 11 url = f"{config_api_base}/config/{namespace}/{service_id}/{key}" 12 response = httpx.put(url, json={"value": value}, timeout=10.0) 13 response.raise_for_status() 14 15def read_service_config( 16 namespace: str, 17 service_id: str, 18 key: str, 19 config_api_base: str = "http://platform-config:8080", 20) -> str: 21 """Read a predecessor step's config during golden path provisioning.""" 22 url = f"{config_api_base}/config/{namespace}/{service_id}/{key}" 23 response = httpx.get(url, timeout=10.0) 24 response.raise_for_status() 25 return response.json()["value"]
After the vector database step completes, the orchestrator calls write_service_config("team-alpha", "vector-db/qdrant", "connection", connection_string). The model-serving step then calls read_service_config("team-alpha", "vector-db/qdrant", "connection") to discover the endpoint—matching the namespaced key pattern described in the Concepts section and enabling automatic credential rotation without consumer code changes.
Verify by instantiating a ServiceCatalogEntry with service_type=ServiceType.MODEL_SERVING and gpu_requirements=None—the validator should raise a ValueError, confirming that the governance policy is enforced at catalog registration time before any infrastructure is touched.
Do's and Don'ts
Having walked through the catalog model, golden path configuration propagation, and the discipline-specific application above, the following imperatives distil the governance and wiring patterns into rules you can apply directly when designing your own catalog entries and golden paths.
Do's
- ✓Do declare
gpu_requirementson everymodel-servingandtraining-jobcatalog entry — thefield_validatoronServiceCatalogEntryrejects these service types whengpu_requirementsisNone, enforcing cost governance at schema validation time before any infrastructure is provisioned and preventing unplanned GPU budget overruns. - ✓Do use
write_service_config/read_service_configwith a consistent namespaced key pattern (e.g.,"team-alpha"/"vector-db/qdrant"/"connection") — this lets each golden path step discover predecessor outputs automatically from the platform control plane instead of hardcoding connection strings across services. - ✓Do model catalog entries with a
versionfield constrained to semver (^\d+\.\d+\.\d+$) and adeprecatedflag — keeping entries versioned and deprecation-aware lets the orchestrator enforceServiceDependency.version_constraintchecks and gives teams a migration path without breaking existing golden path step sequences.
Don'ts
- ✗Don't omit
gpu_requirementsand assume the orchestrator will supply defaults —GPURequirementis a mandatory sub-model forServiceType.MODEL_SERVINGandServiceType.TRAINING_JOB; skipping it raises aValueErrorat registration time, and silently relying on platform defaults is exactly the ad-hoc provisioning pattern the catalog is designed to eliminate. - ✗Don't hardcode connection strings in golden path step code — bypassing
write_service_config/read_service_configbreaks the automatic credential-propagation contract, meaning endpoint changes (including credential rotations) require manual updates across every consuming service rather than a single config-store write. - ✗Don't register a new AI workload under an ad-hoc
service_typestring outside theServiceTypeenum — the enum's kebab-case values (model-serving,vector-db, etc.) align with Kubernetes label conventions and are the keys the governance validator and GPU-requirement check branch on; an unrecognized type bypasses both thefield_validatorand the orchestrator's resource-quota logic.
Keep going with GenAI Platform Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.