Preview lesson

Design service catalog data model and golden path templates

Define the core data structures for platform services, golden path templates, and service catalog entries. Build Pydantic models that represent the platform's offerings to development teams.

Free to read — no subscription required.

Explore Complete Lesson

Design a service catalog with golden paths for common AI workflows

Introduction

Engineers often find that AI teams default to ad-hoc provisioning when no structured catalog exists—GPU spend grows ungoverned, model deployments skip monitoring instrumentation, and connection strings are hardcoded across dozens of services with no automated wiring between them. A service catalog paired with golden paths solves this by giving teams a curated registry of validated services and pre-composed workflows they can adopt without reinventing platform decisions from scratch. By the end of this lesson, you'll be able to design a catalog data model that enforces governance policies at registration time, define golden path step sequences that compose those services into end-to-end AI workflows, and understand how the provisioning orchestrator propagates inter-service configuration automatically through the platform control plane.

Key Terminology

Service Catalog: A curated registry of platform-provided services, each with metadata describing its capabilities, SLAs, dependencies, and provisioning interface.
Golden Path: A pre-validated, opinionated workflow template that composes multiple catalog services into an end-to-end pipeline for a specific use case.
Service Template: A parameterized blueprint that generates infrastructure-as-code and configuration for a specific service instance.
Platform Control Plane: The set of APIs and controllers that manage service lifecycle, configuration propagation, and health monitoring across the platform.
Catalog Entry: A single service definition within the catalog, including its schema, version history, ownership metadata, and dependency graph.

Concepts

Connecting to Platform Configuration Management

The service catalog and golden path registry do not exist in isolation—they feed directly into the platform configuration management service described in the related lab objective. When the provisioning orchestrator deploys a service from a golden path step, it writes the resulting configuration (endpoints, credentials references, resource allocations) to PostgreSQL with Redis caching. Subsequent services in the golden path read their predecessors' configuration from this store, enabling automatic wiring.

Loading diagram...

For example, when the golden path provisions a vector database in step 1, the orchestrator writes the database's connection string and collection schema to the configuration store under a namespaced key like platform/team-alpha/vector-db/qdrant/connection. When step 2 provisions a model serving endpoint, the serving framework's startup configuration reads that same key to discover where to send embedding lookups. This pattern eliminates hardcoded connection strings and enables the platform to rotate credentials or migrate services without requiring consumer code changes.

The configuration management service should expose a gRPC or REST API with the following operations that map to catalog lifecycle events:

PUT /config/{namespace}/{service_id}/{key} — Writes a configuration value during provisioning, with automatic Redis cache invalidation.
GET /config/{namespace}/{service_id}/{key} — Reads a configuration value, served from Redis cache with PostgreSQL fallback.
DELETE /config/{namespace}/{service_id} — Removes all configuration for a decommissioned service, triggered when a golden path step is rolled back.
LIST /config/{namespace} — Enumerates all services configured within a team's namespace, powering the platform health dashboard's service inventory view.

Design Principles for AI-Specific Golden Paths

Building golden paths for AI workflows requires attention to concerns that do not arise in traditional microservice platforms:

Encode GPU lifecycle management — Every golden path that provisions GPU workloads must include steps for quota validation, node pool selection, and automatic scale-down policies. The GPURequirement model in the catalog entry enforces declaration, but the golden path must also verify that the target cluster has sufficient GPU capacity before beginning provisioning. If capacity is insufficient, the path should return a clear error with a link to the capacity request workflow rather than failing mid-deployment.
Version model artifacts alongside infrastructure — Golden paths for model serving must pin the model artifact version (e.g., an MLflow model URI with a specific run ID) in the provisioning parameters. This ensures that infrastructure and model versions are deployed atomically and can be rolled back together. Storing the model version in the configuration management service's PostgreSQL backend creates an auditable history of what model version ran on which infrastructure at which time.
Include observability by default — Every golden path should include a monitoring step that provisions Prometheus scrape targets and Grafana dashboard definitions for the deployed services. AI workloads require specialized metrics beyond request latency—token throughput, batch queue depth, GPU utilization percentage, and inference drift scores. The monitoring step should use the catalog's MONITORING service type and inject pre-built dashboard JSON that the Grafana operator reconciles automatically.
Support experiment-to-production promotion — Data scientists frequently develop models in experiment tracking environments and need a clear path to promote a validated experiment into a production serving deployment. A well-designed golden path provides a promote action that reads the experiment's metadata from the feature store, packages the model using the platform's standard container image builder, and deploys it through the same ArgoCD pipeline used for all production services. This eliminates the gap between experimentation and production that causes most ML projects to stall.

These principles ensure that golden paths are not merely convenience wrappers around Helm charts but encode the platform team's accumulated operational knowledge about running AI workloads reliably at scale. Each path should be versioned using semantic versioning, stored in a Git repository that ArgoCD watches, and tested through the platform's own CI pipeline before being published to the developer portal's catalog.

Code Walkthrough

Building on the Key Terminology definitions for ServiceCatalogEntry, GoldenPathDefinition, and the Platform Control Plane, the following implementation encodes these structures as Pydantic models that the provisioning orchestrator validates before deploying any service.

Code snippetpython
1from pydantic import BaseModel, Field, field_validator
2from enum import Enum
3from typing import Optional
4from datetime import datetime
5
6class ServiceType(str, Enum):
7    MODEL_SERVING = "model-serving"
8    TRAINING_JOB = "training-job"
9    VECTOR_DB = "vector-db"
10    FEATURE_STORE = "feature-store"
11    MONITORING = "monitoring"
12
13class GPURequirement(BaseModel):
14    gpu_type: str
15    min_count: int = Field(ge=0, default=0)
16    max_count: int = Field(ge=0, default=8)
17    memory_gb: int = Field(ge=0, default=40)
18
19class ServiceDependency(BaseModel):
20    service_id: str
21    version_constraint: str
22    optional: bool = False
23
24class ServiceCatalogEntry(BaseModel):
25    service_id: str = Field(min_length=3, max_length=64)
26    name: str
27    service_type: ServiceType
28    version: str = Field(pattern=r"^\d+\.\d+\.\d+$")
29    owner_team: str
30    description: str = Field(min_length=20)
31    gpu_requirements: Optional[GPURequirement] = None
32    dependencies: list[ServiceDependency] = Field(default_factory=list)
33    helm_chart_ref: Optional[str] = None
34    created_at: datetime = Field(default_factory=datetime.utcnow)
35    deprecated: bool = False
36
37    @field_validator("gpu_requirements")
38    @classmethod
39    def gpu_required_for_compute_services(cls, v, info):
40        stype = info.data.get("service_type")
41        gpu_types = {ServiceType.MODEL_SERVING, ServiceType.TRAINING_JOB}
42        if stype in gpu_types and v is None:
43            raise ValueError(f"gpu_requirements mandatory for {stype}")
44        return v

ServiceType enumerates the AI workload categories the platform governs, using kebab-case values that align with Kubernetes label conventions. The GPURequirement sub-model forces explicit hardware declarations on every entry. The field_validator on ServiceCatalogEntry rejects any model-serving or training-job entry that omits GPU requirements, enforcing cost governance at schema validation time rather than at deploy time—a critical policy for platforms where unplanned GPU allocations create budget overruns.

When the orchestrator provisions each golden path step, it writes the resulting connection details to the configuration store so downstream steps can discover them automatically, eliminating hardcoded strings. The following helpers implement the PUT and GET operations described in the Concepts section:

Code snippetpython
1import httpx
2
3def write_service_config(
4    namespace: str,
5    service_id: str,
6    key: str,
7    value: str,
8    config_api_base: str = "http://platform-config:8080",
9) -> None:
10    """Write a provisioned service's config for downstream golden path steps."""
11    url = f"{config_api_base}/config/{namespace}/{service_id}/{key}"
12    response = httpx.put(url, json={"value": value}, timeout=10.0)
13    response.raise_for_status()
14
15def read_service_config(
16    namespace: str,
17    service_id: str,
18    key: str,
19    config_api_base: str = "http://platform-config:8080",
20) -> str:
21    """Read a predecessor step's config during golden path provisioning."""
22    url = f"{config_api_base}/config/{namespace}/{service_id}/{key}"
23    response = httpx.get(url, timeout=10.0)
24    response.raise_for_status()
25    return response.json()["value"]

After the vector database step completes, the orchestrator calls write_service_config("team-alpha", "vector-db/qdrant", "connection", connection_string). The model-serving step then calls read_service_config("team-alpha", "vector-db/qdrant", "connection") to discover the endpoint—matching the namespaced key pattern described in the Concepts section and enabling automatic credential rotation without consumer code changes.

Verify by instantiating a ServiceCatalogEntry with service_type=ServiceType.MODEL_SERVING and gpu_requirements=None—the validator should raise a ValueError, confirming that the governance policy is enforced at catalog registration time before any infrastructure is touched.

Do's and Don'ts

Having walked through the catalog model, golden path configuration propagation, and the discipline-specific application above, the following imperatives distil the governance and wiring patterns into rules you can apply directly when designing your own catalog entries and golden paths.

Do's

✓Do declare gpu_requirements on every model-serving and training-job catalog entry — the field_validator on ServiceCatalogEntry rejects these service types when gpu_requirements is None, enforcing cost governance at schema validation time before any infrastructure is provisioned and preventing unplanned GPU budget overruns.
✓Do use write_service_config / read_service_config with a consistent namespaced key pattern (e.g., "team-alpha" / "vector-db/qdrant" / "connection") — this lets each golden path step discover predecessor outputs automatically from the platform control plane instead of hardcoding connection strings across services.
✓Do model catalog entries with a version field constrained to semver (^\d+\.\d+\.\d+$) and a deprecated flag — keeping entries versioned and deprecation-aware lets the orchestrator enforce ServiceDependency.version_constraint checks and gives teams a migration path without breaking existing golden path step sequences.

Don'ts

✗Don't omit gpu_requirements and assume the orchestrator will supply defaults — GPURequirement is a mandatory sub-model for ServiceType.MODEL_SERVING and ServiceType.TRAINING_JOB; skipping it raises a ValueError at registration time, and silently relying on platform defaults is exactly the ad-hoc provisioning pattern the catalog is designed to eliminate.
✗Don't hardcode connection strings in golden path step code — bypassing write_service_config / read_service_config breaks the automatic credential-propagation contract, meaning endpoint changes (including credential rotations) require manual updates across every consuming service rather than a single config-store write.
✗Don't register a new AI workload under an ad-hoc service_type string outside the ServiceType enum — the enum's kebab-case values (model-serving, vector-db, etc.) align with Kubernetes label conventions and are the keys the governance validator and GPU-requirement check branch on; an unrecognized type bypasses both the field_validator and the orchestrator's resource-quota logic.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →