Free lesson

Build platform health dashboard with Prometheus metrics

Instrument the platform control plane with Prometheus counters, histograms, and gauges. Build Grafana dashboards that show service catalog usage, provisioning latency, and error rates.

~25 min read · Free to read — no subscription required.

Implement a platform health dashboard with Prometheus and Grafana

Introduction

Engineers often instrument individual services without realizing the platform infrastructure beneath them—namespace provisioners, cluster sync adapters, service catalog engines—can fail silently while every running workload reports green. A broken namespace provisioner blocks every new team from onboarding even as existing services hum along perfectly, and no application-level dashboard will surface that failure. This lesson teaches you how to build a platform health dashboard using Prometheus and Grafana that captures control plane functional health, workload aggregates, and golden path drift simultaneously, giving platform engineers the visibility they need to uphold the platform's self-service contract.

Key Terminology

  • Control Plane Functional Health — a measurement dimension that tracks whether platform infrastructure components—namespace provisioners, cluster sync adapters, and service catalog engines—can fulfill their operational contracts, distinct from application-level signals that report only on already-running workloads.
  • Golden Path Drift — the count of workloads that have diverged from their golden path template, captured by the platform_golden_path_drift_total Gauge and updated with .set() rather than .inc() because drift is a current-state measurement, not a cumulative event total.
  • Scrape Endpoint — the HTTP /metrics path opened by start_http_server(9090) that Prometheus polls on the interval defined in its scrape_configs; each control plane component runs its own PlatformMetricsCollector instance and exposes a separate endpoint for Prometheus to aggregate.
  • Metric Label Dimensions — the named key-value pairs registered on a metric (e.g., operation, catalog_entry, result on platform_catalog_operations_total) that allow Grafana panels to break down a single time series by any combination of those dimensions at query time.
  • PlatformMetricsCollector — a Python dataclass that centralizes metric registration for all platform control plane components, initializing Counter, Histogram, and Gauge instances in __post_init__ and exposing instrumentation methods like record_catalog_operation() and update_golden_path_drift().
  • OperationResult enum — a Python Enum whose string values ("success", "failure", "timeout") serve as the result label value on every catalog operation metric, ensuring label consistency across all instrumentation call sites and preventing cardinality explosion from free-form strings.

Concepts

The Blind Spot in Application-Level Dashboards

Application dashboards tell you whether running services are healthy. They cannot tell you whether the platform is capable of provisioning the next one. That gap creates a dangerous blind spot: a namespace provisioner can fail completely while every existing workload reports green, because namespace provisioning only executes during onboarding — it has no effect on services already running. Similarly, a cluster sync adapter can stop reconciling desired state while the workloads it last successfully configured continue operating normally.

This is the core justification for a dedicated platform health dashboard. Platform engineers must verify that the control plane can fulfill its promises — provision namespaces, synchronize service catalog entries, distribute golden path templates — independently of whether the services it already provisioned are healthy. Without explicit instrumentation of control plane operations, platform incidents surface only when blocked teams file support tickets, not when the engineers responsible for the platform can still act proactively.

Three Signal Dimensions and Why Each Is Necessary

A complete platform health dashboard covers three distinct dimensions, each answering a different question.

Loading diagram...

Control plane functional health asks whether platform operations succeed. Metrics like platform_catalog_operations_total and platform_namespace_provisioning_seconds directly instrument the operations the platform performs on behalf of developers. A spike in platform_catalog_operations_total{result="failure"} or a p99 latency jump in platform_namespace_provisioning_seconds signals a degraded platform before any downstream team is impacted.

Workload aggregates ask how many workloads the platform is currently managing and where. The platform_active_workloads Gauge, labeled by catalog_entry, cluster_name, and namespace, provides a live count that Grafana panels can group by service catalog entry or cluster to surface capacity and distribution patterns at a glance.

Golden path compliance asks whether running workloads still match the templates that defined them — connecting platform architecture decisions to runtime reality (see Code Walkthrough). Together, the three dimensions give platform engineers coverage that no single application dashboard can provide.

Counter vs. Gauge: Choosing the Right Metric Type

The choice between a Prometheus Counter and a Gauge is not stylistic — it determines both the semantics of the number and which Grafana query functions apply to it.

A Counter only increases. It records the cumulative total of discrete events: catalog operations attempted, namespaces provisioned, sync runs completed. Grafana queries wrap Counters in rate() or increase() to derive a per-second rate or a window delta. platform_catalog_operations_total is a Counter because each catalog operation is a new event that should accumulate.

A Gauge can increase or decrease. It records a current state: how many workloads are active right now, how many have drifted from their golden path. platform_golden_path_drift_total is a Gauge precisely because drift count is a snapshot — if three drifted workloads are remediated, the count should fall to zero, which a Counter cannot express. Grafana stat panels display the current Gauge value directly without a rate() wrapper, making drift immediately readable as a single number on any dashboard.

Code Walkthrough

Now that you understand how platform observability must cover control plane health, workload health, and golden path compliance as three distinct signal dimensions, you can instrument each dimension with Prometheus metrics built specifically for platform engineering concerns.

The prometheus_client library provides the metric primitives the control plane needs. The PlatformMetricsCollector class below registers a Counter for service catalog operations, Histograms for cluster sync duration and namespace provisioning latency, a Gauge for active workloads, and the platform_golden_path_drift_total Gauge from the Concepts section that tracks how many workloads have diverged from their golden path template. The OperationResult enum keeps label values consistent across every instrumentation call site.

Code snippetpython
1from prometheus_client import Counter, Gauge, Histogram, start_http_server 2from dataclasses import dataclass 3from enum import Enum 4 5class OperationResult(Enum): 6 SUCCESS = "success" 7 FAILURE = "failure" 8 TIMEOUT = "timeout" 9 10@dataclass 11class PlatformMetricsCollector: 12 """Central metrics registry for platform control plane components.""" 13 14 catalog_ops: Counter = None 15 sync_duration: Histogram = None 16 golden_path_drift: Gauge = None 17 provisioning_latency: Histogram = None 18 active_workloads: Gauge = None 19 20 def __post_init__(self): 21 self.catalog_ops = Counter( 22 "platform_catalog_operations_total", 23 "Total service catalog operations", 24 ["operation", "catalog_entry", "result"], 25 ) 26 self.sync_duration = Histogram( 27 "platform_cluster_sync_duration_seconds", 28 "Time spent synchronizing cluster state", 29 ["cluster_name", "sync_type"], 30 buckets=[0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0], 31 ) 32 self.golden_path_drift = Gauge( 33 "platform_golden_path_drift_total", 34 "Workloads drifted from golden path template", 35 ["golden_path", "cluster_name"], 36 ) 37 self.provisioning_latency = Histogram( 38 "platform_namespace_provisioning_seconds", 39 "Namespace provisioning request latency", 40 ["cluster_name", "tier"], 41 buckets=[1.0, 5.0, 15.0, 30.0, 60.0, 120.0], 42 ) 43 self.active_workloads = Gauge( 44 "platform_active_workloads", 45 "Currently active workloads by catalog entry", 46 ["catalog_entry", "cluster_name", "namespace"], 47 ) 48 49 def record_catalog_operation( 50 self, operation: str, entry: str, result: OperationResult 51 ): 52 self.catalog_ops.labels( 53 operation=operation, catalog_entry=entry, result=result.value 54 ).inc() 55 56 def update_golden_path_drift( 57 self, path_name: str, cluster: str, count: int 58 ): 59 self.golden_path_drift.labels( 60 golden_path=path_name, cluster_name=cluster 61 ).set(count) 62 63if __name__ == "__main__": 64 start_http_server(9090) 65 collector = PlatformMetricsCollector() 66 collector.record_catalog_operation( 67 "create", "ai-inference-v2", OperationResult.SUCCESS 68 ) 69 collector.update_golden_path_drift("ml-serving", "production", 3) 70 print("Metrics available at http://localhost:9090/metrics")

The Counter carries three label dimensions—operation, catalog_entry, and result—so Grafana panels can break down catalog health by which operation type ran against which service catalog entry and whether it succeeded. The platform_golden_path_drift_total Gauge uses set() rather than inc() because drift is a current-state measurement, not a cumulative event count; Grafana stat panels display its current value directly, making it straightforward to build the drift detection panels described in the Concepts section. The start_http_server(9090) call opens the scrape endpoint that Prometheus polls on the interval defined in its scrape_configs; each control plane component runs its own instance of this collector and exposes its own /metrics path for Prometheus to aggregate.

Confirm that running the snippet and querying http://localhost:9090/metrics returns lines containing platform_catalog_operations_total and platform_golden_path_drift_total with the label values you passed.

Do's and Don'ts

Having walked through the PlatformMetricsCollector instrumentation, the following practices keep those metrics actionable as the control plane scales across clusters and catalog entries.

Do's

  1. Do use Gauge.set() for platform_golden_path_drift_total — drift is a current-state snapshot, not a cumulative event count, so set() correctly replaces the previous reading each scrape cycle; using inc() instead would cause the value to grow monotonically and make Grafana stat panels unreadable as a real-time signal.
  2. Do instrument namespace provisioning and cluster sync as separate Histograms with cluster-scoped labelsplatform_namespace_provisioning_seconds and platform_cluster_sync_duration_seconds each carry a cluster_name label so Grafana can surface per-cluster latency breakdowns; collapsing both signals into a single metric or dropping the label hides the exact control plane component that is degrading.
  3. Do run a dedicated start_http_server(9090) per control plane component and let Prometheus aggregate across scrape targets — each namespace provisioner, cluster sync adapter, and service catalog engine exposes its own /metrics endpoint; centralizing all instrumentation into a single process defeats per-component isolation and makes it impossible to isolate which control plane function is failing when application-level workload metrics are still green.

Don'ts

  1. Don't omit the result label from platform_catalog_operations_total — the Counter is declared with three label dimensions (operation, catalog_entry, result) precisely so Grafana panels can distinguish successful from failed catalog operations per entry; dropping result and relying solely on absence of success events prevents the dashboard from detecting a flood of FAILURE or TIMEOUT results from OperationResult enum values.
  2. Don't reuse application-level service dashboards to infer control plane health — a broken namespace provisioner blocks every new team from onboarding while existing workloads stay green and report nothing wrong; only dedicated platform metrics like platform_namespace_provisioning_seconds and platform_golden_path_drift_total surface these silent control plane failures.
  3. Don't use free-form strings as label values in place of OperationResult enum members — the OperationResult enum (SUCCESS, FAILURE, TIMEOUT) keeps label values consistent across every record_catalog_operation call site; bypassing it with ad-hoc strings like "ok" or "err" silently creates new label series that fragment Grafana queries and make rate calculations incorrect.

Keep going with GenAI Platform Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.