Free lesson
Build platform health dashboard with Prometheus metrics
Instrument the platform control plane with Prometheus counters, histograms, and gauges. Build Grafana dashboards that show service catalog usage, provisioning latency, and error rates.
~25 min read · Free to read — no subscription required.
Implement a platform health dashboard with Prometheus and Grafana
Introduction
Engineers often instrument individual services without realizing the platform infrastructure beneath them—namespace provisioners, cluster sync adapters, service catalog engines—can fail silently while every running workload reports green. A broken namespace provisioner blocks every new team from onboarding even as existing services hum along perfectly, and no application-level dashboard will surface that failure. This lesson teaches you how to build a platform health dashboard using Prometheus and Grafana that captures control plane functional health, workload aggregates, and golden path drift simultaneously, giving platform engineers the visibility they need to uphold the platform's self-service contract.
Key Terminology
- Control Plane Functional Health — a measurement dimension that tracks whether platform infrastructure components—namespace provisioners, cluster sync adapters, and service catalog engines—can fulfill their operational contracts, distinct from application-level signals that report only on already-running workloads.
- Golden Path Drift — the count of workloads that have diverged from their golden path template, captured by the
platform_golden_path_drift_totalGauge and updated with.set()rather than.inc()because drift is a current-state measurement, not a cumulative event total. - Scrape Endpoint — the HTTP
/metricspath opened bystart_http_server(9090)that Prometheus polls on the interval defined in itsscrape_configs; each control plane component runs its ownPlatformMetricsCollectorinstance and exposes a separate endpoint for Prometheus to aggregate. - Metric Label Dimensions — the named key-value pairs registered on a metric (e.g.,
operation,catalog_entry,resultonplatform_catalog_operations_total) that allow Grafana panels to break down a single time series by any combination of those dimensions at query time. PlatformMetricsCollector— a Python dataclass that centralizes metric registration for all platform control plane components, initializingCounter,Histogram, andGaugeinstances in__post_init__and exposing instrumentation methods likerecord_catalog_operation()andupdate_golden_path_drift().OperationResultenum — a PythonEnumwhose string values ("success","failure","timeout") serve as theresultlabel value on every catalog operation metric, ensuring label consistency across all instrumentation call sites and preventing cardinality explosion from free-form strings.
Concepts
The Blind Spot in Application-Level Dashboards
Application dashboards tell you whether running services are healthy. They cannot tell you whether the platform is capable of provisioning the next one. That gap creates a dangerous blind spot: a namespace provisioner can fail completely while every existing workload reports green, because namespace provisioning only executes during onboarding — it has no effect on services already running. Similarly, a cluster sync adapter can stop reconciling desired state while the workloads it last successfully configured continue operating normally.
This is the core justification for a dedicated platform health dashboard. Platform engineers must verify that the control plane can fulfill its promises — provision namespaces, synchronize service catalog entries, distribute golden path templates — independently of whether the services it already provisioned are healthy. Without explicit instrumentation of control plane operations, platform incidents surface only when blocked teams file support tickets, not when the engineers responsible for the platform can still act proactively.
Three Signal Dimensions and Why Each Is Necessary
A complete platform health dashboard covers three distinct dimensions, each answering a different question.
Control plane functional health asks whether platform operations succeed. Metrics like platform_catalog_operations_total and platform_namespace_provisioning_seconds directly instrument the operations the platform performs on behalf of developers. A spike in platform_catalog_operations_total{result="failure"} or a p99 latency jump in platform_namespace_provisioning_seconds signals a degraded platform before any downstream team is impacted.
Workload aggregates ask how many workloads the platform is currently managing and where. The platform_active_workloads Gauge, labeled by catalog_entry, cluster_name, and namespace, provides a live count that Grafana panels can group by service catalog entry or cluster to surface capacity and distribution patterns at a glance.
Golden path compliance asks whether running workloads still match the templates that defined them — connecting platform architecture decisions to runtime reality (see Code Walkthrough). Together, the three dimensions give platform engineers coverage that no single application dashboard can provide.
Counter vs. Gauge: Choosing the Right Metric Type
The choice between a Prometheus Counter and a Gauge is not stylistic — it determines both the semantics of the number and which Grafana query functions apply to it.
A Counter only increases. It records the cumulative total of discrete events: catalog operations attempted, namespaces provisioned, sync runs completed. Grafana queries wrap Counters in rate() or increase() to derive a per-second rate or a window delta. platform_catalog_operations_total is a Counter because each catalog operation is a new event that should accumulate.
A Gauge can increase or decrease. It records a current state: how many workloads are active right now, how many have drifted from their golden path. platform_golden_path_drift_total is a Gauge precisely because drift count is a snapshot — if three drifted workloads are remediated, the count should fall to zero, which a Counter cannot express. Grafana stat panels display the current Gauge value directly without a rate() wrapper, making drift immediately readable as a single number on any dashboard.
Code Walkthrough
Now that you understand how platform observability must cover control plane health, workload health, and golden path compliance as three distinct signal dimensions, you can instrument each dimension with Prometheus metrics built specifically for platform engineering concerns.
The prometheus_client library provides the metric primitives the control plane needs. The PlatformMetricsCollector class below registers a Counter for service catalog operations, Histograms for cluster sync duration and namespace provisioning latency, a Gauge for active workloads, and the platform_golden_path_drift_total Gauge from the Concepts section that tracks how many workloads have diverged from their golden path template. The OperationResult enum keeps label values consistent across every instrumentation call site.
Code snippetpython
1from prometheus_client import Counter, Gauge, Histogram, start_http_server 2from dataclasses import dataclass 3from enum import Enum 4 5class OperationResult(Enum): 6 SUCCESS = "success" 7 FAILURE = "failure" 8 TIMEOUT = "timeout" 9 10@dataclass 11class PlatformMetricsCollector: 12 """Central metrics registry for platform control plane components.""" 13 14 catalog_ops: Counter = None 15 sync_duration: Histogram = None 16 golden_path_drift: Gauge = None 17 provisioning_latency: Histogram = None 18 active_workloads: Gauge = None 19 20 def __post_init__(self): 21 self.catalog_ops = Counter( 22 "platform_catalog_operations_total", 23 "Total service catalog operations", 24 ["operation", "catalog_entry", "result"], 25 ) 26 self.sync_duration = Histogram( 27 "platform_cluster_sync_duration_seconds", 28 "Time spent synchronizing cluster state", 29 ["cluster_name", "sync_type"], 30 buckets=[0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0], 31 ) 32 self.golden_path_drift = Gauge( 33 "platform_golden_path_drift_total", 34 "Workloads drifted from golden path template", 35 ["golden_path", "cluster_name"], 36 ) 37 self.provisioning_latency = Histogram( 38 "platform_namespace_provisioning_seconds", 39 "Namespace provisioning request latency", 40 ["cluster_name", "tier"], 41 buckets=[1.0, 5.0, 15.0, 30.0, 60.0, 120.0], 42 ) 43 self.active_workloads = Gauge( 44 "platform_active_workloads", 45 "Currently active workloads by catalog entry", 46 ["catalog_entry", "cluster_name", "namespace"], 47 ) 48 49 def record_catalog_operation( 50 self, operation: str, entry: str, result: OperationResult 51 ): 52 self.catalog_ops.labels( 53 operation=operation, catalog_entry=entry, result=result.value 54 ).inc() 55 56 def update_golden_path_drift( 57 self, path_name: str, cluster: str, count: int 58 ): 59 self.golden_path_drift.labels( 60 golden_path=path_name, cluster_name=cluster 61 ).set(count) 62 63if __name__ == "__main__": 64 start_http_server(9090) 65 collector = PlatformMetricsCollector() 66 collector.record_catalog_operation( 67 "create", "ai-inference-v2", OperationResult.SUCCESS 68 ) 69 collector.update_golden_path_drift("ml-serving", "production", 3) 70 print("Metrics available at http://localhost:9090/metrics")
The Counter carries three label dimensions—operation, catalog_entry, and result—so Grafana panels can break down catalog health by which operation type ran against which service catalog entry and whether it succeeded. The platform_golden_path_drift_total Gauge uses set() rather than inc() because drift is a current-state measurement, not a cumulative event count; Grafana stat panels display its current value directly, making it straightforward to build the drift detection panels described in the Concepts section. The start_http_server(9090) call opens the scrape endpoint that Prometheus polls on the interval defined in its scrape_configs; each control plane component runs its own instance of this collector and exposes its own /metrics path for Prometheus to aggregate.
Confirm that running the snippet and querying http://localhost:9090/metrics returns lines containing platform_catalog_operations_total and platform_golden_path_drift_total with the label values you passed.
Do's and Don'ts
Having walked through the PlatformMetricsCollector instrumentation, the following practices keep those metrics actionable as the control plane scales across clusters and catalog entries.
Do's
- ✓Do use
Gauge.set()forplatform_golden_path_drift_total— drift is a current-state snapshot, not a cumulative event count, soset()correctly replaces the previous reading each scrape cycle; usinginc()instead would cause the value to grow monotonically and make Grafana stat panels unreadable as a real-time signal. - ✓Do instrument namespace provisioning and cluster sync as separate Histograms with cluster-scoped labels —
platform_namespace_provisioning_secondsandplatform_cluster_sync_duration_secondseach carry acluster_namelabel so Grafana can surface per-cluster latency breakdowns; collapsing both signals into a single metric or dropping the label hides the exact control plane component that is degrading. - ✓Do run a dedicated
start_http_server(9090)per control plane component and let Prometheus aggregate across scrape targets — each namespace provisioner, cluster sync adapter, and service catalog engine exposes its own/metricsendpoint; centralizing all instrumentation into a single process defeats per-component isolation and makes it impossible to isolate which control plane function is failing when application-level workload metrics are still green.
Don'ts
- ✗Don't omit the
resultlabel fromplatform_catalog_operations_total— the Counter is declared with three label dimensions (operation,catalog_entry,result) precisely so Grafana panels can distinguish successful from failed catalog operations per entry; droppingresultand relying solely on absence of success events prevents the dashboard from detecting a flood ofFAILUREorTIMEOUTresults fromOperationResultenum values. - ✗Don't reuse application-level service dashboards to infer control plane health — a broken namespace provisioner blocks every new team from onboarding while existing workloads stay green and report nothing wrong; only dedicated platform metrics like
platform_namespace_provisioning_secondsandplatform_golden_path_drift_totalsurface these silent control plane failures. - ✗Don't use free-form strings as label values in place of
OperationResultenum members — theOperationResultenum (SUCCESS,FAILURE,TIMEOUT) keeps label values consistent across everyrecord_catalog_operationcall site; bypassing it with ad-hoc strings like"ok"or"err"silently creates new label series that fragment Grafana queries and make rate calculations incorrect.
Keep going with GenAI Platform Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.