Free lesson

Create ADR governance dashboard and compliance audit

You will build an `ADRGovernanceDashboard` that provides organization-wide visibility into architecture decision health, coverage gaps, and compliance status. Implement `CoverageAnalyzer` with `compute_coverage()` that scans the system's deployed services from a `deployed_services` PostgreSQL table with columns `service_id VARCHAR(64) PRIMARY KEY`, `service_name VARCHAR(128)`, `team VARCHAR(64)`, `deployed_at TIMESTAMPTZ`, `technology_stack JSONB`, `has_llm_integration BOOLEAN` and cross-references against `architecture_decisions` to identify services with no corresponding ADRs, returning an `ADRCoverageReport` Pydantic model with `total_services: int`, `covered_services: int`, `coverage_pct: float`, `uncovered_services: list[UncoveredService]` where each has `service_name: str`, `team: str`, `risk_level: str` based on whether the service has LLM integration without documented decisions. Build a Grafana panel showing coverage percentage as a gauge with thresholds at 60% (yellow) and 80% (green), plus a table listing uncovered services sorted by risk. Implement `ReviewWorkflowEngine` with `submit_for_review()` that transitions an ADR from PROPOSED to PENDING_REVIEW, creates an entry in `adr_reviews` table with columns `review_id VARCHAR(64) PRIMARY KEY`, `adr_id VARCHAR(64)`, `reviewer VARCHAR(64)`, `due_date TIMESTAMPTZ`, `status VARCHAR(16)` (PENDING, APPROVED, REJECTED, EXPIRED), `review_comment TEXT`, `reviewed_at TIMESTAMPTZ`. Build `check_expiry()` running as a daily scheduled task via `asyncio` background task that marks ADRs without review activity past `due_date` as EXPIRED and emits `adr_reviews_expired_total` Prometheus counter. Implement `POST /api/v1/adrs/{adr_id}/approve` accepting `ApprovalRequest` with `reviewer: str`, `review_comment: str` and `POST /api/v1/adrs/{adr_id}/reject` accepting `RejectionRequest` with `reviewer: str`, `review_comment: str`, `required_changes: list[str]`. Build `ComplianceReportGenerator` with `generate_report()` that aggregates: total ADRs by status using `SELECT status, COUNT(*) FROM architecture_decisions GROUP BY status`, average time from PROPOSED to ACCEPTED using `AVG(accepted_at - created_at)`, coverage percentage, stale ADR count, open conflicts count from the dependency graph, and outputs a `GovernanceComplianceReport` Pydantic model with `report_id: str`, `generated_at: datetime`, `total_adrs: int`, `adrs_by_status: dict[str, int]`, `avg_review_time_days: float`, `coverage_pct: float`, `stale_count: int`, `conflict_count: int`, `governance_score: float`. Expose `GET /api/v1/governance/report` endpoint returning the report. Create a comprehensive Grafana dashboard with panels: ADR status distribution pie chart, coverage gauge, staleness trend line over 90 days, review pipeline funnel (PROPOSED -> PENDING -> APPROVED), and conflict count time series. Emit `adr_governance_score` Prometheus gauge computed as weighted average of coverage (40%), freshness (30%), and conflict-free ratio (30%).

~25 min read · Free to read — no subscription required.

Adr governance dashboard and compliance audit

Introduction

When you build ADR machinery without surfacing it through two distinct readers — an operator who needs live state and an auditor who needs an immutable snapshot — the governance program collapses into a single live page nobody can cite three months later, and SOC 2 evidence collection becomes an archaeology project. Teams that conflate the live dashboard with the compliance artifact lose trust the first time two readers compare notes a week apart and see different numbers. By the end you'll be able to split a single compliance evaluator into a Prometheus scrape path for operators and a hashed JSON-plus-markdown artifact for auditors, and reason about why those two surfaces must never share a code path.

Key Terminology

ComplianceReport: The serializable audit artifact carrying findings, multi-axis fleet_counts, and a snapshot_hash; it renders to canonical JSON for CI archives and to markdown for the leadership email.
snapshot_hash: A SHA-256 digest computed over the sort-keyed report body so any reader can verify an archived artifact matches what leadership saw.
Operational/audit boundary: The hard separation between the live Prometheus exporter (source of truth for the operator) and the hashed JSON-plus-markdown artifact (source of truth for the auditor) — the two surfaces must never share a code path.

Concepts

Audit cadence and the operational/audit boundary

The full ComplianceAuditor.audit runs in two contexts. First, as a CI step on every merge to main that touches the ADR registry directory—the resulting JSON is uploaded as a build artifact and the markdown is posted as a PR comment so reviewers can see compliance impact before merge. Second, as a Kubernetes CronJob that fires nightly at 02:00 UTC, writes the hashed JSON to the audit archive bucket, and on the first of each month additionally renders the markdown into a leadership email rolled up by team and owner. The CronJob is the source of truth for the auditor; the live exporter is the source of truth for the operator.

The leadership email derives entirely from the audit report—not from the live exporter—so the numbers in the CEO's inbox match the JSON archived in the bucket and the hash printed at the bottom of the email. The rollup-by-owner table comes from fleet_counts["reviewer"] sorted descending, with each row annotated by how many of that reviewer's ADRs are SLA-breached. This single table is what surfaces the bottleneck engineer who is silently holding up half the governance program.

Operating discipline

Treat ComplianceReport.snapshot_hash as load-bearing. Any artifact that lands in the archive bucket without a verifiable hash is corrupt; the monthly email must print the hash so any reader can verify the JSON they pull from archive matches what leadership saw.
Never serve the auditor surface from the live exporter endpoint. If a regulator asks for last March's compliance state, point at the archived hashed JSON, not at a Grafana time-range query—Grafana retention windows lie about historical fidelity once you cross a downsampling boundary.
Keep the three core compliance checks (C-001, C-002, C-003) immutable across releases. Add new checks (C-004 onward) freely, but renaming or renumbering an existing check breaks every historical report comparison and every audit-trail query that filters by check_id.
Run the full audit nightly even though the leadership email is monthly. Nightly cadence catches regressions within twenty-four hours; monthly cadence is just the reporting rhythm. The two cadences are independent and must stay independent.
Alert on adr_review_sla_breach_total > 0 with a one-week-for-warning, twenty-four-hours-for-page severity ladder. Paging the on-call for a single SLA breach is noise; paging when breaches accumulate beyond a team's review capacity is signal.
Wire the exporter scrape path to time out at 500 ms. The aggregator is a pure function over an in-memory registry, so anything slower indicates the registry has grown beyond what fits in memory and needs a paginated cache layer—do not let a slow scrape silently drop metrics from the time series.
Store the audit archive bucket with object-lock enabled and a retention policy of at least seven years. The compliance report is evidence; evidence must outlive the engineers who generated it, and the bucket policy is what makes that promise enforceable rather than aspirational.

Loading diagram...

Code Walkthrough

Having just drawn the operational/audit boundary, you now implement the single evaluator that feeds both surfaces. The ComplianceAuditor answers machine-checkable questions — here, are quarterly review SLAs being honoured? — and emits a ComplianceReport that serializes deterministically to JSON for CI artifacts and to markdown for the leadership email. Two design points are load-bearing: the multi-axis fleet_counts (status, category, reviewer, SLA breaches) and the snapshot_hash that proves an archived artifact has not been tampered with.

Code snippetpython
1from dataclasses import dataclass, field, asdict
2from datetime import datetime, timedelta
3from hashlib import sha256
4import json
5
6@dataclass
7class ComplianceFinding:
8    check_id: str
9    description: str
10    passed: bool
11    failing_items: list[str] = field(default_factory=list)
12    severity: str = "info"
13
14@dataclass
15class ComplianceReport:
16    generated_at: datetime
17    findings: list[ComplianceFinding]
18    fleet_counts: dict[str, dict[str, int]]
19    snapshot_hash: str = ""
20
21    def to_json(self) -> str:
22        body = asdict(self)
23        body["generated_at"] = self.generated_at.isoformat()
24        return json.dumps(body, sort_keys=True, indent=2)
25
26class ComplianceAuditor:
27    def __init__(self, sla_days: int = 90):
28        self.sla_days = sla_days
29
30    def audit(self, adrs: list[dict]) -> ComplianceReport:
31        now = datetime.utcnow()
32        breached = {
33            a["reviewer"] for a in adrs
34            if (now - datetime.fromisoformat(a["last_reviewed_at"]))
35            > timedelta(days=self.sla_days)
36        }
37        by_reviewer: dict[str, int] = {}
38        for a in adrs:
39            by_reviewer[a["reviewer"]] = by_reviewer.get(a["reviewer"], 0) + 1
40        findings = [
41            ComplianceFinding(
42                "C-002",
43                f"Every accepted ADR reviewed within {self.sla_days} days",
44                not breached, sorted(breached), "high",
45            ),
46        ]
47        report = ComplianceReport(
48            generated_at=now,
49            findings=findings,
50            fleet_counts={"reviewer": by_reviewer},
51        )
52        report.snapshot_hash = sha256(report.to_json().encode()).hexdigest()
53        return report

The rollup-by-owner table the monthly leadership email prints is exactly fleet_counts["reviewer"] sorted descending, with each row annotated by how many of that reviewer's ADRs fall in the breached set — this is what surfaces the bottleneck engineer silently holding up half the governance program. Because snapshot_hash is computed over the canonical, sort-keyed JSON, the email can print that digest so any reader can verify the JSON pulled from the archive bucket matches the exact report leadership saw. Note that the hash covers the whole body including generated_at, so it is a tamper-evidence seal for a single report, not a cross-run idempotency guarantee: two audit calls on an unchanged registry still differ in their generated_at timestamp and therefore produce different digests. To compare two runs for content drift, exclude generated_at from the hashed body (or diff the findings and fleet_counts directly) rather than comparing raw snapshot_hash values.

Do's and Don'ts

Do's

✓Do keep the Prometheus scrape path and the hashed JSON artifact on separate code paths — the ComplianceReport.to_json() method produces the canonical snapshot that gets archived and cited by auditors; if the live operator surface shares that same code path, the "immutable" artifact is no longer immutable and SOC 2 evidence collection collapses the first time two readers compare hashes a week apart.
✓Do compute snapshot_hash over the sort-keyed JSON body with sha256 — passing sort_keys=True to json.dumps inside to_json() makes the serialization deterministic for a given body, so the digest is a reliable tamper-evidence seal: any reader who recomputes the hash over the archived JSON gets the same value the email printed. Because the body includes generated_at, the digest changes every run by design; to detect content drift between two runs, hash a body with generated_at excluded rather than comparing raw digests.
✓Do populate fleet_counts with a reviewer-keyed rollup and annotate each row with its SLA-breach count — the breached set from the ComplianceAuditor.audit loop is what surfaces the single engineer whose review backlog is silently blocking half the governance program; a status-only count hides that bottleneck entirely.

Don'ts

✗Don't share a single live state object between the operator dashboard and the auditor artifact — the concrete failure mode is that two readers pull the report a week apart, see different fleet_counts numbers because the live registry changed, and can no longer agree on what was true at audit time, destroying the credibility of the compliance record.
✗Don't omit sort_keys=True when serializing ComplianceReport before hashing — Python's default dict ordering is insertion-order, so without sort_keys=True the same logical body could serialize to a different JSON string across Python versions or dict-rebuild paths, and a reader recomputing the digest over the archived JSON would fail to match — silently breaking the tamper-evidence guarantee. (Two runs already differ via generated_at; sort_keys is about reproducibility for a fixed body, not cross-run equality.)
✗Don't compute snapshot_hash before the report body is fully populated — the code in ComplianceAuditor.audit builds ComplianceReport first and then sets report.snapshot_hash = sha256(report.to_json().encode()).hexdigest(); hashing before fleet_counts or findings are attached produces a digest that doesn't match the complete artifact the archive bucket stores, and CI verification fails on every pull.

Keep going with GenAI Solutions & Delivery

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.

Create a free account Subscribe — →