Free lesson

Validate ADR decisions against production telemetry

You will build a `DecisionValidator` that continuously checks whether the assumptions behind accepted ADRs still hold by comparing them against live production telemetry. Implement `ADRAssumption` as a Pydantic model with fields: `assumption_id: str`, `adr_id: str`, `description: str`, `metric_name: str`, `operator: ComparisonOperator` (LT, GT, LTE, GTE, EQ, BETWEEN), `threshold: float`, `upper_bound: Optional[float]` (for BETWEEN operator), `measurement_window: timedelta`, `data_source: DataSource` (PROMETHEUS, POSTGRESQL, LANGFUSE). Build `extract_assumptions()` that parses an accepted ADR's context and decision fields using `litellm.completion()` with Instructor to extract testable assumptions as structured `ADRAssumption` objects, returning `ExtractionResult` with `assumptions: list[ADRAssumption]`, `confidence: float`, `unextractable_claims: list[str]`. Implement `validate_assumptions()` that queries Prometheus via `prometheus_api_client` for each assumption's `metric_name` over the `measurement_window`, using `custom_query()` for PromQL expressions, compares the result against the `threshold` using the specified `operator`, and returns a `ValidationResult` with `is_valid: bool`, `actual_value: float`, `deviation_pct: float`, `trend_direction: TrendDirection` (IMPROVING, STABLE, DEGRADING). Build `StalenessDetector` with `check_staleness()` that runs `validate_assumptions()` on a configurable schedule (default every 6 hours via `check_interval_hours: int`) and marks ADRs as STALE when any assumption fails validation for `consecutive_failures_threshold` (default 3) consecutive checks. Implement `StalenessState` tracking `consecutive_failures: int`, `last_valid_at: datetime`, `staleness_score: float`. Store validation history in PostgreSQL `adr_validations` table with columns: `validation_id VARCHAR(64) PRIMARY KEY`, `adr_id VARCHAR(64) REFERENCES architecture_decisions(adr_id)`, `assumption_id VARCHAR(64)`, `checked_at TIMESTAMPTZ`, `is_valid BOOLEAN`, `actual_value FLOAT`, `deviation_pct FLOAT`, `trend_direction VARCHAR(16)`. Create index `idx_adr_validations_adr_id_checked_at` for efficient history queries. Emit Prometheus metrics: `adr_validation_checks_total{adr_id,result}`, `adr_staleness_score{adr_id}` (0-1 gauge where 1 means all assumptions valid), `adr_assumption_deviation_pct{adr_id,assumption_id}`. Configure Alertmanager rules firing `ADRStale` alert with severity `warning` when `adr_staleness_score` drops below 0.5 for more than 30 minutes. Build `GET /api/v1/adrs/{adr_id}/validation` FastAPI endpoint returning the validation history and current staleness score. Build `GET /api/v1/adrs/{adr_id}/assumptions` endpoint returning all extracted assumptions with their latest validation status. Implement `DecisionEffectivenessScorecard` that aggregates validation results across all ADRs into a Grafana dashboard showing assumption pass rates per category, staleness trends over 30 days, and a ranked table of most-invalidated decisions.

~25 min read · Free to read — no subscription required.

Validate architecture decisions against production telemetry to detect stale or invalidated assumptions

Introduction

Teams that ship a RAG pipeline, a hosted-model choice, or a latency budget rarely revisit those ADRs once the system is live — yet the production environment underneath them drifts daily. A model that was state of the art six months ago can be matched at 40% lower cost; an embedding service that comfortably met p99 latency at launch can quietly breach it after a provider reroute. When the architecture document and the running system disagree, the document loses silently until cost overruns, stale model choices, or compounding SLA misses surface downstream. By the end of this lesson you'll be able to encode the assumptions inside each ADR as testable predicates, evaluate them against live telemetry on a schedule, and route the stale ones into a governance review loop before they accumulate as architectural debt.

Key Terminology

ADR Assumption: A structured, machine-checkable predicate extracted from an Architecture Decision Record — a metric name, a comparison condition, a threshold, and a tolerance band — that can be evaluated against live telemetry.
Tolerance Band: The percentage margin around a threshold inside which an observed metric is classified as DEGRADED rather than VIOLATED, preventing noisy false alarms from minor variance.
Staleness Threshold (max_violations_before_stale): The number of consecutive validation failures an assumption must accumulate before the surrounding ADR is marked stale and routed into the governance review workflow.

Concepts

Practical Deployment Considerations

When deploying telemetry validation in production, schedule validation runs at intervals aligned with your telemetry aggregation windows. If your metrics backend aggregates at 5-minute intervals, running validation every minute produces noisy results based on partial windows. A 15-minute validation cadence with 5-minute aggregated metrics provides stable readings while still catching rapid degradation.

The max_violations_before_stale parameter requires tuning per assumption category. Model quality assumptions (BLEU score, human preference ratings) should tolerate more consecutive violations—perhaps 5 to 7—because quality metrics are inherently noisier. Cost assumptions should use a lower threshold of 2 to 3 because cost overruns compound rapidly and rarely self-correct.

Store every ValidationResult in a time-series database alongside the raw telemetry. This historical record enables the recommendation engine to identify seasonal patterns—an assumption that violates every Monday morning due to batch processing load is not truly stale; it needs a different threshold for peak windows. Without this history, the governance dashboard generates review requests that architects learn to ignore, undermining the entire validation system.

Finally, integrate staleness alerts with your existing decision taxonomy. A stale assumption under the "model selection" category should route to the ML platform team, while a stale "hosting strategy" assumption routes to the infrastructure team. This routing, combined with the expiry tracking in the governance dashboard, ensures that stale decisions do not languish in a shared queue but reach the engineers with the authority and context to act on them.

Loading diagram...

Code Walkthrough

Now that you've seen how validation cadence, tolerance tuning, and staleness routing shape the production behavior of this system, the code below expresses those mechanics as two pieces: a typed assumption model and the validator that grades it against telemetry and emits a ValidationResult for governance.

Every architecture decision rests on assumptions. When an ADR selects a RAG pipeline over fine-tuning, it assumes retrieval latency stays below a threshold and that embedding cost stays within budget. The first step is extracting those assumptions from prose and encoding them as structured predicates — a metric to observe, a condition to assert, and a tolerance band that absorbs normal variance so minor fluctuations don't trigger false positives.

Code snippetpython
1from dataclasses import dataclass
2from enum import Enum
3
4class MetricCondition(Enum):
5    LESS_THAN = "less_than"
6    GREATER_THAN = "greater_than"
7    WITHIN_RANGE = "within_range"
8
9class AssumptionStatus(Enum):
10    CONFIRMED = "confirmed"
11    DEGRADED = "degraded"
12    VIOLATED = "violated"
13    UNKNOWN = "unknown"
14
15@dataclass
16class ADRAssumption:
17    assumption_id: str
18    adr_id: str
19    metric_name: str
20    condition: MetricCondition
21    threshold: float
22    tolerance_pct: float = 0.10
23    status: AssumptionStatus = AssumptionStatus.UNKNOWN
24    consecutive_violations: int = 0
25    max_violations_before_stale: int = 3
26
27    @property
28    def is_stale(self) -> bool:
29        return self.consecutive_violations >= self.max_violations_before_stale
30
31    def effective_threshold(self) -> float:
32        return self.threshold * (1.0 + self.tolerance_pct)

The tolerance_pct field defaults to 10%, so a 200ms threshold tolerates observed values up to 220ms before it counts as a violation, and is_stale only fires after max_violations_before_stale consecutive failures — keeping a single bad data point from triggering review.

With assumptions modeled as data, the DecisionValidator fetches a telemetry value, places it into the confirmed, degraded, or violated zone, updates the counter, and returns a ValidationResult for the governance layer to persist.

Code snippetpython
1from dataclasses import dataclass
2
3@dataclass
4class ValidationResult:
5    assumption_id: str
6    observed: float
7    status: AssumptionStatus
8
9def validate(assumption: ADRAssumption, observed: float) -> ValidationResult:
10    if observed <= assumption.threshold:
11        status = AssumptionStatus.CONFIRMED
12        assumption.consecutive_violations = 0
13    elif observed <= assumption.effective_threshold():
14        status = AssumptionStatus.DEGRADED
15    else:
16        status = AssumptionStatus.VIOLATED
17        assumption.consecutive_violations += 1
18    assumption.status = status
19    return ValidationResult(assumption.assumption_id, observed, status)

Store each ValidationResult in a time-series database alongside the raw telemetry so the governance workflow can distinguish a genuinely stale assumption from a recurring Monday-morning batch spike. You'll know it works when a metric inside the tolerance band returns DEGRADED, a metric beyond it returns VIOLATED, and three consecutive VIOLATED results flip is_stale to True, routing the ADR into review.

Do's and Don'ts

Do's

✓Do encode each ADR's load-bearing assumptions as ADRAssumption predicates — specifying metric_name, condition, and threshold turns prose like "retrieval latency stays below 200ms" into a testable fact that validate() can grade automatically against live telemetry on a schedule, so drift surfaces before it accumulates as architectural debt.
✓Do tune tolerance_pct deliberately per assumption — the default 10% lets a 200ms threshold tolerate observed values up to 220ms via effective_threshold() before counting a violation; RAG latency assumptions with high natural variance need a wider band than cost assumptions with predictable baselines, so calibrate each one to absorb normal fluctuation without masking real drift.
✓Do persist each ValidationResult to a time-series database alongside the raw telemetry — the governance review workflow needs both the validation status history and the underlying metric values to distinguish a genuinely stale assumption from a recurring batch-window spike that consecutive_violations will naturally absorb over multiple evaluation cycles.

Don'ts

✗Don't route an ADR into governance review on a single VIOLATED result — consecutive_violations exists specifically so is_stale only flips to True after max_violations_before_stale (default 3) sustained breaches; treating one anomalous telemetry sample as a staleness signal floods the review queue with false positives and trains teams to ignore it.
✗Don't leave ADR assumptions as prose sentences in the decision record — without structuring them as ADRAssumption instances with explicit metric_name and condition fields, validate() has nothing to evaluate, so a model whose cost has risen 40% or an embedding service that has quietly breached its p99 budget remains invisible until a downstream cost overrun or SLA miss surfaces it.
✗Don't treat a DEGRADED result as equivalent to CONFIRMED — the validator intentionally skips resetting consecutive_violations when an observed value falls between threshold and effective_threshold(); a metric hovering chronically in the degraded zone is on a trajectory toward staleness, and artificially clearing the counter would let a persistently underperforming assumption evade governance indefinitely.

Keep going with GenAI Solutions Architecture

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.

Create a free account Subscribe — →