Chapter 1

GenAI Failure Catalog

failure taxonomyprovider outagesquality degradationhallucination detectioncost overrunsprompt injectionembedding driftsilent failures

Learning Path

Step 1

Reading Material

11 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Step 1

Reading Material

11 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Hands-on Labs

Each objective has a coding lab that opens in VS Code in your browser

Objective 1

Build GenAI failure classification system

Goal

You will build a `FailureClassifier` that categorizes GenAI system failures into a structured taxonomy. Define five top-level failure categories as Pydantic models: `ProviderFailure` (HTTP 429 rate limits, 503 outages, timeout, malformed response), `QualityFailure` (hallucination detected, faithfulness below threshold, format non-compliance, language drift), `CostFailure` (budget exceeded, unexpected token spike, cache miss storm, batch job cost overrun), `SecurityFailure` (prompt injection detected, PII leakage in response, jailbreak attempt, unauthorized model access), `DataFailure` (embedding drift, retrieval quality drop, stale index, ingestion pipeline failure). Each model includes `failure_id`, `timestamp`, `severity` (P1-P4), `provider`, `detection_method`, `evidence`, and `impact_scope`. Deploy LiteLLM proxy in your vCluster and configure callbacks that intercept every LLM response. Build `classify_failure()` that inspects response headers, status codes, latency, and content to auto-classify failures. Store events in PostgreSQL `failure_events` table.

Objective 2

Instrument multi-provider failure detection

Goal

You will build provider-specific failure detectors for OpenAI, Anthropic, and Google Gemini. Implement `OpenAIFailureDetector` that monitors: rate limit headers (`x-ratelimit-remaining-tokens`, `x-ratelimit-remaining-requests`), detects approaching limits (< 10% remaining), catches `openai.RateLimitError` and `openai.APITimeoutError`, and measures response latency percentiles. Implement `AnthropicFailureDetector` that monitors: `anthropic.RateLimitError`, overloaded errors (`529`), and tracks thinking token costs for Claude models with extended thinking. Implement `GeminiFailureDetector` that monitors: `google.api_core.exceptions.ResourceExhausted`, safety filter blocks (`finish_reason: SAFETY`), and context window overflows. Each detector emits Prometheus metrics: `llm_provider_errors_total{provider,error_type}`, `llm_provider_latency_seconds{provider,quantile}`, `llm_provider_rate_limit_remaining{provider,limit_type}`. Deploy all three providers through LiteLLM and verify detection with synthetic failure injection.

Objective 3

Measure baseline failure rates

Goal

You will build a `BaselineProfiler` that establishes normal failure rates across all providers by running a controlled traffic workload. Implement `run_baseline_profile()` that sends 1000 representative requests per provider (mix of simple, complex, and edge-case prompts) through LiteLLM, records every response including latency, token counts, error codes, and quality signals. Compute baseline statistics: p50/p95/p99 latency per provider, error rate per provider per error type, tokens-per-second throughput, and cost-per-request distribution. Store baselines in PostgreSQL `provider_baselines` table with `provider`, `metric_name`, `p50`, `p95`, `p99`, `measured_at`. Build a FastAPI endpoint `GET /api/v1/baselines` returning current baselines. Deploy a Grafana dashboard showing baseline metrics with green/yellow/red zones. Implement `detect_deviation()` that compares real-time metrics against stored baselines and flags when current values exceed 2x the p95 baseline.

Objective 4

Build failure correlation engine

Goal

You will build a `FailureCorrelationEngine` that identifies relationships between failures across providers and pipeline stages. Implement temporal correlation: when failures cluster within a 5-minute window across multiple providers, flag as potential upstream issue (e.g., network problem, shared dependency). Implement causal correlation: link provider failures to downstream quality failures (e.g., OpenAI timeout -> fallback to weaker model -> quality drop). Build a correlation graph stored in PostgreSQL: nodes are failure events, edges represent temporal and causal links with confidence scores. Implement `correlate_failures()` that runs on each new failure event and attaches it to existing correlation clusters. Build a FastAPI endpoint `GET /api/v1/failures/correlations` returning active correlation clusters. Emit Prometheus metrics: `failure_correlation_clusters_active`, `failure_correlation_strength{cluster_type}`. Create a Grafana panel showing failure correlation timeline with linked events.

Objective 5

Implement failure prediction from leading indicators

Goal

You will build a `FailurePredictor` that uses leading indicators to predict failures before they impact users. Implement leading indicator detection: monitor rate limit header trends (remaining tokens decreasing toward zero), latency trend analysis (gradual latency increase precedes timeouts), error rate acceleration (error rate increasing faster than linear suggests imminent outage). Build prediction models: for each indicator, define thresholds and time horizons. Implement `predict_failure()` that evaluates all indicators every 60 seconds and emits predictions as `PredictedFailure` Pydantic models with `predicted_type`, `confidence`, `estimated_time_to_failure`, `recommended_action`. Deploy predictions as Prometheus metrics: `failure_prediction_score{provider,failure_type}`. Build preemptive actions: when prediction confidence > 0.8, automatically reduce traffic to the at-risk provider. Track prediction accuracy: compare predictions against actual failures to compute precision and recall.

Objective 6

Create failure impact assessment system

Goal

You will build a `FailureImpactAssessor` that quantifies the business impact of each failure event. Implement impact dimensions: `user_impact` (number of affected requests, affected users, failed responses), `cost_impact` (wasted tokens on failed requests, cost of fallback to more expensive provider), `quality_impact` (quality degradation during the failure window measured by faithfulness and hallucination metrics), `sla_impact` (error budget consumed by this failure). Build `assess_impact()` that computes all dimensions for a given failure event or correlation cluster. Store impact assessments in PostgreSQL `failure_impacts` table. Build a FastAPI endpoint `GET /api/v1/failures/{failure_id}/impact` returning the full impact assessment. Create impact-based prioritization: rank failures by total business impact to focus remediation on highest-impact patterns. Build a Grafana failure impact dashboard: top failures by impact, impact trend over time, and cost of failures per provider.