Chapter 17

Eval Gate Pipeline

Promptfooeval gatesgolden test setsregression detectionquality verificationpre-promotion testingeval metrics

Learning Path

Step 1

Reading Material

11 sections

Step 2

Hands-on Labs

6 labs

Step 1

Reading Material

11 sections

Step 2

Hands-on Labs

6 labs

Hands-on Labs

Each objective has a coding lab that opens in VS Code in your browser

Objective 1

Build Promptfoo eval suites

Goal

You will build comprehensive Promptfoo evaluation suites for GenAI artifact quality verification. Create eval configurations for each artifact type: prompt eval (test prompt changes against golden Q&A pairs, verify faithfulness, format compliance, and no regressions), model eval (compare new model config against baseline on quality, latency, and cost), RAG eval (test retrieval changes against known-relevant document pairs using RAGAS metrics). Implement eval suite structure: `eval_configs/` directory with per-artifact YAML configs, `golden_sets/` directory with curated test cases, `assertions/` directory with custom assertion functions. Build `run_eval()` that executes Promptfoo programmatically via `promptfoo eval --config` and parses results into a `EvalResult` Pydantic model with per-test-case pass/fail and aggregate scores.

Objective 2

Implement eval gates in pipelines

Goal

You will integrate Promptfoo eval gates into Argo Workflow pipelines. Create an Argo Workflow step template `eval-gate` that runs Promptfoo eval, parses results, and fails the step (blocking promotion) if quality drops below threshold. Implement gate logic: compute delta between current eval scores and baseline (stored from last successful promotion). Block if any metric regresses by more than the allowed margin (configurable per metric: 2% for faithfulness, 1% for hallucination rate). Implement gate bypass for emergency deployments: require two approvers to override a failed gate, log the override with justification. Build gate result storage: save every eval run in PostgreSQL with `pipeline_id`, `artifact`, `scores`, `baseline_scores`, `gate_result`, `timestamp`. Track `eval_gate_result{artifact_type,result}`, `eval_gate_bypass_total`.

Objective 3

Track eval effectiveness

Goal

You will build eval gate effectiveness tracking and golden test set management. Implement eval effectiveness metrics: gate block rate (how often gates block promotion), false positive rate (gates that blocked but the change was actually fine -- measured by manual override success), false negative rate (changes that passed gates but caused production quality drops). Build golden test set management: `POST /api/v1/golden-sets/{artifact_type}/add` to add new test cases from production incidents (when quality issues are found in production, add the failing case to prevent regression), `GET /api/v1/golden-sets/{artifact_type}/coverage` to analyze test coverage across quality dimensions. Create eval dashboard: gate activity timeline, pass/fail rate trend, eval duration trend, golden test set growth, and gate effectiveness scores.

Objective 4

Build testing and validation for automated eval gates

Goal

You will build comprehensive testing and validation for the automated eval gates system. Implement `AutomatedEvalGatesTester`: define test scenarios that verify all critical paths work correctly under normal conditions, edge cases, and failure conditions. Build integration tests that verify the system integrates correctly with upstream and downstream components. Implement regression testing: maintain a test suite that runs on every configuration change to catch regressions. Build `POST /api/v1/automated-eval-gates/test` API that triggers the full test suite and returns results. Run tests as scheduled Argo Workflow CronJobs. Track `test_pass_rate_{system}_total`, `test_duration_seconds`. Build test results dashboard showing pass rates, flaky tests, and coverage.

Objective 5

Implement performance optimization for automated eval gates

Goal

You will build performance monitoring and optimization for the automated eval gates system. Implement `AutomatedEvalGatesOptimizer`: instrument all critical paths with latency histograms, identify bottlenecks using p95/p99 analysis, and implement optimizations. Build capacity analysis: measure maximum throughput under load, identify scaling limits, and document capacity thresholds. Implement performance SLOs: define acceptable latency and throughput targets, track compliance, and alert on degradation. Build performance benchmarking: run standardized benchmarks on every significant change to detect performance regressions. Track `performance_benchmark_result_{system}`, `performance_slo_compliance_{system}`. Create performance dashboard with trend analysis.

Objective 6

Build operational documentation for automated eval gates

Goal

You will build comprehensive operational documentation and runbooks for the automated eval gates system. Implement `AutomatedEvalGatesDocGenerator`: auto-generate architecture diagrams from deployed resources, configuration reference from active configs, and API documentation from FastAPI OpenAPI specs. Build operational runbooks: document common operational tasks (scaling, configuration changes, troubleshooting), emergency procedures (failure recovery, rollback), and maintenance procedures (upgrades, data migrations). Implement documentation freshness: track when documentation was last updated vs when the system was last changed, flag stale docs. Store documentation in Git with version tracking. Build `GET /api/v1/automated-eval-gates/docs` serving current documentation. Track `documentation_freshness_{system}`, `documentation_coverage_{system}`.