On This Page
Prerequisites
Before starting this chapter, you should have:
- Python 3.11+ with experience writing test suites and assertions
- Promptfoo: Understanding of eval configuration, providers, assertions, and test case structure
- Argo Workflows: DAG templates, exit handlers, and parameter passing between workflow steps
- YAML configuration: Writing and maintaining structured configuration files
- Quality evaluation concepts: Familiarity with LLM output quality dimensions (faithfulness, format compliance, safety)
- Chapter 16 completion: Progressive Delivery Engine that this chapter's eval gates feed into
You will need access to:
- Promptfoo CLI installed locally or as a container image in your vCluster
- Argo Workflows controller deployed on your vCluster
- At least one LLM provider API key for running evaluations
- PostgreSQL for storing eval results and gate history
Learning Goals
-
Build Promptfoo eval suites for pre-promotion quality verification
-
Build Promptfoo Eval Suites for Pre-Promotion Quality VerificationPromptfoo provides the evaluation engine that powers the gate pipeline.
-
You will create Promptfoo evaluation configurations that test model outputs across multiple quality dimensions.
-
You will configure assertion types including exact match, contains, llm-rubric (using an LLM judge), similar (semantic similarity), and is-json (structural validation).
-
You will organize suites by quality dimension — faithfulness, format compliance, safety, and relevance — so that failures map directly to actionable categories.
-
-
Implement eval gates in Argo Workflows that block promotion on failure
-
Implement Eval Gates in Argo Workflows That Block Promotion on FailureYou will build Argo Workflow templates that execute eval suites as workflow steps and use the results to make promotion decisions.
-
You will implement configurable thresholds and failure handling.
-
This matters because manual evaluation checks are inconsistent and slow.
-
-
Create golden test sets for regression detection
-
Create Golden Test Sets for Regression DetectionYou will design and build golden test sets — curated collections of representative inputs with expected outputs that serve as the ground truth for eval gates.
-
You will build sets that cover representative cases, edge cases, and known failure modes, measuring coverage across quality dimensions.
-
This matters because eval gates are only as good as their test sets.
-
-
Track eval pass rates and gate effectiveness metrics
-
Track Eval Pass Rates and Gate Effectiveness MetricsYou will build a metrics pipeline that tracks eval gate outcomes over time: pass rates per eval suite, per quality dimension, and per artifact type.
-
You will measure gate effectiveness by tracking false positive rates (gates that block good changes) and false negative rates (gates that pass bad changes that later cause incidents).
-
You will build a Grafana dashboard showing eval pipeline health and gate decision trends.
-