Prerequisites

Before starting this chapter, you should have:

Python 3.11+ with experience writing test suites and assertions
Promptfoo: Understanding of eval configuration, providers, assertions, and test case structure
Argo Workflows: DAG templates, exit handlers, and parameter passing between workflow steps
YAML configuration: Writing and maintaining structured configuration files
Quality evaluation concepts: Familiarity with LLM output quality dimensions (faithfulness, format compliance, safety)
Chapter 16 completion: Progressive Delivery Engine that this chapter's eval gates feed into

You will need access to:

Promptfoo CLI installed locally or as a container image in your vCluster
Argo Workflows controller deployed on your vCluster
At least one LLM provider API key for running evaluations
PostgreSQL for storing eval results and gate history

Learning Goals

Build Promptfoo eval suites for pre-promotion quality verification
- Build Promptfoo Eval Suites for Pre-Promotion Quality VerificationPromptfoo provides the evaluation engine that powers the gate pipeline.
- You will create Promptfoo evaluation configurations that test model outputs across multiple quality dimensions.
- You will configure assertion types including exact match, contains, llm-rubric (using an LLM judge), similar (semantic similarity), and is-json (structural validation).
- You will organize suites by quality dimension — faithfulness, format compliance, safety, and relevance — so that failures map directly to actionable categories.
Implement eval gates in Argo Workflows that block promotion on failure
- Implement Eval Gates in Argo Workflows That Block Promotion on FailureYou will build Argo Workflow templates that execute eval suites as workflow steps and use the results to make promotion decisions.
- You will implement configurable thresholds and failure handling.
- This matters because manual evaluation checks are inconsistent and slow.
Create golden test sets for regression detection
- Create Golden Test Sets for Regression DetectionYou will design and build golden test sets — curated collections of representative inputs with expected outputs that serve as the ground truth for eval gates.
- You will build sets that cover representative cases, edge cases, and known failure modes, measuring coverage across quality dimensions.
- This matters because eval gates are only as good as their test sets.
Track eval pass rates and gate effectiveness metrics
- Track Eval Pass Rates and Gate Effectiveness MetricsYou will build a metrics pipeline that tracks eval gate outcomes over time: pass rates per eval suite, per quality dimension, and per artifact type.
- You will measure gate effectiveness by tracking false positive rates (gates that block good changes) and false negative rates (gates that pass bad changes that later cause incidents).
- You will build a Grafana dashboard showing eval pipeline health and gate decision trends.

Key Terminology

Eval Gate

An automated checkpoint that runs standardized test suites and blocks promotion when quality falls below defined thresholds

Promptfoo

An open-source tool for evaluating LLM outputs against assertions including exact match, semantic similarity, LLM-as-judge, and structural validation

Golden Test Set

A curated collection of representative inputs with expected outputs serving as ground truth for quality evaluation

Quality Regression

A measurable decline in model output quality compared to the established baseline, detected by eval suite assertions

Eval Suite

A structured collection of test cases, providers, and assertions configured to evaluate a specific quality dimension

Test Assertion

A condition that an LLM output must satisfy, such as containing specific content, matching a format, or scoring above a threshold

Faithfulness Score

A quality metric measuring how accurately model outputs reflect the source material without hallucination

Format Compliance

A quality metric measuring how well model outputs conform to expected structural requirements (JSON schema, markdown format, etc.)

Pass Threshold

The minimum percentage of test cases that must pass for an eval suite to be considered successful

Gate Decision

The binary outcome (pass or block) produced by evaluating all suite results against their thresholds

Argo Workflow

A Kubernetes-native workflow engine for orchestrating complex multi-step processes with DAG dependencies

Promotion Block

The action of preventing an artifact from advancing to the next environment stage when eval gates fail

Quality Dimension

A specific aspect of output quality (faithfulness, format, safety, relevance) evaluated independently

Regression Detection

The process of identifying quality declines by comparing current eval scores against historical baselines

Baseline Score

The established quality level from previous successful evaluations, used as the comparison point for regression detection

Eval Metric

A quantitative measurement produced by an eval suite, such as pass rate, average score, or assertion failure count

Test Coverage

The degree to which golden test sets exercise the full range of expected inputs, edge cases, and failure modes

Eval Pipeline

The end-to-end workflow from triggering evaluation through result aggregation to gate decision

Gate Effectiveness

A meta-metric measuring how well an eval gate distinguishes between good and bad changes, tracked via false positive and false negative rates

False Positive Rate

The frequency at which an eval gate incorrectly blocks changes that would have been safe -- causes development friction

False Negative Rate

The frequency at which an eval gate incorrectly passes changes that later cause production quality incidents -- the most dangerous failure mode

LLM-as-Judge

An evaluation pattern where a separate LLM instance scores the output of the tested model against rubric criteria, enabling semantic evaluation beyond string matching

Eval Flakiness

Non-deterministic assertion results where the same input produces different pass/fail outcomes across runs, typically caused by LLM-based assertions with high temperature

Short-Circuit Evaluation

An optimization where cheap eval suites run first and expensive suites are skipped if early suites already fail, reducing total LLM API cost

Assertion Chain

Multiple assertions applied to a single test case output, where all must pass for the test case to succeed, enabling multi-dimensional quality checks

Test Set Versioning

The practice of incrementing a golden test set's version number on every modification, maintaining a clear history of how the test set evolved over time

Eval Cache

A store of previous eval results keyed by artifact content hash and test set version, enabling skip of re-evaluation when neither has changed

Gate Calibration

The ongoing process of adjusting pass thresholds and assertion sensitivity to maintain the right balance between catching regressions and allowing valid changes

On This Page

Prerequisites

Learning Goals

Key Terminology

On This Page