Prerequisites

Before starting this chapter, you should have:

  • Python 3.11+ with experience writing test suites and assertions
  • Promptfoo: Understanding of eval configuration, providers, assertions, and test case structure
  • Argo Workflows: DAG templates, exit handlers, and parameter passing between workflow steps
  • YAML configuration: Writing and maintaining structured configuration files
  • Quality evaluation concepts: Familiarity with LLM output quality dimensions (faithfulness, format compliance, safety)
  • Chapter 16 completion: Progressive Delivery Engine that this chapter's eval gates feed into

You will need access to:

  • Promptfoo CLI installed locally or as a container image in your vCluster
  • Argo Workflows controller deployed on your vCluster
  • At least one LLM provider API key for running evaluations
  • PostgreSQL for storing eval results and gate history

Learning Goals

  1. Build Promptfoo eval suites for pre-promotion quality verification

    • Build Promptfoo Eval Suites for Pre-Promotion Quality VerificationPromptfoo provides the evaluation engine that powers the gate pipeline.

    • You will create Promptfoo evaluation configurations that test model outputs across multiple quality dimensions.

    • You will configure assertion types including exact match, contains, llm-rubric (using an LLM judge), similar (semantic similarity), and is-json (structural validation).

    • You will organize suites by quality dimension — faithfulness, format compliance, safety, and relevance — so that failures map directly to actionable categories.

  2. Implement eval gates in Argo Workflows that block promotion on failure

    • Implement Eval Gates in Argo Workflows That Block Promotion on FailureYou will build Argo Workflow templates that execute eval suites as workflow steps and use the results to make promotion decisions.

    • You will implement configurable thresholds and failure handling.

    • This matters because manual evaluation checks are inconsistent and slow.

  3. Create golden test sets for regression detection

    • Create Golden Test Sets for Regression DetectionYou will design and build golden test sets — curated collections of representative inputs with expected outputs that serve as the ground truth for eval gates.

    • You will build sets that cover representative cases, edge cases, and known failure modes, measuring coverage across quality dimensions.

    • This matters because eval gates are only as good as their test sets.

  4. Track eval pass rates and gate effectiveness metrics

    • Track Eval Pass Rates and Gate Effectiveness MetricsYou will build a metrics pipeline that tracks eval gate outcomes over time: pass rates per eval suite, per quality dimension, and per artifact type.

    • You will measure gate effectiveness by tracking false positive rates (gates that block good changes) and false negative rates (gates that pass bad changes that later cause incidents).

    • You will build a Grafana dashboard showing eval pipeline health and gate decision trends.

Key Terminology

Eval Gate
An automated checkpoint that runs standardized test suites and blocks promotion when quality falls below defined thresholds
Promptfoo
An open-source tool for evaluating LLM outputs against assertions including exact match, semantic similarity, LLM-as-judge, and structural validation
Golden Test Set
A curated collection of representative inputs with expected outputs serving as ground truth for quality evaluation
Quality Regression
A measurable decline in model output quality compared to the established baseline, detected by eval suite assertions
Eval Suite
A structured collection of test cases, providers, and assertions configured to evaluate a specific quality dimension
Test Assertion
A condition that an LLM output must satisfy, such as containing specific content, matching a format, or scoring above a threshold
Faithfulness Score
A quality metric measuring how accurately model outputs reflect the source material without hallucination
Format Compliance
A quality metric measuring how well model outputs conform to expected structural requirements (JSON schema, markdown format, etc.)
Pass Threshold
The minimum percentage of test cases that must pass for an eval suite to be considered successful
Gate Decision
The binary outcome (pass or block) produced by evaluating all suite results against their thresholds
Argo Workflow
A Kubernetes-native workflow engine for orchestrating complex multi-step processes with DAG dependencies
Promotion Block
The action of preventing an artifact from advancing to the next environment stage when eval gates fail
Quality Dimension
A specific aspect of output quality (faithfulness, format, safety, relevance) evaluated independently
Regression Detection
The process of identifying quality declines by comparing current eval scores against historical baselines
Baseline Score
The established quality level from previous successful evaluations, used as the comparison point for regression detection
Eval Metric
A quantitative measurement produced by an eval suite, such as pass rate, average score, or assertion failure count
Test Coverage
The degree to which golden test sets exercise the full range of expected inputs, edge cases, and failure modes
Eval Pipeline
The end-to-end workflow from triggering evaluation through result aggregation to gate decision
Gate Effectiveness
A meta-metric measuring how well an eval gate distinguishes between good and bad changes, tracked via false positive and false negative rates
False Positive Rate
The frequency at which an eval gate incorrectly blocks changes that would have been safe -- causes development friction
False Negative Rate
The frequency at which an eval gate incorrectly passes changes that later cause production quality incidents -- the most dangerous failure mode
LLM-as-Judge
An evaluation pattern where a separate LLM instance scores the output of the tested model against rubric criteria, enabling semantic evaluation beyond string matching
Eval Flakiness
Non-deterministic assertion results where the same input produces different pass/fail outcomes across runs, typically caused by LLM-based assertions with high temperature
Short-Circuit Evaluation
An optimization where cheap eval suites run first and expensive suites are skipped if early suites already fail, reducing total LLM API cost
Assertion Chain
Multiple assertions applied to a single test case output, where all must pass for the test case to succeed, enabling multi-dimensional quality checks
Test Set Versioning
The practice of incrementing a golden test set's version number on every modification, maintaining a clear history of how the test set evolved over time
Eval Cache
A store of previous eval results keyed by artifact content hash and test set version, enabling skip of re-evaluation when neither has changed
Gate Calibration
The ongoing process of adjusting pass thresholds and assertion sensitivity to maintain the right balance between catching regressions and allowing valid changes

On This Page