Free lesson

Build a stratified evaluation dataset

You will create a structured evaluation dataset for testing a hosted LLM (OpenAI GPT-4o) across multiple task categories. Define task categories: classification, summarization, extraction, generation, and reasoning. For each category, create 20 test cases at three difficulty levels (easy, medium, hard) with fields: input, expected_output, category, difficulty, source, and metadata. Store the dataset as JSONL with one test case per line. Build a DatasetBuilder class that loads the JSONL, validates each row against a Pydantic schema, and reports coverage statistics: cases per category, difficulty distribution, and average input/output token counts.

~25 min read · Free to read — no subscription required.

Build evaluation datasets with stratified sampling across task categories and difficulty levels

Introduction

When you draw evaluation cases uniformly at random from a raw test pool, the resulting accuracy number lies — easy cases and over-represented task categories swamp the signal, and a model that fails badly on rare-but-critical strata still scores in the high 90s. The consequence is a hosted LLM that looks production-ready on the dashboard and ships a regression to real users. By the end of this lesson you will be able to partition a test-case pool by task category and difficulty, draw a balanced sample with auditable per-stratum counts, and filter contaminated cases out before locking the golden dataset.

Key Terminology

  • Stratum: A non-overlapping partition of the test-case pool defined by a (task category, difficulty) pair; sampling draws independently from each stratum to control per-cell counts.
  • Golden dataset: The immutable, versioned evaluation benchmark that survives stratified sampling and contamination filtering, scored against every model release.
  • Contamination ratio: The fraction of a test case's n-grams that also appear in a reference training corpus; cases above the configured threshold are excluded before the dataset is locked.
  • Provenance: A per-case label (e.g. "human" or "synthetic") that records how the case was authored, so dataset cards report the human-vs-synthetic composition.

Concepts

Why naive random sampling fails for LLM evaluation

Random sampling from a raw test-case pool introduces two systemic risks. First, category imbalance: if your pool contains 400 classification examples but only 30 generation examples, a random draw of 200 cases will virtually guarantee that generation capability is under-tested. Second, difficulty skew: easy examples tend to outnumber hard ones because annotators produce simple cases faster, so random sampling yields an artificially inflated accuracy metric. Stratified sampling eliminates both risks by partitioning the pool into strata—defined by the cross-product of task category and difficulty level—and drawing from each stratum independently. The result is a dataset where every (category, difficulty) cell has a controlled, auditable count.

Handling under-represented strata and edge cases

Production evaluation pools almost always have sparse strata. Hard-difficulty generation cases are notoriously scarce because they require expert annotators who can craft prompts that push the model into subtle failure modes—hallucination, instruction drift, or style collapse. Three strategies fill the gap:

  • Adversarial example injection: Manually or programmatically craft edge cases that target known model weaknesses. For extraction tasks, this means ambiguous entity boundaries; for reasoning tasks, it means multi-hop questions with plausible distractors. Patronus Generative Simulators automate this by systematically probing failure surfaces and generating adversarial prompts at scale.

  • Synthetic data generation: NeMo Safe Synthesizer produces privacy-compliant synthetic cases with differential privacy guarantees, letting you augment sparse strata without leaking PII from your training or annotation pipelines. Synthetic cases must carry provenance="synthetic" so that downstream dataset cards accurately report the composition.

  • Inter-annotator agreement re-calibration: When difficulty labels are unreliable, recompute them from annotator agreement scores. Cases with agreement below 0.6 (Fleiss' kappa) are promoted to hard; cases above 0.9 are demoted to easy. This re-calibration often redistributes cases across strata and can resolve sparsity without generating new data.

Assembling the golden dataset

Once stratified sampling and contamination filtering are complete, the surviving cases form the golden dataset—the immutable reference benchmark against which every model version is scored. The golden dataset must be accompanied by a dataset card that documents the stratification grid, per-stratum counts, contamination exclusion counts, provenance breakdown (human vs. synthetic), inter-annotator agreement statistics for difficulty labels, and the Git commit hash of the snapshot. This metadata enables reproducible evaluation across teams and over time—a requirement for any organization operating under governance frameworks where audit trails are non-negotiable.

The practical workflow is: run StratifiedDatasetBuilder.build to produce the balanced draw, pipe the result through ContaminationChecker.filter_dataset, verify that no stratum dropped below the minimum count post-filtering (re-augment with NeMo Safe Synthesizer if necessary), serialize the final list to a versioned JSON file, and commit it alongside its dataset card. The dataset card itself is a structured markdown or JSON document that serves as the single source of truth for what the dataset contains, why it was built, and how it should be used—analogous to a model card but scoped to evaluation data. When automated dataset refresh pipelines detect staleness (covered in another goal), they trigger re-curation through this same pipeline, ensuring that every new version of the golden dataset inherits the same stratification guarantees and contamination safeguards.

Code Walkthrough

Defining the stratification axes

For hosted-LLM evaluation, two primary axes form the stratification grid:

  • Task category: The functional capability being tested. Common categories for a general-purpose model like GPT-4o include classification, summarization, extraction, generation, and reasoning. Each category exercises a different internal pathway—classification demands label precision, while generation demands coherence and fluency over longer outputs.

  • Difficulty level: A discrete scale (typically easy, medium, hard) that captures the cognitive or computational complexity of a test case. Difficulty is often operationalized through inter-annotator agreement: cases where annotators unanimously agree on the correct answer are easy, cases with partial disagreement are medium, and cases where even experts diverge are hard.

The cross-product of five categories and three difficulty levels yields fifteen strata. A well-designed golden dataset allocates a minimum count per cell—typically 20 to 50 cases—ensuring statistical power for per-cell pass-rate estimates with a margin of error below five percentage points at 95% confidence.

Loading diagram...

Designing the test-case schema

Every test case in the evaluation dataset must carry enough metadata to support stratified draws, contamination detection, and reproducible dataset versioning. The minimum viable schema includes: the prompt text, the expected reference output, the task category label, the difficulty level, a unique case identifier for traceability, and a provenance field indicating whether the case is human-authored or synthetically generated (relevant when you later integrate NeMo Safe Synthesizer for privacy-compliant synthetic data or Patronus Generative Simulators for adversarial examples).

The following code defines an EvalCase dataclass and a StratifiedDatasetBuilder class that accepts raw test cases, buckets them into strata, and draws a balanced sample. The StratifiedDatasetBuilder.build method uses Python's random.sample for within-stratum draws and raises a ValueError when any stratum falls below the required minimum count, preventing silently under-powered evaluations.

Code snippetpython
1import random 2from dataclasses import dataclass, field 3from collections import defaultdict 4from typing import List, Dict, Optional 5 6TASK_CATEGORIES = ["classification", "summarization", "extraction", 7 "generation", "reasoning"] 8DIFFICULTY_LEVELS = ["easy", "medium", "hard"] 9 10@dataclass 11class EvalCase: 12 case_id: str 13 prompt: str 14 reference_output: str 15 category: str 16 difficulty: str 17 provenance: str = "human" 18 metadata: Dict = field(default_factory=dict) 19 20class StratifiedDatasetBuilder: 21 def __init__(self, min_per_stratum: int = 30): 22 self.min_per_stratum = min_per_stratum 23 self._pool: List[EvalCase] = [] 24 self._strata: Dict[str, List[EvalCase]] = defaultdict(list) 25 26 def add_cases(self, cases: List[EvalCase]) -> None: 27 for case in cases: 28 if case.category not in TASK_CATEGORIES: 29 raise ValueError(f"Unknown category: {case.category}") 30 if case.difficulty not in DIFFICULTY_LEVELS: 31 raise ValueError(f"Unknown difficulty: {case.difficulty}") 32 key = f"{case.category}::{case.difficulty}" 33 self._strata[key].append(case) 34 self._pool.append(case) 35 36 def coverage_report(self) -> Dict[str, int]: 37 report = {} 38 for cat in TASK_CATEGORIES: 39 for diff in DIFFICULTY_LEVELS: 40 key = f"{cat}::{diff}" 41 report[key] = len(self._strata.get(key, [])) 42 return report 43 44 def build(self, per_stratum: Optional[int] = None, 45 seed: int = 42) -> List[EvalCase]: 46 target = per_stratum or self.min_per_stratum 47 random.seed(seed) 48 dataset = [] 49 for cat in TASK_CATEGORIES: 50 for diff in DIFFICULTY_LEVELS: 51 key = f"{cat}::{diff}" 52 stratum = self._strata.get(key, []) 53 if len(stratum) < target: 54 raise ValueError( 55 f"Stratum '{key}' has {len(stratum)} cases, " 56 f"need {target}. Add more cases or use " 57 f"synthetic augmentation." 58 ) 59 dataset.extend(random.sample(stratum, target)) 60 return dataset
  • The module imports bring in defaultdict from collections (which enables automatic list initialization for each stratum key) and field from dataclasses (which provides a factory for mutable default values).
  • TASK_CATEGORIES and DIFFICULTY_LEVELS define the two stratification axes as module-level constants. Centralizing these lists ensures that every downstream component references the same canonical set of categories and difficulty levels.
  • The EvalCase dataclass captures the full metadata envelope for a single test case. The provenance field defaults to "human" but accepts "synthetic" for cases generated by tools like NeMo Safe Synthesizer. The metadata dictionary uses field(default_factory=dict) to avoid the mutable-default-argument pitfall.
  • StratifiedDatasetBuilder.__init__ sets the minimum-per-stratum threshold and initializes both the flat pool and the stratum dictionary. The defaultdict(list) ensures that appending to a new stratum key never raises a KeyError.
  • add_cases validates each case against the canonical axes before inserting it into the appropriate stratum. Raising a ValueError on unknown categories enforces schema discipline at ingestion time rather than at evaluation time.
  • coverage_report iterates the full cross-product of categories and difficulty levels, returning the count for every cell. Strata with zero cases surface immediately, making gap analysis trivial before you commit to a build.
  • build is the core sampling routine. It seeds the random number generator for reproducibility—critical for dataset versioning—and draws exactly target cases from each stratum. If any stratum is under-populated, it raises a ValueError with an actionable message suggesting synthetic augmentation rather than silently degrading coverage.

Contamination-aware dataset splits

A stratified dataset is useless if the evaluation cases have leaked into the model's training data. Contamination detection must happen before the final golden dataset is locked. The standard approach computes n-gram overlap between each evaluation prompt and a reference corpus of known training documents. Cases exceeding an overlap threshold (typically 8-gram overlap above 80%) are flagged and excluded.

The following code implements a lightweight contamination checker using the ContaminationChecker class. The check method extracts character-level n-grams from each evaluation prompt, compares them against a pre-built set of training n-grams, and returns a contamination ratio. Cases where the ratio exceeds the configurable threshold parameter are flagged with is_contaminated set to True, enabling downstream filtering before the dataset is finalized into reproducible, Git-tracked snapshots.

Code snippetpython
1from typing import Set, Tuple 2 3class ContaminationChecker: 4 def __init__(self, training_corpus: List[str], n: int = 8, 5 threshold: float = 0.80): 6 self.n = n 7 self.threshold = threshold 8 self.training_ngrams: Set[str] = set() 9 for doc in training_corpus: 10 self.training_ngrams.update(self._extract_ngrams(doc)) 11 12 def _extract_ngrams(self, text: str) -> Set[str]: 13 tokens = text.lower().split() 14 if len(tokens) < self.n: 15 return set() 16 return {" ".join(tokens[i:i + self.n]) 17 for i in range(len(tokens) - self.n + 1)} 18 19 def check(self, case: EvalCase) -> Tuple[bool, float]: 20 case_ngrams = self._extract_ngrams(case.prompt) 21 if not case_ngrams: 22 return False, 0.0 23 overlap = case_ngrams & self.training_ngrams 24 ratio = len(overlap) / len(case_ngrams) 25 is_contaminated = ratio >= self.threshold 26 return is_contaminated, ratio 27 28 def filter_dataset(self, cases: List[EvalCase]) -> List[EvalCase]: 29 clean = [] 30 for case in cases: 31 contaminated, ratio = self.check(case) 32 if not contaminated: 33 clean.append(case) 34 else: 35 case.metadata["contamination_ratio"] = ratio 36 return clean
  • The Set and Tuple imports from typing provide explicit type annotations that improve readability in a shared codebase.
  • ContaminationChecker.__init__ pre-computes the full set of n-grams from the training corpus. Storing them in a set enables O(1) membership tests during the check phase. The n parameter defaults to 8, which balances between catching true overlaps and avoiding false positives from common phrases.
  • _extract_ngrams lowercases the text, tokenizes on whitespace, and generates all contiguous n-token windows. If the text is shorter than n tokens, it returns an empty set to avoid generating degenerate partial n-grams.
  • check computes the intersection between the case's n-grams and the training set, then calculates the contamination ratio. The boolean is_contaminated is True when the ratio meets or exceeds the threshold, and False otherwise.
  • filter_dataset iterates through all cases and retains only clean ones. Contaminated cases are not discarded silently—their metadata dictionary is annotated with the contamination ratio, enabling auditing through the dataset card. This transparency is essential for dataset versioning: every exclusion must be traceable.

Do's and Don'ts

Do's

  1. Do call coverage_report() before invoking build() — it surfaces the per-cell counts across all 15 strata (5 categories × 3 difficulty levels) before you commit to a per_stratum target, giving you time to augment sparse cells with synthetic cases rather than hitting StratifiedDatasetBuilder.build's ValueError mid-pipeline when it is too late to course-correct.
  2. Do pass a fixed seed integer to StratifiedDatasetBuilder.buildrandom.seed(seed) before random.sample makes every within-stratum draw reproducible, so two engineers running the same builder on the same pool get identical golden datasets and pass-rate comparisons across model versions reflect evaluation-set identity, not sampling variance.
  3. Do populate the provenance field on every EvalCase with either "human" or "synthetic" — splitting pass-rate analysis by provenance exposes generalization gaps that a provenance-blind aggregate conceals: a model that scores 94% on human-authored reasoning cases but 71% on synthetically generated ones is failing in a way the overall number never reveals.

Don'ts

  1. Don't draw evaluation cases uniformly at random from the raw test pool — easy cases and over-represented task categories dominate the sample, so a model that fails catastrophically on hard::reasoning or hard::extraction strata still posts high-90s accuracy when most draws are easy::classification, producing a dashboard number that looks production-ready while a regression ships to real users.
  2. Don't catch or suppress the ValueError that StratifiedDatasetBuilder.build raises when a stratum is under-populated — the error fires precisely when a cell's case count falls below target (defaulting to min_per_stratum), and silently proceeding with whatever cases exist drops the statistical power below the five-percentage-point margin of error at 95% confidence that the minimum-count design guarantees.
  3. Don't replace field(default_factory=dict) with a bare {} default in EvalCase — a shared mutable default causes every case instantiated without explicit metadata to reference the same dictionary object, so mutating one case's metadata dict silently poisons all others and corrupts the provenance and auxiliary fields that downstream contamination checks and dataset versioning depend on.

Keep going with GenAI Platform Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.