Prerequisites

This chapter assumes proficiency in Python 3.10+, including working knowledge of dataclasses, pathlib, and typing generics. Students should have hands-on experience with Git branching and tagging workflows, as dataset versioning builds directly on these primitives. Familiarity with pandas DataFrames and basic statistical concepts—stratified sampling, distribution analysis, and inter-rater reliability metrics—is expected. Prior exposure to JSON Schema validation and at least one cloud object store (GCS or S3) will accelerate the infrastructure sections. No prior chapters in this course are required.

Learning Goals

Build evaluation datasets with stratified sampling across task categories and difficulty levels
- Build evaluation datasets with stratified sampling across task categories and difficulty levelsto ensure that every hosted LLM evaluation produces results you can trust across the full distribution of production traffic rather than only the easy, well-represented cases. Senior engineers routinely discover that flat random sampling over-represents common query patterns—customer-support summarization, simple Q&A—while under-representing the tail of hard, domain-specific prompts that cause the most severe production failures. Stratified sampling corrects this by partitioning the evaluation population into non-overlapping strata defined by two or more axes—task category (summarization, extraction, reasoning, code generation, creative writing) and difficulty level (routine, moderate, adversarial)—and then drawing samples from each stratum according to a target allocation. The allocation can mirror production traffic proportions for a "representative" dataset, or it can deliberately over-sample rare-but-critical strata for a "stress-test" dataset; the choice depends on the evaluation objective, and a mature evaluation platform typically maintains both variants. The practical challenge is defining strata boundaries that are stable over time: if your task taxonomy shifts every quarter, historical comparisons break down, so you need a versioned taxonomy registry that maps every evaluation example to exactly one stratum and tracks reclassifications. You will implement this as a Python pipeline that reads raw candidate examples from a labeled pool, computes stratum membership using a deterministic classifier, applies proportional or disproportional allocation via numpy random sampling with a fixed seed, validates that minimum-per-stratum counts are met, and writes the resulting dataset to a versioned Parquet file with full provenance metadata. The pipeline must raise a ValueError when any stratum falls below its minimum count rather than silently producing an imbalanced dataset, because downstream evaluation metrics computed on under-represented strata will have confidence intervals too wide to be actionable.
- Design a two-axis stratification schema that partitions evaluation examples by task category and difficulty level simultaneously, producing a grid of strata such as "summarization × adversarial" or "code-generation × routine." You will learn how to define difficulty levels operationally—using proxy signals like prompt token count, required reasoning steps, or historical model failure rate—so that the strata are reproducible by any team member without subjective judgment calls. The schema is stored as a JSON configuration file that the sampling pipeline reads at runtime, enabling you to add new task categories or difficulty buckets without changing pipeline code.
- Implement minimum-per-stratum guarantees with overflow redistribution so that the sampling algorithm never produces a stratum with fewer than a configurable floor (typically 30 examples for statistical power). When the candidate pool for a stratum is smaller than the floor, the pipeline logs a warning, marks that stratum as "pool-limited," and redistributes the surplus budget to the remaining strata proportionally. This ensures the total dataset size stays within your compute budget for evaluation while maximizing coverage. You will handle the edge case where the entire candidate pool is smaller than the sum of all stratum floors by raising a RuntimeError with a diagnostic message listing which strata are deficient and by how many examples.
- Validate stratum balance post-sampling with chi-squared goodness-of-fit tests to confirm that the realized sample proportions match the target allocation within an acceptable tolerance. The pipeline computes a chi-squared statistic comparing observed versus expected counts per stratum and flags the dataset as "imbalanced" if the p-value falls below a threshold you configure (default 0.05). This automated check catches bugs in the sampling logic, label drift in the candidate pool, and silent changes to the taxonomy registry—all of which are common failure modes at scale. The validation result is written into the dataset's metadata sidecar so that downstream evaluation dashboards can surface balance warnings without re-running the sampling pipeline.
- Generate dataset cards with stratum-level statistics that document the provenance, composition, and intended use of each evaluation dataset. The card includes the sampling seed, the taxonomy version, per-stratum counts with percentages, the chi-squared balance check result, the timestamp, and the Git commit hash of the pipeline code. This card follows an internal schema inspired by Hugging Face dataset cards but extended with fields specific to LLM evaluation—such as the target model family, the evaluation harness version, and the expected baseline score range. Producing the card automatically as part of the sampling pipeline ensures that no evaluation dataset ships without documentation, which is critical when audit teams or external regulators review your evaluation methodology.
Implement dataset versioning with Git-based tracking and reproducible snapshots
- Implement dataset versioning with Git-based tracking and reproducible snapshotsso that every evaluation run can be traced back to the exact dataset state that produced its scores, enabling reliable regression detection across model releases and preventing the silent dataset drift that invalidates longitudinal comparisons. In production LLM evaluation, datasets are living artifacts: annotators fix labels, new examples arrive from production logs, stale examples are retired, and taxonomy schemas evolve. Without disciplined versioning, a "v2" dataset might differ from "v1" by thousands of examples with no record of what changed, making it impossible to determine whether a score change reflects genuine model improvement or merely dataset mutation. Git-based versioning solves this by treating the dataset definition—the sampling configuration, the taxonomy schema, and the example manifest—as code that lives in a repository, while the heavyweight data payloads (Parquet files, audio clips, images) are tracked via Git LFS or DVC pointers that reference content-addressable storage. Each dataset version corresponds to a Git tag of the form eval-dataset/{dataset-name}/v{major}.{minor}.{patch}, where major bumps indicate schema changes, minor bumps indicate example additions or removals, and patch bumps indicate label corrections. The pipeline enforces semantic versioning rules: a label correction that does not change the set of example IDs must be a patch bump, while adding a new stratum requires a minor bump. Attempting to tag a version that violates these rules raises an AssertionError with a diff summary showing which rule was broken. Reproducibility requires more than version tags; it requires deterministic snapshot generation. Given a dataset version tag, the reproduce_snapshot.py script checks out the tagged commit, re-runs the sampling pipeline with the recorded seed and configuration, and compares the output byte-for-byte against the stored Parquet file using SHA-256 hashes. If the hashes diverge, the script exits with code 1 and prints a row-level diff showing which examples differ—a diagnostic that has saved teams weeks of debugging when a dependency upgrade silently changed random number generation behavior. You will also implement a lightweight dataset changelog generator that parses Git history between two version tags and produces a human-readable summary of added, removed, and relabeled examples grouped by stratum, which evaluation stakeholders review before approving a new dataset version for use in release-gate evaluations.
- Configure DVC pipelines alongside Git for large-artifact tracking so that Parquet files, embedding caches, and audio samples are stored in cloud object storage (GCS or S3) while their content hashes live in the Git repository. You will learn how to structure the DVC stage graph so that re-running dvc repro after a configuration change automatically re-samples the dataset, regenerates the dataset card, and pushes the new artifacts to remote storage—all triggered by a single command. The DVC lock file captures the exact input hashes, parameter values, and output hashes, providing a machine-readable reproducibility receipt that complements the Git tag.
- Enforce semantic version bumping rules in a pre-commit hook that inspects the diff between the staged dataset manifest and the previous version, classifies the change type (schema change, example set change, or label-only change), and verifies that the proposed version tag matches the required bump level. If an engineer attempts to tag a minor bump for a change that actually modifies the stratum schema, the hook rejects the commit with a clear error message explaining that a major bump is required. This automation eliminates the most common versioning mistakes—under-bumping that causes downstream pipelines to load an incompatible schema, and over-bumping that fragments the version history unnecessarily.
- Build a snapshot reproducibility test that runs in CI so that every pull request that modifies the sampling pipeline or taxonomy schema triggers an automated check confirming that the current dataset version can still be reproduced from its tagged configuration. The CI job checks out the latest version tag, runs the sampling pipeline in a Docker container with pinned dependency versions, and asserts hash equality against the stored artifact. This test catches dependency drift, non-deterministic code paths (such as using set iteration order instead of sorted lists), and accidental mutations to shared configuration files. When the test fails, the CI log includes the row-level diff and a suggested remediation—either pin the offending dependency or cut a new dataset version that acknowledges the change.
Create dataset quality checks for balance, coverage, and contamination detection that run automatically before any evaluation dataset is approved for use, catching the subtle data quality issues that silently inflate or deflate model scores and lead to incorrect deployment decisions. Balance checks verify that stratum proportions match the target allocation, but quality goes far beyond balance. Coverage checks confirm that the dataset exercises every capability the evaluation is supposed to measure—for instance, that a "code generation" stratum actually contains examples spanning multiple programming languages, framework versions, and problem types rather than clustering around trivial "hello world" prompts. Contamination detection identifies evaluation examples that have leaked into the model's training data, which inflates scores and masks genuine capability gaps; this is one of the most consequential quality failures in LLM evaluation because contaminated benchmarks can make a weaker model appear to outperform a stronger one. Your contamination detector will implement two complementary strategies: n-gram overlap analysis that computes the fraction of 10-gram shingles in each evaluation example that appear in a reference corpus of known training data, and membership inference analysis that measures the model's per-token log-probability on each evaluation example and flags examples where the perplexity is suspiciously low relative to the stratum baseline. Both strategies produce a contamination score per example; examples exceeding a configurable threshold are quarantined into a "suspected contaminated" partition that is excluded from official evaluation runs but retained for analysis. The pipeline must handle the case where the reference training corpus is unavailable—common when evaluating third-party hosted models—by falling back to the membership inference strategy alone and annotating the dataset card with a warning that n-gram contamination detection was skipped. You will also implement inter-annotator agreement (IAA) checks for datasets with human-generated labels, computing Cohen's kappa for binary labels and Krippendorff's alpha for ordinal or nominal labels across annotator pairs, and rejecting any example where agreement falls below a configurable threshold. Low-agreement examples are routed to an adjudication queue rather than silently included, because ambiguous ground truth corrupts evaluation metrics in ways that are extremely difficult to diagnose after the fact.
- Implement n-gram shingling for contamination detection using a sliding window of configurable width (default 10 tokens) that converts each evaluation example into a set of shingles, hashes them with xxhash for memory efficiency, and checks membership against a pre-built Bloom filter of training data shingles. The Bloom filter trades a small false-positive rate (configurable, default 1%) for dramatic memory savings—a 100-billion-token training corpus can be represented in under 8 GB of memory. You will learn how to calibrate the Bloom filter's size and hash count parameters to hit your target false-positive rate, and how to interpret the resulting contamination scores in the context of your evaluation objectives.
- Build a membership inference contamination detector that queries the hosted model's log-probability endpoint for each evaluation example, computes the mean per-token log-probability, and compares it against the distribution of log-probabilities for a held-out calibration set from the same stratum. Examples whose log-probability z-score exceeds a threshold (default 3.0) are flagged as suspected memorization. This approach works even when you have no access to the model's training data, making it the primary contamination detection strategy for third-party hosted models. You will handle API rate limits by batching requests with exponential backoff and caching results so that re-running the check after adding new examples only queries the model for the delta.
- Compute inter-annotator agreement with automatic adjudication routing by pairing every annotator's labels with every other annotator's labels for the same example set, computing Cohen's kappa for each pair, and aggregating into an overall Krippendorff's alpha. Examples where no annotator pair agrees are routed to a senior adjudicator queue with the full annotation history attached. The pipeline produces an IAA report that breaks down agreement by stratum, enabling you to identify task categories where annotation guidelines are ambiguous and need revision—for example, if "reasoning" examples consistently have lower agreement than "summarization" examples, the reasoning rubric likely needs more concrete anchor examples.
- Enforce coverage checks using a capability taxonomy that maps each evaluation example to one or more fine-grained capabilities (e.g., "Python list comprehension," "multi-hop reasoning over tables," "negation handling in sentiment analysis") and verifies that every leaf node in the taxonomy has at least a minimum number of examples. Coverage gaps are reported as structured warnings in the dataset card, with suggested remediation actions such as "add 12 more examples for capability 'negation handling in sentiment analysis' to reach the minimum of 30." This check prevents the common failure mode where a high-level stratum like "sentiment analysis" appears well-represented but actually contains no adversarial negation examples, leaving a critical capability untested.
Build automated dataset refresh pipelines that detect staleness and trigger re-curation to ensure that evaluation datasets remain representative of current production traffic patterns, model capabilities, and domain knowledge rather than ossifying into static benchmarks that reward overfitting to yesterday's distribution. Dataset staleness is a pervasive problem in LLM evaluation: a dataset curated six months ago may over-represent query patterns that production traffic has since abandoned and under-represent new patterns driven by product changes, seasonal effects, or shifts in user behavior. Staleness detection requires comparing the current production traffic distribution against the distribution encoded in the evaluation dataset across multiple dimensions—task category proportions, prompt length distributions, entity frequency distributions, and temporal topic distributions. Your staleness detector will compute the Jensen-Shannon divergence (JSD) between the production and evaluation distributions for each dimension and trigger a re-curation alert when any dimension's JSD exceeds a configurable threshold (default 0.1 nats). The re-curation pipeline is not a simple re-run of the sampling pipeline; it must preserve continuity with previous dataset versions to maintain the validity of longitudinal score comparisons. This means the pipeline operates in "delta mode" by default: it identifies examples that are no longer representative (because their stratum's production proportion has dropped significantly), identifies gaps where new strata or expanded strata need more examples, and produces a candidate changeset that a human reviewer approves before the new version is tagged. The approved changeset is applied as a Git commit with a structured commit message that the changelog generator can parse, and the new version is tagged with an appropriate semantic version bump. You will also implement a scheduled runner—typically a daily or weekly cron job orchestrated by Airflow or Cloud Scheduler—that executes the staleness check against the latest production traffic logs and posts the results to a Slack channel or dashboard. When the staleness threshold is breached, the runner automatically generates the candidate changeset and opens a pull request for review, reducing the median time from staleness detection to dataset update from weeks (the typical manual cadence) to hours.
- Compute Jensen-Shannon divergence across multiple distribution dimensions by extracting task-category histograms, prompt-length histograms, and entity-frequency histograms from both the production traffic logs and the evaluation dataset, normalizing them into probability distributions, and computing JSD for each pair. You will learn why JSD is preferred over KL divergence for staleness detection—JSD is symmetric and always finite, whereas KL divergence is asymmetric and undefined when the evaluation distribution assigns zero probability to a category that appears in production. The staleness report includes per-dimension JSD values, a sparkline trend over the last 30 days, and a recommended action (no action, monitor, or re-curate) based on configurable thresholds.
- Implement delta-mode re-curation that preserves longitudinal comparability by classifying each existing evaluation example as "retain" (still representative), "retire" (no longer representative), or "replace" (representative but label needs updating), and generating a pool of candidate additions to fill gaps in under-represented strata. The delta is computed by comparing each example's stratum membership and feature distribution against the current production baseline and applying a configurable retention threshold—examples whose feature vector distance from the production centroid exceeds the threshold are marked for retirement. Retired examples are not deleted; they are moved to an archive partition tagged with the dataset version in which they were retired, enabling historical analysis. This approach ensures that at least 70% of examples are shared between consecutive dataset versions, keeping longitudinal score comparisons meaningful.
- Orchestrate scheduled staleness checks with automated pull request generation using a workflow engine (Airflow DAG or GitHub Actions workflow) that runs on a configurable schedule, pulls the latest production traffic sample from your data warehouse, executes the staleness detector, and conditionally triggers the re-curation pipeline. When re-curation is triggered, the workflow creates a new Git branch, commits the candidate changeset with the staleness report as the commit body, opens a pull request with the dataset card diff and stratum-level statistics in the PR description, and assigns it to the evaluation dataset owner for review. You will implement idempotency guards so that the workflow does not open duplicate pull requests if the previous one is still under review, and escalation logic that pings the on-call evaluation engineer if a staleness alert has been unaddressed for more than 48 hours.
Generate privacy-compliant synthetic evaluation data using NeMo Safe Synthesizer with differential privacy to create evaluation datasets that test model behavior on realistic but non-identifiable data, satisfying regulatory requirements (GDPR, CCPA, HIPAA) while maintaining the statistical properties that make evaluation results meaningful. Real production data is the gold standard for evaluation realism, but privacy regulations and internal data governance policies often prohibit using raw production data in evaluation datasets—especially when the data contains personally identifiable information (PII), protected health information (PHI), or confidential business data. Naive anonymization (replacing names with "[REDACTED]") distorts the linguistic patterns that models rely on, producing evaluation results that do not generalize to real traffic. NeMo Safe Synthesizer addresses this by generating synthetic data that is statistically similar to the real distribution while providing formal differential privacy guarantees—specifically, (ε, δ)-differential privacy, where ε controls the privacy-utility tradeoff and δ bounds the probability of catastrophic privacy failure. You will configure the synthesizer with an epsilon value appropriate for your threat model (typically ε ∈ [1, 10] for evaluation datasets, since lower epsilon provides stronger privacy but reduces data utility), a delta value of 1/n² where n is the number of real examples in the source dataset, and a noise mechanism (Gaussian or Laplace) that is applied during the synthesis process to ensure that no individual source example can be reverse-engineered from the synthetic output. The pipeline reads a privacy-labeled source dataset where each field is annotated with its sensitivity level (public, internal, confidential, restricted), applies field-level synthesis strategies (direct passthrough for public fields, differentially private synthesis for confidential fields, full replacement for restricted fields), and produces a synthetic dataset with an accompanying privacy attestation document that records the epsilon and delta values, the noise mechanism, the synthesis timestamp, and the hash of the source dataset. The attestation is cryptographically signed and stored alongside the dataset version tag so that compliance auditors can verify the privacy properties without accessing the source data. You will also implement utility validation that compares the synthetic dataset against the real dataset on a battery of statistical tests—distribution similarity per feature, inter-feature correlation preservation, and downstream evaluation score correlation—and rejects synthetic datasets whose utility score falls below a configurable threshold, ensuring that privacy is not achieved at the cost of evaluation validity.
- Configure NeMo Safe Synthesizer's differential privacy parameters by selecting an epsilon value based on a formal privacy budget analysis that accounts for the number of times the source dataset will be used for synthesis (composition theorem), the sensitivity of the fields being synthesized, and the acceptable utility loss. You will learn how to use the Rényi Differential Privacy (RDP) accountant to track cumulative privacy spend across multiple synthesis runs and halt synthesis when the total epsilon exceeds the privacy budget, preventing the gradual erosion of privacy guarantees that occurs when synthetic datasets are regenerated frequently without budget tracking. The configuration is stored as a versioned YAML file that the synthesis pipeline reads, ensuring that privacy parameters are auditable and reproducible.
- Implement field-level synthesis strategies based on sensitivity annotations that apply different synthesis techniques to different data fields: public fields (e.g., programming language, task category) are passed through unchanged; internal fields (e.g., anonymized user IDs, session timestamps) are perturbed with calibrated noise; confidential fields (e.g., prompt text containing business logic) are regenerated using a differentially private language model fine-tuned on the source distribution; and restricted fields (e.g., API keys, credentials that accidentally appeared in logs) are replaced with synthetic placeholders that preserve format but contain no real information. You will handle the common edge case where a field's sensitivity annotation is missing by defaulting to "restricted" and logging a warning, following the principle of maximum protection for unclassified data.
- Validate synthetic data utility with statistical and downstream tests by computing the Wasserstein distance between real and synthetic distributions for each numeric feature, the Jensen-Shannon divergence for each categorical feature, and the Pearson correlation between inter-feature correlation matrices. The utility validation also includes a downstream evaluation test: it runs a reference model evaluation on both the real and synthetic datasets, computes the rank correlation (Spearman's ρ) between per-example scores, and requires ρ ≥ 0.85 for the synthetic dataset to be approved. If the utility threshold is not met, the pipeline suggests increasing epsilon (loosening privacy) or increasing the synthetic dataset size (improving statistical estimation) and logs the specific features or evaluation dimensions where utility degraded most, giving the engineer actionable guidance for the next synthesis iteration.
- Generate cryptographically signed privacy attestation documents that accompany each synthetic dataset and provide a tamper-evident record of the privacy parameters, synthesis process, and source data lineage. The attestation includes the epsilon and delta values, the RDP accountant's cumulative privacy spend, the sensitivity annotations for each field, the synthesis model version, the source dataset hash (computed before synthesis so it can be verified without re-accessing the source data), and a timestamp. The document is signed using the team's GPG key and stored in the dataset's Git repository alongside the version tag. Compliance auditors can verify the signature, inspect the privacy parameters, and confirm that the stated differential privacy guarantees are consistent with the synthesis configuration—all without needing access to the original sensitive data. You will implement a verification script that auditors can run independently to validate the attestation chain from the synthetic dataset back to the privacy budget configuration.

Key Terminology

Evaluation Dataset

A curated collection of input-output pairs, each annotated with expected behavior or ground-truth labels, used to measure an LLM's performance against defined quality metrics before and after deployment.

Stratified Sampling

A sampling technique that partitions a source corpus into non-overlapping strata—such as task category, difficulty level, or demographic group—and draws proportional or balanced samples from each stratum to ensure representative coverage in the final evaluation set.

Dataset Versioning

The practice of assigning immutable identifiers (semantic versions or content hashes) to specific snapshots of an evaluation dataset so that any benchmark result can be traced back to the exact data that produced it, typically managed through Git LFS or DVC.

Golden Dataset

A small, expert-curated subset of evaluation examples where every input, expected output, and annotation has been reviewed and agreed upon by multiple domain specialists, serving as the authoritative reference for regression testing and model comparison.

Test Case Design

The systematic process of constructing individual evaluation examples that target specific model capabilities, failure modes, or safety requirements, including the specification of input prompts, expected outputs, scoring rubrics, and metadata tags.

Edge Case

An evaluation example that exercises boundary conditions or rare input distributions—such as maximum token lengths, unusual formatting, ambiguous instructions, or low-resource language inputs—where model behavior is most likely to degrade or become unpredictable.

Adversarial Example

A deliberately crafted input designed to exploit known model weaknesses, bypass safety guardrails, or trigger incorrect outputs through techniques such as prompt injection, semantic perturbation, or encoding manipulation.

Contamination Detection

The process of identifying whether evaluation examples or their close paraphrases appear in a model's training corpus, which would inflate benchmark scores and invalidate the evaluation; common methods include n-gram overlap analysis, embedding similarity search, and canary string insertion.

Dataset Splits

The partitioning of an evaluation dataset into distinct subsets—typically development, validation, and held-out test sets—where each split serves a different purpose in the evaluation lifecycle and strict isolation between splits prevents information leakage.

Inter-Annotator Agreement

A statistical measure, commonly computed as Cohen's kappa or Fleiss' kappa, that quantifies the degree of consensus among multiple human annotators labeling the same evaluation examples, used to assess label reliability and identify ambiguous items that require re-annotation.

Dataset Card

A structured documentation artifact, following standards such as the Hugging Face Dataset Card template, that records a dataset's provenance, composition, annotation methodology, known biases, intended use cases, and licensing terms to ensure transparency and reproducibility.

NeMo Safe Synthesizer

NVIDIA's framework for generating synthetic training and evaluation data that incorporates differential privacy guarantees and content filtering, enabling teams to produce realistic evaluation examples without exposing personally identifiable information from source corpora.

Privacy-Compliant Synthetic Data

Artificially generated evaluation examples that are statistically representative of real-world distributions but contain no actual user data, produced using techniques such as differential privacy, data anonymization pipelines, or generative models trained on sanitized corpora to satisfy regulations like GDPR and CCPA.

Patronus Generative Simulators

Patronus AI's tooling for programmatically generating diverse evaluation scenarios by simulating user interactions, domain-specific queries, and adversarial attack patterns, allowing teams to scale evaluation coverage beyond what manual curation can achieve.

Dataset Staleness

The condition where an evaluation dataset no longer reflects current production traffic patterns, user behavior, or domain knowledge due to temporal drift, rendering benchmark results unreliable indicators of live model performance.

Coverage Matrix

A two-dimensional mapping of evaluation examples against target capabilities, task categories, or risk dimensions that identifies gaps in the dataset where specific combinations of attributes lack sufficient test cases.

Differential Privacy

A mathematical framework that provides formal guarantees about the maximum information any single record contributes to a synthetic dataset's output, quantified by an epsilon parameter where lower values indicate stronger privacy protection at the cost of reduced data utility.

Canary String

A unique, randomly generated token sequence intentionally inserted into an evaluation dataset that, if later detected in model outputs or training data dumps, serves as definitive proof of data contamination or unauthorized dataset usage.

On This Page

Prerequisites

Learning Goals

Key Terminology

On This Page