Chapter 1

Evaluation Dataset Curation

evaluation datasetsstratified samplingdataset versioninggolden datasetstest case designedge casesadversarial examplescontamination detectiondataset splitsinter-annotator agreementdataset cardsNeMo Safe Synthesizerprivacy-compliant synthetic dataPatronus Generative Simulators

Learning Path

Step 1

Reading Material

9 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Step 1

Reading Material

9 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Hands-on Labs

Each objective has a coding lab that opens in VS Code in your browser

Objective 1

Build a stratified evaluation dataset

Goal

You will create a structured evaluation dataset for testing a hosted LLM (OpenAI GPT-4o) across multiple task categories. Define task categories: classification, summarization, extraction, generation, and reasoning. For each category, create 20 test cases at three difficulty levels (easy, medium, hard) with fields: input, expected_output, category, difficulty, source, and metadata. Store the dataset as JSONL with one test case per line. Build a DatasetBuilder class that loads the JSONL, validates each row against a Pydantic schema, and reports coverage statistics: cases per category, difficulty distribution, and average input/output token counts.

Objective 2

Implement dataset versioning and snapshots

Goal

You will build a dataset versioning system using Git tags and content hashing. Create a DatasetVersioner class that computes a SHA-256 hash of the full JSONL file as the dataset fingerprint. On each update, create a Git tag: eval-dataset/v1.0.0-{short_hash}. Store version metadata in a dataset_versions.json: version, hash, row_count, created_at, change_description. Implement diff tracking: compare two dataset versions and report added, removed, and modified test cases. Build GET /datasets/{name}/versions API endpoint that lists all versions with their metadata. Implement rollback: checkout a previous dataset version by Git tag.

Objective 3

Detect dataset contamination and leakage

Goal

You will build contamination detection to ensure evaluation datasets haven't leaked into LLM training data. Create a ContaminationDetector class that: (1) sends each test case input to OpenAI GPT-4o and Gemini Pro, (2) checks if the model can reproduce the expected output verbatim (contamination signal), (3) computes a contamination score: percentage of test cases where the model output has >90% ROUGE-L overlap with the expected answer. Flag contaminated test cases for replacement. Build a quarantine workflow: contaminated cases are moved to a quarantine.jsonl file with the contamination evidence. Generate a contamination report: total cases, contaminated count, contamination rate per model, per category.

Objective 4

Build automated dataset refresh pipeline

Goal

You will create an Argo Workflow pipeline that periodically refreshes the evaluation dataset, incorporating NVIDIA NeMo Safe Synthesizer for privacy-compliant synthetic data generation. The pipeline runs weekly and: (1) checks dataset freshness by comparing the creation date of test cases against a 90-day threshold, (2) runs contamination detection on existing cases, (3) for cases containing sensitive data, uses NeMo Safe Synthesizer to generate privacy-compliant synthetic replacements with differential privacy guarantees — configure PII replacement, model training with optional DP, and synthetic data generation with quality/privacy validation, (4) generates non-sensitive replacement test cases using Gemini Pro with specific prompts for each category, (5) validates new cases against the Pydantic schema, (6) creates a new dataset version with the refreshed cases. Additionally, integrate Patronus AI Generative Simulators to create adaptive simulation environments that generate new evaluation scenarios based on agent behavior — these 'living practice worlds' use Open Recursive Self-Improvement (ORSI) to continuously create novel test cases. Configure the pipeline as an Argo CronWorkflow running on GKE. Store pipeline artifacts (old dataset, new dataset, refresh report) in MinIO.

Objective 5

Create dataset cards for documentation

Goal

You will implement dataset cards following the Hugging Face model card pattern. Create a DatasetCard Pydantic model with sections: description, intended_use, composition (task breakdown, token statistics), collection_process (how cases were sourced), annotation_process (inter-annotator agreement scores), known_limitations, and version_history. Auto-generate the card from the dataset: compute statistics like average tokens, category distribution, difficulty histogram, and estimated cost to run (based on OpenAI/Gemini pricing). Render the card as Markdown and store alongside the dataset. Build a CLI tool: python eval_tools.py card generate --dataset classification_eval.jsonl.

Objective 6

Build a dataset annotation pipeline

Goal

You will create an annotation pipeline for human-labeled evaluation data. Build a FastAPI service that serves test cases for annotation: GET /annotate/next returns the next unlabeled case, POST /annotate/{id} submits the annotation with fields: quality_rating (1-5), correctness (pass/fail), annotator_id, and notes. Implement inter-annotator agreement (IAA): assign each case to 3 annotators and compute Fleiss' kappa. Cases with kappa < 0.6 are flagged for review and discussion. Store annotations in PostgreSQL with a schema: annotation_id, case_id, annotator_id, quality_rating, correctness, created_at. Build an annotation progress dashboard showing completion rate, IAA scores, and annotator performance.