Each objective has a coding lab that opens in VS Code in your browser
You will create a structured evaluation dataset for testing a hosted LLM (OpenAI GPT-4o) across multiple task categories. Define task categories: classification, summarization, extraction, generation, and reasoning. For each category, create 20 test cases at three difficulty levels (easy, medium, hard) with fields: input, expected_output, category, difficulty, source, and metadata. Store the dataset as JSONL with one test case per line. Build a DatasetBuilder class that loads the JSONL, validates each row against a Pydantic schema, and reports coverage statistics: cases per category, difficulty distribution, and average input/output token counts.
You will build a dataset versioning system using Git tags and content hashing. Create a DatasetVersioner class that computes a SHA-256 hash of the full JSONL file as the dataset fingerprint. On each update, create a Git tag: eval-dataset/v1.0.0-{short_hash}. Store version metadata in a dataset_versions.json: version, hash, row_count, created_at, change_description. Implement diff tracking: compare two dataset versions and report added, removed, and modified test cases. Build GET /datasets/{name}/versions API endpoint that lists all versions with their metadata. Implement rollback: checkout a previous dataset version by Git tag.
You will build contamination detection to ensure evaluation datasets haven't leaked into LLM training data. Create a ContaminationDetector class that: (1) sends each test case input to OpenAI GPT-4o and Gemini Pro, (2) checks if the model can reproduce the expected output verbatim (contamination signal), (3) computes a contamination score: percentage of test cases where the model output has >90% ROUGE-L overlap with the expected answer. Flag contaminated test cases for replacement. Build a quarantine workflow: contaminated cases are moved to a quarantine.jsonl file with the contamination evidence. Generate a contamination report: total cases, contaminated count, contamination rate per model, per category.
You will create an Argo Workflow pipeline that periodically refreshes the evaluation dataset, incorporating NVIDIA NeMo Safe Synthesizer for privacy-compliant synthetic data generation. The pipeline runs weekly and: (1) checks dataset freshness by comparing the creation date of test cases against a 90-day threshold, (2) runs contamination detection on existing cases, (3) for cases containing sensitive data, uses NeMo Safe Synthesizer to generate privacy-compliant synthetic replacements with differential privacy guarantees — configure PII replacement, model training with optional DP, and synthetic data generation with quality/privacy validation, (4) generates non-sensitive replacement test cases using Gemini Pro with specific prompts for each category, (5) validates new cases against the Pydantic schema, (6) creates a new dataset version with the refreshed cases. Additionally, integrate Patronus AI Generative Simulators to create adaptive simulation environments that generate new evaluation scenarios based on agent behavior — these 'living practice worlds' use Open Recursive Self-Improvement (ORSI) to continuously create novel test cases. Configure the pipeline as an Argo CronWorkflow running on GKE. Store pipeline artifacts (old dataset, new dataset, refresh report) in MinIO.
You will implement dataset cards following the Hugging Face model card pattern. Create a DatasetCard Pydantic model with sections: description, intended_use, composition (task breakdown, token statistics), collection_process (how cases were sourced), annotation_process (inter-annotator agreement scores), known_limitations, and version_history. Auto-generate the card from the dataset: compute statistics like average tokens, category distribution, difficulty histogram, and estimated cost to run (based on OpenAI/Gemini pricing). Render the card as Markdown and store alongside the dataset. Build a CLI tool: python eval_tools.py card generate --dataset classification_eval.jsonl.
You will create an annotation pipeline for human-labeled evaluation data. Build a FastAPI service that serves test cases for annotation: GET /annotate/next returns the next unlabeled case, POST /annotate/{id} submits the annotation with fields: quality_rating (1-5), correctness (pass/fail), annotator_id, and notes. Implement inter-annotator agreement (IAA): assign each case to 3 annotators and compute Fleiss' kappa. Cases with kappa < 0.6 are flagged for review and discussion. Store annotations in PostgreSQL with a schema: annotation_id, case_id, annotator_id, quality_rating, correctness, created_at. Build an annotation progress dashboard showing completion rate, IAA scores, and annotator performance.