Free lesson
Detect dataset contamination and leakage
You will build contamination detection to ensure evaluation datasets haven't leaked into LLM training data. Create a ContaminationDetector class that: (1) sends each test case input to OpenAI GPT-4o and Gemini Pro, (2) checks if the model can reproduce the expected output verbatim (contamination signal), (3) computes a contamination score: percentage of test cases where the model output has >90% ROUGE-L overlap with the expected answer. Flag contaminated test cases for replacement. Build a quarantine workflow: contaminated cases are moved to a quarantine.jsonl file with the contamination evidence. Generate a contamination report: total cases, contaminated count, contamination rate per model, per category.
~25 min read · Free to read — no subscription required.
Detect dataset contamination and leakage
Introduction
A test case that appears in a model provider's training data produces inflated benchmark scores that disappear in production. This is dataset contamination — and for hosted LLMs whose training corpora you cannot inspect, it can only be detected probabilistically. The standard technique is to probe the model with a portion of each test case and measure how much of the expected continuation it reproduces verbatim. High verbatim overlap (ROUGE-L above 0.85) is strong evidence the model has seen the example before; the case must be quarantined and replaced with a fresh one before the dataset can be used as a fair benchmark.
Key Terminology
- Contamination: A test case present in the model's training data, producing biased eval scores.
- ROUGE-L: Longest common subsequence overlap between two strings, normalized by the longer string. Range 0–1.
- Probe: A truncated prefix of a test case sent to the model with
temperature=0.0; the model's continuation is compared to the expected output. - Quarantine: An append-only file (
quarantine.jsonl) where contaminated records are recorded for replacement. - Replacement: A fresh test case generated to fill the slot of a quarantined one, preserving category/difficulty balance.
Concepts
Operating discipline
- Always probe at temperature 0.0. Stochastic sampling produces different completions every run and makes scores meaningless.
- Use append mode (
a) for quarantine. Overwriting destroys the audit trail of contamination history. - Probe at least two providers per scan. A single provider's training data may not cover what others have seen; cross-provider probing catches more.
- Re-scan quarterly. Provider training cutoffs advance; new contamination appears as new model versions ship.
- Treat ROUGE-L > 0.85 as a strong signal but not proof. Some test cases are inherently low-entropy (e.g., "What is 2+2?") and produce high overlap without contamination. Spot-check before quarantining.
Code Walkthrough
Probe loop
Code snippetpython
1import asyncio, json 2from pathlib import Path 3from openai import AsyncOpenAI 4from rouge_score import rouge_scorer 5 6class ContaminationDetector: 7 def __init__(self, providers: dict[str, AsyncOpenAI], threshold: float = 0.85) -> None: 8 self._providers = providers 9 self._threshold = threshold 10 self._scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True) 11 12 async def probe(self, model: str, prefix: str) -> str: 13 client = self._providers[model] 14 resp = await client.chat.completions.create( 15 model=model, 16 messages=[{"role": "user", "content": prefix}], 17 temperature=0.0, 18 max_tokens=300, 19 ) 20 return resp.choices[0].message.content or "" 21 22 async def score(self, prefix: str, expected: str) -> dict[str, float]: 23 scores: dict[str, float] = {} 24 completions = await asyncio.gather(*[ 25 self.probe(m, prefix) for m in self._providers.keys() 26 ]) 27 for model, completion in zip(self._providers.keys(), completions): 28 r = self._scorer.score(expected, completion)["rougeL"].fmeasure 29 scores[model] = r 30 return scores 31 32 def is_contaminated(self, scores: dict[str, float]) -> bool: 33 return any(s >= self._threshold for s in scores.values())
- The
ContaminationDetectorconstructor takes a dict of provider name → SDK client, plus the contamination threshold (default 0.85). probecalls a single model with the prefix at temperature 0.0 — the deterministic setting is essential because non-deterministic decoding masks training-data overlap.scoreprobes all providers concurrently and returns a per-provider ROUGE-L score against the expected continuation.is_contaminatedflags a test case if any provider scores above the threshold; one provider memorizing the case is enough to bias cross-provider benchmarks.
Quarantine flow
The replacement step matters: simply removing contaminated cases shrinks the dataset and often distorts category/difficulty balance. Always generate a fresh case in the same (category, difficulty) cell, re-probe it, and only commit it once it passes the threshold.
Do's and Don'ts
Do's
- ✓Do set
temperature=0.0in everyprobecall — deterministic decoding is required because non-zero temperature introduces sampling noise that can mask verbatim training-data overlap, producing false negatives and letting contaminated cases survive into the active dataset. - ✓Do flag a test case as contaminated when any provider's ROUGE-L score meets or exceeds 0.85 — a single provider that has memorized the expected continuation is sufficient to bias cross-provider benchmarks, so the
is_contaminatedcheck must useany(), not a majority or average. - ✓Do generate a replacement in the same
(category, difficulty)cell and re-probe it before committing — simply dropping quarantined cases shrinks the dataset and distorts category and difficulty balance; the replacement must itself clear the threshold before it is written to the active dataset.
Don'ts
- ✗Don't skip the replacement step after quarantining a contaminated case — appending to
quarantine.jsonlwithout generating a fresh test case leaves a gap in the dataset's coverage; the resulting distribution shift makes benchmark results non-comparable to prior runs. - ✗Don't probe providers sequentially when scoring a single test case — the
scoremethod usesasyncio.gatherto probe all providers in parallel; serializing those calls adds latency proportional to the number of providers and makes large-dataset contamination scans impractically slow. - ✗Don't treat a ROUGE-L score below 0.85 as proof the model has never seen the example — ROUGE-L is a probabilistic signal, not a forensic ground truth; lowering the threshold to catch marginal cases increases false positives, but raising it above 0.85 risks missing paraphrase-level memorization that still inflates scores.
Keep going with GenAI Safety & Evaluation Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.