Free lesson

Detect dataset contamination and leakage

You will build contamination detection to ensure evaluation datasets haven't leaked into LLM training data. Create a ContaminationDetector class that: (1) sends each test case input to OpenAI GPT-4o and Gemini Pro, (2) checks if the model can reproduce the expected output verbatim (contamination signal), (3) computes a contamination score: percentage of test cases where the model output has >90% ROUGE-L overlap with the expected answer. Flag contaminated test cases for replacement. Build a quarantine workflow: contaminated cases are moved to a quarantine.jsonl file with the contamination evidence. Generate a contamination report: total cases, contaminated count, contamination rate per model, per category.

~25 min read · Free to read — no subscription required.

Detect dataset contamination and leakage

Introduction

A test case that appears in a model provider's training data produces inflated benchmark scores that disappear in production. This is dataset contamination — and for hosted LLMs whose training corpora you cannot inspect, it can only be detected probabilistically. The standard technique is to probe the model with a portion of each test case and measure how much of the expected continuation it reproduces verbatim. High verbatim overlap (ROUGE-L above 0.85) is strong evidence the model has seen the example before; the case must be quarantined and replaced with a fresh one before the dataset can be used as a fair benchmark.

Key Terminology

Contamination: A test case present in the model's training data, producing biased eval scores.
ROUGE-L: Longest common subsequence overlap between two strings, normalized by the longer string. Range 0–1.
Probe: A truncated prefix of a test case sent to the model with temperature=0.0; the model's continuation is compared to the expected output.
Quarantine: An append-only file (quarantine.jsonl) where contaminated records are recorded for replacement.
Replacement: A fresh test case generated to fill the slot of a quarantined one, preserving category/difficulty balance.

Concepts

Operating discipline

Always probe at temperature 0.0. Stochastic sampling produces different completions every run and makes scores meaningless.
Use append mode (a) for quarantine. Overwriting destroys the audit trail of contamination history.
Probe at least two providers per scan. A single provider's training data may not cover what others have seen; cross-provider probing catches more.
Re-scan quarterly. Provider training cutoffs advance; new contamination appears as new model versions ship.
Treat ROUGE-L > 0.85 as a strong signal but not proof. Some test cases are inherently low-entropy (e.g., "What is 2+2?") and produce high overlap without contamination. Spot-check before quarantining.

Code Walkthrough

Probe loop

Code snippetpython
1import asyncio, json
2from pathlib import Path
3from openai import AsyncOpenAI
4from rouge_score import rouge_scorer
5
6class ContaminationDetector:
7    def __init__(self, providers: dict[str, AsyncOpenAI], threshold: float = 0.85) -> None:
8        self._providers = providers
9        self._threshold = threshold
10        self._scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
11
12    async def probe(self, model: str, prefix: str) -> str:
13        client = self._providers[model]
14        resp = await client.chat.completions.create(
15            model=model,
16            messages=[{"role": "user", "content": prefix}],
17            temperature=0.0,
18            max_tokens=300,
19        )
20        return resp.choices[0].message.content or ""
21
22    async def score(self, prefix: str, expected: str) -> dict[str, float]:
23        scores: dict[str, float] = {}
24        completions = await asyncio.gather(*[
25            self.probe(m, prefix) for m in self._providers.keys()
26        ])
27        for model, completion in zip(self._providers.keys(), completions):
28            r = self._scorer.score(expected, completion)["rougeL"].fmeasure
29            scores[model] = r
30        return scores
31
32    def is_contaminated(self, scores: dict[str, float]) -> bool:
33        return any(s >= self._threshold for s in scores.values())

The ContaminationDetector constructor takes a dict of provider name → SDK client, plus the contamination threshold (default 0.85).
probe calls a single model with the prefix at temperature 0.0 — the deterministic setting is essential because non-deterministic decoding masks training-data overlap.
score probes all providers concurrently and returns a per-provider ROUGE-L score against the expected continuation.
is_contaminated flags a test case if any provider scores above the threshold; one provider memorizing the case is enough to bias cross-provider benchmarks.

Quarantine flow

Loading diagram...

The replacement step matters: simply removing contaminated cases shrinks the dataset and often distorts category/difficulty balance. Always generate a fresh case in the same (category, difficulty) cell, re-probe it, and only commit it once it passes the threshold.

Do's and Don'ts

Do's

✓Do set temperature=0.0 in every probe call — deterministic decoding is required because non-zero temperature introduces sampling noise that can mask verbatim training-data overlap, producing false negatives and letting contaminated cases survive into the active dataset.
✓Do flag a test case as contaminated when any provider's ROUGE-L score meets or exceeds 0.85 — a single provider that has memorized the expected continuation is sufficient to bias cross-provider benchmarks, so the is_contaminated check must use any(), not a majority or average.
✓Do generate a replacement in the same (category, difficulty) cell and re-probe it before committing — simply dropping quarantined cases shrinks the dataset and distorts category and difficulty balance; the replacement must itself clear the threshold before it is written to the active dataset.

Don'ts

✗Don't skip the replacement step after quarantining a contaminated case — appending to quarantine.jsonl without generating a fresh test case leaves a gap in the dataset's coverage; the resulting distribution shift makes benchmark results non-comparable to prior runs.
✗Don't probe providers sequentially when scoring a single test case — the score method uses asyncio.gather to probe all providers in parallel; serializing those calls adds latency proportional to the number of providers and makes large-dataset contamination scans impractically slow.
✗Don't treat a ROUGE-L score below 0.85 as proof the model has never seen the example — ROUGE-L is a probabilistic signal, not a forensic ground truth; lowering the threshold to catch marginal cases increases false positives, but raising it above 0.85 risks missing paraphrase-level memorization that still inflates scores.

Keep going with GenAI Safety & Evaluation Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.

Create a free account Subscribe — →