GenAI Safety & Evaluation Engineering
Design automated LLM evaluation pipelines, red-team GenAI systems, build bias detection and fairness benchmarks, implement guardrails.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Verifiable skill graph
12 skill groups · each becomes a signed node on your graph.
Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.
Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.
Production-grade Python for eval tooling: async/await, Pydantic models for eval rubrics, typing, dataclasses, pytest harnesses, parametrized testing patterns.
Provider SDK integration in eval and safety code: judge models, multi-provider scoring, cross-model evaluation harnesses, multi-provider abstraction (LiteLLM).
Golden-set curation, dataset versioning, prompt-variant generation, edge-case mining, stratified sampling, dataset bias auditing, eval-set hygiene.
LLM-as-judge rubric design, position-bias correction, calibration against human raters, scoring functions, judge reproducibility, multi-judge ensembling.
RAGAS + DeepEval + TruLens pipelines, retrieval relevance + faithfulness + answer-relevancy metrics, cross-model comparison harnesses, model ranking.
Step-level agent eval, trajectory scoring, tool-call accuracy, plan-quality assessment, human-in-the-loop review gates, golden-trajectory datasets.
A/B testing for LLM systems, eval-driven CI/CD pipelines, quality gates in deployment, regression suites, prompt-variant champion/challenger.
Hallucination detectors (grounding checks, NLI-based), bias and fairness metrics, demographic-parity tests, disparate-impact analysis, continuous fairness monitoring.
Content moderation, PII detection + redaction, toxicity classifiers, sensitive-data filters, output sanitization, regulated-content classification.
Prompt injection defense, jailbreak detection, adversarial robustness testing, automated red teaming, OWASP LLM Top 10 + MITRE ATLAS, attack scenario engineering.
EU AI Act compliance, SOC2/HIPAA/GDPR frameworks, regulatory artifact generation, AI risk classification, governance committee workflows, end-to-end eval+safety+governance pipelines.
Langfuse + OpenTelemetry for eval traces, eval-pipeline dashboards, cost governance for eval runs, token budget controllers for judge models, eval-cost FinOps.
What you'll ship in production
Core responsibilities this discipline prepares you for.
What you'll ship in production
Core responsibilities this discipline prepares you for.
- 1
Build automated evaluation pipelines
to continuously measure LLM output quality
- Design evaluation harnesses with RAGAS, DeepEval, and NeMo Evaluator SDK for multi-metric scoring
- Create evaluation datasets with ground-truth annotations and run cross-provider comparisons
- Wire CI gates that automatically block deployments when faithfulness or relevance scores degrade
- 2
Conduct red-team exercises
— probe LLMs for vulnerabilities
- Automate adversarial testing with Garak for prompt injection, jailbreak, and data extraction probes
- Run multi-turn adversarial campaigns with Meta GOAT and DeepTeam for agent vulnerability testing
- Execute red-team campaigns against realistic systems, discover vulnerabilities, and write actionable findings
- 3
Implement production guardrails
— content filters, PII detection, jailbreak prevention
- Configure NeMo Guardrails with Colang policy language, Llama Guard 4, and Prompt Guard 2
- Add Presidio for PII detection/redaction and Model Armor for Google-native content safety
- Layer multiple defenses, test against comprehensive attack suites, and quantify safety-vs-helpfulness tradeoffs
- 4
Design GenAI governance frameworks
aligned with regulations
- Map EU AI Act risk classification and implement NIST AI RMF control frameworks
- Build OWASP LLM Top 10 mitigation strategies mapped to technical controls
- Create governance artifacts, conduct risk assessments, and build automated audit trail pipelines
- 5
Evaluate GenAI agent behavior
— trajectory quality, tool selection accuracy
- Build trajectory scoring systems measuring tool selection accuracy and task completion quality
- Design human preference alignment tests and regression test suites for agent workflows
- Evaluate multi-step agent executions to identify failure modes and build targeted regression tests
- 6
Monitor bias, fairness, and hallucination rates
in production
- Detect bias across protected attributes using statistical fairness metrics and disparity analysis
- Measure hallucination rates through ground-truth comparison and citation verification
- Implement continuous bias scanning, hallucination detection, and alerting for metric drift
- 7
Build safety incident response processes
for deployed GenAI systems
- Design safety monitoring dashboards with severity-based alert routing and escalation paths
- Build incident triage workflows with containment procedures and post-incident reporting templates
- Simulate safety incidents end-to-end and practice the full detection-to-resolution workflow
- 8
Design LlamaFirewall policies
for agent safety
- Configure LlamaFirewall middleware for controlling agent tool access and output filtering rules
- Set up multi-agent safety boundaries with policy-based execution constraints
- Validate firewall policies against adversarial scenarios where agents attempt to bypass controls
Curriculum
7 courses · each builds on previous goals
Curriculum
7 courses · each builds on previous goals
11 goals unlocked for preview — click to read. Locked goals need a subscription.