GenAI Data Engineering

Build RAG data pipelines for ingestion, chunking, embedding, and indexing. Manage vector store operations and embedding model lifecycle.

Preview 13 goals free

11 skill groups8 courses542 goals~234 hrs

Verifiable skill graph

11 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

Embedding Pipelines & Re-Embedding Migrations

Run the embedding pipeline at corpus scale: batch generation, and the senior signal — re-embedding the whole corpus to a new model with zero downtime (dual-write, backfill, cutover), plus dimensionality/quantization and cost trade-offs.

Document Ingestion & Parsing

Get heterogeneous sources in cleanly — where the job actually bleeds time: PDF/HTML/table/multimodal parsing, cleaning and normalization, dedup, and incremental/CDC sync with freshness handling.

Vector Store & Index Operations

Operate the index as a lifecycle: vector-DB ops, index construction and tuning (HNSW/IVF/PQ), reindex/rebuild/compaction/backup, sharding, and recall-vs-latency tuning at build time.

Retrieval Quality & Index Eval

Is the index good? precision@k / recall@k / MRR / nDCG, golden-retrieval-set construction, chunking-strategy tuning (tuned against these metrics), drift detection, and retrieval regression — measuring the index, not the answer.

Knowledge Graph Construction & GraphRAG

Build the graph when vectors aren't enough: entity/relation extraction, ontology and graph construction (Neo4j), GraphRAG retrieval, and graph+vector hybrid.

Hybrid Search & Reranking

Build and tune the retrieval stage offline: dense+sparse/BM25 fusion, reciprocal rank fusion, and cross-encoder reranking — tuned against a golden set, not per-request in a feature.

Data Flywheels & Continuous Improvement

Close the loop: feedback capture, re-embedding/re-index triggers, active learning, corpus expansion, and continuous retrieval eval.

Corpus Governance, PII & Deletion

Keep the corpus compliant: PII detection/redaction, lineage, access governance, and the hard problem — right-to-be-forgotten deletion from the index and every derived artifact.

Pipeline Orchestration & Ops

Run the pipeline as a system: orchestration (Airflow/Dagster/Prefect), batch-vs-streaming, scheduling and backfills, data versioning, deployment/CI-CD for pipeline jobs, and monitoring/scaling.

Hosted LLM API Integration

Baseline provider access in pipeline code: embedding and LLM SDK calls, auth, batching, and retries.

Python for Data Engineering

Production Python for pipeline code: async, Pydantic, data libraries, typing, and error handling.

What you'll ship in production

Core responsibilities this discipline prepares you for.

1
Build embedding pipelines
— ingest, chunk, embed, and store in vector databases
- Select and benchmark embedding models across OpenAI and Gemini for domain-specific accuracy
- Implement chunking strategies (fixed, semantic, recursive) with batch embedding generation
- Build complete pipelines processing thousands of documents into pgvector with HNSW indexing
2
Design RAG data infrastructure
— hybrid search and reranking
- Build BM25 + semantic hybrid search with LLM-as-reranker patterns using Gemini
- Implement semantic caching for throughput optimization and query result deduplication
- Construct hybrid search pipelines and benchmark retrieval quality with RAGAS precision-recall metrics
3
Build knowledge graph pipelines
using Neo4j
- Extract entities from unstructured text and construct knowledge graphs with relationship typing
- Implement GraphRAG patterns and agentic Graph-RAG with MCP tool integration for graph traversal
- Build knowledge graphs from document corpora and query them with graph-aware retrieval agents
4
Process documents at scale
— parsing, chunking, and quality filtering
- Process multi-format documents with Docling across PDF, HTML, and Office formats
- Apply intelligent context-preserving chunking and GPU-accelerated curation with NeMo Curator
- Build document processing pipelines that handle real-world messy data with quality filtering
5
Implement data quality controls
— PII, dedup, compliance filtering
- Integrate Presidio for PII detection with custom entity recognizers and deduplication strategies
- Build compliance pipelines with content classification for regulated industries
- Construct quality gates that block non-compliant documents from entering the embedding pipeline
6
Orchestrate data pipelines
with scheduling and failure recovery
- Use Argo Workflows for Kubernetes-native pipeline orchestration with DVC data versioning
- Build quality gates between pipeline stages with dead-letter queues and failure recovery patterns
- Wire multi-stage pipelines with automatic retry, checkpoint recovery, and quality validation gates
7
Monitor pipeline health
— freshness, quality scores, embedding drift
- Instrument pipeline stages with OpenTelemetry and build Grafana dashboards for freshness and quality
- Monitor retrieval quality continuously with RAGAS evaluation and embedding drift detection
- Build monitoring for live pipelines that detects data quality degradation and triggers remediation
8
Design multi-tenant data isolation
for enterprise RAG
- Build tenant-aware embedding pipelines with pgvector namespace isolation per customer
- Implement row-level security for vector search with per-tenant quality monitoring
- Verify tenant data isolation under concurrent multi-tenant queries with cross-tenant leakage tests

Curriculum

8 courses · each builds on previous goals

13 goals unlocked for preview — click to read. Locked goals need a subscription.

CourseGoals

Python Essentials for Agent Builders62

Your Dev Environment4

Navigate filesystem with terminal
Manage files from command line
Set up VS Code
Configure terminal in VS Code

Python, Git & Package Management6

Install and verify Python
Write hello world script
Use Python REPL
Initialize Git repository
Track changes with Git
Install packages with pip

Variables & Basic Types5

Create and name variables
Work with strings
Work with numbers
Work with booleans
Format with f-strings

Control Flow4

Make decisions with if/elif/else
Iterate with for loops
Repeat with while loops
Control loop execution

Functions5

Define and call functions
Use parameters
Return values
Document with docstrings
Understand scope

Modules & Imports4

Import standard library
Create custom modules
Understand Python path
Create packages

Lists & Tuples5

Create and access lists
Modify lists
Slice lists
Use list comprehensions
Work with tuples

Dictionaries & Sets5

Create and access dicts
Modify dictionaries
Iterate over dicts
Work with nested dicts
Use sets

Classes & Dataclasses5

Understand class basics
Create dataclasses
Add methods
Use default values
Basic inheritance

Files, JSON & Error Handling5

Read and write files
Work with JSON
Use pathlib
Handle exceptions
Create custom exceptions

Basic Testing4

Use assert statements
Create test functions
Run pytest
Test classes

Environment Variables & Configuration5

Understand environment variables
Use .env files
Load with python-dotenv
Handle missing variables
Organize configuration

Decorators & Context Managers5

Understand decorators
Write simple decorators
Use context managers
Write context managers
Combine patterns

LLM Foundations for Agent Builders60

Generators & Iterators5

Understand iteration
Create generators
Use generator expressions
Build data pipelines
Use itertools

Async Programming Basics5

Understand async concepts
Write async functions
Run concurrent operations
Use async context managers
Handle async exceptions

Type Hints & Pydantic5

Add basic type hints
Use typing generics
Create Pydantic models
Validate API data
Configure Pydantic

Data Pipelines & Transformations5

Build functional pipelines
Work with tabular data
Transform data shapes
Process LLM data formats
Optimize for performance

HTTP Clients & httpx5

Make GET requests
Make POST requests
Use async httpx
Handle errors
Use sessions

Your First LLM Call5

Set up credentials
Install Gemini SDK
Make first API call
Parse response
Handle API errors

Tokenizer Internals5

Understand tokenization basics
Learn BPE algorithm
Compare tokenizer types
Analyze cross-language efficiency
Count and optimize tokens

Sampling Parameters & Output Control5

Understand temperature
Use top-p sampling
Implement determinism
Control output length
Use structured output

Embeddings & Semantic Search5

Understand embeddings
Generate embeddings
Calculate similarity
Build simple search
Compare embedding models

RAG Fundamentals5

Understand RAG pattern
Chunk documents
Build retrieval pipeline
Compose RAG prompts
Evaluate RAG quality

Cost Awareness & Token Economics5

Understand pricing models
Calculate request costs
Compare provider costs
Identify cost drivers
Basic cost optimization

Retry Patterns with Tenacity5

Understand retry need
Use tenacity basics
Implement exponential backoff
Handle specific exceptions
Combine with async

Kubernetes Essentials for GenAI60

Containerizing LLM Applications6

Write a Python app that calls the Gemini API and returns structured responses
Write a Dockerfile and build a container image for the LLM app
Run the containerized LLM app with environment-based configuration
Use Docker Compose to run the LLM app with supporting services
Tag images with semantic versions and push to a container registry
Debug containers with exec, logs, and inspect

Your Kubernetes Cluster & First LLM Pod6

Understand K8s architecture and connect to your vCluster
Deploy the LLM app as your first Kubernetes pod
Organize workloads with namespaces
Use labels and selectors to organize and query resources
Understand pod lifecycle and restart policies
Master kubectl debugging: exec, logs, describe, port-forward

Services & the LLM Chat Backend6

Create a ClusterIP service to expose the LLM chat API internally
Deploy a multi-tier LLM chat application
Compare service types: ClusterIP, NodePort, LoadBalancer
Master DNS-based service discovery in Kubernetes
Understand endpoints and traffic routing
Debug service connectivity problems

Deployments, Scaling & Rolling Updates6

Create a Deployment for the LLM chat API
Scale LLM app replicas to handle concurrent requests
Perform a rolling update with zero downtime
Roll back a broken deployment
Compare deployment strategies: RollingUpdate vs Recreate
Manage deployment lifecycle with kubectl rollout

ConfigMaps & Secrets for LLM Apps6

Create ConfigMaps for LLM app settings
Mount ConfigMaps as files for complex configuration
Store LLM proxy credentials securely in Secrets
Manage per-environment configuration for dev, staging, and prod
Handle configuration updates and rolling restarts
Debug configuration issues in LLM app pods

Persistent Storage & StatefulSets6

Create PersistentVolumeClaims for durable storage
Deploy PostgreSQL as a StatefulSet
Connect the LLM chat API to PostgreSQL for conversation persistence
Deploy Redis as a StatefulSet for LLM response caching
Understand StatefulSet scaling and ordering guarantees
Manage PVC lifecycle: expansion, snapshots, and cleanup

Resource Management & Cost Optimization6

Set resource requests and limits for the LLM chat API
Understand QoS classes and their impact on eviction
Enforce resource defaults with LimitRanges
Cap namespace resource usage with ResourceQuotas
Right-size LLM app containers based on actual usage
Diagnose OOMKilled and CPU throttling issues

Networking, Ingress & TLS6

Expose the LLM chat API via an Ingress resource
Add TLS to the Ingress for HTTPS access
Isolate services with NetworkPolicies
Configure Ingress annotations for production traffic
Understand K8s networking: pod IPs, CNI, and service routing
Debug networking and connectivity issues

Health Probes, Autoscaling & Self-Healing6

Add liveness and readiness probes to the LLM chat API
Configure startup probes for containers with slow initialization
Scale the chat API automatically with HPA based on CPU
Create PodDisruptionBudgets for safe maintenance
Implement health check patterns for LLM-dependent services
Combine autoscaling, probes, and PDBs for a resilient LLM service

RBAC, Security & K8s Troubleshooting6

Create RBAC roles for the LLM chat application
Enforce Pod Security Standards
Apply SecurityContext for defense in depth
Debug CrashLoopBackOff and OOMKilled failures
Use kubectl debug and ephemeral containers for live debugging
Troubleshoot LLM-specific issues: timeouts, proxy errors, stale connections

Web APIs for GenAI Engineers54

FastAPI Fundamentals6

Create a FastAPI application with path operations
Define Pydantic request and response models
Implement dependency injection for shared resources
Build CRUD endpoints with proper HTTP semantics
Configure OpenAPI documentation with examples
Handle errors with custom exception handlers

Async Python for APIs6

Convert sync endpoints to async with proper await patterns
Implement background tasks for non-blocking operations
Execute concurrent API calls with asyncio.gather
Manage application lifecycle with lifespan handlers
Build async generators for streaming responses
Control concurrency with semaphores and throttling

Database Integration6

Configure SQLAlchemy async engine with connection pooling
Define ORM models with relationships and constraints
Create and manage database migrations with Alembic
Implement repository pattern for data access
Build transactional endpoints with session lifecycle
Implement filtering, sorting, and full-text search

Authentication & Authorization6

Implement user registration with password hashing
Build OAuth2 password flow with JWT tokens
Implement API key authentication for services
Enforce role-based access control with permissions
Build token refresh and revocation
Compose multiple auth strategies into dependencies

Real-time Streaming6

Build SSE endpoint for streaming LLM responses
Implement WebSocket endpoint with connection lifecycle
Build WebSocket connection manager for broadcasting
Handle backpressure and slow clients
Implement heartbeat and automatic reconnection
Build real-time notification system with Redis pub/sub

Resilience Patterns6

Implement rate limiting with Redis sliding window
Build circuit breaker for LLM provider calls
Configure retry logic with tenacity
Isolate critical paths with bulkhead semaphores
Build fallback responses for degraded mode
Combine resilience patterns into middleware stack

API Gateway & Routing6

Build reverse proxy with path-based routing
Implement load balancing across backend instances
Transform requests and responses through the gateway
Aggregate responses from multiple backends
Implement service discovery with health checking
Build gateway authentication and request enrichment

API Versioning & Evolution6

Implement URL-based API versioning with routers
Build header-based version negotiation
Manage deprecation with Sunset and Warning headers
Build request and response adapters for version translation
Detect breaking changes automatically
Generate API changelogs from schema diffs

Deployment & Observability6

Build production Docker images with multi-stage builds
Deploy to Kubernetes with health check probes
Instrument endpoints with Prometheus metrics
Implement distributed tracing with OpenTelemetry
Build structured logging with correlation IDs
Create Grafana dashboards for API monitoring

Data Infrastructure for GenAI60

PostgreSQL & pgvector6

Install pgvector and create vector-enabled tables
Build HNSW and IVFFlat indexes for fast similarity search
Perform similarity search with distance operators
Build hybrid search combining vectors and metadata
Integrate pgvector with SQLAlchemy ORM
Build a semantic search API endpoint

Advanced PostgreSQL6

Partition tables by range for time-series AI data
Query and index JSONB for AI metadata
Build full-text search for document retrieval
Optimize queries with EXPLAIN ANALYZE
Implement connection pooling with PgBouncer
Build a database migration strategy for AI schemas

Redis for Caching & Sessions6

Implement cache-aside pattern for LLM responses
Build session storage for multi-turn conversations
Use Redis pub/sub for real-time event broadcasting
Implement distributed locks for concurrent operations
Build rate limiting with Redis sorted sets
Monitor Redis performance and memory

MinIO Object Storage6

Create MinIO buckets and configure access policies
Upload and download with presigned URLs
Implement versioned storage for datasets
Build multipart uploads for large model files
Build content-addressable storage for embeddings
Monitor MinIO health and storage metrics

Kafka Fundamentals6

Create topics with partition and replication strategies
Produce messages with key-based partitioning
Consume messages with consumer groups
Implement message serialization with schemas
Handle delivery guarantees and idempotency
Monitor Kafka with consumer lag metrics

Event-Driven Architectures6

Build an event-driven inference pipeline
Implement event sourcing for prediction audit trails
Design dead letter queues for failed processing
Build stream processing for real-time enrichment
Implement the saga pattern for multi-step workflows
Implement event replay for reprocessing

Neo4j Graph Database6

Model a knowledge graph with nodes and relationships
Write Cypher queries for graph traversal
Build GraphRAG pipeline combining graph and LLM
Integrate Neo4j with Python using async driver
Build a knowledge graph update pipeline
Monitor Neo4j performance and queries

Data Pipeline Orchestration6

Define Argo Workflow templates for data processing
Build DAG workflows with parallel execution
Pass artifacts between workflow steps
Implement retry and error handling strategies
Build reusable WorkflowTemplates
Schedule pipelines with CronWorkflows

Data Quality & Validation6

Build schema validation for AI datasets
Implement embedding quality checks
Create data profiling reports
Build Great Expectations validation suites
Implement data quality gates in pipelines
Monitor data quality metrics over time

Data Infrastructure Operations6

Deploy data services on Kubernetes with StatefulSets
Configure Prometheus monitoring for data services
Implement automated PostgreSQL backup and restore
Build MinIO backup and replication
Build auto-scaling for data services
Build operational runbooks and incident response

Enterprise LLM Customization102

Enterprise Data Pipeline6

Build DataPipeline with Instructor
Implement distillation data generation
Build a data quality dashboard
Build a data validation pipeline
Optimize data pipeline throughput
Build a data lineage tracker

Synthetic Data Factory6

Build SyntheticFactory with DSPy
Use OpenAI Batch API for bulk generation
Filter and validate synthetic data
Build synthetic data quality evaluator
Optimize Batch API cost and throughput
Build synthetic data versioning

Fine-Tuned Enterprise Model6

Fine-tune with OpenAI
Tune with Vertex AI
Build model comparison framework
Build fine-tuning hyperparameter search
Implement model distillation pipeline
Build model comparison report generator

RLVR-Trained Reasoning Model6

Build programmatic graders
Train via RFT API
Analyze training dynamics
Build grader reliability testing
Optimize RFT training cost
Build training run monitoring dashboard

Reward Engineering Toolkit6

Build composite reward functions
Build LLM-as-judge graders
Validate grader reliability with Promptfoo
Build reward function A/B testing
Optimize composite reward weighting
Build reward engineering documentation generator

Model Eval Dashboard6

Configure Promptfoo eval suites
Use Batch API for bulk eval
Build regression detection
Build eval suite versioning and management
Optimize eval pipeline cost with sampling
Build eval regression root cause analyzer

Model Selection Engine6

Build ModelSelector
Build cost-quality optimizer
Benchmark HuggingFace open models
Build model selection test harness
Optimize model selection latency
Build model catalog and recommendation engine

Production RAG Pipeline6

Build EnterpriseRAGPipeline
Build RAGEvaluator with RAGAS
Build HybridRetriever with reranking
Build testing and evaluation for production rag pipeline
Optimize performance for production rag pipeline
Build operational monitoring for production rag pipeline

Vector Database Engineering6

Build VectorDBBenchmark
Build MultiIndexManager
Build EmbeddingOptimizer
Build testing and evaluation for vector database engineering
Optimize performance for vector database engineering
Build operational monitoring for vector database engineering

LangGraph Agentic Orchestration6

Build LangGraphAgent with stateful workflows
Build OrchestrationComparator
Build LangGraphRAGAgent
Build testing and evaluation for langgraph agentic orchestration
Optimize performance for langgraph agentic orchestration
Build operational monitoring for langgraph agentic orchestration

GraphRAG & Knowledge Graphs6

Build KnowledgeGraphBuilder
Build GraphRAGRetriever
Build HybridKnowledgeSearch
Build testing and evaluation for graphrag & knowledge graphs
Optimize performance for graphrag & knowledge graphs
Build operational monitoring for graphrag & knowledge graphs

Agent Memory & Stateful Systems6

Build AgentMemorySystem
Build MemoryManager
Build StatefulAgentBenchmark
Build testing and evaluation for agent memory & stateful systems
Optimize performance for agent memory & stateful systems
Build operational monitoring for agent memory & stateful systems

Advanced RAG Patterns6

Build AdaptiveRAG
Build RAGCIPipeline
Build ProductionRAGDashboard
Build testing and evaluation for advanced rag patterns
Optimize performance for advanced rag patterns
Build operational monitoring for advanced rag patterns

Domain-Specific Fine-Tuning Pipelines6

Training Data Preprocessor
Fine-Tuning Job Orchestrator
Model Evaluation Gate
Checkpoint and Version Management
Self-Service Pipeline API
Fine-Tuning Pipeline Capstone

Customer Knowledge Base RAG6

Multi-Tenant Vector Store
Document Ingestion Pipeline
RAG Query Service
RAGAS Evaluation
Access Control Integration
Customer RAG Capstone

Data Quality for LLM Training6

Format Validation Pipeline
Embedding Deduplication
Quality Scoring Model
Human Review Workflow
Data Quality Monitoring
Data Quality Capstone

Enterprise Prompt Management6

Prompt Registry
Prompt A/B Testing
Prompt Governance
Prompt Performance Monitoring
Prompt Template Engine
Prompt Management Capstone

GenAI Data Pipelines108

Document Ingestion with VLMs6

Extract documents using Docling's unified multi-format parser
Process documents with VLM-based understanding using hosted APIs
Use Google Document AI for managed OCR and layout parsing
Design a unified document model normalizing all extraction outputs
Build a routing system selecting the optimal extraction method
Store extracted documents in GCS with PostgreSQL metadata tracking

Data Cleaning & Quality Agents6

Implement exact-match deduplication using NeMo Curator
Build near-duplicate detection using MinHash and LSH
Implement content quality scoring with NeMo Curator filters
Deploy autonomous data quality agents for continuous monitoring
Design incremental cleaning for streaming document ingestion
Evaluate cleaning impact on downstream retrieval quality

Chunking & Contextual Retrieval6

Implement recursive and semantic chunking as baselines
Build Anthropic's Contextual Retrieval pattern
Implement parent-child chunking for precision-with-context
Implement document-structure-aware chunking
Design chunk metadata schemas with context tracking
Benchmark all strategies using evaluation-driven selection

Context Engineering & LLM Enrichment6

Extract structured metadata using Instructor with Pydantic schemas
Build LLM-powered document classification and tagging
Implement entity extraction and relationship identification
Generate chunk-level and document-level summaries
Implement multi-layer cost optimization for enrichment
Design context engineering patterns for enrichment pipelines

Multi-Format & Multimodal Processing6

Extract and normalize tables using VLMs and traditional parsers
Process images with multimodal embeddings using Cohere Embed v4
Extract and classify code blocks with language detection
Build multimodal retrieval supporting text+image+table queries
Design multi-format storage strategies on GCS and PostgreSQL
Test multi-format pipelines against diverse document corpora

Embedding Model Selection & Benchmarking6

Survey the 2026 hosted embedding API landscape
Build an embedding benchmarking framework
Analyze Matryoshka dimension tradeoffs for cost optimization
Evaluate Voyage 4's shared embedding space across model tiers
Implement an embedding abstraction layer with provider switching
Document model selection with evaluation evidence

Embedding Pipelines with Cost Controls6

Build embedding pipelines with LiteLLM gateway routing
Implement multi-layer caching for embedding generation
Build incremental pipelines processing only changed documents
Track costs in real-time with Langfuse and enforce budgets
Implement rate limiting and retry logic for hosted APIs
Orchestrate pipelines as Argo Workflows with Kafka triggers

Vector Store Operations on AlloyDB6

Configure AlloyDB with pgvector and ScaNN indexing
Tune HNSW index parameters for recall-latency tradeoffs
Implement vector store partitioning for multi-tenant data
Build zero-downtime reindexing for embedding model upgrades
Build vector store monitoring for index health
Load test and benchmark against self-managed alternatives

Hybrid Search, Reranking & Caching6

Implement BM25 + vector hybrid search with rank fusion
Add reranking with Cohere Rerank 4 and NVIDIA NIM Reranker
Build ColBERT late interaction retrieval with RAGatouille
Build semantic caching using Redis LangCache
Implement RAG vs CAG decision routing
Evaluate the full retrieval stack end-to-end

Knowledge Graph Construction with LightRAG6

Design graph ontology schemas for document domains
Build efficient extraction using LightRAG single-pass approach
Use Instructor for structured triple extraction with validation
Route extraction by complexity to cost-effective models
Implement graph ingestion merging triples into Neo4j
Evaluate knowledge graph completeness and accuracy

Entity Resolution & Linking6

Build entity fingerprinting with name normalization
Implement embedding-based entity similarity with Voyage 4
Build LLM-powered coreference resolution
Design merge strategies for conflicting attributes
Implement cross-document entity linking in Neo4j
Evaluate entity resolution with precision, recall, F1

Agentic Graph-RAG Pipelines6

Implement graph neighborhood retrieval from matched entities
Build agentic RAG with query decomposition and self-verification
Deploy LazyGraphRAG for cost-efficient graph-augmented retrieval
Use MCP for agent-tool integration in retrieval pipelines
Build query routing selecting retrieval mode per query
Evaluate agentic Graph-RAG against standard RAG

Knowledge Graph Maintenance6

Implement graph snapshots and versioning with rollback
Build incremental update pipelines for document changes
Design schema evolution for new node and relationship types
Implement graph consistency validation
Build automated maintenance workflows with Argo
Monitor graph health with automated dashboards

Evaluation-Driven Quality Engineering6

Generate synthetic evaluation datasets with RAGAS
Build the three-layer evaluation stack
Integrate DeepEval into CI/CD for automated quality gates
Deploy Arize Phoenix for real-time LLM observability
Instrument pipelines with OpenTelemetry GenAI conventions
Detect embedding drift and data freshness issues

PII Detection, Guardrails & Compliance6

Implement Presidio regex and NER-based PII detection
Add NeMo Curator PII redaction for pipeline-scale detection
Build LLM-powered PII detection for context-dependent data
Deploy NeMo Guardrails for output safety and validation
Implement access controls and data masking
Build audit logging and compliance validation

Agentic Pipeline Orchestration6

Design data pipelines as Dagster assets with lineage
Orchestrate K8s jobs with Argo Workflows and Kueue scheduling
Build event-driven triggers with Kafka and KEDA autoscaling
Version datasets with DVC backed by GCS
Connect pipeline agents via MCP for autonomous orchestration
Implement pipeline observability with OTel, Prometheus, Grafana

Data Flywheels & Continuous Improvement6

Capture user feedback signals from retrieval interactions
Build feedback-driven evaluation with LLM-as-Judge labeling
Implement model cascading for cost reduction
Design A/B testing for pipeline configurations
Build automated improvement triggers from quality degradation
Measure flywheel effectiveness over iteration cycles

Production Capstone on GKE6

Design end-to-end architecture on GKE Autopilot
Deploy infrastructure with Crossplane + Helm + Kustomize
Implement the integrated pipeline with quality gates
Build the full observability stack
Load test the system and define SLAs
Document runbooks and demonstrate end-to-end quality

GenAI Operations36

Embedding Pipeline Ops6

Deploy pgvector and build embedding ingestion pipeline with operational monitoring
Implement pipeline throughput tracking and failure detection with alerting
Build reprocessing workflow for failed or stale embeddings
Create pipeline health dashboards with freshness SLA tracking
Implement performance optimization for embedding pipeline operations
Build operational documentation for embedding pipeline operations

Vector Index Ops6

Implement pgvector index maintenance with VACUUM and reindexing schedules
Deploy Qdrant and compare operational characteristics with pgvector
Build schema migration workflow for embedding dimension changes
Monitor index performance with query latency tracking and degradation detection
Implement performance optimization for vector index maintenance
Build operational documentation for vector index maintenance

Knowledge Graph Ops6

Deploy Neo4j on vCluster with backup and restore procedures
Implement graph index management and query performance monitoring
Build knowledge graph freshness tracking with entity update pipelines
Create graph operations dashboards for health monitoring
Implement performance optimization for knowledge graph operations
Build operational documentation for knowledge graph operations

RAG Quality Monitor6

Deploy RAGAS evaluation for production RAG quality monitoring on sampled traffic
Implement retrieval relevance tracking with precision and recall metrics
Build quality degradation alerting with root cause analysis
Compare retrieval quality across embedding models with Cohere Rerank
Implement performance optimization for rag quality monitoring
Build operational documentation for rag quality monitoring

Data Recovery Platform6

Implement automated backup for pgvector, Qdrant, Neo4j, and PostgreSQL with scheduling
Build point-in-time recovery procedures with defined RTO and RPO targets
Create backup verification with data integrity checks and restore testing
Track backup compliance and recovery test results
Implement performance optimization for data backup and recovery
Build operational documentation for data backup and recovery

Data Quality Ops6

Implement data freshness monitoring with staleness alerting across all stores
Build completeness checks for embedding coverage and knowledge graph gaps
Detect data poisoning and anomalous ingestion patterns
Create data quality scorecards with trend tracking
Implement performance optimization for data quality monitoring
Build operational documentation for data quality monitoring

GenAI Data Engineering

Verifiable skill graph

What you'll ship in production

Build embedding pipelines

Design RAG data infrastructure

Build knowledge graph pipelines

Process documents at scale

Implement data quality controls

Orchestrate data pipelines

Monitor pipeline health

Design multi-tenant data isolation

Curriculum