GenAI Data Engineering
Build RAG data pipelines for ingestion, chunking, embedding, and indexing. Manage vector store operations and embedding model lifecycle.
Verifiable skill graph
11 skill groups · each becomes a signed node on your graph.
Verifiable skill graph
11 skill groups · each becomes a signed node on your graph.
Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.
Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.
Run the embedding pipeline at corpus scale: batch generation, and the senior signal — re-embedding the whole corpus to a new model with zero downtime (dual-write, backfill, cutover), plus dimensionality/quantization and cost trade-offs.
Get heterogeneous sources in cleanly — where the job actually bleeds time: PDF/HTML/table/multimodal parsing, cleaning and normalization, dedup, and incremental/CDC sync with freshness handling.
Operate the index as a lifecycle: vector-DB ops, index construction and tuning (HNSW/IVF/PQ), reindex/rebuild/compaction/backup, sharding, and recall-vs-latency tuning at build time.
Is the index good? precision@k / recall@k / MRR / nDCG, golden-retrieval-set construction, chunking-strategy tuning (tuned against these metrics), drift detection, and retrieval regression — measuring the index, not the answer.
Build the graph when vectors aren't enough: entity/relation extraction, ontology and graph construction (Neo4j), GraphRAG retrieval, and graph+vector hybrid.
Build and tune the retrieval stage offline: dense+sparse/BM25 fusion, reciprocal rank fusion, and cross-encoder reranking — tuned against a golden set, not per-request in a feature.
Close the loop: feedback capture, re-embedding/re-index triggers, active learning, corpus expansion, and continuous retrieval eval.
Keep the corpus compliant: PII detection/redaction, lineage, access governance, and the hard problem — right-to-be-forgotten deletion from the index and every derived artifact.
Run the pipeline as a system: orchestration (Airflow/Dagster/Prefect), batch-vs-streaming, scheduling and backfills, data versioning, deployment/CI-CD for pipeline jobs, and monitoring/scaling.
Baseline provider access in pipeline code: embedding and LLM SDK calls, auth, batching, and retries.
Production Python for pipeline code: async, Pydantic, data libraries, typing, and error handling.
What you'll ship in production
Core responsibilities this discipline prepares you for.
What you'll ship in production
Core responsibilities this discipline prepares you for.
- 1
Build embedding pipelines
— ingest, chunk, embed, and store in vector databases
- Select and benchmark embedding models across OpenAI and Gemini for domain-specific accuracy
- Implement chunking strategies (fixed, semantic, recursive) with batch embedding generation
- Build complete pipelines processing thousands of documents into pgvector with HNSW indexing
- 2
Design RAG data infrastructure
— hybrid search and reranking
- Build BM25 + semantic hybrid search with LLM-as-reranker patterns using Gemini
- Implement semantic caching for throughput optimization and query result deduplication
- Construct hybrid search pipelines and benchmark retrieval quality with RAGAS precision-recall metrics
- 3
Build knowledge graph pipelines
using Neo4j
- Extract entities from unstructured text and construct knowledge graphs with relationship typing
- Implement GraphRAG patterns and agentic Graph-RAG with MCP tool integration for graph traversal
- Build knowledge graphs from document corpora and query them with graph-aware retrieval agents
- 4
Process documents at scale
— parsing, chunking, and quality filtering
- Process multi-format documents with Docling across PDF, HTML, and Office formats
- Apply intelligent context-preserving chunking and GPU-accelerated curation with NeMo Curator
- Build document processing pipelines that handle real-world messy data with quality filtering
- 5
Implement data quality controls
— PII, dedup, compliance filtering
- Integrate Presidio for PII detection with custom entity recognizers and deduplication strategies
- Build compliance pipelines with content classification for regulated industries
- Construct quality gates that block non-compliant documents from entering the embedding pipeline
- 6
Orchestrate data pipelines
with scheduling and failure recovery
- Use Argo Workflows for Kubernetes-native pipeline orchestration with DVC data versioning
- Build quality gates between pipeline stages with dead-letter queues and failure recovery patterns
- Wire multi-stage pipelines with automatic retry, checkpoint recovery, and quality validation gates
- 7
Monitor pipeline health
— freshness, quality scores, embedding drift
- Instrument pipeline stages with OpenTelemetry and build Grafana dashboards for freshness and quality
- Monitor retrieval quality continuously with RAGAS evaluation and embedding drift detection
- Build monitoring for live pipelines that detects data quality degradation and triggers remediation
- 8
Design multi-tenant data isolation
for enterprise RAG
- Build tenant-aware embedding pipelines with pgvector namespace isolation per customer
- Implement row-level security for vector search with per-tenant quality monitoring
- Verify tenant data isolation under concurrent multi-tenant queries with cross-tenant leakage tests
Curriculum
8 courses · each builds on previous goals
Curriculum
8 courses · each builds on previous goals
13 goals unlocked for preview — click to read. Locked goals need a subscription.