GenAI Data Engineering

Build RAG data pipelines for ingestion, chunking, embedding, and indexing. Manage vector store operations and embedding model lifecycle.

11 skill groups8 courses542 goals~234 hrs

Verifiable skill graph

11 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

01
Embedding Pipelines & Re-Embedding Migrations

Run the embedding pipeline at corpus scale: batch generation, and the senior signal — re-embedding the whole corpus to a new model with zero downtime (dual-write, backfill, cutover), plus dimensionality/quantization and cost trade-offs.

02
Document Ingestion & Parsing

Get heterogeneous sources in cleanly — where the job actually bleeds time: PDF/HTML/table/multimodal parsing, cleaning and normalization, dedup, and incremental/CDC sync with freshness handling.

03
Vector Store & Index Operations

Operate the index as a lifecycle: vector-DB ops, index construction and tuning (HNSW/IVF/PQ), reindex/rebuild/compaction/backup, sharding, and recall-vs-latency tuning at build time.

04
Retrieval Quality & Index Eval

Is the index good? precision@k / recall@k / MRR / nDCG, golden-retrieval-set construction, chunking-strategy tuning (tuned against these metrics), drift detection, and retrieval regression — measuring the index, not the answer.

05
Knowledge Graph Construction & GraphRAG

Build the graph when vectors aren't enough: entity/relation extraction, ontology and graph construction (Neo4j), GraphRAG retrieval, and graph+vector hybrid.

06
Hybrid Search & Reranking

Build and tune the retrieval stage offline: dense+sparse/BM25 fusion, reciprocal rank fusion, and cross-encoder reranking — tuned against a golden set, not per-request in a feature.

07
Data Flywheels & Continuous Improvement

Close the loop: feedback capture, re-embedding/re-index triggers, active learning, corpus expansion, and continuous retrieval eval.

08
Corpus Governance, PII & Deletion

Keep the corpus compliant: PII detection/redaction, lineage, access governance, and the hard problem — right-to-be-forgotten deletion from the index and every derived artifact.

09
Pipeline Orchestration & Ops

Run the pipeline as a system: orchestration (Airflow/Dagster/Prefect), batch-vs-streaming, scheduling and backfills, data versioning, deployment/CI-CD for pipeline jobs, and monitoring/scaling.

10
Hosted LLM API Integration

Baseline provider access in pipeline code: embedding and LLM SDK calls, auth, batching, and retries.

11
Python for Data Engineering

Production Python for pipeline code: async, Pydantic, data libraries, typing, and error handling.

What you'll ship in production

Core responsibilities this discipline prepares you for.

  1. 1

    Build embedding pipelines

    — ingest, chunk, embed, and store in vector databases

    • Select and benchmark embedding models across OpenAI and Gemini for domain-specific accuracy
    • Implement chunking strategies (fixed, semantic, recursive) with batch embedding generation
    • Build complete pipelines processing thousands of documents into pgvector with HNSW indexing
  2. 2

    Design RAG data infrastructure

    — hybrid search and reranking

    • Build BM25 + semantic hybrid search with LLM-as-reranker patterns using Gemini
    • Implement semantic caching for throughput optimization and query result deduplication
    • Construct hybrid search pipelines and benchmark retrieval quality with RAGAS precision-recall metrics
  3. 3

    Build knowledge graph pipelines

    using Neo4j

    • Extract entities from unstructured text and construct knowledge graphs with relationship typing
    • Implement GraphRAG patterns and agentic Graph-RAG with MCP tool integration for graph traversal
    • Build knowledge graphs from document corpora and query them with graph-aware retrieval agents
  4. 4

    Process documents at scale

    — parsing, chunking, and quality filtering

    • Process multi-format documents with Docling across PDF, HTML, and Office formats
    • Apply intelligent context-preserving chunking and GPU-accelerated curation with NeMo Curator
    • Build document processing pipelines that handle real-world messy data with quality filtering
  5. 5

    Implement data quality controls

    — PII, dedup, compliance filtering

    • Integrate Presidio for PII detection with custom entity recognizers and deduplication strategies
    • Build compliance pipelines with content classification for regulated industries
    • Construct quality gates that block non-compliant documents from entering the embedding pipeline
  6. 6

    Orchestrate data pipelines

    with scheduling and failure recovery

    • Use Argo Workflows for Kubernetes-native pipeline orchestration with DVC data versioning
    • Build quality gates between pipeline stages with dead-letter queues and failure recovery patterns
    • Wire multi-stage pipelines with automatic retry, checkpoint recovery, and quality validation gates
  7. 7

    Monitor pipeline health

    — freshness, quality scores, embedding drift

    • Instrument pipeline stages with OpenTelemetry and build Grafana dashboards for freshness and quality
    • Monitor retrieval quality continuously with RAGAS evaluation and embedding drift detection
    • Build monitoring for live pipelines that detects data quality degradation and triggers remediation
  8. 8

    Design multi-tenant data isolation

    for enterprise RAG

    • Build tenant-aware embedding pipelines with pgvector namespace isolation per customer
    • Implement row-level security for vector search with per-tenant quality monitoring
    • Verify tenant data isolation under concurrent multi-tenant queries with cross-tenant leakage tests

Curriculum

8 courses · each builds on previous goals

13 goals unlocked for preview — click to read. Locked goals need a subscription.