Chapter 1

Document Ingestion with VLMs

Docling unified document parsingVLM document understanding vs traditional OCRGoogle Document AI Layout Parserunified document modelsGCS document storage pipelines

Learning Path

Hands-on Labs

Each objective has a coding lab that opens in VS Code in your browser

Objective 1

Extract documents using Docling's unified multi-format parser

Goal

Use Docling to parse PDF, DOCX, PPTX, and HTML into a unified structured representation. Leverage Granite-Docling-258M for layout analysis on CPU.

Objective 2

Process documents with VLM-based understanding using hosted APIs

Goal

Use GPT-4o vision and Gemini 2.5 vision APIs to directly understand document pages as images — extracting text, tables, and layout without OCR. Compare accuracy and cost vs traditional parsing.

Objective 3

Use Google Document AI for managed OCR and layout parsing

Goal

Configure Document AI processors (Enterprise OCR, Layout Parser with Gemini 3 Flash) for high-volume document processing. Handle 200+ languages and handwritten text.

Objective 4

Design a unified document model normalizing all extraction outputs

Goal

Create a format-agnostic document schema with typed content blocks and extraction method metadata. Build adapters for Docling, VLM, and Document AI outputs.

Objective 5

Build a routing system selecting the optimal extraction method

Goal

Route documents to the best extractor based on type, quality, and cost constraints. Use Docling for standard docs, VLMs for complex layouts, Document AI for high-volume.

Objective 6

Store extracted documents in GCS with PostgreSQL metadata tracking

Goal

Configure GCS buckets for raw and processed documents. Design PostgreSQL schemas for metadata, extraction lineage, and processing status.