Document Ingestion with VLMs
Learning Path
Hands-on Labs
Each objective has a coding lab that opens in VS Code in your browser
Extract documents using Docling's unified multi-format parser
Use Docling to parse PDF, DOCX, PPTX, and HTML into a unified structured representation. Leverage Granite-Docling-258M for layout analysis on CPU.
Process documents with VLM-based understanding using hosted APIs
Use GPT-4o vision and Gemini 2.5 vision APIs to directly understand document pages as images — extracting text, tables, and layout without OCR. Compare accuracy and cost vs traditional parsing.
Use Google Document AI for managed OCR and layout parsing
Configure Document AI processors (Enterprise OCR, Layout Parser with Gemini 3 Flash) for high-volume document processing. Handle 200+ languages and handwritten text.
Design a unified document model normalizing all extraction outputs
Create a format-agnostic document schema with typed content blocks and extraction method metadata. Build adapters for Docling, VLM, and Document AI outputs.
Build a routing system selecting the optimal extraction method
Route documents to the best extractor based on type, quality, and cost constraints. Use Docling for standard docs, VLMs for complex layouts, Document AI for high-volume.
Store extracted documents in GCS with PostgreSQL metadata tracking
Configure GCS buckets for raw and processed documents. Design PostgreSQL schemas for metadata, extraction lineage, and processing status.