Prerequisites

This chapter assumes you have completed the course onboarding and environment setup. You should have a working Python 3.11+ environment with pip and venv configured, along with an active Google Cloud Platform project with billing enabled. Familiarity with basic PDF structure (pages, fonts, bounding boxes) and REST API consumption patterns is expected. You should also have a GCS bucket provisioned and a PostgreSQL 15+ instance running locally or via Cloud SQL, as both are used throughout the hands-on exercises in this chapter.

Learning Goals

Extract documents using Docling's unified multi-format parser
- Extract text and structure from PDF, HTML, and DOCX using Docling's unified parser— Engineers will gain hands-on proficiency with IBM's Docling library, a unified document parsing framework that abstracts away format-specific extraction logic behind a single pipeline interface. Rather than maintaining separate parsers for each document type, Docling provides a DocumentConverter class that accepts PDF, HTML, DOCX, PPTX, and even image files through one consistent API surface. This goal focuses on understanding how Docling's internal pipeline stages—input parsing, layout analysis, table structure recognition, and output serialization—compose together to produce a structured DoclingDocument object. You will learn how Docling's PdfPipelineOptions and PipelineOptions configuration classes control every stage of extraction, from OCR engine selection (EasyOCR, Tesseract, or a built-in hybrid) through table detection using transformer-based models like TableFormer. At the senior engineering level, the critical skill is not merely calling converter.convert() but understanding the performance implications of each pipeline stage, configuring batch processing for throughput, and handling the inevitable edge cases—scanned PDFs with mixed digital and image-based pages, DOCX files with deeply nested tables, and HTML pages with non-semantic markup that breaks structural extraction. You will also explore how Docling's output representation maps document hierarchy (titles, sections, paragraphs, tables, figures) into a tree structure that preserves reading order, enabling downstream chunking strategies that respect semantic boundaries rather than splitting mid-paragraph or mid-table. This is foundational to everything that follows in the chapter because the quality of your ingestion pipeline's structured output directly determines the quality of retrieval in any RAG system built on top of it.
- Configure Docling's DocumentConverter with format-specific pipeline options including PdfPipelineOptions with do_ocr=True for scanned documents, do_table_structure=True for table extraction via TableFormer, and images_scale for controlling resolution during rasterization. You will learn why the default configuration often produces suboptimal results on real-world enterprise documents—where mixed-mode PDFs contain both born-digital text layers and scanned appendices—and how to configure the ocr_options parameter with an EasyOcrOptions or TesseractOcrOptions instance to target only pages that lack an extractable text layer, avoiding the performance penalty of running OCR on pages that do not need it.
- Process documents in batch using Docling's convert_all() method and streaming results to handle large document collections without excessive memory consumption. Production ingestion pipelines rarely process one document at a time; you will learn how Docling's internal batching works, how to set max_num_pages and max_file_size limits to prevent runaway processing on malformed files, and how to implement error handling around ConversionResult objects where individual documents may fail while the batch continues. You will also understand the difference between Docling's synchronous and asynchronous execution paths and when each is appropriate for pipeline architectures running on Cloud Run, Kubernetes Jobs, or local development environments.
- Navigate and transform the DoclingDocument output structure including its hierarchical representation of document elements—TextItem, TableItem, PictureItem, and SectionHeaderItem—to extract specific content types, flatten the tree for search indexing, or produce Markdown output using the built-in doc.export_to_markdown() serializer. You will learn how Docling assigns prov (provenance) metadata to each element, tracking the source page number, bounding box coordinates, and confidence scores, which is essential for building audit trails in regulated industries where you must prove that extracted text came from a specific region of a specific page in the source document.
- Handle edge cases in multi-format extraction where the same logical document exists in PDF, HTML, and DOCX variants with structural differences. Docling's unified parser normalizes these into the same DoclingDocument schema, but normalization is not lossless—HTML documents may carry CSS class names that hint at semantic structure (e.g., class="section-heading") that Docling's HTML backend preserves differently than a DOCX heading style or a PDF font-size heuristic. You will learn to write post-processing logic that reconciles these differences, ensuring that your downstream systems receive consistently structured content regardless of the source format.
Process documents with VLM-based understanding using hosted APIs
- Compare traditional OCR pipelines with VLM-based document understanding using GPT-4o and Gemini vision— This goal moves beyond the mechanics of text extraction into the architectural decision that increasingly defines modern document ingestion: when to use traditional OCR-based pipelines and when to leverage vision-language models that understand document layout, semantics, and content simultaneously. Traditional OCR pipelines—Tesseract, Amazon Textract, Google Document AI—operate in stages: first they detect text regions, then they recognize characters, and finally a separate post-processing step attempts to reconstruct structure (tables, headers, reading order). VLMs like GPT-4o and Gemini 1.5 Pro with vision capabilities collapse these stages into a single inference pass where the model "reads" the document image and produces structured output directly. The trade-offs are profound and non-obvious. Traditional OCR is deterministic, fast, cheap at scale, and produces character-accurate output on clean documents. VLM-based extraction is probabilistic, slower, significantly more expensive per page, and can hallucinate content—but it excels at understanding complex layouts, interpreting handwritten annotations, extracting meaning from charts and diagrams, and handling documents where structure cannot be inferred from text position alone. Senior engineers must understand both approaches deeply enough to make the right architectural choice for each document type in their pipeline, and in many production systems, the answer is a hybrid approach where traditional OCR handles the common cases and VLM extraction is reserved for documents that fail quality checks or require semantic understanding beyond what positional heuristics can provide. You will learn to quantify the cost-accuracy-latency triangle for both approaches and build routing logic that directs documents to the appropriate extraction path.
- Implement VLM-based document extraction by sending page images to GPT-4o and Gemini vision APIs with carefully engineered prompts that specify the desired output structure. You will learn that VLM document understanding is fundamentally a prompt engineering challenge—the same document image will produce wildly different outputs depending on whether you ask for "all text on this page" versus "extract this as a structured JSON with sections, tables, and metadata." You will build extraction prompts that instruct the model to preserve table structure as Markdown or JSON arrays, maintain heading hierarchy, and flag low-confidence regions. Critically, you will learn to handle the token limits that constrain VLM extraction: a single high-resolution page image consumes 1,000-4,000 tokens of context depending on the model and detail parameter setting, meaning that a 100-page document requires either batched single-page calls or aggressive resolution reduction, each with its own accuracy implications.
- Benchmark extraction quality across OCR and VLM approaches using precision, recall, and structural similarity metrics on a representative document corpus. You will learn to build evaluation harnesses that compare extracted text against ground truth using character error rate (CER) and word error rate (WER) for text accuracy, and custom structural metrics that measure whether tables were correctly detected, whether heading hierarchy was preserved, and whether reading order matches the source document. This is where many teams make poor architectural decisions—they test on clean, born-digital PDFs where traditional OCR achieves 99%+ accuracy and conclude that VLMs offer no benefit, without testing on the scanned invoices, handwritten forms, and complex multi-column layouts that actually drive their production error rates.
- Build hybrid extraction pipelines with intelligent routing that scores each incoming document's complexity and routes it to the appropriate extraction backend. Simple born-digital PDFs with standard layouts go through Docling or Document AI for fast, cheap, deterministic extraction. Documents that fail quality heuristics—low OCR confidence scores, detected handwriting, complex multi-column layouts, or embedded charts—are escalated to VLM extraction. You will implement this routing using features like page image entropy, text layer presence detection, table density estimation, and OCR confidence score aggregation, creating a classification layer that optimizes the cost-accuracy trade-off across your entire document corpus rather than applying a one-size-fits-all extraction strategy.
Use Google Document AI for managed OCR and layout parsing
- Use Google Document AI for managed OCR and layout parsing at scale— Google Document AI provides a fully managed, enterprise-grade document processing service that eliminates the operational burden of maintaining OCR infrastructure while offering specialized processors for different document types—invoices, receipts, contracts, W-2 forms, and general-purpose documents. This goal focuses on integrating Document AI into your ingestion pipeline as a production-grade extraction backend, understanding its processor architecture, batch processing capabilities, and the Layout Parser processor that goes beyond basic OCR to produce hierarchical document structure with paragraph grouping, table detection, and reading order inference. Unlike self-hosted solutions, Document AI scales automatically, handles concurrent requests through its batch API, and provides processor versioning so you can pin your pipeline to a specific model version while evaluating newer versions in parallel—a critical capability for regulated environments where extraction behavior must be reproducible. You will learn to use the Document AI Python client library (google-cloud-documentai) to create processors, submit documents for synchronous and asynchronous processing, and parse the resulting Document protobuf into your application's domain model. At the staff engineering level, the emphasis is on operational concerns: understanding Document AI's pricing model (per page, with different rates for different processor types), managing processor quotas, implementing retry logic with exponential backoff for transient failures, and monitoring extraction quality over time as Google updates its underlying models. You will also explore the Human-in-the-Loop (HITL) capabilities that allow you to route low-confidence extractions to human reviewers, creating a feedback loop that improves extraction accuracy for your specific document types.
- Configure and deploy Document AI processors programmatically using the google-cloud-documentai client library, including creating processor instances via the DocumentProcessorServiceClient, selecting the appropriate processor type (OCR, Form Parser, Layout Parser, or specialized processors like Invoice Parser), and managing processor versions. You will learn why the Layout Parser processor is particularly valuable for ingestion pipelines—it produces a hierarchical document representation with detected blocks, paragraphs, lines, and tokens organized into a layout tree, along with table structures and key-value pairs, giving you richer structural information than the basic OCR processor while remaining format-agnostic across PDF, TIFF, GIF, and image inputs.
- Implement batch processing with Document AI's batch_process_documents() method for high-volume ingestion, where documents are read from and results written to GCS buckets. You will learn to construct BatchProcessRequest objects with GcsDocuments input configurations, set DocumentOutputConfig to control whether results are written as JSON or protobuf, and monitor long-running operations using the operation.result() polling pattern or callback-based approaches. Batch processing is essential for any pipeline processing more than a few hundred documents per day, because synchronous processing is rate-limited and incurs higher per-page latency due to request overhead, while batch processing amortizes that overhead and benefits from Google's internal parallelization.
- Parse Document AI's Document protobuf response into application-level data structures by navigating its pages, entities, tables, and text_anchors fields. The Document AI response format is powerful but complex—text content is stored in a single document.text string, and all structural elements reference back to this string via text_anchor.text_segments with start_index and end_index offsets. You will learn to write extraction utilities that dereference these anchors to retrieve actual text content, iterate over page.tables to reconstruct table cell content with row and column spans, and extract form field key-value pairs from page.form_fields. This parsing logic is the bridge between Document AI's generic output and your pipeline's unified document model, and getting it right is essential for maintaining extraction fidelity.
Design a unified document model normalizing all extraction outputs
- Design a unified document model normalizing output across all formats and extraction methods— Production document ingestion pipelines rarely use a single extraction backend. Different document types, quality levels, and cost constraints drive the use of multiple extractors—Docling for local processing, Document AI for managed scale, VLMs for complex layouts—and each produces output in its own format with its own structural conventions. This goal addresses the critical architectural challenge of normalizing these heterogeneous outputs into a single, well-defined document model that downstream systems (chunking, embedding, search, RAG) can consume without knowing or caring which extractor produced the data. You will design a UnifiedDocument data model using Pydantic that captures the superset of structural information across all extractors: document metadata (source URI, format, page count, extraction timestamp, extractor identity), a hierarchical content tree (sections, paragraphs, tables, figures, lists), provenance tracking (which extractor produced each element, with confidence scores and bounding boxes), and lineage information (the chain of transformations from raw source to final structured output). This is a data modeling challenge that senior engineers encounter repeatedly across domains—it is the same pattern as designing a unified event schema for heterogeneous data sources or a canonical data model for an enterprise integration platform. The key principles are: capture everything that any downstream consumer might need, make the extraction source explicit rather than implicit, ensure round-trip fidelity so you can trace any piece of extracted text back to its source location, and version the schema so you can evolve it without breaking consumers.
- Define a Pydantic-based document schema with UnifiedDocument, DocumentPage, ContentBlock, and TableBlock models that represent the full structural hierarchy of an extracted document. You will learn to use Pydantic's discriminated unions via a block_type literal field to handle the polymorphic nature of document content—where a content block might be a paragraph, heading, table, figure, or list item—while maintaining strict type safety and JSON serialization compatibility. The schema must handle both flat documents (single continuous text) and deeply nested structures (sections containing subsections containing tables containing nested lists), using a recursive children field pattern that mirrors the actual structure of complex documents without imposing an arbitrary depth limit.
- Implement adapter classes that transform extractor-specific outputs into the unified model with a consistent interface: DoclingAdapter, DocumentAIAdapter, and VLMAdapter each implementing a to_unified_document() method that accepts the raw extractor output and returns a UnifiedDocument instance. You will learn that the adapter pattern is not merely a mechanical transformation—each adapter must make normalization decisions. Docling produces heading levels (h1-h6) while Document AI produces block types with font-size metadata; your adapter must map these to a consistent heading level scheme. VLM output may include markdown formatting that must be parsed into structural elements. Table representations differ radically: Docling produces TableData objects with cell grids, Document AI uses Table protobufs with HeaderRow and BodyRow structures, and VLMs may produce Markdown tables or JSON arrays. Each adapter must normalize these into your canonical TableBlock representation while preserving as much fidelity as possible.
- Add provenance and confidence metadata to every content block so downstream systems can make quality-aware decisions. Each ContentBlock in your unified model carries a provenance field containing the extractor name and version, the source page number, bounding box coordinates (when available), OCR confidence score (when available), and a quality_flags list that records any issues detected during extraction—such as low confidence, possible hallucination (for VLM extractors), truncated content, or unresolved unicode characters. This provenance metadata enables critical production capabilities: quality dashboards that track extraction accuracy across document types, automatic re-extraction routing where low-confidence blocks are sent to a higher-quality (but more expensive) extractor, and audit trails for compliance requirements in financial services, healthcare, and legal domains where you must demonstrate that extracted content faithfully represents the source document.
- Version the unified schema using semantic versioning embedded in the document payload with a schema_version field and a migration strategy for handling documents extracted under older schema versions. In a production pipeline that processes millions of documents over months or years, schema evolution is inevitable—you will add new block types, extend metadata fields, or change normalization rules. You will learn to implement forward-compatible schema changes using Pydantic's model_validator with a version-aware deserialization path, so that documents extracted under schema v1.0 can still be loaded and optionally up-converted when your pipeline is running schema v1.2, without requiring a costly reprocessing of your entire document corpus.
Build a routing system selecting the optimal extraction method
- Store extracted documents in GCS with PostgreSQL metadata tracking and lineage— The final goal addresses the storage and metadata layer that makes your ingestion pipeline production-ready: persisting extracted document content to Google Cloud Storage as versioned JSON blobs, tracking document metadata and processing status in PostgreSQL, and maintaining lineage records that connect every extracted document back to its source file, extraction configuration, and processing history. This is the engineering discipline that separates a prototype ingestion script from a production data pipeline. GCS provides durable, cost-effective object storage with built-in versioning, lifecycle management, and fine-grained IAM access control—ideal for storing extracted document payloads that may range from kilobytes (a simple one-page form) to hundreds of megabytes (a thousand-page technical manual with embedded images). PostgreSQL serves as the metadata catalog, enabling fast queries across your document corpus: find all documents extracted from a specific source, retrieve all documents processed by a particular extractor version, identify documents that need re-extraction after an extractor upgrade, or list all documents with extraction confidence below a threshold. The lineage component tracks the provenance chain: source document URI → extraction job ID → extractor configuration (including model version and parameters) → unified document GCS path → any downstream derived artifacts (chunks, embeddings, search index entries). This lineage is not merely a nice-to-have audit feature—it is the mechanism that enables you to perform targeted re-extraction when you upgrade an extractor, roll back to previous extraction results when a new extractor version introduces regressions, and satisfy data governance requirements that mandate traceability from any piece of information back to its authoritative source. You will implement this storage layer using google-cloud-storage for GCS operations and SQLAlchemy for PostgreSQL interactions, with careful attention to consistency guarantees: the GCS write must succeed before the PostgreSQL metadata record is committed, and the pipeline must handle partial failures gracefully using idempotent writes and transactional metadata updates.
- Implement a DocumentStore class that writes UnifiedDocument payloads to GCS with a deterministic path schema following the pattern gs://{bucket}/documents/{source_type}/{date}/{document_id}/v{version}.json, where each component serves a specific purpose: source_type enables prefix-based lifecycle policies (e.g., retain invoices for 7 years, ephemeral web scrapes for 90 days), date enables time-based partitioning for efficient listing and cleanup, document_id is a deterministic hash of the source URI for deduplication, and version tracks re-extractions of the same source document. You will learn to configure GCS object metadata headers to store quick-lookup fields (extractor name, page count, extraction timestamp) directly on the object, enabling metadata queries without downloading the full payload, and to set appropriate storage classes (Standard for recent documents, Nearline or Coldline for archival) using object lifecycle rules tied to the date prefix.
- Design a PostgreSQL metadata schema with documents, extraction_jobs, and lineage_edges tables that form a complete tracking system for your document corpus. The documents table stores document-level metadata (source URI, document type, page count, current extraction version, quality score, GCS path) with indexes on frequently queried columns. The extraction_jobs table records every extraction attempt with its configuration (extractor type, model version, pipeline options), status (pending, running, completed, failed), timing information, and error details for failed jobs. The lineage_edges table implements a directed acyclic graph connecting source documents to extracted documents to derived artifacts, enabling both forward queries ("what was produced from this source?") and backward queries ("where did this extracted content come from?"). You will implement this schema using SQLAlchemy ORM models with appropriate constraints, indexes, and relationship definitions that enforce referential integrity while supporting the high-throughput insert patterns typical of batch ingestion pipelines.
- Build an idempotent ingestion transaction that coordinates GCS writes with PostgreSQL metadata updates ensuring that your pipeline can safely retry failed ingestions without creating duplicate or inconsistent records. The transaction flow is: generate a deterministic document ID from the source URI, check PostgreSQL for an existing record with the same source URI and extraction configuration, write the UnifiedDocument JSON to GCS using blob.upload_from_string() with a if_generation_match precondition for conditional writes, and then commit the PostgreSQL metadata record with the GCS path and extraction job reference. If the GCS write fails, no metadata record is created. If the PostgreSQL commit fails after a successful GCS write, the next retry will detect the existing GCS object and skip the upload, then attempt the metadata commit again. You will learn why this ordering matters—GCS writes are naturally idempotent (re-uploading the same content to the same path is a no-op), while PostgreSQL inserts require explicit conflict handling via ON CONFLICT clauses or application-level existence checks.
- Implement lineage tracking that connects source documents through extraction to downstream artifacts using the lineage_edges table and a LineageTracker utility class. Every significant transformation in your pipeline—uploading a source document to GCS, running extraction, producing a unified document, generating chunks, creating embeddings—records a lineage edge connecting the input artifact to the output artifact with metadata about the transformation (tool name, version, parameters, timestamp). You will learn to query this lineage graph to answer operational questions that arise daily in production pipelines: "Why does this RAG response contain incorrect information?" traces back through the embedding, chunk, extracted document, and source file to identify whether the error originated in the source document or was introduced by extraction. "Which documents need re-extraction after upgrading Docling from v1.x to v2.x?" queries for all extraction edges using the old Docling version. This lineage capability transforms your ingestion pipeline from a black box into an observable, debuggable, and auditable system that senior engineers can operate with confidence in production.

Key Terminology

Docling

An open-source Python library developed by IBM Research that provides a unified API for parsing PDF, HTML, DOCX, and other document formats into a single structured representation, eliminating the need for format-specific extraction pipelines.

Vision-Language Model (VLM)

A multimodal neural network such as GPT-4o or Gemini Pro Vision that accepts both image and text inputs, enabling it to reason about document layout, table structure, and semantic content directly from rendered page images rather than relying on character-level OCR.

Optical Character Recognition (OCR)

A traditional text extraction technique that identifies individual characters and words from rasterized document images using pattern matching or neural classifiers, operating without any understanding of document semantics or hierarchical structure.

Google Document AI

A fully managed Google Cloud service that provides pre-trained processors for OCR, form parsing, and layout analysis, exposing a REST and gRPC API that accepts document bytes and returns structured annotations including bounding boxes, confidence scores, and entity types.

Layout Parser

A Document AI processor type that detects and classifies structural regions of a page—such as headings, paragraphs, tables, figures, and lists—by combining OCR output with a spatial understanding model trained on annotated document corpora.

Unified Document Model

An application-defined canonical data structure that normalizes extraction output from heterogeneous sources like Docling, Document AI, and VLM-based pipelines into a single schema containing sections, tables, metadata, and provenance fields regardless of the original file format or extraction method.

DoclingDocument

The core output object returned by Docling's **`DocumentConverter`** class, representing a fully parsed document as a tree of typed content nodes including headings, paragraphs, tables, and figures with associated bounding box and reading-order metadata.

Document Lineage

A metadata record that traces each extracted document element back to its source file, extraction method, model version, processing timestamp, and transformation steps, enabling reproducibility and debugging across pipeline runs.

Bounding Box

A rectangular coordinate tuple typically expressed as (x0, y0, x1, y1) in normalized or pixel units that localizes a detected text block, table cell, or structural region on a rendered document page.

Structure Preservation

The capability of a document extraction system to maintain hierarchical relationships—such as heading levels, nested lists, table row-column associations, and figure captions—rather than flattening the document into a plain text stream.

GCS (Google Cloud Storage)

An object storage service used as the durable backing store for raw source documents and extracted outputs, organized into buckets with path prefixes that encode course, document type, and processing stage for efficient retrieval.

Content Chunking

The post-extraction process of splitting a parsed document into semantically coherent segments—typically by section boundaries, token count limits, or sliding windows—to prepare content for downstream embedding and retrieval-augmented generation pipelines.

Format-Agnostic Parsing

A design approach where a single API entry point accepts multiple input formats such as PDF, DOCX, HTML, and image files, internally dispatching to format-specific backends while returning a uniform output schema to the caller.

Confidence Score

A floating-point value between 0.0 and 1.0 returned by OCR and Document AI processors indicating the model's certainty that a recognized character, word, or structural classification is correct, used downstream for quality filtering and human review routing.

Document AI Processor

A versioned, project-scoped resource in Google Cloud that encapsulates a specific document understanding model such as OCR, form parser, or custom extractor, configured via the Document AI console or Terraform and invoked through batch or online processing endpoints.

Page Segmentation

The computer vision task of partitioning a document page image into non-overlapping regions corresponding to distinct content types before any text recognition occurs, serving as the foundational step for both traditional OCR pipelines and VLM-based extraction.

PostgreSQL Metadata Catalog

A relational database schema that stores document-level and chunk-level metadata—including source URI, extraction method, schema version, page count, and processing status—enabling SQL-based querying, join operations, and transactional consistency guarantees that object storage alone cannot provide.

Multimodal Grounding

The ability of a VLM to anchor its textual response to specific spatial regions of an input document image, allowing it to attribute extracted table values or paragraph text to precise page locations rather than generating content from parametric memory alone.

On This Page

Prerequisites

Learning Goals

Key Terminology

On This Page