On This Page
Prerequisites
This chapter assumes you have completed the course onboarding and environment setup. You should have a working Python 3.11+ environment with pip and venv configured, along with an active Google Cloud Platform project with billing enabled. Familiarity with basic PDF structure (pages, fonts, bounding boxes) and REST API consumption patterns is expected. You should also have a GCS bucket provisioned and a PostgreSQL 15+ instance running locally or via Cloud SQL, as both are used throughout the hands-on exercises in this chapter.
Learning Goals
-
Extract documents using Docling's unified multi-format parser
-
Extract text and structure from PDF, HTML, and DOCX using Docling's unified parser— Engineers will gain hands-on proficiency with IBM's Docling library, a unified document parsing framework that abstracts away format-specific extraction logic behind a single pipeline interface. Rather than maintaining separate parsers for each document type, Docling provides a
DocumentConverterclass that accepts PDF, HTML, DOCX, PPTX, and even image files through one consistent API surface. This goal focuses on understanding how Docling's internal pipeline stages—input parsing, layout analysis, table structure recognition, and output serialization—compose together to produce a structuredDoclingDocumentobject. You will learn how Docling'sPdfPipelineOptionsandPipelineOptionsconfiguration classes control every stage of extraction, from OCR engine selection (EasyOCR, Tesseract, or a built-in hybrid) through table detection using transformer-based models like TableFormer. At the senior engineering level, the critical skill is not merely callingconverter.convert()but understanding the performance implications of each pipeline stage, configuring batch processing for throughput, and handling the inevitable edge cases—scanned PDFs with mixed digital and image-based pages, DOCX files with deeply nested tables, and HTML pages with non-semantic markup that breaks structural extraction. You will also explore how Docling's output representation maps document hierarchy (titles, sections, paragraphs, tables, figures) into a tree structure that preserves reading order, enabling downstream chunking strategies that respect semantic boundaries rather than splitting mid-paragraph or mid-table. This is foundational to everything that follows in the chapter because the quality of your ingestion pipeline's structured output directly determines the quality of retrieval in any RAG system built on top of it. -
Configure Docling's
DocumentConverterwith format-specific pipeline options includingPdfPipelineOptionswithdo_ocr=Truefor scanned documents,do_table_structure=Truefor table extraction via TableFormer, andimages_scalefor controlling resolution during rasterization. You will learn why the default configuration often produces suboptimal results on real-world enterprise documents—where mixed-mode PDFs contain both born-digital text layers and scanned appendices—and how to configure theocr_optionsparameter with anEasyOcrOptionsorTesseractOcrOptionsinstance to target only pages that lack an extractable text layer, avoiding the performance penalty of running OCR on pages that do not need it. -
Process documents in batch using Docling's
convert_all()method and streaming results to handle large document collections without excessive memory consumption. Production ingestion pipelines rarely process one document at a time; you will learn how Docling's internal batching works, how to setmax_num_pagesandmax_file_sizelimits to prevent runaway processing on malformed files, and how to implement error handling aroundConversionResultobjects where individual documents may fail while the batch continues. You will also understand the difference between Docling's synchronous and asynchronous execution paths and when each is appropriate for pipeline architectures running on Cloud Run, Kubernetes Jobs, or local development environments. -
Navigate and transform the
DoclingDocumentoutput structure including its hierarchical representation of document elements—TextItem,TableItem,PictureItem, andSectionHeaderItem—to extract specific content types, flatten the tree for search indexing, or produce Markdown output using the built-indoc.export_to_markdown()serializer. You will learn how Docling assignsprov(provenance) metadata to each element, tracking the source page number, bounding box coordinates, and confidence scores, which is essential for building audit trails in regulated industries where you must prove that extracted text came from a specific region of a specific page in the source document. -
Handle edge cases in multi-format extraction where the same logical document exists in PDF, HTML, and DOCX variants with structural differences. Docling's unified parser normalizes these into the same
DoclingDocumentschema, but normalization is not lossless—HTML documents may carry CSS class names that hint at semantic structure (e.g.,class="section-heading") that Docling's HTML backend preserves differently than a DOCX heading style or a PDF font-size heuristic. You will learn to write post-processing logic that reconciles these differences, ensuring that your downstream systems receive consistently structured content regardless of the source format.
-
-
Process documents with VLM-based understanding using hosted APIs
-
Compare traditional OCR pipelines with VLM-based document understanding using GPT-4o and Gemini vision— This goal moves beyond the mechanics of text extraction into the architectural decision that increasingly defines modern document ingestion: when to use traditional OCR-based pipelines and when to leverage vision-language models that understand document layout, semantics, and content simultaneously. Traditional OCR pipelines—Tesseract, Amazon Textract, Google Document AI—operate in stages: first they detect text regions, then they recognize characters, and finally a separate post-processing step attempts to reconstruct structure (tables, headers, reading order). VLMs like GPT-4o and Gemini 1.5 Pro with vision capabilities collapse these stages into a single inference pass where the model "reads" the document image and produces structured output directly. The trade-offs are profound and non-obvious. Traditional OCR is deterministic, fast, cheap at scale, and produces character-accurate output on clean documents. VLM-based extraction is probabilistic, slower, significantly more expensive per page, and can hallucinate content—but it excels at understanding complex layouts, interpreting handwritten annotations, extracting meaning from charts and diagrams, and handling documents where structure cannot be inferred from text position alone. Senior engineers must understand both approaches deeply enough to make the right architectural choice for each document type in their pipeline, and in many production systems, the answer is a hybrid approach where traditional OCR handles the common cases and VLM extraction is reserved for documents that fail quality checks or require semantic understanding beyond what positional heuristics can provide. You will learn to quantify the cost-accuracy-latency triangle for both approaches and build routing logic that directs documents to the appropriate extraction path.
-
Implement VLM-based document extraction by sending page images to GPT-4o and Gemini vision APIs with carefully engineered prompts that specify the desired output structure. You will learn that VLM document understanding is fundamentally a prompt engineering challenge—the same document image will produce wildly different outputs depending on whether you ask for "all text on this page" versus "extract this as a structured JSON with sections, tables, and metadata." You will build extraction prompts that instruct the model to preserve table structure as Markdown or JSON arrays, maintain heading hierarchy, and flag low-confidence regions. Critically, you will learn to handle the token limits that constrain VLM extraction: a single high-resolution page image consumes 1,000-4,000 tokens of context depending on the model and
detailparameter setting, meaning that a 100-page document requires either batched single-page calls or aggressive resolution reduction, each with its own accuracy implications. -
Benchmark extraction quality across OCR and VLM approaches using precision, recall, and structural similarity metrics on a representative document corpus. You will learn to build evaluation harnesses that compare extracted text against ground truth using character error rate (CER) and word error rate (WER) for text accuracy, and custom structural metrics that measure whether tables were correctly detected, whether heading hierarchy was preserved, and whether reading order matches the source document. This is where many teams make poor architectural decisions—they test on clean, born-digital PDFs where traditional OCR achieves 99%+ accuracy and conclude that VLMs offer no benefit, without testing on the scanned invoices, handwritten forms, and complex multi-column layouts that actually drive their production error rates.
-
Build hybrid extraction pipelines with intelligent routing that scores each incoming document's complexity and routes it to the appropriate extraction backend. Simple born-digital PDFs with standard layouts go through Docling or Document AI for fast, cheap, deterministic extraction. Documents that fail quality heuristics—low OCR confidence scores, detected handwriting, complex multi-column layouts, or embedded charts—are escalated to VLM extraction. You will implement this routing using features like page image entropy, text layer presence detection, table density estimation, and OCR confidence score aggregation, creating a classification layer that optimizes the cost-accuracy trade-off across your entire document corpus rather than applying a one-size-fits-all extraction strategy.
-
-
Use Google Document AI for managed OCR and layout parsing
-
Use Google Document AI for managed OCR and layout parsing at scale— Google Document AI provides a fully managed, enterprise-grade document processing service that eliminates the operational burden of maintaining OCR infrastructure while offering specialized processors for different document types—invoices, receipts, contracts, W-2 forms, and general-purpose documents. This goal focuses on integrating Document AI into your ingestion pipeline as a production-grade extraction backend, understanding its processor architecture, batch processing capabilities, and the Layout Parser processor that goes beyond basic OCR to produce hierarchical document structure with paragraph grouping, table detection, and reading order inference. Unlike self-hosted solutions, Document AI scales automatically, handles concurrent requests through its batch API, and provides processor versioning so you can pin your pipeline to a specific model version while evaluating newer versions in parallel—a critical capability for regulated environments where extraction behavior must be reproducible. You will learn to use the Document AI Python client library (
google-cloud-documentai) to create processors, submit documents for synchronous and asynchronous processing, and parse the resultingDocumentprotobuf into your application's domain model. At the staff engineering level, the emphasis is on operational concerns: understanding Document AI's pricing model (per page, with different rates for different processor types), managing processor quotas, implementing retry logic with exponential backoff for transient failures, and monitoring extraction quality over time as Google updates its underlying models. You will also explore the Human-in-the-Loop (HITL) capabilities that allow you to route low-confidence extractions to human reviewers, creating a feedback loop that improves extraction accuracy for your specific document types. -
Configure and deploy Document AI processors programmatically using the
google-cloud-documentaiclient library, including creating processor instances via theDocumentProcessorServiceClient, selecting the appropriate processor type (OCR, Form Parser, Layout Parser, or specialized processors like Invoice Parser), and managing processor versions. You will learn why the Layout Parser processor is particularly valuable for ingestion pipelines—it produces a hierarchical document representation with detected blocks, paragraphs, lines, and tokens organized into a layout tree, along with table structures and key-value pairs, giving you richer structural information than the basic OCR processor while remaining format-agnostic across PDF, TIFF, GIF, and image inputs. -
Implement batch processing with Document AI's
batch_process_documents()method for high-volume ingestion, where documents are read from and results written to GCS buckets. You will learn to constructBatchProcessRequestobjects withGcsDocumentsinput configurations, setDocumentOutputConfigto control whether results are written as JSON or protobuf, and monitor long-running operations using theoperation.result()polling pattern or callback-based approaches. Batch processing is essential for any pipeline processing more than a few hundred documents per day, because synchronous processing is rate-limited and incurs higher per-page latency due to request overhead, while batch processing amortizes that overhead and benefits from Google's internal parallelization. -
Parse Document AI's
Documentprotobuf response into application-level data structures by navigating itspages,entities,tables, andtext_anchorsfields. The Document AI response format is powerful but complex—text content is stored in a singledocument.textstring, and all structural elements reference back to this string viatext_anchor.text_segmentswithstart_indexandend_indexoffsets. You will learn to write extraction utilities that dereference these anchors to retrieve actual text content, iterate overpage.tablesto reconstruct table cell content with row and column spans, and extract form field key-value pairs frompage.form_fields. This parsing logic is the bridge between Document AI's generic output and your pipeline's unified document model, and getting it right is essential for maintaining extraction fidelity.
-
-
Design a unified document model normalizing all extraction outputs
-
Design a unified document model normalizing output across all formats and extraction methods— Production document ingestion pipelines rarely use a single extraction backend. Different document types, quality levels, and cost constraints drive the use of multiple extractors—Docling for local processing, Document AI for managed scale, VLMs for complex layouts—and each produces output in its own format with its own structural conventions. This goal addresses the critical architectural challenge of normalizing these heterogeneous outputs into a single, well-defined document model that downstream systems (chunking, embedding, search, RAG) can consume without knowing or caring which extractor produced the data. You will design a
UnifiedDocumentdata model using Pydantic that captures the superset of structural information across all extractors: document metadata (source URI, format, page count, extraction timestamp, extractor identity), a hierarchical content tree (sections, paragraphs, tables, figures, lists), provenance tracking (which extractor produced each element, with confidence scores and bounding boxes), and lineage information (the chain of transformations from raw source to final structured output). This is a data modeling challenge that senior engineers encounter repeatedly across domains—it is the same pattern as designing a unified event schema for heterogeneous data sources or a canonical data model for an enterprise integration platform. The key principles are: capture everything that any downstream consumer might need, make the extraction source explicit rather than implicit, ensure round-trip fidelity so you can trace any piece of extracted text back to its source location, and version the schema so you can evolve it without breaking consumers. -
Define a Pydantic-based document schema with
UnifiedDocument,DocumentPage,ContentBlock, andTableBlockmodels that represent the full structural hierarchy of an extracted document. You will learn to use Pydantic's discriminated unions via ablock_typeliteral field to handle the polymorphic nature of document content—where a content block might be a paragraph, heading, table, figure, or list item—while maintaining strict type safety and JSON serialization compatibility. The schema must handle both flat documents (single continuous text) and deeply nested structures (sections containing subsections containing tables containing nested lists), using a recursivechildrenfield pattern that mirrors the actual structure of complex documents without imposing an arbitrary depth limit. -
Implement adapter classes that transform extractor-specific outputs into the unified model with a consistent interface:
DoclingAdapter,DocumentAIAdapter, andVLMAdaptereach implementing ato_unified_document()method that accepts the raw extractor output and returns aUnifiedDocumentinstance. You will learn that the adapter pattern is not merely a mechanical transformation—each adapter must make normalization decisions. Docling produces heading levels (h1-h6) while Document AI produces block types with font-size metadata; your adapter must map these to a consistent heading level scheme. VLM output may include markdown formatting that must be parsed into structural elements. Table representations differ radically: Docling producesTableDataobjects with cell grids, Document AI usesTableprotobufs withHeaderRowandBodyRowstructures, and VLMs may produce Markdown tables or JSON arrays. Each adapter must normalize these into your canonicalTableBlockrepresentation while preserving as much fidelity as possible. -
Add provenance and confidence metadata to every content block so downstream systems can make quality-aware decisions. Each
ContentBlockin your unified model carries aprovenancefield containing the extractor name and version, the source page number, bounding box coordinates (when available), OCR confidence score (when available), and aquality_flagslist that records any issues detected during extraction—such as low confidence, possible hallucination (for VLM extractors), truncated content, or unresolved unicode characters. This provenance metadata enables critical production capabilities: quality dashboards that track extraction accuracy across document types, automatic re-extraction routing where low-confidence blocks are sent to a higher-quality (but more expensive) extractor, and audit trails for compliance requirements in financial services, healthcare, and legal domains where you must demonstrate that extracted content faithfully represents the source document. -
Version the unified schema using semantic versioning embedded in the document payload with a
schema_versionfield and a migration strategy for handling documents extracted under older schema versions. In a production pipeline that processes millions of documents over months or years, schema evolution is inevitable—you will add new block types, extend metadata fields, or change normalization rules. You will learn to implement forward-compatible schema changes using Pydantic'smodel_validatorwith a version-aware deserialization path, so that documents extracted under schema v1.0 can still be loaded and optionally up-converted when your pipeline is running schema v1.2, without requiring a costly reprocessing of your entire document corpus.
-
-
Build a routing system selecting the optimal extraction method
-
Store extracted documents in GCS with PostgreSQL metadata tracking and lineage— The final goal addresses the storage and metadata layer that makes your ingestion pipeline production-ready: persisting extracted document content to Google Cloud Storage as versioned JSON blobs, tracking document metadata and processing status in PostgreSQL, and maintaining lineage records that connect every extracted document back to its source file, extraction configuration, and processing history. This is the engineering discipline that separates a prototype ingestion script from a production data pipeline. GCS provides durable, cost-effective object storage with built-in versioning, lifecycle management, and fine-grained IAM access control—ideal for storing extracted document payloads that may range from kilobytes (a simple one-page form) to hundreds of megabytes (a thousand-page technical manual with embedded images). PostgreSQL serves as the metadata catalog, enabling fast queries across your document corpus: find all documents extracted from a specific source, retrieve all documents processed by a particular extractor version, identify documents that need re-extraction after an extractor upgrade, or list all documents with extraction confidence below a threshold. The lineage component tracks the provenance chain: source document URI → extraction job ID → extractor configuration (including model version and parameters) → unified document GCS path → any downstream derived artifacts (chunks, embeddings, search index entries). This lineage is not merely a nice-to-have audit feature—it is the mechanism that enables you to perform targeted re-extraction when you upgrade an extractor, roll back to previous extraction results when a new extractor version introduces regressions, and satisfy data governance requirements that mandate traceability from any piece of information back to its authoritative source. You will implement this storage layer using
google-cloud-storagefor GCS operations and SQLAlchemy for PostgreSQL interactions, with careful attention to consistency guarantees: the GCS write must succeed before the PostgreSQL metadata record is committed, and the pipeline must handle partial failures gracefully using idempotent writes and transactional metadata updates. -
Implement a
DocumentStoreclass that writesUnifiedDocumentpayloads to GCS with a deterministic path schema following the patterngs://{bucket}/documents/{source_type}/{date}/{document_id}/v{version}.json, where each component serves a specific purpose:source_typeenables prefix-based lifecycle policies (e.g., retain invoices for 7 years, ephemeral web scrapes for 90 days),dateenables time-based partitioning for efficient listing and cleanup,document_idis a deterministic hash of the source URI for deduplication, andversiontracks re-extractions of the same source document. You will learn to configure GCS object metadata headers to store quick-lookup fields (extractor name, page count, extraction timestamp) directly on the object, enabling metadata queries without downloading the full payload, and to set appropriate storage classes (Standard for recent documents, Nearline or Coldline for archival) using object lifecycle rules tied to thedateprefix. -
Design a PostgreSQL metadata schema with
documents,extraction_jobs, andlineage_edgestables that form a complete tracking system for your document corpus. Thedocumentstable stores document-level metadata (source URI, document type, page count, current extraction version, quality score, GCS path) with indexes on frequently queried columns. Theextraction_jobstable records every extraction attempt with its configuration (extractor type, model version, pipeline options), status (pending, running, completed, failed), timing information, and error details for failed jobs. Thelineage_edgestable implements a directed acyclic graph connecting source documents to extracted documents to derived artifacts, enabling both forward queries ("what was produced from this source?") and backward queries ("where did this extracted content come from?"). You will implement this schema using SQLAlchemy ORM models with appropriate constraints, indexes, and relationship definitions that enforce referential integrity while supporting the high-throughput insert patterns typical of batch ingestion pipelines. -
Build an idempotent ingestion transaction that coordinates GCS writes with PostgreSQL metadata updates ensuring that your pipeline can safely retry failed ingestions without creating duplicate or inconsistent records. The transaction flow is: generate a deterministic document ID from the source URI, check PostgreSQL for an existing record with the same source URI and extraction configuration, write the
UnifiedDocumentJSON to GCS usingblob.upload_from_string()with aif_generation_matchprecondition for conditional writes, and then commit the PostgreSQL metadata record with the GCS path and extraction job reference. If the GCS write fails, no metadata record is created. If the PostgreSQL commit fails after a successful GCS write, the next retry will detect the existing GCS object and skip the upload, then attempt the metadata commit again. You will learn why this ordering matters—GCS writes are naturally idempotent (re-uploading the same content to the same path is a no-op), while PostgreSQL inserts require explicit conflict handling viaON CONFLICTclauses or application-level existence checks. -
Implement lineage tracking that connects source documents through extraction to downstream artifacts using the
lineage_edgestable and aLineageTrackerutility class. Every significant transformation in your pipeline—uploading a source document to GCS, running extraction, producing a unified document, generating chunks, creating embeddings—records a lineage edge connecting the input artifact to the output artifact with metadata about the transformation (tool name, version, parameters, timestamp). You will learn to query this lineage graph to answer operational questions that arise daily in production pipelines: "Why does this RAG response contain incorrect information?" traces back through the embedding, chunk, extracted document, and source file to identify whether the error originated in the source document or was introduced by extraction. "Which documents need re-extraction after upgrading Docling from v1.x to v2.x?" queries for all extraction edges using the old Docling version. This lineage capability transforms your ingestion pipeline from a black box into an observable, debuggable, and auditable system that senior engineers can operate with confidence in production.
-