Free lesson
Design a unified document model normalizing all extraction outputs
Create a format-agnostic document schema with typed content blocks and extraction method metadata. Build adapters for Docling, VLM, and Document AI outputs.
~25 min read · Free to read — no subscription required.
Design unified document model
Introduction
When you build an ingestion pipeline that pulls from Docling, Document AI, and a vision-language model, each backend returns content in its own shape — bounding boxes here, page hierarchies there, freeform JSON over there — and every downstream consumer ends up branching on if extractor == "docling": … until the codebase fossilizes around extractor names. Teams that skip a unified document model spend the next quarter rewriting chunkers, embedders, and search indexers every time a new extractor ships. By the end you'll be able to design a canonical document schema with discriminated-union content blocks, provenance metadata, and a stable adapter boundary so downstream systems consume one shape regardless of which extractor produced it.
Key Terminology
- Unified document model: a single canonical schema (here,
UnifiedDocument) that every extractor adapter targets, so downstream chunkers, embedders, and indexers consume one shape instead of branching on extractor name. - Discriminated union: a polymorphic content-block representation where each variant carries a literal
block_typetag (e.g.Literal["table"]onTableBlock) that Pydantic uses to deserialize mixed lists of blocks without ambiguity. - Provenance metadata: per-block tracking of which extractor produced the block (name, version, page, bounding box, confidence,
quality_flags), kept on the block itself so quality-aware processing can filter or re-route downstream. - Extraction adapter: an isolated class (e.g.
DoclingAdapter) whose only job is to translate one extractor's native output intoUnifiedDocument, forming the stable boundary that decouples ingestion from extractor choice.
Concepts
The lesson teaches three ideas that fit together as one design.
One canonical schema, many adapters. The UnifiedDocument model is the contract every downstream consumer depends on; DoclingAdapter and any future DocumentAIAdapter / VLMAdapter exist only to map their backend's native output into that contract. Adding a new extractor never changes the schema or any consumer — it adds one adapter class and nothing else.
Discriminated unions over free-form JSON. Real documents mix paragraphs, headings, tables, figures, and lists. Encoding each as a Pydantic model with a literal block_type tag lets a list[ContentBlock | TableBlock] deserialize correctly while still giving you typed access to variant-specific fields like TableBlock.headers or ContentBlock.level. Free-form dicts produce the same data but lose the validation and the IDE support.
Provenance is a first-class field, not a sidecar log. Because every block carries its own Provenance (extractor, version, page, bbox, confidence, quality_flags), downstream code can make per-block decisions — drop low-confidence VLM output, re-route flagged blocks for review, attribute citations in a RAG answer — without joining back to an external metadata store. The Code Walkthrough below shows exactly how each of these three ideas appears in the schema and how DoclingAdapter produces them.
Code Walkthrough
Defining the Pydantic Schema
Use Pydantic models with discriminated unions to handle the polymorphic nature of document content:
Code snippet python
1from pydantic import BaseModel, Field 2from typing import Literal, Optional 3from datetime import datetime 4from enum import Enum 5 6class BlockType(str, Enum): 7 PARAGRAPH = "paragraph" 8 HEADING = "heading" 9 TABLE = "table" 10 FIGURE = "figure" 11 LIST_ITEM = "list_item" 12 CODE = "code" 13 14class Provenance(BaseModel): 15 extractor: str 16 extractor_version: str 17 page_number: Optional[int] = None 18 bbox: Optional[tuple[float, float, float, float]] = None 19 confidence: Optional[float] = None 20 quality_flags: list[str] = Field(default_factory=list) 21 22class TableCell(BaseModel): 23 text: str 24 row_span: int = 1 25 col_span: int = 1 26 27class TableBlock(BaseModel): 28 block_type: Literal["table"] = "table" 29 headers: list[str] 30 rows: list[list[TableCell]] 31 provenance: Provenance 32 33class ContentBlock(BaseModel): 34 block_type: BlockType 35 text: str 36 level: Optional[int] = None 37 children: list["ContentBlock"] = Field(default_factory=list) 38 provenance: Provenance 39 40class DocumentPage(BaseModel): 41 page_number: int 42 blocks: list[ContentBlock | TableBlock] 43 44class UnifiedDocument(BaseModel): 45 document_id: str 46 source_uri: str 47 schema_version: str = "1.0" 48 format: str 49 page_count: int 50 extraction_timestamp: datetime 51 pages: list[DocumentPage] 52 metadata: dict = Field(default_factory=dict)
- Lines 6-12: The
BlockTypeenum defines all content types your pipeline handles. Using an enum rather than free-form strings catches typos at validation time and enables exhaustive pattern matching in downstream processors. - Lines 14-20: The
Provenancemodel tracks which extractor produced each block. Thequality_flagslist records issues detected during extraction —"low_confidence","possible_hallucination","truncated"— enabling quality-aware processing downstream. - Lines 28-31:
TableBlockuses a discriminated union pattern withblock_type: Literal["table"]so that Pydantic can deserialize mixed content block lists correctly. Headers and rows useTableCellobjects that support merged cells viarow_spanandcol_span. - Lines 33-37:
ContentBlocksupports recursive nesting via thechildrenfield, mirroring the hierarchical structure of real documents where sections contain subsections contain paragraphs. - Lines 43-52: The top-level
UnifiedDocumentcaptures document-level metadata, a list of pages each containing blocks, and aschema_versionfor evolution.
Building Extraction Adapters
Each adapter transforms extractor-specific output into the unified model:
Code snippet python
1class DoclingAdapter: 2 EXTRACTOR_NAME = "docling" 3 EXTRACTOR_VERSION = "2.0" 4 5 def to_unified_document( 6 self, result, source_uri: str 7 ) -> UnifiedDocument: 8 doc = result.document 9 pages = [] 10 11 current_page_blocks = [] 12 current_page = 1 13 14 for item, level in doc.iterate_items(): 15 page_num = item.prov[0].page_no if item.prov else current_page 16 17 if page_num != current_page and current_page_blocks: 18 pages.append(DocumentPage( 19 page_number=current_page, 20 blocks=current_page_blocks, 21 )) 22 current_page_blocks = [] 23 current_page = page_num 24 25 provenance = Provenance( 26 extractor=self.EXTRACTOR_NAME, 27 extractor_version=self.EXTRACTOR_VERSION, 28 page_number=page_num, 29 bbox=item.prov[0].bbox.as_tuple() if item.prov else None, 30 confidence=None, 31 ) 32 33 class_name = item.__class__.__name__ 34 if class_name == "SectionHeaderItem": 35 block = ContentBlock( 36 block_type=BlockType.HEADING, 37 text=item.text, 38 level=item.level, 39 provenance=provenance, 40 ) 41 elif class_name == "TableItem": 42 df = item.export_to_dataframe() 43 block = TableBlock( 44 headers=list(df.columns), 45 rows=[ 46 [TableCell(text=str(cell)) for cell in row] 47 for _, row in df.iterrows() 48 ], 49 provenance=provenance, 50 ) 51 else: 52 block = ContentBlock( 53 block_type=BlockType.PARAGRAPH, 54 text=item.text if hasattr(item, "text") else "", 55 provenance=provenance, 56 ) 57 58 current_page_blocks.append(block) 59 60 if current_page_blocks: 61 pages.append(DocumentPage( 62 page_number=current_page, 63 blocks=current_page_blocks, 64 )) 65 66 return UnifiedDocument( 67 document_id=self._generate_id(source_uri), 68 source_uri=source_uri, 69 format=str(result.input.format), 70 page_count=len(doc.pages) if hasattr(doc, "pages") else len(pages), 71 extraction_timestamp=datetime.utcnow(), 72 pages=pages, 73 ) 74 75 def _generate_id(self, uri: str) -> str: 76 import hashlib 77 return hashlib.sha256(uri.encode()).hexdigest()[:16]
- Lines 1-3: Each adapter declares its extractor identity as class constants, ensuring consistent provenance tracking across all documents processed by this adapter.
- Lines 14-23: Group content blocks by page number. Docling's
iterate_items()yields elements in reading order, but page transitions must be detected from provenance metadata. - Lines 25-31: Build
Provenanceobjects for every block with the extractor name, version, page number, and bounding box from Docling's output. - Lines 33-52: Map Docling element types to unified model block types.
SectionHeaderItembecomes a heading with its level preserved.TableItemis exported to a DataFrame and then converted to the canonicalTableBlockformat with properTableCellobjects. All other elements default to paragraphs. - Lines 65-72: Construct the final
UnifiedDocumentwith a deterministic ID derived from the source URI, ensuring idempotent re-extraction produces the same document ID.
The adapter pattern decouples your pipeline from any single extraction backend. Adding a new extractor requires writing one new adapter class without modifying any downstream consumer code.
Do's and Don'ts
Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.
Do's
- Tag every content-block variant with a
Literal["..."]block_typeso Pydantic can deserialize mixedlist[ContentBlock | TableBlock]content unambiguously and consumers can dispatch on the tag. - Make
Provenancemandatory on every block — extractor name, version, page number, bbox where available, andquality_flags— so downstream re-ranking, debugging, and citation paths never have to join back to an external metadata store. - Bump
schema_versiononUnifiedDocumentwhenever you add a block type or change a field's meaning, and teach consumers to branch on version (never on extractor).
Don'ts
- ✗Don't let
if extractor == "docling"branching leak past the adapter boundary — once aUnifiedDocumentexists, no chunker, embedder, or indexer should know which backend produced it. - ✗Don't drop unknown extractor element types into free-form
dictblobs; map them to the closest canonicalBlockType(default toPARAGRAPH) so the discriminated union stays exhaustive and validation stays meaningful. - ✗Don't make
Provenanceoptional or store it in a sidecar log — block-local provenance is the whole reason quality-aware processing works after normalization.
Keep going with GenAI Data Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.