Free lesson

Design a unified document model normalizing all extraction outputs

Create a format-agnostic document schema with typed content blocks and extraction method metadata. Build adapters for Docling, VLM, and Document AI outputs.

~25 min read · Free to read — no subscription required.

Design unified document model

Introduction

When you build an ingestion pipeline that pulls from Docling, Document AI, and a vision-language model, each backend returns content in its own shape — bounding boxes here, page hierarchies there, freeform JSON over there — and every downstream consumer ends up branching on if extractor == "docling": … until the codebase fossilizes around extractor names. Teams that skip a unified document model spend the next quarter rewriting chunkers, embedders, and search indexers every time a new extractor ships. By the end you'll be able to design a canonical document schema with discriminated-union content blocks, provenance metadata, and a stable adapter boundary so downstream systems consume one shape regardless of which extractor produced it.

Key Terminology

Unified document model: a single canonical schema (here, UnifiedDocument) that every extractor adapter targets, so downstream chunkers, embedders, and indexers consume one shape instead of branching on extractor name.
Discriminated union: a polymorphic content-block representation where each variant carries a literal block_type tag (e.g. Literal["table"] on TableBlock) that Pydantic uses to deserialize mixed lists of blocks without ambiguity.
Provenance metadata: per-block tracking of which extractor produced the block (name, version, page, bounding box, confidence, quality_flags), kept on the block itself so quality-aware processing can filter or re-route downstream.
Extraction adapter: an isolated class (e.g. DoclingAdapter) whose only job is to translate one extractor's native output into UnifiedDocument, forming the stable boundary that decouples ingestion from extractor choice.

Concepts

The lesson teaches three ideas that fit together as one design.

One canonical schema, many adapters. The UnifiedDocument model is the contract every downstream consumer depends on; DoclingAdapter and any future DocumentAIAdapter / VLMAdapter exist only to map their backend's native output into that contract. Adding a new extractor never changes the schema or any consumer — it adds one adapter class and nothing else.

Discriminated unions over free-form JSON. Real documents mix paragraphs, headings, tables, figures, and lists. Encoding each as a Pydantic model with a literal block_type tag lets a list[ContentBlock | TableBlock] deserialize correctly while still giving you typed access to variant-specific fields like TableBlock.headers or ContentBlock.level. Free-form dicts produce the same data but lose the validation and the IDE support.

Provenance is a first-class field, not a sidecar log. Because every block carries its own Provenance (extractor, version, page, bbox, confidence, quality_flags), downstream code can make per-block decisions — drop low-confidence VLM output, re-route flagged blocks for review, attribute citations in a RAG answer — without joining back to an external metadata store. The Code Walkthrough below shows exactly how each of these three ideas appears in the schema and how DoclingAdapter produces them.

Code Walkthrough

Defining the Pydantic Schema

Use Pydantic models with discriminated unions to handle the polymorphic nature of document content:

Code snippet python

1from pydantic import BaseModel, Field
2from typing import Literal, Optional
3from datetime import datetime
4from enum import Enum
5
6class BlockType(str, Enum):
7    PARAGRAPH = "paragraph"
8    HEADING = "heading"
9    TABLE = "table"
10    FIGURE = "figure"
11    LIST_ITEM = "list_item"
12    CODE = "code"
13
14class Provenance(BaseModel):
15    extractor: str
16    extractor_version: str
17    page_number: Optional[int] = None
18    bbox: Optional[tuple[float, float, float, float]] = None
19    confidence: Optional[float] = None
20    quality_flags: list[str] = Field(default_factory=list)
21
22class TableCell(BaseModel):
23    text: str
24    row_span: int = 1
25    col_span: int = 1
26
27class TableBlock(BaseModel):
28    block_type: Literal["table"] = "table"
29    headers: list[str]
30    rows: list[list[TableCell]]
31    provenance: Provenance
32
33class ContentBlock(BaseModel):
34    block_type: BlockType
35    text: str
36    level: Optional[int] = None
37    children: list["ContentBlock"] = Field(default_factory=list)
38    provenance: Provenance
39
40class DocumentPage(BaseModel):
41    page_number: int
42    blocks: list[ContentBlock | TableBlock]
43
44class UnifiedDocument(BaseModel):
45    document_id: str
46    source_uri: str
47    schema_version: str = "1.0"
48    format: str
49    page_count: int
50    extraction_timestamp: datetime
51    pages: list[DocumentPage]
52    metadata: dict = Field(default_factory=dict)

Lines 6-12: The BlockType enum defines all content types your pipeline handles. Using an enum rather than free-form strings catches typos at validation time and enables exhaustive pattern matching in downstream processors.
Lines 14-20: The Provenance model tracks which extractor produced each block. The quality_flags list records issues detected during extraction — "low_confidence", "possible_hallucination", "truncated" — enabling quality-aware processing downstream.
Lines 28-31: TableBlock uses a discriminated union pattern with block_type: Literal["table"] so that Pydantic can deserialize mixed content block lists correctly. Headers and rows use TableCell objects that support merged cells via row_span and col_span.
Lines 33-37: ContentBlock supports recursive nesting via the children field, mirroring the hierarchical structure of real documents where sections contain subsections contain paragraphs.
Lines 43-52: The top-level UnifiedDocument captures document-level metadata, a list of pages each containing blocks, and a schema_version for evolution.

Building Extraction Adapters

Each adapter transforms extractor-specific output into the unified model:

Code snippet python

1class DoclingAdapter:
2    EXTRACTOR_NAME = "docling"
3    EXTRACTOR_VERSION = "2.0"
4
5    def to_unified_document(
6        self, result, source_uri: str
7    ) -> UnifiedDocument:
8        doc = result.document
9        pages = []
10
11        current_page_blocks = []
12        current_page = 1
13
14        for item, level in doc.iterate_items():
15            page_num = item.prov[0].page_no if item.prov else current_page
16
17            if page_num != current_page and current_page_blocks:
18                pages.append(DocumentPage(
19                    page_number=current_page,
20                    blocks=current_page_blocks,
21                ))
22                current_page_blocks = []
23                current_page = page_num
24
25            provenance = Provenance(
26                extractor=self.EXTRACTOR_NAME,
27                extractor_version=self.EXTRACTOR_VERSION,
28                page_number=page_num,
29                bbox=item.prov[0].bbox.as_tuple() if item.prov else None,
30                confidence=None,
31            )
32
33            class_name = item.__class__.__name__
34            if class_name == "SectionHeaderItem":
35                block = ContentBlock(
36                    block_type=BlockType.HEADING,
37                    text=item.text,
38                    level=item.level,
39                    provenance=provenance,
40                )
41            elif class_name == "TableItem":
42                df = item.export_to_dataframe()
43                block = TableBlock(
44                    headers=list(df.columns),
45                    rows=[
46                        [TableCell(text=str(cell)) for cell in row]
47                        for _, row in df.iterrows()
48                    ],
49                    provenance=provenance,
50                )
51            else:
52                block = ContentBlock(
53                    block_type=BlockType.PARAGRAPH,
54                    text=item.text if hasattr(item, "text") else "",
55                    provenance=provenance,
56                )
57
58            current_page_blocks.append(block)
59
60        if current_page_blocks:
61            pages.append(DocumentPage(
62                page_number=current_page,
63                blocks=current_page_blocks,
64            ))
65
66        return UnifiedDocument(
67            document_id=self._generate_id(source_uri),
68            source_uri=source_uri,
69            format=str(result.input.format),
70            page_count=len(doc.pages) if hasattr(doc, "pages") else len(pages),
71            extraction_timestamp=datetime.utcnow(),
72            pages=pages,
73        )
74
75    def _generate_id(self, uri: str) -> str:
76        import hashlib
77        return hashlib.sha256(uri.encode()).hexdigest()[:16]

Lines 1-3: Each adapter declares its extractor identity as class constants, ensuring consistent provenance tracking across all documents processed by this adapter.
Lines 14-23: Group content blocks by page number. Docling's iterate_items() yields elements in reading order, but page transitions must be detected from provenance metadata.
Lines 25-31: Build Provenance objects for every block with the extractor name, version, page number, and bounding box from Docling's output.
Lines 33-52: Map Docling element types to unified model block types. SectionHeaderItem becomes a heading with its level preserved. TableItem is exported to a DataFrame and then converted to the canonical TableBlock format with proper TableCell objects. All other elements default to paragraphs.
Lines 65-72: Construct the final UnifiedDocument with a deterministic ID derived from the source URI, ensuring idempotent re-extraction produces the same document ID.

The adapter pattern decouples your pipeline from any single extraction backend. Adding a new extractor requires writing one new adapter class without modifying any downstream consumer code.

Loading diagram...

Do's and Don'ts

Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.

Do's

Tag every content-block variant with a Literal["..."] block_type so Pydantic can deserialize mixed list[ContentBlock | TableBlock] content unambiguously and consumers can dispatch on the tag.
Make Provenance mandatory on every block — extractor name, version, page number, bbox where available, and quality_flags — so downstream re-ranking, debugging, and citation paths never have to join back to an external metadata store.
Bump schema_version on UnifiedDocument whenever you add a block type or change a field's meaning, and teach consumers to branch on version (never on extractor).

Don'ts

✗Don't let if extractor == "docling" branching leak past the adapter boundary — once a UnifiedDocument exists, no chunker, embedder, or indexer should know which backend produced it.
✗Don't drop unknown extractor element types into free-form dict blobs; map them to the closest canonical BlockType (default to PARAGRAPH) so the discriminated union stays exhaustive and validation stays meaningful.
✗Don't make Provenance optional or store it in a sidecar log — block-local provenance is the whole reason quality-aware processing works after normalization.

Keep going with GenAI Data Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.

Create a free account Subscribe — →