Free lesson

Extract documents using Docling's unified multi-format parser

Use Docling to parse PDF, DOCX, PPTX, and HTML into a unified structured representation. Leverage Granite-Docling-258M for layout analysis on CPU.

~25 min read · Free to read — no subscription required.

Extract documents using docling unified parser

Introduction

Teams that build retrieval-augmented generation pipelines hit the same wall on day one: every input format — PDF, DOCX, HTML, PPTX, scanned images — needs its own parser, its own quirks, and its own bug surface, and the wrappers around them rarely agree on output schema. Docling collapses that into one extraction interface that always returns a uniformly shaped tree. Get this layer wrong and every downstream consumer (chunker, embedder, retriever, evaluator) inherits the inconsistency, so a single bad parser corrupts the whole RAG stack. By the end of this lesson you will be able to configure Docling's DocumentConverter for production-grade PDF extraction with OCR and table detection, run it over mixed-format batches without aborting on individual failures, and navigate the resulting DoclingDocument tree to pull out content blocks, tables, and provenance metadata for downstream stages.

Key Terminology

  • DocumentConverter: Docling's single entry-point class that dispatches input files to format-specific pipelines and returns a uniformly shaped DoclingDocument regardless of source format.
  • DoclingDocument: the typed tree of content elements (TextItem, SectionHeaderItem, TableItem, PictureItem) Docling emits, each carrying prov metadata with page number and bounding box for grounding.
  • TableFormer: Docling's table-structure recognition model, configurable via TableFormerMode (FAST or ACCURATE) to trade extraction fidelity for speed on table-heavy PDFs.
  • PdfPipelineOptions: the per-format configuration object that toggles OCR, table-structure detection, image scaling, and other PDF-specific extraction behavior on a DocumentConverter instance.
  • ConversionStatus: the per-document outcome (SUCCESS, PARTIAL_SUCCESS, FAILURE) returned by convert_all() so batch jobs can branch on result quality without aborting on a single bad input.

Concepts

Docling's value is that it hides format-specific parsing behind one converter and one output schema, so the rest of a RAG pipeline can be written against a single tree shape. Three ideas drive the lesson:

  1. One configured converter, many formats. A DocumentConverter is initialized once with a format_options map that wires a pipeline (OCR settings, table-structure mode, image scale) to each InputFormat. The same instance then handles PDF, DOCX, HTML, PPTX, and image inputs without per-call branching in caller code.
  2. Batch processing with isolated failures. convert_all(..., raises_on_error=False) yields a ConversionResult per input, letting a malformed PDF surface as ConversionStatus.FAILURE while the rest of the batch keeps running — essential when ingesting heterogeneous document corpora at scale.
  3. A uniform tree with provenance. Every successful extraction produces a DoclingDocument whose iterate_items() walk yields typed nodes (TextItem, SectionHeaderItem, TableItem, PictureItem) with prov metadata. Downstream chunkers and retrievers consume one shape, and the provenance lets answers cite the exact page and bounding box they came from.
Loading diagram...

Code Walkthrough

This walkthrough covers two stages of the pipeline back-to-back. The first stage builds a configured DocumentConverter and runs a mixed-format batch through it; the second stage navigates the resulting DoclingDocument tree and normalizes heading hierarchies so downstream chunkers see consistent structure regardless of source format.

Configuring and Running the Converter

The converter takes per-format pipeline options that control OCR, table detection, and image resolution, then exposes convert_all() for batch processing. The snippet below configures PDF extraction for both born-digital and scanned inputs and runs it over a directory of PDFs, handling individual failures without aborting the batch:

Code snippetpython
1from pathlib import Path 2from docling.document_converter import DocumentConverter, PdfFormatOption 3from docling.datamodel.pipeline_options import ( 4 PdfPipelineOptions, 5 EasyOcrOptions, 6 TableFormerMode, 7) 8from docling.datamodel.base_models import InputFormat, ConversionStatus 9 10ocr_options = EasyOcrOptions( 11 lang=["en"], 12 force_full_page_ocr=False, 13) 14 15pipeline_options = PdfPipelineOptions( 16 do_ocr=True, 17 ocr_options=ocr_options, 18 do_table_structure=True, 19 table_structure_options={"mode": TableFormerMode.ACCURATE}, 20 images_scale=2.0, 21) 22 23converter = DocumentConverter( 24 format_options={ 25 InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), 26 } 27) 28 29pdf_files = list(Path("/data/incoming/").glob("**/*.pdf")) 30results = converter.convert_all(pdf_files, raises_on_error=False) 31 32for result in results: 33 if result.status == ConversionStatus.SUCCESS: 34 doc = result.document 35 markdown = doc.export_to_markdown() 36 print(f"Processed {result.input.file}: {len(doc.pages)} pages") 37 elif result.status == ConversionStatus.PARTIAL_SUCCESS: 38 print(f"Partial: {result.input.file} - {len(result.errors)} errors") 39 else: 40 print(f"Failed: {result.input.file} - {result.errors}")
  • EasyOcrOptions(force_full_page_ocr=False) runs OCR only on pages lacking an extractable text layer, so born-digital PDFs skip the OCR penalty entirely while scanned pages still get processed.
  • PdfPipelineOptions enables OCR, table-structure detection using TableFormer in accurate mode, and 2× image resolution scaling. Higher images_scale values improve OCR fidelity on small text but increase memory consumption and per-page processing time.
  • The format_options dict maps input formats to format-specific pipelines. Add InputFormat.DOCX or InputFormat.HTML entries with their own options to extend the same converter instance to other inputs.
  • convert_all(..., raises_on_error=False) yields a ConversionResult per document and isolates failures so one malformed PDF does not abort the batch. Branch on ConversionStatus.{SUCCESS, PARTIAL_SUCCESS} to distinguish clean extractions from partial ones, and call export_to_markdown() to serialize the document tree with heading levels, table formatting, and figure references preserved.

The DoclingDocument object exposes content as a tree of typed elements. Walking that tree gives you a flat, typed list of blocks you can feed to chunkers, embedders, or evaluators — and once the tree is in hand, you can also reconcile the structural differences that creep in across formats (PDF font-size heuristics, DOCX heading styles, HTML starting at <h2>):

Code snippetpython
1from docling.datamodel.document import DoclingDocument 2 3def extract_content_blocks(doc: DoclingDocument) -> list[dict]: 4 blocks = [] 5 for item, level in doc.iterate_items(): 6 block = { 7 "type": item.__class__.__name__, 8 "text": item.text if hasattr(item, "text") else "", 9 "level": level, 10 "page": item.prov[0].page_no if item.prov else None, 11 "bbox": item.prov[0].bbox.as_tuple() if item.prov else None, 12 } 13 if item.__class__.__name__ == "TableItem": 14 block["table_data"] = item.export_to_dataframe().to_dict() 15 blocks.append(block) 16 return blocks 17 18def normalize_heading_levels(doc: DoclingDocument) -> DoclingDocument: 19 """Shift headings so the document always starts at h1.""" 20 heading_levels_seen = { 21 item.level 22 for item, _ in doc.iterate_items() 23 if item.__class__.__name__ == "SectionHeaderItem" 24 } 25 if heading_levels_seen and min(heading_levels_seen) > 1: 26 offset = min(heading_levels_seen) - 1 27 for item, _ in doc.iterate_items(): 28 if item.__class__.__name__ == "SectionHeaderItem": 29 item.level = max(1, item.level - offset) 30 return doc
  • iterate_items() yields each content element with its depth level in the document hierarchy. Element types include TextItem (paragraphs), SectionHeaderItem (headings), TableItem (tables), and PictureItem (figures). Each element carries prov (provenance) metadata with page number, bounding box, and confidence — essential for citations, retrieval, and grounding.
  • Tables get special treatment: export_to_dataframe() materializes Docling's internal table representation as a pandas DataFrame, which you can serialize to JSON, CSV, or Parquet for downstream consumers.
  • normalize_heading_levels scans the tree for the minimum heading level used and shifts every heading down so the document starts at h1. This makes structure-aware chunking deterministic across formats — without it, an HTML page starting at <h2> and a PDF starting at h1 produce subtly different chunk boundaries.

You'll know it works when convert_all() returns at least one ConversionStatus.SUCCESS result, extract_content_blocks(doc) returns a non-empty list containing TextItem and SectionHeaderItem entries with populated page and bbox fields, and normalize_heading_levels(doc) produces a document whose first heading is level 1 regardless of source format.

Do's and Don'ts

Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.

Do's

  1. Initialize one DocumentConverter per worker process and reuse it across documents — TableFormer and OCR model loading dominates first-call latency, so per-call construction makes batch jobs orders of magnitude slower.
  2. Pass raises_on_error=False to convert_all() and branch explicitly on ConversionStatus.{SUCCESS, PARTIAL_SUCCESS, FAILURE} so one corrupt PDF does not abort an ingest of thousands.
  3. Preserve item.prov (page number and bounding box) when emitting blocks to chunkers and embedders — downstream retrieval and answer-grounding rely on it for citations.

Don'ts

  1. Don't write per-format if pdf … elif docx … branches in caller code — register format-specific PdfFormatOption / DocxFormatOption entries in format_options and call convert() polymorphically.
  2. Don't crank images_scale above 2–3 by default; it multiplies memory and per-page time, and rarely improves OCR fidelity on already-readable text.
  3. Don't drop ConversionStatus.PARTIAL_SUCCESS results silently — inspect result.errors and decide per pipeline whether partial content is usable or should be quarantined.

Keep going with GenAI Data Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.