Free lesson
Extract documents using Docling's unified multi-format parser
Use Docling to parse PDF, DOCX, PPTX, and HTML into a unified structured representation. Leverage Granite-Docling-258M for layout analysis on CPU.
~25 min read · Free to read — no subscription required.
Extract documents using docling unified parser
Introduction
Teams that build retrieval-augmented generation pipelines hit the same wall on day one: every input format — PDF, DOCX, HTML, PPTX, scanned images — needs its own parser, its own quirks, and its own bug surface, and the wrappers around them rarely agree on output schema. Docling collapses that into one extraction interface that always returns a uniformly shaped tree. Get this layer wrong and every downstream consumer (chunker, embedder, retriever, evaluator) inherits the inconsistency, so a single bad parser corrupts the whole RAG stack. By the end of this lesson you will be able to configure Docling's DocumentConverter for production-grade PDF extraction with OCR and table detection, run it over mixed-format batches without aborting on individual failures, and navigate the resulting DoclingDocument tree to pull out content blocks, tables, and provenance metadata for downstream stages.
Key Terminology
- DocumentConverter: Docling's single entry-point class that dispatches input files to format-specific pipelines and returns a uniformly shaped
DoclingDocumentregardless of source format. - DoclingDocument: the typed tree of content elements (
TextItem,SectionHeaderItem,TableItem,PictureItem) Docling emits, each carryingprovmetadata with page number and bounding box for grounding. - TableFormer: Docling's table-structure recognition model, configurable via
TableFormerMode(FASTorACCURATE) to trade extraction fidelity for speed on table-heavy PDFs. - PdfPipelineOptions: the per-format configuration object that toggles OCR, table-structure detection, image scaling, and other PDF-specific extraction behavior on a
DocumentConverterinstance. - ConversionStatus: the per-document outcome (
SUCCESS,PARTIAL_SUCCESS,FAILURE) returned byconvert_all()so batch jobs can branch on result quality without aborting on a single bad input.
Concepts
Docling's value is that it hides format-specific parsing behind one converter and one output schema, so the rest of a RAG pipeline can be written against a single tree shape. Three ideas drive the lesson:
- One configured converter, many formats. A
DocumentConverteris initialized once with aformat_optionsmap that wires a pipeline (OCR settings, table-structure mode, image scale) to eachInputFormat. The same instance then handles PDF, DOCX, HTML, PPTX, and image inputs without per-call branching in caller code. - Batch processing with isolated failures.
convert_all(..., raises_on_error=False)yields aConversionResultper input, letting a malformed PDF surface asConversionStatus.FAILUREwhile the rest of the batch keeps running — essential when ingesting heterogeneous document corpora at scale. - A uniform tree with provenance. Every successful extraction produces a
DoclingDocumentwhoseiterate_items()walk yields typed nodes (TextItem,SectionHeaderItem,TableItem,PictureItem) withprovmetadata. Downstream chunkers and retrievers consume one shape, and the provenance lets answers cite the exact page and bounding box they came from.
Code Walkthrough
This walkthrough covers two stages of the pipeline back-to-back. The first stage builds a configured DocumentConverter and runs a mixed-format batch through it; the second stage navigates the resulting DoclingDocument tree and normalizes heading hierarchies so downstream chunkers see consistent structure regardless of source format.
Configuring and Running the Converter
The converter takes per-format pipeline options that control OCR, table detection, and image resolution, then exposes convert_all() for batch processing. The snippet below configures PDF extraction for both born-digital and scanned inputs and runs it over a directory of PDFs, handling individual failures without aborting the batch:
Code snippetpython
1from pathlib import Path 2from docling.document_converter import DocumentConverter, PdfFormatOption 3from docling.datamodel.pipeline_options import ( 4 PdfPipelineOptions, 5 EasyOcrOptions, 6 TableFormerMode, 7) 8from docling.datamodel.base_models import InputFormat, ConversionStatus 9 10ocr_options = EasyOcrOptions( 11 lang=["en"], 12 force_full_page_ocr=False, 13) 14 15pipeline_options = PdfPipelineOptions( 16 do_ocr=True, 17 ocr_options=ocr_options, 18 do_table_structure=True, 19 table_structure_options={"mode": TableFormerMode.ACCURATE}, 20 images_scale=2.0, 21) 22 23converter = DocumentConverter( 24 format_options={ 25 InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), 26 } 27) 28 29pdf_files = list(Path("/data/incoming/").glob("**/*.pdf")) 30results = converter.convert_all(pdf_files, raises_on_error=False) 31 32for result in results: 33 if result.status == ConversionStatus.SUCCESS: 34 doc = result.document 35 markdown = doc.export_to_markdown() 36 print(f"Processed {result.input.file}: {len(doc.pages)} pages") 37 elif result.status == ConversionStatus.PARTIAL_SUCCESS: 38 print(f"Partial: {result.input.file} - {len(result.errors)} errors") 39 else: 40 print(f"Failed: {result.input.file} - {result.errors}")
EasyOcrOptions(force_full_page_ocr=False)runs OCR only on pages lacking an extractable text layer, so born-digital PDFs skip the OCR penalty entirely while scanned pages still get processed.PdfPipelineOptionsenables OCR, table-structure detection using TableFormer in accurate mode, and 2× image resolution scaling. Higherimages_scalevalues improve OCR fidelity on small text but increase memory consumption and per-page processing time.- The
format_optionsdict maps input formats to format-specific pipelines. AddInputFormat.DOCXorInputFormat.HTMLentries with their own options to extend the same converter instance to other inputs. convert_all(..., raises_on_error=False)yields aConversionResultper document and isolates failures so one malformed PDF does not abort the batch. Branch onConversionStatus.{SUCCESS, PARTIAL_SUCCESS}to distinguish clean extractions from partial ones, and callexport_to_markdown()to serialize the document tree with heading levels, table formatting, and figure references preserved.
Navigating Output and Normalizing Across Formats
The DoclingDocument object exposes content as a tree of typed elements. Walking that tree gives you a flat, typed list of blocks you can feed to chunkers, embedders, or evaluators — and once the tree is in hand, you can also reconcile the structural differences that creep in across formats (PDF font-size heuristics, DOCX heading styles, HTML starting at <h2>):
Code snippetpython
1from docling.datamodel.document import DoclingDocument 2 3def extract_content_blocks(doc: DoclingDocument) -> list[dict]: 4 blocks = [] 5 for item, level in doc.iterate_items(): 6 block = { 7 "type": item.__class__.__name__, 8 "text": item.text if hasattr(item, "text") else "", 9 "level": level, 10 "page": item.prov[0].page_no if item.prov else None, 11 "bbox": item.prov[0].bbox.as_tuple() if item.prov else None, 12 } 13 if item.__class__.__name__ == "TableItem": 14 block["table_data"] = item.export_to_dataframe().to_dict() 15 blocks.append(block) 16 return blocks 17 18def normalize_heading_levels(doc: DoclingDocument) -> DoclingDocument: 19 """Shift headings so the document always starts at h1.""" 20 heading_levels_seen = { 21 item.level 22 for item, _ in doc.iterate_items() 23 if item.__class__.__name__ == "SectionHeaderItem" 24 } 25 if heading_levels_seen and min(heading_levels_seen) > 1: 26 offset = min(heading_levels_seen) - 1 27 for item, _ in doc.iterate_items(): 28 if item.__class__.__name__ == "SectionHeaderItem": 29 item.level = max(1, item.level - offset) 30 return doc
iterate_items()yields each content element with its depth level in the document hierarchy. Element types includeTextItem(paragraphs),SectionHeaderItem(headings),TableItem(tables), andPictureItem(figures). Each element carriesprov(provenance) metadata with page number, bounding box, and confidence — essential for citations, retrieval, and grounding.- Tables get special treatment:
export_to_dataframe()materializes Docling's internal table representation as a pandas DataFrame, which you can serialize to JSON, CSV, or Parquet for downstream consumers. normalize_heading_levelsscans the tree for the minimum heading level used and shifts every heading down so the document starts at h1. This makes structure-aware chunking deterministic across formats — without it, an HTML page starting at<h2>and a PDF starting at h1 produce subtly different chunk boundaries.
You'll know it works when convert_all() returns at least one ConversionStatus.SUCCESS result, extract_content_blocks(doc) returns a non-empty list containing TextItem and SectionHeaderItem entries with populated page and bbox fields, and normalize_heading_levels(doc) produces a document whose first heading is level 1 regardless of source format.
Do's and Don'ts
Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.
Do's
- Initialize one
DocumentConverterper worker process and reuse it across documents — TableFormer and OCR model loading dominates first-call latency, so per-call construction makes batch jobs orders of magnitude slower. - Pass
raises_on_error=Falsetoconvert_all()and branch explicitly onConversionStatus.{SUCCESS, PARTIAL_SUCCESS, FAILURE}so one corrupt PDF does not abort an ingest of thousands. - Preserve
item.prov(page number and bounding box) when emitting blocks to chunkers and embedders — downstream retrieval and answer-grounding rely on it for citations.
Don'ts
- ✗Don't write per-format
if pdf … elif docx …branches in caller code — register format-specificPdfFormatOption/DocxFormatOptionentries informat_optionsand callconvert()polymorphically. - ✗Don't crank
images_scaleabove 2–3 by default; it multiplies memory and per-page time, and rarely improves OCR fidelity on already-readable text. - ✗Don't drop
ConversionStatus.PARTIAL_SUCCESSresults silently — inspectresult.errorsand decide per pipeline whether partial content is usable or should be quarantined.
Keep going with GenAI Data Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.