Preview lesson

Build a routing system selecting the optimal extraction method

Route documents to the best extractor based on type, quality, and cost constraints. Use Docling for standard docs, VLMs for complex layouts, Document AI for high-volume.

Free to read — no subscription required.

Explore Complete Lesson

Build routing system for optimal extraction

Introduction

When you process a pile of mixed PDFs — born-digital invoices, scanned contracts, image-heavy reports — sending every file to the same extractor either wastes money on simple documents or produces garbage on hard ones. Teams that skip routing usually discover the cost after their first month: a four-figure OCR bill for documents Docling could have handled for free, or empty rows in production tables because a scanned form went through a text-only extractor. The fix is a routing layer that inspects each document, picks the backend that will produce the best result at the lowest cost, and falls back when the primary choice fails. By the end of this lesson you'll be able to score a document by complexity, route it to one of several extraction backends (Docling, Document AI, or a VLM), and chain fallbacks so a single failure never drops a document.

Key Terminology

Extraction backend — a concrete adapter (Docling, Document AI, GPT-4o VLM, Gemini VLM) that turns raw PDF bytes into structured text; routing chooses which one to invoke per document.
Complexity score — a 0.0–1.0 number derived from cheap document features (text layer, tables, images, page count) that proxies "how hard is this to extract"; thresholds on this score drive routing decisions.
Routing config — the externalised set of thresholds and limits (VLM threshold, max pages for VLM, batch-size cutover) that lets you re-tune cost vs. accuracy without redeploying code.
Fallback chain — an ordered list of alternate backends to try when the primary fails or produces a result that fails quality checks; ensures no document is silently dropped.

Concepts

The routing system has three pieces: a feature extractor that runs in milliseconds, a decision function that maps features to a backend, and a fallback chain that handles failures. Together they form a pipeline where every document is handled by the cheapest backend that can produce acceptable quality.

Cheap feature extraction

Routing decisions must be cheap or the router itself becomes the bottleneck. Use PyMuPDF to inspect a document in under 100 ms per file: check for an extractable text layer (born-digital vs scanned), count table-like blocks, look for embedded images, and read the page count. These signals collapse into a single complexity score so downstream code only needs one threshold per backend (see Code Walkthrough).

Threshold-driven routing

Each backend has a distinct cost-and-capability profile: Docling is free but text-only, Document AI charges per page but handles scans and scales to large batches, VLMs cost the most per page but understand complex layouts and handwriting. The router encodes that hierarchy as two thresholds (VLM threshold higher than Document-AI threshold) plus two limits (max pages for VLM, batch-size cutover). Externalising these in a RoutingConfig lets you re-tune the cost curve without touching the routing logic.

Loading diagram...

Fallback chains for resilience

Any single backend will fail on some inputs: Docling chokes on scans, Document AI quotas spike during incidents, VLM providers rate-limit. Instead of dropping the document, the pipeline tries an ordered chain of alternates and accepts the first result that passes quality checks. The chain is keyed by primary backend because the best fallback depends on why the primary is likely to have failed: Docling → Document AI (the doc was probably scanned), VLM → other VLM (the first provider was probably rate-limited).

Code Walkthrough

Now that you have seen the concepts above, the walkthrough below turns them into working code.

The two snippets below implement the three concepts end-to-end: the first combines cheap feature extraction with threshold-driven routing, and the second wires the router into a pipeline with fallback chains.

Classify and route

Code snippetpython
1from dataclasses import dataclass
2from pathlib import Path
3from enum import Enum
4import fitz  # PyMuPDF
5
6class ExtractionBackend(str, Enum):
7    DOCLING = "docling"
8    DOCUMENT_AI = "document_ai"
9    VLM_GPT4O = "vlm_gpt4o"
10    VLM_GEMINI = "vlm_gemini"
11
12@dataclass
13class DocumentFeatures:
14    has_text_layer: bool
15    page_count: int
16    has_tables: bool
17    has_images: bool
18    estimated_complexity: float
19
20def analyze_document(file_path: Path) -> DocumentFeatures:
21    doc = fitz.open(str(file_path))
22    has_text = has_tables = has_images = False
23    for page in doc:
24        if len(page.get_text().strip()) > 50:
25            has_text = True
26        if page.get_images():
27            has_images = True
28        blocks = page.get_text("dict")["blocks"]
29        table_like = sum(1 for b in blocks if b.get("lines") and len(b["lines"]) > 3)
30        if table_like > 2:
31            has_tables = True
32    complexity = (
33        (0.4 if not has_text else 0.0)
34        + (0.3 if has_tables else 0.0)
35        + (0.2 if has_images else 0.0)
36        + (0.1 if doc.page_count > 50 else 0.0)
37    )
38    page_count = doc.page_count
39    doc.close()
40    return DocumentFeatures(
41        has_text_layer=has_text,
42        page_count=page_count,
43        has_tables=has_tables,
44        has_images=has_images,
45        estimated_complexity=min(complexity, 1.0),
46    )
47
48@dataclass
49class RoutingConfig:
50    complexity_threshold_vlm: float = 0.6
51    complexity_threshold_docai: float = 0.3
52    max_pages_for_vlm: int = 20
53    batch_size_threshold: int = 100
54
55class ExtractionRouter:
56    def __init__(self, config: RoutingConfig = RoutingConfig()):
57        self.config = config
58
59    def route(self, features: DocumentFeatures, batch_size: int = 1) -> ExtractionBackend:
60        c = self.config
61        if not features.has_text_layer and features.estimated_complexity >= c.complexity_threshold_vlm:
62            if features.page_count <= c.max_pages_for_vlm:
63                return ExtractionBackend.VLM_GPT4O
64            return ExtractionBackend.DOCUMENT_AI
65        if batch_size >= c.batch_size_threshold:
66            return ExtractionBackend.DOCUMENT_AI
67        if features.estimated_complexity >= c.complexity_threshold_docai:
68            return ExtractionBackend.DOCUMENT_AI
69        return ExtractionBackend.DOCLING

ExtractionBackend enumerates the targets; each maps to one adapter and one cost profile, so the router only needs to return an enum value.
analyze_document runs in under 100 ms — it never parses full content, it just inspects the structure PyMuPDF cached when it opened the file.
Complexity scoring: missing text layer is the dominant signal (+0.4) because it means OCR or VLM is required; tables (+0.3) and images (+0.2) add structural difficulty.
route() applies thresholds in priority order: scanned-and-complex goes VLM (short docs) or Document AI (long docs), large batches always go Document AI to amortise API overhead, and everything else falls through to Docling.

Wire fallbacks into the pipeline

Code snippetpython
1class ExtractionPipeline:
2    def __init__(self, router: ExtractionRouter, adapters: dict):
3        self.router = router
4        self.adapters = adapters
5        self.fallback_chain = {
6            ExtractionBackend.DOCLING:     [ExtractionBackend.DOCUMENT_AI, ExtractionBackend.VLM_GEMINI],
7            ExtractionBackend.DOCUMENT_AI: [ExtractionBackend.VLM_GPT4O,   ExtractionBackend.DOCLING],
8            ExtractionBackend.VLM_GPT4O:   [ExtractionBackend.VLM_GEMINI,  ExtractionBackend.DOCUMENT_AI],
9            ExtractionBackend.VLM_GEMINI:  [ExtractionBackend.VLM_GPT4O,   ExtractionBackend.DOCUMENT_AI],
10        }
11
12    def extract(self, file_path: Path, features: DocumentFeatures):
13        primary = self.router.route(features)
14        attempts = [primary] + self.fallback_chain.get(primary, [])
15        last_error = None
16        for backend in attempts:
17            try:
18                adapter = self.adapters[backend]
19                raw = adapter.extract(file_path)
20                doc = adapter.to_unified_document(raw, str(file_path))
21                if self._passes_quality_check(doc):
22                    return doc
23                last_error = ValueError(f"quality check failed for {backend}")
24            except Exception as e:
25                last_error = e
26        raise RuntimeError(f"all backends failed for {file_path}: {last_error}")

The fallback chain is keyed by primary backend so each failure mode gets a sensible alternate — a Docling failure most likely means the doc was actually scanned, so Document AI is the right next try, not another text-only extractor.
extract() accepts the first result that both returns and passes quality checks; a returned-but-low-confidence result is treated as a failure and triggers the next backend.
Only when every backend in the chain is exhausted does the pipeline raise — no document is silently dropped.

You'll know it works when a small evaluation set of mixed PDFs (one born-digital invoice, one scanned form, one image-heavy report) routes to three different backends and all three return unified documents that pass _passes_quality_check.

Do's and Don'ts

Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.

Do's

✓Externalise routing thresholds — keep RoutingConfig values in config, not constants, so you can re-tune the cost curve without a redeploy after a backend price change.
✓Cap VLM page count — VLM cost scales linearly with pages; the max_pages_for_vlm limit prevents a single 200-page scan from blowing the budget.
✓Keep feature extraction cheap — analysis must stay under ~100 ms per doc; if it grows beyond that the router becomes the bottleneck the pipeline was meant to avoid.

Don'ts

✗Don't fall back to the same failure mode — a Docling failure should not fall back to another text-only extractor; pick a backend that handles the input class Docling cannot.
✗Don't skip quality checks on fallback results — a returned result that fails confidence checks is worse than a clean failure because it silently corrupts downstream tables.
✗Don't route by file extension alone — .pdf covers both born-digital and scanned; inspect the text layer, not the suffix.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →