Free lesson
Build a routing system selecting the optimal extraction method
Route documents to the best extractor based on type, quality, and cost constraints. Use Docling for standard docs, VLMs for complex layouts, Document AI for high-volume.
~25 min read · Free to read — no subscription required.
Build routing system for optimal extraction
Introduction
When you process a pile of mixed PDFs — born-digital invoices, scanned contracts, image-heavy reports — sending every file to the same extractor either wastes money on simple documents or produces garbage on hard ones. Teams that skip routing usually discover the cost after their first month: a four-figure OCR bill for documents Docling could have handled for free, or empty rows in production tables because a scanned form went through a text-only extractor. The fix is a routing layer that inspects each document, picks the backend that will produce the best result at the lowest cost, and falls back when the primary choice fails. By the end of this lesson you'll be able to score a document by complexity, route it to one of several extraction backends (Docling, Document AI, or a VLM), and chain fallbacks so a single failure never drops a document.
Key Terminology
- Extraction backend — a concrete adapter (Docling, Document AI, GPT-4o VLM, Gemini VLM) that turns raw PDF bytes into structured text; routing chooses which one to invoke per document.
- Complexity score — a 0.0–1.0 number derived from cheap document features (text layer, tables, images, page count) that proxies "how hard is this to extract"; thresholds on this score drive routing decisions.
- Routing config — the externalised set of thresholds and limits (VLM threshold, max pages for VLM, batch-size cutover) that lets you re-tune cost vs. accuracy without redeploying code.
- Fallback chain — an ordered list of alternate backends to try when the primary fails or produces a result that fails quality checks; ensures no document is silently dropped.
Concepts
The routing system has three pieces: a feature extractor that runs in milliseconds, a decision function that maps features to a backend, and a fallback chain that handles failures. Together they form a pipeline where every document is handled by the cheapest backend that can produce acceptable quality.
Cheap feature extraction
Routing decisions must be cheap or the router itself becomes the bottleneck. Use PyMuPDF to inspect a document in under 100 ms per file: check for an extractable text layer (born-digital vs scanned), count table-like blocks, look for embedded images, and read the page count. These signals collapse into a single complexity score so downstream code only needs one threshold per backend (see Code Walkthrough).
Threshold-driven routing
Each backend has a distinct cost-and-capability profile: Docling is free but text-only, Document AI charges per page but handles scans and scales to large batches, VLMs cost the most per page but understand complex layouts and handwriting. The router encodes that hierarchy as two thresholds (VLM threshold higher than Document-AI threshold) plus two limits (max pages for VLM, batch-size cutover). Externalising these in a RoutingConfig lets you re-tune the cost curve without touching the routing logic.
Fallback chains for resilience
Any single backend will fail on some inputs: Docling chokes on scans, Document AI quotas spike during incidents, VLM providers rate-limit. Instead of dropping the document, the pipeline tries an ordered chain of alternates and accepts the first result that passes quality checks. The chain is keyed by primary backend because the best fallback depends on why the primary is likely to have failed: Docling → Document AI (the doc was probably scanned), VLM → other VLM (the first provider was probably rate-limited).
Code Walkthrough
Now that you have seen the concepts above, the walkthrough below turns them into working code.
The two snippets below implement the three concepts end-to-end: the first combines cheap feature extraction with threshold-driven routing, and the second wires the router into a pipeline with fallback chains.
Classify and route
Code snippetpython
1from dataclasses import dataclass 2from pathlib import Path 3from enum import Enum 4import fitz # PyMuPDF 5 6class ExtractionBackend(str, Enum): 7 DOCLING = "docling" 8 DOCUMENT_AI = "document_ai" 9 VLM_GPT4O = "vlm_gpt4o" 10 VLM_GEMINI = "vlm_gemini" 11 12@dataclass 13class DocumentFeatures: 14 has_text_layer: bool 15 page_count: int 16 has_tables: bool 17 has_images: bool 18 estimated_complexity: float 19 20def analyze_document(file_path: Path) -> DocumentFeatures: 21 doc = fitz.open(str(file_path)) 22 has_text = has_tables = has_images = False 23 for page in doc: 24 if len(page.get_text().strip()) > 50: 25 has_text = True 26 if page.get_images(): 27 has_images = True 28 blocks = page.get_text("dict")["blocks"] 29 table_like = sum(1 for b in blocks if b.get("lines") and len(b["lines"]) > 3) 30 if table_like > 2: 31 has_tables = True 32 complexity = ( 33 (0.4 if not has_text else 0.0) 34 + (0.3 if has_tables else 0.0) 35 + (0.2 if has_images else 0.0) 36 + (0.1 if doc.page_count > 50 else 0.0) 37 ) 38 page_count = doc.page_count 39 doc.close() 40 return DocumentFeatures( 41 has_text_layer=has_text, 42 page_count=page_count, 43 has_tables=has_tables, 44 has_images=has_images, 45 estimated_complexity=min(complexity, 1.0), 46 ) 47 48@dataclass 49class RoutingConfig: 50 complexity_threshold_vlm: float = 0.6 51 complexity_threshold_docai: float = 0.3 52 max_pages_for_vlm: int = 20 53 batch_size_threshold: int = 100 54 55class ExtractionRouter: 56 def __init__(self, config: RoutingConfig = RoutingConfig()): 57 self.config = config 58 59 def route(self, features: DocumentFeatures, batch_size: int = 1) -> ExtractionBackend: 60 c = self.config 61 if not features.has_text_layer and features.estimated_complexity >= c.complexity_threshold_vlm: 62 if features.page_count <= c.max_pages_for_vlm: 63 return ExtractionBackend.VLM_GPT4O 64 return ExtractionBackend.DOCUMENT_AI 65 if batch_size >= c.batch_size_threshold: 66 return ExtractionBackend.DOCUMENT_AI 67 if features.estimated_complexity >= c.complexity_threshold_docai: 68 return ExtractionBackend.DOCUMENT_AI 69 return ExtractionBackend.DOCLING
ExtractionBackendenumerates the targets; each maps to one adapter and one cost profile, so the router only needs to return an enum value.analyze_documentruns in under 100 ms — it never parses full content, it just inspects the structure PyMuPDF cached when it opened the file.- Complexity scoring: missing text layer is the dominant signal (+0.4) because it means OCR or VLM is required; tables (+0.3) and images (+0.2) add structural difficulty.
route()applies thresholds in priority order: scanned-and-complex goes VLM (short docs) or Document AI (long docs), large batches always go Document AI to amortise API overhead, and everything else falls through to Docling.
Wire fallbacks into the pipeline
Code snippetpython
1class ExtractionPipeline: 2 def __init__(self, router: ExtractionRouter, adapters: dict): 3 self.router = router 4 self.adapters = adapters 5 self.fallback_chain = { 6 ExtractionBackend.DOCLING: [ExtractionBackend.DOCUMENT_AI, ExtractionBackend.VLM_GEMINI], 7 ExtractionBackend.DOCUMENT_AI: [ExtractionBackend.VLM_GPT4O, ExtractionBackend.DOCLING], 8 ExtractionBackend.VLM_GPT4O: [ExtractionBackend.VLM_GEMINI, ExtractionBackend.DOCUMENT_AI], 9 ExtractionBackend.VLM_GEMINI: [ExtractionBackend.VLM_GPT4O, ExtractionBackend.DOCUMENT_AI], 10 } 11 12 def extract(self, file_path: Path, features: DocumentFeatures): 13 primary = self.router.route(features) 14 attempts = [primary] + self.fallback_chain.get(primary, []) 15 last_error = None 16 for backend in attempts: 17 try: 18 adapter = self.adapters[backend] 19 raw = adapter.extract(file_path) 20 doc = adapter.to_unified_document(raw, str(file_path)) 21 if self._passes_quality_check(doc): 22 return doc 23 last_error = ValueError(f"quality check failed for {backend}") 24 except Exception as e: 25 last_error = e 26 raise RuntimeError(f"all backends failed for {file_path}: {last_error}")
- The fallback chain is keyed by primary backend so each failure mode gets a sensible alternate — a Docling failure most likely means the doc was actually scanned, so Document AI is the right next try, not another text-only extractor.
extract()accepts the first result that both returns and passes quality checks; a returned-but-low-confidence result is treated as a failure and triggers the next backend.- Only when every backend in the chain is exhausted does the pipeline raise — no document is silently dropped.
You'll know it works when a small evaluation set of mixed PDFs (one born-digital invoice, one scanned form, one image-heavy report) routes to three different backends and all three return unified documents that pass _passes_quality_check.
Do's and Don'ts
Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.
Do's
- Externalise routing thresholds — keep
RoutingConfigvalues in config, not constants, so you can re-tune the cost curve without a redeploy after a backend price change. - Cap VLM page count — VLM cost scales linearly with pages; the
max_pages_for_vlmlimit prevents a single 200-page scan from blowing the budget. - Keep feature extraction cheap — analysis must stay under ~100 ms per doc; if it grows beyond that the router becomes the bottleneck the pipeline was meant to avoid.
Don'ts
- ✗Don't fall back to the same failure mode — a Docling failure should not fall back to another text-only extractor; pick a backend that handles the input class Docling cannot.
- ✗Don't skip quality checks on fallback results — a returned result that fails confidence checks is worse than a clean failure because it silently corrupts downstream tables.
- ✗Don't route by file extension alone —
.pdfcovers both born-digital and scanned; inspect the text layer, not the suffix.
Keep going with GenAI Data Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.