Intermediate

Document Processing Pipeline

Combine OCR, classification, and LLM models to build intelligent document processing systems that extract structured data from invoices, contracts, receipts, and any document type at scale.

What Is Intelligent Document Processing (IDP)?

Intelligent Document Processing (IDP) is a multi-model approach that transforms unstructured documents — scanned PDFs, photographs of receipts, handwritten forms, multi-page contracts — into structured, actionable data. Unlike traditional OCR that simply converts images to text, IDP combines vision models for text extraction, classification models to identify document types, and large language models to understand context, extract entities, and answer questions about the content.

The global IDP market is projected to exceed $5 billion by 2027, driven by enterprises that still process millions of paper and PDF documents daily. Banks process loan applications. Insurance companies handle claims. Healthcare organizations digitize patient records. Legal teams review contracts. Every one of these workflows benefits from a multi-model pipeline that can read, classify, understand, and extract data from documents automatically.

💡

Why multi-model? No single AI model handles the full document processing pipeline well. OCR models excel at text extraction but cannot understand meaning. LLMs understand language brilliantly but cannot read pixels from a scanned page. Classification models route documents efficiently but cannot extract specific fields. By combining all three, you get a system far more capable than any single model.

The Document Processing Pipeline

A production IDP pipeline follows six stages, each handled by a specialized model or component:

Stage	Task	Models / Tools	Output
1. Ingest	Accept documents from email, upload, scan, API	File parsers, PDF libraries	Raw file bytes + metadata
2. OCR / Extract	Convert images and scanned pages to text	Tesseract, PaddleOCR, Azure Doc Intelligence, AWS Textract	Raw text + bounding boxes + confidence scores
3. Classify	Identify document type (invoice, contract, receipt, etc.)	BERT, DistilBERT, custom classifiers	Document type label + confidence
4. Parse	Extract structured fields based on document type	Layout models, template matching, regex	Key-value pairs (vendor, amount, date, etc.)
5. Enrich	Summarize, extract entities, answer questions, validate	Claude, GPT-4, Gemini, Mistral	Summaries, entities, validation flags
6. Store	Save structured output to database, index, or downstream system	PostgreSQL, Elasticsearch, S3	Database records, searchable index

OCR and Text Extraction Models

The foundation of any document processing pipeline is accurate text extraction. Here is how the leading OCR solutions compare:

Solution	Type	Strengths	Limitations	Cost
Tesseract 5	Open source	Free, 100+ languages, LSTM engine, self-hosted	Struggles with complex layouts, tables, handwriting	Free
PaddleOCR	Open source	Excellent accuracy, 80+ languages, lightweight, good table detection	Smaller community than Tesseract, fewer integrations	Free
Azure Document Intelligence	Cloud API	Best table extraction, prebuilt models for invoices/receipts, layout analysis	Azure dependency, cost at scale	$1.50 per 1K pages
AWS Textract	Cloud API	Strong form extraction, query-based extraction, AWS ecosystem	AWS lock-in, limited language support	$1.50 per 1K pages
Google Document AI	Cloud API	Strong handwriting support, custom processors, good accuracy	GCP dependency	$1.50 per 1K pages

Document Classification with BERT

After extracting text, you need to classify the document type to determine which extraction rules to apply. A fine-tuned BERT or DistilBERT classifier is the standard approach — fast inference (under 50ms), high accuracy (95%+ with good training data), and easy to deploy.

# Document classifier using a fine-tuned DistilBERT model
from transformers import pipeline

# Load a fine-tuned document classifier
classifier = pipeline(
    "text-classification",
    model="./models/document-classifier",
    tokenizer="distilbert-base-uncased"
)

# Define document type labels
DOCUMENT_TYPES = {
    "LABEL_0": "invoice",
    "LABEL_1": "contract",
    "LABEL_2": "receipt",
    "LABEL_3": "medical_record",
    "LABEL_4": "tax_form",
    "LABEL_5": "bank_statement",
    "LABEL_6": "insurance_claim",
    "LABEL_7": "legal_filing"
}

def classify_document(extracted_text: str) -> dict:
    """Classify a document based on its extracted text."""
    # Use first 512 tokens for classification
    truncated = extracted_text[:2000]
    result = classifier(truncated)[0]

    doc_type = DOCUMENT_TYPES.get(result["label"], "unknown")
    confidence = result["score"]

    return {
        "document_type": doc_type,
        "confidence": confidence,
        "needs_review": confidence < 0.85
    }

# Example usage
ocr_text = "Invoice #INV-2026-0042 Date: March 15, 2026..."
classification = classify_document(ocr_text)
print(classification)
# {"document_type": "invoice", "confidence": 0.97, "needs_review": False}

Full Document Processing Pipeline

Here is a complete production-ready pipeline that takes a PDF document, runs OCR, classifies it, extracts entities with an LLM, and returns structured JSON output:

import io
import json
from pathlib import Path
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
from transformers import pipeline
from anthropic import Anthropic

class DocumentProcessor:
    """Multi-model document processing pipeline."""

    def __init__(self):
        # Model 1: Document classifier (DistilBERT)
        self.classifier = pipeline(
            "text-classification",
            model="./models/document-classifier"
        )
        # Model 2: LLM for entity extraction and enrichment
        self.llm = Anthropic()

        # Extraction schemas per document type
        self.schemas = {
            "invoice": {
                "fields": ["vendor_name", "invoice_number", "date",
                           "due_date", "line_items", "subtotal",
                           "tax", "total", "payment_terms"],
                "prompt_template": self._invoice_prompt
            },
            "contract": {
                "fields": ["parties", "effective_date", "termination_date",
                           "key_terms", "obligations", "governing_law"],
                "prompt_template": self._contract_prompt
            },
            "receipt": {
                "fields": ["merchant", "date", "items",
                           "subtotal", "tax", "total", "payment_method"],
                "prompt_template": self._receipt_prompt
            },
            "medical_record": {
                "fields": ["patient_id", "date", "provider",
                           "diagnosis", "medications", "procedures"],
                "prompt_template": self._medical_prompt
            }
        }

    def process(self, file_path: str) -> dict:
        """Process a document through the full pipeline."""
        # Stage 1: Ingest
        pages = self._ingest(file_path)

        # Stage 2: OCR - extract text from all pages
        extracted = self._ocr(pages)

        # Stage 3: Classify document type
        classification = self._classify(extracted["full_text"])

        # Stage 4 & 5: Parse and enrich with LLM
        doc_type = classification["document_type"]
        entities = self._extract_entities(
            extracted["full_text"], doc_type
        )

        # Stage 6: Structure output
        return {
            "file": file_path,
            "pages": len(pages),
            "document_type": doc_type,
            "classification_confidence": classification["confidence"],
            "extracted_text_length": len(extracted["full_text"]),
            "entities": entities,
            "page_texts": extracted["page_texts"],
            "needs_human_review": classification["confidence"] < 0.85
        }

    def _ingest(self, file_path: str) -> list:
        """Convert PDF pages to images for OCR."""
        path = Path(file_path)
        if path.suffix.lower() == ".pdf":
            return convert_from_path(file_path, dpi=300)
        else:
            return [Image.open(file_path)]

    def _ocr(self, pages: list) -> dict:
        """Run OCR on each page and combine results."""
        page_texts = []
        for i, page in enumerate(pages):
            text = pytesseract.image_to_string(
                page,
                config="--oem 3 --psm 6"  # LSTM engine, uniform block
            )
            page_texts.append({
                "page": i + 1,
                "text": text.strip(),
                "char_count": len(text.strip())
            })

        full_text = "\n\n---PAGE BREAK---\n\n".join(
            p["text"] for p in page_texts
        )
        return {"full_text": full_text, "page_texts": page_texts}

    def _classify(self, text: str) -> dict:
        """Classify the document type."""
        result = self.classifier(text[:2000])[0]
        doc_types = ["invoice", "contract", "receipt",
                     "medical_record", "tax_form",
                     "bank_statement", "insurance_claim"]
        label_idx = int(result["label"].split("_")[-1])
        return {
            "document_type": doc_types[label_idx],
            "confidence": result["score"]
        }

    def _extract_entities(self, text: str, doc_type: str) -> dict:
        """Use LLM to extract structured entities."""
        schema = self.schemas.get(doc_type)
        if not schema:
            return self._generic_extraction(text)

        prompt = schema["prompt_template"](text)

        response = self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}]
        )

        # Parse JSON from LLM response
        response_text = response.content[0].text
        try:
            return json.loads(response_text)
        except json.JSONDecodeError:
            # Extract JSON block if wrapped in markdown
            start = response_text.find("{")
            end = response_text.rfind("}") + 1
            return json.loads(response_text[start:end])

    def _invoice_prompt(self, text: str) -> str:
        return f"""Extract all structured data from this invoice.
Return valid JSON with these fields:
- vendor_name (string)
- invoice_number (string)
- date (YYYY-MM-DD)
- due_date (YYYY-MM-DD)
- line_items (array of objects: description, quantity, unit_price, total)
- subtotal (number)
- tax (number)
- total (number)
- payment_terms (string)
- currency (string, ISO 4217 code)

If a field is not found, use null.

DOCUMENT TEXT:
{text}

Return ONLY valid JSON, no explanation."""

    def _contract_prompt(self, text: str) -> str:
        return f"""Extract all structured data from this contract.
Return valid JSON with these fields:
- parties (array of strings)
- effective_date (YYYY-MM-DD)
- termination_date (YYYY-MM-DD or null)
- contract_type (string)
- key_terms (array of strings - key obligations and terms)
- obligations (object mapping party name to array of obligations)
- governing_law (string - jurisdiction)
- renewal_terms (string or null)

If a field is not found, use null.

DOCUMENT TEXT:
{text}

Return ONLY valid JSON, no explanation."""

    def _receipt_prompt(self, text: str) -> str:
        return f"""Extract all data from this receipt. Return valid JSON:
- merchant (string)
- date (YYYY-MM-DD)
- items (array: name, quantity, price)
- subtotal, tax, total (numbers)
- payment_method (string)

DOCUMENT TEXT:
{text}

Return ONLY valid JSON."""

    def _medical_prompt(self, text: str) -> str:
        return f"""Extract structured data from this medical record.
Return valid JSON:
- patient_id (string)
- date (YYYY-MM-DD)
- provider (string)
- diagnosis (array of strings)
- medications (array: name, dosage, frequency)
- procedures (array of strings)
- notes (string)

DOCUMENT TEXT:
{text}

Return ONLY valid JSON."""

    def _generic_extraction(self, text: str) -> dict:
        """Fallback extraction for unknown document types."""
        response = self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[{"role": "user", "content": f"""Extract all key
information from this document as structured JSON. Identify dates,
names, amounts, reference numbers, and any other important fields.

DOCUMENT TEXT:
{text}

Return ONLY valid JSON."""}]
        )
        text_out = response.content[0].text
        start = text_out.find("{")
        end = text_out.rfind("}") + 1
        return json.loads(text_out[start:end])


# Run the pipeline
processor = DocumentProcessor()
result = processor.process("invoices/scan_march_2026.pdf")
print(json.dumps(result, indent=2))

Using Azure Document Intelligence API

For production workloads where accuracy on complex layouts, tables, and forms is critical, Azure Document Intelligence (formerly Form Recognizer) provides prebuilt models and layout analysis that outperform open-source OCR on structured documents:

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
import json

class AzureDocumentProcessor:
    """Document processing using Azure Document Intelligence."""

    def __init__(self, endpoint: str, key: str):
        self.client = DocumentIntelligenceClient(
            endpoint=endpoint,
            credential=AzureKeyCredential(key)
        )

    def process_invoice(self, file_path: str) -> dict:
        """Extract structured data from an invoice using
        Azure's prebuilt invoice model."""
        with open(file_path, "rb") as f:
            poller = self.client.begin_analyze_document(
                "prebuilt-invoice",
                body=f,
                content_type="application/pdf"
            )
        result = poller.result()

        invoices = []
        for invoice in result.documents:
            fields = invoice.fields
            extracted = {
                "vendor": self._get_field(fields, "VendorName"),
                "vendor_address": self._get_field(
                    fields, "VendorAddress"
                ),
                "customer": self._get_field(fields, "CustomerName"),
                "invoice_number": self._get_field(
                    fields, "InvoiceId"
                ),
                "date": self._get_field(fields, "InvoiceDate"),
                "due_date": self._get_field(fields, "DueDate"),
                "subtotal": self._get_field(fields, "SubTotal"),
                "tax": self._get_field(fields, "TotalTax"),
                "total": self._get_field(fields, "InvoiceTotal"),
                "currency": self._get_field(
                    fields, "CurrencyCode"
                ),
                "line_items": self._extract_line_items(fields),
                "confidence": invoice.confidence
            }
            invoices.append(extracted)
        return {"invoices": invoices, "page_count": len(result.pages)}

    def extract_tables(self, file_path: str) -> list:
        """Extract all tables from a document using layout
        analysis - works on any document type."""
        with open(file_path, "rb") as f:
            poller = self.client.begin_analyze_document(
                "prebuilt-layout",
                body=f,
                content_type="application/pdf"
            )
        result = poller.result()

        tables = []
        for table in result.tables:
            rows = {}
            for cell in table.cells:
                row_idx = cell.row_index
                if row_idx not in rows:
                    rows[row_idx] = {}
                rows[row_idx][cell.column_index] = {
                    "content": cell.content,
                    "kind": cell.kind  # "columnHeader" or "content"
                }

            # Convert to list of dicts using headers
            headers = [
                rows[0][col]["content"]
                for col in sorted(rows.get(0, {}).keys())
            ]
            table_data = []
            for row_idx in sorted(rows.keys()):
                if row_idx == 0:
                    continue
                row_dict = {}
                for col_idx, header in enumerate(headers):
                    cell_data = rows[row_idx].get(col_idx, {})
                    row_dict[header] = cell_data.get("content", "")
                table_data.append(row_dict)

            tables.append({
                "row_count": table.row_count,
                "column_count": table.column_count,
                "headers": headers,
                "data": table_data
            })
        return tables

    def _get_field(self, fields, name):
        field = fields.get(name)
        return field.content if field else None

    def _extract_line_items(self, fields):
        items_field = fields.get("Items")
        if not items_field:
            return []
        items = []
        for item in items_field.value:
            f = item.value
            items.append({
                "description": self._get_field(f, "Description"),
                "quantity": self._get_field(f, "Quantity"),
                "unit_price": self._get_field(f, "UnitPrice"),
                "amount": self._get_field(f, "Amount")
            })
        return items


# Usage
processor = AzureDocumentProcessor(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    key="your-api-key"
)
invoice_data = processor.process_invoice("scan.pdf")
tables = processor.extract_tables("financial_report.pdf")
print(json.dumps(invoice_data, indent=2, default=str))

Handling Multi-Page Documents

Real-world documents are rarely single-page. Contracts run 50+ pages, medical records span entire patient histories, and financial reports contain dozens of tables. Here are patterns for handling multi-page processing efficiently:

import asyncio
from concurrent.futures import ThreadPoolExecutor

class MultiPageProcessor:
    """Efficient multi-page document handling."""

    def __init__(self, max_workers: int = 4):
        self.executor = ThreadPoolExecutor(max_workers=max_workers)

    async def process_pages_parallel(self, pages: list) -> list:
        """OCR pages in parallel for faster processing."""
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                self.executor,
                self._ocr_single_page,
                page, i
            )
            for i, page in enumerate(pages)
        ]
        results = await asyncio.gather(*tasks)
        return sorted(results, key=lambda x: x["page"])

    def _ocr_single_page(self, page_image, page_num: int) -> dict:
        """OCR a single page with Tesseract."""
        text = pytesseract.image_to_string(page_image)
        data = pytesseract.image_to_data(
            page_image, output_type=pytesseract.Output.DICT
        )

        # Calculate average confidence
        confidences = [
            int(c) for c in data["conf"] if int(c) > 0
        ]
        avg_confidence = (
            sum(confidences) / len(confidences)
            if confidences else 0
        )

        return {
            "page": page_num + 1,
            "text": text.strip(),
            "word_count": len(text.split()),
            "avg_confidence": round(avg_confidence, 2),
            "low_confidence": avg_confidence < 70
        }

    def detect_scanned_vs_native(self, file_path: str) -> str:
        """Detect whether a PDF is scanned or has native text."""
        import fitz  # PyMuPDF
        doc = fitz.open(file_path)
        text_pages = 0
        for page in doc:
            text = page.get_text()
            if len(text.strip()) > 50:
                text_pages += 1

        ratio = text_pages / len(doc) if len(doc) > 0 else 0
        if ratio > 0.8:
            return "native"
        elif ratio > 0.2:
            return "mixed"
        else:
            return "scanned"

Document Types and Industry Use Cases

Industry	Document Types	Key Extraction Fields	Volume
Finance	Invoices, bank statements, tax forms, loan applications	Amounts, dates, account numbers, tax IDs	Millions/month
Healthcare	Medical records, prescriptions, insurance claims, lab reports	Patient IDs, diagnoses, medications, procedure codes	Thousands/day
Legal	Contracts, court filings, patents, deeds	Parties, dates, terms, obligations, jurisdictions	Hundreds/day
Insurance	Claims, policies, accident reports, appraisals	Policy numbers, claim amounts, dates, descriptions	Thousands/day
Real Estate	Leases, purchase agreements, inspection reports	Property details, terms, parties, amounts	Hundreds/week
Government	Permits, licenses, tax returns, applications	Applicant info, dates, reference numbers, status	Tens of thousands/day

Output Formats and Storage

After processing, structured data can be stored in multiple formats depending on your downstream needs:

💡

Common output formats:

JSON: Flexible, nested structures for complex documents. Ideal for API responses and NoSQL databases.
Database records: Flat rows in PostgreSQL or MySQL for querying and reporting. Best for high-volume invoice and receipt processing.
Searchable index: Elasticsearch or OpenSearch for full-text search across all processed documents. Combine with vector embeddings for semantic search.
Data lake: Parquet files in S3 or Azure Blob for analytics workloads and batch processing with Spark or BigQuery.

Best Practices for Production Pipelines

Pre-process images: Deskew, denoise, and enhance contrast before OCR. This alone can improve accuracy by 10–20%.
Use confidence thresholds: Flag documents with low OCR or classification confidence for human review instead of silently producing bad data.
Batch LLM calls: Group multiple documents for entity extraction to reduce API costs and latency.
Cache classification results: Documents from the same source often have the same type. Cache to avoid redundant classification.
Version your extraction schemas: As document formats change, your extraction prompts need updating. Track schema versions alongside extracted data.
Test with real documents: Synthetic test data never captures the messiness of production documents. Build a test set from actual customer documents (anonymized).
Monitor accuracy continuously: Set up sampling-based human review to catch accuracy drift over time as document formats evolve.

What's Next

In the next lesson, we build another powerful multi-model application: Conversational AI. You will learn to combine speech-to-text, intent classification, LLM response generation, and text-to-speech into a voice-enabled AI assistant that handles real-time conversations.

← Previous RAG Applications Next → Conversational AI