Advanced

Enhancements & Best Practices

Add OCR fallback, multi-language support, compliance features, and explore advanced document intelligence patterns. Includes a comprehensive FAQ.

Enhancement 1: OCR Fallback with Tesseract

When GPT-4 Vision is too expensive for high-volume processing, use Tesseract OCR as a cost-effective fallback:

# app/extraction/ocr_fallback.py
import pytesseract
from PIL import Image
import fitz
import logging
from pathlib import Path

logger = logging.getLogger(__name__)


class OCRFallback:
    """Tesseract OCR fallback for when Vision API is too costly."""

    def __init__(self, lang: str = "eng"):
        self.lang = lang

    def ocr_image(self, image_path: str) -> str:
        """Run OCR on an image file."""
        img = Image.open(image_path)
        text = pytesseract.image_to_string(img, lang=self.lang)
        logger.info(f"OCR extracted {len(text)} chars from {Path(image_path).name}")
        return text

    def ocr_pdf_page(self, pdf_path: str, page_num: int = 0, dpi: int = 300) -> str:
        """Convert PDF page to image and run OCR."""
        doc = fitz.open(pdf_path)
        page = doc[page_num]
        mat = fitz.Matrix(dpi / 72, dpi / 72)
        pix = page.get_pixmap(matrix=mat)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        doc.close()
        return pytesseract.image_to_string(img, lang=self.lang)

    def ocr_pdf_all(self, pdf_path: str) -> str:
        """OCR all pages of a PDF."""
        doc = fitz.open(pdf_path)
        texts = []
        for i in range(len(doc)):
            texts.append(self.ocr_pdf_page(pdf_path, i))
        doc.close()
        return "\n\n".join(texts)

Enhancement 2: Multi-Language Support

# Language detection and multi-language extraction
from langdetect import detect

def detect_language(text: str) -> str:
    """Detect document language. Returns ISO 639-1 code."""
    try:
        return detect(text[:1000])
    except Exception:
        return "en"

# Update vision prompts for detected language
def get_extraction_prompt(lang: str) -> str:
    prompts = {
        "en": "Extract all text and data from this document.",
        "es": "Extrae todo el texto y datos de este documento.",
        "fr": "Extraire tout le texte et les donnees de ce document.",
        "de": "Extrahieren Sie allen Text und alle Daten aus diesem Dokument.",
        "ja": "This document contains Japanese text. Extract all text preserving the original language.",
    }
    return prompts.get(lang, prompts["en"])

Enhancement 3: Compliance and Audit Trail

# app/compliance/audit.py
import json
import hashlib
from datetime import datetime
from pathlib import Path


class AuditLogger:
    """Log all document processing for compliance."""

    def __init__(self, log_dir: str = "data/audit"):
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)

    def log_processing(self, job_id: str, filename: str, result: dict, user: str = "system"):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "job_id": job_id,
            "filename": filename,
            "user": user,
            "document_type": result.get("document_type"),
            "method": result.get("method"),
            "fields_extracted": len(result.get("structured_data", {})),
            "file_hash": self._hash_file(filename),
        }
        log_file = self.log_dir / f"{datetime.utcnow().strftime('%Y-%m')}.jsonl"
        with open(log_file, "a") as f:
            f.write(json.dumps(entry) + "\n")

    def _hash_file(self, file_path: str) -> str:
        try:
            with open(file_path, "rb") as f:
                return hashlib.sha256(f.read()).hexdigest()[:16]
        except Exception:
            return "unknown"

Frequently Asked Questions

How accurate is the extraction?

For digital PDFs with clean text, accuracy is 95-99%. For scanned documents with GPT-4 Vision, accuracy is 85-95% depending on scan quality. Handwritten text accuracy varies by legibility, typically 70-90%. Always include a human review step for critical documents.

Can I process Word documents and spreadsheets?

Yes. Add python-docx for DOCX files and openpyxl for Excel. Create new loader functions in the extraction module and register the file extensions. The structured extraction pipeline works with any text input regardless of source format.

How do I handle confidential documents?

For on-premises processing, use Tesseract OCR instead of OpenAI Vision. For the structured extraction step, consider running a local LLM via Ollama. Add encryption at rest for uploaded files and enable audit logging for all operations.

What is the cost per document?

Digital PDF (text extraction only): essentially free. Scanned PDF with GPT-4 Vision: $0.01-0.05 per page. Structured extraction with GPT-4o-mini: $0.001-0.005 per document. Total for a typical invoice: under $0.10.

How do I improve table extraction accuracy?

Try both lattice and stream modes in tabula. For complex tables, use GPT-4 Vision with a specific prompt asking for table data in CSV format. Post-process with pandas to validate row and column counts match expectations.

What You Built

StepWhat You BuiltKey Files
1. SetupProject structure, FastAPI serverapp/main.py, app/config.py
2. PDF ExtractionText, table, layout extractionapp/extraction/*.py
3. Vision AIGPT-4 Vision for scanned docsapp/vision/*.py
4. Structured OutputPydantic schemas, function callingapp/structuring/*.py
5. Batch ProcessingAsync queue, workers, progressapp/pipeline/*.py
6. Web UIDrag-drop upload, results reviewfrontend/index.html
7. EnhancementsOCR fallback, multi-lang, complianceapp/extraction/ocr_fallback.py
💡
Keep building. Start with your own invoices or receipts. The extraction errors you find will guide you to better prompts and schemas. Track accuracy metrics and iterate until extraction is reliable enough to remove the human review step.