×

👋 Hello, Anthropic team!

Thanks for checking out my blog. I'm excited about the opportunity to work with you on building safe, beneficial AI systems.

Feel free to explore the posts on AI alignment, verification theory, and software engineering.

— James

Today I Learned

Document extraction: four main approaches with a 1000x cost difference

I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services.

The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That's a 1000x+ difference.

Most companies waste money by treating all documents the same. Document classification upfront can cut costs by 85%+ while maintaining flexibility for edge cases.

What I learned

Cloud OCR services (Azure Document Intelligence, AWS Textract, Google Document AI) cost $1.50 per 1,000 pages for basic OCR. They're fully managed, pre-trained on common document types, and great for MVPs.

Recent benchmarks: Gemini 2.0 Pro achieved 100% item extraction accuracy at $0.0045 per invoice, while AWS and Azure cost $0.01 per invoice. Azure's asynchronous processing delivers 87% cost savings—30 pages async costs $0.045 versus $0.30 for synchronous.

The downside is that the cost per page adds up quickly, and Azure's custom extraction models cost $50 for every 1,000 pages.

Fine-tuned small models (7B parameter models like Llama 3.1, Mistral 7B) cost $0.00368 per 1,000 tokens for inference after training.

Real benchmarks: LLaMA-3 8B achieved 76.6% accuracy without any fine-tuning, matching fine-tuned LLaMA-2 70B. After fine-tuning on just 861 samples, LLaMA-2 7B jumped from 47.6% to 61.5% accuracy with 47.78% reduction in hallucinations.

Cost of training: less than $2 for QLoRA on A100 GPUs (46 minutes for Mistral 7B). Inference hosting costs between $288 and $530 per month on cloud GPUs. Breakeven at about 1 million documents per year compared to the costs of the GPT-4 API.

Template-based extraction costs very little per document, but you have to make the templates ahead of time. New tools can get F1 scores of 1.0 with less than a second of latency for known formats.

PyMuPDF got F1 scores between 0.983 and 0.993 in documents from the government, the law, and finance. Camelot was good at making tables with a 0.828 F1 score for complicated government tenders. Processing speed: structured documents take 0.3 to 1.6 seconds, while multimodal LLM approaches take 33.9 seconds—54 times faster.

Azure Document Intelligence requires only 3 training + 3 test documents for template model creation, with the first 10 hours of neural training free.

Full LLM inference (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 and Gemini 2.5) costs $0.005-0.02 per typical invoice. It handles any format without training, adapts to changes, and can reason about context.

Production benchmarks: Claude and GPT-4o get 92–95% accuracy for line items and 95–98% accuracy for invoice extraction. For Claude, processing takes 200 to 300 milliseconds, and for GPT-4o, it takes 1 to 30 seconds, depending on complexity.

Cost optimization: Prompt caching cuts down on repeated content by 90%. Batch API processing cuts costs by 50% for workloads that aren't urgent. With caching, Claude costs $30 to $90 for 10,000 invoices a month, while GPT-4o costs $50 to $180.

The hybrid strategy

The best way to do this is with a classifier that sorts documents, as shown in the October 2024 Hybrid OCR-LLM Framework study:

  • Standard forms (60%) → Table-based extraction (F1=1.0, 0.3s latency)
  • Semi-structured (30%) → PaddleOCR + table method (F1=0.997, 0.6s)
  • Novel formats (10%) → Multimodal LLM (F1=0.999, 34s)

Real-world impact: Asian Paints cut processing time from 5 minutes to 30 seconds per document (10 times faster), saving 192 person-hours a month and finding $47,000 in vendor overcharges.

The filename classification optimization: Lightweight classifiers achieve 96.7% accuracy at 442x faster speed than full content analysis, processing 80%+ of documents through fast paths before invoking expensive models.

This lowers the blended cost to $1.50 per document, down from $10 for pure LLM. That's an 85% drop in cost while still keeping flexibility.

How to choose

More than 10,000 documents per month: For common types, use models or templates that have been fine-tuned. Mistral 7B trains for 46 minutes for $1.46 on RunPod and gets 85% of GPT-4's accuracy for 8 times less money.

Less than 10,000 docs a month: Cloud OCR services for speed. For custom extractors, Google gives you the first 1,000 documents for free, and then $30 for every 1,000 pages after that.

Accuracy critical: Template extraction with rules. Azure supports up to 500 trained models in composed architectures with incremental training on misclassified documents.

Format highly variable: LLM-based extraction. Claude 3.5 Sonnet handles 100-page PDFs up to 30MB with 200K token context window, eliminating preprocessing.

The winning architecture

Don't pick one approach. Route intelligently:

IF standard_form → Template (F1=1.0, 0.3s, $0.001)
ELIF semi_structured → Fine-tuned 7B (F1=0.997, 0.6s, $0.03)
ELSE → LLM fallback (F1=0.999, 34s, $10)

Blended cost: $1.50/doc vs $10 pure LLM = 85% savings

The main point

Through smart routing, the best AP departments get their cost per invoice down to $2.78, which is much lower than the industry average of $9.40. They cost 78% less and are 82% faster than their competitors.

The market data backs this up: Document extraction will grow from $10.57 billion in 2025 to $66.68 billion by 2032 at a rate of 30.6% per year. This is because companies are using smart routing instead of relying on expensive LLMs for everything.

Tools and Resources

Open-source PDF parsing:

Fine-tuning frameworks:

Cloud platforms:

RAG frameworks:

Key research papers:

Official documentation: