Enter the email address associated with your access grant.
The question is never "which extraction tool is best?" It's which tier the field requires. Tier selection determines the tool, the infrastructure, and the data sovereignty posture.
Organizations select extraction tools the same way they select software platforms — by brand recognition, benchmark scores, or vendor relationships. The result is architectural mismatch: LLMs deployed on fields that regex handles deterministically, cloud APIs processing PHI that should never leave the building, and expensive GPU inference running on documents where a 400KB Python library would have done the same job at a fraction of the cost.
A VIN extracted from a clean PDF requires a different tier than a dollar amount from a handwritten form photographed on a phone. Wrong-tier tools introduce unnecessary LLM calls, external API exposure, and hallucination risk on fields that regex handles deterministically. The six-tier model gives teams a decision framework that starts with the field, not the tool.
Each tier is defined by its sovereignty posture and the field categories it handles reliably.
| Tier | Name | Tools | Best for | Sovereignty |
|---|---|---|---|---|
| T0 | Native PDF Text | pdfplumber, PyMuPDF | Clean digital PDFs — policy docs, ACORD forms, endorsements | Full — no API, no model |
| T1 | Rule-Based OCR | Tesseract, EasyOCR | Scanned docs, structured forms, machine-print | Full |
| T2 | Regex Extraction | Python re, custom parsers | Cat 1 fields: VIN, DL#, DOB, policy#, claim#, ICD-10 codes | Full |
| T2.5 | Document VLM (Local) | Granite-Docling-258M IBM, Apache 2.0 |
Complex layouts, tables, mixed content — loss run schedules, IA reports | Full — 8GB, runs in-premise |
| T4 | Local LLM | Ollama + Llama 3.3 70B | Cat 4 fields: narrative interpretation, context-dependent extraction, cross-document synthesis | Full — in-premise, no egress |
| T5 | External LLM API | Groq, OpenAI, Anthropic | Complex reasoning, multi-document synthesis, cases where T4 latency is prohibitive | Posture B/C — PHI leaves boundary |
The tier-to-tool mapping is downstream of field categorization. Before selecting a tool, classify the target fields:
| Category | Field Type | Examples | Recommended Tier |
|---|---|---|---|
| Cat 1 | Structured, regex-deterministic | VIN, DL#, DOB, policy#, claim#, ICD-10, ZIP, phone | T2 (regex) after T0/T1 text extraction |
| Cat 2 | Semi-structured, layout-dependent | Table cells, form fields, endorsement amounts, loss run rows | T2.5 (Granite-Docling) for complex layouts; T1 + regex for simple |
| Cat 3 | Narrative, layout-flexible | Injury descriptions, cause of loss, coverage summaries, legal disclaimers | T4 (local LLM) — keep PHI on-premise |
| Cat 4 | Contextual, judgment-required | Fraud indicators, liability assignment, cross-document reconciliation, complex coverage questions | T4 first; T5 only when T4 latency exceeds workflow SLA and PHI posture permits |
T5 is never the first choice for Cat 1 fields. Regex is faster, cheaper, more accurate, and keeps data on-premise. Using an LLM to extract a VIN is architectural malpractice.
Questions about tier selection for your document types?
We're happy to walk through field categorization for your specific FNOL, submission, or claims document set before you commit to a tooling decision.
gps@elevatenow.tech →We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.