🔒

Access Required

Enter the email address associated with your access grant.

No access? Request here →

Standard OCR benchmarks — DocVQA, SROIE, FUNSD — were built for document understanding tasks, not insurance extraction. They measure whether a model can answer questions about a document. That's not the problem. The problem is whether a VIN extracted from a scanned police report is byte-perfect, whether the same document produces the same output on every run, and whether that output can be traced back to a specific location in the source when a regulator asks.

The framework below is what we actually use to evaluate tools. We share it because the evaluation criteria are as important as the results — and because vendors who cherry-pick generic benchmarks are counting on you not asking the right questions.

OCR Quality Extraction Architecture Data Sovereignty Audit Trail

Five Evaluation Groups

The first four map to standard document intelligence dimensions. The fifth — audit and compliance — is the one that determines whether output is actually deployable in a regulated insurance workflow.

Group 1
Text Accuracy
Character error rate and word error rate measure overall text quality. For insurance, what matters more is field-level exact match on structured fields — dates, amounts, VINs, policy numbers, DL#, ICD-10 codes. A tool that achieves 99% CER but drops one digit from a policy number fails the insurance test.
EM ≥ 95%
Group 2
Structure Fidelity
Table TEDS (Tree Edit Distance Score) measures whether table structure is preserved — rows, columns, merged cells, reading order. Loss run schedules, ACORD endorsement tables, and IA report summaries are the primary stress test. Split-page tables and merged headers are the most common failure modes.
TEDS ≥ 0.90
Group 3
Robustness Under Real Conditions
A DPI degradation curve (300 / 200 / 150 / 75 DPI) simulates fax-chain degradation common in police reports and older IA documents. Scan quality classes A through D characterize the document population. A tool that performs well on clean scans but collapses at 150 DPI isn't usable in production.
<10% drift 300→150
Group 4
Operational Performance
Throughput, P95 latency, memory footprint, cold-start time, and GPU vs. CPU dependency. For high-volume carriers these matter. For AMICA-scale operations (~80K docs/year), throughput is rarely the constraint — latency for real-time FNOL workflows and memory footprint for in-premise deployment are the actual concerns.
≤8s/page OCR
Group 5
Audit & Compliance
Grounding rate: what percentage of extracted field values can be traced to a specific character offset or bounding box in the source document. Hallucination rate: what percentage of field values have no textual basis in the source at all. Run variance: whether the same document produces the same output across ten identical runs. Data sovereignty posture: whether PHI ever leaves the customer's infrastructure boundary during processing.
Hallucination = 0

Group 5 is a compliance gate, not a performance metric. A tool can pass Groups 1–4 and be disqualified by Group 5 alone.

"Group 5 is a compliance gate, not a performance metric. A tool can pass Groups 1–4 and be disqualified by Group 5 alone."

How to use this framework

  • 1 Use Groups 1–3 to shortlist tools in a proof-of-concept. These dimensions separate tools that handle insurance document complexity from those that only perform on clean digital inputs.
  • 2 Use Group 4 to validate operational fit for your document volume and deployment model — in-premise vs. cloud, GPU vs. CPU, latency requirements for real-time FNOL workflows.
  • 3 Treat Group 5 as a binary gate — pass or disqualify. No partial credit. A tool that hallucinates field values, even occasionally, is not deployable in a regulated insurance workflow regardless of performance on Groups 1–4.

Request the full evaluation dataset

The complete dataset includes 50 test documents across scan quality classes A–D, per-document results for each group, and the evaluation harness scripts.

gps@elevatenow.tech →

If the problems in these cases
sound familiar, let's talk.

We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.