The Extraction Intelligence Benchmark

Perspectives AI-Ready Insurance: Architectural Directive

Insurance Use Cases

Submission Triage FNOL Intelligence WC Jurisdiction Rules Loss Run Analysis Municipal Risk Intel

Data Solutions

MDM Pre-Assessment MDM Lite Data Quality Certification Proactive Audit Cohort Analysis The Stack

Evidence Lab

Evidence Lab Home OCR Benchmark Framework Six-Tier Extraction Stack Extraction Intelligence Benchmark Let's Talk

Case Study #001 · WC FNOL Document Extraction · May 2026

A controlled comparison of three extraction architectures on a single high-complexity WC FNOL reveals that the approach marketed for its audit trail hallucinated the claimant's name, last name, and date of injury. The zero-LLM approach extracted every deterministic field correctly and is the only architecture defensible under regulatory examination.

19/19

Deterministic fields correct
Approach C, zero API calls

Grounded hallucinations
Approach B, 101 intervals cited

86%

Of fields are non-LLM addressable
Only Cat 4 inference requires LLM

10×

Latency penalty, grounded LLM
42.7s vs. 4.3s (single-prompt)

Exhibit 1 — Three architectures under test

Metric	Approach A Groq · llama-3.3-70b · T5	Approach B LangExtract · Gemini · T5	Approach C Docling + Regex · T0+T2
External API calls	1	7	0
Latency	4.3s	42.7s	8.2s
Fields extracted (of 36)	36	36	29 Cat 1–3 only
Cat 1–2 deterministic fields	19 / 19	15 / 19	19 / 19
Hallucinations	0	3 ▲	0
Document-provenance grounded	0 values	101 intervals †	29 / 29
Deterministic (zero variance)	No	No	Yes
Cost per document	~$0.04–0.06	~$0.28–0.42	~$0.00015
PHI-safe for data posture B/C	No	No	Yes

▲ Approach B grounding intervals point to real text from wrong entities — see Finding 1. † 29/36 fields for Approach C = Cat 1–3; the 7 missing Cat 4 inference fields were left empty, not hallucinated.

Finding 1

Grounding intervals do not guarantee correct field attribution

LangExtract returned a character-interval citation for every extracted value — the feature distinguishing its audit architecture from a standard LLM call. Three of those intervals pointed to real text at the cited position. The text belonged to the wrong entity or the wrong date context.

A grounding interval proves a string exists in the document. It does not prove the string is the correct value for the correct field. Under NYDFS Part 216 examination, this distinction is the difference between passing and failing provenance review.

Exhibit 2 — Three grounded hallucinations (Approach B)

Field	Extracted	Cited source text	Actual source
first_name	"Franklin"	"Franklin Logistics Inc…"	Third-party shipper — not the claimant
last_name	"Mr."	"Dear Mr. Johnson,"	Broker salutation — not a surname
date_of_injury	"March 4"	"filed March 4 by prior counsel"	Attorney filing date — injury was March 18

Exhibit 3 — Category 1 & 2 field results: 19 deterministic fields

Field	Ground truth	A: Groq	B: LangExtract	C: Docling+Regex
policy_number	AP-2026-WC-9214	✓	✓	✓
claim_number	WCH250721001	✓	✓	✓
employer_fein	47-2381094	✓	✓	✓
naics_code	561320	✓	✓	✓
date_of_injury	2026-03-18	✓	✗ "March 4"	✓
date_reported	2026-03-28	✓	✓	✓
date_of_birth	1988-07-15	✓	✓	✓
ssn_last4	4721	✓	✓	✓
hourly_rate	$28.50	✓	✓	✓
avg_weekly_wage	$1,140.00	✓	✓	✓
reporting_delay_days	10	✓	✗	✓
attorney_contact_date	2026-03-21	✓	✓	✓
first_name	Terrence	✓	✗ "Franklin"	✓
last_name	Jackson	✓	✗ "Mr."	✓
employer_name	Apex Staffing Solutions	✓	✓	✓
body_part_primary	Lower back / lumbar	✓	✓	✓
injury_mechanism	Lifting / exertion	✓	✓	✓
occupation_class	Warehouse / labor	✓	✓	✓
state_of_injury	NY	✓	✓	✓
Score — Category 1 & 2		19 / 19	15 / 19	19 / 19

Shaded rows = fields where Approach B returned a grounded hallucination. Approach C Cat 4 fields (claim type, RTW status, attorney flag, same body part, delay flag) were left empty by design — not hallucinated.

Finding 2

Section-aware extraction makes the error category structurally unreachable

The document contains three business entities — Apex Staffing, Excel Manufacturing, and Franklin Logistics — before the Employee Information section that contains the claimant's name. Approach B scanned the full document; Approach C partitioned it.

Each regex pattern runs only against its assigned section pool. The first_name pattern sees only text under the Employee Information header. "Franklin" exists only in the Employer Information pool. The two pools never intersect. The attribution error is not a probability to manage — it is a structural impossibility.

Section partitioning (Python)

SECTION_MAP = {
  "employer": re.compile(
      r"employer\s+information", re.I),
  "employee": re.compile(
      r"(?:injured\s+)?employee\s+information", re.I),
  "injury":   re.compile(
      r"(?:injury|incident)\s+(?:information|details)", re.I),
}

# first_name runs in "employee" pool only.
# "Franklin" exists in "employer" pool only.
# No overlap. Attribution error is impossible.

Finding 3 — Audit defensibility

Only one architecture passes regulatory examination

Architecture	Answer to "where did this value come from?"	Examination result
A — Groq	"The language model extracted policy number AP-2026-WC-9214 from the document. Confidence: high."	No provenance
B — LangExtract	"The value 'Franklin' was extracted from characters 412–419, which reads 'Franklin'."	Misleading — wrong entity
C — Docling+Regex	"First name matched by `First\s+Name[.:\s]+([A-Z][a-z]+)\b` in the Employee Information section. Deterministic. Reproducible on every run."	Passes examination

Implications

Field classification precedes tool selection. The category of a field — deterministic, categorical, verbatim, or inferential — determines the correct extraction tier. Applying LLM to Cat 1 deterministic fields introduces hallucination risk on the most auditable class of data in a claim file.

Grounding intervals are necessary but not sufficient for audit defensibility. A character offset that cites real source text does not prove correct entity attribution on complex, multi-entity documents. Section-aware deterministic extraction provides a stronger correctness guarantee for Cat 1–2 fields.

The LLM-required zone is 11–14% of this document's field set. Five of 36 fields require genuine inference: claim type, RTW status, attorney flag, same body part comparison, and delay threshold interpretation. The optimal architecture is not LLM vs. non-LLM — it is governing which fields go to which tier.

PHI data posture is a tier-selection constraint, not a post-design concern. For carriers with posture B (carrier-cloud) or C (air-gapped), external APIs are not available for PHI documents regardless of accuracy results. T0+T2 for Cat 1–3 combined with T4 local inference (Ollama, carrier VPC) for Cat 4 is the only architecture that satisfies these constraints end-to-end.

This benchmark is anchored to an active enterprise POC covering 79,908 annual documents across three use cases. Phase 1 recommendation: run Approach C on 50 real labeled documents before any GPU or LLM infrastructure investment.

Discuss the extraction architecture →

If the extraction findings
apply to your pipeline, let's talk.

We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.

gps@elevatenow.tech LinkedIn elevatenow.tech

Access Required

The Extraction Intelligence Benchmark

Grounding intervals do not guarantee correct field attribution

Section-aware extraction makes the error category structurally unreachable

Only one architecture passes regulatory examination

If the extraction findingsapply to your pipeline, let's talk.

If the extraction findings
apply to your pipeline, let's talk.