🔒

Access Required

Enter the email address associated with your access grant.

No access? Request here →

Case Study #001 · WC FNOL Document Extraction · May 2026

The Extraction Intelligence Benchmark

A controlled comparison of three extraction architectures on a single high-complexity WC FNOL reveals that the approach marketed for its audit trail hallucinated the claimant's name, last name, and date of injury. The zero-LLM approach extracted every deterministic field correctly and is the only architecture defensible under regulatory examination.

19/19
Deterministic fields correct
Approach C, zero API calls
3
Grounded hallucinations
Approach B, 101 intervals cited
86%
Of fields are non-LLM addressable
Only Cat 4 inference requires LLM
10×
Latency penalty, grounded LLM
42.7s vs. 4.3s (single-prompt)
Exhibit 1 — Three architectures under test
Metric Approach A
Groq · llama-3.3-70b · T5
Approach B
LangExtract · Gemini · T5
Approach C
Docling + Regex · T0+T2
External API calls 1 7 0
Latency 4.3s 42.7s 8.2s
Fields extracted (of 36) 36 36 29 Cat 1–3 only
Cat 1–2 deterministic fields 19 / 19 15 / 19 19 / 19
Hallucinations 0 3 ▲ 0
Document-provenance grounded 0 values 101 intervals † 29 / 29
Deterministic (zero variance) No No Yes
Cost per document ~$0.04–0.06 ~$0.28–0.42 ~$0.00015
PHI-safe for data posture B/C No No Yes

▲ Approach B grounding intervals point to real text from wrong entities — see Finding 1.  † 29/36 fields for Approach C = Cat 1–3; the 7 missing Cat 4 inference fields were left empty, not hallucinated.

Finding 1

Grounding intervals do not guarantee correct field attribution

LangExtract returned a character-interval citation for every extracted value — the feature distinguishing its audit architecture from a standard LLM call. Three of those intervals pointed to real text at the cited position. The text belonged to the wrong entity or the wrong date context.

A grounding interval proves a string exists in the document. It does not prove the string is the correct value for the correct field. Under NYDFS Part 216 examination, this distinction is the difference between passing and failing provenance review.

Exhibit 2 — Three grounded hallucinations (Approach B)
Field Extracted Cited source text Actual source
first_name "Franklin" "Franklin Logistics Inc…" Third-party shipper — not the claimant
last_name "Mr." "Dear Mr. Johnson," Broker salutation — not a surname
date_of_injury "March 4" "filed March 4 by prior counsel" Attorney filing date — injury was March 18
Exhibit 3 — Category 1 & 2 field results: 19 deterministic fields
Field Ground truth A: Groq B: LangExtract C: Docling+Regex
policy_numberAP-2026-WC-9214
claim_numberWCH250721001
employer_fein47-2381094
naics_code561320
date_of_injury2026-03-18✗ "March 4"
date_reported2026-03-28
date_of_birth1988-07-15
ssn_last44721
hourly_rate$28.50
avg_weekly_wage$1,140.00
reporting_delay_days10
attorney_contact_date2026-03-21
first_nameTerrence✗ "Franklin"
last_nameJackson✗ "Mr."
employer_nameApex Staffing Solutions
body_part_primaryLower back / lumbar
injury_mechanismLifting / exertion
occupation_classWarehouse / labor
state_of_injuryNY
Score — Category 1 & 2 19 / 19 15 / 19 19 / 19

Shaded rows = fields where Approach B returned a grounded hallucination. Approach C Cat 4 fields (claim type, RTW status, attorney flag, same body part, delay flag) were left empty by design — not hallucinated.

Finding 2

Section-aware extraction makes the error category structurally unreachable

The document contains three business entities — Apex Staffing, Excel Manufacturing, and Franklin Logistics — before the Employee Information section that contains the claimant's name. Approach B scanned the full document; Approach C partitioned it.

Each regex pattern runs only against its assigned section pool. The first_name pattern sees only text under the Employee Information header. "Franklin" exists only in the Employer Information pool. The two pools never intersect. The attribution error is not a probability to manage — it is a structural impossibility.

Section partitioning (Python)
SECTION_MAP = {
  "employer": re.compile(
      r"employer\s+information", re.I),
  "employee": re.compile(
      r"(?:injured\s+)?employee\s+information", re.I),
  "injury":   re.compile(
      r"(?:injury|incident)\s+(?:information|details)", re.I),
}

# first_name runs in "employee" pool only.
# "Franklin" exists in "employer" pool only.
# No overlap. Attribution error is impossible.
Finding 3 — Audit defensibility

Only one architecture passes regulatory examination

Architecture Answer to "where did this value come from?" Examination result
A — Groq "The language model extracted policy number AP-2026-WC-9214 from the document. Confidence: high." No provenance
B — LangExtract "The value 'Franklin' was extracted from characters 412–419, which reads 'Franklin'." Misleading — wrong entity
C — Docling+Regex "First name matched by First\s+Name[.:\s]+([A-Z][a-z]+)\b in the Employee Information section. Deterministic. Reproducible on every run." Passes examination
Implications
1
Field classification precedes tool selection. The category of a field — deterministic, categorical, verbatim, or inferential — determines the correct extraction tier. Applying LLM to Cat 1 deterministic fields introduces hallucination risk on the most auditable class of data in a claim file.
2
Grounding intervals are necessary but not sufficient for audit defensibility. A character offset that cites real source text does not prove correct entity attribution on complex, multi-entity documents. Section-aware deterministic extraction provides a stronger correctness guarantee for Cat 1–2 fields.
3
The LLM-required zone is 11–14% of this document's field set. Five of 36 fields require genuine inference: claim type, RTW status, attorney flag, same body part comparison, and delay threshold interpretation. The optimal architecture is not LLM vs. non-LLM — it is governing which fields go to which tier.
4
PHI data posture is a tier-selection constraint, not a post-design concern. For carriers with posture B (carrier-cloud) or C (air-gapped), external APIs are not available for PHI documents regardless of accuracy results. T0+T2 for Cat 1–3 combined with T4 local inference (Ollama, carrier VPC) for Cat 4 is the only architecture that satisfies these constraints end-to-end.

This benchmark is anchored to an active enterprise POC covering 79,908 annual documents across three use cases. Phase 1 recommendation: run Approach C on 50 real labeled documents before any GPU or LLM infrastructure investment.

Discuss the extraction architecture →

If the extraction findings
apply to your pipeline, let's talk.

We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.