Practitioner Research · Insurance AI

What we build,
what we learn, and
what the evidence shows.

We architect governed AI systems for insurers and MGAs. This is where we publish the work — implemented workflows, benchmark studies, and the reasoning behind every architectural decision.

10
Published field use cases across insurance operations
3
Platform layers governing every AI decision — Certify, Contextualize, Compose
9
Production products — 5 certified data solutions and 4 insurance recipe packs
Perspectives

What we believe about insurance AI.

Not as a vendor. As practitioners who have built these systems, debugged them in production, and had to explain them to underwriters and regulators.

01 · The data problem

A single bad data point doesn't fail quietly in an agentic workflow — it propagates.

Every downstream decision an agent makes inherits the error that entered upstream. Trust in AI outcomes is not a function of the model — it's a function of the data pipeline. Carriers who treat data certification as a pre-deployment step, not a foundation, will keep hitting the same wall.

02 · The governance reality

AI is entering regulated decisions. The audit standard is the same as for human decisions.

When an AI agent participates in an underwriting decision or claims adjudication, regulators apply the same explainability and audit requirements as they do to a human adjuster. You can't build a governed outcome on an ungoverned pipeline. The architecture has to be designed for accountability from the start.

03 · The knowledge problem

The platform handles the computation. The knowledge stays yours.

Your appetite rules, jurisdiction requirements, and SOPs should be encoded, versioned, and applied consistently — not recalled probabilistically from a model that may have drifted since last training. Every AI decision should trace back to your procedures, your expertise, your institutional knowledge.

04 · The market shift

Boards are moving from pilot to P&L. The architecture has to move with them.

Proof-of-value expectations have shifted from "interesting" to "in production." Scaling agentic AI at this stage requires a method, not just a model — one that certifies the data before agents consume it, contextualizes institutional knowledge so agents reason correctly, and composes workflows that are auditable end to end.

Certify before you compose

Bad data in an agentic workflow doesn't fail at the source — it compounds across every downstream decision. Every data product needs a trust score before an agent touches it.

Knowledge encoded, not recalled

Appetite, jurisdiction rules, and SOPs as versioned, governed definitions — not floating in model weights where they can drift, hallucinate, or become unauditable.

Regulated decisions need regulated systems

Underwriting and claims AI faces the same audit standards as human decisions. Explainability has to be designed in — it cannot be retrofitted onto a black-box pipeline.

Composable, not captive

Solutions that plug into the stack you already run — Snowflake, Databricks, AWS, Azure. You keep the code. You own the outcome. No platform lock-in.

Proven Results

10 implemented workflows.
Documented as practitioners.

Each case documents what the real problem is, why standard approaches don't solve it, and what an architecture that does actually looks like — without giving away the blueprints.

Underwriting 10 min read

Prior Carrier Loss Runs: From PDF Chaos to Underwriting Intelligence

What risk signals are buried in prior carrier loss runs that underwriters never see because the data arrives as PDFs with carrier-specific codes that don't map to your coverage lines?

Loss RunsCoverage Gaps
Read →
Underwriting 10 min read

Municipal Budgets Tell Risk Stories Underwriters Never Have Time to Decode

Counties operate jails, water plants, law enforcement, and healthcare facilities — each with distinct liability profiles. Yet underwriters get 200-page PDFs and 15 minutes. What signal is lost, and what happens after a loss when that signal surfaces?

Public SectorBudget Analysis
Read →
Claims 11 min read

WC FNOL: Jurisdiction Rules That Run Consistently Across All 50 States

Workers Comp FNOL routing requires applying the exact statutory rules for the state and date of injury — statute of limitations, benefit schedules, reporting deadlines. LLM-based routing introduces variance that can't be audited against the version of law that applied at time of injury.

Workers CompJurisdictionZero LLM
Read →
Operations 16 min read

Your AI Is Making Decisions on Data Nobody Has Certified

Your enterprise is deploying AI agents that autonomously consume data from structured databases and unstructured documents. When a regulator asks "was this data certified before your AI used it to deny this claim?" — what do you show them?

AI CertificationData Quality10 Dimensions
Read →
Operations 12 min read

What if Audits Prevented Problems Instead of Just Documenting Them?

Traditional audit happens after binding — when it's too late to prevent problems and too expensive to fix them. This case documents a shift from compliance checkbox to preventive risk intelligence that evaluates 100% of submissions before binding.

AuditGovernancePre-bind
Read →
Operations 14 min read

MDM Platforms Fail When Nobody Assessed Data Readiness First

Your enterprise acquires operations in APAC. Their customer names follow different cultural patterns, addresses use unfamiliar formats, source schemas don't align with your MDM target model. What happens when the 5–6 week manual mapping process per region becomes the bottleneck preventing global expansion?

MDM ReadinessEntity Resolution
Read →
Operations 16 min read

Enterprise MDM Provides No Sandbox to Test Configurations Before Production

You set your blocking scheme and auto-merge threshold — but will that generate 50,000 candidate pairs or 10 million false positives? Will golden records pick current addresses or stale CRM data? You discover these answers in production, with real customer data at risk.

Golden RecordsMatching Engine
Read →
Actuarial 12 min read

AI-Driven Cohort Analysis: What if Cohort Design Was Evidence-Based, Not Traditional?

Most actuaries inherit cohort segmentation from predecessors without rigorous testing. Cohort selection is foundational to reserve accuracy — yet it's rarely questioned. This case documents automating hypothesis generation, construction, and statistical validation to find objectively better segmentations.

ReservingLoss TrianglesStatistical Testing
Read →
The Stack

What we build on.

The governed intelligence stack behind the use cases above. Ten capabilities across four layers — from AI-ready data pipelines to personified workbenches.

This isn't a product pitch — it's context for what makes the architectures in the field cases possible.

AI-Ready Pipelines

Assure

Data quality certification — structured and unstructured — before AI consumes it. 10-dimension composite trust score.

GA
AI-Ready Pipelines

Redact

1,100+ sensitive data types detected and redacted in-pipeline before egress. PHI, PII, PCI, HIPAA, GDPR.

GA
AI-Ready Pipelines

DataDNA

Pipeline-native lineage registry. The trust control plane for AI data consumption across Snowflake and Databricks.

Preview
Semantic

Semantic Hub

Insurance ontology + document intelligence + knowledge repository. The context layer for governed AI.

GA
Semantic

Resolve

AI-powered entity resolution. MDM pre-assessment in days, not weeks. Customer identity that carriers trust.

GA
Governance

AI Compliance Hub

Every AI decision reconstructible from governing chunks, version, effective date, and trace path.

Preview
Workflow

Recipe Packs

Insurance agentic workflows for WC/Auto/Property claims and commercial underwriting. Each tool invocation is traceable.

GA
Apps

Workbench

Claims examiner intelligence workspace. Facts, intelligence, and evidence in a single cockpit from FNOL intake.

GA
Evidence Lab

We publish the criteria before we publish the results.

Standard OCR benchmarks — DocVQA, SROIE, FUNSD — were built for document understanding tasks, not insurance extraction. They measure whether a model can answer questions about a document. That's not the problem. The problem is whether a VIN extracted from a scanned police report is byte-perfect, whether the same document produces the same output on every run, and whether that output can be traced back to a specific location in the source when a regulator asks.

The framework below is what we actually use to evaluate tools. We share it because the evaluation criteria are as important as the results — and because vendors who cherry-pick generic benchmarks are counting on you not asking the right questions.

Published
OCR QualityExtraction Architecture Data SovereigntyAudit Trail

OCR & Extraction Benchmark Framework for Insurance Documents

Five evaluation groups. The first four map to standard document intelligence dimensions. The fifth — audit and compliance — is the one that determines whether output is actually deployable in a regulated insurance workflow. Most tools pass the first four. Group 5 is where the field narrows.

Group 1
Text Accuracy
Character error rate and word error rate measure overall text quality. For insurance, what matters more is field-level exact match on structured fields — dates, amounts, VINs, policy numbers, DL#, ICD-10 codes. A tool that achieves 99% CER but drops one digit from a policy number fails the insurance test.
EM ≥ 95%
Group 2
Structure Fidelity
Table TEDS (Tree Edit Distance Score) measures whether table structure is preserved — rows, columns, merged cells, reading order. Loss run schedules, ACORD endorsement tables, and IA report summaries are the primary stress test. Split-page tables and merged headers are the most common failure modes.
TEDS ≥ 0.90
Group 3
Robustness Under Real Conditions
A DPI degradation curve (300 / 200 / 150 / 75 DPI) simulates fax-chain degradation common in police reports and older IA documents. Scan quality classes A through D characterize the document population. A tool that performs well on clean scans but collapses at 150 DPI isn't usable in production.
<10% drift 300→150
Group 4
Operational Performance
Throughput, P95 latency, memory footprint, cold-start time, and GPU vs. CPU dependency. For high-volume carriers these matter. For AMICA-scale operations (~80K docs/year), throughput is rarely the constraint — latency for real-time FNOL workflows and memory footprint for in-premise deployment are the actual concerns.
≤8s/page OCR
Group 5
Audit & Compliance
Grounding rate: what percentage of extracted field values can be traced to a specific character offset or bounding box in the source document. Hallucination rate: what percentage of field values have no textual basis in the source at all. Run variance: whether the same document produces the same output across ten identical runs. Data sovereignty posture: whether PHI ever leaves the customer's infrastructure boundary during processing.
Hallucination = 0

Group 5 is a compliance gate, not a performance metric. A tool can pass Groups 1–4 and be disqualified by Group 5 alone.

Request the full framework →
Published

The Six-Tier Extraction Stack

"Which tool should we use?" is the wrong question. The right question is which tier the field requires — and that determines the tool, the infrastructure, and the sovereignty posture.

T0–T2 pdfplumber, Tesseract, regex — no LLM, no GPU, data-sovereign
T2.5 Granite-Docling-258M (IBM) — document VLM, 8GB, Apache 2.0
T4 Ollama + Llama 3.3 70B — local LLM, in-premise, Cat 4 fields only
T5 Groq / OpenAI — external API, fast, PHI risk at posture B/C
In Progress

Regex vs. LLM on Structured Fields

Head-to-head on Cat 1 fields — VIN, DL#, DOB, policy number, claim number — across 200 FNOL and police report documents. Measuring field-level exact match, hallucination rate, and run variance side by side.

FNOLPolice Reports
Notify when published →
In Progress

Scanned Doc Quality: Tesseract vs. Granite-Docling

DPI degradation curve at 300/200/150/75 DPI. Table TEDS and Cat 1 exact match across 50 scanned insurance documents. Testing the hypothesis that Granite degrades under 10% vs. Tesseract at 25–30%.

Scanned DocsTables
Notify when published →

If the problems in these cases
sound familiar, let's talk.

We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.