We architect governed AI systems for insurers and MGAs. This is where we publish the work — implemented workflows, benchmark studies, and the reasoning behind every architectural decision.
Not as a vendor. As practitioners who have built these systems, debugged them in production, and had to explain them to underwriters and regulators.
When an insurer says "we're concerned about LLM cost," the underlying question is: can you show a regulator exactly why a specific value was extracted from a specific document? Cost is the proxy for a governance concern that's harder to articulate.
A 70B parameter model extracting a VIN from a police report is engineering theater. Compiled regex against a 17-character alphanumeric pattern achieves 99% exact match, zero variance, zero hallucination, and produces a verifiable character offset. The LLM is worse on this task by every measure that matters.
FNOL documents, police reports, and IA reports contain DOB, DL#, medical codes, and policy numbers. Every external LLM API call with these documents is a potential HIPAA exposure. No anonymization wrapper fixes this — the architecture has to ensure PHI never leaves the carrier's environment.
A governed system means every field value traces to a document position, every rule traces to a versioned definition, and every decision can be reconstructed from first principles. You can't retrofit this onto a black-box LLM pipeline. It requires the right structure from the start.
Every field that can be extracted without inference, should be. LLM adds cost and variance without improving quality on structured fields.
Char offset for text, bounding box for images, section ID for documents. "The model said so" is not an audit trail for insurance operations.
For carrier-cloud and air-gapped deployments, the full pipeline runs within the carrier's environment. Nothing leaves.
Low-confidence fields surface to handlers by design. AMICA said it directly: "duplicative, unclear data escalated to a representative." That's the right model.
Each case documents the real problem, why generic approaches fall short, and what an architecture that actually solves it looks like. No NDA — but no implementation blueprints either.
Underwriters spend 60–70% of their time on triage work that doesn't require underwriting judgment — extracting data from PDFs, checking appetite guidelines, routing to queues. Meanwhile, submissions sit 24–48 hours before anyone evaluates them, and 40% get mis-routed. What if Fast NOs were as fast as Fast YESes — and both happened in minutes?
When an FNOL arrives — workplace injury, multi-vehicle collision, property loss — adjusters have minutes to assess liability, detect fraud, identify subrogation, calculate reserves, and route correctly. What happens when those signals are buried in narratives, jurisdiction rules vary across 50 states, and institutional knowledge lives only in expert adjusters' heads?
What risk signals are buried in prior carrier loss runs that underwriters never see because the data arrives as PDFs with carrier-specific codes that don't map to your coverage lines?
Counties operate jails, water plants, law enforcement, and healthcare facilities — each with distinct liability profiles. Yet underwriters get 200-page PDFs and 15 minutes. What signal is lost, and what happens after a loss when that signal surfaces?
Workers Comp FNOL routing requires applying the exact statutory rules for the state and date of injury — statute of limitations, benefit schedules, reporting deadlines. LLM-based routing introduces variance that can't be audited against the version of law that applied at time of injury.
Your enterprise is deploying AI agents that autonomously consume data from structured databases and unstructured documents. When a regulator asks "was this data certified before your AI used it to deny this claim?" — what do you show them?
Traditional audit happens after binding — when it's too late to prevent problems and too expensive to fix them. This case documents a shift from compliance checkbox to preventive risk intelligence that evaluates 100% of submissions before binding.
Your enterprise acquires operations in APAC. Their customer names follow different cultural patterns, addresses use unfamiliar formats, source schemas don't align with your MDM target model. What happens when the 5–6 week manual mapping process per region becomes the bottleneck preventing global expansion?
You set your blocking scheme and auto-merge threshold — but will that generate 50,000 candidate pairs or 10 million false positives? Will golden records pick current addresses or stale CRM data? You discover these answers in production, with real customer data at risk.
Most actuaries inherit cohort segmentation from predecessors without rigorous testing. Cohort selection is foundational to reserve accuracy — yet it's rarely questioned. This case documents automating hypothesis generation, construction, and statistical validation to find objectively better segmentations.
Framework assessments and benchmark studies grounded in insurance document realities — not vendor marketing. Updated as we conduct new research.
A five-group evaluation methodology for selecting OCR and extraction approaches. Covers text accuracy (CER/WER + field-level exact match), structure fidelity (Table TEDS, section boundaries), robustness under real-world conditions (DPI degradation curve, scan quality Classes A–D), operational performance, and the audit/compliance group that determines whether output is deployable in regulated insurance.
CER, WER, field-level EM. VIN and policy # must be byte-perfect.
Table TEDS, reading order, section boundaries, form field linkage.
DPI curve 300/200/150/75, scan class A–D, jurisdiction variance.
Throughput, latency P50/P95, footprint, CPU vs. GPU dependency.
Grounding rate, hallucination = 0, determinism, data sovereignty posture.
Why the question "which tool should we use?" is wrong. The right question is "which tier does this field require?" — and the tier determines the tool, the infrastructure, and the sovereignty posture.
Direct comparison of compiled regex against Groq LLM on Cat 1 fields (VIN, plate, DOB, DL#, policy number) across 200 FNOL and police report documents. Metrics: field-level exact match, hallucination rate, run variance.
Hypothesis: regex F1 ≥ 0.97, hallucination = 0.00, variance = 0.00. LLM F1 0.85–0.93, non-zero variance.
DPI degradation curve study at 300/200/150/75 DPI. Table TEDS, Cat 1 exact match, and DPI sensitivity coefficient across 50 scanned insurance documents.
Hypothesis: T2.5 degrades <10% vs. Tesseract at 25–30%. Table TEDS 0.90+ vs. 0.72.
Accuracy comparison of Ollama + Llama 3.3 70B (T4, in-premise) versus Groq API (T5, cloud) on subrogation indicator detection in IA reports. Includes T2 trigger pre-filter effectiveness study.
The governed intelligence stack behind the use cases above. Ten capabilities across four layers — from AI-ready data pipelines to personified workbenches.
This isn't a product pitch — it's context for what makes the architectures in the field cases possible.
Data quality certification — structured and unstructured — before AI consumes it. 10-dimension composite trust score.
GA1,100+ sensitive data types detected and redacted in-pipeline before egress. PHI, PII, PCI, HIPAA, GDPR.
GAPipeline-native lineage registry. The trust control plane for AI data consumption across Snowflake and Databricks.
PreviewInsurance ontology + document intelligence + knowledge repository. The context layer for governed AI.
GAAI-powered entity resolution. MDM pre-assessment in days, not weeks. Customer identity that carriers trust.
GAEvery AI decision reconstructible from governing chunks, version, effective date, and trace path.
PreviewInsurance agentic workflows for WC/Auto/Property claims and commercial underwriting. Each tool invocation is traceable.
GAClaims examiner intelligence workspace. Facts, intelligence, and evidence in a single cockpit from FNOL intake.
GAWe work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.