We architect governed AI systems for insurers and MGAs. This is where we publish the work — implemented workflows, benchmark studies, and the reasoning behind every architectural decision.
Not as a vendor. As practitioners who have built these systems, debugged them in production, and had to explain them to underwriters and regulators.
Every downstream decision an agent makes inherits the error that entered upstream. Trust in AI outcomes is not a function of the model — it's a function of the data pipeline. Carriers who treat data certification as a pre-deployment step, not a foundation, will keep hitting the same wall.
When an AI agent participates in an underwriting decision or claims adjudication, regulators apply the same explainability and audit requirements as they do to a human adjuster. You can't build a governed outcome on an ungoverned pipeline. The architecture has to be designed for accountability from the start.
Your appetite rules, jurisdiction requirements, and SOPs should be encoded, versioned, and applied consistently — not recalled probabilistically from a model that may have drifted since last training. Every AI decision should trace back to your procedures, your expertise, your institutional knowledge.
Proof-of-value expectations have shifted from "interesting" to "in production." Scaling agentic AI at this stage requires a method, not just a model — one that certifies the data before agents consume it, contextualizes institutional knowledge so agents reason correctly, and composes workflows that are auditable end to end.
Bad data in an agentic workflow doesn't fail at the source — it compounds across every downstream decision. Every data product needs a trust score before an agent touches it.
Appetite, jurisdiction rules, and SOPs as versioned, governed definitions — not floating in model weights where they can drift, hallucinate, or become unauditable.
Underwriting and claims AI faces the same audit standards as human decisions. Explainability has to be designed in — it cannot be retrofitted onto a black-box pipeline.
Solutions that plug into the stack you already run — Snowflake, Databricks, AWS, Azure. You keep the code. You own the outcome. No platform lock-in.
Each case documents what the real problem is, why standard approaches don't solve it, and what an architecture that does actually looks like — without giving away the blueprints.
Underwriters spend 60–70% of their time on triage work that doesn't require underwriting judgment — extracting data from PDFs, checking appetite guidelines, routing to queues. Meanwhile, submissions sit 24–48 hours before anyone evaluates them, and 40% get mis-routed. What if Fast NOs were as fast as Fast YESes — and both happened in minutes?
When an FNOL arrives — workplace injury, multi-vehicle collision, property loss — adjusters have minutes to assess liability, detect fraud, identify subrogation, calculate reserves, and route correctly. What happens when those signals are buried in narratives, jurisdiction rules vary across 50 states, and institutional knowledge lives only in expert adjusters' heads?
What risk signals are buried in prior carrier loss runs that underwriters never see because the data arrives as PDFs with carrier-specific codes that don't map to your coverage lines?
Counties operate jails, water plants, law enforcement, and healthcare facilities — each with distinct liability profiles. Yet underwriters get 200-page PDFs and 15 minutes. What signal is lost, and what happens after a loss when that signal surfaces?
Workers Comp FNOL routing requires applying the exact statutory rules for the state and date of injury — statute of limitations, benefit schedules, reporting deadlines. LLM-based routing introduces variance that can't be audited against the version of law that applied at time of injury.
Your enterprise is deploying AI agents that autonomously consume data from structured databases and unstructured documents. When a regulator asks "was this data certified before your AI used it to deny this claim?" — what do you show them?
Traditional audit happens after binding — when it's too late to prevent problems and too expensive to fix them. This case documents a shift from compliance checkbox to preventive risk intelligence that evaluates 100% of submissions before binding.
Your enterprise acquires operations in APAC. Their customer names follow different cultural patterns, addresses use unfamiliar formats, source schemas don't align with your MDM target model. What happens when the 5–6 week manual mapping process per region becomes the bottleneck preventing global expansion?
You set your blocking scheme and auto-merge threshold — but will that generate 50,000 candidate pairs or 10 million false positives? Will golden records pick current addresses or stale CRM data? You discover these answers in production, with real customer data at risk.
Most actuaries inherit cohort segmentation from predecessors without rigorous testing. Cohort selection is foundational to reserve accuracy — yet it's rarely questioned. This case documents automating hypothesis generation, construction, and statistical validation to find objectively better segmentations.
The governed intelligence stack behind the use cases above. Ten capabilities across four layers — from AI-ready data pipelines to personified workbenches.
This isn't a product pitch — it's context for what makes the architectures in the field cases possible.
Data quality certification — structured and unstructured — before AI consumes it. 10-dimension composite trust score.
GA1,100+ sensitive data types detected and redacted in-pipeline before egress. PHI, PII, PCI, HIPAA, GDPR.
GAPipeline-native lineage registry. The trust control plane for AI data consumption across Snowflake and Databricks.
PreviewInsurance ontology + document intelligence + knowledge repository. The context layer for governed AI.
GAAI-powered entity resolution. MDM pre-assessment in days, not weeks. Customer identity that carriers trust.
GAEvery AI decision reconstructible from governing chunks, version, effective date, and trace path.
PreviewInsurance agentic workflows for WC/Auto/Property claims and commercial underwriting. Each tool invocation is traceable.
GAClaims examiner intelligence workspace. Facts, intelligence, and evidence in a single cockpit from FNOL intake.
GAStandard OCR benchmarks — DocVQA, SROIE, FUNSD — were built for document understanding tasks, not insurance extraction. They measure whether a model can answer questions about a document. That's not the problem. The problem is whether a VIN extracted from a scanned police report is byte-perfect, whether the same document produces the same output on every run, and whether that output can be traced back to a specific location in the source when a regulator asks.
The framework below is what we actually use to evaluate tools. We share it because the evaluation criteria are as important as the results — and because vendors who cherry-pick generic benchmarks are counting on you not asking the right questions.
Five evaluation groups. The first four map to standard document intelligence dimensions. The fifth — audit and compliance — is the one that determines whether output is actually deployable in a regulated insurance workflow. Most tools pass the first four. Group 5 is where the field narrows.
Group 5 is a compliance gate, not a performance metric. A tool can pass Groups 1–4 and be disqualified by Group 5 alone.
Request the full framework →"Which tool should we use?" is the wrong question. The right question is which tier the field requires — and that determines the tool, the infrastructure, and the sovereignty posture.
Head-to-head on Cat 1 fields — VIN, DL#, DOB, policy number, claim number — across 200 FNOL and police report documents. Measuring field-level exact match, hallucination rate, and run variance side by side.
DPI degradation curve at 300/200/150/75 DPI. Table TEDS and Cat 1 exact match across 50 scanned insurance documents. Testing the hypothesis that Granite degrades under 10% vs. Tesseract at 25–30%.
We work with insurers and MGAs who are serious about the architecture — not just the demo. Conversations start with the problem, not the product.