Compliance & Governance

Auditability in Claims AI: Building Evidence Chains That Auditors Actually Accept

Everyone says their AI is explainable. Few can prove it to an auditor. Here's the architecture that makes it possible.

February 2025

12 min read

Technical Deep Dive

When regulators ask “why did your system approve this claim?”, they don't want to hear about attention weights or SHAP values. They want to click on a field and see exactly where it came from. They want to re-run a decision from six months ago and get the same answer. They want to know when a human intervened and why.

We learned this the hard way. Our claims automation system went through 47 evaluation iterations, improving from 18% to 98% accuracy. Every decision has a complete audit trail. Here's the architecture that makes it possible, and the regulatory requirements it satisfies.

What Regulators Actually Want

Let's translate regulatory language into technical specifications.

FINMA Guidance 08/2024 (Switzerland)

Switzerland has no AI-specific law. Instead, FINMA applies technology-neutral governance requirements to AI systems, framing AI risks as operational risks:

•Model risks: robustness, correctness, bias, explainability
•IT/cyber risks: security, availability, integrity
•Third-party dependency risks: vendor lock-in, concentration

FINMA explicitly notes that AI “results often cannot be understood, explained, or reproduced.” This is the problem you must solve.

EU AI Act

The EU AI Act classifies life and health insurance risk assessment and pricing as high-risk. Claims automation isn't explicitly listed as high-risk, but transparency obligations apply. EIOPA emphasizes governance and risk management aligned with existing insurance regulation.

What This Means in Practice

Every decision must have a traceable evidence chain
Human overrides must be captured with justification
You must be able to reproduce a decision from 6 months ago
"The model said so" is not an acceptable explanation

Key insight: Regulators don't care about your model architecture. They care about your evidence trail.

The Evidence Chain Architecture

We use three layers of traceability. Each layer answers a different audit question.

Layer 1: Field-to-Source Provenance

“Where did this number come from?”

Every extracted field must link to: source document, page number, location (coordinates), and extraction confidence.

Field: [extracted_field_name]
Value: [extracted_value]
Source: [document.pdf], Page X
Location: [bounding box coordinates]
Confidence: [high/medium/low]

When an auditor asks “where did this value come from?”, you click and show them the exact location in the source document. No guessing, no searching.

Layer 2: Decision Audit Trail

“Why was this claim approved?”

Every decision must record: input hash, model version, confidence score, timestamp, and decision tier.

Decision: [APPROVED/DENIED/REVIEW]
Input hash: [reproducibility hash]
Model version: [version identifier]
Confidence: [score]
Decision method: [deterministic/AI/hybrid]
Timestamp: [ISO timestamp]

Our system uses a multi-tier decision cascade, starting with deterministic rules and falling back to AI only when needed. Each tier is logged, so you know exactly which method made the decision.

Layer 3: Override Capture

“Why did a human change this decision?”

Every human override must capture: original AI decision, human decision, override reason (required field), reviewer identity, and timestamp.

Original decision: [AI decision with confidence]
Override decision: [human decision]
Reason: [required justification text]
Reviewer: [authenticated user ID]
Timestamp: [ISO timestamp]

This creates a feedback loop: override patterns reveal system gaps. If adjusters consistently override denials for the same reason, that's a signal to improve the system.

Quality Gates: The Audit Checkpoint

We use a three-state quality gate that determines when automation proceeds and when humans must intervene.

PASSAll required fields present, confidence above threshold.

Proceed automatically. Log decision with full evidence chain. No human review required.

WARNConfidence below threshold OR non-critical field missing.

Proceed with flag. Route to sampling queue. Human may review or skip.

FAILCritical field missing OR confidence very low OR rule violation.

Block automation. Route to human adjuster. Require explicit human decision.

Confidence Asymmetry

We use asymmetric thresholds:

For approvals: higher confidence threshold required
For denials: lower confidence threshold acceptable
Below thresholds: routed to human review

Why asymmetric? False approvals cost more: financial loss plus regulatory risk. False denials can be appealed by the customer. This isn't bias; it's risk management.

Result: We reduced our REFER_TO_HUMAN rate from 60% to less than 5%.

Failure Modes That Break Audit Trails

From our 47 evaluation iterations, here are failures that regulators will catch:

Extraction Errors Without Provenance

SymptomA value was extracted incorrectly, but there's no link back to the source

Audit ImpactWithout provenance, you can't explain the error

FixEvery extracted value links to source location

Decisions Based on Unavailable Data

SymptomSystem made a decision, but key information wasn't in the document set

Audit ImpactSystem was correct given available data, but auditor doesn't know that

FixDocument what information was available at decision time

Approve-by-Default Logic

SymptomHigh accuracy overall, but wrong for wrong reasons

Audit ImpactCan't explain why approved, just "no failures found"

FixRequire explicit coverage confirmation, not absence of rejection

What Doesn't Work for Explainability

SHAP/LIME values: Technically interesting, meaningless to adjusters. “Feature X contributed 0.3 to the decision” is not an explanation.
Attention weights: “The model focused on these words” doesn't explain WHY the decision was made.
Confidence scores alone: “87% confident” isn't an explanation. You need: “87% confident BECAUSE [evidence].”
Post-hoc rationales: Generating explanations after the decision risks mismatch between explanation and actual logic. Auditors will test this.

Key insight: Explainability means showing your work, not describing your feelings about the answer.

Regulatory Mapping Quick Reference

Requirement	FINMA 08/2024	EU AI Act	How to Address
Documentation	Material applications	Transparency obligations	Model cards, decision logs, version control
Traceability	Results must be explainable	Record-keeping for high-impact	Field-to-source provenance
Reproducibility	"Often cannot be reproduced" (the problem)	Implied	Input hashing, model versioning
Human oversight	Required for material decisions	Required for high-risk	Quality gates, override capture
Fallback mechanisms	Explicitly required	Expected	Tier cascade, human queue
Risk classification	Centralized inventory expected	High-risk for life/health pricing	Use-case registry with risk tags

Core Principles for Auditable AI

Building audit-ready AI systems requires commitment to four foundational principles. The specific implementation will vary by organization, but these principles remain constant.

Complete Provenance

Every extracted value must be traceable to its source document and location.

Decision Reproducibility

Given the same inputs and model version, you must be able to reproduce any historical decision.

Human Accountability

Every human override must be captured with required justification and reviewer identity.

Continuous Validation

Regular testing of historical decisions to verify your audit trail remains intact.

Next Steps

Audit your current evidence chain

Can you trace any field to its source?

Test reproducibility

Re-run a 3-month-old decision. Same output?

Document your quality gate criteria

What triggers human review?

Map your architecture to FINMA/EU AI Act

Use the table above as a starting point.

The goal isn't perfect AI. It's defensible AI: systems where every decision can be explained, every field can be traced, and every human intervention is documented. That's what regulators want. That's what auditors accept.

Based on internal research from 47 evaluation iterations (18% → 98% accuracy) and regulatory analysis of FINMA Guidance 08/2024, EU AI Act, and EIOPA governance frameworks.

Key Takeaways

Three-layer evidence chain architecture
Field-to-source provenance tracking
Decision audit trail with input hashing
Human override capture with required justification
Quality gates that satisfy auditors
FINMA & EU AI Act compliance mapping

Building Auditable AI?

We can help you design evidence chain architectures that satisfy regulators and auditors.

Get in Touch See It in Action