Build vs Buy: What to Demand from IDP Vendors
A practical evaluation framework for insurance teams evaluating claims automation solutions
The Wrong Question
“Should we build or buy our claims AI?”
This is the wrong question. Vendors will show you polished demos. Your developers will promise they can “just use GPT-4.” Neither perspective captures reality.
The right question: What do you need to be true for this to work in production, and who can actually make it true?
We've worked through 47 evaluation iterations on a claims automation pipeline, watching accuracy climb from 18% to 98%. Along the way, we documented every failure mode, every false assumption, and every hard-won lesson. Here's how to evaluate claims AI vendors without getting fooled by demos.
When to Build
Build is a capability investment, not a cost savings. Choose it for strategic reasons, not fear of vendors.
Build Makes Sense When:
- Your policy logic is genuinely unique. If your coverage rules can't be configured in a vendor product, you'll fight their system forever.
- You have ML engineering capacity, and retention. One departing engineer shouldn't cripple your system.
- Claims processing is your competitive advantage. Some insurers differentiate on claims speed. Most don't.
- Regulatory requirements demand full control. Swiss and EU data residency, audit trails, explainability: some regulators want everything in-house.
- You're prepared for 12-18 months before production. That's realistic. Our 47 iterations took 2 months of focused work.
Build Traps to Avoid:
You'll hit the same accuracy ceiling everyone else does. Our baseline LLM-only approach achieved 18% accuracy. Not 80%. Not 60%. Eighteen percent.
Claims domain is deep. German compound words like Abgasrückführungsventil (exhaust gas recirculation valve) require automotive vocabulary that doesn't come from Stack Overflow.
You'll trade it for internal lock-in: dependency on engineers who understand your bespoke system.
The Real Cost of Build:
- • 2-4 ML engineers for 12+ months
- • Domain expert time for labeling and validation (~20 person-hours for 50-claim ground truth)
- • Ongoing maintenance: models drift, document formats change
- • Opportunity cost of not shipping sooner
When to Buy
Buy is a speed-to-market trade-off. Choose it because you value time, not because you think it's “easier.”
Buy Makes Sense When:
- You need production results in 3-6 months. Vendors have solved problems you haven't encountered yet.
- You don't have (or want to build) ML capacity. Hiring and retaining ML engineers is hard.
- Your claim types are relatively standard. Motor, property, health: vendors have seen these before.
- You want IT focused on integration, not model building. Your competitive advantage is probably not in AI research.
Buy Traps to Avoid:
Demo data is curated. Our system looked great on German documents until we ran French claims. All 25 approved claims were wrongly rejected.
Ask: accuracy on what? Their test set or your documents? Our 98% accuracy came with 36% payout accuracy. Decision accuracy ≠ business accuracy.
Integration is always harder than promised. Document formats vary. Field names differ. Your policy structure has quirks.
The Real Cost of Buy:
- • License fees (per claim, per document, or platform)
- • Integration effort (typically 3-6 months, not 3-6 weeks)
- • Ongoing optimization (you'll still need to tune)
- • Vendor dependency (what if they pivot, fail, or get acquired?)
The Vendor Evaluation Framework
A systematic approach to comparing vendors. Weight categories based on your priorities.
Category 1: Accuracy & Evaluation
30% Weight| Question | What You're Looking For |
|---|---|
| What's your holdout accuracy? | Specific %, methodology explained |
| What's your false approve rate? | Tracked separately, <5% |
| Can you show error categories? | Taxonomy exists, improvement tracked |
| How many iterations to reach current accuracy? | Shows discipline, not luck |
Red flags: “Our accuracy is 99%” without explaining holdout methodology. We tracked 14 distinct failure modes across matching, extraction, coverage logic, and calculation.
Category 2: Auditability
25% Weight| Question | What You're Looking For |
|---|---|
| Can I trace any field to its source document? | Page and character level |
| Do you log model version and confidence? | Complete audit trail |
| Can you reproduce a 6-month-old decision? | Full reproducibility |
| How do you capture human overrides? | With justification required |
For Swiss and EU regulated workflows, auditability isn't negotiable.
Category 3: Architecture
20% Weight| Question | What You're Looking For |
|---|---|
| What % is deterministic vs ML? | Higher deterministic = lower risk |
| What are your confidence thresholds? | Asymmetric, configurable |
| How do you detect distribution shift? | Monitoring in place |
| What's your human-in-loop workflow? | Full QA console |
Our architecture: 57% of items handled by deterministic rules (zero LLM cost). Higher deterministic percentage = lower cost and higher explainability.
Cost & Operations
15%- • Cost per claim breakdown
- • Token limits, circuit breakers
- • P50, P95, P99 latencies
- • Native language handling
Risk & Safety
10%- • Documented failure modes
- • Automated regression testing
- • Graceful degradation to human
- • EU/Swiss data residency
The Demo Checklist
What to demand when vendors present.
Before the Demo
- • Send 10-20 of YOUR claims
- • Include edge cases: multilingual, poor scan quality
- • Define success criteria upfront
During the Demo
- • Run on YOUR documents
- • Ask to see a failure
- • Click through to field provenance
- • Request confidence scores
After the Demo
- • Compare claimed vs actual accuracy
- • Count “we'll tune that” responses
- • Evaluate: could your team use this daily?
The 5-Minute Stress Test
Pick your weirdest claim. A German compound word like Zylinderkopfdichtung (cylinder head gasket) with a French invoice format and a borderline coverage decision. If they can't process it live, their “99% accuracy” doesn't apply to your reality.
The Hybrid Path
Consider the middle ground. Build some, buy some.
Buy
Document extraction, OCR, classification: vendors are good at generic document understanding
Build
Policy-specific business rules, coverage logic: you're better at your specific domain
Own
Evaluation framework, ground truth, quality monitoring. Maintain control of the decision layer
Why this works: Vendors handle commodity parts. You handle differentiated parts. You can swap vendors without rewriting rules. You maintain audit control over final decisions.
Next Steps
Define must-haves vs nice-to-haves
Auditability is non-negotiable. Automatic optimization is not.
Prepare your demo claim set
Twenty claims, including your edge cases and multilingual documents.
Build your scorecard before the first vendor call
Don't let demos anchor your evaluation criteria.
Decide: what do you want to own long-term?
The decision layer? The extraction layer? The whole stack?
The goal isn't to avoid vendors or avoid building. It's to make an informed decision based on your specific situation, timeline, and capabilities.
True Aim AG builds auditable AI systems for Swiss and EU regulated markets. We've learned these lessons the hard way, 47 iterations worth.
Key Takeaways
- Build for strategic capability, not cost savings
- Buy for speed-to-market, not simplicity
- 5-category vendor evaluation framework
- Demo checklist: before, during, after
- The hybrid path: buy commodity, build differentiation
- Auditability is non-negotiable
Related Topics
Evaluating Claims AI Vendors?
We can help you build an evaluation framework tailored to your requirements and run objective vendor comparisons.