Suprmind Multi-Model Divergence Index - April 2026 Edition

The Confidence Trap:
When Confident AI Answers Don't Survive Multi-Model Review

Q: Does 'contradicted' mean 'incorrect'?

No. 'Contradicted' means another model produced a substantively different answer. On questions with no external ground truth — forecasts, strategic judgments, interpretations — being outvoted doesn't equal being incorrect. On questions with ground truth, it usually does, but we don't measure that separately in this edition.

Q: Why does Financial produce more disagreement than Legal?

Legal questions have retrievable authoritative sources such as statutes and cases. Five models querying the same sources tend to converge. Financial questions are largely forecasts and interpretive judgments, with no ground truth and five defensible answers. This inverts the common assumption that legal queries are the hardest domain for AI. The 72.1% Financial disagreement rate versus 41.2% Legal rate captures that structural difference.

Q: Can I see the raw data?

The aggregate data behind every table on the page is downloadable as a CSV under a Creative Commons BY 4.0 license. Individual turn-level data is not released for user privacy reasons.

Based on 1,324 real-user production turns across financial, legal,
medical, technical, business, and marketing queries.
Sample window: March 5 - April 19, 2026.

Published April 2026 · Next edition: late July 2026

Signal Detection

99.1%

Of 1,324 production turns, 99.1% surfaced at least one contradiction, correction, or unique insight across the five providers

Confidence Trap

51.3%

Gemini's high-confidence answers contradicted or corrected by another model in the same turn - the highest rate of any provider

Catch Asymmetry

9.77×

The asymmetry between how often Perplexity catches other models' errors versus Gemini (catch ratios 2.54 vs 0.26)

Domain Peak

72.1%

Disagreement rate on financial questions - the highest of any domain. Education, the lowest, is 28.6%

Download the full dataset (CSV) Read the methodology

Why this report exists

Hallucinations are the failure mode people notice. But in real work, the deeper risk is often confidence without resilience: answers that sound certain until another model challenges, corrects, or contradicts them. To measure that, we analyzed 1,324 real-user multi-model turns in Suprmind and tracked where providers diverged, corrected one another, and exposed hidden failure modes.

Hallucinations are one of several failure modes that confident-looking AI answers hide. Multi-model review exposes them — alongside stale knowledge, faulty reasoning, and unchallenged assumptions. This report measures the pattern across five frontier providers in production use.

How we measured it

Suprmind runs every user query through Claude, GPT, Gemini, Grok, and Perplexity, then scores the contradictions, corrections, and unique insights between their answers, turn by turn. Across 1,324 real-user turns — finance, legal, medical, technical, business strategy, marketing, research — multi-model review surfaced at least one contradiction, correction, or unique insight on 99.1% of turns. Not a lab benchmark. Real production work.

1,324

turns analyzed

299

unique users

day window

professional domains

What we don't claim

One clarification before the numbers. This study does not measure correctness against an external ground truth. No such ground truth exists for the kinds of questions professionals actually ask — investment theses, regulatory interpretation, strategic tradeoffs, clinical judgment. What we measure is what you'd observe if you ran the same question past five experts: how often they contradict each other, who corrects whom, and where the disagreements are severe enough to change a decision.

The Confidence Trap

The gap between how certain an AI sounds and how its answer holds up when another AI reads it.

Models disagreed on 54% of turns in this dataset. They corrected each other on 72%. They surfaced more than 2,500 unique insights that only one provider contributed in each turn. Every row below is the mechanism, expressed as a number. The question this report answers: which providers are doing what, and which combinations matter most.

The Financial-Legal Inversion

Chart: disagreement rate on financial questions (72.1%) vs legal questions (41.2%). Based on 1,324 real-user multi-model turns in Suprmind.

Domain	Sample	Disagreement Rate
Financial	n = 258	72.1%
Legal	n = 136	41.2%

The 31-point gap - nearly 1.75× - inverts the common assumption that legal queries are the hardest domain for AI. The explanation is structural. Legal questions ("what does this statute say," "is this clause enforceable") have citable ground truth in cases and statutes. Five frontier models retrieving the same authoritative sources tend to converge. Financial questions ("should we expand to Germany," "what's the fair valuation") are forecasts and judgments. No ground truth exists, and five models produce five defensible answers that differ.

The failure modes also differ. Legal AI peer-divergence patterns cluster around confidence exposure - confident assertions of case citations that turn out to be unsupported when another model checks.^[3] Financial AI peer-divergence patterns cluster around confident contradictions - five different numbers, all sounding certain. Both are real. Only the second is captured in this report.

Based on 1,324 real-user multi-model turns in Suprmind, March 5 - April 19, 2026.

What We Measure

This page tracks three metrics. They describe the mechanism of multi-model review, not any single model's correctness.

Confidence-Contradiction Rate

When a provider's answer reads as highly confident - definitive language, no hedging - how often is that answer contradicted or corrected by another model in the same turn? The core signal of the Confidence Trap.

Correction Yield

Of all the corrections one model made of another's output, what share did each provider contribute? Measures which models are doing the catching, not which models are being caught.

Catch Ratio

For each provider: corrections made divided by corrections received. Above 1.0 means the provider catches more than it's caught. Below 1.0 means the inverse. The cleanest single number for asymmetry.

We do not rank models on accuracy because we cannot. We rank them on behavior inside a multi-model ensemble, which is a different question - and the one that actually matters if you're deciding whether multi-model orchestration is worth the cost.

The Reference Table: Provider Performance in Production

One table that anchors the rest of the report - and the decision of whether multi-model validation is worth the cost. Each row describes one provider's behavior inside the five-model ensemble across the three metrics we track, plus raw catch and insight counts.

Provider	Confident-Contradicted (all turns)	Confident-Contradicted (high-stakes only)	Corrections Made	Times Caught	Catch Ratio	Unique Insights Surfaced
Claude	33.8%	26.3%	304	135	2.25	631
Perplexity	33.9%	32.1%	335	132	2.54	638
GPT	39.5%	36.0%	111	295	0.38	339
Grok	48.8%	46.8%	193	269	0.72	509
Gemini	51.3%	50.2%	109	416	0.26	463

Based on 1,324 real-user multi-model turns in Suprmind, March 5 - April 19, 2026. External users only (10 internal/test accounts excluded). "Contradicted" means a peer model produced a substantively different answer in the same turn - not measured against external ground truth.

The five columns tell different stories. Low confident-contradicted rate + high catch ratio = a model that speaks carefully and catches other people's errors (Claude, Perplexity). High confident-contradicted rate + low catch ratio = a model that speaks definitively and is frequently contradicted on that confidence (Gemini). Grok and GPT occupy the middle with different profiles: Grok catches moderately and gets caught moderately, GPT catches rarely but surfaces unique insights least often.

Perplexity and Claude together cover most of the value columns. Neither dominates every one, and GPT's consistent middle-of-the-column position is its own kind of reliability. No single row wins across the board - which is the argument for multi-model orchestration, expressed as a matrix.^[1][2]

See the Confidence Trap Live - in Your Next Question

Suprmind runs every message through the same five frontier models measured in this report, then scores contradictions, corrections, and unique insights turn by turn. Inside the product, the Disagreement/Correction Index shows up as a sidebar panel - the per-turn version of what this page shows in aggregate.

When the models disagree, that disagreement is where the hard questions live. Suprmind surfaces it, quantifies it, and, in three clicks, turns it into a professional deliverable - so the weak reasoning gets caught before the decision ships.

Disagreement is the feature.

1. The Financial-Legal Inversion 2. What We Measure 3. Provider Performance Reference Table 4. The Confidence Trap by Provider 5. The Confidence Trap by Domain 6. Who Catches Whom 7. The Ensemble Value - Unique Insights 8. The 0.9% Silent-Agreement Rate 9. Hallucinations, in Context 10. Methodology 11. What Q2 Will Add 12. Frequently Asked Questions 13. Download the Dataset (CSV) 14. References

The Confidence Trap by Provider

Every provider in our dataset produced answers flagged as high-confidence by their own language - definitive assertions, no hedging, "the answer is X" phrasing. In a single-model workflow, those answers ship as final. In a multi-model turn, they get tested against four peers. How often does the test fail?

All turns (n = 1,324)

Provider	High-confidence responses	Contradicted or corrected	Rate
Gemini	887	455	51.3%
Grok	688	336	48.8%
GPT	805	318	39.5%
Perplexity	629	213	33.9%
Claude	757	256	33.8%

Gemini's confident answers are contradicted almost exactly half the time. Claude's are contradicted a third. That is a 1.5× difference in reliability at the same confidence level - not because Claude has ground truth on its side and Gemini doesn't, but because Claude's definitive language is calibrated closer to what the other four models also assert, and Gemini's isn't.

High-stakes turns only (n = 382)

We filtered to turns the domain classifier flagged as high-stakes - real decisions with legal, financial, medical, or career consequences. Drafts, regulatory interpretation, prescriptions, investment calls, contract reviews.

Provider	High-confidence responses	Contradicted or corrected	Rate
Gemini	289	145	50.2%
Grok	231	108	46.8%
GPT	261	94	36.0%
Perplexity	215	69	32.1%
Claude	240	63	26.3%

The calibration delta - which providers slow down when stakes rise

The sharper cut of the Confidence Trap isn't the absolute rates. It's how each provider's confident-contradicted rate changes when the question has real consequences - when the model "should" be more careful.

Chart: calibration delta by provider. Claude drops 7.5 points when stakes rise; Gemini drops 1.1 points. Based on 1,324 real-user multi-model turns.

Provider	All turns	High-stakes	Delta
Claude	33.8%	26.3%	−7.5 pts
GPT	39.5%	36.0%	−3.5 pts
Grok	48.8%	46.8%	−2.0 pts
Perplexity	33.9%	32.1%	−1.8 pts
Gemini	51.3%	50.2%	−1.1 pts

Three providers slow down meaningfully when stakes rise. Claude slows down most. Gemini and Grok barely slow down at all.

That's the argument for Claude's calibration, expressed as a delta. When a question carries real-world consequences, Claude's confident-contradicted rate drops nearly eight points - its language becomes more hedged, more bounded. Gemini's drops 1.1 points. Grok's drops 2.0. Same linguistic confidence, same definitive phrasing, whether the user is asking for a pizza recipe or a regulatory opinion. Their linguistic confidence isn't tracking the stakes.

The Confidence Trap by Domain

The Financial-Legal inversion above is the headline. The full pattern across all ten domains is wider.

Chart: disagreement rate by domain across 10 professional domains. Financial highest at 72.1%, Education lowest at 28.6%. Based on 1,324 real-user multi-model turns.

Domain	Turns	Disagreement Rate	Correction Rate
Financial	258	72.1%	71.7%
Other	153	60.1%	69.3%
Marketing & Sales	131	55.0%	77.1%
Business Strategy	257	54.9%	75.5%
Research & Analysis	74	52.7%	77.0%
Technical	172	49.4%	75.0%
Creative	38	42.1%	60.5%
Legal	136	41.2%	69.9%
Medical	56	33.9%	58.9%
Education	49	28.6%	63.3%

The ranking tracks how much each domain's answer depends on judgment versus lookup. Finance, strategy, marketing, and research all require interpretation and forecasts - where five frontier models produce five defensible answers that differ. Legal, Medical, and Education questions have more retrievable ground truth - where the same five queries tend to converge on the same answer. But even the domains where models most often agree show substantial disagreement. Legal's 41.2% rate is higher than any published hallucination benchmark.^[3] Multi-model review keeps finding something to flag even where models ostensibly agree on the facts.

Severity: not all disagreement is equal

Counting disagreements doesn't fully describe them. A 3/10 severity disagreement is a stylistic or minor framing difference. A 9/10 is "two models are recommending opposite actions in a high-stakes context." Below, the share of each domain's contradictions rated critical (severity ≥ 7):

Domain	Total contradictions	Critical (≥ 7)	Share
Research & Analysis	46	24	52.2%
Technical	117	46	39.3%
Financial	285	107	37.5%
Other	162	53	32.7%
Business Strategy	196	60	30.6%
Legal	75	22	29.3%
Medical	24	7	29.2%
Creative	16	4	25.0%
Marketing & Sales	89	14	15.7%
Education	16	2	12.5%

Research & Analysis has a lower disagreement volume than Financial, but a much higher share of its disagreements are severe. When research-level questions diverge across models, they diverge hard. Marketing & Sales disagreements, by contrast, cluster at moderate severity - opinion-level differences on campaign positioning and channel strategy, not "one model is demonstrably missing something critical."

The practical read: on Financial and Research questions, multi-model review is not a nice-to-have. A single-model workflow on those question types is running a 30-50% blind spot.

Who Catches Whom

The provider-pair matrix describes where disagreement actually happens inside the ensemble. The most combative pairs are not always the highest-severity. The least combative pairs are not the most "in agreement" - they may just be asking different questions.

Provider-pair contradiction matrix (overall)

Pair	Contradictions	Avg severity	Critical
Gemini vs Grok	182	6.13	66
Claude vs Gemini	120	6.04	50
Claude vs GPT	117	6.34	55
Gemini vs GPT	108	6.31	50
Gemini vs Perplexity	107	6.04	37
Claude vs Grok	97	6.43	48
GPT vs Grok	86	6.51	47
Grok vs Perplexity	81	6.26	36
GPT vs Perplexity	68	6.13	30
Claude vs Perplexity	52	6.44	24

Gemini and Grok are the most combative pair in the dataset by volume. Claude and Perplexity disagree least often. When GPT and Grok disagree, they disagree with the highest severity (6.51 avg) - a smaller number of harder disagreements.

Top combative pair per domain

Different domains have different "hot pairs." If we had to pick which pair to watch for each domain:

Domain	Top disagreement pair	Count
Business Strategy	Gemini vs Grok	59
Financial	Claude vs Gemini	37
Technical	Gemini vs Grok	27
Marketing & Sales	Gemini vs Grok	23
Legal	Gemini vs Perplexity	18
Medical	Gemini vs Perplexity	7
Research & Analysis	Claude vs GPT	10

Legal is the only domain where the top pair isn't Gemini-adjacent. Claude-GPT leads Research & Analysis - which is also the domain where disagreements are most severe.

The catch-ratio asymmetry

The cleanest number in this dataset is each provider's catch ratio - how many times the provider corrected another model, divided by how many times it was itself corrected:

Chart: catch ratio asymmetry by provider. Perplexity 2.54, Claude 2.25, Grok 0.72, GPT 0.38, Gemini 0.26. Based on 1,324 real-user multi-model turns.

Provider	Catches made	Times caught	Catch ratio
Perplexity	335	132	2.54
Claude	304	135	2.25
Grok	193	269	0.72
GPT	111	295	0.38
Gemini	109	416	0.26

Gemini is caught correcting others 109 times. It is caught by others 416 times - nearly four to one. Perplexity flips that ratio: 335 catches, 132 times caught.

Perplexity catches 9.77× more per correction than Gemini does. That's the single sharpest stat in the dataset - the Confidence Trap expressed as a multiplier. Two ends of the same ensemble, calibrated differently enough that their roles are almost opposite.

A second stat lands the same argument from a different angle: Perplexity and Claude together made 60.7% of all corrections in the dataset - 639 out of 1,052 attributable to a specific corrector. Two models do three-fifths of the catching. GPT and Gemini combined make 20.9%. Error-catching is concentrated in two providers. That's both a design input for how many models to run and a reminder that not all five do the same job.

This is the central argument for multi-model ensembles expressed as a single statistic. Models with a high catch ratio are disproportionately valuable inside an ensemble - not because they are "smarter," but because their failure modes are less correlated with the failure modes of the models they catch.^[2]

Provider strengths in this sample

No single provider dominates across every metric. Each has a distinct role in the ensemble's error-catching mechanism.

Claude - the Calibrated Specialist. Second-highest catch ratio (2.25) and the largest calibration delta on high-stakes (33.8% → 26.3%, a 7.5-point drop). The language hedges when the stakes rise.

Perplexity - the Grounding Layer. Highest catch ratio (2.54). Surfaced 333 critical-severity unique insights, more than any other provider. Web-grounded retrieval catches stale and fabricated claims other models miss.

Grok - the Contrarian Insight. Third in both catch ratio (0.72) and unique-insight share (19.7%). Surfaces a meaningful slice of what Perplexity misses, often from sources the others don't index.

GPT - the Balanced Generalist. Most balanced across columns. Rarely fails, rarely surfaces unique signal (13.1% insight share, lowest). In a multi-model ensemble, GPT functions as a balanced generalist, not a contrarian catcher.

Gemini - the Overconfident Challenger. Highest response volume in the sample (941 responses). Gemini Flash Lite also serves as the DCI classifier that scores cross-provider behavior across the dataset. The research exists because it produces structured analysis reliably at scale. In its own responses, Gemini's confidence signal and reliability signal are calibrated differently from its peers' - the Confidence Trap pattern is most visible in Gemini's numbers.

The Ensemble Value - Unique Insights

Disagreement and corrections describe what multi-model review catches. Unique insights describe what it adds. Across 1,324 turns, the classifier flagged 3,484 unique insights - facts, angles, risks, or recommendations that would have been lost in a single-model workflow. 2,580 are attributable to one of the five tracked providers. The remainder came from earlier classifier runs that used non-standard provider labels and are excluded from provider shares.

Provider	Unique insights	Share	Avg severity	Critical (≥ 7)
Perplexity	638	24.7%	6.38	333
Claude	631	24.5%	6.09	268
Grok	509	19.7%	5.89	159
Gemini	463	17.9%	5.48	104
GPT	339	13.1%	5.48	85

Three things stand out.

Perplexity surfaces the highest-severity unique insights (avg 6.38, and more than half of its insights are critical-severity). That tracks with its grounding mechanism - web-retrieved facts that other models don't have access to frequently turn out to be decisive.

Claude is a close second on volume and lower-severity - consistent with its profile elsewhere: careful language, broad coverage.

GPT surfaces the fewest unique insights, by a wide margin. Only 13% of all unique insights come from GPT, even though it produces roughly the same volume of responses as Gemini and Claude. This contradicts the popular framing of GPT as the "default best model." In a multi-model ensemble, GPT functions as a balanced generalist - it rarely fails, but it also rarely surfaces things the other four don't have.

The framing that lands: without Perplexity, 638 insights flagged across the 1,324 turns would have been missed, including 333 rated critical. Without GPT, 339. That's the actual cost of running one model fewer.

The 0.9% Silent-Agreement Rate

One of the questions we expected to answer honestly on this page: when is multi-model review not worth the cost? The natural assumption was "a significant fraction of turns produce no additional signal - the models just agree redundantly."

The data does not support that assumption.

Domain	Turns	Silent turns	Silent rate
Other	153	4	2.6%
Business Strategy	257	5	1.9%
Marketing & Sales	131	1	0.8%
Technical	172	1	0.6%
Financial	258	1	0.4%
Legal	136	0	0.0%
Medical	56	0	0.0%
Education	49	0	0.0%
Research & Analysis	74	0	0.0%
Creative	38	0	0.0%
Total	1,324	12	0.9%

A "silent turn" is one where no contradictions, no corrections, and no unique insights were flagged across all five providers. Twelve turns out of 1,324.

Five of ten domains had zero silent turns.

In Legal (n = 136), Medical (n = 56), Education (n = 49), Research & Analysis (n = 74), and Creative (n = 38), multi-model review flagged something on every single turn. 353 turns across these five domains, zero silent outcomes. Across the full dataset, the rate holds: multi-model review surfaces at least one contradiction, correction, or unique insight on 99.1% of all turns.

The honest caveat: "flagging something" doesn't automatically mean the flag was decision-relevant. A minor phrasing difference counts toward the 99.1%. A stricter definition - "no critical-severity contradictions or corrections" - would produce a larger silent-agreement rate, and we plan to report that variant in the Q2 edition.

But the headline stands: there is no meaningful floor below which multi-model review is routinely wasted. The burden of proof for "single-model is sufficient" rests on whichever model you're trusting, and the catch-ratio table above is the evidence it probably isn't.

Hallucinations, in Context

Hallucinations - fabricated facts presented as real - are the failure mode most professionals associate with AI risk. The existing Suprmind research on AI hallucination rates across benchmarks aggregates 50+ sources on that specific problem.^[9]

The Confidence Trap is adjacent but not the same.

A model can produce an answer that is not a hallucination in the Vectara sense^[4] - every fact in it is retrievable from training data - and still be contradicted by another model in the same turn. Stale knowledge, unexamined assumptions, risk-tolerance mismatches, and edge-case reasoning errors all produce contradictions without producing hallucinations.

Our read: hallucinations are one of several failure modes that multi-model review exposes. When a second model flags a fabricated citation, it caught a hallucination. When it flags that the first model's ROI calculation assumed 100% customer adoption, it caught an unchallenged assumption. Both are real. Both are caught. The hallucinations page measures the first mode. This page measures the broader category across multiple failure types - which is why decision validation is a different problem than hallucination detection.^[8]

Methodology

This dataset is production data from real Suprmind users, not a lab benchmark. The Disagreement/Correction Index (DCI) that scored every turn is the same system that runs inside the product as a sidebar panel, giving live users the per-turn version of what this report shows in aggregate. That means tradeoffs we want to be explicit about.

What we measure

Confidence level (per provider, per turn). An automated classifier scores each provider's response on a 1-10 linguistic-confidence scale, based on language signals: definitive phrasing, absence of hedging, first-person certainty. Score is the language, not the correctness.

Contradiction. A turn is flagged as containing a contradiction when two or more providers produce substantively different answers to the same question. The DCI classifier specifies which providers contradicted each other and assigns a severity score (1-10).

Correction. A turn is flagged as containing a correction when one provider explicitly identifies an inaccuracy, flawed assumption, or missing context in another provider's output. The classifier records which provider corrected which, and the severity.

Unique insight. A fact, risk, angle, or recommendation surfaced by exactly one provider in a turn and not mentioned by any of the other 2-4 providers in the same turn.

Silent turn. No contradictions, no corrections, and no unique insights flagged by the classifier. The floor of the "multi-model adds no signal" measurement.

What we do not measure

External ground truth. We do not score any answer against an authoritative source. A model outvoted four-to-one on a financial forecast may still be correct. A model the others "correct" may be the only one that's right. We report the mechanism - who contradicts whom, who corrects whom - not the verdict.

Accuracy in a benchmark sense. This is not a hallucination benchmark. It does not compete with Vectara HHEM,^[4] AA-Omniscience,^[5] FACTS,^[6] or SimpleQA. It measures a different thing: ensemble behavior in production.

Sample definition

External turns: 1,324
Unique external sessions: 700
Unique external users: 299
Sample window: March 5 - April 19, 2026 (45 days)
Internal accounts excluded: 10 accounts, 196 turns removed
Providers: Claude, GPT, Gemini, Grok, Perplexity
Modes represented: Sequential (87%), Super Mind, Debate, Red Team, First Principles
Domains: 10 buckets (Legal, Medical, Financial, Technical, Business Strategy, Marketing & Sales, Creative, Research & Analysis, Education, Other) + Unknown (113 turns excluded for too-short first messages)

Sample composition disclosures

Three structural facts about the dataset that readers should understand before interpreting the headline numbers. None of them invalidates the pattern; all of them shape how narrowly a given comparison can be drawn.

Tier aggregation. When we say "Claude," "Gemini," or "GPT," we mean the provider family, not a single named variant. Suprmind routes between variants (Haiku / Sonnet / Opus for Claude; Flash / Pro / variants for Gemini; and equivalents for the others) depending on the user's plan, mode, and query type. A "Claude" response in our data could be Haiku on one turn and Opus on another. For the purposes of this index, we're measuring provider-family behavior inside the ensemble, not any specific model version.

Variable peer panels. Not every turn is scored against all five providers. Peer review is task-appropriate and plan-appropriate: some turns run 2 models, some 4, some 5, based on the user's Suprmind plan and the workflow they selected. The confidence-contradiction rates in this report are calculated against whichever peers actually ran on each turn - not a uniform 5-provider panel on every data point. This is a feature of production data, not an artifact we can reshape.

Version verification. No mid-window model-version changes materially affected this dataset. The sample closed April 19, 2026; Claude Opus 4.7 shipped on April 16, but adoption in the routing layer had not yet propagated to our production traffic at the freeze point. This dataset is therefore a pre-Claude 4.7 snapshot. Q2 2026 Edition will be the first edition to reflect post-4.7 behavior.

Known caveats - the honest list

1. Gemini 3.1 Flash Lite is the classifier. The same engine that classifies contradictions, corrections, and confidence is the Gemini family member. There is a theoretical self-leniency concern. We have not run a manual calibration against human verification in this edition, as a deliberate tradeoff: manual verification would break the quarterly re-run automation. A future edition may add a benchmark arm that cross-validates classifier decisions against a smaller human-rated subset.

2. User sample bias. Users who chose a multi-model platform plausibly skew toward those who already suspect AI models disagree. Our audience, not a random sample of enterprise workers.

3. Sequential mode dominance. 87% of turns in the sample used Sequential mode (where each model sees prior models' answers before responding). Super Mind, Debate, Red Team, and First Principles modes together make up 13%. Mode effect on disagreement rate is a follow-up question we do not address in this edition.

4. Thin domain cells. Creative (n = 38), Education (n = 49), and Medical (n = 56) have enough data for headline rates but not for fine-grained provider-pair cuts. Cells with fewer than 10 observations are suppressed in the domain × pair matrix.

5. Domain classifier is a single-pass LLM call. 100% coverage across 877 total DCI sessions (700 external, 177 internal/test), with high classifier-reported confidence on all but 113 "Unknown" turns (short greetings and fragments). Not human-verified.

6. Silent-agreement definition. The 0.9% silent-agreement rate counts turns where the classifier flagged nothing at all. A stricter definition ("no critical-severity contradictions or corrections") would produce a larger rate. Both framings are defensible. This edition reports the looser one.

7. Corrections are not adjudicated. When Model A asserts that Model B's answer is incorrect, we record the assertion. We do not independently verify whether Model A is correct to make the claim.

8. Contradicted ≠ wrong. Repeated for emphasis: a model outvoted by peers may be the only one that's right. The statistic measures ensemble behavior, not truth.

Where we think we might be wrong

This is the part of the methodology where we'd welcome replication, disagreement, and pushback. Three concerns we hold about this edition:

Gemini is the judge. Gemini 3.1 Flash Lite classifies contradictions, corrections, and confidence across all five providers - including Gemini's own responses. The catch-ratio numbers argue against significant self-leniency (a lenient self-judge should produce the opposite pattern - Gemini catching more than being caught, not the reverse). But we haven't cross-validated against a non-Gemini classifier. If a non-Gemini judge produced meaningfully different rankings, the methodology would need to update. We'd like to see that analysis.

The sample is Suprmind users. People who choose a multi-model platform plausibly skew toward those who already suspect AI models disagree. That's our audience, not a random enterprise sample. A stratified replication across a broader population would test whether our domain disagreement rates generalize. If someone runs one, we'd like to compare numbers.

The silent-agreement rate uses a loose definition. The 0.9% rate counts turns where nothing at all was flagged. A stricter definition - "no critical-severity contradictions or corrections" - would produce a larger rate. Both framings are defensible. This edition reports the looser one. Q2 will report the stricter definition alongside it.

What would change our reading

Because we're publishing behavior in an ensemble, not correctness against ground truth, the conclusions are sensitive to several things we don't control. Specific observations that would make us update:

If a non-Gemini classifier shows Gemini's catch ratio above 0.5, we'd reconsider whether self-leniency is affecting the current 0.26 number.
If a stratified sample from non-Suprmind users shows under 55% disagreement on Financial, we'd reconsider whether the user-selection effect is inflating the headline.
If Q2's Positional Divergence analysis shows that catch ratio varies strongly by chain position, we'd add position as a confound and possibly re-weight.

We'd rather update than defend. The dataset is downloadable. If your reading of it differs from ours, we'd like to see the analysis.

Update cadence

This is the first frozen edition of a recurring series. Q2 2026 Edition is planned for late July with a refreshed sample window and any classifier or methodology improvements disclosed explicitly. The underlying classification pipeline runs daily on production traffic, and the dataset will continue to grow at roughly 30-50 new DCI turns per day.

What Q2 Will Add: Positional Divergence

Q2 2026 Edition will publish the first Positional Divergence analysis - how a model's answer changes depending on whether it responds first, middle, or last in a sequential multi-model chain. It's the question no lab benchmark can answer, because lab benchmarks don't run sequential chains. Data collection is already underway.

Early signal: position in the chain appears to meaningfully affect both the catch ratio and the unique-insight rate, in ways the current methodology collapses.

FAQ

Frequently Asked Questions

What is the Confidence Trap?

An answer that sounds certain but doesn't survive peer review by other AI models. Operationally: a turn where a provider's response was flagged as high-confidence in its language (score ≥ 7 on a 1-10 scale) and then contradicted or corrected by at least one other provider in the same turn.

How is this different from a hallucination benchmark?

Hallucination benchmarks measure how often a model fabricates specific kinds of facts (summarization faithfulness, knowledge-recall accuracy, citation validity). This report measures how often frontier models disagree with each other in production, regardless of failure mode. A turn can contain zero hallucinations in the Vectara sense and still produce a Confidence Trap. The companion Suprmind AI Hallucination Rates & Benchmarks 2026 page covers the first category; this page covers the second.

Why is Gemini the classifier if Gemini is also being measured?

Gemini 3.1 Flash Lite is the cheapest, fastest model capable of producing structurally reliable JSON classifications at the volume required (~50 turns per day, fire-and-forget). The same model is used to classify every provider's responses, including its own - introducing a theoretical self-leniency concern that we disclose explicitly. The catch-ratio numbers (Gemini caught 416 times, caught others 109) argue against significant self-leniency, since a lenient self-judge would produce the opposite pattern. But we cannot eliminate the concern without a human calibration arm, which we've deferred to preserve automation.

Does "contradicted" mean "incorrect"?

No. "Contradicted" means another model produced a substantively different answer. On questions with no external ground truth - forecasts, strategic judgments, interpretations - being outvoted doesn't equal being incorrect. On questions with ground truth, it usually does, but we don't measure that separately in this edition.

Why does Financial produce more disagreement than Legal?

Because Legal questions have retrievable authoritative sources (statutes, cases). Five models querying the same sources tend to converge. Financial questions are largely forecasts and interpretive judgments - no ground truth, five defensible answers. This inverts the common assumption that legal queries are the hardest domain for AI. The 72.1% Financial disagreement rate vs 41.2% Legal rate - a 31-point gap - captures that structural difference.

Why does GPT rank last on unique insights?

The data suggests GPT functions as a balanced generalist inside the ensemble - rarely failing, but also rarely surfacing things the other four models don't have. At 13.1% share of unique insights (339 out of 2,580), it's roughly half the rate of Perplexity and Claude. That's a different kind of reliability than "catches the most unique signal." Whether it's preferable depends on the task.

Why is Perplexity's catch rate so high?

Web-grounded retrieval. When Perplexity fact-checks another model against current sources, it frequently finds stale or fabricated claims. This shows up in the catch ratio (2.54, highest in the dataset) and in the share of unique insights flagged critical (333, also highest). Citation accuracy on news sources remains a known problem for AI models generally - Columbia Journalism Review's March 2025 audit found 60%+ citation hallucination across AI search tools^[7] - so Perplexity's grounding advantage is real but not absolute.

Does this mean Claude is the best model?

No. It means Claude produces high-confidence language that is calibrated closer to what the other four models also assert - so its confident answers get contradicted less often. That is a narrower claim than "Claude is the most accurate." The catch ratio (2.25) is strong but below Perplexity's 2.54. The unique-insights rate (24.5%) is essentially tied with Perplexity's 24.7%. The picture is "Claude and Perplexity are the two disproportionately valuable ensemble members," not "one model is best."

Can I see the raw data?

The aggregate data behind every table on this page is downloadable as a CSV. We do not release individual turn-level data for user privacy reasons.

Is this page going to be updated?

This is the April 2026 Edition, frozen at publication. Q2 2026 Edition is planned for late July 2026. Future editions will be archived at their own URLs so citations against a specific edition remain stable.

Does running multiple models actually help in practice?

The data says yes. Across 1,324 real-user turns, multi-model review surfaced at least one contradiction, correction, or unique insight on 99.1% of turns. On Legal, Medical, Education, Research, and Creative domains, the silent-agreement rate was literally zero. Whether that signal is worth the latency and cost is a product-level decision, but the assertion that "a single frontier model is sufficient" is not supported by this dataset. See how Suprmind runs this pattern on your own questions →

Download the aggregate dataset

Every table on this page is backed by an aggregate CSV - confidence-contradiction rates, catch ratios, domain disagreement, provider-pair matrices, severity distributions, and silent-agreement counts. Individual turn-level data is not released for user privacy reasons.

Licensed CC BY 4.0. Cite as: Suprmind Multi-Model Divergence Index, April 2026 Edition. n = 1,324 production turns.

Download CSV (aggregate data)

References and Sources

Nine sources. The list is grouped by how each reference relates to the research on this page - not alphabetically. Studies cited inline with specific claims are listed first. Adjacent hallucination benchmarks are listed separately because we name them in body to contrast our methodology, not because we used their data.

Primary Support - Multi-Model Ensemble Behavior

Schoenegger, P. et al. "Wisdom of the silicon crowd: LLM ensemble prediction capabilities rival the human crowd." Science Advances, 2024. Foundational finding that aggregating LLM outputs can rival human crowd forecasting - the generalized form of the ensemble-value argument this page measures in production. science.org/doi/10.1126/sciadv.adp1528
Luo, Y. et al. "Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models." Amazon / ACM WWW 2025. Shows models' failure modes are uncorrelated enough that weighted combination beats any single model - the theoretical basis for why catch ratios like Perplexity's 2.54 matter. arxiv.org/abs/2503.05757

Primary Support - Domain Asymmetry

Dahl, M., Magesh, V., Suzgun, M., & Ho, D. "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models." Stanford RegLab / HAI, January 2024. Documents 69-88% hallucination rates on legal queries - the benchmark our 41.2% Legal disagreement rate sits against, and the basis for the "confidence exposure" vs "confident contradictions" distinction in the Financial-Legal Inversion section. arxiv.org/abs/2401.01301

Adjacent Hallucination Benchmarks (cited for contrast)

We reference these benchmarks in body to clarify what this page does not do. The Confidence Trap measures ensemble disagreement, not accuracy against ground truth. These are the leaderboards we don't compete with - important to name because readers would otherwise assume we do.

Vectara. "Hallucination Leaderboard (HHEM)." Ongoing. Measures summarization faithfulness (does the output stick to the source). github.com/vectara/hallucination-leaderboard
Artificial Analysis. "AA-Omniscience Benchmark." Knowledge calibration and hallucination rates, November 2025. Measures whether a model admits ignorance on hard knowledge questions. artificialanalysis.ai/evaluations/omniscience
Google DeepMind. "FACTS Grounding: Evaluating and Improving Factuality in LLMs." December 2025. Multi-dimensional factuality across grounding, multimodal, parametric, and search tasks.

Adjacent Literature - Citation Accuracy & Architectural Limits

Columbia Journalism Review / Tow Center. "AI search citation audit." 60%+ incorrect citation rate across ChatGPT Search, Perplexity, and Gemini, March 2025. Cited in the Perplexity FAQ to contextualize why web-grounded retrieval is a relative advantage, not an absolute one. cjr.org
Xu, Z. et al. "Hallucination is Inevitable: An Innate Limitation of Large Language Models." arXiv, 2024. Mathematical proof that hallucination is a provable limit of the architecture - the reason multi-model review is a mitigation strategy rather than a cure. arxiv.org/abs/2401.11817

Companion Research

Suprmind. "AI Hallucination Rates & Benchmarks in 2026." Companion aggregated research covering Vectara, AA-Omniscience, FACTS, and 50+ sources. Where to go for the accuracy-against-ground-truth view this page deliberately does not provide. suprmind.ai/hub/ai-hallucination-rates-and-benchmarks/

Disagreement is the feature. Run your next high-stakes question through all five.

We measured this pattern in 1,324 real production queries across 10 professional domains. If you're working on something where a silently contradicted answer costs money or reputation, run the same pattern on it. If you think the Gemini-as-judge concern breaks the methodology, show us the analysis. If you think the Suprmind-user sample skews the domain rates, stratify and reply. The dataset is downloadable precisely so someone can disagree with us in a way that moves the work forward.

Try Suprmind →

The Confidence Trap is the April 2026 Edition of the Suprmind Multi-Model Divergence Index. Dataset frozen 2026-04-19 from Suprmind production data. Next edition: Q2 2026, late July. This page will remain accessible at its canonical URL with edition history preserved.

The Confidence Trap: When Confident AI Answers Don't Survive Multi-Model Review