Suprmind Multi-Model Divergence Index - April 2026 Edition
Based on 1,324 real-user production turns across financial, legal,
medical, technical, business, and marketing queries.
Sample window: March 5 - April 19, 2026.
Published April 2026 · Next edition: late July 2026
Why this report exists
Hallucinations are the failure mode people notice. But in real work, the deeper risk is often confidence without resilience: answers that sound certain until another model challenges, corrects, or contradicts them. To measure that, we analyzed 1,324 real-user multi-model turns in Suprmind and tracked where providers diverged, corrected one another, and exposed hidden failure modes.
Hallucinations are one of several failure modes that confident-looking AI answers hide. Multi-model review exposes them — alongside stale knowledge, faulty reasoning, and unchallenged assumptions. This report measures the pattern across five frontier providers in production use.
How we measured it
Suprmind runs every user query through Claude, GPT, Gemini, Grok, and Perplexity, then scores the contradictions, corrections, and unique insights between their answers, turn by turn. Across 1,324 real-user turns — finance, legal, medical, technical, business strategy, marketing, research — multi-model review surfaced at least one contradiction, correction, or unique insight on 99.1% of turns. Not a lab benchmark. Real production work.
What we don't claim
One clarification before the numbers. This study does not measure correctness against an external ground truth. No such ground truth exists for the kinds of questions professionals actually ask — investment theses, regulatory interpretation, strategic tradeoffs, clinical judgment. What we measure is what you'd observe if you ran the same question past five experts: how often they contradict each other, who corrects whom, and where the disagreements are severe enough to change a decision.
Models disagreed on 54% of turns in this dataset. They corrected each other on 72%. They surfaced more than 2,500 unique insights that only one provider contributed in each turn. Every row below is the mechanism, expressed as a number. The question this report answers: which providers are doing what, and which combinations matter most.
| Domain | Sample | Disagreement Rate |
|---|---|---|
| Financial | n = 258 | 72.1% |
| Legal | n = 136 | 41.2% |
The 31-point gap - nearly 1.75× - inverts the common assumption that legal queries are the hardest domain for AI. The explanation is structural. Legal questions ("what does this statute say," "is this clause enforceable") have citable ground truth in cases and statutes. Five frontier models retrieving the same authoritative sources tend to converge. Financial questions ("should we expand to Germany," "what's the fair valuation") are forecasts and judgments. No ground truth exists, and five models produce five defensible answers that differ.
The failure modes also differ. Legal AI peer-divergence patterns cluster around confidence exposure - confident assertions of case citations that turn out to be unsupported when another model checks.[3] Financial AI peer-divergence patterns cluster around confident contradictions - five different numbers, all sounding certain. Both are real. Only the second is captured in this report.
Based on 1,324 real-user multi-model turns in Suprmind, March 5 - April 19, 2026.
This page tracks three metrics. They describe the mechanism of multi-model review, not any single model's correctness.
When a provider's answer reads as highly confident - definitive language, no hedging - how often is that answer contradicted or corrected by another model in the same turn? The core signal of the Confidence Trap.
Of all the corrections one model made of another's output, what share did each provider contribute? Measures which models are doing the catching, not which models are being caught.
For each provider: corrections made divided by corrections received. Above 1.0 means the provider catches more than it's caught. Below 1.0 means the inverse. The cleanest single number for asymmetry.
We do not rank models on accuracy because we cannot. We rank them on behavior inside a multi-model ensemble, which is a different question - and the one that actually matters if you're deciding whether multi-model orchestration is worth the cost.
One table that anchors the rest of the report - and the decision of whether multi-model validation is worth the cost. Each row describes one provider's behavior inside the five-model ensemble across the three metrics we track, plus raw catch and insight counts.
| Provider | Confident-Contradicted (all turns) |
Confident-Contradicted (high-stakes only) |
Corrections Made | Times Caught | Catch Ratio | Unique Insights Surfaced |
|---|---|---|---|---|---|---|
| Claude | 33.8% | 26.3% | 304 | 135 | 2.25 | 631 |
| Perplexity | 33.9% | 32.1% | 335 | 132 | 2.54 | 638 |
| GPT | 39.5% | 36.0% | 111 | 295 | 0.38 | 339 |
| Grok | 48.8% | 46.8% | 193 | 269 | 0.72 | 509 |
| Gemini | 51.3% | 50.2% | 109 | 416 | 0.26 | 463 |
Based on 1,324 real-user multi-model turns in Suprmind, March 5 - April 19, 2026. External users only (10 internal/test accounts excluded). "Contradicted" means a peer model produced a substantively different answer in the same turn - not measured against external ground truth.
The five columns tell different stories. Low confident-contradicted rate + high catch ratio = a model that speaks carefully and catches other people's errors (Claude, Perplexity). High confident-contradicted rate + low catch ratio = a model that speaks definitively and is frequently contradicted on that confidence (Gemini). Grok and GPT occupy the middle with different profiles: Grok catches moderately and gets caught moderately, GPT catches rarely but surfaces unique insights least often.
Perplexity and Claude together cover most of the value columns. Neither dominates every one, and GPT's consistent middle-of-the-column position is its own kind of reliability. No single row wins across the board - which is the argument for multi-model orchestration, expressed as a matrix.[1][2]
Suprmind runs every message through the same five frontier models measured in this report, then scores contradictions, corrections, and unique insights turn by turn. Inside the product, the Disagreement/Correction Index shows up as a sidebar panel - the per-turn version of what this page shows in aggregate.
When the models disagree, that disagreement is where the hard questions live. Suprmind surfaces it, quantifies it, and, in three clicks, turns it into a professional deliverable - so the weak reasoning gets caught before the decision ships.
Every provider in our dataset produced answers flagged as high-confidence by their own language - definitive assertions, no hedging, "the answer is X" phrasing. In a single-model workflow, those answers ship as final. In a multi-model turn, they get tested against four peers. How often does the test fail?
| Provider | High-confidence responses | Contradicted or corrected | Rate |
|---|---|---|---|
| Gemini | 887 | 455 | 51.3% |
| Grok | 688 | 336 | 48.8% |
| GPT | 805 | 318 | 39.5% |
| Perplexity | 629 | 213 | 33.9% |
| Claude | 757 | 256 | 33.8% |
Gemini's confident answers are contradicted almost exactly half the time. Claude's are contradicted a third. That is a 1.5× difference in reliability at the same confidence level - not because Claude has ground truth on its side and Gemini doesn't, but because Claude's definitive language is calibrated closer to what the other four models also assert, and Gemini's isn't.
We filtered to turns the domain classifier flagged as high-stakes - real decisions with legal, financial, medical, or career consequences. Drafts, regulatory interpretation, prescriptions, investment calls, contract reviews.
| Provider | High-confidence responses | Contradicted or corrected | Rate |
|---|---|---|---|
| Gemini | 289 | 145 | 50.2% |
| Grok | 231 | 108 | 46.8% |
| GPT | 261 | 94 | 36.0% |
| Perplexity | 215 | 69 | 32.1% |
| Claude | 240 | 63 | 26.3% |
The sharper cut of the Confidence Trap isn't the absolute rates. It's how each provider's confident-contradicted rate changes when the question has real consequences - when the model "should" be more careful.
| Provider | All turns | High-stakes | Delta |
|---|---|---|---|
| Claude | 33.8% | 26.3% | −7.5 pts |
| GPT | 39.5% | 36.0% | −3.5 pts |
| Grok | 48.8% | 46.8% | −2.0 pts |
| Perplexity | 33.9% | 32.1% | −1.8 pts |
| Gemini | 51.3% | 50.2% | −1.1 pts |
Three providers slow down meaningfully when stakes rise. Claude slows down most. Gemini and Grok barely slow down at all.
That's the argument for Claude's calibration, expressed as a delta. When a question carries real-world consequences, Claude's confident-contradicted rate drops nearly eight points - its language becomes more hedged, more bounded. Gemini's drops 1.1 points. Grok's drops 2.0. Same linguistic confidence, same definitive phrasing, whether the user is asking for a pizza recipe or a regulatory opinion. Their linguistic confidence isn't tracking the stakes.
The Financial-Legal inversion above is the headline. The full pattern across all ten domains is wider.
| Domain | Turns | Disagreement Rate | Correction Rate |
|---|---|---|---|
| Financial | 258 | 72.1% | 71.7% |
| Other | 153 | 60.1% | 69.3% |
| Marketing & Sales | 131 | 55.0% | 77.1% |
| Business Strategy | 257 | 54.9% | 75.5% |
| Research & Analysis | 74 | 52.7% | 77.0% |
| Technical | 172 | 49.4% | 75.0% |
| Creative | 38 | 42.1% | 60.5% |
| Legal | 136 | 41.2% | 69.9% |
| Medical | 56 | 33.9% | 58.9% |
| Education | 49 | 28.6% | 63.3% |
The ranking tracks how much each domain's answer depends on judgment versus lookup. Finance, strategy, marketing, and research all require interpretation and forecasts - where five frontier models produce five defensible answers that differ. Legal, Medical, and Education questions have more retrievable ground truth - where the same five queries tend to converge on the same answer. But even the domains where models most often agree show substantial disagreement. Legal's 41.2% rate is higher than any published hallucination benchmark.[3] Multi-model review keeps finding something to flag even where models ostensibly agree on the facts.
Counting disagreements doesn't fully describe them. A 3/10 severity disagreement is a stylistic or minor framing difference. A 9/10 is "two models are recommending opposite actions in a high-stakes context." Below, the share of each domain's contradictions rated critical (severity ≥ 7):
| Domain | Total contradictions | Critical (≥ 7) | Share |
|---|---|---|---|
| Research & Analysis | 46 | 24 | 52.2% |
| Technical | 117 | 46 | 39.3% |
| Financial | 285 | 107 | 37.5% |
| Other | 162 | 53 | 32.7% |
| Business Strategy | 196 | 60 | 30.6% |
| Legal | 75 | 22 | 29.3% |
| Medical | 24 | 7 | 29.2% |
| Creative | 16 | 4 | 25.0% |
| Marketing & Sales | 89 | 14 | 15.7% |
| Education | 16 | 2 | 12.5% |
Research & Analysis has a lower disagreement volume than Financial, but a much higher share of its disagreements are severe. When research-level questions diverge across models, they diverge hard. Marketing & Sales disagreements, by contrast, cluster at moderate severity - opinion-level differences on campaign positioning and channel strategy, not "one model is demonstrably missing something critical."
The practical read: on Financial and Research questions, multi-model review is not a nice-to-have. A single-model workflow on those question types is running a 30-50% blind spot.
The provider-pair matrix describes where disagreement actually happens inside the ensemble. The most combative pairs are not always the highest-severity. The least combative pairs are not the most "in agreement" - they may just be asking different questions.
| Pair | Contradictions | Avg severity | Critical |
|---|---|---|---|
| Gemini vs Grok | 182 | 6.13 | 66 |
| Claude vs Gemini | 120 | 6.04 | 50 |
| Claude vs GPT | 117 | 6.34 | 55 |
| Gemini vs GPT | 108 | 6.31 | 50 |
| Gemini vs Perplexity | 107 | 6.04 | 37 |
| Claude vs Grok | 97 | 6.43 | 48 |
| GPT vs Grok | 86 | 6.51 | 47 |
| Grok vs Perplexity | 81 | 6.26 | 36 |
| GPT vs Perplexity | 68 | 6.13 | 30 |
| Claude vs Perplexity | 52 | 6.44 | 24 |
Gemini and Grok are the most combative pair in the dataset by volume. Claude and Perplexity disagree least often. When GPT and Grok disagree, they disagree with the highest severity (6.51 avg) - a smaller number of harder disagreements.
Different domains have different "hot pairs." If we had to pick which pair to watch for each domain:
| Domain | Top disagreement pair | Count |
|---|---|---|
| Business Strategy | Gemini vs Grok | 59 |
| Financial | Claude vs Gemini | 37 |
| Technical | Gemini vs Grok | 27 |
| Marketing & Sales | Gemini vs Grok | 23 |
| Legal | Gemini vs Perplexity | 18 |
| Medical | Gemini vs Perplexity | 7 |
| Research & Analysis | Claude vs GPT | 10 |
Legal is the only domain where the top pair isn't Gemini-adjacent. Claude-GPT leads Research & Analysis - which is also the domain where disagreements are most severe.
The cleanest number in this dataset is each provider's catch ratio - how many times the provider corrected another model, divided by how many times it was itself corrected:
| Provider | Catches made | Times caught | Catch ratio |
|---|---|---|---|
| Perplexity | 335 | 132 | 2.54 |
| Claude | 304 | 135 | 2.25 |
| Grok | 193 | 269 | 0.72 |
| GPT | 111 | 295 | 0.38 |
| Gemini | 109 | 416 | 0.26 |
Gemini is caught correcting others 109 times. It is caught by others 416 times - nearly four to one. Perplexity flips that ratio: 335 catches, 132 times caught.
Perplexity catches 9.77× more per correction than Gemini does. That's the single sharpest stat in the dataset - the Confidence Trap expressed as a multiplier. Two ends of the same ensemble, calibrated differently enough that their roles are almost opposite.
A second stat lands the same argument from a different angle: Perplexity and Claude together made 60.7% of all corrections in the dataset - 639 out of 1,052 attributable to a specific corrector. Two models do three-fifths of the catching. GPT and Gemini combined make 20.9%. Error-catching is concentrated in two providers. That's both a design input for how many models to run and a reminder that not all five do the same job.
This is the central argument for multi-model ensembles expressed as a single statistic. Models with a high catch ratio are disproportionately valuable inside an ensemble - not because they are "smarter," but because their failure modes are less correlated with the failure modes of the models they catch.[2]
No single provider dominates across every metric. Each has a distinct role in the ensemble's error-catching mechanism.
Claude - the Calibrated Specialist. Second-highest catch ratio (2.25) and the largest calibration delta on high-stakes (33.8% → 26.3%, a 7.5-point drop). The language hedges when the stakes rise.
Perplexity - the Grounding Layer. Highest catch ratio (2.54). Surfaced 333 critical-severity unique insights, more than any other provider. Web-grounded retrieval catches stale and fabricated claims other models miss.
Grok - the Contrarian Insight. Third in both catch ratio (0.72) and unique-insight share (19.7%). Surfaces a meaningful slice of what Perplexity misses, often from sources the others don't index.
GPT - the Balanced Generalist. Most balanced across columns. Rarely fails, rarely surfaces unique signal (13.1% insight share, lowest). In a multi-model ensemble, GPT functions as a balanced generalist, not a contrarian catcher.
Gemini - the Overconfident Challenger. Highest response volume in the sample (941 responses). Gemini Flash Lite also serves as the DCI classifier that scores cross-provider behavior across the dataset. The research exists because it produces structured analysis reliably at scale. In its own responses, Gemini's confidence signal and reliability signal are calibrated differently from its peers' - the Confidence Trap pattern is most visible in Gemini's numbers.
Disagreement and corrections describe what multi-model review catches. Unique insights describe what it adds. Across 1,324 turns, the classifier flagged 3,484 unique insights - facts, angles, risks, or recommendations that would have been lost in a single-model workflow. 2,580 are attributable to one of the five tracked providers. The remainder came from earlier classifier runs that used non-standard provider labels and are excluded from provider shares.
| Provider | Unique insights | Share | Avg severity | Critical (≥ 7) |
|---|---|---|---|---|
| Perplexity | 638 | 24.7% | 6.38 | 333 |
| Claude | 631 | 24.5% | 6.09 | 268 |
| Grok | 509 | 19.7% | 5.89 | 159 |
| Gemini | 463 | 17.9% | 5.48 | 104 |
| GPT | 339 | 13.1% | 5.48 | 85 |
Three things stand out.
Perplexity surfaces the highest-severity unique insights (avg 6.38, and more than half of its insights are critical-severity). That tracks with its grounding mechanism - web-retrieved facts that other models don't have access to frequently turn out to be decisive.
Claude is a close second on volume and lower-severity - consistent with its profile elsewhere: careful language, broad coverage.
GPT surfaces the fewest unique insights, by a wide margin. Only 13% of all unique insights come from GPT, even though it produces roughly the same volume of responses as Gemini and Claude. This contradicts the popular framing of GPT as the "default best model." In a multi-model ensemble, GPT functions as a balanced generalist - it rarely fails, but it also rarely surfaces things the other four don't have.
The framing that lands: without Perplexity, 638 insights flagged across the 1,324 turns would have been missed, including 333 rated critical. Without GPT, 339. That's the actual cost of running one model fewer.
One of the questions we expected to answer honestly on this page: when is multi-model review not worth the cost? The natural assumption was "a significant fraction of turns produce no additional signal - the models just agree redundantly."
The data does not support that assumption.
| Domain | Turns | Silent turns | Silent rate |
|---|---|---|---|
| Other | 153 | 4 | 2.6% |
| Business Strategy | 257 | 5 | 1.9% |
| Marketing & Sales | 131 | 1 | 0.8% |
| Technical | 172 | 1 | 0.6% |
| Financial | 258 | 1 | 0.4% |
| Legal | 136 | 0 | 0.0% |
| Medical | 56 | 0 | 0.0% |
| Education | 49 | 0 | 0.0% |
| Research & Analysis | 74 | 0 | 0.0% |
| Creative | 38 | 0 | 0.0% |
| Total | 1,324 | 12 | 0.9% |
A "silent turn" is one where no contradictions, no corrections, and no unique insights were flagged across all five providers. Twelve turns out of 1,324.
In Legal (n = 136), Medical (n = 56), Education (n = 49), Research & Analysis (n = 74), and Creative (n = 38), multi-model review flagged something on every single turn. 353 turns across these five domains, zero silent outcomes. Across the full dataset, the rate holds: multi-model review surfaces at least one contradiction, correction, or unique insight on 99.1% of all turns.
The honest caveat: "flagging something" doesn't automatically mean the flag was decision-relevant. A minor phrasing difference counts toward the 99.1%. A stricter definition - "no critical-severity contradictions or corrections" - would produce a larger silent-agreement rate, and we plan to report that variant in the Q2 edition.
But the headline stands: there is no meaningful floor below which multi-model review is routinely wasted. The burden of proof for "single-model is sufficient" rests on whichever model you're trusting, and the catch-ratio table above is the evidence it probably isn't.
Hallucinations - fabricated facts presented as real - are the failure mode most professionals associate with AI risk. The existing Suprmind research on AI hallucination rates across benchmarks aggregates 50+ sources on that specific problem.[9]
The Confidence Trap is adjacent but not the same.
A model can produce an answer that is not a hallucination in the Vectara sense[4] - every fact in it is retrievable from training data - and still be contradicted by another model in the same turn. Stale knowledge, unexamined assumptions, risk-tolerance mismatches, and edge-case reasoning errors all produce contradictions without producing hallucinations.
Our read: hallucinations are one of several failure modes that multi-model review exposes. When a second model flags a fabricated citation, it caught a hallucination. When it flags that the first model's ROI calculation assumed 100% customer adoption, it caught an unchallenged assumption. Both are real. Both are caught. The hallucinations page measures the first mode. This page measures the broader category across multiple failure types - which is why decision validation is a different problem than hallucination detection.[8]
This dataset is production data from real Suprmind users, not a lab benchmark. The Disagreement/Correction Index (DCI) that scored every turn is the same system that runs inside the product as a sidebar panel, giving live users the per-turn version of what this report shows in aggregate. That means tradeoffs we want to be explicit about.
Confidence level (per provider, per turn). An automated classifier scores each provider's response on a 1-10 linguistic-confidence scale, based on language signals: definitive phrasing, absence of hedging, first-person certainty. Score is the language, not the correctness.
Contradiction. A turn is flagged as containing a contradiction when two or more providers produce substantively different answers to the same question. The DCI classifier specifies which providers contradicted each other and assigns a severity score (1-10).
Correction. A turn is flagged as containing a correction when one provider explicitly identifies an inaccuracy, flawed assumption, or missing context in another provider's output. The classifier records which provider corrected which, and the severity.
Unique insight. A fact, risk, angle, or recommendation surfaced by exactly one provider in a turn and not mentioned by any of the other 2-4 providers in the same turn.
Silent turn. No contradictions, no corrections, and no unique insights flagged by the classifier. The floor of the "multi-model adds no signal" measurement.
External ground truth. We do not score any answer against an authoritative source. A model outvoted four-to-one on a financial forecast may still be correct. A model the others "correct" may be the only one that's right. We report the mechanism - who contradicts whom, who corrects whom - not the verdict.
Accuracy in a benchmark sense. This is not a hallucination benchmark. It does not compete with Vectara HHEM,[4] AA-Omniscience,[5] FACTS,[6] or SimpleQA. It measures a different thing: ensemble behavior in production.
Three structural facts about the dataset that readers should understand before interpreting the headline numbers. None of them invalidates the pattern; all of them shape how narrowly a given comparison can be drawn.
Tier aggregation. When we say "Claude," "Gemini," or "GPT," we mean the provider family, not a single named variant. Suprmind routes between variants (Haiku / Sonnet / Opus for Claude; Flash / Pro / variants for Gemini; and equivalents for the others) depending on the user's plan, mode, and query type. A "Claude" response in our data could be Haiku on one turn and Opus on another. For the purposes of this index, we're measuring provider-family behavior inside the ensemble, not any specific model version.
Variable peer panels. Not every turn is scored against all five providers. Peer review is task-appropriate and plan-appropriate: some turns run 2 models, some 4, some 5, based on the user's Suprmind plan and the workflow they selected. The confidence-contradiction rates in this report are calculated against whichever peers actually ran on each turn - not a uniform 5-provider panel on every data point. This is a feature of production data, not an artifact we can reshape.
Version verification. No mid-window model-version changes materially affected this dataset. The sample closed April 19, 2026; Claude Opus 4.7 shipped on April 16, but adoption in the routing layer had not yet propagated to our production traffic at the freeze point. This dataset is therefore a pre-Claude 4.7 snapshot. Q2 2026 Edition will be the first edition to reflect post-4.7 behavior.
1. Gemini 3.1 Flash Lite is the classifier. The same engine that classifies contradictions, corrections, and confidence is the Gemini family member. There is a theoretical self-leniency concern. We have not run a manual calibration against human verification in this edition, as a deliberate tradeoff: manual verification would break the quarterly re-run automation. A future edition may add a benchmark arm that cross-validates classifier decisions against a smaller human-rated subset.
2. User sample bias. Users who chose a multi-model platform plausibly skew toward those who already suspect AI models disagree. Our audience, not a random sample of enterprise workers.
3. Sequential mode dominance. 87% of turns in the sample used Sequential mode (where each model sees prior models' answers before responding). Super Mind, Debate, Red Team, and First Principles modes together make up 13%. Mode effect on disagreement rate is a follow-up question we do not address in this edition.
4. Thin domain cells. Creative (n = 38), Education (n = 49), and Medical (n = 56) have enough data for headline rates but not for fine-grained provider-pair cuts. Cells with fewer than 10 observations are suppressed in the domain × pair matrix.
5. Domain classifier is a single-pass LLM call. 100% coverage across 877 total DCI sessions (700 external, 177 internal/test), with high classifier-reported confidence on all but 113 "Unknown" turns (short greetings and fragments). Not human-verified.
6. Silent-agreement definition. The 0.9% silent-agreement rate counts turns where the classifier flagged nothing at all. A stricter definition ("no critical-severity contradictions or corrections") would produce a larger rate. Both framings are defensible. This edition reports the looser one.
7. Corrections are not adjudicated. When Model A asserts that Model B's answer is incorrect, we record the assertion. We do not independently verify whether Model A is correct to make the claim.
8. Contradicted ≠ wrong. Repeated for emphasis: a model outvoted by peers may be the only one that's right. The statistic measures ensemble behavior, not truth.
This is the part of the methodology where we'd welcome replication, disagreement, and pushback. Three concerns we hold about this edition:
Gemini is the judge. Gemini 3.1 Flash Lite classifies contradictions, corrections, and confidence across all five providers - including Gemini's own responses. The catch-ratio numbers argue against significant self-leniency (a lenient self-judge should produce the opposite pattern - Gemini catching more than being caught, not the reverse). But we haven't cross-validated against a non-Gemini classifier. If a non-Gemini judge produced meaningfully different rankings, the methodology would need to update. We'd like to see that analysis.
The sample is Suprmind users. People who choose a multi-model platform plausibly skew toward those who already suspect AI models disagree. That's our audience, not a random enterprise sample. A stratified replication across a broader population would test whether our domain disagreement rates generalize. If someone runs one, we'd like to compare numbers.
The silent-agreement rate uses a loose definition. The 0.9% rate counts turns where nothing at all was flagged. A stricter definition - "no critical-severity contradictions or corrections" - would produce a larger rate. Both framings are defensible. This edition reports the looser one. Q2 will report the stricter definition alongside it.
Because we're publishing behavior in an ensemble, not correctness against ground truth, the conclusions are sensitive to several things we don't control. Specific observations that would make us update:
We'd rather update than defend. The dataset is downloadable. If your reading of it differs from ours, we'd like to see the analysis.
This is the first frozen edition of a recurring series. Q2 2026 Edition is planned for late July with a refreshed sample window and any classifier or methodology improvements disclosed explicitly. The underlying classification pipeline runs daily on production traffic, and the dataset will continue to grow at roughly 30-50 new DCI turns per day.
Q2 2026 Edition will publish the first Positional Divergence analysis - how a model's answer changes depending on whether it responds first, middle, or last in a sequential multi-model chain. It's the question no lab benchmark can answer, because lab benchmarks don't run sequential chains. Data collection is already underway.
Early signal: position in the chain appears to meaningfully affect both the catch ratio and the unique-insight rate, in ways the current methodology collapses.
An answer that sounds certain but doesn't survive peer review by other AI models. Operationally: a turn where a provider's response was flagged as high-confidence in its language (score ≥ 7 on a 1-10 scale) and then contradicted or corrected by at least one other provider in the same turn.
Hallucination benchmarks measure how often a model fabricates specific kinds of facts (summarization faithfulness, knowledge-recall accuracy, citation validity). This report measures how often frontier models disagree with each other in production, regardless of failure mode. A turn can contain zero hallucinations in the Vectara sense and still produce a Confidence Trap. The companion Suprmind AI Hallucination Rates & Benchmarks 2026 page covers the first category; this page covers the second.
Gemini 3.1 Flash Lite is the cheapest, fastest model capable of producing structurally reliable JSON classifications at the volume required (~50 turns per day, fire-and-forget). The same model is used to classify every provider's responses, including its own - introducing a theoretical self-leniency concern that we disclose explicitly. The catch-ratio numbers (Gemini caught 416 times, caught others 109) argue against significant self-leniency, since a lenient self-judge would produce the opposite pattern. But we cannot eliminate the concern without a human calibration arm, which we've deferred to preserve automation.
No. "Contradicted" means another model produced a substantively different answer. On questions with no external ground truth - forecasts, strategic judgments, interpretations - being outvoted doesn't equal being incorrect. On questions with ground truth, it usually does, but we don't measure that separately in this edition.
Because Legal questions have retrievable authoritative sources (statutes, cases). Five models querying the same sources tend to converge. Financial questions are largely forecasts and interpretive judgments - no ground truth, five defensible answers. This inverts the common assumption that legal queries are the hardest domain for AI. The 72.1% Financial disagreement rate vs 41.2% Legal rate - a 31-point gap - captures that structural difference.
The data suggests GPT functions as a balanced generalist inside the ensemble - rarely failing, but also rarely surfacing things the other four models don't have. At 13.1% share of unique insights (339 out of 2,580), it's roughly half the rate of Perplexity and Claude. That's a different kind of reliability than "catches the most unique signal." Whether it's preferable depends on the task.
Web-grounded retrieval. When Perplexity fact-checks another model against current sources, it frequently finds stale or fabricated claims. This shows up in the catch ratio (2.54, highest in the dataset) and in the share of unique insights flagged critical (333, also highest). Citation accuracy on news sources remains a known problem for AI models generally - Columbia Journalism Review's March 2025 audit found 60%+ citation hallucination across AI search tools[7] - so Perplexity's grounding advantage is real but not absolute.
No. It means Claude produces high-confidence language that is calibrated closer to what the other four models also assert - so its confident answers get contradicted less often. That is a narrower claim than "Claude is the most accurate." The catch ratio (2.25) is strong but below Perplexity's 2.54. The unique-insights rate (24.5%) is essentially tied with Perplexity's 24.7%. The picture is "Claude and Perplexity are the two disproportionately valuable ensemble members," not "one model is best."
The aggregate data behind every table on this page is downloadable as a CSV. We do not release individual turn-level data for user privacy reasons.
This is the April 2026 Edition, frozen at publication. Q2 2026 Edition is planned for late July 2026. Future editions will be archived at their own URLs so citations against a specific edition remain stable.
The data says yes. Across 1,324 real-user turns, multi-model review surfaced at least one contradiction, correction, or unique insight on 99.1% of turns. On Legal, Medical, Education, Research, and Creative domains, the silent-agreement rate was literally zero. Whether that signal is worth the latency and cost is a product-level decision, but the assertion that "a single frontier model is sufficient" is not supported by this dataset. See how Suprmind runs this pattern on your own questions →
Every table on this page is backed by an aggregate CSV - confidence-contradiction rates, catch ratios, domain disagreement, provider-pair matrices, severity distributions, and silent-agreement counts. Individual turn-level data is not released for user privacy reasons.
Licensed CC BY 4.0. Cite as: Suprmind Multi-Model Divergence Index, April 2026 Edition. n = 1,324 production turns.
Download CSV (aggregate data)Nine sources. The list is grouped by how each reference relates to the research on this page - not alphabetically. Studies cited inline with specific claims are listed first. Adjacent hallucination benchmarks are listed separately because we name them in body to contrast our methodology, not because we used their data.
We reference these benchmarks in body to clarify what this page does not do. The Confidence Trap measures ensemble disagreement, not accuracy against ground truth. These are the leaderboards we don't compete with - important to name because readers would otherwise assume we do.
We measured this pattern in 1,324 real production queries across 10 professional domains. If you're working on something where a silently contradicted answer costs money or reputation, run the same pattern on it. If you think the Gemini-as-judge concern breaks the methodology, show us the analysis. If you think the Suprmind-user sample skews the domain rates, stratify and reply. The dataset is downloadable precisely so someone can disagree with us in a way that moves the work forward.
Try Suprmind →The Confidence Trap is the April 2026 Edition of the Suprmind Multi-Model Divergence Index. Dataset frozen 2026-04-19 from Suprmind production data. Next edition: Q2 2026, late July. This page will remain accessible at its canonical URL with edition history preserved.