If you thought ChatGPT and its counterparts were walking, talking versions of the truth, you might want to adjust your expectations. A new study released today on 1,000 real-world fact-check claims shows that five of the most advanced AI frontier models disagree on 67% of them. When you ask these digital brains to settle a claim, they don't just differ on nuance. They often land in completely different buckets of reality, from "True" to "False."
The researchers used a four-bucket system: True, Mostly True, Misleading, and False. To keep things clean, they used 1,000 recent user submissions from a platform called Lenz. These weren't curated textbook questions designed for a test; they were the messy, complicated, and often contentious claims that real people send in. The study found that on 34% of these cases, the models' verdicts were so far apart—like someone saying "True" while another says "False"—that it qualifies as a major substantive disagreement.
It's tempting to think that if three out of five models say something is true, it must be the case. The researchers warn us against this trap. A majority verdict isn't the same as ground truth. In many instances, the dissenting model—the one disagreeing with the pack—might actually be the only one that got it right. The study emphasizes that the models aren't interchangeable judges, but rather distinct systems with their own biases and blind spots.
The panel consisted of three parametric models—GPT-5.4, Claude Opus 4.7, and Gemini 3 Pro—alongside two retrieval-augmented versions, Gemini 3 Pro + Search and Sonar Pro. These were chosen to cover how different AI systems function in the wild. The researchers took a strict approach by forcing each model to pick a label without any option to abstain. This prevents the models from simply "chickening out" of hard questions.
"A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right."
When the AI panel does agree, they usually converge on the extremes, like "True" or "False." However, when they land on the middle ground of "Mostly True" or "Misleading," they almost never reach unanimity. This suggests that the gray areas of reality remain a massive challenge for even the smartest algorithms. Researchers noted that while some models prefer to stick to the poles of the truth, others are more willing to wander into the middle. This reflects different "decision priors" built into their design.
Many studies use public benchmarks like MMLU-Pro or TruthfulQA, but those questions have been around for so long that AI models have likely already "seen" them during their training. By using recent user submissions from the platform Lenz, the researchers created a snapshot of claims that haven't been memorized or canonized. This ensures that the disagreement they measured is about the models' ability to analyze information rather than their ability to recite pre-trained answers.
The study released this week is just the first part of a larger project. Researchers are currently working on a follow-up that will use human-labeled verification for these same 1,000 claims. Once those human ground-truth labels are available, they'll be able to map out exactly where the machines deviate from human logic. The data is available as a frozen, permanent archive for anyone who wants to see the messy, divided reality of AI "intelligence" for themselves. The next phase will provide definitive human verification for the claims.