The single sharpest fact about the new frontier AI models from OpenAI and Anthropic is that they don't actually improve performance all that much, scoring just below 73% in a test by expert network Pearl. This test wasn't just any test - it was a rigorous evaluation. The evaluation included 25 of the world's leading AI models, including GPT-5.5, Claude Opus 4.7, and Gemini, with real licensed professionals judging the answers. The result was clear: none of the models exceeded 73%.
Benchmarks measure whether a model can pass a test. They're asking whether a professional would trust the answer, and right now, the answer is no, said Pearl CEO Andy Kurtzig. Almost right is still wrong. He didn't mince words - the current state of AI models isn't good enough.
Pearl assembled roughly 510 questions across five professional domains: business, health, law, pets, and technology. None of these questions had ever been released publicly, and they weren't available to model developers during training. Each of the 25 AI models received identical prompts with no tuning or prompt engineering. Responses were graded by credentialed experts on a 1-to-5 rubric. The rubric measured four dimensions: correctness, completeness, prioritization, and professional judgment.
Pearl also tested models in both minimum and maximum reasoning configurations. They found that more inference-time compute delivers only 1-2.6% improvement. Occasionally, it even produced worse answers. Some areas were better, of course. Top models hit 80.9% in business, for instance. However, in law and health, Pearl says some widely-used models dropped to around 20% expert alignment - that's unimpressive at best, and dangerous at worst.
So, who is Andy Kurtzig, the CEO of Pearl? Andy Kurtzig is the founder and CEO of Pearl, an expert network that builds AI systems with experts in the loop. Before founding Pearl, Kurtzig worked at various tech companies. He had a stint as a VP of engineering at a Silicon Valley startup. Kurtzig's experience doesn't suggest he's biased against AI - he's just calling it like it is.
For those executives at companies like Cisco and Meta that are shedding human workers to align with the age of AI, the results should remind them that AI makes more than a few mistakes in every domain. It makes serious errors in specific high-impact areas like health and law. They shouldn't rely solely on AI models - they're not ready for prime time yet.
The potential impact of these findings on the broader tech industry can't be ignored. If AI models aren't as reliable as claimed, companies may need to rethink their strategies for implementing AI in their operations. They won't be able to just plug in AI models and expect them to work flawlessly. It's a complex issue that requires careful consideration.
The new frontier AI models are not quite ready for prime time. They may be more expensive and come with gaudy claims of higher intelligence, but when it comes down to it, they're just not that much better than the old models. They're still making mistakes, and they can't be trusted to make critical decisions.
- 25 AI models were tested by Pearl.
- The models scored below 73% in the test.
- 510 questions were assembled across five professional domains.
- The test was graded by credentialed experts on a 1-to-5 rubric.
- More inference-time compute delivered only 1-2.6% improvement.