The arms race to construct smarter AI fashions has a measurement downside: the exams used to rank them have gotten out of date virtually as rapidly because the fashions enhance. On Monday, Synthetic Evaluation, an impartial AI benchmarking group whose rankings are carefully watched by builders and enterprise consumers, launched a significant overhaul to its Intelligence Index that essentially adjustments how the trade measures AI progress.
The brand new Intelligence Index v4.0 incorporates 10 evaluations spanning brokers, coding, scientific reasoning, and basic data. However the adjustments go far deeper than shuffling take a look at names. The group eliminated three staple benchmarks — MMLU-Professional, AIME 2025, and LiveCodeBench — which have lengthy been cited by AI firms of their advertising and marketing supplies. Of their place, the brand new index introduces evaluations designed to measure whether or not AI programs can full the sort of work that individuals really receives a commission to do.
sort: embedded-entry-inline id: 1bCmRrroGCdUb07IuaHysL
"This index shift reflects a broader transition: intelligence is being measured less by recall and more by economically useful action," noticed Aravind Sundar, a researcher who responded to the announcement on X (previously Twitter).
Why AI benchmarks are breaking: The issue with exams that high fashions have already mastered
The benchmark overhaul addresses a rising disaster in AI analysis: the main fashions have grow to be so succesful that conventional exams can now not meaningfully differentiate between them. The brand new index intentionally makes the curve tougher to climb. In keeping with Synthetic Evaluation, high fashions now rating 50 or beneath on the brand new v4.0 scale, in comparison with 73 on the earlier model — a recalibration designed to revive headroom for future enchancment.
This saturation downside has plagued the trade for months. When each frontier mannequin scores within the ninetieth percentile on a given take a look at, the take a look at loses its usefulness as a decision-making instrument for enterprises attempting to decide on which AI system to deploy. The brand new methodology makes an attempt to resolve this by weighting 4 classes equally — Brokers, Coding, Scientific Reasoning, and Genera l— whereas introducing evaluations the place even probably the most superior programs nonetheless wrestle.
The outcomes beneath the brand new framework present OpenAI's GPT-5.2 with prolonged reasoning effort claiming the highest spot, adopted carefully by Anthropic's Claude Opus 4.5 and Google's Gemini 3 Professional. OpenAI describes GPT-5.2 as "the most capable model series yet for professional knowledge work," whereas Anthropic's Claude Opus 4.5 scores greater than GPT-5.2 on SWE-Bench Verified, a take a look at set evaluating software program coding skills.
GDPval-AA: The brand new benchmark testing whether or not AI can do your job
Probably the most important addition to the brand new index is GDPval-AA, an analysis based mostly on OpenAI's GDPval dataset that exams AI fashions on real-world economically invaluable duties throughout 44 occupations and 9 main industries. Not like conventional benchmarks that ask fashions to resolve summary math issues or reply multiple-choice trivia, GDPval-AA measures whether or not AI can produce the deliverables that professionals really create: paperwork, slides, diagrams, spreadsheets, and multimedia content material.
Fashions obtain shell entry and net shopping capabilities via what Synthetic Evaluation calls "Stirrup," its reference agentic harness. Scores are derived from blind pairwise comparisons, with ELO scores frozen on the time of analysis to make sure index stability.
Underneath this framework, OpenAI's GPT-5.2 with prolonged reasoning leads with an ELO rating of 1442, whereas Anthropic's Claude Opus 4.5 non-thinking variant follows at 1403. Claude Sonnet 4.5 trails at 1259.
On the unique GDPval analysis, GPT-5.2 beat or tied high trade professionals on 70.9% of well-specified duties, based on OpenAI. The corporate claims GPT-5.2 "outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations," with firms together with Notion, Field, Shopify, Harvey, and Zoom observing "state-of-the-art long-horizon reasoning and tool-calling performance."
The emphasis on economically measurable output is a philosophical shift in how the trade thinks about AI functionality. Moderately than asking whether or not a mannequin can move a bar examination or clear up competitors math issues — achievements that generate headlines however don't essentially translate to office productiveness — the brand new benchmarks ask whether or not AI can really do jobs.
Graduate-level physics issues expose the bounds of at present's most superior AI fashions
Whereas GDPval-AA measures sensible productiveness, one other new analysis referred to as CritPT reveals simply how far AI programs stay from true scientific reasoning. The benchmark exams language fashions on unpublished, research-level reasoning duties throughout fashionable physics, together with condensed matter, quantum physics, and astrophysics.
CritPT was developed by greater than 50 lively physics researchers from over 30 main establishments. Its 71 composite analysis challenges simulate full-scale analysis initiatives on the entry degree — corresponding to the warm-up workouts a hands-on principal investigator would possibly assign to junior graduate college students. Each downside is hand-curated to provide a guess-resistant, machine-verifiable reply.
The outcomes are sobering. Present state-of-the-art fashions stay removed from reliably fixing full research-scale challenges. GPT-5.2 with prolonged reasoning leads the CritPT leaderboard with a rating of simply 11.5%, adopted by Google's Gemini 3 Professional Preview and Anthropic's Claude 4.5 Opus Pondering variant. These scores recommend that regardless of exceptional progress on consumer-facing duties, AI programs nonetheless wrestle with the sort of deep reasoning required for scientific discovery.
AI hallucination charges: Why probably the most correct fashions aren't at all times probably the most reliable
Maybe probably the most revealing new analysis is AA-Omniscience, which measures factual recall and hallucination throughout 6,000 questions protecting 42 economically related matters inside six domains: Enterprise, Well being, Legislation, Software program Engineering, Humanities & Social Sciences, and Science/Engineering/Arithmetic.
The analysis produces an Omniscience Index that rewards exact data whereas penalizing hallucinated responses — offering perception into whether or not a mannequin can distinguish what it is aware of from what it doesn't. The findings expose an uncomfortable reality: excessive accuracy doesn’t assure low hallucination. Fashions with the best accuracy usually fail to guide on the Omniscience Index as a result of they have an inclination to guess slightly than abstain when unsure.
Google's Gemini 3 Professional Preview leads the Omniscience Index with a rating of 13, adopted by Claude Opus 4.5 Pondering and Gemini 3 Flash Reasoning, each at 10. Nevertheless, the breakdown between accuracy and hallucination charges reveals a extra complicated image.
On uncooked accuracy, Google's two fashions lead with scores of 54% and 51% respectively, adopted by Claude 4.5 Opus Pondering at 43%. However Google's fashions additionally show greater hallucination charges than peer fashions, scoring 88% and 85%. Anthropic's Claude 4.5 Sonnet Pondering and Claude Opus 4.5 Pondering present hallucination charges of 48% and 58% respectively, whereas GPT-5.1 with excessive reasoning effort achieves 51%—the second-lowest hallucination fee examined.
Each Omniscience Accuracy and Hallucination Price contribute 6.25% weighting every to the general Intelligence Index v4.
Contained in the AI arms race: How OpenAI, Google, and Anthropic stack up beneath new testing
The benchmark reshuffling arrives at an particularly turbulent second within the AI trade. All three main frontier mannequin builders have launched main new fashions inside just some weeks — and Gemini 3 nonetheless holds the highest spot on a lot of the leaderboards on LMArena, a broadly cited benchmarking instrument used to match LLMs.
Google's November launch of Gemini 3 prompted OpenAI to declare a "code red" effort to enhance ChatGPT. OpenAI is relying on its GPT household of fashions to justify its $500 billion valuation and over $1.4 trillion in deliberate spending. "We announced this code red to really signal to the company that we want to marshal resources in one particular area," stated Fidji Simo, CEO of functions at OpenAI. Altman advised CNBC he anticipated OpenAI to exit its code crimson by January.
Anthropic responded with Claude Opus 4.5 on November 24, attaining an SWE-Bench Verified accuracy rating of 80.9% — reclaiming the coding crown from each GPT-5.1-Codex-Max and Gemini 3. The launch marked Anthropic's third main mannequin launch in two months. Microsoft and Nvidia have since introduced multi-billion-dollar investments in Anthropic, boosting its valuation to about $350 billion.
How Synthetic Evaluation exams AI fashions: A take a look at the impartial benchmarking course of
Synthetic Evaluation emphasizes that each one evaluations are run independently utilizing a standardized methodology. The group states that its "methodology emphasizes fairness and real-world applicability," estimating a 95% confidence interval for the Intelligence Index of lower than ±1% based mostly on experiments with greater than 10 repeats on sure fashions.
The group's revealed methodology defines key phrases that enterprise consumers ought to perceive. In keeping with the methodology documentation, Synthetic Evaluation considers an "endpoint" to be a hosted occasion of a mannequin accessible by way of an API — which means a single mannequin might have a number of endpoints throughout completely different suppliers. A "provider" is an organization that hosts and gives entry to a number of mannequin endpoints or programs. Critically, Synthetic Evaluation distinguishes between "open weights" fashions, whose weights have been launched publicly, and really open-source fashions—noting that many open LLMs have been launched with licenses that don’t meet the complete definition of open-source software program.
The methodology additionally clarifies how the group standardizes token measurement: it makes use of OpenAI tokens as measured with OpenAI's tiktoken package deal as a typical unit throughout all suppliers to allow honest comparisons.
What the brand new AI Intelligence Index means for enterprise know-how choices in 2026
For technical decision-makers evaluating AI programs, the Intelligence Index v4.0 gives a extra nuanced image of functionality than earlier benchmark compilations. The equal weighting throughout brokers, coding, scientific reasoning, and basic data implies that enterprises with particular use circumstances might wish to look at category-specific scores slightly than relying solely on the mixture index.
The introduction of hallucination measurement as a definite, weighted issue addresses one of the vital persistent considerations in enterprise AI adoption. A mannequin that seems extremely correct however regularly hallucinates when unsure poses important dangers in regulated industries like healthcare, finance, and regulation.
The Synthetic Evaluation Intelligence Index is described as "a text-only, English language evaluation suite." The group benchmarks fashions for picture inputs, speech inputs, and multilingual efficiency individually.
The response to the announcement has been largely constructive. "It is great to see the index evolving to reduce saturation and focus more on agentic performance," wrote one commenter in an X.com put up. "Including real-world tasks like GDPval-AA makes the scores much more relevant for practical use."
Others struck a extra bold be aware. "The new wave of models that is just about to come will leave them all behind," predicted one observer. "By the end of the year the singularity will be undeniable."
However whether or not that prediction proves prophetic or untimely, one factor is already clear: the period of judging AI by how effectively it solutions take a look at questions is ending. The brand new customary is less complicated and much more consequential — can it do the work?

