For many years, the IQ take a look at has been some of the acquainted — and most contested — yardsticks for human intelligence. Now, a startup venture known as AI IQ is making use of the identical metaphor to synthetic intelligence, assigning estimated intelligence quotients to greater than 50 of the world's strongest language fashions and plotting them on a typical bell curve.
The result’s a set of interactive visualizations at aiiq.org which have ricocheted throughout social media prior to now week, drawing reward from enterprise technologists who say the charts make an impossibly advanced market legible — and sharp criticism from researchers and commentators who warn your complete framework is deceptive.
"That is tremendous helpful," wrote Thibaut Mélen, a expertise commentator, on X. "A lot simpler to know mannequin progress when it's mapped like this as a substitute of one other big leaderboard desk."
Brian Vellmure, a enterprise strategist, supplied the same endorsement: "That is useful. Anecdotally tracks with private expertise."
However the backlash arrived simply as shortly. "It's nonsense. AI is way too jagged. The map is just not the territory," posted AI Deeply, a man-made intelligence commentary account, crystallizing a fear shared by many researchers: that lowering a language mannequin's sprawling, uneven capabilities to a single quantity creates a harmful phantasm of precision.
Twelve benchmarks, 4 dimensions, and one controversial quantity: how AI IQ really works
AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor greatest referred to as a co-founder of the blockchain platform Stacks. Shea additionally co-founded Voterbase and has invested within the early levels of a number of unicorns, together with OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton College.
The positioning's methodology rests on a deceptively easy formulation. AI IQ teams 12 benchmarks into 4 reasoning dimensions: summary, mathematical, programmatic, and educational. The composite IQ is a straight common of these 4 dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).
The summary reasoning dimension attracts from ARC-AGI-1 and ARC-AGI-2, the notoriously troublesome pattern-recognition benchmarks designed to check common fluid intelligence. Mathematical reasoning consists of FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning makes use of Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Educational reasoning pulls from Humanity's Final Examination, CritPt, and GPQA Diamond.
Every uncooked benchmark rating will get mapped to an implied IQ by way of what the positioning describes as "hand-calibrated problem curves." Crucially, the methodology compresses ceilings for benchmarks thought-about simpler or extra vulnerable to knowledge contamination, stopping them from inflating scores above 100. Tougher, much less gameable benchmarks retain increased ceilings. The system additionally handles lacking knowledge conservatively: fashions want scores on at the very least two of the 4 dimensions to obtain a derived IQ, and when benchmarks are absent, the pipeline intentionally pulls scores down somewhat than up. The positioning states that "each derived IQ averages all 4 dimensions, so lacking protection can’t make a mannequin look higher by omission."
OpenAI leads the bell curve, however the hole between the highest AI fashions has by no means been smaller
As of mid-Might 2026, the AI IQ charts inform a narrative of fast convergence on the prime of the frontier — and widening range within the tiers beneath.
In accordance with the Frontier IQ Over Time chart, GPT-5.5 from OpenAI at the moment sits on the peak of the bell curve, with an estimated IQ close to 136 — the best of any mannequin tracked. It’s intently adopted by GPT-5.4 (roughly 131), Opus 4.7 from Anthropic (roughly 132), and Opus 4.6 (roughly 129). Google's Gemini 3.1 Professional lands close to 131, making the highest cluster terribly tight.
That compression is just not distinctive to AI IQ's framework. Visible Capitalist, drawing from a separate Mensa-based rating by TrackingAI, lately noticed the identical dynamic, noting that "the largest takeaway is how compressed the highest of the leaderboard has develop into." On that scale, Grok-4.20 Knowledgeable Mode and GPT 5.4 Professional tied at 145, with Gemini 3.1 Professional at 141.
Beneath the frontier cluster, the AI IQ charts present a crowded midfield. Fashions from Chinese language labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier more and more aggressive for enterprise patrons who don't want the very best mannequin for each job. One X person, ovsky, famous that the info "confirms expertise with sonnet 4.6 being an absolute workhorse versus opus 4.5" — pointing to the way in which the charts can validate practitioner intuitions that headline rankings typically miss.
Why emotional intelligence scores have gotten the brand new battleground in AI mannequin rankings
What distinguishes AI IQ from most different benchmarking efforts is its inclusion of an "EQ" — emotional intelligence — rating. The positioning maps every mannequin's EQ-Bench 3 Elo rating and Area Elo rating to an estimated EQ utilizing calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the 2.
The EQ scores produce a meaningfully totally different rating than IQ alone. On the IQ vs. EQ scatter plot, Anthropic's Opus 4.7 leads on EQ with a rating close to 132, pushing it into the upper-right quadrant — probably the most fascinating place, signaling each excessive cognitive and excessive emotional intelligence. OpenAI's GPT-5.5 and GPT-5.4 cluster within the high-IQ zone however lag barely on EQ. Google's Gemini 3.1 Professional sits in a powerful center place on each axes.
One notable methodological alternative has drawn consideration: EQ-Bench 3 is judged by Claude, an Anthropic mannequin, which the positioning acknowledges "creates potential scoring bias in favor of Anthropic fashions." To right for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench part for all Anthropic fashions earlier than mapping to implied EQ. The Area part is unaffected because it makes use of human judges. That self-correction is uncommon within the benchmarking world, and it suggests Shea is conscious of the methodological minefield he has entered. Nonetheless, the EQ dimension captures one thing IQ alone can’t: the rising significance of conversational high quality, collaboration, and belief in fashions deployed for user-facing work.
The AI cost-performance chart that enterprise patrons really have to see
Maybe probably the most virtually helpful chart on the positioning is just not the bell curve however the IQ vs. Efficient Value scatter plot. It maps every mannequin's estimated IQ towards an "efficient price" metric — outlined because the token price for a job utilizing 2 million enter tokens and 1 million output tokens, multiplied by a utilization effectivity issue.
The chart reveals a well-recognized sample in enterprise expertise: one of the best fashions should not at all times one of the best worth. GPT-5.5 and Opus 4.7 sit within the upper-left nook — excessive IQ, excessive price, with efficient per-task prices north of $30 and $50 respectively. In the meantime, fashions like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a candy spot within the center: respectable IQ scores between 112 and 120, at efficient prices starting from roughly $1 to $5 per job. On the least expensive excessive, GPT-oss-20b (an open-source OpenAI mannequin) seems close to $0.20 efficient price with an IQ round 107 — doubtlessly probably the most economical possibility for bulk classification or extraction workloads.
The positioning additionally gives a 3D visualization mapping IQ, EQ, and efficient price concurrently. A dashed line operating by way of the dice factors towards the best: increased IQ, increased EQ, and decrease price. Fashions close to the "inexperienced finish" of that axis are stronger all-around offers; these close to the "crimson finish" sacrifice functionality, price effectivity, or each. For CIOs observing API invoices, the implication is obvious: the intelligence hole between a $50 mannequin and a $3 mannequin has narrowed sufficient that routing — utilizing costly fashions for arduous issues and low-cost ones for the whole lot else — is not optionally available. It’s the dominant structure for critical AI deployments.
Critics say AI's "jagged" capabilities make a single IQ rating dangerously deceptive
The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a mannequin's uneven capabilities right into a single rating obscures greater than it reveals.
"IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," posted Zaya, a expertise commentator, on X. "GPT-5.5 already hit saturation on MMLU-Professional, however nonetheless fails ClockBench 50% of the time."
That statement touches on what AI researchers name the "jaggedness" drawback: giant language fashions typically exhibit wildly uneven capabilities, excelling at graduate-level physics whereas failing at duties a baby may do. A composite rating can paper over these gaps.
Pressureangle, one other X person, posted a extra granular critique, calling out "full lack of transparency" and arguing the positioning by no means absolutely discloses how its calibration curves have been created or validated. In equity, AI IQ does listing its 12 benchmarks and exhibits the form of every calibration curve in its methodology modal. However the uncooked knowledge and exact mathematical transformations should not revealed as open datasets — a niche that issues to researchers accustomed to completely reproducible strategies.
Others questioned the premise itself. "As ineffective as human IQ testing," wrote haashim on X. Shubham Sharma, an AI and expertise author, supplied a constructive different: "Why not having the Fashions take an official (MENSA-Grade) take a look at? Wouldn't this be probably the most correct and most 'human-comparable' approach to benchmark intelligence?" That strategy already exists by way of TrackingAI, which administers the Mensa Norway IQ take a look at to language fashions. However Mensa-style checks measure solely summary sample recognition, whereas AI IQ makes an attempt a broader composite throughout coding, arithmetic, and educational reasoning. As Visible Capitalist famous, "an IQ-style benchmark captures just one slice of functionality." Every strategy has tradeoffs — and neither has received the argument but.
The true race isn't for the best rating — it's for the neatest mannequin stack
For all the talk about methodology, an important sign in AI IQ's knowledge is probably not any single mannequin's rating. It’s the form of the market the charts reveal.
There are actually greater than 50 frontier-class fashions accessible by way of APIs, from at the very least 14 main suppliers spanning america, China, and Europe. Every supplier publishes its personal benchmarks, typically cherry-picked to showcase strengths. The result’s a Tower of Babel the place no two firms measure the identical factor in the identical means. Educational analysis has highlighted that "most benchmarks introduce bias by specializing in a selected kind of area," and the Frontier IQ Over Time chart on AI IQ exhibits simply how briskly the targets are shifting: in October 2023, GPT-4-turbo sat close to an estimated IQ of 75. By early 2026, the highest fashions have been brushing 135 — roughly 60 factors of enchancment in 30 months.
That tempo raises a elementary query about whether or not any scoring system can sustain. The positioning compresses ceilings for saturated benchmarks, however as fashions proceed to max out even the toughest checks — ARC-AGI-2, FrontierMath Tier 4, Humanity's Final Examination — the framework will face the identical ceiling results which have plagued each AI analysis earlier than it. Connor Forsyth pointed to this dynamic on X: "ARC AGI 3 disagrees," he wrote, referencing a next-generation benchmark which will already be undermining present scores.
AI IQ is just not good. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges recognized biases whereas doubtless lacking others. However the different — wading by way of dozens of provider-specific benchmark tables, every utilizing totally different take a look at suites and scoring conventions — is worse. The positioning gives enterprise patrons one thing genuinely scarce: a single framework for evaluating fashions throughout suppliers, dimensions, and value factors, up to date usually, with sufficient nuance to indicate that the suitable reply to "which mannequin is greatest?" is nearly at all times "it is determined by the duty."
As Debdoot Ghosh mused on X after viewing the charts: "Now a human's function is simply to orchestrate?"
Perhaps. But when the AI IQ knowledge exhibits something clearly, it’s that orchestration — understanding which mannequin to deploy, when, and at what value — has develop into its personal type of intelligence. And for that, there isn’t a benchmark but.

