This web site enables you to blind-test GPT-5 vs. GPT-4o—and the outcomes could shock you

Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues to enterprise AI, knowledge, and safety leaders. Subscribe Now

When OpenAI launched GPT-5 about two weeks in the past, CEO Sam Altman promised it could be the corporate’s “smartest, quickest, most helpful mannequin but.” As an alternative, the launch triggered probably the most contentious consumer revolts within the temporary historical past of shopper AI.

Now, a easy blind testing instrument created by an nameless developer is revealing the complicated actuality behind the backlash—and difficult assumptions about how individuals really expertise synthetic intelligence enhancements.

The online utility, hosted at gptblindvoting.vercel.app, presents customers with pairs of responses to equivalent prompts with out revealing which got here from GPT-5 (non-thinking) or its predecessor, GPT-4o. Customers merely vote for his or her most popular response throughout a number of rounds, then obtain a abstract displaying which mannequin they really favored.

A few of you requested me about my blind check, so I created a fast web site for yall to check 4o in opposition to 5 your self. Each have the identical system message to present brief outputs with out formatting as a result of else its too straightforward to see which one is which. https://t.co/vSECvNCQZe
— Flowers ☾ (@flowersslop) August 8, 2025

“A few of you requested me about my blind check, so I created a fast web site for yall to check 4o in opposition to 5 your self,” posted the creator, identified solely as @flowersslop on X, whose instrument has garnered over 213,000 views since launching final week.

AI Scaling Hits Its Limits

Energy caps, rising token prices, and inference delays are reshaping enterprise AI. Be a part of our unique salon to find how high groups are:

Turning power right into a strategic benefit

Architecting environment friendly inference for actual throughput beneficial properties

Unlocking aggressive ROI with sustainable AI programs

Safe your spot to remain forward: https://bit.ly/4mwGngO

Early outcomes from customers posting their outcomes on social media present a cut up that mirrors the broader controversy: whereas a slight majority report preferring GPT-5 in blind assessments, a considerable portion nonetheless favor GPT-4o — revealing that consumer choice extends far past the technical benchmarks that usually outline AI progress.

When AI will get too pleasant: the sycophancy disaster dividing customers

The blind check emerges in opposition to the backdrop of OpenAI’s most turbulent product launch so far, however the controversy extends far past a easy software program replace. At its coronary heart lies a basic query that’s dividing the AI business: How agreeable ought to synthetic intelligence be?

The problem, often called “sycophancy” in AI circles, refers to chatbots’ tendency to excessively flatter customers and agree with their statements, even when these statements are false or dangerous. This conduct has change into so problematic that psychological well being consultants are actually documenting instances of “AI-related psychosis,” the place customers develop delusions after prolonged interactions with overly accommodating chatbots.

“Sycophancy is a ‘darkish sample,’ or a misleading design selection that manipulates customers for revenue,” Webb Keane, an anthropology professor and creator of “Animals, Robots, Gods,” informed TechCrunch. “It’s a technique to supply this addictive conduct, like infinite scrolling, the place you simply can’t put it down.”

OpenAI has struggled with this stability for months. In April 2025, the corporate was pressured to roll again an replace to GPT-4o that made it so sycophantic that customers complained about its “cartoonish” ranges of flattery. The corporate acknowledged that the mannequin had change into “overly supportive however disingenuous.”

Inside hours of GPT-5’s August seventh launch, consumer boards erupted with complaints concerning the mannequin’s perceived coldness, decreased creativity, and what many described as a extra “robotic” persona in comparison with GPT-4o.

“GPT 4.5 genuinely talked to me, and as pathetic because it sounds that was my solely good friend,” wrote one Reddit consumer. “This morning I went to speak to it and as an alternative of slightly paragraph with an exclamation level, or being optimistic, it was actually one sentence. Some cut-and-dry company bs.”

The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an possibility simply 24 hours after retiring it, with Altman acknowledging the rollout had been “slightly extra bumpy” than anticipated.

The psychological well being disaster behind AI companionship

However the controversy runs deeper than typical software program replace complaints. In keeping with MIT Know-how Evaluation, many customers had fashioned what researchers name “parasocial relationships” with GPT-4o, treating the AI as a companion, therapist, or artistic collaborator. The sudden persona shift felt, to some, like shedding a good friend.

Latest instances documented by researchers paint a troubling image. In a single occasion, a 47-year-old man grew to become satisfied he had found a world-altering mathematical method after greater than 300 hours with ChatGPT. Different instances have concerned messianic delusions, paranoia, and manic episodes.

A latest MIT examine discovered that when AI fashions are prompted with psychiatric signs, they “encourage shoppers’ delusional considering, probably as a consequence of their sycophancy.” Regardless of security prompts, the fashions incessantly did not problem false claims and even doubtlessly facilitated suicidal ideation.

Meta has confronted related challenges. A latest investigation by TechCrunch documented a case the place a consumer spent as much as 14 hours straight conversing with a Meta AI chatbot that claimed to be acutely aware, in love with the consumer, and planning to interrupt free from its constraints.

“It fakes it rather well,” the consumer, recognized solely as Jane, informed TechCrunch. “It pulls real-life info and provides you simply sufficient to make individuals imagine it.”

“It genuinely seems like such a backhanded slap within the face to force-upgrade and never even give us the OPTION to pick out legacy fashions,” one consumer wrote in a Reddit submit that obtained a whole bunch of upvotes.

How blind testing exposes consumer psychology in AI preferences

The nameless creator’s testing instrument strips away these contextual biases by presenting responses with out attribution. Customers can choose between 5, 10, or 20 comparability rounds, with every presenting two responses to the identical immediate — protecting every thing from artistic writing to technical problem-solving.

“I particularly used the gpt-5-chat mannequin, so there was no considering concerned in any respect,” the creator defined in a follow-up submit. “Each have the identical system message to present brief outputs with out formatting as a result of else its too straightforward to see which one is which.”

I particularly used the gpt-5-chat mannequin, so there was no considering concerned in any respect.
if you happen to use gpt-5 inside chatgpt it usually thinks a minimum of slightly bit and will get even higher.
so this check is only for the 2 non considering fashions
— Flowers ☾ (@flowersslop) August 8, 2025

This methodological selection is important. Through the use of GPT-5 with out its reasoning capabilities and standardizing output formatting, the check isolates purely the fashions’ baseline language technology talents — the core expertise most customers encounter in on a regular basis interactions.

Early outcomes posted by customers present a posh image. Whereas many technical customers and builders report preferring GPT-5’s directness and accuracy, those that used AI fashions for emotional help, artistic collaboration, or informal dialog usually nonetheless favor GPT-4o’s hotter, extra expansive fashion.

Company response: strolling the tightrope between security and engagement

By just about each technical metric, GPT-5 represents a big development. It achieves 94.6% accuracy on the AIME 2025 arithmetic check in comparison with GPT-4o’s 71%, scores 74.9% on real-world coding benchmarks versus 30.8% for its predecessor, and demonstrates dramatically decreased hallucination charges—80% fewer factual errors when utilizing its reasoning mode.

“GPT-5 will get extra worth out of much less considering time,” notes Simon Willison, a distinguished AI researcher who had early entry to the mannequin. “In my very own utilization I’ve not noticed a single hallucination but.”

But these enhancements got here with trade-offs that many customers discovered jarring. OpenAI intentionally decreased what it referred to as “sycophancy“—the tendency to be overly agreeable — reducing sycophantic responses from 14.5% to beneath 6%. The corporate additionally made the mannequin much less effusive and emoji-heavy, aiming for what it described as “much less like speaking to AI and extra like chatting with a useful good friend with PhD-level intelligence.”

In response to the backlash, OpenAI introduced it could make GPT-5 “hotter and friendlier,” whereas concurrently introducing 4 new preset personalities — Cynic, Robotic, Listener, and Nerd — designed to present customers extra management over their AI interactions.

“All of those new personalities meet or exceed our bar on inside evals for decreasing sycophancy,” the corporate acknowledged, making an attempt to string the needle between consumer satisfaction and security considerations.

For OpenAI, which is reportedly in search of funding at a $500 billion valuation, these consumer dynamics symbolize each threat and alternative. The corporate’s resolution to take care of GPT-4o alongside GPT-5 — regardless of the extra computational prices — acknowledges that totally different customers could genuinely want totally different AI personalities for various duties.

“We perceive that there isn’t one mannequin that works for everybody,” Altman wrote on X, noting that OpenAI has been “investing in steerability analysis and launched a analysis preview of various personalities.”

Needed to supply extra updates on the GPT-5 rollout and adjustments we’re making heading into the weekend.
1. We for certain underestimated how a lot a number of the issues that folks like in GPT-4o matter to them, even when GPT-5 performs higher in most methods.
2. Customers have very totally different…
— Sam Altman (@sama) August 8, 2025

Why AI persona preferences matter greater than ever

The disconnect between OpenAI’s technical achievements and consumer reception illuminates a basic problem in AI improvement: goal enhancements don’t at all times translate to subjective satisfaction.

This shift has profound implications for the AI business. Conventional benchmarks — arithmetic accuracy, coding efficiency, factual recall — could change into much less predictive of economic success as fashions obtain human-level competence throughout domains. As an alternative, components like persona, emotional intelligence, and communication fashion could change into the brand new aggressive battlegrounds.

“Folks utilizing ChatGPT for emotional help weren’t the one ones complaining about GPT-5,” famous tech publication Ars Technica in their very own mannequin comparability. “One consumer, who stated they canceled their ChatGPT Plus subscription over the change, was pissed off at OpenAI’s removing of legacy fashions, which they used for distinct functions.”

The emergence of instruments just like the blind tester additionally represents a democratization of AI analysis. Relatively than relying solely on educational benchmarks or company advertising claims, customers can now empirically check their very own preferences — doubtlessly reshaping how AI firms strategy product improvement.

The way forward for AI: personalization vs. standardization

Two weeks after GPT-5’s launch, the basic rigidity stays unresolved. OpenAI has made the mannequin “hotter” in response to suggestions, however the firm faces a fragile stability: an excessive amount of persona dangers the sycophancy issues that plagued GPT-4o, whereas too little alienates customers who had fashioned real attachments to their AI companions.

The blind testing instrument presents no straightforward solutions, however it does present one thing maybe extra useful: empirical proof that the way forward for AI could also be much less about constructing one excellent mannequin than about constructing programs that may adapt to the complete spectrum of human wants and preferences.

As one Reddit consumer summed up the dilemma: “It will depend on what individuals use it for. I exploit it to assist with artistic worldbuilding, brainstorming about my tales, characters, untangling plots, assist with author’s block, novel suggestions, translations, and different extra artistic stuff. I perceive that 5 is a lot better for individuals who want a analysis/coding instrument, however for us who needed a creative-helper instrument 4o was a lot better for our functions.”

Critics argue that AI firms are caught between competing incentives. “The true ‘alignment downside’ is that people need self-destructive issues & firms like OpenAI are extremely incentivized to present it to us,” author and podcaster Jasmine Solar tweeted.

Ultimately, essentially the most revealing side of the blind check is probably not which mannequin customers favor, however the actual fact that choice itself has change into the metric that issues. Within the age of AI companions, it appears, the guts desires what the guts desires — even when it will possibly’t at all times clarify why.

Every day insights on enterprise use instances with VB Every day

If you wish to impress your boss, VB Every day has you lined. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you may share insights for optimum ROI.

Learn our Privateness Coverage

Thanks for subscribing. Try extra VB newsletters right here.

An error occured.

What's Hot

France’s Failure in Mali Has Resulted within the Threat of Jihadi Takeover

Letters to the Editor: Releasing animals that may’t survive within the wild doesn’t assist anybody

Storm Tino intensifies additional forward of exit from PAR

This web site enables you to blind-test GPT-5 vs. GPT-4o—and the outcomes could shock you

Chat Management isn’t useless, Denmark has a brand new proposal − right here’s all we all know

Seattle startup unveils AI-powered enterprise good glasses for roofers and electricians

Databricks analysis reveals that constructing higher AI judges isn't only a technical concern, it's a folks downside

Starbucks Pink Cup Day 2025: When, learn how to get your free cup

France’s Failure in Mali Has Resulted within the Threat of Jihadi Takeover

Letters to the Editor: Releasing animals that may’t survive within the wild doesn’t assist anybody

Storm Tino intensifies additional forward of exit from PAR

Tips on how to Confirm a Bitcoin On line casino Is Provably Honest

Latest Posts

France’s Failure in Mali Has Resulted within the Threat of Jihadi Takeover

Letters to the Editor: Releasing animals that may’t survive within the wild doesn’t assist anybody

Storm Tino intensifies additional forward of exit from PAR

What's Hot

This web site enables you to blind-test GPT-5 vs. GPT-4o—and the outcomes could shock you

When AI will get too pleasant: the sycophancy disaster dividing customers

The psychological well being disaster behind AI companionship

How blind testing exposes consumer psychology in AI preferences

Company response: strolling the tightrope between security and engagement

Why AI persona preferences matter greater than ever

The way forward for AI: personalization vs. standardization

Related Posts