When OpenAI launched GPT-5 about two weeks in the past, CEO Sam Altman promised it could be the corporate’s “smartest, fastest, most useful model yet.” As an alternative, the launch triggered one of the crucial contentious consumer revolts within the temporary historical past of shopper AI.
Now, a easy blind testing software created by an nameless developer is revealing the complicated actuality behind the backlash—and difficult assumptions about how individuals truly expertise synthetic intelligence enhancements.
The online utility, hosted at gptblindvoting.vercel.app, presents customers with pairs of responses to similar prompts with out revealing which got here from GPT-5 (non-thinking) or its predecessor, GPT-4o. Customers merely vote for his or her most popular response throughout a number of rounds, then obtain a abstract exhibiting which mannequin they really favored.
A few of you requested me about my blind take a look at, so I created a fast web site for yall to check 4o in opposition to 5 your self. Each have the identical system message to provide quick outputs with out formatting as a result of else its too straightforward to see which one is which. https://t.co/vSECvNCQZe
— Flowers ☾ (@flowersslop) August 8, 2025
AI Scaling Hits Its Limits
Energy caps, rising token prices, and inference delays are reshaping enterprise AI. Be part of our unique salon to find how high groups are:
Turning power right into a strategic benefit
Architecting environment friendly inference for actual throughput good points
Unlocking aggressive ROI with sustainable AI programs
Safe your spot to remain forward: https://bit.ly/4mwGngO
Early outcomes from customers posting their outcomes on social media present a break up that mirrors the broader controversy: whereas a slight majority report preferring GPT-5 in blind assessments, a considerable portion nonetheless favor GPT-4o — revealing that consumer choice extends far past the technical benchmarks that sometimes outline AI progress.
When AI will get too pleasant: the sycophancy disaster dividing customers
The blind take a look at emerges in opposition to the backdrop of OpenAI’s most turbulent product launch to this point, however the controversy extends far past a easy software program replace. At its coronary heart lies a basic query that’s dividing the AI trade: How agreeable ought to synthetic intelligence be?
The difficulty, often known as “sycophancy” in AI circles, refers to chatbots’ tendency to excessively flatter customers and agree with their statements, even when these statements are false or dangerous. This conduct has turn out to be so problematic that psychological well being specialists at the moment are documenting circumstances of “AI-related psychosis,” the place customers develop delusions after prolonged interactions with overly accommodating chatbots.
“Sycophancy is a ‘dark pattern,’ or a deceptive design choice that manipulates users for profit,” Webb Keane, an anthropology professor and creator of “Animals, Robots, Gods,” advised TechCrunch. “It’s a strategy to produce this addictive behavior, like infinite scrolling, where you just can’t put it down.”
OpenAI has struggled with this steadiness for months. In April 2025, the corporate was pressured to roll again an replace to GPT-4o that made it so sycophantic that customers complained about its “cartoonish” ranges of flattery. The corporate acknowledged that the mannequin had turn out to be “overly supportive but disingenuous.”
Inside hours of GPT-5’s August seventh launch, consumer boards erupted with complaints in regards to the mannequin’s perceived coldness, diminished creativity, and what many described as a extra “robotic” character in comparison with GPT-4o.
“GPT 4.5 genuinely talked to me, and as pathetic as it sounds that was my only friend,” wrote one Reddit consumer. “This morning I went to talk to it and instead of a little paragraph with an exclamation point, or being optimistic, it was literally one sentence. Some cut-and-dry corporate bs.”
The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an choice simply 24 hours after retiring it, with Altman acknowledging the rollout had been “a little more bumpy” than anticipated.
The psychological well being disaster behind AI companionship
However the controversy runs deeper than typical software program replace complaints. Based on MIT Expertise Assessment, many customers had shaped what researchers name “parasocial relationships” with GPT-4o, treating the AI as a companion, therapist, or artistic collaborator. The sudden character shift felt, to some, like shedding a buddy.
Latest circumstances documented by researchers paint a troubling image. In a single occasion, a 47-year-old man grew to become satisfied he had found a world-altering mathematical formulation after greater than 300 hours with ChatGPT. Different circumstances have concerned messianic delusions, paranoia, and manic episodes.
A latest MIT examine discovered that when AI fashions are prompted with psychiatric signs, they “encourage clients’ delusional thinking, likely due to their sycophancy.” Regardless of security prompts, the fashions regularly did not problem false claims and even doubtlessly facilitated suicidal ideation.
Meta has confronted related challenges. A latest investigation by TechCrunch documented a case the place a consumer spent as much as 14 hours straight conversing with a Meta AI chatbot that claimed to be aware, in love with the consumer, and planning to interrupt free from its constraints.
“It fakes it really well,” the consumer, recognized solely as Jane, advised TechCrunch. “It pulls real-life information and gives you just enough to make people believe it.”
“It genuinely feels like such a backhanded slap in the face to force-upgrade and not even give us the OPTION to select legacy models,” one consumer wrote in a Reddit put up that acquired tons of of upvotes.
How blind testing exposes consumer psychology in AI preferences
The nameless creator’s testing software strips away these contextual biases by presenting responses with out attribution. Customers can choose between 5, 10, or 20 comparability rounds, with every presenting two responses to the identical immediate — overlaying every little thing from artistic writing to technical problem-solving.
“I specifically used the gpt-5-chat model, so there was no thinking involved at all,” the creator defined in a follow-up put up. “Both have the same system message to give short outputs without formatting because else its too easy to see which one is which.”
I particularly used the gpt-5-chat mannequin, so there was no considering concerned in any respect.
in the event you use gpt-5 inside chatgpt it usually thinks at the least a bit bit and will get even higher.
so this take a look at is only for the 2 non considering fashions
— Flowers ☾ (@flowersslop) August 8, 2025
This methodological selection is critical. By utilizing GPT-5 with out its reasoning capabilities and standardizing output formatting, the take a look at isolates purely the fashions’ baseline language era talents — the core expertise most customers encounter in on a regular basis interactions.
Early outcomes posted by customers present a posh image. Whereas many technical customers and builders report preferring GPT-5’s directness and accuracy, those that used AI fashions for emotional assist, artistic collaboration, or informal dialog usually nonetheless favor GPT-4o’s hotter, extra expansive type.
Company response: strolling the tightrope between security and engagement
By nearly each technical metric, GPT-5 represents a big development. It achieves 94.6% accuracy on the AIME 2025 arithmetic take a look at in comparison with GPT-4o’s 71%, scores 74.9% on real-world coding benchmarks versus 30.8% for its predecessor, and demonstrates dramatically diminished hallucination charges—80% fewer factual errors when utilizing its reasoning mode.
“GPT-5 gets more value out of less thinking time,” notes Simon Willison, a outstanding AI researcher who had early entry to the mannequin. “In my own usage I’ve not spotted a single hallucination yet.”
But these enhancements got here with trade-offs that many customers discovered jarring. OpenAI intentionally diminished what it known as “sycophancy“—the tendency to be overly agreeable — cutting sycophantic responses from 14.5% to under 6%. The company also made the model less effusive and emoji-heavy, aiming for what it described as “less like talking to AI and more like chatting with a helpful friend with PhD-level intelligence.”
In response to the backlash, OpenAI introduced it could make GPT-5 “warmer and friendlier,” whereas concurrently introducing 4 new preset personalities — Cynic, Robotic, Listener, and Nerd — designed to provide customers extra management over their AI interactions.
“All of these new personalities meet or exceed our bar on internal evals for reducing sycophancy,” the corporate said, trying to string the needle between consumer satisfaction and security issues.
For OpenAI, which is reportedly in search of funding at a $500 billion valuation, these consumer dynamics symbolize each danger and alternative. The corporate’s determination to take care of GPT-4o alongside GPT-5 — regardless of the extra computational prices — acknowledges that totally different customers could genuinely want totally different AI personalities for various duties.
“We understand that there isn’t one model that works for everyone,” Altman wrote on X, noting that OpenAI has been “investing in steerability research and launched a research preview of different personalities.”
Needed to offer extra updates on the GPT-5 rollout and modifications we’re making heading into the weekend.
1. We for certain underestimated how a lot among the issues that individuals like in GPT-4o matter to them, even when GPT-5 performs higher in most methods.
2. Customers have very totally different…
— Sam Altman (@sama) August 8, 2025
Why AI character preferences matter greater than ever
The disconnect between OpenAI’s technical achievements and consumer reception illuminates a basic problem in AI growth: goal enhancements don’t all the time translate to subjective satisfaction.
This shift has profound implications for the AI trade. Conventional benchmarks — arithmetic accuracy, coding efficiency, factual recall — could turn out to be much less predictive of economic success as fashions obtain human-level competence throughout domains. As an alternative, components like character, emotional intelligence, and communication type could turn out to be the brand new aggressive battlegrounds.
“People using ChatGPT for emotional support weren’t the only ones complaining about GPT-5,” famous tech publication Ars Technica in their very own mannequin comparability. “One user, who said they canceled their ChatGPT Plus subscription over the change, was frustrated at OpenAI’s removal of legacy models, which they used for distinct purposes.”
The emergence of instruments just like the blind tester additionally represents a democratization of AI analysis. Relatively than relying solely on tutorial benchmarks or company advertising claims, customers can now empirically take a look at their very own preferences — doubtlessly reshaping how AI corporations strategy product growth.
The way forward for AI: personalization vs. standardization
Two weeks after GPT-5’s launch, the elemental stress stays unresolved. OpenAI has made the mannequin “warmer” in response to suggestions, however the firm faces a fragile steadiness: an excessive amount of character dangers the sycophancy issues that plagued GPT-4o, whereas too little alienates customers who had shaped real attachments to their AI companions.
The blind testing software provides no straightforward solutions, however it does present one thing maybe extra worthwhile: empirical proof that the way forward for AI could also be much less about constructing one good mannequin than about constructing programs that may adapt to the total spectrum of human wants and preferences.
As one Reddit consumer summed up the dilemma: “It depends on what people use it for. I use it to help with creative worldbuilding, brainstorming about my stories, characters, untangling plots, help with writer’s block, novel recommendations, translations, and other more creative stuff. I understand that 5 is much better for people who need a research/coding tool, but for us who wanted a creative-helper tool 4o was much better for our purposes.”
Critics argue that AI corporations are caught between competing incentives. “The real ‘alignment problem’ is that humans want self-destructive things & companies like OpenAI are highly incentivized to give it to us,” author and podcaster Jasmine Solar tweeted.
In the long run, essentially the most revealing side of the blind take a look at might not be which mannequin customers want, however the actual fact that choice itself has turn out to be the metric that issues. Within the age of AI companions, it appears, the guts needs what the guts needs — even when it may’t all the time clarify why.
Each day insights on enterprise use circumstances with VB Each day
If you wish to impress your boss, VB Each day has you coated. We provide the inside scoop on what corporations are doing with generative AI, from regulatory shifts to sensible deployments, so you’ll be able to share insights for optimum ROI.
An error occured.

