Washington State College professor Mesut Cicek and his analysis crew repeatedly examined ChatGPT by giving it hypotheses taken from scientific papers. The objective was to see if the AI may appropriately decide whether or not every declare was supported by analysis or not — in different phrases, whether or not it was true or false.
In complete, the crew evaluated greater than 700 hypotheses and requested the identical query 10 instances for every one to measure consistency.
Accuracy Outcomes and Limits of AI Efficiency
When the experiment was first carried out in 2024, ChatGPT answered appropriately 76.5% of the time. In a follow-up take a look at in 2025, accuracy rose barely to 80%. Nevertheless, as soon as the researchers adjusted for random guessing, the outcomes appeared far much less spectacular. The AI carried out solely about 60% higher than probability, a stage nearer to a low D than to robust reliability.
The system had essentially the most issue figuring out false statements, appropriately labeling them solely 16.4% of the time. It additionally confirmed notable inconsistency. Even when given the very same immediate 10 instances, ChatGPT produced constant solutions solely about 73% of the time.
Inconsistent Solutions Elevate Considerations
“We’re not simply speaking about accuracy, we’re speaking about inconsistency, as a result of in the event you ask the identical query time and again, you give you totally different solutions,” stated Cicek, an affiliate professor within the Division of Advertising and marketing and Worldwide Enterprise in WSU’s Carson School of Enterprise and lead creator of the brand new publication.
“We used 10 prompts with the identical precise query. All the things was similar. It might reply true. Subsequent, it says it is false. It is true, it is false, false, true. There have been a number of circumstances the place there have been 5 true, 5 false.”
AI Fluency vs. Actual Understanding
The findings, revealed within the Rutgers Enterprise Evaluation, spotlight the significance of utilizing warning when counting on AI for essential selections, particularly people who require nuanced or complicated reasoning. Whereas generative AI can produce easy, convincing language, it doesn’t but exhibit the identical stage of conceptual understanding.
In response to Cicek, these outcomes counsel that synthetic common intelligence able to really “pondering” should be additional away than many count on.
“Present AI instruments do not perceive the world the way in which we do — they do not have a ‘mind,'” Cicek stated. “They only memorize, and so they can provide you some perception, however they do not perceive what they’re speaking about.”
Examine Design and Strategies
Cicek labored with co-authors Sevincgul Ulu of Southern Illinois College, Can Uslay of Rutgers College, and Kate Karniouchina of Northeastern College.
The crew used 719 hypotheses from scientific research revealed in enterprise journals since 2021. Some of these questions usually contain nuance, with a number of elements influencing whether or not a speculation is supported. Decreasing such complexity to a easy true or false judgment requires cautious reasoning.
The researchers examined the free model of ChatGPT-3.5 in 2024 and the up to date ChatGPT-5 mini in 2025. General, efficiency remained related throughout each variations. After adjusting for random probability, which provides a 50% likelihood of an accurate reply, the AI’s effectiveness was solely about 60% above probability in each years.
Key Weak point in AI Reasoning
The outcomes level to a elementary limitation of huge language mannequin AI programs. Though they will generate fluent and persuasive responses, they usually battle to cause via difficult questions. This could result in solutions that sound convincing however are literally incorrect, Cicek stated.
Why Consultants Urge Warning With AI
Based mostly on these findings, the researchers advocate that enterprise leaders confirm AI-generated data and method it with skepticism. Additionally they emphasize the necessity for coaching to higher perceive what AI programs can and can’t do successfully.
Though this research centered particularly on ChatGPT, Cicek famous that related experiments with different AI instruments have produced comparable outcomes. The work additionally builds on earlier analysis pointing to warning round AI hype. A 2024 nationwide survey discovered that customers have been much less prone to buy merchandise after they have been marketed with a deal with AI.
“At all times be skeptical,” he stated. “I am not towards AI. I am utilizing it. However you’ll want to be very cautious.”

