AI Symptom Checks Get a Reality Check: Study Finds ChatGPT and Rivals No Better Than Web Search

London, Feb 9, 2026, 17:16 (GMT)

A UK trial revealed that people seeking symptom advice from AI chatbots didn’t make better health decisions compared to those using web searches or other standard resources
These AI models showed strong performance on their own, but accuracy dropped once real users got involved
Researchers and clinicians cautioned that minor tweaks in how users describe symptoms might lead to unsafe recommendations

A new study released Monday found that asking an AI chatbot about medical symptoms doesn’t improve patient decision-making compared to using a standard internet search or reputable health websites. Reuters

As more consumers turn to chatbots for health advice at home, health systems and developers are testing these tools as a digital “front door” for triage—helping patients figure out whether to manage symptoms themselves, visit a doctor, or head to the emergency room.

The researchers pointed out that surveys show a rising number of people turn to AI chatbots for health questions—about one in six American adults do so at least monthly. They also warned that high marks on medical exams don’t always reflect how these tools perform with actual users. Nature

The randomized trial included 1,298 UK adults tackling 10 medical scenarios crafted by doctors, ranging from mild illnesses to a critical brain bleed. Participants either used a large language model — the text-generating AI behind tools like ChatGPT — or turned to their usual resources, like internet searches or the National Health Service website.

When tested solo, the models — OpenAI’s GPT-4o, Meta’s Llama 3, and Cohere’s Command R+ — correctly spotted relevant conditions 94.9% of the time and selected the right “disposition,” or next step, 56.3% on average, the study found. People using those same systems identified conditions less than 34.5% of the time and chose the proper next move under 44.2%, barely outperforming the control group.

Adam Mahdi, co-author and associate professor at the University of Oxford, pointed out a “huge gap” between the theory behind the technology and its real-world performance. “The knowledge may be in those bots; however, this knowledge doesn’t always translate when interacting with humans,” he said.

Mahdi cautioned that impressive benchmark results can hide flaws that only appear when AI systems interact with real users. “The gap between benchmark scores and actual performance should alert AI developers and regulators,” he said, urging more extensive testing with diverse populations before rolling out these tools in healthcare. Ox

Rebecca Payne, a GP and lead medical practitioner on the study, warned that consumers need to approach chatbot responses carefully. “Despite all the hype, AI just isn’t ready to take on the role of the physician,” she said, noting that incorrect advice might overlook cases requiring urgent medical attention.

The study highlighted how minor tweaks in wording can drastically change responses. For instance, a user reporting “the worst headache ever” with a stiff neck and light sensitivity was advised to go to the hospital. But when the headache was described as “terrible” instead, the guidance shifted to resting in a dark room.

Researchers examined a set of conversations closely and found that mistakes frequently stemmed from both ends: people omitted crucial information or shared incorrect details, while the AI occasionally generated misleading or outright false answers. The systems also blended solid advice with weaker suggestions, forcing users to figure out what to believe.

The key question remains: can improved interfaces, clearer instructions for users, or newer models bridge the gap between controlled tests and real-world use? The team intends to run similar studies across other countries and languages to verify if the findings hold up. OpenAI, Meta, and Cohere did not respond to Reuters’ requests for comment.

The researchers noted that the study received backing from the data company Prolific, the German non-profit Dieter Schwarz Stiftung, and both the UK and U.S. governments. This highlights the increasing effort to evaluate if consumer-facing AI can be safely integrated into already stretched health systems.

AI Symptom Checks Get a Reality Check: Study Finds ChatGPT and Rivals No Better Than Web Search

Artur Ślesik

Search

Technology News

Latest Articles

EU threatens Meta with quick WhatsApp action over rival AI chatbots

AI Symptom Checks Get a Reality Check: Study Finds ChatGPT and Rivals No Better Than Web Search

Takeda’s $1.7 billion AI drug discovery deal with Iambic: what’s inside the pact