In an unexpected twist that bridges the worlds of artificial intelligence and neurology, researchers have discovered that leading AI chatbots display patterns similar to mild cognitive impairment when subjected to standard dementia screening tests. The findings raise intriguing questions about the true capabilities of AI in medical settings.
Published in The BMJ | Estimated reading time: 5 minutes
The race to integrate artificial intelligence into healthcare has been marked by bold predictions about AI replacing human doctors. However, a fascinating new study published in The BMJ’s Christmas issue reveals that even the most sophisticated AI language models may have significant cognitive limitations that mirror human cognitive decline.
Researchers put several leading AI chatbots through the Montreal Cognitive Assessment (MoCA), a standardized test widely used to detect early signs of dementia. The study included the latest versions of prominent AI models: OpenAI’s ChatGPT-4 and 4o, Anthropic’s Claude 3.5 “Sonnet”, and Alphabet’s Gemini versions 1 and 1.5.
The results were striking. ChatGPT-4o emerged as the top performer with a score of 26 out of 30 – just barely reaching the threshold considered normal for human cognitive function. ChatGPT-4 and Claude tied at 25 points, while Gemini 1.0 scored significantly lower at 16 points. Perhaps most telling was that “older” versions of the chatbots tended to perform worse on the tests, mimicking age-related cognitive decline in humans.
The AI models showed consistent weaknesses in specific areas. As the study notes, “All chatbots showed poor performance in visuospatial skills and executive tasks,” struggling particularly with challenges like connecting numbers and letters in sequence and drawing clock faces. Most models excelled in areas like naming, attention, and language comprehension, but stumbled when faced with tasks requiring visual abstraction and executive function.
These findings challenge the narrative of AI’s imminent takeover of medical diagnosis. As the researchers conclude, “Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients – artificial intelligence models presenting with cognitive impairment.”
Glossary
- Large Language Models (LLMs)
- Advanced AI systems trained on vast amounts of text data to understand and generate human-like language.
- Montreal Cognitive Assessment (MoCA)
- A standardized screening tool used by healthcare professionals to detect cognitive impairment and early signs of dementia.
- Executive Function
- Mental skills that help with planning, focusing attention, remembering instructions, and handling multiple tasks successfully
Test Your Knowledge
What was the highest score achieved by any AI model on the MoCA test?
ChatGPT-4o scored 26 out of 30, the highest among all tested models.
What is considered a normal score on the MoCA test?
A score of 26 or above is generally considered normal.
Which specific types of tasks proved most challenging for the AI chatbots?
The chatbots particularly struggled with visuospatial skills and executive tasks, such as trail making and clock drawing tests.
How did the performance of “older” versions of chatbots compare to newer ones, and what human parallel does this suggest?
Older versions of chatbots tended to perform worse on the tests, mirroring the pattern of age-related cognitive decline seen in human patients.
Enjoy this story? Subscribe to our newsletter at scienceblog.substack.com.