In a development that could reshape how hospitals deploy artificial intelligence, a free open–source AI system has matched the performance of a leading proprietary tool in solving complex medical cases that often stump human doctors.
Harvard Medical School researchers found that Llama 3.1 405B, an open-source AI model whose code is publicly available, performed at the same level as GPT-4, the flagship closed-source model from tech giant OpenAI, according to a study published yesterday in JAMA Health Forum.
When tested on 92 diagnostically challenging clinical scenarios from The New England Journal of Medicine, the open-source challenger correctly diagnosed 70 percent of cases compared to GPT-4’s 64 percent. Even more impressively, Llama ranked the correct diagnosis as its first suggestion 41 percent of the time, slightly outperforming GPT-4’s 37 percent.
“To our knowledge, this is the first time an open-source AI model has matched the performance of GPT-4 on such challenging cases as assessed by physicians,” said senior author Arjun Manrai, assistant professor of biomedical informatics in the Blavatnik Institute at Harvard Medical School. “It really is stunning that the Llama models caught up so quickly with the leading proprietary model. Patients, care providers, and hospitals stand to gain from this competition.”
The findings may mark an inflection point in medical AI, as hospitals consider which AI systems to adopt. Open-source models like Llama offer several advantages – they can be run locally on hospital servers, keeping sensitive patient data in-house rather than sending it to external servers operated by commercial entities.
“The open-source model is likely to be more appealing to many chief information officers, hospital administrators, and physicians since there’s something fundamentally different about data leaving the hospital for another entity, even a trusted one,” explained lead author Thomas Buckley, a doctoral student in the AI in Medicine track at Harvard Medical School’s Department of Biomedical Informatics.
For Dr. Sarah Chen, a primary care physician at Boston Medical Center who wasn’t involved in the research, the findings offer a glimpse into how AI might eventually assist in her daily practice. “I see patients with vague symptoms every day, and sometimes the correct diagnosis isn’t immediately obvious,” she said after reviewing the study. “Having an AI tool that could suggest possible diagnoses I hadn’t considered could be incredibly helpful, especially if that tool keeps data secure and can be customized for our patient population.”
Both AI systems in the study rely on similar approaches – training on vast datasets including medical textbooks, research papers, and anonymized patient information. When presented with a new clinical scenario, they compare it against their training data to suggest possible diagnoses.
The researchers were careful to control for any advantage Llama might have had from potentially being exposed to some test cases during its training. They included 22 new cases published after Llama’s training period ended, and the open-source model performed even better on these newer cases, correctly diagnosing 73 percent and listing the right answer as its top suggestion 45 percent of the time.
Another key advantage of open-source models is their flexibility. “This is key,” noted Buckley. “You can use local data to fine-tune these models, either in basic ways or sophisticated ways, so that they’re adapted for the needs of your own physicians, researchers, and patients.”
The gap between open and closed AI systems resembles earlier technology shifts in medicine, such as when open-source electronic health record systems began challenging proprietary platforms. While closed-source developers like OpenAI provide customer support and hosting, open-source models require institutions to handle setup and maintenance themselves.
“As a physician, I’ve seen much of the focus on powerful large language models center around proprietary models that we can’t run locally,” said Adam Rodman, an HMS assistant professor of medicine at Beth Israel Deaconess Medical Center and co-author on the research. “Our study suggests that open-source models might be just as powerful, giving physicians and health systems much more control on how these technologies are used.”
The stakes for improving diagnostic accuracy are high. Approximately 795,000 patients in the United States die or suffer permanent disability due to diagnostic errors annually, according to a 2023 report cited in the study. Beyond the human toll, delayed or incorrect diagnoses drive up healthcare costs through unnecessary testing and treatments.
“Used wisely and incorporated responsibly in current health infrastructure, AI tools could be invaluable copilots for busy clinicians and serve as trusted diagnostic aides to enhance both the accuracy and speed of diagnosis,” Manrai said. “But it remains crucial that physicians help drive these efforts to make sure AI works for them.”
As hospitals increasingly evaluate AI tools for clinical use, the competitiveness of open-source options could lead to more affordable and customizable solutions. For patients, this might ultimately translate to more accurate diagnoses without the privacy concerns associated with sending their data to third-party companies.
With both open and closed-source systems demonstrating impressive diagnostic capabilities, the choice for healthcare institutions may increasingly hinge on factors beyond raw performance – including data privacy, customization needs, and implementation costs.
If you found this piece useful, please consider supporting our work with a small, one-time or monthly donation. Your contribution enables us to continue bringing you accurate, thought-provoking science and medical news that you can trust. Independent reporting takes time, effort, and resources, and your support makes it possible for us to keep exploring the stories that matter to you. Together, we can ensure that important discoveries and developments reach the people who need them most.