OpenAI has announced the launch of HealthBench, a benchmark to evaluate AI models in healthcare using real-world applicability and physician judgment.
“The 5,000 conversations in HealthBench simulate interactions between AI models and individual users or clinicians. The task for a model is to provide the best possible response to the user’s last message,” the company said in a statement.
OpenAI built the benchmark with 262 physicians in 60 countries, who are proficient in 49 languages and have training in 26 medical specialties.
HealthBench includes 5,000 health conversations, each with a physician-created rubric to evaluate model responses. The rubric evaluation includes 48,562 unique rubric criteria.
The company said the conversations were created through “synthetic generation and human adversarial testing,” are multilingual, and span various medical specialities and contexts.
“Every model response is graded against a set of physician-written rubric criteria specific to that conversation,” the company said.
“Each criterion outlines what an ideal response should include or avoid (e.g., a specific fact to include or unnecessarily technical jargon to avoid). Each criterion has a corresponding point value, weighted to match the physician’s judgment of that criterion’s importance.”
The model’s responses are evaluated using GPT-4.1 to determine if each rubric criterion is met. An overall score based on the criteria being met is shown to the user and compared to the maximum possible score.
HealthBench is split into seven themes: expertise-tailored communication, response depth, emergency referrals, health data tasks, global health, responding under uncertainty and context seeking.
“Evaluations like HealthBench are part of our ongoing efforts to understand model behavior in high-impact settings and help ensure progress is directed toward real-world benefit,” the company said.
“Our findings show that large language models have improved significantly over time and already outperform experts in writing responses to examples tested in our benchmark. Yet even the most advanced systems still have substantial room for improvement, particularly in seeking necessary context for underspecified queries and worst-case reliability. We look forward to sharing results for future models.”
The tools are publicly available on GitHub.
THE LARGER TREND
OpenAI’s CEO, Sam Altman, was part of President Donald Trump’s press conference earlier this year announcing the launch of Project Stargate. This $500 billion project would focus on developing the physical and virtual infrastructure to power AI construction, including AI to improve health outcomes.
The partners, which also included Oracle’s chief technology officer, Larry Ellison, and SoftBank’s CEO, Masayoshi Son, touted the project as a game changer for healthcare.
Altman said during the press conference that he is thrilled to be part of Stargate and anticipates that diseases will be cured at an unprecedented rate.
Ellison added that a cancer vaccine is one of the “most exciting” things the group is working on, using the tools that Altman and Son are providing.
Earlier this month, the Financial Times reported that Project Stargate was considering international expansion, with its top country of choice being the UK. Germany and France are also attractive candidates.
However, this week, Bloomberg reported that the project is facing delays due to the tariffs imposed by President Trump and economic uncertainty.
Due to economic uncertainty and growing market volatility, banks and institutional investors are wary of investing in Stargate, especially as data center build-out costs are uncertain due to U.S. tariffs, particularly on chips, server racks and cooling systems.
Additionally, SoftBank, which pledged to donate an immediate $100 billion investment in the project with the goal of it becoming $500 billion within the next four years, has yet to develop a financing template or start discussions with potential backers, according to Bloomberg.
OpenAI has announced the launch of HealthBench, a benchmark to evaluate AI models in healthcare using real-world applicability and physician judgment.
“The 5,000 conversations in HealthBench simulate interactions between AI models and individual users or clinicians. The task for a model is to provide the best possible response to the user’s last message,” the company said in a statement.
OpenAI built the benchmark with 262 physicians in 60 countries, who are proficient in 49 languages and have training in 26 medical specialties.
HealthBench includes 5,000 health conversations, each with a physician-created rubric to evaluate model responses. The rubric evaluation includes 48,562 unique rubric criteria.
The company said the conversations were created through “synthetic generation and human adversarial testing,” are multilingual, and span various medical specialities and contexts.
“Every model response is graded against a set of physician-written rubric criteria specific to that conversation,” the company said.
“Each criterion outlines what an ideal response should include or avoid (e.g., a specific fact to include or unnecessarily technical jargon to avoid). Each criterion has a corresponding point value, weighted to match the physician’s judgment of that criterion’s importance.”
The model’s responses are evaluated using GPT-4.1 to determine if each rubric criterion is met. An overall score based on the criteria being met is shown to the user and compared to the maximum possible score.
HealthBench is split into seven themes: expertise-tailored communication, response depth, emergency referrals, health data tasks, global health, responding under uncertainty and context seeking.
“Evaluations like HealthBench are part of our ongoing efforts to understand model behavior in high-impact settings and help ensure progress is directed toward real-world benefit,” the company said.
“Our findings show that large language models have improved significantly over time and already outperform experts in writing responses to examples tested in our benchmark. Yet even the most advanced systems still have substantial room for improvement, particularly in seeking necessary context for underspecified queries and worst-case reliability. We look forward to sharing results for future models.”
The tools are publicly available on GitHub.
THE LARGER TREND
OpenAI’s CEO, Sam Altman, was part of President Donald Trump’s press conference earlier this year announcing the launch of Project Stargate. This $500 billion project would focus on developing the physical and virtual infrastructure to power AI construction, including AI to improve health outcomes.
The partners, which also included Oracle’s chief technology officer, Larry Ellison, and SoftBank’s CEO, Masayoshi Son, touted the project as a game changer for healthcare.
Altman said during the press conference that he is thrilled to be part of Stargate and anticipates that diseases will be cured at an unprecedented rate.
Ellison added that a cancer vaccine is one of the “most exciting” things the group is working on, using the tools that Altman and Son are providing.
Earlier this month, the Financial Times reported that Project Stargate was considering international expansion, with its top country of choice being the UK. Germany and France are also attractive candidates.
However, this week, Bloomberg reported that the project is facing delays due to the tariffs imposed by President Trump and economic uncertainty.
Due to economic uncertainty and growing market volatility, banks and institutional investors are wary of investing in Stargate, especially as data center build-out costs are uncertain due to U.S. tariffs, particularly on chips, server racks and cooling systems.
Additionally, SoftBank, which pledged to donate an immediate $100 billion investment in the project with the goal of it becoming $500 billion within the next four years, has yet to develop a financing template or start discussions with potential backers, according to Bloomberg.