How AI Systems Fail the Human Test todayheadline

Economists have a game that reveals how deeply individuals reason. Known as the 11-20 money request game, it is played between two players who each request an amount of money between 11 and 20 shekels, knowing that both will receive the amount they ask for.

But there’s a twist: if one player asks for exactly one shekel less than the other, that player earns a bonus of 20 shekels. This tests each player’s ability to think about what their opponent might do — a classic challenge of strategic reasoning.

The 11-20 game is an example of level-k reasoning in game theory, where each player tries to anticipate the other’s thought process and adjust their own choices accordingly. For example, a player using level-1 reasoning might pick 19 shekels, assuming the other will pick 20. But a level-2 thinker might ask for 18, predicting that their opponent will go for 19. This kind of thinking gets layered, creating an intricate dance of strategy and second-guessing.

Human Replacements?

In recent years, various researchers have suggested that large language models (LLMs) like ChatGPT and Claude can behave like humans in a wide range of tasks. That’s raised the possibility that LLMs could replace humans in tasks like testing opinions of new products and adverts before they are released to the human market, an approach that would be significantly cheaper than current methods.

But that raises the important question of whether LLM behavior really is similar to humans’. Now we get an answer thanks to the work of Yuan Gao and colleagues at Boston University, who have used a wide range of advanced LLMs to play the 11-20 game. They found that none of these AI systems produced results similar to human players and say that extreme caution is needed when it comes to using LLMs as surrogates for humans.

The team’s approach is straightforward. They explained the rules of the game to LLMs, including several models from ChatGPT, Claude, and Llama. They asked each to choose a number and then explain its reasoning. And they repeated the experiment a thousand times for each LLM.

But Gao and co were not impressed with the results. Human players typically use sophisticated strategies that reflect deeper reasoning levels. For example, a common human choice might be 17, reflecting an assumption that their opponent will select a higher value like 18 or 19. But the LLMs showed a starkly different pattern: many simply chose 20 or 19, reflecting basic level-0 or level-1 reasoning.

The researchers also tried to improve the performance of LLMs with techniques like writing more suitable prompts and fine-tuning the models. GPT-4 showed more human-like responses as a result, but the others all failed to.

The behavior of LLMs was also highly inconsistent depending on irrelevant factors, such as the language they were prompted in.

Gao and co say the reason LLMs fail to reproduce human behavior is that they don’t reason like humans. Human behavior is complex, driven by emotions, biases, and varied interpretations of incentives, like the desire to beat an opponent. LLMs give their answer using patterns in language to predict the next word in a sentence, a process that is fundamentally different to human thinking.

Sobering Result

That’s likely to be a sobering result for social scientists, for whom the idea that LLMs could replace humans in certain types of experiments is tempting.

But Gao and co say: “Expecting to gain insights into human behavioral patterns through experiments on LLMs is like a psychologist interviewing a parrot to understand the mental state of its human owner.” The parrot might use similar words and phrases to its owner but manifestly without insight.

“These LLMs are human-like in appearance yet fundamentally and unpredictably different in behavior,” they say.

Social scientists: you have been warned!

Ref: Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina : arxiv.org/abs/2410.19599

Tags: artificial intelligence

How AI Systems Fail the Human Test todayheadline

Hezbollah leader who planned attacks on US soldiers killed in IAF strike

Pam Bondi: Who is Pam Bondi, Trump’s nominee for Attorney General after controversial Matt Gaetz withdraws – The Economic Times Video todayheadline

Related Posts

Mixing Coffee And Antibiotics Could Be a Bad Idea, Study Shows : ScienceAlert todayheadline

Room-Temperature Quantum Oscillator Network Promises Faster, Greener Computing todayheadline

Pam Bondi: Who is Pam Bondi, Trump's nominee for Attorney General after controversial Matt Gaetz withdraws - The Economic Times Video todayheadline

Family calls for change after B.C. nurse dies by suicide after attacks on the job

Product reduces TPH levels to non-hazardous status

Police ID man who died after Corso Italia fight

Hospital Mergers Fail to Deliver Better Care or Lower Costs, Study Finds todayheadline

Harris tells supporters ‘never give up’ and urges peaceful transfer of power

Des Moines Man Accused Of Shooting Ex-Girlfriend’s Mother

Trump ‘looks forward’ to White House meeting with Biden

Catholic voters were critical to Donald Trump’s blowout victory: ‘Harris snubbed us’

3 Brilliant Tech Stocks to Buy Now and Hold for the Long Term todayheadline

Israel strikes Yemeni capital Sanaa todayheadline

Desi beats for new examshala: Kholo Kholo Kitaab todayheadline

Mixing Coffee And Antibiotics Could Be a Bad Idea, Study Shows : ScienceAlert todayheadline

Recent News

3 Brilliant Tech Stocks to Buy Now and Hold for the Long Term todayheadline

Israel strikes Yemeni capital Sanaa todayheadline

Desi beats for new examshala: Kholo Kholo Kitaab todayheadline

Mixing Coffee And Antibiotics Could Be a Bad Idea, Study Shows : ScienceAlert todayheadline

Browse by Category

Recent News

3 Brilliant Tech Stocks to Buy Now and Hold for the Long Term todayheadline

Israel strikes Yemeni capital Sanaa todayheadline

Welcome Back!

Retrieve your password