A new study from the University of Birmingham suggests that the key to improving AI language systems lies not in gathering ever-larger amounts of data, but in better representing the diversity of human language. The research, published January 13 in Frontiers in AI, proposes that understanding how language varies across different social groups and contexts is crucial for developing AI that works well for everyone.
“When prompted, generative AIs such as ChatGPT may be more likely to produce negative portrayals about certain ethnicities and genders,” said lead author Professor Jack Grieve. “If the training corpus contains relatively frequent expression of harmful or inaccurate ideas about certain social groups, LLMs will inevitably reproduce those biases resulting in potentially racist or sexist content.”
Quality Over Quantity
While companies have focused on training AI systems on increasingly massive amounts of text data, the researchers argue that what matters more is ensuring this data properly represents the full range of language varieties used across society – from different regional dialects to professional jargons to generational differences in speech.
The study suggests that carefully selecting training data to reflect linguistic diversity could help address several major challenges with current AI systems, including social bias, misinformation, and the ability to adapt to specialized domains like medicine or law.
Understanding Language Variation
The researchers identify three key dimensions of language variation that need to be considered: dialects (language variation across social groups), registers (variation across different contexts and purposes), and time periods. Each of these dimensions contributes to how language is used and understood in different parts of society.
For example, medical professionals use different language when talking to colleagues versus patients, while people from different regions might use different terms for the same concept. These variations, the researchers argue, need to be properly represented in AI training data to create systems that work effectively for all users.
A Framework for Fairer AI
The researchers draw on sociolinguistics – the study of how language varies across social groups and contexts – to propose a new framework for developing language AI systems. This approach emphasizes the importance of understanding and representing different “varieties of language” in AI training data.
Rather than simply gathering massive amounts of text from the internet, AI developers could deliberately include balanced samples of language from different social groups, professional contexts, and time periods. This could help ensure AI systems work equally well for everyone, regardless of their background or how they communicate.
Practical Applications
The implications of this work extend beyond just improving AI performance. The research suggests that this approach could help create AI systems that better align with society’s values and expectations, while also addressing practical challenges in fields like customer service, healthcare, and education.
For instance, an AI system trained with greater linguistic diversity would be better equipped to understand and respond to users from different cultural backgrounds or professional contexts. This could lead to more effective and equitable AI-powered services across society.
Looking Forward
“Understanding the structure of society, and how this structure is reflected in patterns of language use, is critical to maximizing the benefits of LLMs for the societies in which they are increasingly being embedded,” explained Professor Grieve.
The study suggests that engaging with sociolinguistic research will be crucial as AI language models become increasingly embedded in society, from education and healthcare to business and government. By better representing linguistic diversity in their training, these systems could become not only more accurate but also more ethical and socially aware.
The researchers emphasize that this approach provides a theoretical foundation for addressing many current challenges in AI development, offering a path toward systems that better serve the diverse communities in which they operate.
If you found this piece useful, please consider supporting our work with a small, one-time or monthly donation. Your contribution enables us to continue bringing you accurate, thought-provoking science and medical news that you can trust. Independent reporting takes time, effort, and resources, and your support makes it possible for us to keep exploring the stories that matter to you. Together, we can ensure that important discoveries and developments reach the people who need them most.