The Accent Gap Problem In Minorities And Dialect Speakers

The current success of voice related technologies is largely driven by the possibility they have on offer for users to perform simple tasks without having to do anything but speak out loud (e.g., “Alexa, show the weather for this afternoon”, “Hey Google, turn off the lights”), which they accomplish with great accuracy.

In fact, in 2017, Google announced [Ref1] that its general-purpose speech-to-text technology had a 4.9% word error rate, which translates to 19 out of 20 words being correctly recognised, in contrast with the 8.5% they announced in July 2016. A big improvement compared to the 23% of 2013! Some speech-to-text systems do even better in specific usage settings, with a word error rate of 3% [Ref2] only.

Image credit: MikeRenpening | Free image via Pixabay

After being in the mass market for several years, users of voice-enabled systems have started to notice that these technologies do not work with the same level of precision for everyone. Research carried out by the Washington Post [Ref3] on “the smart speaker’s accent imbalance” showed notable disparities in how users are understood across the United States.

Results showed that people who spoke Spanish as their first language (L1) were understood 6% less often than people born and raised around Washington or California, where the tech giants are based. The same research also showed that, when phrases are limited to utterances related to entertainment controls, the accent gap is even more evident, with a 12% gap between Eastern Americans (92% accuracy) and English speakers whose L1 is Spanish (80% accuracy) when using Google Home while Amazon Echo didn’t fare much better with a 9% gap between Southern Americans (91% accuracy) and English speakers who’s L1 is Chinese (82% accuracy).

This indicates that current voice-enabled systems are unable to recognise different accents with the same precision (e.g., the accent of an English speaker whose L1 is Spanish or Chinese vs. an American speaker of broadcast English).

However, this phenomenon, we have to clarify, is not limited to a single language, say English, which has 160 distinct dialects spoken around the world. Today, speech-to-text is integrated into a variety of devices, including mobile phones, tablets, laptops, wearable devices and cars, and is available in a wide range of languages. To a lesser or greater extent, the accent gap phenomenon is present in all of them.

Only 7 languages, English, French, German, Italian, Japanese, Portuguese and Spanish are covered by the voice assistants of the three main technological companies (Google, Amazon and Apple), of which English, French and Spanish provide some regional localisation. This is far below the capabilities of what Google is offering with its speech-to-text API dictation service. Which is also far off the 185 languages identified in ISO-639-1 All this is even before we start considering the accent gap within each localisation.

Causes of the accent gap

To understand where the accent gap comes from, we have to focus on how AI models behind voice-enabled systems (e.g., Amazon Echo, Google Nest, Apple HomePod, etc.) are trained.

Generally speaking, a speech-to-text system is trained to convert speech into text by using audio samples collected from a group of subjects. These samples are manually transcribed and ‘fed’ to models so they can learn to recognise patterns from the words and sounds (an acoustic model). Furthermore, the sequence of the words that create the sentence is used to train a model that will help predict the word that the user is expected to say (a language model). Therefore, the sound of the word, and the possibility of the word being used in the sentence are both combined to convert the speech into text. What does this imply? The models used by the speech-to-text system will be reflective of the specific data used for its training. Just like a child in New York will not learn to understand and speak with a Texan accent.

In this sense, if most of the audio samples used to train a speech-to-text model came from white male native English speakers from a particular region, it’ll certainly be more accurate for this segment of the population than for others that have not been properly represented in the dataset. Data diversity is therefore crucial to reduce the accent gap.

Besides accent, a poorly balanced dataset can result in different biases [Ref4] that also jeopardise the system’s accuracy and worsen the accent gap. Consider a woman that asks her bank voice assistant to display her account balance. If the AI model behind the assistant has been trained mostly using audio samples from men, the result will be less accurate for women, since the features in their voices are different. If the woman’s first language isn’t English, the accuracy will decrease even more. This issue also occurs with children’s speech, whose voice features differ from those of adults.

Little is said about the impact of the accent gap on the sales or adoption of voice-enabled solutions and devices. Researchers at University College Dublin [Ref5] suggest that the level of satisfaction of native English speakers toward voice-enabled systems is higher than that of non-native speakers. Considering that native speakers don’t have to consider altering their vocabulary in order to be understood, nor being constantly aware of the time it takes them to formulate a command before the system resets or interrupts them, this result is of no surprise.

Solutions aiming at reducing the accent gap

As explained throughout this article, the accent gap is caused primarily by a lack of diversity within the datasets used for training AI models. Therefore, acquiring large amounts of training data from different demographics is critical for improving speech recognition.

Techniques for achieving such a goal are diverse but not equally functional. For instance, a company could opt for hiring people from multiple demographical backgrounds to record audio samples for training purposes. However, this approach is expensive, slow and not optimal for a market that grows at high speed. Moreover, it is unlikely that the amount of data collected using this approach, although privacy-friendly, would collect sufficient enough data to train a model to accomplish any real improvement.

Developers and researchers could revert to crowdsourcing voices (e.g., Mozilla’s crowdsourcing initiative, “Common Voice”). However, there aren’t many projects of this nature large enough to shrink the accent gap that affects so many users around the world to the best of our knowledge.

In this light, there are several solutions, some of them already in the market, that aim at reducing the accent gap.

a) Global English. Speechmatics, a technology company specialised in speech recognition software, has been working toward the development of a ‘Global English’ [Ref6], a single-English language pack that supports major English accents and dialect variations. Global English follows an accent-independent approach that improves accuracy while, at the same time, reduces complexity and time to market.

Speechmatics improvements on speech recognition revolve around many technologies and techniques, particularly modern neural network architectures (i.e., deep neural networks featuring multiple layers between input and output) and applied proprietary languages training techniques.

b) Dragon shade. Nuance [Ref7], an American company specialising in voice recognition and artificial intelligence, also exemplifies how the industry intends to reduce the accent gap. The company’s newest versions of Dragon, a Speech-To-Text software suite, uses a machine learning model based on neural networks that automatically switch between several dialect models depending on the user’s accent.

The “Voice Training” [Ref8] feature allows the solution to learn how the user speaks by requesting it to read aloud one of the available Voice Training stories. The features Voice Training collects include individual accent, intonation and tone

c) Applause. Applause [Ref9] is an American company that specialises in crowdtesting. It provides their customers with a full suite of testing and feedback capabilities that many industries – particularly the automotive industry – implementing voice-based technologies are utilising. It provides, among others, testing with native language speakers from around the world to validate utterances and dialogues and allow for direct testing by in-market vetted testers under real-world conditions.

d) INCLUDED. COMPRISE [Ref10] is a project funded by the Horizon 2020 Programme that aims to create a cost-effective, multilingual, privacy-driven voice-enabled service. Using a novel approach, once in the market, COMPRISE is expected to adapt models locally on the user’s device based on user-independent models trained on anonymised data in the cloud on the user’s own data (user’s speech is automatically anonymised before being sent to the cloud). User-independent speech and dialog models are personalised to each user by running additional computations on the user’s device. This will result in improved accuracy of Speech-To-Text, Spoken Language Understanding and Dialog Management for all users, especially “hard-to-understand” users (e.g., with non-native or regional accents), and as a consequence, an improvement in user experience and inclusiveness.

Authors: Alvaro Moreton and Ariadna Jaramillo

References:

[Ref1]: Protalinski E. “Google’s speech recognition technology now has a 4.9% word error rate”. May 2017. Available: https://venturebeat.com/2017/05/17/googles-speech-recognition-technology-now-has-a-4-9-word-error-rate/

[Ref2] Wiggers K. “Google AI technique reduces speech recognition errors by 29%”. February 2019. Available: https://venturebeat.com/2019/02/21/google-ai-technique-reduces-speech-recognition-errors-by-29/

[Ref3] Harwell D. “The Accent Gap”. July 2018. Available: https://www.washingtonpost.com/graphics/2018/business/alexa-does-not-understand-your-accent/

[Ref4] Tatman R. “How well do Google and Microsoft and recognize speech across dialect, gender and race?” August 2017. Available: https://makingnoiseandhearingthings.com/2017/08/29/how-well-do-google-and-microsoft-and-recognize-speech-across-dialect-gender-and-race/

[Ref5] Wiggers K. “Research suggests ways voice assistants could accomodate non-native English speakers” June 2020. Available: https://venturebeat.com/2020/06/17/research-shows-non-native-english-speakers-struggle-with-voice-assistants/

[Ref6] Speechmatics. “Global English”. Available: https://www.speechmatics.com/product/global-english/

[Ref7] Nuance. “Nuance”. Available: https://www.nuance.com/index.html

[Ref8] Nuance. “Voice Training“. Available: https://www.nuance.com/products/help/dragon/dragon-for-mac6/

[Ref9] Applause. “Voice Testing” . Available: https://www.applause.com/voice-testing

[Ref10] Vincent E. “Cost Effective Speech-to-Text with Weakly and Semi Supervised Training”. December 2020. Available: https://www.compriseh2020.eu/cost-effective-speech-to-text-with-weakly-and-semi-supervised-training/

-->