What Can Go Wrong When Creating Data to Enable Multilingual AI
Editor’s note: Olga is a speaker for ODSC East 2022! Be sure to check out her talk, “Creating Data to Enable Multilingual AI: What Can Go Wrong and Ways to Mitigate It,” there!
Artificial intelligence (AI), and conversational AI as one of the fastest-growing sub-domains within AI, has a broad range of use cases in customer engagement, operations, and supply chain management across the globe.
The global reach of business has given rise to a need to collect and develop data for AI applications that both understand (natural language understanding) and generate (natural language generation) text and speech in multiple languages. Vast volumes of varied quality multilingual data sets are essential for enabling the optimal performance of the models behind such applications. But we need to be very mindful of everything that can potentially go wrong in the process of developing these massive multilingual datasets. Over the last few years, AI data experts and data scientists have come across a variety of issues that they could not even think about prior to embarking on the global AI applications development journey.
So, what can go wrong when creating data to enable multilingual AI? Problems that we have come across, include:
- Sociolinguistics: both language and acoustic model training datasets can be limited to catering to just certain in-country demographics including gender, ethnicity, age group, and education level, which will significantly limit the customer engagement with your conversational AI product or voice search engine. When developing for local audiences you can also introduce or carry over cultural phenomena from your English dataset that are not relevant for that geographic. This is, in my opinion, different from the data contribution to potentially harmful (unless intended) AI model social biases which I’ll be discussing below.
- Engineering: multiple challenges are related to developing datasets for various locales. Often, similar to internationalization, if not addressed in advanced, software localization issues can arise, including code-switching, various glyph sets within a single language (think, for example, about four writing systems in Japanese), challenges specific to bi-directional languages, and ways of handling user errors such as repeated words, typos, and homophones.
- There are also requirements specific to spoken modality (vs. written modality) which will be harmful to the language model utility if not represented properly. For example, the user’s speech will not be understood (it is on the language model to understand the speaker, as the acoustic model is supporting the phonetic mapping). Such conventions include alphanumeric values spelled out as words, filler words, special characters and punctuation, initialisms vs. acronyms (W.H.O vs. NASA, but ASAP being an edge case), abbreviations, etc.
- Bias and Inclusion: your models may produce gender, racial, age, and other social biases — driven both by algorithms and underlying data — which is a phenomenon recently getting a lot of attention due to the issues it has caused around diversity, equity, and inclusion across multiple industries.
Luckily, there are several techniques in both managing your data (pruning, augmentation, assigning weights and other state-of-the-art methods we’ll talk about) and tweaking your algorithms that can help you control and reduce this bias.
The good news is, in my upcoming session at ODSC East, “Creating Data to Enable Multilingual AI: What Can Go Wrong and Ways to Mitigate It,” I will share ways to mitigate these problems based on natural language processing (NLP) and other engineering solutions, and data creators training approaches Welocalize has developed over the course of collaborating with its clients’ data scientists.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.