AI Experts on Different Language NLP Datasets in APAC
Natural language processing (NLP) has taken hold across every country, industry, and field of study. As language becomes the predominant form of data, researchers and data science professionals are looking at ways to harness language to obtain informative results. In the APAC region, NLP faces its own challenge of multiple different languages, accents, and dialects, making it important yet difficult to master. We spoke to some of our upcoming ODSC APAC 2021 speakers as part of the NLP focus area to learn more about the importance of NLP in the APAC region, such as the need for different language NLP datasets, and its outlook into the future.
Sowmya Vajjala | Research Officer, Digital Technologies | National Research Council
NLP research and development used to primarily be focused on English language for a long time. However, that is not the case now. Both in research as well as in practical application scenarios, we now see more and more discussion on developing NLP systems for other languages, sometimes, many at once. However, when starting out with a new language or a new task, we may often end up in a scenario where we don’t have a standard dataset to use off the shelf. Datasets may need to be built from scratch. Similarly, while deploying real-world NLP applications, even with English, we often won’t have a readymade dataset that perfectly fits our own problem. However, strangely enough, courses/workshops on NLP do not touch upon this aspect much.
Karin Verspoor | Dean, Fellow | School of Computing Technologies at RMIT University, Australasian Institute of Digital Health
Regarding the issue of different languages, generally speaking, biomedical NLP targets the languages of the scientific literature and the language of documentation in electronic health records. For the former, while much of the scientific literature is in English, it definitely isn’t all, and I have been involved with efforts to work on automatic machine translation specifically for scientific texts, specifically through the Workshop on Machine Translation Biomedical task. For the latter, a key challenge is the availability of data sets and resources for working with clinical texts in different languages; clinical texts are not easy to obtain in any language. However, there are ongoing efforts to make these available, for instance for Spanish, the Biomedical Text Mining Unit at the Barcelona Supercomputing Center has run several shared tasks on Spanish-language clinical texts, and I collaborated with a team to develop a deep learning-based NLP approach for named entity recognition in Spanish clinical narratives in that context. Another challenge is ‘translating’ complex clinical terminology to more consumer-friendly language; we have also done some early work leveraging Wikipedia for that (called WikiUMLS). Because Wikipedia has been translated into so many different languages, this opens up the opportunity to map clinical concepts into those languages.
Ralph Vincent Regalado, PhD | CEO & Developer Expert, ML | Senti.ai
Early NLP projects are often only done in the English language. But in the past few years, we’ve seen AI startups and companies start to create algorithms that can understand the local languages in their respective countries. In a way, this makes AI-based technology even more accessible to users, especially now that these are being implemented in crucial and life-saving work such as disaster management.
There are also challenges when we have to consider formal and informal conversations, which can be encountered in documents. For voice-based assistants, we also have to account for varying intonations in speech. There’s a lot that can be explored when you have to design algorithms that not only understand the context, but also the intent of users using various local languages.
This is why we had to factor in local languages in our algorithms — because of the unique way our users communicate and interact, not just with fellow humans but also with machines. To accomplish this, Senti AI combines over 30 years of academic research with our deep industry knowledge when we create our conversational assistants. We believe that having people understand what we do and what we’re trying to achieve is the first step to getting not just the Philippines, but the APAC region up to par with the rest of the world in terms of AI.
David Tan | Product Director | Heicoders Academy
The standard approach to NLP problems has always been rule-based models, which generally yields lackluster results. While we can apply neural networks to NLP tasks, the performance pales in comparison to the performance of deep learning in computer vision. One of the main reasons for this disparity is the lack of large labeled text datasets. Most labeled text datasets are not big enough to train deep neural networks because these networks have a huge number of parameters and training such networks on small datasets will cause overfitting. However, the introduction of transfer learning to the field of NLP in 2018 by Google has led to a paradigm shift. Transfer learning is a technique that involves the reuse of a pre-trained model on a new problem. This is a game-changer because we can now leverage models that were already pre-trained on other huge text datasets to augment our own model.
In essence, the introduction of transfer learning to NLP enables the use of deep learning models, leading to better performance in NLP tasks. This has deep and wide-reaching effects as the advancement in the field of NLP has sped up considerably, and we see this technique applied to other languages such as Chinese. Needless to say, we expect to see this transfer learning applied to other languages of APAC countries.
How to learn more about NLP and different language NLP datasets
Even seasoned NLP experts may not have the knowledge required to work with different language NLP datasets commonly seen in the APAC region, such as Chinese, Japanese, or Indian data. By attending the ODSC APAC 2021 virtual conference, you can gain the skills needed to adapt to using these languages, as well as novel NLP techniques like transfer learning. Some highlighted sessions in the NLP focus area, including looking into different language NLP datasets, include:
- Fairness in Natural Language Processing: Tim Baldwin, PhD | Director, VP, Laureate Professor | ARC Centre for Cognitive Computing in Medical Technologies, ACL, Uni. Melbourne
- Uncover Hidden Business Insights from Unstructured Data: Dr. Lau Cher Han | CEO, Founder | LEAD, CoronaTracker
- NLP in Ecommerce: Mathangi Sri | Head of Data — GoFood | Gojek
- Data Science Supporting Clinical Decision Making: What, Why, How?: Professor Karin Verspoor | Dean, Fellow | School of Computing Technologies at RMIT University, Australasian Institute of Digital Health
- Model, Task and Data Engineering for NLP: Shafiq Rayhan Joty, PhD | Assistant Professor, Senior Manager | NTU Natural Language Processing Grou at Nanyang Technological University, Salesforce AI
- On Summarization Systems: Dr. Sriparna Saha | Group Member, Associate Professor at the Department of Computer Science and Engineering AI-NLP-ML Research Lab, IIT Patna
- Finding Rare Events in Text: Debanjana Banerjee | Senior Data Scientist | Walmart Labs
- How to do NLP When You Don’t Have a Labeled Dataset?: Sowmya Vajjala, PhD | Research Officer, Digital Technologies | National Research Council