Techniques to Overcome Data Scarcity
Editor’s note: Attend ODSC East 2019 this April 30 to May 3 in Boston and check out Parinaz’ talk, “Data Efficiency Through Transfer Learning” there!
Supervised machine learning models are being used to successfully solve a whole range of business challenges. However, these models are data-hungry and their performance relies heavily on the size of training data available. In many cases, it’s difficult to collect training datasets that are large enough.
In our work with growth-stage startups in different verticals, we have encountered this issue several times.
In this post, we will explore two methods we have used successfully to overcome labeled data scarcity: transfer learning and data generation. For each, we will look at when to use them and the challenges you might face.
Transfer Learning
Transfer learning is a framework that leverages existing relevant data or models while building a machine learning model. Transfer learning techniques are useful because they allow models to make predictions for a new domain or task (known as the target domain) using knowledge learned from another dataset or existing machine learning models (the source domain). Transfer learning techniques should be considered when you do not have enough target training data, and the source and target domains have some similarities but are not identical.
Naively aggregating models or different datasets would not always work! If the existing datasets are very different from the target data then the new learner can be negatively impacted by existing data or models.
As is common with SaaS (Software as Service) companies, the companies we work with amass numerous similar, yet separate datasets from each of their client companies.
For example, working with Cority, we applied transfer learning techniques to improve the accuracy of the model that predicts when an employee might be injured, with the aim of preventing occupational accidents. Since these accidents are rare, it is unlikely that any single customer would collect enough training data, so aggregation from multiple sources was a useful solution that outperformed the existing approaches.
[Related article: Machine Learning Guide: 20 Free ODSC Resources to Learn Machine Learning]
Another common application of transfer learning is to train models on cross-customer datasets to overcome the cold-start problem. This is an issue SaaS companies often face when onboarding new customers to their machine learning products. Until the new customer has collected enough data to achieve good model performance, which could take several months, it’s hard to provide value. With Bluecore and WorkFusion transfer learning helped solve this problem.
Privacy Challenges
This approach required extra care to protect each company’s underlying sensitive information. Simply aggregating the datasets would have risked exposing the model to adversarial reverse engineering and privacy and security threats. Even the possibility was enough to make customers question the benefits of providing their data to an aggregate model, so privacy guarantees were crucial to getting this off the ground.
With both Bluecore and WorkFusion, the teams used differential privacy techniques to build private machine learning models. Differential privacy is a probabilistic framework that measures the level of privacy of a mechanism or algorithm (function) that uses data and provides an answer based on some computation on the data. The main principle of preserving privacy is to introduce randomness to the computation function so that the final answer does not depend on any individual data points. In the case of machine learning models, any learned parameters (weights) of the final system should not have a large dependency on the presence or absence of any individual user data used to train the model.
[Related article: The 2019 Data Science Dictionary — Key Terms You Need to Know]
Data Generation
Transfer learning works well when you have other datasets you can use to infer knowledge, but what happens when you have no data at all? This is where data generation can play a role. It is used when no data is available, or when you need to create more data than you could amass even through aggregation.
In this case, the small amount of data that does exist is manipulated to create variations on that data to train the model. For example, many images of a cat can be generated by flipping, cropping, downsizing one single image of a cat.
Data generation was also used to solve the cold-start problem for WorkFusion. The company provides software that uses machine learning to extract information automatically from documents like invoices. Working with WorkFusion, we used data generation to reduce the number of correctly labeled invoices needed to train a model from around 5000 to 5–10.
To do this, we took a single labeled invoice and substituted text and moved the text around the document, generating a large amount of labeled training data with sufficient variation using a simple templating engine.
When using data generation you’ll have to consider how to introduce sufficient variation into generated invoices to avoid over-fitting and whether the approach can attain the performance needed to replace manual labeling.
A lack of quality labeled data is one of the largest challenges facing data science teams, but by using techniques such as transfer learning and data generation it is possible to overcome data scarcity and reach your goals.
Editor’s note: Interested in learning more about transfer learning in-person? Attend ODSC East 2019 this April 30 to May 3 in Boston and check out Parinaz’ talk, “Data Efficiency Through Transfer Learning” there!
— — — — — — — —
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.