Automated Data Labeling for Text Classification

For machine learning (ML) practitioners, one of the most important and typical tasks is full-text classification (FTC), a technique where you assign a set of categories or tags to an entire row of text. The way you do this apparently simple task has a direct impact on how we build and run apps on social media sites, news, blog posts, or online forums, to name a few of the best-known use cases.

According to recent estimates, humanity is generating multiple quintillions (that’s right, the number with eighteen 0’s!) of bytes daily. Until recently, using distributed crowds of human laborers was the only approach to understanding, organizing, and filtering this kind of data at scale. But is there a more efficient way for practitioners of ML to enhance user experience and make smart decisions when fast and accurate FTC is a project must?

Data-Centric Approach for Text Classification

At the most basic level, AI = Code + Data, and much of ML practice have historically been built around the development process of a model iterating on the architecture, training procedures, feature engineering, and so on. In this scenario, we consider the data a fixed component and only focus on the model to improve performance. Here, initial data labeling and model accuracy iterations are traditionally manual tasks.

The more recent data-centric approach focuses on systematically improving the quality of the datasets to improve the accuracy of the machine learning model output. This works when you deal with small datasets that rarely change. But what to do when you need representative samples for huge amounts of long-tail data or when the data becomes obsolete by the time the sample is perfected?

Challenges with Manual Approaches

Before describing a more efficient approach to FTC, it’s worth exploring what exactly makes a manual approach to data labeling a problem worth solving. Probably the greatest hurdle most ML practitioners can instantly relate to is finding and scaling your Subject Matter Experts (SMEs). It is often impossible to consistently outsource your manual labeling for the simple reason that knowledge transfer from your SMEs to the labelers is extremely difficult to handle. ML teams must coordinate the haunting task of building training materials, QAs, procedures, etc., for labelers who are likely living on different continents.

Even when the data is not very specialized, you can safely assume there will be a level of inconsistencies in dataset creation. It can come from multiple sources: labelers’ personal biases, staff churn, and delays arising from batch processing, to name a few. Also, large datasets on a fixed budget usually involve longer execution times which cannot keep pace with inherent model drifts, especially when data velocity is high.

The Programmatic Approach

Programmatic labeling is the process of writing programs that assign labels for parts of your dataset and applying them to your machine learning project. The process starts by selecting the parts of the dataset that are related — directly or indirectly — to the labels we want to produce and/or deduce.

Instead of relying on just the data scientists and software developers, or even outsourced labelers, it is much more efficient to leverage Subject Matter Experts (SMEs) to process the data. For them to rapidly deploy their own purpose-built AI, a new approach is needed for data discovery, tooling, automation, and validation, according to Jaidev Amrite (SparkCognition).

Programmatic labeling can be a good fit for your use case if you are dealing with a large amount of data (tens of thousands of rows and above) that require some level of expertise to label and that change at a relatively high rate to warrant a solution that doesn’t add delays in the labeling process. Of course, data scientists and ML engineers can write their own labeling functions from scratch, but this trial-and-error approach takes a lot of time and resources.

Developing a Guided Labeling Experience

Rather than gaining knowledge from data alone, we can have the SMEs teach the machine. They can decompose any problem into smaller parts and provide examples to the algorithm to learn the task independently, enabling an explainable taxonomy that is a proxy for deep learning models.

Let’s see how this works.

First, we upload a set of rows and develop predictive labeling functions to transform raw data into training data by exploring patterns in the data. The key is having an easy-to-use interface focused on making data labeling more efficient and pleasant. Instead of coding functions from scratch or writing regex, SMEs introduce labels by hand into the suggestion engine, reverse engineering the right labeling functions to match the patterns in the hand labels.

Notice the difference: instead of having a person sit down and painstakingly create labeling functions for each individual entity, you can have a subject-matter expert sit down and click on “Yes” or “No”. Or they can enrich a few hundred individual rows with metadata, and the system could generate predictive labeling functions for the dataset that can be reused as needed, independent of the number of SMEs working on your project.

This not only enables you to reuse the work efficiently on the rest of the dataset but also offers recourse in case you detect something wrong with your data or you want to provide documentation on a model decision pattern. With manual labeling, this would mean weeks or months of delays and significant extra costs.

Programmatic Labeling for Text Classification

The best way to see the value of the programmatic approach to data labeling for text classification is to find real case studies. As mentioned above, the cost and time opportunity must be carefully weighed. My team had the opportunity to put this approach to practice for a number of suitable use cases, bringing measurable results.

The Wilson Sonsini data science team had a large body of unclassified data that potentially contained critical insights about the types of work the firm provides to its clients, but that was incredibly difficult to tackle with traditional data labeling approaches. They created a prediction algorithm that was eventually applied to the larger set of related data entries. Their SMEs spent two 3-hours sessions with the team to develop an automated pipeline and workflow for newly generated data entries, generating additional insight.

A Proper High is a company focused on normalizing the noisy and fragmented cannabis e-commerce data. Using the Watchful interface, they were able to automate their classification and information extraction. It took one engineer only one afternoon to classify their entire +200,000 product library with greater than 99% accuracy. They reduced their original 30 days estimate to less than 4 hours of effort.

Final Thoughts

There is no “silver bullet” solution to how we approach ML, only constant improvements with each iteration. What we do know today is that the manual labeling approach to large-scale training data sets for classifier use cases is an uphill battle against inefficiency, model drift, low quality, human biases, and lack of sufficient SMEs.

A programmatic approach to key tasks can be more efficient, both time and money-wise, or it can unlock otherwise untenable use cases. We must find powerful tools and weapons to rapidly explore and identify the best data to train models. The future is giving experts “superpowers” compared to the established way of doing things. One is to have an expert label just a few data samples and let a programmatic set of functions learn from those examples, offering automated training data outputs.

About Watchful

Watchful is a modern and interactive solution for NLP that places the control of data labeling back into the hands of data scientists and machine learning practitioners. Through our scalable data-centric approach, anyone, from subject matter experts to MLOps engineers, can holistically explore, classify, annotate and validate any unique dataset to power today’s AI initiatives and business processes. Watchful’s enterprise-ready solution removes the data bottlenecks associated with AI from the start, allowing for the iterative processes of AI, from production to deployment, to be far more cost-effective and scalable. Use Watchful across multiple industries, such as manufacturing, retail, finance, life sciences, and more. Learn more by visiting www.watchful.io.

Article by Shayan Mohanty, Co-Founder and CEO of Watchful

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.