Lots of Data, No Labels, Now What?

3 min readOct 25, 2019

Editor’s Note: Interested in learning about the problem of: lots of data, no labels? See Paolo’s talk “Guiding AI to Generate the Labels we do not have with Active Learning” at ODSC West 2019.

Let me tell you about a common stereotypical data story in many industries today, simplified here for brevity. “Corey” is a fresh graduate from a great data school and he is hired right away by some company with lots of money and lots of data. Great, right? Corey is actually a young but experienced machine learning practitioner who loves deep learning and natural language processing. The company needs his help to extract value from all their text data sitting in some massive data lake. What about this text data?

[Related Article: An Introduction to Active Learning]

We could be talking about doctors’ diagnostic reports, customer care emails, or attached messages of wire transfers. However, for the sake of this argument we actually do not care. What we do care about is that Corey cannot train any supervised models simply because the data are not labeled.

If we took the healthcare example, we could say that all Corey has is the diagnostic report texts but no disease type attached, or in the case of the financial example he has wire transfer messages but no fraud label available.

Whatever the use case is, Corey cannot train either a complex or simple document classifier unless all those documents are manually labeled first, by a domain expert and in a short amount of time. We are talking of thousands if not millions of confidential data that might require deep domain knowledge. And domain knowledge is expensive. What now?

Corey is not defeated yet because he’s heard about active learning, an old strategy that can be used to train his super deep RNN model — or any supervised model really — even a simple logistic regression. Active learning can provide the labels for training his supervised model by involving the expensive domain expert to label only a subset of the data. For deep learning this required subset of manually provided labels is greater, but it is still better than labeling the entire dataset.

Corey thus needs a web-based interactive application where the domain expert can provide labels in small doses, i.e. just for the critical data.

The thing is that “usually,” in order to create an active learning application, you need different skills to those that typically fall into the skillset of a data scientist. Neither Python nor R can help you set up a complex web application where frontend interactivity and backend model training are heavily combined. Instead of labeling documents for months, you find yourself shouting at full-stack developers for years. I said “usually” didn’t I?

Whether you are a “Corey” or not, join my talk on October 31 about active learning. I will show my free and open source blueprint guided analytics application which you can download to train a document classifier starting with no labels.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

Lots of Data, No Labels, Now What?

Written by ODSC - Open Data Science

No responses yet