Trial, Error, Triumph: Lessons Learned Using LLMs for Creating Machine Learning Training Data

5 min read3 days ago

The broad availability and performance of large language models (LLMs) enables practitioners to automate a variety of time-consuming tasks. Obtaining a large number of quality labels for a machine learning training dataset is a critical step in supervised learning, but can require prohibitive amounts of time to manually generate. At this year’s ODSC East, Matt Dzugan outlined an approach that his team at Muck Rack employs to generate high-quality machine learning training datasets using LLMs.

Get your ODSC Europe 2024 pass today!
In-Person and Virtual Conference
September 5th to 6th, 2024 — London
Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.
REGISTER NOW

While many natural language processing (NLP) tasks can be solved with LLMs, they don’t offer the most cost-effective or accurate predictions in every application. To illustrate how his team employs LLMs in an efficient manner, Matt worked through an example task of assigning relevant topics to a large volume of articles. Using an LLM to generate a topic for each article in production would be prohibitively costly at a large scale, processing millions of articles per day. Instead, one could train a more traditional NLP model on a suitable training dataset and use the trained topic classifier to score each article. Unless a training dataset is already available, it is necessary to create one.

Not all machine learning training datasets are equally useful; Matt illustrated that the best machine learning training datasets are easy to obtain, accurate, and generalize well to the data in production. The distinction was necessary because data generation methods may achieve some qualities much more than others.

Figure 1: Three key qualities of effective training data.

Matt described four approaches to directly generate the machine learning training dataset using an LLM. The first approach was coined “The Labeler”, where an LLM is given each article and instructed to assign a topic of 1000 possible options. While “The Labeler” approach creates a dataset that generalizes well, the high context length incurs considerable cost and the model can hallucinate topics outside of the defined scope. “The Author” approach creates a dataset by starting with topics and using an LLM to generate an article to match. A key disadvantage of “The Author” method is a loss of generalization; the articles won’t necessarily be similar to articles seen in production, they will be more similar to the content present in the LLM’s training data. The third method “The Librarian” involves using an LLM to write a query to match articles in a database for a given topic. “The Librarian” approach can scale well but suffers from low accuracy due to the difficulty of matching topics given keywords in the query. The last LLM-based approach Matt discussed provides articles and an assigned topic to an LLM, instructing it to determine if the topic is a good match. This method suffers from a loss of scale when the search space for potential topics is large. One would need to evaluate a large number of topic-article combinations to obtain a sufficient number of appropriate matches.

Figure 2: Tradeoffs of the four LLM-based data generation methods, including “The Author” (top left), “The Librarian” (top right), “The Labeler” (bottom right), and “The Judge” (bottom left).

To best balance the cost and quality of the dataset, Matt outlined a fifth approach called artificial semi-supervised learning. The first step is to employ the “The Author” approach of generating articles from a series of topics. While the data generated won’t generalize well by itself, the second step involves training a model on the initial dataset and scoring existing real-world articles. “The Judge” method can then be applied to determine if the trained model’s scores are well matched and allow poor matches to be discarded. This process can be repeated many times to grow the machine learning training dataset in a semi-supervised fashion. By combining the “The Author” method, an iteratively retrained supervised learning model, and the “The Judge” approach, the user can maximize accuracy, generalization, and efficiency. For perspective, Matt provided a comparison of cost for one example of topic classification. Using LLMs in an artificial semi-supervised fashion, incurred substantially lower costs than purely LLM-based approaches like “The Labeler” and returned a high-quality dataset.

Figure 3: Cost comparison of four data-generating techniques where one bill is equal to $100.

Matt’s talk illustrated that using LLM technology effectively requires thoughtful planning. While the example of topic modeling is not outside the NLP domain, the applications of LLMs are becoming more broad as their performance improves. By attending ODSC talks, engineers and data scientists in any domain can stay current with the types of problems being solved with more recent technology. ODSC Europe on September 5th-6th will have a wealth of similar content for key topics like LLMs, Gen AI, and AI for finance. Check out the confirmed speakers here: https://odsc.com/europe/

ODSC West 2024 tickets available now!
In-Person & Virtual Data Science Conference
October 29th-31st, 2024 — Burlingame, CA
Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!
REGISTER NOW

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

Trial, Error, Triumph: Lessons Learned Using LLMs for Creating Machine Learning Training Data

Written by ODSC - Open Data Science