State-of-the-Art Text Classification Made Easy

6 min readOct 1, 2020

In Natural Language Processing (NLP), language models such as ULMFiT, BERT, and GPT have become the foundation of many solutions for common NLP tasks. The benefit of language models is their ability to be pre-trained with a general understanding of language, such that users can fine-tune models on significantly less data and achieve better performance than when starting from scratch. Prior to language models, NLP models required enough data to simultaneously learn a language and a task, such as classification.

At Novetta, we achieved amazing performance with language models, but it was difficult for developers and new data scientists to train and deploy their own models. To address this, we decided to streamline the implementation of state-of-the-art models for different NLP tasks. We built an open-source framework, AdaptNLP, that lowers the barrier to entry for practitioners to use advanced NLP capabilities. AdaptNLP is built atop two open-source libraries: Transformers (from Hugging Face) and Flair (from Zalando Research). AdaptNLP enables users to fine-tune language models for text classification, question answering, entity extraction, and part-of-speech tagging.

Example: Text Classification

To demonstrate how AdaptNLP can be used for language model fine-tuning and training, we will fine-tune a pre-trained language model from Transformers for sequence classification, also known as text classification.

Using AdaptNLP starts with a Python pip install.

pip install adaptnlp

First, we import EasySequenceClassifier, which abstracts the sequence classification task to its most basic components such as data preprocessing, inference, and training. We can then instantiate the EasySequenceClassifier class object to start training our own custom sequence classification model.

from adaptnlp import EasySequenceClassifierclassifier = EasySequenceClassifier()

To train a sequence classification model with our `classifier`, we need to prepare our data and our training hyperparameters.

AdaptNLP is tightly integrated with Hugging Face’s nlp library, so we will import nlp and load in the “ag_news” dataset. The AG News dataset is a collection of news articles labeled as one of four classes: world, sports, business, or sci/tech. This makes it a perfect multi-class dataset for us to train our classifier on. If you’d like, explore the dataset on Hugging Face’s nlp Viewer UI and try out the amazing nlp library in general.

Note: The classifier can be trained with CSV data file path inputs as well as nlp.Dataset inputs.

from datasets import load_datasettrain_dataset, eval_dataset = load_dataset('ag_news', split=['train[:10%]', 'test'])

Now that we have our train and evaluation/test datasets, we can now create the training arguments object from the transformers library, TrainingArguments. This lets us specify parameters and hyperparameters for training the classifier such as output paths, epochs, batch size, and weight decay. Wonderful, extensive documentation on TrainingArguments can be found on Hugging Face’s documentation site.

From transformers import TrainingArguments 
training_args = TrainingArguments(output_dir='./models',num_train_epochs=1,per_device_train_batch_size=16,per_device_eval_batch_size=16,warmup_steps=500,weight_decay=0.01,evaluate_during_training=True,logging_dir='./logs',save_steps=100)

We can then start training by running the classifier’s built-in `train()` method, which takes in the train and eval datasets and the `training_args` variable we created. Besides specifying the text and label column names, you will now specify the pre-trained language model to fine-tune.

classifier.train(training_args=training_args,train_dataset=train_dataset,eval_dataset=eval_dataset,model_name_or_path="bert-base-cased",text_col_nm="text",label_col_nm="label",)

Important Note: In this example, we will use the “bert-base-cased” pre-trained language model with a sequence classification head. However, you can use nearly any pre-trained language model. Try out a pre-trained DistilBert or an Electra model, a custom fine-tuned model, or any model in Hugging Face’s model repository.

After training is completed, all artifacts and metadata such as checkpoints, model files, configs, and logs will be located in the directory paths specified in your `training_args` for `output_dir` and `logging_dir`. In this example, they are in “./models” and “./logs”.

You will then run a final evaluation with the built-in `evaluate() method to see how well your model performs by calculating metrics on the eval/test dataset.

classifier.evaluate()Outputs:{'epoch': 1.0, 'eval_accuracy': 0.9019736842105263, 'eval_f1': array([0.90401969, 0.9692994 , 0.85683646, 0.87806097]), 'eval_loss': 0.295024262882377, 'eval_precision': array([0.9408082 , 0.96650968, 0.87322404, 0.8358706 ]), 'eval_recall': array([0.87      , 0.97210526, 0.84105263, 0.92473684])}

Great! You’ve successfully fine-tuned and trained your own sequence classifier for the AG_News dataset. Now let’s explore the data and model objects in more detail

The EasySequenceClassifier object can dynamically load and run mini-batch inference on nearly any Transformers model, including the one you just trained. You can load the model and run mini-batch inference with the built-in `tag_text` method.

text = ["The batter up went for the run and scored a touch down.",     "The engineer designed rocket fuel that can take us to mars.",     "The president of the United States and the prime minister of Britain talked.",     "The stock market went down as the economy took a hit from stuff."]results = classifier.tag_text(text=text,model_name_or_path = "./models",mini_batch_size=2)print(results)

Outputs:

[Sentence: “The batter up went for the run and scored a touch down .” [− Tokens: 13 − Sentence-Labels: {‘sc’: [World (0.0421), Sports (0.9516), Business (0.003), Sci/Tech (0.0033)]}],

Sentence: “The engineer designed rocket fuel that can take us to mars .” [− Tokens: 12 − Sentence-Labels: {‘sc’: [World (0.1011), Sports (0.0295), Business (0.0767), Sci/Tech (0.7928)]}],

Sentence: “The president of the United States and the prime minister of Britain talked .” [− Tokens: 14 − Sentence-Labels: {‘sc’: [World (0.9544), Sports (0.003), Business (0.0335), Sci/Tech (0.0091)]}],

Sentence: “The stock market went down as the economy took a hit from stuff .” [− Tokens: 14 − Sentence-Labels: {‘sc’: [World (0.0243), Sports (0.0013), Business (0.9655), Sci/Tech (0.0089)]}]]

While you’re at it, you can try to run `tag_text()` with a different model fine-tuned on AG_News from Hugging Face’s model repository to see how your custom trained model fares.

Fine-Tuning Language Models

To go beyond only fine-tuning a classifier from general-domain language models, you can use AdaptNLP’s `LMFineTuner` to fine-tune a language model on your target task data. Data from your target task will typically have a different distribution or topic domain from a general-domain language model, so fine-tuning a language model on your target task data can help it “adapt” to your data.

For more information on these techniques, and AdaptNLP in general, visit our documentation site for tutorials, guides, class reference documentation, and more.

A fine-tuned language model can be trained and easily be integrated into user-built systems by providing state-of-the-art text-based classifications. By standardizing the input and output data and function calls, developers can easily use NLP algorithms regardless of which model is used in the backend. Before AdaptNLP, we integrated each version of the latest released model and pre-trained weights, then reiterated through a build for an NLP task pipeline. AdaptNLP streamlined this process to help us leverage new models in existing workflows without having to overhaul code.

Using the latest transformer embeddings, AdaptNLP makes it easy to fine-tune and train state-of-the-art token classification (NER, POS, Chunk, Frame Tagging), sentiment classification, and question-answering models. We will be giving a hands-on workshop on using AdaptNLP with state-of-the-art models at ODSC Europe 2020 — which is now available on-demand to purchase anytime.

About the author/ODSC Europe speakers:

Brian Sacash is a Machine Learning Engineer in Novetta’s Machine Learning Center of Excellence. He helps various organizations discover the best ways to extract value from data. His interests are in the areas of Natural Language Processing, Machine Learning, Big Data, and Statistical Methods. Brian holds a Master of Science in Quantitative Analysis from the University of Cincinnati and a Bachelor of Science in Physics from Ohio Northern University.

Andrew Chang is an Applied Machine Learning Researcher in Novetta’s Machine Learning (ML) Center of Excellence. Andrew is a graduate from Carnegie Mellon University who has a focus on researching state of the art machine learning models and rapid prototyping ML technologies and solutions across the scope of customer problems. He has an interest in open source projects and research in natural language processing, geometric deep learning, reinforcement learning, and computer vision. Andrew is the author and creator of NovettaNLP.

State-of-the-Art Text Classification Made Easy

Example: Text Classification

Fine-Tuning Language Models

Written by ODSC - Open Data Science

No responses yet