20 Open Datasets for Natural Language Processing

4 min readJul 31, 2019

Natural language processing is a significant part of machine learning use cases, but it requires a lot of data and some deftly handled training. In 25 Excellent Machine Learning Open Data Sets, we listed Amazon Reviews and Wikipedia Links for general NLP and the Standford Sentiment Treebank and Twitter US Airlines Reviews specifically for sentiment analysis, but here are 20 more great datasets for NLP use cases.

General

Enron Dataset: Over half a million anonymized emails from over 100 users. It’s one of the few publically available collections of “real” emails available for study and training sets.

Google Blogger Corpus: Nearly 700,000 blog posts from blogger.com. The meat of the blogs contain commonly occurring English words, at least 200 of them in each entry.

SMS Spam Collection: Excellent dataset focused on spam. Nearly 6000 messages tagged as legitimate or spam messages with a useful subset extracted directly from Grumbletext.

Recommender Systems Datasets: Datasets from a variety of sources, including fitness tracking, video games, song data, and social media. Labels include star ratings, time stamps, social networks, and images.

Project Gutenberg: Extensive collection of book texts. These are public domain and available in a variety of languages, spanning a long period of time.\

Sentiment Analysis

Sentiment 140: 160,000 tweets scrubbed of emoticons. They’re arranged in six fields — polarity, tweet date, user, text, query, and ID.

MultiDomain Sentiment Analysis Dataset: Includes a wide range of Amazon reviews. Dataset can be converted to binary labels based on star review, and some product categories have thousands of entries.

Yelp Reviews: Restaurant rankings and reviews. It includes a variety of aspects including reviews for sentiment analysis plus a challenge with cash prizes for those working with Yelp’s datasets.

Dictionaries for Movies and Finance: Specific dictionaries for sentiment analysis using a specific field for testing data. Entries are clean and arranged in positive or negative connotations.

OpinRank Dataset: 300,000 reviews from Edmunds and TripAdvisor. They’re neatly arranged by car model or by travel destination and relevant to the hotel.

Text

20 Newsgroups: 20,000 documents from over 20 different newsgroups. The content covers a variety of topics with some closely related for reference. There are three versions, one in its original form, one with dates removed, and one with duplicates removed.

The WikiQA Corpus: Contains question and sentence pairs. It’s robust and compiled from Bing query logs. There are over 3000 questions and over 29,000 answer sentences with just under 1500 labeled as answer sentences.

European Parliament Proceedings Parallel Corpus: Sentence pairs from Parliament proceedings. There are entries from 21 European languages including some less common entries for ML corpus.

Jeopardy: Over 200,000 questions from the famed tv show. It includes category and value designations as well as other descriptors like question and answer fields and rounds.

Legal Case Reports Dataset: Text summaries of legal cases. It contains wrapups of over 4000 legal cases and could be great for training for automatic text summarization.

Speech

LibriSpeech: Nearly 1000 hours of speech in English taken from audiobook clips.

Spoken Wikipedia Corpora: Spoken articles from Wikipedia in three languages, English, German, and Dutch. It includes a diverse speaker set and range of topics. There are hundreds of hours available for training sets.

LJ Speech Dataset: 13,100 clips of short passages from audiobooks. They vary in length but contain a single speaker and include a transcription of the audio, which has been verified by a human reader.

M-AI Labs Speech Dataset: Nearly 1000 hours of audio plus transcriptions. It includes multiple languages arranged by male voices, female voices, and a mix of the two.

Noisy Speech Database: Noisy and Clean parallel speech dataset. It’s designed for building speech enhancement software but could be valuable as a training dataset for speech outside of ideal conditions.

NLP and the Road Ahead

Machines are getting better at figuring out our complex human language. Each time someone trains a model to understand us, we are one step closer to integrating our machines more efficiently into our lives. Research will soon unlock even more capability in the fields of business, finance, and a host of other disciplines, but for now, NLP is making progress. We are excited to see what you build!

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.