The Pile Dataset: EleutherAI’s Massive Project to Help Train NLP Models

3 min readJan 25, 2021

Recently, EleutherAI — a small group of researchers devoted to open-source AI research — created The Pile, a massive dataset designed to train NLP models, such as GPT-2 and GPT-3, among others. The dataset is open-source, contains over 800GB of English language data, and is still growing.

The Methods

EleutherAI compiled a series of other popular language modeling datasets to create an overall diverse, thorough, and generalized one-stop-shop for NLP tasks. Some of the used datasets include Pile-CC, Wikipedia, PubMed Central, GitHub, Stack Exchange, YouTube, The US Patent and Trademark Office, and more. The 22 included datasets represent academic writing, fiction, code, and mathematics, creating diverse possibilities. The Pile also introduces OpenWebText2 and BookCorpus2, extensions of their original versions.

The Goals

Since most large language models are trained on private datasets based on common crawl data, their downstream generalization capabilities are limited. However, with dataset diversity — a core feature of The Pile — language modeling tasks will lead to improved downstream generalization capabilities.

While initially conceived as a training dataset for large-scale models, The Pile’s diverse nature proved to be useful as an evaluation tool.

The researchers hope that by using all of this data, they may be able to replicate the GPT3, only with more diverse data and for free. They also hope to create datasets in languages other than English in the future.

The Future of NLP

NLP jobs are on the rise and require a plethora of skills to stand out, including the aforementioned GPT-3.

The ODSC on-demand training platform, Ai+ Training, offers a number of videos that will help you get up-to-date on the latest NLP skills, tricks, tools, platforms, libraries, and research advancements. Here are a few standout talks:

An Introduction to Transfer Learning in NLP and HuggingFace Tools: Thomas Wolf, PhD | Chief Science Officer | Hugging Face

Natural Language Processing Case-studies for Healthcare Models: Veysel Kocaman | Lead Data Scientist and ML Engineer | John Snow Labs

Transform your NLP Skills Using BERT (and Transformers) in Real Life: Niels Kasch, PhD | Data Scientist and Founding Partner | Miner & Kasch

A Gentle Intro to Transformer Neural Networks: Jay Alammar | Machine Learning Research Engineer | jalammar.github.io

Level Up: Fancy NLP with Straightforward Tools: Kimberly Fessel, PhD | Senior Data Scientist, Instructor | Metis

Build an ML pipeline for BERT models with TensorFlow Extended — An end-to-end Tutorial: Hannes Hapke | Senior Machine Learning Engineer | SAP Concur

Natural Language Processing: Feature Engineering in the Context of Stock Investing: Frank Zhao | Senior Director, Quantamental Research | S&P Global

Transfer Learning in NLP: Joan Xiao, PhD | Principal Data Scientist | Linc Global

Developing Natural Language Processing Pipelines for Industry: Michael Luk, PhD | Chief Technology Officer | SFL Scientific

Deep Learning-Driven Text Summarization & Explainability: Nina Hristozova | Junior Data Scientist | Thomson Reuters

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.

The Pile Dataset: EleutherAI’s Massive Project to Help Train NLP Models

Written by ODSC - Open Data Science

No responses yet