Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data

3 min readApr 1, 2024

Since the beginning of AI models, the creation of datasets for supervised and instruction-tuning of AI models relied on the painstaking process of hiring human annotators — a method not only time-consuming but also prohibitively expensive.

But it seems that Hugging Face is hoping to change all of that with Cosmopedia, a synthetic data creation tool that can cover hundreds of subjects with a duplicate content rate of less than 1%. With over 25 billion tokens and 30 million files, Cosmopedia stands as the largest open synthetic dataset to date.

Get your ODSC East 2024 pass today!
In-Person and Virtual Conference
April 23rd to 25th, 2024
Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.
REGISTER NOW

Creating synthetic data that is both diverse and scalable is a complex undertaking. To address this, the Hugging Face team crafted over 30 million Cosmopedia prompts spanning hundreds of topics, achieving a duplicate content rate of less than 1%. This monumental effort underscores the commitment to providing an extensive, high-quality synthetic data resource.

Cosmopedia’s creation involved a dual approach: conditioning online data for scalability and curated sources for quality. The latter includes educational resources like OpenStax and Khan Academy, ensuring the production of high-caliber content.

On the other hand, the web data, making up over 80% of Cosmopedia’s prompts, utilized a method akin to RefinedWeb, organizing millions of online samples into meaningful clusters. The output of these efforts not only enriches AI training resources but also highlights the necessity of innovative solutions like decontamination pathways to ensure the integrity of synthetic data.

This method, akin to the one used for the Phi-1 model, involves removing potentially contaminated samples to maintain dataset purity. The implications of Cosmopedia and similar projects are profound, offering a glimpse into the future of AI development.

These advancements promise a more inclusive field, where the creation of comprehensive datasets is not confined to a privileged few but is accessible to a broader spectrum of researchers. As the AI community continues to explore and refine these methods, the potential for accelerated innovation and growth in AI capabilities seems boundless.

2024 Data Engineering Summit tickets available now!
In-Person Data Engineering Conference
April 23rd to 24th, 2024 — Boston, MA
At our second annual Data Engineering Summit, Ai+ and ODSC are partnering to bring together the leading experts in data engineering and thousands of practitioners to explore different strategies for making data actionable.
REGISTER NOW

For developers, researchers, and enthusiasts alike, the evolution of synthetic fine-tuning datasets represents an important moment in AI’s journey. The success of projects like Cosmopedia not only enhances the training of more sophisticated models but also paves the way for the next generation of AI advancements.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

Hugging Face’s Cosmopedia Hopes To Reshape Pre-Training Data

Written by ODSC - Open Data Science