22 Machine Learning Open Datasets for 2021

But first: How to Find Machine Learning Open Datasets

Searching for machine learning open datasets is a skill in itself and one you should get really good at if you’re in the data science community. Luckily, there are few sources for finding these datasets. Some common ones include:

  • Google Dataset Search
  • Github
  • data.gov
  • UCI Machine Learning Laboratory
  • Can I find/fix inaccuracies?
  • Is it complete?
  • Is the data objective?

22 Best Machine Learning Open Datasets

We’ll divide these machine learning open datasets based on some general categories, but you can also mix and match based on the data available in each set. Just because something is labeled for sentiment analysis doesn’t mean it wouldn’t also work with general natural language processing, for example.

Image Processing

LabelMe: A computer vision data set published by MIT that allows users to contribute through the annotation tool. You can download the images via the MatLab toolbox or work with them online.

Natural Language Processing

Dirty Words: This fun dataset from Github itself looks at what you definitely do not want showing up in your chatbot, unless it’s that type of chatbot. A fascinating and ongoing collection of not socially acceptable words and phrases in a multitude of languages.

Sentiment Analysis

Dynasent: This English dataset includes over 121,000 sentences in positive, negative, and neutral utterances created on its own open platform. Each utterance has been verified by five crowd workers.

Speech

Vox Celeb: A large-scale speaker identification set with over 100,000 utterances compiled from YouTube videos. It offers a range of accents, balanced gender, and dispersed ages. It offers users around 2000 hours of speech.

Government Data

Data USA: A well-organized place to find all sorts of data from the US government and its various departments. It includes info on congressional districts, public workers, population studies, and so much more.

For beginners

NYC Taxi Trip Data: A collection of trip data starting in 2009, this data set explores things like rates, trip lengths, and payment types. In addition, it offers other tools such as user guides and a user-friendly format.

Leveraging open datasets for your data science practice

There are so many great open datasets you can use to practice your craft, build your dream projects, and expand your portfolio. Whether you’re building for your current employer or dreaming up new projects, these datasets offer great machine learning training without the cost of buying expensive private data collections.

How to Learn More about ML and How to Use These Machine Learning Open Datasets

At our upcoming event this November 16th-18th in San Francisco, ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on machine learning and machine learning open datasets. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on machine learning include:

  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University
  • MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
  • Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
  • Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investments
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store