25 Excellent Machine Learning Open Datasets

25 Machine Learning Open Datasets To Get You Started

Natural Language Processing

  • Amazon Reviews: A collection of over 35 million reviews from the last 18 years. It includes things like ratings, reviews in plain text, and user information. It also contains complete product information for reference.
  • Wikipedia Links Data: The full power of Wikipedia including four million articles containing 1.9 billion words. Your search options are varied and include both word and phrase searches as well as pieces of paragraphs.

Sentiment Analysis

  • Standford Sentiment Treebank: Dataset containing sentiment notations for over 10,000 pieces of data from Rotten Tomatoes reviews rendered in HTML
  • Twitter US Airline Sentiment: Tweets collected about US Airlines with clear markers for positive, negative, and neutral tones, dated from 2015.

Public Government Data

  • Data USA: A comprehensive overview of various sets of US public data in fun visualizations. It includes things like population, health, and jobs.
  • EU Open Data Portal: Much like Data USA except with a concentration on countries belonging to the EU. It includes fields such as population, culture, energy, and health, among others.

Finance and Economics

  • World Bank Open Data: Data concerning population demographics and key indicators for development.
  • IMF Data: International Monetary Fund’s collection of open data for things like debt rates, commodity pricing, international markets, and foreign exchange reserves.

Facial Recognition

  • Labeled Faces In The Wild: Common dataset for facial recognition training. It includes 13,000 cropped faces plus a subset of people with two different pictures within the dataset.
  • UMDFaces Dataset: Includes both still and video images. The dataset is annotated and features around 367,000 faces of over 8,000 subjects.

Image Datasets

  • Imagenet: Dataset containing over 14 million images available for download in different formats. It also includes API integration and is organized according to the WordNet hierarchy.
  • Google’s Open Images: 9 million URLs to categorized public images in over 6,000 categories. Each image is licensed under creative commons.

Health:

  • Healthdata.gov: a resource from the US federal government providing data to improve health outcomes for the US population.
  • MIMIC Critical Care Database: Datasets for Computational Physiology with unidentified health data from 40,000 critical care patients (demographics, vital signs, medications, etc.)

Media

  • FiveThirtyEight Journalism: The numbers behind some of this journalism hub’s stories. Useful for visualizations and data stories.
  • BuzzFeed Media: Open source data hub for everything in the realm of Buzzfeed. Everything their journalists used to produce the stories (the organization recommends reading the articles to get a better idea of how the data was used.)

Transportation

  • US National Travel and Tourism Office: provides trustworthy datasets with big pictures of the tourism industry, including things like inbound and outbound travel and international visitor data.
  • Department of Transportation: datasets on each field that falls under the DOT including National Parks, driver registers, bridges and rail information, and port systems.

Speech

  • Flickr Audio Caption Corpus: 40,000 spoken captions from 8,000 images in a manageable size. It was initially designed for unsupervised speech pattern discovery.
  • Speech Commands Dataset: A continuously evolving collection of one second long utterances from thousands of different people. It’s still receiving contributions and is useful for building basic voice interfaces.

Sound

  • FSD (Freesound): A collection of every day sounds collected by contribution under an open source license.
  • Environmental Audio Datasets: It does contain some proprietary information, but a large portion is open source. It contains sound events tables and acoustic scenes tables.

Dataset Aggregators

  • OpenDataSoft: 2600 data portals arranged in an interactive map formation or by country list. If you’re looking for it, chances are, it’s here.
  • Kaggle: an online community of data scientists where users can work with and upload datasets. It’s a community and a resource in one.
  • UCI Machine Learning Repository: User contributed datasets in various levels of cleanliness. It’s one of the originals, and you can download datasets without having to register anything.

Getting Started With Machine Learning

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

94K Followers

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.