Ten Reasons for Doing Public Data Hacking

  1. Using public data for doing good — Utilizing public data sources, you can focus on a meaningful goal for doing good. This is the main premise of the DataKind organization whose mission statement is “Harnessing the power of data science in the service of humanity” and uses public data for many of its projects. Take some time to review some public data resources and grab a few data sets that align with your personal passions, e.g. climate change, endangered species, traffic, crime, etc. Working on a data science project that’s meaningful to you makes it all the more intriguing and fun.
  2. Public data is remarkably clean — The public data sets I’ve used were very clean in terms of consistency, missing values, and values that make sense. This may be because the sponsors of public data repositories bear a responsibility to ensure the data is useable and ready for public consumption. Of course, clean data makes the job of a data scientist more streamlined since the data transformation (wrangling, munging) phase of the data science process simpler.
  3. Ample data volume — Many of the data sets from open data repositories are extensive, going back many years, and provide some rather high dimensionality. If you’re looking to do some big data experimentation, public data may be a great place to start.
  4. Varied data formats — The public data sites offer a variety of data formats including CSV, XML, JSON, etc. You’ll likely find a format you’re comfortable with.
  5. Data flexibility — Most open data repositories have very flexible means for selecting just the data you’re interested in. You can select subsets of variables, ranges of values (e.g. dollar ranges, date ranges, etc.), and categorical variable values (e.g. state=”MA”).
  6. Data diversity — You’ll find that each public data website offers a wide variety of data assets for many diverse problem domains. Sometimes I get new project ideas just browsing through collections of open data sets.
  7. New data scientists — If you’re a newbie data scientist trying to build up your resume, a great place to start is sharpening your skills with public data. You can choose an application close to your heart, like traffic patterns, air pollution, or crime stats. This way you’ll have extra motivation to ferret out new patterns and predictions to help the cause.
  8. Generate buzz for personal promotion — If your data science project using public data produces some surprising and/or useful results, you can write a paper summarizing your conclusions and submit it to a few appropriate organizations for promotion. If you’re lucky, you might get some good local press about your project.
  9. Capstone project material — If you’re in an academic program, including a MOOC specialization, public data sources can be a great basis for a capstone project.
  10. Obtain domain knowledge — Examining data is a great way to learn in general and also obtain specialized knowledge about a new domain space you may have an interest in. For example, maybe you’re interested in the healthcare industry, so examining public data sets describing the healthcare realm will serve to give you important insights.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.