Brace Yourself, Data Cleaning is Coming

Recap on data cleaning

  • Incorrect data types
  • Out-of domains observations, violating: ranges, set restrictions or regular patterns
  • Lack of uniqueness
  • Improper cross-field dependencies

Psychological angle

  • Pressure. Have you ever heard that business wants the results NOW while you are stuck with mountains of dirt, little time, and huge expectations? Often we are under pressure to deliver in a short timeframe while having this slowing us down obstacles in the form of data quality issues.
  • Never-ending story. Have you ever felt like playing the game of infinite matryoshka of dirty data? It’s so easy to have the impression that this heavenly process will never end.
  • Manual work. Have you ever dreamed about running even the simplest linear regression while correcting stupid typos? Classical cleaning is so unambitious, so not compatible with the racing minds of the data hackers.
  • Complete lack of fame. Have you ever heard about this super cool AbbyCadabby model for cleaning text? Of course not. Why? Because it has never been created. Cleaning tricks don’t give you endless fame. Actually, they give you no fame.
  • Satisfaction. Have you ever felt the great pride of delivering forecasts with only 3% MAPE? Modeling is the phase that returns results that are able to literally change the business world we work for. That’s huge, that’s worth cramming for.
  • Fame, fame everywhere! Have you ever enjoyed the view of the dropped jaws or nowadays, the likes and comments, at your idea for modeling problem X? Great models imply respect and popularity in the data community.
  • Personal development. Have you ever hunted for new NLP models out there to try them out? For a few years now, more and more innovative models are emerging every month. They are novel, they are exciting, they are game-changing! We want to use and enhance them while holding aloof from old models and time-consuming, non-developmental cleaning tasks.

Do it Freud style

  • High iterativeness. No novelty here, you think. Yes, we often use scrum, we work in sprints, we have the backlogs. However, I challenge the process to be cut into many small CRISP-DM pieces. Start with having a subset of observations, with very few features, even dirty (let’s get crazy), and build the model anyway (warning here: don’t show it to the business! Build it for your sanity sake). Then clean those few features at hand. Add features, extend set, clean extended features, run a model or two, and start the process again. This way, modeling will sweeten up the bitterness of data preparation. At the same time, you gain the benchmark model which is super nice and satisfactory for monitoring the progress in the project.
  • ML for cleaning! Don’t reject ML for cleaning purposes. Think hard whether your process indeed has to be manual, whether for sure you can’t trick it with the novel ML model or, even better, with your OWN innovative model!
  • Evangelisation. The necessity for data cleaning often comes from frivolous data collection. And that’s where we should act — at the very source of the problem! We should promote this knowledge about how crucial data collection is, we should fight for the improvement of processes and tools to make it better. By silently taking the blows that incorrect data collection throws at us, we willingly agree on future frustration and huge time waste. Let’s stop the circle!

Do it Sherlock Holmes-style

Do it Sesame Street style




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science


Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.