Brace Yourself, Data Cleaning is Coming

8 min readApr 29, 2021

If you are just too familiar with This Crazy Thing Called Data Cleaning, with both the classical and psychological tricks that help, if your hair has already gone grey because of it, if you are simply seeking fast, fun, and furious nontrivial tricks, I encourage you to go straight to the ”Do it Sesame Street Style” section. Otherwise, stay for the chapter about…

Recap on data cleaning

Data preparation, which includes cleaning, is an indispensable part of the cross-industry standard process for data mining, CRISP-DM for short:

The reason is simple — machine learning models are followers of the religion “Garbage in, garbage out,” which essentially means that the dirtier the data they receive as input, the less reliable is the output that they return. As a consequence, it may lead you to false conclusions which put your business to trouble or worse — make a total fool of you in front of leadership.

Let’s start with a quick reminder of typical data issues:

Incorrect data types
Out-of domains observations, violating: ranges, set restrictions or regular patterns
Lack of uniqueness
Improper cross-field dependencies

Even that list itself brings back some scary memories…

Source: https://roomescapeartist.com/2019/03/23/non-escape-room-friends-meme/

Psychological angle

I challenge you to get away from coding for a sec and attempt to analyze ourselves. Why do we run away from Cleaning? Why does it frustrate us so much?

[second for thought]

Pressure. Have you ever heard that business wants the results NOW while you are stuck with mountains of dirt, little time, and huge expectations? Often we are under pressure to deliver in a short timeframe while having this slowing us down obstacles in the form of data quality issues.
Never-ending story. Have you ever felt like playing the game of infinite matryoshka of dirty data? It’s so easy to have the impression that this heavenly process will never end.
Manual work. Have you ever dreamed about running even the simplest linear regression while correcting stupid typos? Classical cleaning is so unambitious, so not compatible with the racing minds of the data hackers.
Complete lack of fame. Have you ever heard about this super cool AbbyCadabby model for cleaning text? Of course not. Why? Because it has never been created. Cleaning tricks don’t give you endless fame. Actually, they give you no fame.

On the contrary… why do we enjoy modeling so much?

[second for thought]

Satisfaction. Have you ever felt the great pride of delivering forecasts with only 3% MAPE? Modeling is the phase that returns results that are able to literally change the business world we work for. That’s huge, that’s worth cramming for.
Fame, fame everywhere! Have you ever enjoyed the view of the dropped jaws or nowadays, the likes and comments, at your idea for modeling problem X? Great models imply respect and popularity in the data community.
Personal development. Have you ever hunted for new NLP models out there to try them out? For a few years now, more and more innovative models are emerging every month. They are novel, they are exciting, they are game-changing! We want to use and enhance them while holding aloof from old models and time-consuming, non-developmental cleaning tasks.

Being aware of these psychological barriers, we’re already equipped with few aces up the sleeve!

Do it Freud style

Let’s balance the demotivating factors of the Cleaning:

High iterativeness. No novelty here, you think. Yes, we often use scrum, we work in sprints, we have the backlogs. However, I challenge the process to be cut into many small CRISP-DM pieces. Start with having a subset of observations, with very few features, even dirty (let’s get crazy), and build the model anyway (warning here: don’t show it to the business! Build it for your sanity sake). Then clean those few features at hand. Add features, extend set, clean extended features, run a model or two, and start the process again. This way, modeling will sweeten up the bitterness of data preparation. At the same time, you gain the benchmark model which is super nice and satisfactory for monitoring the progress in the project.
ML for cleaning! Don’t reject ML for cleaning purposes. Think hard whether your process indeed has to be manual, whether for sure you can’t trick it with the novel ML model or, even better, with your OWN innovative model!
Evangelisation. The necessity for data cleaning often comes from frivolous data collection. And that’s where we should act — at the very source of the problem! We should promote this knowledge about how crucial data collection is, we should fight for the improvement of processes and tools to make it better. By silently taking the blows that incorrect data collection throws at us, we willingly agree on future frustration and huge time waste. Let’s stop the circle!

With the whining tamed, we can move to technical aspects (jupi!).

Do it Sherlock Holmes-style

Why is it non-trivial to implement automatic cleaning, especially in the era of Auto ML?

Among others, there are various origins for data quality issues, which entail various actions. As a consequence, it is often necessary to put on the coat of a detective and discover truths behind the observed quirks and only then correct them accordingly.

Below are two interesting examples of data issues. Note that once correctly identified, they are quite easy and nicely fixable.

The first chart presents a series of COVID tests with seemingly two outliers. But does it really? When we look at the data closer it becomes quite clear it is just a matter of a simple human error introduced while putting data into an Excel cell 😉 Maybe one zero too many? Also, we observed that error was detected and correction introduced just the next day. Instead of removing outliers or replacing the values with the series mean / median / other statistics, it’s most reasonable to simply take the correction into account.

Second graphic presents the biggest cities of the United Kingdom. They should be within country boundaries, but few aren’t, again with a very logical cause — latitude and longitude were swapped.

Even when being Sherlocky, the above tasks are still tedious and manual. Let’s move to the third family of solutions.

Do it Sesame Street style

When you are already a pro and you’ve long ago introduced basic strategies, I would like to invite you to brainstorming about more advanced solutions. Have you ever managed to get the data clean cleverly, fast and effective, with little manual work? What were your tricks? Have you used it once or multiple times? In what domains?

While being curious about your hacks, I will share the model for cleaning text of my own creation. Let’s focus on our favorite issue with text data…

Whenever we get text, human-introduced data, it becomes a headache. Add a few languages to it and you can start ordering Melisa tea. But… wait. Google doc or Word correct us, quite accurately, when we make mistakes which is pretty comfy. For sure there is a way to leverage the same thinking and techniques to correct our beauties!

To explain the method, let me use a simple example based on a crazily popular, “starters” dataset — titanic data. I’ve introduced some quality issues to the ‘Sex’ label, as follows:

Then, using embeddings from a multilingual model of SBert (BERT + siamese network, paper can be found here) and PCA on top, a very nice translation into desired categories was received. When used for modeling, literally the same results were obtained, regardless of original labels or messy labels translated into PCA vectors via leveraging embeddings!

If you are interested in details, here is the full script: https://github.com/lady-pandas/cleaning-is-coming/blob/main/category_cleaning.ipynb

Second example will be about the mess that comes from missing data. While recruiting, the most frequent answer to questions about missing values is to impute series mean/median /another statistics. And while it’s not the worst of options out there, there are tons of cases when we can do much better guys!

Let’s examine the series of total vaccinations against covid for one of the countries. It’s pretty clear statistics are not gathered every day. Simultaneously with the first glance at the series, we immediately know what should be done — and I bet you agree with me on the answer. Let’s interpolate and admire the effect!

Summary

Being aware of the length of the post, that’s all for now 😉 If this topic keeps you awake at night and you are interested in knowing more — I will be talking about other Sesame Street style cleaning techniques at this year’s ODSC Europe in my talk, “The Colours of Cleaning“!

About the author/ODSC Europe 2021 speaker, Marta Markiewicz:

Currently Senior (Big) Data Scientist at InPost and Lecturer at Wroclaw University of Economics and Business, previously Head of Data Science at Objectivity, with a background in Mathematical Statistics. For almost 10 years, she has been discovering the potential of data in various business domains, from medical data, through retail, HR, finance, aviation, real estate, logistics, … She deeply believes in the power of data in every area of life. Articles’ writer, conference speaker, and privately — passionate dancer and hand-made jewelry creator.