Want to Prevent Data Emergencies? Clean as You Go

ODSC - Open Data Science
5 min readApr 5, 2024

--

Editor’s note: Eric Callahan is a speaker for ODSC East this April 23–25. Be sure to check out his talk, “Clean as You Go: Basic Hygiene in the Modern Data Stack,” there!

As a parent, I’m always trying to teach my kids to “clean as they go”. When they’re running around making messes, I remind them that if they pick up the thing they just dropped now instead of leaving it on the floor, they won’t have to put away a pile of 12 things later. It’s a basic concept that I think most of us try to embody in our regular lives (to varying degrees of success, of course).

So shouldn’t we make it a priority as data engineers?

The culture of “Move Fast and Break Things” has pressured us into closing tickets as quickly as possible. This leads to an “oh, I’ll clean it up later” mindset. But that leads to long-term headaches and data “fire drills” — moments where suddenly, a little problem has led to a massive bug, or a misunderstanding between peers has led to a big confusing knot that needs to be resolved immediately. Some of us might exist in a state of perpetual fire drills, where everything feels like an emergency all the time.

But we don’t have to live this way! And the answer is as simple as tidying systems as we go, rather than waiting for problems to develop. Investing in data quality initiatives is the best way to make sure the data you’re working with is useful. To keep things running smoothly, it’s good to establish strategies to keep your data…

Get your ODSC East 2024 pass today!

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.

REGISTER NOW

Useful

The next priority is putting measures in place to ensure that the whole team understands how data is being collected, and how it should be used. One strategy is establishing a data contract.

A Data Contract is a process designed to ensure that data producers and data consumers are working together to store and log data correctly. Data contracts are unique to each organization’s needs, but they might include processes around what data is extracted, who is responsible for data, and what types of metadata should be attached.

High quality

High-quality data is data that is up-to-date, labeled, cleaned, and interpretable. In other words, high-quality data is information that can paint a picture and show us larger patterns. Low-quality data, on the other hand, is noisy, confusing, and disorganized. The easiest way to ensure that your data is high quality (and not a big mess) is to set up cleaning measures right from the beginning.

As an example: we’ve all had the experience of looking at a data set, and seeing multiple labels for the same kind of data. For instance, if the category is supposed to be “orange”, you might look at the set and see “org”, “color_orange”, “or_ange” — any number of permutations that make the information totally inscrutable to anyone who is trying to use it. This chaotic labeling makes this data low quality because no one can do anything with it.

To make sure that data is high quality, we can take a “shift left” approach to cleaning. In other words: rather than trying to organize a data set once it’s already in the database, we can take measures to organize it immediately when it enters the stack. In the case of our chaotic labels — we can set up the stack to immediately filter out data that’s labeled anything other than “orange”. Taking the time to put cleaning measures in place at the very beginning of the stack, to ensure that data quality is high, prevents a lot of future busy-work.

2024 Data Engineering Summit tickets available now!

In-Person Data Engineering Conference

April 23rd to 24th, 2024 — Boston, MA

At our second annual Data Engineering Summit, Ai+ and ODSC are partnering to bring together the leading experts in data engineering and thousands of practitioners to explore different strategies for making data actionable.

REGISTER NOW

Accessible

When your data is accessible, that means you can find what you need, when you need it, without having to dig. Making data accessible requires a ground-up approach — by organizing your data stack efficiently from the beginning.

In other words, can you find it when you need it? Is your data stack organized such that you can pull what you want to without having to dig?

Simplifying data stacks to ensure data accessibility is a great first step to preventing future emergencies. Cataloging tools like Secoda can help uncomplicate your stack and keep things running smoothly.

Working on infrastructure challenges might fall to the bottom of the to-do list, because they may not pose immediate problems. However, putting in the time to get your stack in order will prevent lots of urgent problems in the future.

As your most annoying uncle might say, an ounce of prevention is worth a pound of cure. Learn more strategies to prevent data fire drills at my talk on April 23rd, Clean as You Go: Basic Hygiene in the Modern Data Stack.”

About the Author:

Eric Callahan serves as Principal, Data Solutions at Pickaxe Foundry. With over 15 years of experience, he assists clients in resolving various data challenges. His expertise encompasses data engineering, analytics, machine learning, and experimentation, gained through roles focused on both product and marketing. This diverse background provides him with a unique perspective on the interconnectedness of the data ecosystem. He actively participates as a panelist and speaker, sharing his insights on best practices for modern data stacks.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

Responses (1)