Hands-on Data-Centric AI: Data Preparation Tuning — Why and How?

ODSC - Open Data Science
4 min readApr 25, 2023

Editor’s note: Fabiana Clemente is a speaker for ODSC East 2023 this May. Be sure to check out her talk, “Hands-on Data-Centric AI: Data preparation tuning — why and how?” there!

Machine Learning is applied to an increasingly large number of applications that range from financial to healthcare industries. Nevertheless, we haven’t yet nailed the process of building a successful and business-meaningful AI solution. The Data-Centric AI term was coined by Andrew NG in 2021, bringing a shift of focus in developing data-driven solutions.

We can describe Data-Centric AI as the paradigm of an AI system development where data is the iterative element. Instead of investing a lot of time in hyperparameter tuning, data is the element that needs to be optimized toward a business objective.

Adopting a more “data-centric” approach while developing AI solutions means spending more time managing, profiling, augmenting, and curating the data efficiently and in a reproducible process.

But to adopt a more data-centric perspective does not mean you won’t play around with models — algorithm, architecture, and hyperparameters choice. Given that data has higher stakes, it only means that you should invest most of your development investment in improving your data quality.

The continuous process of increasing the quality of our data through data cleansing and, of course, data acquisition is what we call the “data cycle” of machine learning development, and it includes the following steps:

  • Understand and explore the feasibility of the data quality for model development
  • Data cleaning and augmentation strategies applied
  • Run several models to understand the data impact on the results
  • Continuously improve the models’ outcome based on the data preparation

As expected, the first step is to draft a profile of the available data immediately. Profiling is an excellent tool to set up a working baseline to improve dataset quality. We should consider two different types of profiling to have a more holistic perspective: unsupervised and supervised.

Unsupervised profiling includes analysis such as distribution histograms, mutual information, and correlations. It allows answering questions such as: Do I have noisy data? What is the missing data behavior? Are my distributions skewed? Do I have relevant variables?

Supervised profiling assumes the choice of a baseline model that will train in our training set. This step allows defining the limit of a Machine Learning performance given a base data quality.

Data cleaning is where the heavy lifting towards improved data quality and better Machine Learning models performance. In this step, we must understand and build a reproducible flow to deal with faulty data behaviors such as noisy and missing data, imbalanced classes, and bias. Synthetic data generation is a powerful tool to mitigate some of the previous behaviors — it can be used to augment data, balance classes, de-bias and add variability for models’ improved generalization.

After all the data preparation is time to re-train our baseline model. Have we achieved the performance expected? If not, it is time to return to work and make more tweaks and cleaning. The whole process and data preparation development should be reproducible, comparable, and above all, versionable.

But what tools and methods do one adopt to start the journey in Data-Centric AI? That’s the question that we at YData strive to answer for the broader audience of data scientists and, for that reason, building not only a community but also a set of open-source tools that can help you to get started like ydata-profiling and ydata-synthetic.

Join me in my upcoming ODSC workshop on “Hands-on Data Centric AI: Data preparation tuning — why and how?” this May to explore the entire flow of data preparation and machine learning model development from a more data-centric perspective. We will leverage some open-source essentials such as scikit-learn, pandas, and numpy combined with the data-centric must-haves like profiling and synthetic data generation.

About the author/ODSC East 2023 speaker:

Fabiana Clemente is the co-founder and CDO of YData, combining Data Understanding, Causality, and Privacy as her main fields of work and research, with the mission to make data actionable for organizations. Passionate for data, Fabiana has vast experience leading data science teams in startups and multinational companies. Host of the “When Machine Learning meets privacy” podcast and a guest speaker at Datacast and Privacy Please, the previous WebSummit speaker, was recently awarded “Founder of the Year” by the South Europe Startup Awards.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.