Building a Production-Level Data Pipeline Using Kedro

ODSC - Open Data Science
3 min readSep 17, 2020

Suppose you are a self-taught data scientist who does not have much experience in software development. One morning, your senior executive asks you to provide an ad-hoc analysis — perks of the job, and when you do, she thanks you for delivering useful insights for her planning. Great!

Three months down the line, the newly-promoted executive, now your CEO, asks you to re-run the analysis for the next planning meeting… and you cannot. The code is broken because you’ve overwritten some of the key sections of the file, and you cannot remember the exact environment you used at the time. Or maybe the code is OK, but it’s in one big notebook with all the file paths hardcoded, meaning that you have to go through laboriously to check and change each one for the new data inputs, shifting deadlines, and focus.

Source: https://xkcd.com/2054/

What you needed when you started out, was a tool that applies software engineering best practices to data and machine-learning pipelines. Something to organise your project, whether it is a single-user project running on a local environment, or a team collaboration for an enterprise-level project. What you needed to use was Kedro.

What is Kedro?

Kedro is an easy to use open-source Python workflow library for data scientists and data engineers which will, before long, be the industry standard for developing production-ready code. It allows creating portable data and machine learning pipelines, providing a standardised approach to collaboration and increasing productivity across the board, while efficiently minimising the effects of distractions or shifting deadlines allowing the team to build scalable, deployable, and reproducible code. We consider it to be the bridge between machine learning and software engineering.

Often referred to as the React or Django for Data Science, Kedro allows the entire team to contribute with snippets of knowledge without the mental overhead of understanding the entire project.

Due to its modularity, Kedro is suitable for a wide range of applications, ranging from single-user projects to enterprise-level software driving business decisions backed by machine learning models.

Kedro’s pipeline visualisation tool, named kedro-viz

What do you need for the workshop?

Attendees are expected to know basic Python (3.6+) and basic Command Line (terminal). They also should have an interest in data science and on improving their data science code to be of a high standard.

In our workshop at ODSC Europe, “Building a Production-level Data Pipeline Using Kedro,” we will talk about the emergence of MLOps and production-level data pipelines. We will discuss the software principles which data engineers and data scientists should consider, and we will see how Kedro fits into the workflow for creating robust and reproducible data pipelines. Finally, we will end with a demonstration on how to build and visualise your data and ML pipelines with an example dataset.

We look forward to seeing you there!

About the Author: Kiyoshito Kunii, Software Engineer at QuantumBlack: LinkedIn | GitHub

Kiyo is a software engineer at QuantumBlack, an advanced analytics firm operating at the intersection of strategy, technology, and design to improve performance outcomes for organizations. Kiyo is one of the core contributors and maintainers of Kedro, a Python library that implements software engineering best-practice for data and ML pipelines.

He holds an MSc in Computing Science from Imperial College London, and an MA in Economics from The University of Edinburgh.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.