Building a Robust Data Pipeline with the “dAG Stack”: dbt, Airflow, and Great Expectations

Why you should test your data

An open-source data engineering stack for robust data pipelines

  • dbt (data build tool) is a framework that allows data teams to quickly iterate on building data transformation pipelines using templated SQL.
  • Apache Airflow is a workflow orchestration tool that enables users to define complex workflows as “DAGs” (directed acyclic graphs) made up of various tasks, as well as schedule and monitor execution.
  • Great Expectations is a python-based open-source data validation and documentation framework.
https://odsc.com/boston/#register

What is Great Expectations?

expect_column_values_to_be_between(
column=”passenger_count”, min_value=1, max_value=6
)

Putting it all together

import airflow
from airflow_dbt import DbtRunOperator
from airflow.operators.python import PythonOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
default_args = {
"owner": "Airflow",
"start_date": airflow.utils.dates.days_ago(1)
}
dag = airflow.DAG(
dag_id='sample_dag',
default_args=default_args,
schedule_interval=None,
)
def load_source_data():
# Implement load to database
def publish_to_prod():
# Implement load to production database
task_validate_source_data = GreatExpectationsOperator(
task_id='validate_source_data',
checkpoint_name='source_data.chk',
dag=dag
)
task_load_source_data = PythonOperator(
task_id='load_source_data',
python_callable=load_source_data,
dag=dag,
)
task_validate_source_data_load = GreatExpectationsOperator(
task_id='validate_source_data_load',
checkpoint_name='source_data_load.chk',
dag=dag
)
task_run_dbt_dag = DbtRunOperator(
task_id='run_dbt_dag',
dag=dag
)
task_validate_analytical_output = GreatExpectationsOperator(
task_id='validate_analytical_output',
checkpoint_name='analytical_output.chk',
dag=dag
)
task_publish = PythonOperator(
task_id='publish',
python_callable=publish_to_prod,
dag=dag
)
task_validate_source_data.set_downstream(task_load_source_data)
task_load_source_data.set_downstream(task_validate_source_data_load)
task_validate_source_data_load.set_downstream(task_run_dbt_dag)
task_run_dbt_dag.set_downstream(task_validate_analytical_output)
task_validate_analytical_output.set_downstream(task_publish)

Wrapping Up on Great Expectations

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

94K Followers

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.