Are All Monoliths Bad?

ODSC - Open Data Science
5 min readApr 16, 2024

Editor’s note: Elliott Cordo is a speaker for ODSC East this April 23–25. Be sure to check out his talk, “Data Pipeline Architecture — Stop Building Monoliths,” there!

The simple answer is no. When building any software system, when complexity is low, and engineering teams are small, a monolith may be a great place to start. In some cases, a more complex microservice implementation can be a pitfall, based on the same variables, and could be considered premature optimization.

However at a certain complexity monoliths start forming serious cracks in terms of system stability, and productivity. Especially when you have large, often federated teams working on the same system.

Get your ODSC East 2024 pass today!

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.

REGISTER NOW

Data Frustration

My career has been near equal parts of software and data engineering, although data has continually pulled me in. I absolutely love building data platforms from both the technical aspect, as well as the intimacy of working with every nook and cranny of a business’s data and processes.

However, on the tech side of things, I’m often disappointed by the engineering maturity of the data platforms that are built. I’d say my beloved craft lags at least a decade behind more generalized software engineering. Yes, we have some new tools, which we largely consider to comprise the “Modern Data Stack,” however the way most organizations use them, and to some extent, the limitations of our tools have resulted in the building of centralized, fragile, monolithic architecture.

Major platform components become huge single points of failure as they host large portions or the entirety of analytic processing. Developer experience and productivity degrade as the code base becomes so large that it is difficult to make changes. When developers do make changes they are at risk of merge or dependency conflicts, or introducing unintended bugs.

The pain level for an Airflow environment with 500+ dags can be roughly equivalent to a bloated Django project with a similar number of modules. It immediately begs the question: does this all need to be one thing?

Team Organization

At one time the norm was the data stack would be wholly owned by one central team. Due to the difficulties of scaling to large teams, systems organizations started splitting up responsibilities vertically, ie. ingest team, data lake team, data quality team, and serving team.

Although this allowed organizations to organize into more realistically sized teams and project sizes it did not make work go any faster. This is due to the fact we now introduced cross-team dependencies and impedance for any data product outcome we wanted to drive. And the “smaller” platform components were still quite large (ie. a single ingest project, or data lake codebase).

Many organizations are now, in my opinion, rightly driving toward domain or outcome-based teams, that in addition to software development have the responsibility for data for data concerns from both a product and analytic perspective.

As many of you probably have heard, a concept called data mesh has been gaining traction that attempts to address various technical and organizational concerns.

Technology: “Modern Data Stack”

There are many variations, but most consider the modern data stack a combination of Airflow, DBT, and a Cloud “Data Warehouse” engine (which would include “Data Lake”). These tools themselves are not necessarily to blame for Monolithic architecture, although features that support multi-project and federated development are still emerging.

Airflow

Large monolithic Airflow environments are quite common, and perhaps the most problematic. As a Python project, things can get quite bloated pretty quick, and you can end up in dependency nightmares. The first question we should be asking ourselves is, does this have to be a single Airflow instance? Almost always the answer is no, as Airflow DAGs tend to be largely independent. It is very often the cause of inadequate investment in infrastructure deployment (IAC) and CI/CD making multiple deployments difficult.

As far as what you run inside of Airflow, there are plenty of options to keep the code base small and domain-specific. A really powerful tool is of course the Kubernetes Pod Operator, which allows you to remove code and logic from the Airflow environment and run it instead in a container.

See this post for additional details.

2024 Data Engineering Summit tickets available now!

In-Person Data Engineering Conference

April 23rd to 24th, 2024 — Boston, MA

At our second annual Data Engineering Summit, Ai+ and ODSC are partnering to bring together the leading experts in data engineering and thousands of practitioners to explore different strategies for making data actionable.

REGISTER NOW

DBT & Data Warehouse Engine

Just like Airflow, DBT and the Data Warehouse engine almost always become monolithic. Unfortunately, this area is a bit pricklier, both in tooling and organizational practice. DBT however, with some effort, can be implemented with multi-repo architecture, allowing separation around domain boundaries.

The Data Warehouse Engine itself, by default, tends to be a single environment, however, most platforms do enable both storage and processing separation and multi-warehouse architecture with data sharing. It’s more so a failure in planning and data governance (i.e. security, data contracts..), than a limitation of technology.

Learning More About Monoliths at ODSC East 2024

I hope these ideas and tips are helpful. I look forward to diving deeper at my upcoming ODSC talk “Stop Building Monoliths.”

About the Author

Elliott is an expert in data engineering, data warehousing, information management, and technology innovation with a passion for helping transform data into powerful information. He has more than a decade of experience implementing cutting-edge, data-driven applications. He has a passion for helping organizations understand the true potential in their data by working as a leader, architect, and hands-on contributor.

Elliott has built nearly a dozen cloud-native data platforms on AWS, ranging from data warehouses and data lakes, to real-time activation platforms in companies ranging from small startups to large enterprises.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.