The Rapid Evolution of the Canonical Stack for Machine Learning

11 min readJul 27, 2021

Just a few years ago, almost nobody was building software to support the surge of new machine learning apps coming into production all over the world. Every big tech company, like Google, Lyft, Microsoft, and Amazon rolled their own AI/ML tech stack from scratch.

See the full header image here.

Fast forward to today and we’ve got a Cambrian explosion of new companies building a massive array of software to democratize AI for the rest of us.

Since I wrote the Rise of the Canonical Stack for machine learning about the state of the art in machine learning in 2020, the state of the art has changed. That’s no surprise because we’re on a brand-new branch of the software development tree and the cutting edge of ML algorithms is changing as fast as the software to support it.

Recently, Sam Charrington and the folks at TWIML put out a guide covering the ML landscape and a lot of the AIIA companies made the list.

Why do we even need to do that though? Because there are so many companies now that it’s hard to get a handle on what they do.

That’s why we see people do a three-part post on the MLOps space just trying to figure out what the heck everyone’s software does in the first place! The post has some strong, clear thinking, but it still gets a lot of what the individual companies do wrong. That’s also no surprise because there are a lot of them and it’s not easy to go deep on each and every one.

Take Pachyderm, a company I know well since I happen to work there in addition to acting as the Managing Director of the AIIA. In that three-part post, Pachyderm got pigeonholed as “data versioning,” but we’re also a robust data orchestration engine that takes you from data ingestion through training in the ML lifecycle. We can agnostically run any language or framework in that pipeline, something few other platforms can do. While most pipeline systems run only Python and maybe one other half supported language, we can run Python, Rust, C++, Bash, R, Java, and any framework or library you want, whether that’s Tensorflow, MXNet, Pytorch, or the deep learning library you found on the MIT website last week.

It’s hard to dig through the marketing hype and have the time to go through lots of tools to figure out what’s the best of the best.

But that’s not the whole story. There’s a bigger problem at work here:

I call it the NASCAR slide.

The NASCAR slide comes to us from marketing. It’s the slide with logos slapped all over it like a NASCAR driver’s uniform.

In machine learning, that slide has completely destroyed people’s ability to actually understand what any of these companies do. You know the one. It’s got 87 categories of machine learning with 500 logos that fit neatly in each box.

It started in 2014–2016 with O’Reilly’s State of Machine learning series and it’s been replicating like a virus ever since.

It was relevant when the brilliant Shivon Zilis and her team did it at the time but if you look at that slide now, you’ll see it’s super out of date. These slides age horribly. Most of them are out of date almost the second they hit the press. Here’s a link to one from this year but I could point to a hundred different ones.

The biggest problem is that the categories are way too confining. Most companies’ software does multiple things. At the AI Infrastructure Alliance (AIIA), a community group with 50 of the most cutting-edge machine learning software companies on the planet, many of the companies have software that ripples across multiple stages of the AI/ML lifecycle. That will only accelerate as companies grow and take on more of the stack.

Plugging their logos into neat little boxes of made-up machine learning categories makes no sense whatsoever.

That’s why one of the first big projects of the Alliance is working on blueprints to help people get a better grip on a real-world enterprise AI/ML workflow via a canonical stack for machine learning.

Show Me the Blueprints

Take this early time series illustration from our working group as an example. It shows the AI/Ml workflow with boxes that match the amount of time we spend in each stage of the AI/ML dev lifecycle.

Where does AIIA company ClearML fit into that graphic? The answer is “multiple parts of the workflow,” which is why we’re using color to show how software ripples across the stages of the diagram.

What about a model serving framework like Seldon? Even they fit into different parts of the workflow, not just serving.

The partial coloring in the Seldon diagram shows where they do parts of what a total unified canonical stack for machine learning might do. While a full-blown monitoring engine might cover experimentation all the way to deployment, a monitoring engine for the production side of the house is still essential.

And what about Pachyderm? If we’re not just data versioning, where do we fit?

Of course, no graphic perfectly captures every nuance of a canonical stack for machine learning and a workflow. As my chief data scientist Jimmy Whitaker likes to say, “all diagrams are wrong, some are useful.”

We’re looking to make the AIIA diagrams a lot less wrong so everyone can use them as a common visual language in this fast-developing space. Our diagrams are open source so everyone can build on them and modify them to their own needs and we’ll release them on the website when they’re out of beta.

I’ll consider the AIIA a roaring success if we kill the NASCAR slide forever.

The Architecture of Tomorrow

But where are we even getting these architectures? How are we putting them together in the first place?

To start with, we’re looking at what each company in the AIIA does. We’re getting demos from the founders. What are they building and why? What did they bet their futures on? How does it all weave together into a seamless AI fabric?

We’ve also spent time talking with the big consulting operations out there, like Cognizant, who are talking to customers and practitioners all over the world. They’re finding that most companies are still stuck somewhere between L0 and L2 on the AI readiness scale. Very few have advanced to automating the ML lifecycle from end to end.

We’re also studying the architectures from cutting-edge companies, like Lyft’s and Google’s and Uber’s. We’re reading the papers and the blogs that break down the architectures. How is Google’s SEED RL system different from Deep Mind’s Acme RL system?

But we can’t just assume that whatever big tech puts out is fully backed and ready to go for the rest of us. We have to ask questions.

What works?
What doesn’t work?
What makes sense and what doesn’t?
What’s missing?

We can’t just assume they got it all perfectly correct and fully baked from the jump. The big tech companies are smart but that doesn’t mean they figured it all out on the first try. It took Google ten years before they got to Kubernetes and they built two other solutions along the way, Borg and Omega.

Already we’ve noticed a number of easily missed problems with a lot of the architectures coming out of Google and other big innovators in the space. Take a look at this diagram that people often cite from Google’s MLOps docs:

What’s missing?

Data.

Where are they storing the data? What kind of system are they using to access that data? How do they control access and version it?

They don’t include a storage and data versioning layer at all. The diagram picks up at “data extraction” assuming you already have data storage and scaling perfectly handled.

Why?

Because Google has a planetary scale file system that spans data centers and they have unified RBAC to get to that data. They can take that as a given in their stack. You can’t do that if you work at most companies.

Most companies have a mish-mash of systems developed over many years, with different RBAC and different standards and formats. Half the work of the data engineer is just trying to navigate that RBAC minefield and get that data into a format data scientists can use to do their work.

Of course, our diagrams make assumptions too. The time series diagram is very linear and AI/ML workflows are often branching, like a DAG, and/or looping, aka the Machine Learning Loop. Our diagram has some curved arrows to show you some of the loop-y nature of machine learning but it’s not a perfect representation.

That doesn’t make the diagram wrong, it’s just hard to represent everything in a single diagram perfectly, so we don’t even try. It’s all about focus. Are we focusing on the time or on the stack itself?

In the next diagram, we focus purely on the tech stack itself and not the workflow. This is an early draft of emerging patterns we’re seeing across the 50+ members of the AIIA.

It shows the tech only, not what people are doing across it. Think of it the same way you think of a stack diagram of a web server, database server, and load balancer layer. It doesn’t show the app on top of it or how the programmer writes code and rolls it out.

It also doesn’t show boxes that you can stuff logos into easily. Just like the time series illustration a company’s software might flow across multiple components of the tech stack.

We’re using color to show how you could combine different software platforms to build a state-of-the-art machine learning stack that works for multiple use cases with ease. The stack below combines Pachyderm, Algorithmia, ClearML, and Tecton into a unified stack and it’s even got a clever acronym, the PACT stack.

But we’re not worrying too much about clever acronyms for now. Sometimes you add a company and it doesn’t make for a memorable naming convention but it still makes for a powerful stack combination. If we add Fiddler’s state-of-the-art monitoring and explainability to the mix, we have the PACTF stack that covers 90% of what any enterprise needs to run machine learning at scale.

If you wanted to go with a model serving from Seldon a stack might look like this one below:

The Future and Beyond

In the long run, the AIIA is looking to deliver what the world of data science really needs, a Kubernetes of ML, something that abstracts away all the concepts and communications between different layers of any kind of complex AI/Ml stack people can dream up.

We want an abstract AI/ML factory that’s plug-and-play.

You might think that’s something like Kubeflow but Kubeflow is more of a pipelining and orchestration system that’s not really agnostic to the languages and frameworks that run on it. It supports mostly Python. It recently supported R but if the developers have to support every language and framework and library by hand that’s not going to be the orchestration engine we’re looking for in the future.

Even more, we want something that works more the way Kubernetes itself does. Kubernetes doesn’t know what applications are running but it knows how to run them perfectly. It lets you run any kind of application that you can dream up on top of it. We want the same for AI/ML. The winning structure will let you abstract away various components and build a fast and flexible engine composite structure that works for any kind of AI we can imagine.

The system that everyone uses will be agonistic to the language, the frameworks, and the tools that run on it. If you want to run R, or Rust, or C++, or two versions of Anaconda, mixed with MXNet and Pytorch and an experimental NLP library you just found, you should be able to do it with ease and without having to wait for the team to support it.

That’s why we’re looking for agnostic orchestration systems, clean API layers, and well-defined communications standards.

What about big, vertically integrated stacks like Amazon SageMaker and Google’s Vertex AI?

Their platforms will always make money and many teams may find that SageMaker is all they need. But in the long run, we believe the stack that everyone uses in the future will be open and cross-platform.

Kubernetes didn’t become Kubernetes because it only ran on Google.

There were dozens of other container orchestration frameworks out there but Kubernetes proved to be the most flexible and agonistic and the other ones slowly died off.

Big vertically integrated AI/ML solutions have an early advantage. They have great programmers and they can design a prettier front end. Open systems and frameworks start out a bit uglier and messier. They take time to come together.

Does anyone really think Amazon’s feature store will win out over FEAST or Tecton or Molecula? We don’t think so and when Amazon rips and replaces their feature store with one of the feature stores in the AIIA we’ll know we’ve really succeeded.

Then we’ll have a new mission. We’ll move “up the stack” to the generation of software that sits on top of the AIIA stack.

When 35 engineers built WhatsApp and reached 400 million people, they built it on the back of pre-baked GUIs and transport layer security standards and messaging protocols, and more. They didn’t have to invent all that stuff. Eventually, we’ll have to move up the stack too. But when we do, we’ll have solved the first major challenge in this rapidly evolving AI/ML space. We’ll keep evolving as the times change and the needs of our people change.

Machine learning is one of the most powerful technologies on the planet but to unleash its true world-changing potential we need a stack everyone can build on. Once we have it, we’ll have gone well beyond the Cambrian explosion of infrastructure software to an explosion of new AI-driven applications that touch every industry on Earth.

Join the AIIA and you won’t just ride the winds of change.

You’ll shape them.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.

The Rapid Evolution of the Canonical Stack for Machine Learning

Show Me the Blueprints

The Architecture of Tomorrow

The Future and Beyond

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by ODSC - Open Data Science

No responses yet