Machine Learning Challenges You Might Not See Coming
There seems to be a skills gap, and a skills misunderstanding, when it comes to Data Science, Engineering, and DevOps as a joint process. At our machine learning consultancy, Infinia ML, we view deployment as a sequential process across teams:
(1) Data Science explores data and develops algorithm(s).
(2a) Engineering refines/optimizes code and creates an API.
(2b) Engineering integrates this API with the necessary system(s).
(3) DevOps deploys code in a production setting (whether it be cloud, on-prem, or hybrid cloud).
(4) Data Science monitors the performance of models in production.
(5) Process repeats as models get updated or new models are needed.
Businesses beginning their machine learning journey may not realize it yet, but getting an algorithm to work on some data (Steps 1 and 2) is probably the easy part. Real, ongoing business value comes from actually deploying a machine learning solution into production (Step 3) and the necessary monitoring and optimization work that comes after it (Steps 4 and 5).
This lack of focus on these challenges is reflected in machine learning software solutions, which, overall, are still in their infancy. Our company has put a number of paid and open source ML products to the test. Working in small teams that combined data science and engineering talent, we attempted to build, run, and deploy well-known machine learning models with a set of image data and a set of text data (we felt tabular data would not be as challenging).
We used the same data sets and attempted the same models on 22 different products (we won’t name them here, but if you’ve explored this space you’d likely recognize some of the names on our list).
Below are some of our conclusions. While not all products exhibited all of these gaps, our exploration convinced us that there isn’t yet a great overall software solution for making ML real in the enterprise.
In the free trials we conducted, we found no clear process for deploying solutions into production. It’s unclear who in an organization is supposed to take the next step. This was true even for tabular data, meaning it’s certainly a challenge for text and images. Aspects of the challenge include:
On-Prem Deployment is Difficult
Some tools do a great job streamlining the ML deployment process in the cloud. However, deploying ML on your own servers or hardware continues to be a challenge. Although many processes are similar to traditional on-prem software deployments, ML has unique needs related to model training, performance, and maintenance.
Integration Not Included
Integrating prediction API’s into products/workflows is difficult because companies are often working with legacy and/or closed systems. It might be easy to think an ML project is “done” once algorithm development is complete. However, this is only the beginning of the journey — the algorithm must (a) be hooked up to production data inputs and (b) push predictions to applicable downstream processes. The specific issues here may be unique to each client.
Who Owns Deployment and Maintenance?
On traditional SaaS teams, Site Reliability Engineers or DevOps Engineers handle the deployment of code to production (for both practical and audit purposes), as well as the monitoring of production infrastructure and its performance
Currently, it is unclear whether DevOps is also responsible for the maintenance and deployment of ML models in production. Should data scientists be responsible here? Should engineers? Should DevOps? We haven’t seen an established way to do this. We’ve heard that several companies have Data Science teams who’ve built models that never see the light of day.
This process gap is related to a gap in products, which tend to focus more on data science tools and less on integration and implementation.
Today’s tools lack the ability to measure model drift and identify when a model’s production data is no longer representative of training data.
Some companies are trying to enable monitoring, but the way to do so is not clear. For example, one product offered tools to theoretically conduct model tuning, identify drift, etc. But it did not give specific instructions on what to change or how to measure a specific model. In practice, the calibration process depends on the specific models in production, and will be different for each.
Machine learning is not “learning” unless there is a continuous feedback loop of data from production; today’s ML toolset does not have an established method for doing this. In the products we evaluated, it was unclear how to push insights back to models so that the solution can continue to learn while in production.
Machine learning requires continuous maintenance and deployment to ensure models remain performant. Versioning, data governance, and model training continue to be a challenge as Data Scientists, Engineers, and DevOps personnel leverage machine learning in production. Additionally, machine learning presents unique challenges related to quality assurance practices (software testing) and rolling back model versions. Again, this process gap is reflected in, and perhaps exacerbated by, product gaps.