The Comprehensive Guide to Model Validation Framework: What is a Robust Machine Learning Model?

9 min readMar 20, 2020

Olivier is a speaker for ODSC East this April 13–17 in Boston. Be sure to check out his talk, “Validate and Monitor Your AI and Machine Learning Models,” there!

Machine learning usage has been quite democratized in the past 2 years with the development of solutions like Azure ML for machine learning models, Google Colab for free infrastructures and simplified libraries like fast.ai, Keras, Scikit Learn, and others. Building a functional machine learning system is one thing; building a successful machine learning system and be confident enough to put it in production is another ball game.

McKinsey says that about 87% of AI proof of concepts (POC) are not deployed in production. This is a huge problem and I believe that proactive validation of models is one of the main ways to ensure that the POC yields on the agreed-upon benefits. It will validate its capability to generate realistic predictions and it will boost business adoption.

This post is an introduction to a training session that will be presented to the Open Data Science Conference East 2020 in Boston. If you are not able to make it (ODSC is an awesome event!), I will post follow up articles on https://moov.ai/en/blog/.

In this blog post, we will introduce the validation framework by answering this simple question: What is a robust machine learning model?

During the training session at ODSC, and in the upcoming articles related to this post, we will explore concrete techniques to validate your model.

What is a robust machine learning model?

According to Investopedia, a model is considered to be robust if its output dependent variable (label) is consistently accurate even if one or more of the input independent variables (features) or assumptions are drastically changed due to unforeseen circumstances.

In a more practical manner, here are different dimensions that need to be validated in order to assume robustness.

Is my model performant?

By definition, a model does not have to be performant to be robust. However, a weak model struggles to predict a phenomenon (label) right. This is why we want to make sure the model is good enough to meet the project’s benefits.

What is a “performant” model?

By default, a machine learning model cannot be 100% accurate considering all biases, variance and irreducible error (see graph below). Irreducible error is inevitable because you cannot have a perfect model because the world is not perfect, and neither is your dataset. Then, the real question is: what should be your label?

A good model is one that can generate value in real life. If you can gain value being right 70% of the time, then this can be your dependant variable (label). Another good way to assess this is by evaluating the manual performance; what is the performance of a human?

Which metric to use?

There are hundreds of different metrics you might want to use for different reasons. Here is a rule of thumb.

First of all, you might want to use other metrics to train your model than the ones you use for validation. And this is ok. Validation is more about the robustness of the full model.

For regression, I recommend using the “Adjusted R squared” as it is often used for explanatory purposes. It also tells how well your selected independent variables (features) explain the variability in your dependent variable (label). This is exactly what you want to do for validation. Also, this metric’s value is stable considering that it is a ratio while RMSE, MSE, and MAE are values that cannot be compared from one model to another.

Here is a great article about this topic: https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4

For classification, the best metric to measure the model’s robustness is the AUC. (Area Under the Curve) of a ROC curve (Receiver Operating Characteristics). This metric is polyvalent as it measures essentially the ability to accurately predict a class in particular. If you have 2 classes, you could calculate the AUC-ROC for each class. AUC-ROC is also great in an imbalanced situation, aka when you have a smaller sample for one class.

Is my model stable?

The goal is to get the same performance every time. Of course, a performance that varies too much can be problematic.

The minimum validation framework

I rarely see this structure of splitting a dataset in 3 anymore. Still, this is really important to create a test dataset that will only get validated once you have your final model. When possible, I would add a fourth dataset to validate the deployed system prior to project go-live, but this is not necessary.

Having only a training and a validation dataset (the minimum) is a big mistake as you might test thousands of model configurations and select a model that can overfit both on training and validation.

Cross-validation for the win

A good add-on to this testing framework is to replace the training/ validation with a cross-validation methodology. This technique is essentially just consisting of training a model and a validation on a random validation dataset multiple times independently. Each repetition is called a fold. A 5-fold cross-validation means that you will train and then validate your model 5 times.

This is a good option when you don’t have a big dataset. Cross-validation works well with a smaller validation ratio considering that the multiple folds will cover a large proportion of data points. It also allows you to calculate your performance metric and evaluate the variance between folds. A stable model should have similar performance metrics at every fold. Here is a good article about this technique: https://machinelearningmastery.com/k-fold-cross-validation/

Biases are known and approved

In machine learning, biases and discrimination are typical. If a set of features can accurately predict something, you should thank its discriminant features. However, it is important to be aware of these biases to be comfortable about the model, ethically speaking.

Typically, interpretability methods are good to identify biases considering that biases only happen when features have an important contribution to the dependant variable (label). To measure the feature importance of a complex model, I use mostly SHAP, which is a solid model-agnostic interpretability library.

Here is a good link to learn more about SHAP: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d

To identify any biases, here are 2 different scenarios that you will want to inspect:

When a feature or combination of features have an overall abnormal marginal contribution to the model (i.e. gender might be a very discriminant factor)
When a feature or combination of features have a targeted abnormal marginal contribution (i.e. underrepresented ethnic groups might be very discriminant only when it applies)

By using SHAP and the proper analytical framework, you can get an idea of the biases in your model, and then do something about it.

Is my model too sensitive?

Sensitivity analysis determines how label is affected based on changes in features.

There are 2 different dimensions that you might want to validate:

Model tolerance to noise
Model tolerance to extreme scenarios (targeted noise)

Sensitivity analysis will allow you to explore the generalization of your model’s decision boundaries, to really see the impact of a lack of generalization.

For example, although less performant, SVM #1 is more robust and less sensitive than SVM #2. You always want to be aware of this, but sometimes you might want to prioritize a more tolerant model over a performant one for some critical models.

Once you have measured sensitivity, it is important to assess the following:

Scenarios likelihood
Scenarios impact on performance

The goal of this assessment is to evaluate the risks. You can then accept them as is, or fix your model.

Model tolerance to noise

What happens if your data is a little bit messy?

The risk here is that the model is super narrow, and performance drops suddenly as soon as there is a little bit of noise. Noise could also reflect unseen scenarios. Let’s say you develop a credit assessment in the subprime industry (small loans); what would happen if a millionaire applies for a loan?

You can easily test tolerance to noise by adding random noise to the features of your test dataset and see the impact.

Model tolerance to extreme scenarios

Imagine if you had developed a state-of-the-art, automated stock trading system in 2007? How can you make sure that such system is robust to abnormal highs and lows?

The way you can validate this is by creating testing datasets containing extreme or rare events and testing your final model on that. Once again, please make sure you do not use these observations in your model!

Is my model predictive?

Predictivity is often overlooked. After all, if you have a performant and stable model, what can go wrong?

New data is different from training data

I have seen several contexts where historical data is slightly different from the new data the model will use to make its predictions.

Here are some reasons that can explain this:

Historical data source does not match new source (i.e. historical data uses weather while new data uses weather forecast)
Historical data is formatted differently (i.e. new categorical data do not match previous categories)
Historical data has been already transformed (i.e. IT has transformed the dataset beforehand)
New data has a different structure due to new trends

You can analyze this by comparing the data structures. A bunch of anomaly detection tools do this quite well. Here is a series of algorithms capable of assessing your data structure: https://scikit-learn.org/stable/modules/outlier_detection.html

Leakage

If you are not careful when you define the architecture of your machine learning model, you might end up with features that should happen in the future, hence creating an error we call “leakage”. Leakage leads to overly optimistic expectations about model performance as it “knows” future information, which is not going to happen in production. It is important to detect any risks associated to leakage since the model will not perform the way you might anticipate it.

A way you can identify this is by interpreting your data. If a feature is leaking, you will be able to identify these biases. A contaminated model will also tend to be very sensitive to targeted noise since it will impact the leaked variable. Once you have the suspicion that there might be some leakage, it is important to review the features and make sure that they are generated before the phenomenon occurs.

In conclusion

We now understand what robustness is. It is essentially a performant, tolerant, stable, predictive model that has known and fair biases. Ouff, quite a large order! Does it have to be perfect? The answer is no, as long as the gaps are known and measured.

Measurement techniques include proper validation framework consisting of cross-validation and a separate test set, performance metrics such as “Adjusted R squared” and “AUC-ROC”, interpretability techniques like SHAP for biases and leakage identification, anomaly detection to identify data structure discrepancies.

Despite all the techniques, tools and dimensions to validate, one of the most important pieces of advice is to be aware that performance only can be misleading.

Want to know more about validation? I will describe an efficient validation framework and explain how to develop each analysis in an upcoming article.

Relevant links:

87% of POC never make it into production: https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/
Regression metrics: https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4
Cross-validation framework: https://machinelearningmastery.com/k-fold-cross-validation/
SHAP framework for model interpretability: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
Anomaly detection for data structure comparison: https://scikit-learn.org/stable/modules/outlier_detection.html

About the speaker/Author: Olivier Blais, Co-founder and Head of Data Science | Moov AI

Olivier is a data science expert whose leading field of expertise and cutting-edge knowledge of AI and machine learning led him to support many companies’ digital transformations, as well as implementing projects in different industries. He has led the data team and put in place a data culture in companies like Pratt & Whitney Canada, L’Oréal, GSoft and now Moov AI.