Auto-Sklearn: AutoML in Python

ODSC - Open Data Science
5 min readMay 26, 2021

Machine learning is the driving force of modern technology and smart applications. While highly efficient methods and implementations are broadly available, their successful application is hard. It is hard because a myriad of design decisions have to be made correctly before an ML pipeline achieves peak performance.

Such decisions include how to preprocess features (e.g. how to replace missing values), which model class to use (e.g. neural networks or boosted trees), and finally, how to set the hyperparameters of this model class (e.g. the learning rate and number of epochs). Manually searching this vast design space either requires a lot of experience, a lot of computing resources, or both. AutoML is here to help!

AutoML automatically finds well-performing machine learning pipelines and thus frees the human expert from this tedious task. This reduces the barrier to broadly apply machine learning and makes it available for everyone. In this post, we’ll have a look at the AutoML tool Auto-sklearn.

Auto-sklearn is an open-source tool, so we are happy to receive stars, pull requests, and issues: www.github.com/automl/auto-sklearn.

What you’ll get out of this post and what you’ll need to run the code

You’ll learn how to replace a manually designed scikit-learn pipeline with an Auto-sklearn estimator. We provide all code in this Colab Notebook.

Step 1: Load data

As a first step, we’ll use the built-in data loading method from scikit-learn to load the credit-g dataset and split it into train and test data.

import sklearn.datasets
import sklearn.model_selection
# We fetch the data using openml.org
X, y = sklearn.datasets.fetch_openml(data_id=31, return_X_y=True, as_frame=True)
# Split the data into train and test
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, test_size=0.4, random_state=42
)
X_train.info()

This dataset describes bank customers who apply for credit. It has 1000 data points and 20 features and is a good example dataset as it contains both numerical and categorical features. The objective is to classify each request whether the credit would default or not.

Step 2: Manually build a pipeline

Now, we turn to building our pipeline. We’ll use a Support Vector Machine (SVM). However, in order to get good performance with an SVM one needs to preprocess the data, and in particular, we need to apply one-hot encoding to deal with categorical values and scale the features (such as the features credit-amount, which goes up to 20.000, and the feature duration, which does not go above 80).

https://odsc.com/europe/#register

Note: For demonstration, we use the default hyperparameters set by scikit-learn for this pipeline; however, in practice, these need to be tuned to achieve top performance.

from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Create the estimator using the default parameters from the library
estimator_svc = SVC(C=1.0, kernel='rbf', gamma='scale', shrinking=True, tol=1e-3,
cache_size=200, verbose=False, max_iter=-1, random_state=42
)
# build and fit the pipeline
categorical_columns = [col for col in X_train.columns
if X[col].dtype.name == 'category']
encoder = ColumnTransformer(transformers = [
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
], remainder='passthrough')
pipeline_svc = Pipeline([
('encoder', encoder),
('scaler', StandardScaler()),
('svc', estimator_svc),
])
pipeline_svc.fit(X_train, y_train)

After constructing the pipeline and training it on the training data, we measure the performance on the test set and obtain an accuracy of 76.75%.

# Score the model
prediction = pipeline_svc.predict(X_test)
performance_svc = accuracy_score(y_test, prediction)
print(f"SVC performance is {performance_svc}")

We also tried other classifiers such as a Gradient Boosting Classifier and a Decision Tree and their performance was 73.5% and 70.75%

Step 3: Use Auto-sklearn as a drop-in-replacement

Finally, we’ll demonstrate how easy it is to use auto-sklearn as a drop-in replacement for the manually constructed estimator pipelines discussed above.

Instead of manually specifying a pipeline, we can just use the Auto-sklearn estimator object and all that’s left is to decide how much resources should be spent on searching for the best pipeline. We set this limit to 5 minutes and 1 CPU core. As we have a small dataset at hand, we also turn on cross-validation.

Note: Large datasets require more computational resources to achieve good results.

Then, we’ll have an estimator object that can be handled like any scikit-learn object or pipeline and predict labels for new data; in this case, this achieves a test accuracy of 77.5%, better than the manually designed pipeline and without any need for manual work.

import autosklearn.classification# Create and train the estimator
estimator_askl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=300,
seed=42,
resampling_strategy='cv',
n_jobs=1,
)
estimator_askl.fit(X_train, y_train)
# Score the model
prediction = estimator_askl.predict(X_test)
performance_askl = accuracy_score(y_test, prediction)
print(f"Auto-Sklearn Classifier performance is {performance_askl}")

Wrapping up on Auto-sklearn

You might wonder, what does Auto-sklearn do internally? Well, the short answer is: It searches a huge space with more than 100 dimensions for a pipeline that does well on your dataset and then automatically ensembles the best performing pipelines for prediction.

If this sounds interesting to you and you want to take a deep dive into the methodology behind Auto-sklearn and other up-to-date AutoML systems, and learn how to apply Auto-sklearn to your machine learning problem, we have two events for you at the upcoming ODSC Europe on Wednesday, June 9th:

Frank Hutter will present the methods behind Auto-sklearn and other recent AutoML systems in his presentation (10.50–11.35).

Also, if you like Auto-sklearn, give us a star at www.github.com/automl/auto-sklearn!

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.