Introduction to PyCaret
In this article, I’d like to introduce a new machine learning library for Python called PyCaret. PyCaret is touted as a low-code resource for data scientists that aims to reduce the “hypothesis to insights cycle time” in a machine learning experiment. It enables data scientists to perform end-to-end experiments quickly and efficiently. PyCaret is a library that can be used to perform complex machine learning tasks with only a few lines of code.
PyCaret was developed by data scientist Moez Ali, who started the project in the summer of 2019. His motivation for the project comes from the emerging role of citizen data scientists who provide a complementary role to professional data scientists and bring their own expertise and unique skills to analytics-driven tasks. As much as PyCaret is ideal for citizen data scientists due to its simplicity, ease of use, and low-code environment, it can also be used by professional data scientists as part of their machine learning workflows to help build rapid prototypes quickly and efficiently. Ali told me that PyCaret is not directly related to the caret package in R, but was inspired by caret creator Dr. Max Kuhn’s work in R. The name caret is short for Classification And REgression Training.
“In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only,” said PyCaret creator Moez Ali. “This makes experiments exponentially fast and efficient.”
The original PyCaret 1.0.0 release was made available in April 2020, and the most recent 2.1 version was released on August 28, 2020.
Why Use PyCaret?
PyCaret is a very useful library that not only simplifies the machine learning tasks for citizen data scientists but also helps startup companies reduce the cost of a team of data scientists. The theory is — fewer data scientists using PyCaret can compete with a larger team using traditional tools. Further, this library has not only helped citizen data scientists but has also helped newbies who want to start exploring the field of data science, but have little prior knowledge in this field.
PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks including scikit-learn, XGBoost, Microsoft LightGBM, spaCy, and many more.
The intended audience for PyCaret is:
- Experienced data scientists who want to increase their productivity
- Citizen data scientists who can benefit from a low code machine learning solution
- Data science students (I plan to include PyCaret in my upcoming “Introduction to Data Science” classes)
- Data science professionals and consultants involved in building MVP versions of projects
Let’s take a quick look at some important PyCaret functions:
- compare_models function — trains all the models in the model library using default hyperparameters and evaluates performance metrics using cross-validation. Metrics used for classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC. Metrics used for regression: MAE, MSE, RMSE, R2, RMSLE, MAPE.
- create_model function — trains a model using default hyperparameters and evaluates performance metrics using cross-validation.
- tune_model function — tunes the hyperparameter of the model passed as an estimator. It uses random grid search with pre-defined tuning grids that are fully customizable.
- ensemble_model function — you pass a trained model object, and the function returns a table with k-fold cross-validated scores of common evaluation metrics.
- predict_model — used for inference/prediction.
- plot_model — used to evaluate the performance of the trained machine learning model.
- Utility functions — a number of utility functions that are useful when managing your machine learning experiments with PyCaret.
- Experiment logging — PyCaret embeds the MLflow tracking component as a backend API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.
Sample output from plot_model function
New features in the 2.1 release include:
- You can now tune the hyperparameters of various models on a GPU: XGBoost, LightGBM, and Catboost.
- Deploy trained model functionality has been extended with the deploy_model function (only AWS was previously available) to include support for deployment on GCP as well as Microsoft Azure.
- The plot_model function now includes a new “scale” parameter that you can use to control the resolution and generate high-quality plots for your expository data visualization needs.
- It is now possible to use user-defined custom loss functions with the new custom_scorer parameter in the tune_model function.
- For enhanced feature engineering, PyCaret now includes the Boruta algorithm. Originally announced as an R package in the Journal of Statistical Software, September 2010, Boruta has since been ported to Python.
Getting Started with PyCaret
PyCaret comes with a series of well-crafted tutorials (each with its own GitHub repo) that cover many important areas of development for data scientists. The tutorials include such topics as: classification, regression, NLP, clustering, anomaly detection, and association rule mining. A number of video tutorials are also offered, making it pretty easy to get up to speed with these powerful tools.
I’ve witnessed first-hand the popularity of a machine learning assist library after using the caret package in R for a number of years. I was turned on to caret after consuming the wonderful book “Applied Predictive Modeling,” by Kuhn and Johnson that uses the R caret package throughout. PyCaret fills the same role for data scientists using Python and who want to engage a tool to improve productivity that is short of using a full-blown AutoML platform like H2O.ai or Data Robot. I think PyCaret is definitely worth a serious look.