Member-only story

Transforming Skewed Data for Machine Learning

5 min readJul 25, 2019

Skewed data is common in data science; skew is the degree of distortion from a normal distribution. For example, below is a plot of the house prices from Kaggle’s House Price Competition that is right skewed, meaning there are a minority of very large values.

Why do we care if the data is skewed? If the response variable is skewed like in Kaggle’s House Prices Competition, the model will be trained on a much larger number of moderately priced homes, and will be less likely to successfully predict the price for the most expensive houses. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance.

ODSC West 2024 tickets available now!
In-Person & Virtual Data Science Conference
October 29th-31st, 2024 — Burlingame, CA
Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!
REGISTER NOW

We can objectively determine if the variable is skewed using the Shapiro-Wilks test. The null hypothesis for this test is that the data is a sample from a normal distribution, so a p-value less than 0.05 indicates significant skewness. We’ll apply the test to the response variable Sale Price above labeled “resp” using Scipy.stats in Python.

The p-value is not surprisingly less than 0.05, so we can conclude that the variable is skewed. A more convenient way of evaluating skewness is with pandas’ “.skew” method. It calculates the Fisher–Pearson standardized moment coefficient for all columns in a dataframe. We can calculate it for all the features in Kaggle’s Home Value dataset (labeled “df”) simultaneously with the following code.

A few of the variables like Pool Area are highly right skewed due to lots of zeros, this is okay. Some models like decision trees are fairly robust to skewed features.

Transforming Skewed Data for Machine Learning

Written by ODSC - Open Data Science

Responses (1)