Standout Code Snippets From ODSC West 2022

ODSC - Open Data Science
5 min readDec 27, 2022

--

This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at ODSC West 2022.

For this notebook you will need the standard imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Side-By-Side Data Visualizations

This is a common technique you will find in many notebooks across GitHub, Kaggle, and others. However as excellent as this technique is, I wish more would use it.

Clinton Brownley, who presented a tour of machine learning in Python, offers a clear and concise example of the technique that elegantly solves a common use case — when you want to compare multiple distributions side-by-side.

# Specify fictional data to work with.
r = pd.Series(skewnorm.rvs(a=4, loc=10, scale=4, size=1000))# Specify a four column subplot.
fig, axes = plt.subplots(figsize=(20, 5), ncols=4)sns.distplot(r, ax=axes[0],
kde=False, rug=False,
fit=stats.norm).set_title('Original')sns.distplot(np.log(r), ax=axes[1],
kde=False, rug=False,
fit=stats.norm
).set_title('Natural Log')sns.distplot(np.sqrt(r), ax=axes[2],
kde=False, rug=False,
fit=stats.norm
).set_title('Square Root')sns.distplot(1/r, ax=axes[3],
kde=False, rug=False,
fit=stats.norm).set_title('Inverse')

Image Credit: Image generated from code snippets shown above. First published by Clinton Brownley.

A variation on this theme lets you view multiple distributions with multiple different vertical y-axes.

# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# Specify a two by two subplot + adjust spacing.
sns.set_context('notebook')fig, axes = plt.subplots(figsize=(12, 4), ncols=2,
nrows=2, squeeze=False)
plt.subplots_adjust(hspace=0.8, wspace=0.3)sns.histplot(df['price'],
ax=axes[0,0],
stat='count',
kde=True, color=my_blue
).set_title('Count')sns.histplot(df['price'],
ax=axes[0,1],
stat='frequency',
kde=True, color=my_blue
).set_title('Frequency')sns.histplot(df['price'],
ax=axes[1,0],
stat='percent',
kde=True, color=my_blue
).set_title('Percent')sns.histplot(df['price'],
ax=axes[1,1],
stat='density',
kde=True, color=my_blue
).set_title('Density')

Note, to replicate these colors use the pallet strategies I wrote about here. This example also switches the layout from a 1 by 4 to a 2 by 2 display.

Image Credit: Image generated from code snippets shown above.

Transpose The Describe Method

As this is a personal favorite Pandas hack of mine I was glad to see multiple presenters at ODSC using.

Image credit: Author’s illustration built in Canva.

If you use pd.describe() you know that it produces summary statistics. So, the problem with pd.describe() is that it puts the variable names across the columns of the summary statistics table.

If you have many variables the table is unreadable… it will be too wide for the screen. To fix that, chain an additional method thus: pd.describe().transpose() for the win!

Aggregate Methods On IsNull

In his presentations on Scikit Learn, Corey Wade offers multiple examples of this clever way to chain methods after isnull().

When inspecting a data frame for missing values you are likely familiar with the pd.isnull() method. That method returns a copy of the data frame but where each value is True or False. True when missing. False, otherwise.

Several clever additional method chains make this output more readable. For example, perhaps you already knew about the df.isnull().sum() hack:

# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# View which variables have missing values, and how many.
df.isnull().sum()

But did you know about the double sum? Using pd.insull().sum().sum() will give the total missing values (or entries) across the entire data frame.

# View how many missing values there are in the entire df.
df.isnull().sum().sum()

A less intuitive hack and result is taking the mean of boolean values. The mean of a boolean is equivalent to the proportion true.

Thus, for more useful output, expressing the “missingness “as a proportion of the observations is helpful. The df.isnull().mean() method chain does this for you:

# View the proportion of values missing in each column.
df.isnull().mean() * 100# Use lambda to spruce up the output.
df.isnull().mean().apply(lambda x:
str(round(x * 100, 1)) +
"% Missing Values")

Quality Assurance Environmental Checks

Making sure your environment is ready to go, at the time of development, and then on through subsequent runs, is important. Stefanie Molin demonstrated a few hacks that do this well. She shared her implementation at her presentation on data visualization.

Image Credit: From Stefanie Molin’s data visualization in the Python repository.

As you can see from the image above, taken from Stefanie’s repository, this hack produces elegant and readable output. Placing this code in any repository will be a handy way to add a greater measure of quality assurance for you and those with whom you collaborate.

Conclusion

This article summarized some of the best snippets of code shared during ODSC’s West 2022 session in San Francisco. Specific examples included code that presents data visualisations side-by-side, the use of aggregate methods following the pd.isnull() method, and also coding conventions that can check and re-check for proper environmental configuration.

If you missed ODSC West in San Francisco you should consider future editions. In just a few short weeks, information and registration for ODSC East in Boston will be available.

Thanks For Reading

Adam Ross Nelson is a data scientist + career coach. Read more about advancing your data science career: coaching.adamrossnelson.com.

Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. Twitter: @adamrossnelson | LinkedIn: Adam Ross Nelson| Facebook: Adam Ross Nelson.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet