# Visualization for Clustering Methods

*Editor’s note: Evie Fowler is a speaker for ODSC West. Be sure to check out her talk, “**Bridging the Interpretability Gap in Customer Segmentation**,” there!*

At this Fall’s **Open Data Science Conference**, I will talk about how to bring **a systematic approach to the interpretation of clustering models**. To get ready for that, let’s talk about data visualization for clustering models.

# Preparing a Workspace

All of these visualizations can be created with the basic tools of data manipulation (pandas and numpy) and the basics of visualization (matplotlib and seaborn).

`from matplotlib import colormaps, pyplot as plt`

from sklearn.cluster import KMeans

from sklearn.datasets import load_diabetes

from sklearn.preprocessing import MinMaxScaler

import numpy as np

import pandas as pd

import seaborn as sns

For this tutorial, I’ll use the diabetes prediction dataset built into matplotlib. I’ll offer a lot more insight on how to train and evaluate an effective clustering model at ODSC, but for now, I’ll just fit a few simple k-means models.

`# load diabetes data`

diabetesData = load_diabetes(as_frame = True).data

# center and scale clusterable features

diabetesScaler = MinMaxScaler().fit(diabetesData)

diabetesDataScaled = pd.DataFrame(diabetesScaler.transform(diabetesData)

, columns = diabetesData.columns

, index = diabetesData.index)# build three small clustering models

km3 = KMeans(n_clusters = 3).fit(diabetesDataScaled)

km4 = KMeans(n_clusters = 4).fit(diabetesDataScaled)

km10 = KMeans(n_clusters = 10).fit(diabetesDataScaled)

# Choosing a Color Scheme

The matplotlib package provides a number of built-in color schemes through its colormaps registry. It is convenient to choose one colormap for the entirety of a visualization, and important to choose thoughtfully. That can mean evaluating everything from whether the map is sequential (for when data can be interpreted along a scale from low to high) or divergent (for when data is most relevant at either of two extremes) to whether it is thematically appropriate for the subject (greens and browns for a topography project). When there is no particular relationship between the data and the order it will be presented in, the nipy_spectral colormap is a good choice.

`# choose the nipy_spectral colormap from matplotlib`

nps = colormaps['nipy_spectral']

`# view the whole colormap`

nps

Each matplotlib colormap consists of a series of tuples, with each describing a color in RGBA format (though with components scaled to [0, 1] rather than [0, 255]). Individual colors from the map can be accessed either by integer (between 0 and 255) or by float (between 0 and 1). Numbers close to 0 correspond to colors at the lower end of the color map, while integers close to 255 and floats close to 1.0 correspond to colors at the upper end of the color map. Intuitively, the same color can be described by either an integer, or a float representing that integer as a quotient of 255.

`# view select colors from the colormap`

print(nps(51))

#(0.0, 0.0, 0.8667, 1.0)

`print(nps(0.2))`

#(0.0, 0.0, 0.8667, 1.0)

# Creating Visualizations

# Scatter Plots

The classic visualization for a clustering model is a series of scatter plots comparing each pair of features that went into the clustering model, with cluster assignment denoted by color. There are built in methods to achieve this, but a DIY approach gives more control over details like the color scheme.

`def plotScatters(df, model):`

""" Create scatter plots based on each pair of columns in a dataframe.

Use color to denote model label.

"""

# create a figure and axes

plotRows = df.shape[1]

plotCols = df.shape[1]

fig, axes = plt.subplots(

# create one row and one column for each feature in the dataframe

plotRows, plotCols

# scale up the figure size for easy viewing

, figsize = ((plotCols * 3), (plotRows * 3))

)

# iterate through subplots to create scatter plots

pltindex = 0

for i in np.arange(0, plotRows):

for j in np.arange(0, plotCols):

pltindex += 1

# identify the current subplot

plt.subplot(plotRows, plotCols, pltindex)

plt.scatter(

# compare the i-th and j-th features of the dataframe

df.iloc[:, j], df.iloc[:, i]

# use integer cluster labels and a color map to unify color selection

, c = model.labels_, cmap = nps

# choose a small marker size to reduce overlap

, s = 1)

# label the x axis on the bottom row of sub plots

if i == df.shape[1] - 1:

plt.xlabel(df.columns[j])

# label the y axis on the first column of sub plots

if j == 0:

plt.ylabel(df.columns[i]) plt.show()

These plots do double duty, showing the relationship between a pair of features and the relationship between those features and cluster assignment.

`plotScatters(diabetesDataScaled, km3)`

As analysis progresses, it’s easy to focus on a smaller subset of features.

`plotScatters(diabetesDataScaled.iloc[:, 2:7], km4)`

# Violin Plots

To get a better sense of the distribution of each feature within each cluster, we can also look at violin plots. If you’re not familiar with violin plots, think of them as the grown up cousin of the classic box plot. Where box plots identify only a few key descriptors of a distribution, violin plots are contoured to illustrate the entire probability density function.

`def plotViolins(df, model, plotCols = 5):`

""" Create violin plots of each feature in a dataframe

Use model labels to group.

"""

# calculate number of rows needed for plot grid

plotRows = df.shape[1] // plotCols

while plotRows * plotCols < df.shape[1]:

plotRows += 1 # create a figure and axes

fig, axes = plt.subplots(plotRows, plotCols

# scale up the figure size for easy viewing

, figsize = ((plotCols * 3), (plotRows * 3))

) # identify unique cluster labels from model

uniqueLabels = sorted(np.unique(model.labels_)) # create a custom subpalette from the unique labels

# this will return

npsTemp = nps([x / max(uniqueLabels) for x in uniqueLabels]) # add integer cluster labels to input dataframe

df2 = df.assign(cluster = model.labels_) # iterate through subplots to create violin plots

pltindex = 0

for col in df.columns:

pltindex += 1

plt.subplot(plotRows, plotCols, pltindex)

sns.violinplot(

data = df2

# use cluster labels as x grouper

, x = 'cluster'

# use current feature as y values

, y = col

# use cluster labels and custom palette to unify color selection

, hue = model.labels_

, palette = npsTemp

).legend_.remove()

# label y axis with feature name

plt.ylabel(col) plt.show()plotViolins(diabetesDataScaled, km3, plotCols = 5)

# Histograms

Violin plots show the distribution of each feature within each cluster, but it is also helpful to look at how each cluster is represented in the broader distribution of each feature. A modified histogram can illustrate this well.

`def histogramByCluster(df, labels, plotCols = 5, nbins = 30, legend = False, vlines = False):`

""" Create a histogram of each feature.

Use model labels to color code.

"""

# calculate number of rows needed for plot grid

plotRows = df.shape[1] // plotCols

while plotRows * plotCols < df.shape[1]:

plotRows += 1

# identify unique cluster labels

uniqueLabels = sorted(np.unique(labels))

# create a figure and axes

fig, axes = plt.subplots(plotRows, plotCols

# scale up the figure size for easy viewing

, figsize = ((plotCols * 3), (plotRows * 3))

)

pltindex = 0

# loop through features in input data

for col in df.columns:

# discretize the feature into specified number of bins

tempBins = np.trunc(nbins * df[col]) / nbins

# cross the discretized feature with cluster labels

tempComb = pd.crosstab(tempBins, labels)

# create an index in the same size as the cross tab

# this will help with alignment

ind = np.arange(tempComb.shape[0]) # identify the relevant subplot

pltindex += 1

plt.subplot(plotRows, plotCols, pltindex)

# create grouped histogram data

histPrep = {}

# work one cluster at a time

for lbl in uniqueLabels:

histPrep.update(

{

# associate the cluster label...

lbl:

# ... with a bar chart

plt.bar(

# use the feature-specific index to set x locations

x = ind

# use the counts associated with this cluster as bar height

, height = tempComb[lbl]

# stack this bar on top of previous cluster bars

, bottom = tempComb[[x for x in uniqueLabels if x < lbl]].sum(axis = 1)

# eliminate gaps between bars

, width = 1

, color = nps(lbl / max(uniqueLabels))

)

}

)

# use feature name to label x axis of each plot

plt.xlabel(col)

# label the y axis of plots in the first column

if pltindex % plotCols == 1:

plt.ylabel('Frequency')

plt.xticks(ind[0::5], np.round(tempComb.index[0::5], 2))

# if desired, overlay vertical lines

if vlines:

for vline in vlines:

plt.axvline(x = vline * ind[-1], lw = 0.5, color = 'red')

if legend:

leg1 = []; leg2 = []

for key in histPrep:

leg1 += [histPrep[key]]

leg2 += [str(key)]

plt.legend(leg1, leg2) plt.show()

histogramByCluster(diabetesDataScaled, km4.labels_)

This process scales easily when more cluster categories are needed.

`histogramByCluster(diabetesDataScaled, km10.labels_)`

# Conclusion

These visualizations will provide a strong base for evaluating clustering models. For more about how to do so in a systematic way, be sure to come to **my talk** at this Fall’s **Open Data Science Conference** in San Francisco!

# About the Author:

Evie Fowler is a data scientist based in Pittsburgh, Pennsylvania. She currently works in the healthcare sector leading a team of data scientists who develop predictive models centered on the patient care experience. She holds a particular interest in the ethical application of predictive analytics and in exploring how qualitative methods can inform data science work. She holds an undergraduate degree from Brown University and a master’s degree from Carnegie Mellon.

*Originally posted on OpenDataScience.com*

*Read more data science articles on **OpenDataScience.com**, including tutorials and guides from beginner to advanced levels! **Subscribe to our weekly newsletter here** and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our **Ai+ Training** platform. Interested in attending an ODSC event? Learn more about our **upcoming events here**.*