Unsupervised Learning: Evaluating Clusters

K-means clustering is a partitioning approach for unsupervised statistical learning. It is somewhat unlike agglomerative approaches like hierarchical clustering. A partitioning approach starts with all data points and tries to divide them into a fixed number of clusters.

K-means is applied to a set of quantitative variables. We fix the number of clusters in advance and must guess where the centers (called “centroids”) of those clusters are. We work to assign points to the closest centroid and then recalculate the centroids to iterate through this clustering approach.

When we speak about the accuracy of a statistical learning algorithm, we refer to a measure of comparing the true label to the predicted label. The K-Means clustering algorithm works on “unlabeled” data sets where a predicted label does not exist. This means K-Means clustering evaluation cannot directly apply accuracy as supervised methods can. There are however, some measurements that you can use to evaluate clusters.

Within Cluster Sum of Squares

Calculating Euclidean distance in two-dimensional space, and a more general formula.

Essentially, WCSS measures the variability of the observations within each cluster. In general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares. Clusters that have higher values exhibit greater variability of the observations within the cluster.

From another perspective, WCSS is influenced by the number of observations. As the number of observations increases, you’ll notice that the sum of squares increases. This means that WCSS is often not directly comparable across clusters with different numbers of observations. To compare within-cluster variabilities of different clusters, you should use the average distance from centroid.

Between Clusters Sum of Squares

Essentially, BCSS measures the variation between all clusters. A large value can indicate clusters that are spread out, while a small value can indicate clusters that are close to each other.

Other Cluster Metrics

  • To get help choosing the optimal k value for K-means clustering, you can use the Elbow Method.
  • Silhouette Coefficient — you can use the Scikit-learn function silhouette_score to compute the mean Silhouette Coefficient of all samples.
  • The Gap Statistic Method compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.
  • Here is a nice tutorial on K-means clustering (including a mathematical foundation) using R code examples, along with the use of WCSS and BCSS.

Original story here.

— — — — — — — — — — — — — — — — — —

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.