Unsupervised Learning with k-means Clustering With Large Datasets

ODSC - Open Data Science
8 min readMay 15, 2020

--

When facing difficult problems with datasets, choosing the right model for the task often proves to be the most difficult task in ML. When we are given an unfamiliar set of features to represent, it can be a somewhat puzzling prospect. Then we have to get our hands dirty and try different models. An efficient estimator requires good datasets, which might change the course of the project.

This article is an excerpt from the book Artificial Intelligence by Example, Second Edition by Denis Rothman. In this article, we will explore how to implement k-means clustering with dataset volumes that exceed the capacity of the given algorithm.

[Related article: The Anatomy of K-Means Clustering]

k-means clustering takes unlabeled data and forms clusters of data points. The names (integers) of these clusters provide a basis to then run a supervised learning algorithm such as a decision tree. When facing a project with large unlabeled datasets, the first step consists of evaluating if machine learning will be feasible or not. Trying to avoid AI in a book on AI may seem paradoxical. However, in AI, as in real life, you should use the right tools at the right time. If AI is not necessary to solve a problem, do not use it.

Use a proof of concept (POC) approach to see if a given AI project is possible or not. A POC costs much less than the project itself and helps to build a team that believes in the outcome. Or, the POC might show that it is too risky to go forward with an ML solution. Intractable problems exist. It’s best to avoid spending months on something that will not work.

The first step is exploring the data volume and the ML estimator model that will be used.

If the POC proves that a particular ML algorithm will solve the problem at hand, the next thing to do is to address data volume issues. The POC shows that the model works on a sample dataset. Now, the implementation process can begin.

Anybody who has run a machine learning algorithm with a large dataset on a laptop knows that it takes some time for a machine learning program to train and test these samples. A machine learning program or a deep learning convolutional neural network consumes a large amount of machine power. Even if you run an ANN using a GPU (short for graphics processing unit) hoping to get better performance than with CPUs, it still takes a lot of time for the training process to run through all the learning epochs. An epoch means that we have tried one set of weights, for example, to measure the accuracy of the result. If the accuracy is not sufficient, we run another epoch, trying other weights until the accuracy is sufficient.

If you go on and you want to train your program on datasets exceeding 1,000,000 data points, for example, you will consume significant local machine power.

Suppose you need to use a k-means clustering algorithm in a corporation with hundreds of millions to billions of records of data coming from multiple SQL Server instances, Oracle databases, and a big data source. For example, suppose that you are working for the phone operating activity of a leading phone company. You must apply a k-means clustering program to the duration of phone calls in locations around the world over a year. That represents millions of records per day, adding up to billions of records in a year.

A machine learning k-means clustering training program running billions of records will consume too much CPU/GPU and take too much time even if it works. On top of that, a billion records might only represent an insufficient amount of features. Adding more features will dramatically increase the size of such a dataset.

The question now is to find out if k-means clustering will even work with a dataset this size. A k-means clustering problem is NP-hard. The P stands for polynomial and the N for non-deterministic.

The solution to our volume problem requires some theoretical considerations. We need to identify the difficulty of the problems we are facing.

Identifying the difficulty of the problem

We first need to understand what level of difficulty we are dealing with. One of the concepts that come in handy is NP-hard.

NP-hard — the meaning of P

The P in NP-hard means that the time to solve or verify the solution of a P problem is polynomial (poly=many, nomial=terms). For example, x3 is a polynomial. The N means that the problem is non-deterministic.

Once x is known, then x3 will be computed. For x = 3,000,000,000 and only 3 elementary calculations, this adds up to:

log x3=28.43

It will take 1028.43 calculations to compute this particular problem.

It seems scary, but it isn’t that much for two reasons:

  • In the world of big data, the number can be subject to large-scale randomized sampling.
  • k-means clustering can be trained in mini-batches (subsets of the dataset) to speed up computations.

Polynomial time means that this time will be more or less proportional to the size of the input. Even if the time it takes to train the k-means clustering algorithm remains a bit fuzzy, as long as the time it will take to verify the solution remains proportional thanks to the batch size of the input, the problem remains a polynomial.

NP-hard — the meaning of non-deterministic

Non-deterministic problems require a heuristic approach, which implies some form of heuristics, such as a trial and error approach. We try a set of weights, for example, evaluate the result, and then go on until we find a satisfactory solution.

The meaning of hard

NP-hard can be transposed into an NP problem with some optimization. This means that NP-hard is as hard or harder than an NP problem.

For example, we can use batches to control the size of the input, the calculation time, and the size of the outputs. That way, we can bring an NP-hard problem down to an NP problem.

One way of creating batches to avoid running an algorithm on a dataset that will prove too large for it is to use random sampling.

Implementing random sampling with mini-batches

A large portion of machine learning and deep learning contains random sampling in various forms. In this case, a training set of billions of elements will prove difficult, if not impossible, to implement without random sampling.

Random sampling is used in many methods: Monte Carlo, stochastic gradient descent, random forests, and many algorithms. No matter what name the sampling takes, they share common concepts to various degrees, depending on the size of the dataset.

Random sampling on large datasets can produce good results, but it requires relying on the law of large numbers, which we will explore in the next section.

Using the LLN

In probability, the LLN states that when dealing with very large volumes of data, significant samples can be effective enough to represent the whole dataset. For example, we are all familiar with polls that use small datasets.

This principle, like all principles, has its merits and limits. But whatever the limitations, this law applies to everyday machine learning algorithms.

In machine learning, sampling resembles polling. A smaller number of individuals represent a larger overall dataset.

Sampling mini-batches and averaging them can prove as efficient as calculating the whole dataset as long as a scientific method is applied:

  • Training with mini-batches or subsets of data
  • Using an estimator in one form or another to measure the progression of the training session until a goal has been reached

You may be surprised to read “until a goal has been reached” and not “until the optimal solution has been reached.” The optimal solution may not represent the best solution. All the features and all the parameters are often not expressed. Finding a good solution will often be enough to classify or predict efficiently.

The LLN explains why random functions are widely used in machine learning and deep learning. Random samples provide efficient results if they respect the central limit theorem (CLT).

The CLT

The LLN applied to the example of the k-means clustering project must provide a reasonable set of centroids using random sampling. If the centroids are correct, then the random sample is reliable.

This approach can now be extended to the CLT, which states that when training a large dataset, a subset of mini-batch samples can be sufficient. The following two conditions define the main properties of the CLT:

  • The variance between the data points of the subset (mini-batch) remains reasonable.
  • The normal distribution pattern with mini-batch variances remains close to the variance of the whole dataset.

A Monte Carlo estimator, for example, can provide a good basis to see if the samples respect the CLT.

Using a Monte Carlo estimator

The name Monte Carlo comes from the casinos in Monte Carlo and gambling. Gambling represents an excellent memoryless random example. No matter what happens before a gambler plays, prior knowledge provides no insight. For example, the gambler plays 10 games, losing some and winning some, creating a distribution of probabilities.

The sum of the distribution of f(x) is computed. Then random samples are extracted from a dataset, for example, x1, x2, x3,…, xn.

f(x) can be estimated through the following equation:

The estimator ê represents the average of the result of the predictions of a k-means clustering algorithm or any implemented model.

We have seen that a sample of a dataset can represent a full dataset, just as a group of people can represent a population when polling for elections, for example. Knowing that we can safely use random samples, just like in polling a population for elections, we can now process a full large dataset directly, or preferably with random samples.

In this article, the issue of large datasets in a model is addressed. We covered how to use k-means clustering with large datasets. We defined the NP-hard characteristic of k-means clustering and implemented random sampling concerning LLN, CLT, and the Monte Carlo estimator.

Understand the fundamentals and develop your own AI solutions in the updated edition of Artificial Intelligence by Example book by Denis Rothman packed with many new examples.

[Related article: K-Means Clustering Applied to GIS Data]

About the author

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, writing one of the very first word2matrix embedding solutions. He began his career authoring one of the first AI cognitive natural language processing (NLP) chatbots applied as a language teacher for Moët et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an advanced planning and scheduling (APS) solution used worldwide.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.