Exploring the Central Limit Theorem in R

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It’s certainly a concept that every data scientist should fully understand. In this article, we’ll go over some basic theory of the CLT, explain why it’s important for data scientists, and present some R code that explores the theorem’s characteristics.

CLT Theory

The CLT states that, given a sufficiently large sample size from a population, the mean of all samples from the same population will be approximately equal to the mean of the original population. It also states that as you increase the number of samples and the sample size, the distribution of all of the sample means will approximate a normal distribution (aka Gaussian distribution) — no matter what the population distribution is. This distribution is referred to as the “sampling distribution.”

In other words, the CLT states that the sampling distribution of the sample mean approximates normal distribution. It does so regardless of the distribution of the sampled population, provided the sample size is sufficiently large. This enables data scientists to make statistical inferences about the sample based on normal distribution properties, even if it is drawn from a population that is not normally distributed.

CLT for Data Scientists

So why is the CLT important? Because it’s at the core of what every data scientist does — make statistical inferences about data.

If we can claim normal distribution, there are a number of things we can say about the data set. In data science, we often want to compare two different populations through statistical significance tests, i.e. hypothesis testing. Using the CLT and knowledge of the Gaussian distribution, we’re able to assess our hypothesis about the two populations.

In addition, the concepts of regularly-used statistical techniques like confidence intervals and hypothesis testing are based on the CLT. There are some limitations, however. You can’t use CLT when sampling isn’t random, or when the underlying distribution doesn’t have a defined mean and variance.

As a data scientist, you should be able to explain this theorem and understand why it’s so important. To achieve this understanding further, I suggest you study the mathematical foundation of the CLT. Also, check out the Kahn Academy instructional video on the CLT.

Using R to Explore the CLT

n <- 4     # Number of trials (population size)
s <- 2000 # Number of simulations
m <- c(20, 100, 500, 1000)
EX <- n*p
VarX <- n*p*(1-p)
Z_score <- matrix(NA, nrow = s, ncol = length(m))
for (i in 1:s){
for (j in 1:length(m)){ # loop over sample size
samp <- rbinom(n = m[j], size = n, prob = 0.05)
sample_mean <- mean(samp) # sample mean
# Calculate Z score for mean of each sample size
Z_score[i,j] <- (sample_mean-EX)/sqrt(VarX/m[j])

Now let’s plot a series of four stacked histograms of the Z-score — one for each sample size — and add the density curve from the normal distribution to each histogram.

# Display distribution of means
for (j in 1:4){
hist(Z_score[,j], xlim=c(-5,5),
freq=FALSE, ylim=c(0, 0.5),
ylab="Probability", xlab="",
main=paste("Sample Size =", m[j]))
# Density curve
x <- seq(-4, 4, by=0.01)
y <- dnorm(x)
lines(x, y, col="blue")

Original story here.

— — — — — — — — — — — — — — — — — —

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.