Why is the formula for sample variance different from the formula for population variance?

STATISTICS 101

Differences for EDA in ML and why use one instead of the other

Photo by Crissy Jarvis on Unsplash

In statistics and machine learning, when we talk about population, we mean the entire universe of possible values of a stochastic variable. If you know the population is always viable to compute the mean and the variance as:

where n is the cardinality (the number of elements) of the population.

Most of the time, we can’t use the entire population because it is too complex to have or simply not feasible. Think, for instance, at a problem when you want to analyze the heights of the oak trees in a forest. You can, of course, measure every single tree of the forest and so have collected statistics about the entire forest, but this could be very expensive and would take a very long time. Instead, you can obtain a sample of, let’s say, 20 trees and try to relate sample statistics and population statistics. So, for N samples, we have:

You can ask now: why N-1 instead of N? To answer, we need to do some computation. First of all, we compute the expected value for s2:

and then, with a bit of algebra:

Now, remembering that:

we have:

What does it mean? Using the formula with N-1 gives us a sample variance, which on average, is equal to the unknown population variance. So, also with few samples, we can get a reasonable estimate of the actual but unknown parameters of the population distribution.

What if we did the computation with N instead of N-1? Let’s see:

So, when we use N instead of N-1, we have an error called statistical bias, which means that the sample variance (the estimator) is systematically different from the true population parameter (in this case, the variance).

The correcting factor N/(N-1) is called the Bessel factor and allow to get the average of unbiased variance s² as a function of biased one:

How to apply that at a machine learning problem? When we try to model an ML problem, we first look at the data (in supervised and unsupervised learning), searching for patterns, statistical parameters, dimensionality reduction, selection of features, etc. This is called Exploratory Data Analysis or EDA. Both for numerical or categorical features, we first look at the distribution of values among the observations. So, initially, we must estimate two parameters: mean value and variance. If N, the number of observations, is small, we have to apply the Bessel factor to have a better estimate of the real variance. If N is big (we’ll see how much), we can omit this term because N/(N-1) is approximately equal to 1.

So, in a problem with:

  • very few observations
  • very few values of a feature ( because i.e., we have a biased dataset)

we must apply the correcting factor.

In any case, we can’t be confident about the result because we are using a sample and not the total population. The best we can do is an estimate of a range of values in which real variance falls within (confidence interval for the population variance).

Let’s see an example. Imagine a forest of 10000 oak trees: This is the entire population. We want to estimate the distribution of heights. Suppose we don’t know that the heights are normally distributed with an average of 10m and a standard deviation (square root of variance) of 2m. These are the statistical parameters of the entire population. We try to estimate these values through a sample of 20 random oak trees. We repeat the experiment 100 times. The following is the code in Python:

import numpy as np
import pandas as pd
from random import sample,choice
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
np.random.seed(1234)
mu = 10
sigma = 2
pop = np.random.normal(mu, sigma, 10000)
count, bins, ignored = plt.hist(pop, 100, density=True, color = 'lightgreen')
sns.set_style('whitegrid')
tests = 100
sam = []
mean =[]
std_b = []
std_u = []
fig, axs = plt.subplots(ncols=2)
sns.kdeplot(pop, bw=0.3, ax=axs[0])
for i in range(tests):
sam_20 = np.random.choice(pop, 20)
sns.kdeplot(sam_20, bw=0.3, ax=axs[1])
sam.append(sam_20)
mean.append(np.mean(sam_20))
std_b.append(np.std(sam_20))
std_u.append(np.std(sam_20, ddof=1))
frame = {'mean':mean, 'std_b': std_b, 'std_u': std_u}
table = pd.DataFrame(frame)

This way, we obtain a sample mean “mean” standard deviation (biased) “std_b” and unbiased (ddof=1 is the same as (N-1) instead of N) standard deviation “std_u” for each of the 100 experiments. Graphically, we obtain:

Given that 20 samples are a tiny subset of 10000 items of population, every time we run the test, we get different distributions. Anyway, on average, we get a reasonable estimate of the real mean and standard deviation. We can see that in Python with the following command:

table.describe()

As you can see, the unbiased sample standard deviation in average is nearest to the value of the population parameter than the biased one.

Measures of goodness of these estimations are made by confidence analysis, which we’ll discuss another time.

Conclusion

In this article, we have seen that sample variance is affected by statistical bias, due to the distortion of very few observed data comparing to the cardinality of the entire population. We learned a method to mitigate this error by the Bessel factor and gave an example.

Why is the formula of sample variance different from population variance?

When we calculate population variance, we divide by N (the population size). When we calculate sample variance, we divide by n-1 (the sample size – 1). When calculating the sample variance, we apply something known as Bessel's correction – which is the act of dividing by n-1.

Is the only difference between calculating the variance of a sample and population?

Differences Between Population Variance and Sample Variance The only differences in the way the sample variance is calculated is that the sample mean is used, the deviations is summed up over the sample, and the sum is divided by n-1 (Why use n-1?).

Is sample variance equal to population variance?

Using the formula with N-1 gives us a sample variance, which on average, is equal to the unknown population variance. So, also with few samples, we can get a reasonable estimate of the actual but unknown parameters of the population distribution.

What is the difference between the equations for the population and sample standard deviation?

If we are calculating the population standard deviation, then we divide by n, the number of data values. If we are calculating the sample standard deviation, then we divide by n -1, one less than the number of data values.