Why is the formula for sample variance different from the formula for population variance?

STATISTICS 101

Inhaltsverzeichnis Show

STATISTICS 101
Differences for EDA in ML and why use one instead of the other
Why is the formula of sample variance different from population variance?
Is the only difference between calculating the variance of a sample and population?
Is sample variance equal to population variance?
What is the difference between the equations for the population and sample standard deviation?

Differences for EDA in ML and why use one instead of the other

Photo by Crissy Jarvis on Unsplash

In statistics and machine learning, when we talk about population, we mean the entire universe of possible values of a stochastic variable. If you know the population is always viable to compute the mean and the variance as:

where n is the cardinality (the number of elements) of the population.

Most of the time, we can’t use the entire population because it is too complex to have or simply not feasible. Think, for instance, at a problem when you want to analyze the heights of the oak trees in a forest. You can, of course, measure every single tree of the forest and so have collected statistics about the entire forest, but this could be very expensive and would take a very long time. Instead, you can obtain a sample of, let’s say, 20 trees and try to relate sample statistics and population statistics. So, for N samples, we have:

You can ask now: why N-1 instead of N? To answer, we need to do some computation. First of all, we compute the expected value for s2:

and then, with a bit of algebra:

Now, remembering that:

we have:

What does it mean? Using the formula with N-1 gives us a sample variance, which on average, is equal to the unknown population variance. So, also with few samples, we can get a reasonable estimate of the actual but unknown parameters of the population distribution.

What if we did the computation with N instead of N-1? Let’s see:

So, when we use N instead of N-1, we have an error called statistical bias, which means that the sample variance (the estimator) is systematically different from the true population parameter (in this case, the variance).

The correcting factor N/(N-1) is called the Bessel factor and allow to get the average of unbiased variance s² as a function of biased one:

How to apply that at a machine learning problem? When we try to model an ML problem, we first look at the data (in supervised and unsupervised learning), searching for patterns, statistical parameters, dimensionality reduction, selection of features, etc. This is called Exploratory Data Analysis or EDA. Both for numerical or categorical features, we first look at the distribution of values among the observations. So, initially, we must estimate two parameters: mean value and variance. If N, the number of observations, is small, we have to apply the Bessel factor to have a better estimate of the real variance. If N is big (we’ll see how much), we can omit this term because N/(N-1) is approximately equal to 1.

So, in a problem with:

very few observations
very few values of a feature ( because i.e., we have a biased dataset)

we must apply the correcting factor.

In any case, we can’t be confident about the result because we are using a sample and not the total population. The best we can do is an estimate of a range of values in which real variance falls within (confidence interval for the population variance).

Let’s see an example. Imagine a forest of 10000 oak trees: This is the entire population. We want to estimate the distribution of heights. Suppose we don’t know that the heights are normally distributed with an average of 10m and a standard deviation (square root of variance) of 2m. These are the statistical parameters of the entire population. We try to estimate these values through a sample of 20 random oak trees. We repeat the experiment 100 times. The following is the code in Python:

import numpy as np
import pandas as pd
from random import sample,choice
import matplotlib.pyplot as plt
import seaborn as sns
import scipynp.random.seed(1234)
mu = 10
sigma = 2
pop = np.random.normal(mu, sigma, 10000)
count, bins, ignored = plt.hist(pop, 100, density=True, color = 'lightgreen')
sns.set_style('whitegrid')
tests = 100
sam = []
mean =[]
std_b = []
std_u = []
fig, axs = plt.subplots(ncols=2)
sns.kdeplot(pop, bw=0.3, ax=axs[0])
for i in range(tests):
    sam_20 = np.random.choice(pop, 20)
    sns.kdeplot(sam_20, bw=0.3, ax=axs[1])
    sam.append(sam_20)
    mean.append(np.mean(sam_20))
    std_b.append(np.std(sam_20))
    std_u.append(np.std(sam_20, ddof=1))frame = {'mean':mean, 'std_b': std_b, 'std_u': std_u}
table = pd.DataFrame(frame)

This way, we obtain a sample mean “mean” standard deviation (biased) “std_b” and unbiased (ddof=1 is the same as (N-1) instead of N) standard deviation “std_u” for each of the 100 experiments. Graphically, we obtain:

Given that 20 samples are a tiny subset of 10000 items of population, every time we run the test, we get different distributions. Anyway, on average, we get a reasonable estimate of the real mean and standard deviation. We can see that in Python with the following command:

table.describe()

As you can see, the unbiased sample standard deviation in average is nearest to the value of the population parameter than the biased one.

Measures of goodness of these estimations are made by confidence analysis, which we’ll discuss another time.

Conclusion

In this article, we have seen that sample variance is affected by statistical bias, due to the distortion of very few observed data comparing to the cardinality of the entire population. We learned a method to mitigate this error by the Bessel factor and gave an example.

Why is the formula of sample variance different from population variance?

When we calculate population variance, we divide by N (the population size). When we calculate sample variance, we divide by n-1 (the sample size – 1). When calculating the sample variance, we apply something known as Bessel's correction – which is the act of dividing by n-1.

Is the only difference between calculating the variance of a sample and population?

Differences Between Population Variance and Sample Variance The only differences in the way the sample variance is calculated is that the sample mean is used, the deviations is summed up over the sample, and the sum is divided by n-1 (Why use n-1?).

Is sample variance equal to population variance?

Using the formula with N-1 gives us a sample variance, which on average, is equal to the unknown population variance. So, also with few samples, we can get a reasonable estimate of the actual but unknown parameters of the population distribution.

What is the difference between the equations for the population and sample standard deviation?

If we are calculating the population standard deviation, then we divide by n, the number of data values. If we are calculating the sample standard deviation, then we divide by n -1, one less than the number of data values.

Why is the formula for sample variance different from the formula for population variance?

STATISTICS 101

Differences for EDA in ML and why use one instead of the other

Conclusion

Why is the formula of sample variance different from population variance?

Is the only difference between calculating the variance of a sample and population?

Is sample variance equal to population variance?

What is the difference between the equations for the population and sample standard deviation?

zusammenhängende Posts

What is it called when there is a difference between the value of a countrys exports and imports during a specific time?

A type of annuity in which the payments are made at the beginning of each payment interval.

If you borrow $20,000 for 10 years at an annual rate of 8%, what would the monthly payment be?

What is the difference between the simple interest and compound interest for a period of two years at a rate of 10%?

A simple annuity is an annuity where the payment interval is the same as the interest period.

Which annuity is made when the periodic payment is not made at the beginning or at the end of each payment interval but at some later date?

What is the formula in finding the present value of an ordinary Identify each variable represents?

A general annuity is an annuity where the payment interval is the same as the interest period.

What do you call an annuity in which the payment interval is not the same as the interest compounding period?

What do you call the annuities where the payments are made at the beginning of each payment interval?

Werbung

NEUESTEN NACHRICHTEN

Which list below contains only information systems development project risk factors?

Controlling measures the of actual performance from the standard performance

Welche Elemente kommen im menschlichen Körper vor?

Mit anzusehen wenn der mann anbaut

Second Hand Kinder in der Nähe

Which of the following statements is true regarding surface-level diversity?

Wie wird man im Solarium am besten braun?

When giving a speech of presentation you should usually explain why the recipient is being given his or her award?

Wie lange ist 3 tage fieber ansteckend

Was sagt man wenn jemand in rente geht

Werbung

Populer

Werbung

Um

Legal

Hilfe

Sozial