Data Driving: Standard errors

This post is going to be about the standard error of the mean, which is often just referred to as the standard error.

As a reminder, we're basing our discussion on a distribution of 5 km race times. The mean of the race times is 25.001 minutes. The standard deviation s is 2.973 minutes. The mean tells you where the distribution is centered. The standard deviation tells you about its width. Look back at the last post if you need a reminder.

The standard error is subtly different. It doesn't really describe your distribution. Rather, it helps you to understand how accurately you know the mean value of your data.

But the mean is the mean, you're thinking. How can it change?

Well, you're right. You calculated the mean of 500 race times. That's a fixed number. But the assumption in this type of statistics is that you're pulling these race times from an infinite series of potential times. When you picked this particular set of 500 values, the mean happened to be 25.001. That's now your estimate of the true mean of the infinite series of potential times.

But you could have picked a different set of 500 race times from the infinite list of potential numbers, in which case, you'd probably have calculated a different estimate of the mean. A third set of 500 times could have given you a still different estimate. And so on ...

How much would these estimates of the true mean vary? That's what the standard error helps you to decide.

Take a look at the histograms below. They show what happens when you split the 500 original race times into groups of different sizes and then calculate the mean of each group.

So the middle panel (group size = 5) shows the distribution of means calculated using 100 groups which each contain 5 race times. The first value represented in the histogram is the mean of the first 5 race times, the second value is the mean of the second 5 race times, and so on.

Do you see how the distributions get narrower as the group size increases? That is because as you include more and more points in the group, it becomes increasingly representative of the original population. By the time you're averaging 20 points, there's not much difference between the individual subsets.

You can calculate the width of the histograms above just like we did when we calculated the standard deviation in the last post. The width of these distributions is the standard error of the mean. It goes down as you increase the group size. You can also calculate it as

$\text{SE}_\bar{x}\ = \frac{s}{\sqrt{n}}$

from Wikipedia where s is the standard deviation calculated for your samples and n is the number of data points you have. As you increase n, the standard error goes down, because your points are becoming more representative of the original population. However, there are diminishing returns. Once you already have quite a few data points, you have to add a lot more to get a better estimate of the mean. That bit comes from the square root of n on the bottom.

So in summary,

the mean tells you where your data are centered
the standard deviation quantifies the spread in your original data
the standard error describes the uncertainty associated with the mean value

Next up, one-sample t-tests.

Data Driving

Sunday, April 24, 2016

Standard errors

No comments:

Post a Comment

Blog Archive