From the course: Excel Statistics Essential Training: 2

The central limit theorem

- [Instructor] The central limit theorem, it's a rather simple concept that is sort of intuitive, but with some interesting and helpful twists. We've already started to discover that when we take a sample, the larger our sample size, the more confident we are in our sample. Confident about what? Confident that this one sample reflects the entire population. But the central limit theorem takes this a bit further. The central limit theorem goes on to tell us that the more samples we take, the closer the average of our sample means will get to the actual population mean. So, if we take one sample with a reasonable sample size, we have some evidence. If we take lots of samples, the evidence starts to point us closer and closer to the truth. But the central limit theorem actually tells us a bit more. It says that as we start to take many more samples, dozens of samples, hundreds of samples, even thousands of samples, the sample means if we plotted them as a histogram, they would begin to look more and more like a normal distribution. But it goes one step further. If we take hundreds of random samples with a small sample size of four, the distribution might look like this. And that's if we use a sample size of only four data points. But look what happens when we take hundreds of random samples with a sample size of 50. The distribution starts to look like this. The curve is getting taller and more narrow, which means a larger sample size gives us a smaller standard deviation. Let's actually see this inaction by using Excel. Okay, so what are we looking here? What you have been given is a population of data, 950 students, and you have a course grade for 950 students. We also know that the max course grade among these 950 students is 999 points and the lowest course grade among these 950 students is 347. Let's go ahead and figure out what the population mean is of these scores. So, we're going to do an average of everything in column D, so we can see that the average course grade is 706.2. Next thing what we want to do is we want to see what types of course grades we have in this particular class. So, we're going to build a frequency table, and so we're going to use our frequency function, and then we have to tell it that we want to gather data from this particular column. And we're going to use the bins that we have here from F6 down to right there, about F20. Close that up, and it has now told us how many scores we have in each particular bin. And if you don't remember, this is telling us right here that between 301 and 350 we have one score. Between 351 and 400, we have three scores, and so on. All right, next thing we want to do is we want to build a histogram out of this particular data. So, what we're going to do is we're going to go over to, well, first thing we'll do is we'll grab this data right here. Next thing I'm going to do is going to go to Insert and we're going to let it recommend some charts for us and we'll use this one right here. So, what this has done is it told us what the distribution of our course grades looks like. All right, so let's move on to the next thing. Now what we're going to do is we're going to gather a random sample, and what we're going to do is we're going to use the index function, which we have seen before. We're going to grab our data from our exam scores, so it's which they start here at D2 and it goes all the way down to, we know there's 950 exam scores and we have a label up there, so it's going to go down to D951. And because we're going to be using this over and over again, I'm going to lock these cells in. Next thing we need to do is gather one of those numbers at random. So, we're going to use the randbetween and we want to gather an exam score between the first exam score and the 950th exam score. And there we go, we have one of our exam scores. Now we want to do this, we want to gather some big samples today. So, let's go ahead and we're going to copy this all the way over to the right. So, we'll have a sample of 36 different readings, we can also copy this down, and look what we just did. We now have 10 samples and each sample has up to 36 course grades in them. Now the next thing I want to do is say, "Well, not all my samples are big. Sometimes we have small samples, sometimes they have medium size, sometimes they're a little bit bigger." So, I'm going to say, "Well, what would happen if we said our sample was only a sample size of four? So, n is equal to four, what would be the mean?" So, we're going to use our average, but I only want the average of the first four. Or how about if I want the average of the first 16 scores? Or how about if I want the average of all 36 scores? And what we would expect is that the larger the sample size, the closer we're getting to our population mean. And it doesn't always happen, but it should happen most of the time. What I'm going to do is I'm going to do this for all of our samples, so I'm going to copy this down. And the next thing that we want to do is we want to take the average of our averages, the mean of our means, so we'll do that right here. Say, "All right, the average of all 10 of my n equals four means is this. And the average of my n equals 16 means is the average of all of these." And I can copy across again. And the average of all of my n equals 36 samples is this right here. And again, we have this rand function that's living inside of all these cells, so every single time we change something, the numbers change. But you can sort of see already how as things are changing, this n equals 36 tends to be the one that's pretty close to our population mean. But let's go ahead and do one thing, let's take this a step further, let's create frequency tables for our means. So, for n equals four, I'm going to type in, I'm going to do a frequency table again, and my data array is going to be those here in L. And then what I'm going to use as my bins are these over here. Close that up. Now, I'm going to be doing this for all three of my sample sizes, n equals four, n equals 16, and n equals 36, and I want my bins to stay the same so I'm going to lock those in. And there you go, you can sort of see right now how many are falling in the bins. And if I want, I can copy this to the right and we can see a little bit of what we're talking about, that the larger the sample size, the more and more that our distribution is getting more narrow and taller. I'm going to show you this in one other way. So, what we're going to do now is I'm going to grab all of these right here and then I'm going to go over to the Insert and Recommended Charts, and I'm going to use this one right here. And notice what's happening. The blue bars represent n equals four, the orange represents n equals 16, and the gray, that represents n equals 36. And as I said, as we change things, what do we notice? The blues, which are n equals four, that distribution is always rather low and rather wide. But as we start to get the larger sample size, the orange, n equals 16, that gets a little bit more narrow and a little bit taller. And by the time we get to n equals 36, we have a very narrow curve with a very tall center. And again, we can sort of see this over and over again. And there you go, the central limit in action. The larger our sample size gets, the smaller our standard deviation and the taller our curve becomes.

Contents