From the course: Excel Statistics Essential Training: 2

Sample considerations

- [Instructor] Inference, what does that mean? Well, it means that we reach a conclusion, a conclusion based on evidence. This course is largely about statistical inference, which means that in this statistics course, we'll make some evidence based conclusions. So what will we use as evidence? Samples. Let me explain. Suppose we want to know the average height of females in a certain city. Well, that poses some problems. Which problems, well, we don't have the time nor the money to measure everyone's height, and most people probably aren't willing to let us measure their height. And this is where samples come in. By taking a small group, a sample of women and getting their heights, we can use this as evidence to draw some conclusions about the average height of women in the city. That said, there's some things to consider. First, what's the sample size? In a city of about 25,000 women, how many women's heights would you need? 100 women, 1,000, 10,000? Needing only 100 women's heights would be great. A small sample size is cheaper and faster to collect, but small sample sizes often have large margins of error. Remember, we want solid evidence to draw our conclusions. Second, what's your selection process? Suppose you said 250 women's heights would be a good sample size. Well, how will you choose the 250 women? Will you choose women that go to the doctor's office? Women leaving the grocery store? Maybe you can try phone surveys or perhaps use a paid survey. We pay women to get their heights measured. I'm not sure any of those is a viable idea. And even if they were, each of those selection options is flawed. Which brings us to our third issue, bias. There are many ways samples can be biased. There's selection bias. If we select women at the doctor's office, we may be measuring women that are healthier or sicker or perhaps more wealthy than most other people. Phone surveys are biased toward people that answer their phones. Paid surveys bring in people that want money. And then there's polling bias. If we survey people outside the grocery store, perhaps our pollster is more likely to ask kinder looking women, more attractive women, or women that are similar to our pollster. A fourth issue to consider, methodology. How are we measuring? Are we using a ruler, a tape measure, a stadiometer? Are we letting people keep their shoes on? Is our pollster reminding people to stand up straight? In other words, are we being very careful or are we a bit sloppy? In essence, our goal is to collect the best sample possible because we want solid evidence so we can make an informed conclusion. The best possible sample is what statisticians call the simple random sample. But gathering a simple random sample is not that simple. Actually, it's quite difficult. Why? Well, a simple random sample must exhibit two key characteristics. The sample must be unbiased, and the data points must be independent. We already discussed bias, making sure every person selected is actually randomly selected. That is very difficult. Where we collect data, when and how we collect data, who's collecting the data, there are so many ways bias can creep into our sample. And then there's independence, the data in the sample must exhibit independence. What does this mean? It means that the selection of one participant must not influence the selection of other participants. Suppose we pay women to participate in our study. Is it possible a person that volunteers tells their friends about this paid opportunity? If so, now we have numerous people from the same friend group. One participant just influenced the selection of other participants. In this course, you won't need to gather data, you'll be given data sets, but often to see our statistics concepts in action, you'll be asked to gather sample data, simple random samples, of course, from the population of data we provide. Luckily we have Excel and gathering a simple random sample from a population in Excel is actually quite simple. Let's head on over to Excel to choose some random samples. So as you can see in our exercise file, we have given you a lot of different exam scores here, 228, it seems to be exact. Now what we want to do here is we want to grab a sample of 10 scores, and there's one really easy way to do this. You can go into Data, click on Data Analysis, and of course we're going to choose Sampling. We click OK. And now it's saying, well, what do you want to grab? Well, we want to grab from our input range D, and we are using labels, so I have that clicked. We don't want periodic samples, we want the best possible samples, we want a random sample and we want a sample of 10 numbers from this column. And we're going to put the numbers, let's see, where do we want to put those? We'll put 'em right here in F8. And let's go ahead and click OK. And just like that, I have a sample with 10 scores in it. If I want to do this again, I can do it again very quickly here, Sampling, OK. Let's only grab five this time. I'll put range, I'm going to put them in a different spot. I'm going to put them right here and we click OK. And there you go, so that's one way to do it. Another way to do it is we can use a random number. So I'm going to type in rand, open parentheses, close parentheses, and this is going to generate a random decimal. Now what I can do is then copy this down, click on this box twice, and now it's given a random number to every single one of the students and the test scores. What I'm going to do is I'm going to copy, and I'm going to do a Paste Special, Values. And I've now locked in those random numbers. And what I'm going to do is going to say, you know what, the 10 scores I'm going to take are the people that randomly just got the lowest numbers. And there we go. This is my sample of 10. Maybe you want to do it the other way. I'm going to take the biggest numbers. So now here is a second sample of 10 with the biggest numbers, and here are my scores, here are my scores. The nice thing is I actually know who contributed these scores as well. Yet another way to do this is by using an index function. So as you can see right here, we are going to use this formula right here, index, and the array, which I'm going to grab data from is right here. We know that there are, well actually, let's double check. Before we do that, how many scores do we have? Let's count 'em up, 228, all right, so this is going to go from D2 down to D229 because we have a label here. So we now know that this is going to be index. We want to go and grab a number that's in this range from D2 to D229. And the next thing is, well, we want it to be at random, any one of those, so I'm going to do a rand between of one of the scores between the first one and the 228th one in the group, close up my parentheses. Now I'm going to be using this formula to gather a number of samples. So what I'm going to do is I'm going to lock these in 'cause I'm always going to be grabbing from the same array, and I'm going to apply my dollar signs here. And so that's going to lock that in. I click this and there we go. I just grabbed a sample of five by copying across. And if I copy this down, I've now grabbed a sample, five samples, and each with five. You know, the thing is that, as you know, when you use any random number here, every single time I change the page at all, it grabs different random numbers. So what I can do is if I like the samples I have right here, I just copy and I do a Paste Special, and those values have now been locked in. Now, the thing that concerns people about this right here is they say, wait a second, I don't know where these scores came from, I don't know which student contributed that score, I don't know if there's duplicates in here. So there's another way to do this. Let's first find out where we're going to grab the data from. So what I've done is I've created this column over here for you that's called index, and the number just goes in order from 1 to 228. So the first thing I'm going to do is I'm going to say, well, I'm going to pick a row at random, it could be any one of those, 1 and 128. So what we're going to do is we're going to type in this formula here, and we're going to grab a number from this particular column over here. So our array is going to go from the first number in that index column down to the last number. And we know there's 228 scores, so it's going to take us down to A229. And next thing we want to do is we want to grab any one of those numbers at random, any number between 1 and 228. And we close this up and it's now at random just choosing a number that's in this particular column. And what we're going to do, we want to grab the associated exam score. And so what we'll do is we'll create a formula over here, and it's going to say, all right, I'm going to grab a number. I want to index a number from this column right here and it's going to go down to D229, and which number do I want? I want the one that I just pulled up here at random and I hit Return. And one thing I want to do before I move on is because I'm always going to be grabbing numbers from this area, I want to add my dollar signs. And I am going to do the same thing over here. I'm always going to be grabbing numbers from the exam score column in this area, so I'm going to add my dollar signs there as well. And so what this allows me to do is I can now just copy this, right, copy this down. You can also do it by clicking the little box here. And what I now know is that every single time, again, every time I change this, the numbers change, and you might say to yourself, how do I know that the numbers are actually working out? So I'm looking at the 32nd score. The 32nd score in the group is this one right here, A93. And that does match up. And again, if you're worried about things changing, you can always Copy, Paste Special Values and now you've locked those in. All right, so we found a number of different ways to find simple random samples from the data populations that were given in Excel, and we'll be using some of these throughout the course.

Contents