Sampling Designs Final Material

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 52

SAMPLING DESIGNS

Let’s learn first the software!

What is R?
R is a free software environment for statistical computing and graphics.

The R – chitecture
R exists as base package with a reasonable amount of functionality. The Software R and its
packages are stored in a central location known as the CRAN or the Comprehensive R Archive
Network. Once a package is stored in the CRAN, anyone with an internet connection can
download it from the CRAN and install it to use within their own copy of R.

Pros and Cons of R


Advantages
-free
-versatile
-rapidly expanding tool and can respond quickly to new developments
Disadvantages
-ease of use (typing instructions rather than pointing, clicking, and dragging things with
a mouse.
-work with a command line rather than a graphical user interface

Downloading and Installing R

To install R onto your computer you need to visit the project website (http://www.R-project.org).
The figure below shows the process of obtaining the installation files. On the main project page,
on the left-hand side, click on the link labelled ‘ CRAN’
There are various copies (mirrors) of CRAN across the globe; therefore the link to the CRAN
will navigate you to page of links to the various ‘mirror’ sites. Scroll down this list to find a
mirror near to you. You may click ‘ https://cran.stat.upd.edu.ph/’ since this is the closest to us.

Once you have been redirected to the CRAN mirror that you selected, you will see a web page
that asks you which platform you use (Linux , MacOS or Windows). Click the link that applies
to you.

If you click on the ‘Windows’ link, then you’ll be taken to another page with some more links;
click on’ base’, which will direct you to the webpage with the link to the setup file, once there,
click on the link that says ‘Download R 3.6.1 for Windows’, which will initiate the download of
the R setup file. Once this file has been downloaded, double click on it and you will enter a
(hopefully) familiar install procedure.
If you click on the ‘MacOS’ link you will be taken directly to a page from where you can
download the install package by clicking on the link labelled ‘R-3.6.1.pkg’ Clicking this link will
download the install file; once downloaded, double click on it and you will enter the normal
MAcOS install procedure.

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-


highlighting editor that supports direct code execution, as well as tools for plotting, history,
debugging and workspace management. It makes R easier to use and is recommended for
beginners.

To Install RStudio
1. Go to www.rstudio.com and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. For Windows : Click on the version recommended for your system, or the latest
Windows version, and save the executable file. Run the .exe file and follow the
installation instructions.
4. For MacOS: Click on the version recommended for your system, or the latest Mac
version, save the .dmg file on your computer, double-click it to open, and then
drag and drop it to your applications folder.

The Main Windows in RStudio


Console – It is the main window where you can both type commands and see the results of
executing these commands. This is where you see the errors in the running program if there is
any.
Editor – a separate window where you can write your commands rather than writing directly to
the console.
Packages and Plots Pane – It is where your created graphs and active R packages are displayed

Environement/History

Editor/Script/
Data Pane

Files/Plots/
Packages/Help

Console

Menus in R for Windows


File – It allows you to do general things such as saving workspace. Likewise, you can open
previously saved files and print graphs, data or output. In essence, it contains all the options that
are customarily found in File menus.

Edit – This menu contains edit functions such as cut and paste. From here, you can also clear the
console, activate a rudimentary data editor, and change how the Graphical User Interface looks.

View – This menu lets you select whether or not to see the toolbar and whether to show a status
bar at the bottom of the window.

Packages – This menu is very important because it is where you load, install and update
packages.

Window – If you have multiple windows. This menu allows you to change how the windows in R
are arranged.

Help – It routes you to online help (links to frequently asked questions, the R webpage etc.) and
it offers you an offline help (pdf manuals and system help files).

Commands, Objects and Functions

Everything you want to do has to be typed into the console.

Commands in R are generally made up of two parts: objects and functions. These are separate by
“←“, which you can think of as meaning ‘is created from’. As such, the general from of
command is: object <- function which means ‘object is created from function’.

An object is anything created in R. It could be a variable, a collection of variables, a statistical


model, etc.

Functions are the things that you do in R to create your objects.

R is case sensitive; which means that if the same things are written in upper or lower case, R
thinks that they are completely different things.

R Workspace
The collection of objects and things you have created in a session is known as your workspace.

Setting a Working Directory

A working directory is a directory where you want to store your data files.

To set the working directory to this folder, we use the setwd( ) command to specify this newly
created folder as the working directory.
- Create a folder and place the data files you’ll be using in that folder.
Example: setwd(“D:/R Training/Files”)

By executing this command, we can now access files in that folder directly without having
reference to the full file path.

If you want to check what working directory is, we have to execute the command getwd( ).

Installing Packages

Package does not come pre-installed in R.

Two Ways to Install Packages


1. Through menus.
2. Using a command.

1. In windows if you select Packages => Install packages(s)… the window that will open
first asks you to select a CRAN and then choose a package you want to install.

2. If you know the package you want to install, then the simplest way to execute this
command is install.packages (“package.name”) in which ‘package.name’ is replaced by
the name of the package that you’d like to installed. Note that the name of the package
must be enclosed in speech marks.

Once a package is installed you need to reference it for R to know that you’re using it. You need
to install the package only once but once you need to reference it each time you start a new
session of R.

To reference a package, we simply execute this general command: library(package.name)

To update pre-installed R and Rstudio


1. Type install.packages(“installr”)
2. Type library(installr)
3. Type updateR()

BASIC CONCEPTS OF SAMPLING DESIGN

The population is the collection of all elements under consideration in a study.

Complete Enumeration Vs Sampling

In complete enumeration /census, measurements on the variables of interest will be taken from
all the elements in the population.

In sampling, measurements will be taken only from a subset of a population.


Elementary Unit Vs Sampling Unit

An element or elementary unit is an object on which a measurement on the variable of interest


is taken.

Sampling unit is the unit that is selected in the sampling process. Sampling units are
nonoverlapping collections of elements from the population that cover the entire population.

Elementary Unit Vs Sampling Unit


An element or elementary unit is an object in which the variable of interest is measured.
The sampling unit is the unit selected in the sampling process. Sampling units are
nonoverlapping collections of elements from the population that cover the entire population.

Example:
Suppose a researcher wishes to select a sample in order to study the opinion of university
students in Metro Manila on the SOGIE bill. The elementary units in this study are the
university students.
Approach 1. The researcher compiles a list of all university students in Metro Manila. From the
list, the researcher selects the sample of university students. In this approach, the sampling units
are the university students themselves.
Approach 2. The researcher gets a list of universities. From the list, the researcher selects a
sample of the universities. The sample includes all the students in the selected universities. In
this approach, the sampling units are the universities, not the students.
Approach 3. The researcher gets a list of the universities in Metro Manila. From the list, the
researcher selects a sample of universities. The researcher then gets a list of students from each
of the selected universities and then selects a sample of students from each one of the lists. In
this approach, there are two sets of sampling units: universities in the first stage sampling of
sampling and students in the second stage of sampling.

Random samples are selected using chance methods or random numbers.

1. One method is to number each subject in the population. Then place numbered cards in a
bowl, mix them thoroughly, and select as many cards as needed. The subjects whose
numbers are selected constitute the sample.
2. Obtain a random sample from a table of random numbers.
3. Generate random numbers with a computer.
Table of
Random
Numbers

Example:
Suppose a researcher wants to have an online talk radio featuring interviews with provincial
governors on the subject of the war on drugs. Because of time constraints, the 45-minute blog
talk radio can only accommodate five governors. The radio host wishes to select the governors at
random. Select a random sample of provinces from 43.

Step 1. Number each province from 1 to 43, as shown.

01. Aklan 24. Davao Del Norte


02. Antique 25. Davao Del Sur
03. Biliran 26. Davao Oriental
04. Bohol 27. Dinagat Islands
05. Capiz 28. Lanao Del Norte
06. Cebu 29. Lanao Del Norte
07. Eastern Samar 30. Lanao Del Sur
08. Guimaras 31. Maguindaao
09. Iloilo 32. Misamis Occidental
10. Leyte 33. Misamis Oriental
11. Negros Ossidental 34. Sarangani
12. Negros Oriental 35. South Cotabato
13. Northern Samar 36. Sultan Kudarat
14. Western Samar 37. Sulu
15. Siquijor 38. Surigao Del Norte
16. Southern Leyte 39. Surigaon Del Sur
17. Agusan Del Norte 40. Tawi-tawi
18. Agusan Del Sur 41. Zamboanga Del Norte
19. Basilan 42. Zamboanga Del Sur
20. Bukidnon 43. ZamboangaSibuga
21. Camiguin
22. Campostela Valley
23. North Cotabato
Step 2. Using the table of random numbers, find a starting point. Say, the starting point is
in the 7th column and 1st row. So, the first number selected is 04.

Step 3. Select 5 unique numbers from the starting point. Disregard larger numbers and
numbers that appear more than once. So going down, we get 04, 37, 32, 15 and 07.

Step 4. Take note of the provinces that corresponds to the selected numbers. The
governors of these provinces will join the online interview.
04 Bohol
37 Sulu
32 Misamis Occidental
15 Siquijor
07 Eastern Samar

Using R, use the following codes:

To sample with replacement, type sample(1:43, 5, replace=TRUE)


To sample with replacement, type sample(1:43,5, replace=FALSE)

So, the sampled governors are from the following provinces:


13 Northern Samar
34 Sarangani
17 Agusan Del Norte
36 Sultan Kudarat
37 Sulu
Note:
Generalization will only work when sampling scheme is appropriate.

TARGET POPULATION VS SAMPLED POPULATION

The target population is the collection of elements from which information is desired while the
sampled population is the collection from which the sample is actually selected.

Advantages of Sampling over Census


 Reduced Cost
The cost of sampling is relatively lower.
 Large Population

8|Page
-Elements from different groups can be well represented through sampling. When the
population is too large, a survey without sampling becomes impossible,especially when
there is a limited time.
 Greater Accuracy and Efficiency
-When all entities are measured, the measurement error increases.
 Timeliness
-Sampling takes lesser time since the volume of data is reduced.
 Greater Scope
-More information can be extracted even with a limited amount of resource because of
generalizations
 Nature of Testing Procedure
-Grinding and sectioning of fossil material in an archeological study, dissection of,
quality control of manufactured products, and many more need to be sampled or else the
entire population is destroyed.
 Research Ethics
A researcher should control the use of animals, and it should be under an ethical
framework

Disadvantages of Sampling
 There are chances of bias in the selection of the sampling method.
 Appropriate calculation of sample size is a challenging task.
 It requires adequate knowledge of the subjects.
 When the population is not homogenous, there is a need for an expert who has
specialized knowledge in sampling.

Characteristics of Estimators

1. The reliability/precision of an estimator refers to how reproducible the estimates are


over repetitions of the process of sampling and computing the estimate.
2. The validity of an estimator refers to how different the mean of the estimator is from the
true value of parameter being estimated.
3. The accuracy of an estimator refers to how close the estimates are to the true value of the
parameter being measured over repetitions of the process of sampling and computing the
estimate.

Major Classification of Sampling Plans

1. Element Sampling: the elementary units themselves serve as the sampling units.
2. Cluster Sampling: the sampling units contain many elementary units and all the
elementary units that belong in the selected sampling units will form the sample.
3. Multi-Stage Sampling: there is more than one stage of sampling and different sampling
units are used in each stage.

The sampling frame is the complete list of sampling units under study.

9|Page
Types of Frame

1. directory type: this is a listing of sampling units. It may be a physical list of units or a
list of codes representing the units
2. map type: lists in the form of processed maps

Why do we need a frame?

1. Essential for selecting the element of the target population.


2. Provides information for locating and identifying the units.
3. Provides quantitative information for estimation of population parameters based on
sample observations

Qualities of Good Sampling Frame


1. Include all individuals in the target population.
2. Includes accurate information that can be used to contact selected individual
3. Has unique identifier for each member to ensure no duplication
4. A logical organization to the list.
5. Up to date information of each element needed in implementing the sampling plan.

Considerations in Choosing the Frame

1. Cost. If an existing list should be used, it should not be more expensive to acquire and
clean this list than to generate a list of all sampling units in the population on your own.
2. Simple to use. The rule used to identify in which sampling unit an element belongs in
must not be too complicated.

Probabilty vs Nonprobabilty Sampling

A probability sample is one that is based on a sampling plan that gives every element in the
population a known, nonzero probability of being included in the sample; otherwise it is a
nonprobability sample.

When following a qualitative research design, non-probability sampling techniques, such as


purposive sampling, can provide researchers with strong theoretical reasons for their choice of
units (or cases) to be included in their sample

Common Types of Nonprobability Sampling

1. Haphazard, Convenience or Accidental Sampling: the sampled elements are chosen


for convenience or haphazardly. Examples include a sample of volunteers, street corner
interviews, and pull-out questionnaires in a magazines.

10 | P a g e
2. Purposive Sampling: the elements are carefully selected to provide a “representative
sample”. Studies have demonstrated that selection bias can arise even with expert choice
but nevertheless the method may be well be appropriate for very small samples when the
expert has a good deal of information about the population elements. The two common
features of the method are (a) sampling units often consist of relatively large groups; and,
(b) sampling units are chosen so that they will provide accurate estimates for important
control variables for which results are known for the whole population and it is hoped
that it will also give “good” estimates for other variables that are highly correlated with
the control variables.

3. Quota Sampling: interviewers are assigned quotas on the number of respondents in the
different subgroups of the population figures for various types, often based on population
being studied.

4. Snowball Sampling: identifying one or more participants in the desired population and
using them to find other participants until the desired sample size is met.

Properties of Probability Sampling

1. The class of distinct samples {S 1 , S 2 , … , S v } which the procedure is capable of selecting


can be defined. That is, the sampling units that belong to S1can be specified.
2. Each possible sample Si has assigned to it a known probability of selection pi.
3. One of the Si ' s will be selected by a random process in which each Si receives its
appropriate probability pi.
4. The method for computing the estimate from the sample must be stated and must lead to
a unique estimate for any specific sample.

Inclusion Probability vs Selection Probability.

The inclusion probability π j , is the probability that the jth element of the population is
included in the sample. The selection probability pi, is the probability that the ith possible
sample Si is selected.

In probability sampling, the inclusion and selection probabilities are both known. This will be
possible if the identification of the sampling units included in the sample is based on a
randomization mechanism.

The use of probability sampling does NOT guarantee the selection of a “representative sample”.

11 | P a g e
The knowledge of the inclusion and selection probabilities under probability sampling will allow
us to measure the reliability, validity and accuracy of the estimators.

Components Involved in Designing Sample Surveys

1. Sample Design: includes both sampling plan and the estimation procedure. The
sampling plan is the methodology used for selecting the sample from the population. The
estimation procedures are the algorithms or formulas used for obtaining estimates of the
parameters from the sample and for estimating the reliability/ accuracy of these
population estimates.

2. Survey Measurements: This component includes: (a) the variables needed in order to
meet the objectives of the survey and (b) the survey instruments to be used to measure
these variables.

3. Survey Operations: Once the sample has been chosen and the measurement instruments
or questionnaires drafted, pretested, modified, then survey operation can already begin.
Survey operation includes both fieldwork of the survey (including data collection) and
data management.

4. Statistical Analysis and Report Writing. After the data have been collected, coded,
edited, and processed, the data can be analyzed statistically and the findings incorporated
into a final report. As in all components of a sample survey, considerable care should be
taken in the interpretation of the findings of the survey.

Criteria for a Good Sample Design

1. Cost Efficiency. Each observation, or item, taken from the population contains a certain
amount of information about the parameter of interest. Since information costs money,
the researcher must decide on the amount that will be used to estimate this parameter.
Too little information prevents the researcher from making good estimates, while too
much information results in a waste of money.

Ways in Achieving Cost Efficiency

a. initially decide on the total cost to be allocated to the survey and then choosing
the sample design that will yield estimates having the highest degree of accuracy
at the stated cost.
b. make specifications on the desired accuracy of the estimate and choose the sample
design that will yield estimates meeting these specifications in the lowest possible
cost.

12 | P a g e
2. Feasibility. No matter how cost-efficient a particular design is, it is of no use if it is not
feasible to execute this design. This means the selection plan must be made such that it
would be possible for the interviewers to identify the elements from whom to get the
information from as specified by the sampling plan.

SIMPLE RANDOM SAMPLING

Sample Selection Procedure

1. A sample of n elements from a population of N elements drawn using simple random


sampling without replacement (SRSWOR) is one in which each one of the NPn
possible permutation of n elements taken from the N elements of the population has the
same probability of selection.

2. A sample of elements from a population of N elements drawn using simple random


sampling with replacement (SRSWR) is one in which each one of the Nn possible
ordered n- tuples form the N elements of the population has the same probability of
selection.

Note:

 The sampling units in SWSWOR and SRSWR are the elements themselves.

Unlocking Terms:

A Permutation is an arrangement of a set of n objects in a given order. An arrangement of


any r ≤ n of these objects in a given order is called an r −permutation or a permutation of the
n objects taken r at a time.

n!
FORMULA: nPr=
( n−r ) !

Ordered n-tuple (a 1 , a2 , … , an) is the ordered collection of n elements with a 1 as its first
element, a 2 as its second element an a n as its nth element.

SRSWOR vs. SRSWR

In SRSWOR, a particular element can appear only once in a given sample since permutation
must consist of distinct coordinates. IN SRWSR, a particular element can appear more than once
in a sample.

Sample Selection Procedure

13 | P a g e
Step 1. Assign a number from 1 to N to each element in the population.

Step 2. Select n(distinct) numbers from 1 to N by use of some random process such as a table of
random numbers, a computer or a calculator with a random generator.

Step 3. The population elements corresponding to the selected numbers in Step 2 constitute the
sample using SRSWR( SRSWOR).

DETERMINING THE SAMPLE SIZE UNDER SRSWOR

General Procedure

Step 1. Specify what is expected of the sample in terms of the level of reliability needed for the
resulting estimates. This statement is usually in terms of the desired limits of error (absolute or
relative) and the corresponding level of confidence to be placed on the estimates.

In general, the larger the sample, the greater will be the reliability of the resulting estimates.
Validity, in general, cannot be improved with an increase in the sample size unless the bias is a
function n. And because of the difficulty of ensuring that no unsuspected bias enters into the
estimates, the level of reliability is controlled instead of the level of accuracy.

Step 2. Find some equation that relates n with the desired reliability of the sample. The equation
will vary according to how the desired reliability is stated and the type of sampling procedure to
be used.

Step 3. Estimate the unknown parameters in Step 2.

Step 4. If more than one variable is to be measured in the survey, select the most vital variables
in the study. Prescribe the desired degree of reliability for each item and the corresponding
sample size is computed. More commonly, there is sufficient variation among the computed n’s
so that it is not advisable to choose the largest n, either from budgetary considerations or because
this will give an over-all standard of reliability that is substantially higher than originally
contemplated. In this event, the desired reliability may be relaxed for certain of the items in order
to permit the use of a smaller value of n. Or in some case, these items are dropped from the
study.

Step 5. Finally appraise the chosen value of n to see whether it is consistent with the resources
available to take the sample. This demands an estimation of cost, labor, time, and materials
required to obtain the proposed size of the sample. If cost has been specified in advance and the
computed n is much larger than what the researcher can afford, let the researcher decide whether
to proceed with a much smaller sample size (thus reducing the reliability of the estimates) or to
look for a more efficient sampling design or to abandon efforts until more resources are found.

14 | P a g e
Note: If computed n1 is not an integer, the sample size n is usually taken as n=⟦ n1⟧ +1, subject to
the unit cost of sampling. For an additional sampling unit, compare the increase in cost to the
increase in reliability; particularly in very small populations where the reliability will fluctuate
largely between ⟦ n1 ⟧ and ⟦ n1 ⟧ +1.

Ways of Specifying the Desired Degree of Reliability

1. Directly specify the desired variance of the estimator, V d =¿ desired Var (Θ ^ ), or the
desired coefficient of variation of the estimator, C d=¿ desired CV (Θ^ ). The formula for n
using C d will involve the coefficient of variation of the population measures which of the
population measures which is often more stable and easier to guess than the standard
error.
The coefficient of variation (CV) is a statistical measure of the dispersion of data points
in a data series around the mean. It is often calculated as the ratio of the standard
deviation to the mean.

2. Specify the desired margin of error, d, in the estimate and the risk α that the researcher is
willing to incur that the actual error is larger than d, that is, P (|Θ−θ
^ |>d ) =α .
The margin of error is supposed to measure the maximum amount by which the sample
results are expected to differ from those of the actual population.

3. Specify the desired relative error, r, in the estimate and the risk α that the researcher is

(| | )
willing to incur that the actual relative error is larger than r; that is, P
^ −θ
Θ
θ
> r =α

This type of error is relative to the size of the parameter being measured.

Ways of Estimating the Parameters

1. Use the results of a pilot survey.


Allowance must be made for the selective nature of the pilot when using its results to
estimate the unknown parameters. For instance, a common practice is to confine the pilot
work to a few clusters of units. Thus the computed estimates will measure not only the
variability within a cluster, but also among clusters.

2. Use the results from previous surveys.


If suitable past data are found, allowance should also be made to ask into account
changes in time.

3. Guess the structure of the population and use some mathematical results.

15 | P a g e
For example, Deming shows how simple mathematical distributions may be used to
estimate the population variance from information on the range and general idea of the
shape of the distribution. If the distribution is like a binomial, with a proportion p of the
observations at one end of the range and a proportion q at the other end, S2 can be
estimated by pq r 2 where r is the range. Other useful relations are that S2 can be estimated
2 2
0.083 r for a rectangular distribution, 0.056 r for a distribution that is shaped like a right
triangle, 0.042r 2 for an isosceles triangle.

4. Take the sample in two steps, the first being a sample of size n, for which the estimates
are computed and the required n will be obtained.
This method gives the most reliable estimates but it is not often used since it slows down
the completion of the survey. Cochran lists some formulas used in determining n2 =n−n1
after a sample of n1 has already been taken.

FORMULAS FOR SAMPLE SIZE UNDER SRSWOR


Formulas of Computing n 0

Parameter Of Interest For Specified d or V d For Specified r or C d


2 2 2 2
Mean S z α / 2 S2 CV z α /2 CV 2
n 0= 2 ∨ n 0= 2
∨ 2
d Vd r Cd
2 2
Proportion PQ z α / 2 PQ Q zα/ 2 Q
n 0= 2
∨ n 0= 2
∨ 2
d Vd Pr PC d

no
n=
 If n 0 /N < 0.05, use n=n0. Otherwise, use no to estimate the mean and
1+
N
no
n=
no −1 when the objective is to estimate proportion,
1+
N
 When the objective is to estimate τ then simply multiply n o by N 2 when d or V d
is specified. Otherwise, use the formula for the mean.

Example:

A community within a city contains 3000 households and 10, 000 persons. For purposes
of planning a community satellite to the local health department, it is desired to estimate the total
number of physician visits made during a calendar by members of the community. For this

16 | P a g e
information to be useful, it should be accurate to within 10% of the true value. A small pilot
survey of 10 households, conducted for purposes of gathering preliminary information, yielded
the accompanying data on physician visits made during the previous calendar year. Using this
data as preliminary information, determine the sample size needed to meet the specifications of
the survey.

HH # of Persons in HH Total # of Visits


1 3 12
2 6 27
3 2 16
4 5 17
5 2 1
6 3 21
7 4 34
8 2 12
9 6 24
10 4 30
Total 194

r =0.1. Let’s use α =0.01 so that z α/ 2=2.576.


10

Based on the pilot study we can guess μto be


∑ Xi
i=1
=19.4
10
10 10
2
10 ∑ X 2i −( ∑ X i)
and 2 i=1 i=1 (10 )( 4636 )−(194)2 .
S= = =96.93
9(10) 90

S √ 96.3
So, CV = = =0.575.
μ 19.4
2 2
CV z α / 2 (0.575)2 (2.576)2
n o= 2
= 2
=170.9076
r (0.1)

17 | P a g e
no
Since >0.05, we cannot ignore the fpc. Thus, we compute for
N
no 170.9076
n= = =161.696
no 170.9076
1+ 1+
N 3000

Example (Capistrano)

An anthropologist is studying the 3,200 inhabitants of island X. Among other things, he wishes
to estimate the percentage of inhabitants belonging to blood group O. Find a conservative
estimate for n using SRSWOR if the anthropologist will be content if the percentage is correct
within ±5% except for a 1 in 20 chance.

d = .05 N = 3200 α = .05

Since only a conservative estimate for n is needed, we’ll use P=Q=.5 (for which we will observe
2
z a/ 2 PQ ( 1.96 ¿¿¿ 2 ( .5 )( .5 ) 384.16
the largest variability). n o= = = 384.16. Since ≮ .05 ,we cannot
d
2
(.05 ¿ ¿¿ 2 3200
ignore the fpc,

no 384.16
n= = = 343.08. Use a sample of size 344.
1+ ( n ˳−1 ) / N 1+ 383.16/3200

ESTIMATING THE MEAN AND TOTAL USING SRSWOR

Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of the Standard Error under SRSWOR

Parameter Estimator Variances Estimators


n 2
Mean S N −n s N −n
( μ) ∑ yi n N √n N
y= i=1
n 2
σ N−n
n N−1


n 2 2
Total N N S N−n Ns N −n
(τ ) τ^ =N y=
n
∑ yi n N √n N
i=1

N −n
The factor is called the finite population correction (fpc).
N−1

18 | P a g e
The factors
N −n
N
n
N N √
=1−f (where f = ) for the variance and N −n for the standard error will

also be referred to as finite population correction(fpc). As n gets closer to N, the fpc decreases in
magnitude and thus will cause a reduction in the value of the standard error. On the other hand, if
the sampling fraction, f , is very small (that is, information has been obtained from only a very
small fraction of the population) then the fpc is close to one..

Using Data Set 1, we can estimate the mean and total forced vital capacity of workers in the
company under consideration and their respective standard errors as follows:
n

Estimated mean
∑ yi 3216
¿ y= i=1 = =80.4
n 40

Estimated total ¿ N y=1200 ( 80.4 )=96480


n

∑ ( y i− y )2
s2= i=1 =153.9385
n−1

n
sampling fraction¿ f = =40 /1200
N

estd. std. error of y= ^


se ( y )=
s
√n √ N −n
N
=1.9288

estd. std. error of Ny=^


se (N y )=
√n N √
Ns N −n
=2314.5320

Estimators of Number of Units and Proportions of Units in C, Their Corresponding


Variances and Estimators of the Standard Error Under SRSWOR

Estimated Standard Error


Parameter Estimator Variance

( ) √ √
n
Proportion of NPQ n PQ N−n ^
PQ^ n
Units in C ^
P=
∑ yi n ( N −1 )
1− =
N n N−1 n−1
1−
N
i=1
(P)
n

Total No of ^
A = N ^P 2
N PQ N−n N

√ √ √
Units in C n N−1 ^
PQ^ n ^Q
NP ^
(A) 1− = √ N−n
n−1 N n−1

19 | P a g e
Example (Capistrano)

In a sample of size 100 selected using SRSWOR from a population of size 500, there are 37 units
in Class C. Estimate the population proportion of units in C and the total number of units in C
and their respective standard errors.

^
P=
37
100 √ 99 √
=.37 . Its standard error is estimated to be ( .37 )( .63 ) 1− 100 ¿ 0.0434
500
^
A=( 500 ) ( 0.37 )=185 and its standard error is estimated to be (500)(0.0434) = 21.7.

STRATIFIED RANDOM SAMPLING

Stratified Sampling is a probability sampling method where the population is divided into
nonoverlapping groups or strata based on supplementary information, and then independent
samples are selected within each stratum.

Stratified Random Sampling is a particular stratified sampling method where simple random
sampling without replacement is used in selecting the samples within each stratum.

Example:

1. A researcher wishes to estimate average enrollments and faculty sizes for high schools.
Private institutions tend to be smaller than the public ones, so stratified sampling is used
where the two strata are private and public.

2. A standard quality control check on automobile batteries involves simply measuring the
weight. One particular shipment from the manufacturer consisted of batteries produced in
six different months. The investigator decides to stratify in months in the sampling
inspection to observe month-to month variation.

Basic Features of Stratified Random Sampling

 Prior information on the stratification variable is needed in order to assign an element


into its corresponding stratum. Thus, the frame will be in the form of a list of all elements
in the population with the corresponding information on the stratification variable. This
makes stratified random sampling more difficult than simple random sampling.

20 | P a g e
 Stratification will not always produce more reliable estimates as compared to the
estimates under SRSWOR of the same sample unless all of the strata are large. A
sufficient condition that will assure more reliable estimates is the selection of a
stratification variable that will decompose the variance σ 2, in such a way that σ 2B is larger
than σ 2w.

 The choice of allocation method affects the reliability of the estimates. The allocation
method that yields the smallest variance per unit cost is optimum allocation. However, if
the stratum if the stratum variances are not too different from each other the estimate,
using proportional allocation will be as reliable as the estimate using Neyman allocation.
We would then prefer to use proportional allocation since the sample will be self-
weighing sample.

 Other considerations in choosing the stratification variables are: (i) convenience (e.g.
geographic areas) since stratified sampling will allow for the decentralization of data
collection and processing (ii) simplicity of the domain analysis since n h s will not be
random variable since they will be fixed by design.

 Misclassification of elements will result to biased estimates.

Sample Selection Procedure

Step 1. Clearly specify the strata. These strata must be nonoverlapping and their union must be
the whole population.

Step 2. Place each sampling unit of the population into its appropriate stratum.

Step 3. Using a randomization mechanism, select a sample from each stratum, making sure that
the selection of the samples are independent of each other. The sample size may vary form one
stratum to the other, depending on the allocation method. The sample allocation procedure may
also vary in the different strata.

In stratified random sampling, use simple random sampling without replacement in each stratum.
Make sure that a different set of random numbers is generated in each one of the strata so that the
observations chosen in one stratum do not depend upon those chosen in another.

Step 4. The stratified random sample consists of the combined samples selected in Step 3.

Example.

Suppose we wish to select a sample of 50 students using stratified random sampling with sex as
stratification variable from a population consisting of 250 males and 100 males. In order to do

21 | P a g e
this our frame must be in the form of a list of students with information on the sex of each
student. We partition the population into two strata: Female and Male. Each female will be
assigned a unique number from 1 to 100. Suppose our sample of 50 must consist of 30 females
and 20 males. To select a sample of females using SRSWOR, we need to use a randomization
mechanism to choose 30 distinct numbers from 1 to 250. The females in the associated with the
selected numbers will be included in the sample. To select a sample of males using SRSWOR,
we once again use a randomization mechanism to choose 20 distinct numbers from 1 to 100. The
males in the list associated with the selected numbers will be included in the sample

Notations

Population Figures

N h= number of units in the population that belong in the hth stratum

Y hj = measure taken from the j th element in the hth stratum

Nh
W h =hth stratum weight=
N

nh
f h= sampling fraction in the hth stratum=
Nh
Nh

μh= hth stratum mean= ∑


Y hj
j =1
Nh
L Nh L

μ=¿ population mean = h=1


∑ ∑ Y hj ∑ N h μh L
=∑ W h μ h
j=1 h=1
=
N N h =1

L Nh L
τ =population total=∑ ∑ Y hj =∑ N j μh
h =1 j=1 h=1

L Nh

2
σ w=variance within strata= h=1
∑ ∑ (Y hj−μh)2
j=1
N
L Nh

σ
2
=variance between strata=
∑ ∑ (μ h−μ)
2
L
Nh
=∑
B h=1 j=1 2
(μh−μ)
N h=1 N

22 | P a g e
L Nh

2
σ =population variance=
∑ ∑ (Y hj −μ)
2

h=1 j=1
N

Note : σ 2=σ 2w + σ 2B
Nh

∑ (Y hj−μh )2
2 j=1
Sh =
N h −1

Sample Counterparts
L
n h=number of units in the sample that belong in the hth stratum (∑ nh=n
h =1

y hi =measure taken from the i thelement in thehth stratum


Nh

∑ ( y hi− y h )2
2 i=1
sh=
nh−1

Allocation of Sample to Strata

1. Equal Allocation: the same number of elements is sampled from each stratum. This
is used if the primary objective of the survey is to test hypotheses about differences
among the strata with respect to the variable of interest (under the assumption that
within strata variances are equal).

nh
2. Proportional Allocation (self-weighting samples): the sampling fraction is
Nh
specified to be the same for each stratum, which implies that the overall sampling
n
fraction is the fraction taken from each stratum. This is often used because of its
N
simplicity even if it is not the optimal design in terms of precision of estimates.

3. Optimal Allocation: the allocation that will yield an estimate that has the lowest
variance per unit cost. A special case is called Neyman allocation, where the cost per
unit is the same in all strata.

STRATUM SIZE & VAR( y st ) of the DIFFERENT METHODS OF


ALLOCATION

23 | P a g e
Equal Allocation Proportional Optimal
nh 1 Nh W h Sh N h Sh
n W h n= n
L N √ ch √ ch
n= n

(√ ) (√ )
L L
W h Sh N h Sh
∑ ch
∑ ch
h=1 h=1
where the cost function is
C=c 0 + ∑ c h nh
c 0= overhead cost &
c h= cost pet unit in the hth stratum
If cost is fixed in advance :
Nh Sh
√ ch (C−c0 )
L

∑ ( N h S h √ ch )
h=1

Var( y st ¿ L
L
1−f
L L
2
L

∑ W 2 S 2 (1−f h )
n h=1 h h n
∑ W h S2h ( ∑ W h S h) ∑ W h S 2h
h=1
or
h=1
− h=1
L L
n N
∑ W 2h S 2h ∑ W 2h S 2h for Neyman Allocation
h=1
− h=1
n N

 Under proportional allocation, a larger sample is taken in a stratum if the stratum is


larger.
 Under optimal allocation, a larger sample is taken in a stratum if : (a) the stratum is
larger; (b) stratum is more variable internally; or (c) sampling is cheaper in the stratum
 For stratified random sampling with proportional sampling with proportional allocation,
the variance simplifies to
L
1−f
V ¿=
n
∑ W h S 2h
h=1

2 2
 With proportional allocation and equal variances Sh in all strata, say Sc , we have the
simple result
2
Sc
Var ( y st ) = (1−f )
n

 For stratified random sampling with Neyman allocation, the variance simplifies to

24 | P a g e
L L
2
(∑ W h S h ) ∑ W h S 2h
h=1
V opt =Var ( y st )= − h=1
n N

Example:

(Capistrano) Consider the summary data from 3 strata:

Stratum Nh Sh Nh
W h=
N
1 100 50 0.2
2 150 10 0.3
3 250 5 0.5

How will we allocate the total sample of 140 elements to each stratum by using equal allocation,
proportional allocation and Neyman allocation? Compute for Var ( y st ) for each.

Equal Allocation

1 140
n h= n= =46.67 for h=1 ,2 , 3 (If we round up to 47, the sample size will increase to 141)
L 3
L 2
Sh 1
Var ( y st ) =∑ W
2
nh
( 1−f h ) = ¿
h
47
h=1

Proportional Allocation

n h=W h n, so n1=28, n2 =42, n3 =70


L
1−f
Var ( y st ) =
n
∑ W 2h S 2h= 1−0.28
140
¿
h=1

Neyman Allocation

N h Sh
n h= L
n
∑ W h Sh
h =1

∑ ( N h S h )=100 ( 50 ) +150 ( 10 ) +250 ( 5 )=7750


h =1

25 | P a g e
So, n1 = ( 5000
7750 )
∗140=90.32 , n =27.10 ,n =22.58
2 3 which we would round off to
n1=90 , n2=27 , n3 =23

( )
L 2 L
∑ W h Sh ∑ W h S 2h
h=1
Var ( y st ) = − h=1
n N
(0.2 ( 50 ) +0.3 ( 10 ) +0.5 (5)) 0.2 ( 50 2 )+ 0.3 ( 102 ) +0.5 ( 52 )
2
¿ − =0.63167
140 500

It is possible that the computed n h under optimal allocation will exceed the value of N h. For
example, the sample size needed is 140 and N 1=100 , N 2=110 , N 1=120 but the computed n1
under Neyman allocation is 120> N 1 . In such case, we will use 100% sampling in the first
stratum and allocate the remaining elements in the other strata using the formula,

W h Sh
n h=(n−N 1 ) L

∑ W h Sh
h=1

DETERMINING THE SAMPLE SIZE

For a specified V d or d and the fpc is ignored, the first approximation for the sample size when
we wish to estimate the mean is
2 2 2 L 2 2
W h S h 1 L W h Sh
z α /2 n
n 0= 2 ∑ = ∑ where w h= h
d h=1 wh V d h=1 w h n
2
d
Note that V d = 2 and the exact formula depends on the allocation method.
z α /2

Formula for Computing n 0 for a Specified V d

Parameter of Interest Equal Allocation Proportional Neyman


1 Nh W h Sh
w h= w h= w h=( )
L N L

∑ W h Sh
h =1
L L L
Mean L 1 1 2
n 0= ∑
V d h=1
W 2h S2h n 0= ∑
V d h=1
W h S2h n 0= (∑ W h S h )
V d h=1
L L L
Total L N 1 2
n 0= ∑ N 2h S2h n 0= ∑ N h S2h n 0= (∑ W h S h )
V d h=1 V d h=1 V d h=1

26 | P a g e
L L L
Proportion L 1 1 2
n 0= ∑ N 2h Ph Qh n 0= ∑ W h Ph Q h n 0= (∑ W h √ P h Qh )
V d h=1 V d h=1 V d h=1

n0
 As before, if < 0.05, use n=n0. Otherwise, compute for n using the following formula:
N
n0
n= L
for the Mean : 1
1+ ∑
NV d h=1
W h S 2h

n0
n=n= L
for the Total : 1
1+ 2 ∑ N h S 2h
N V d h=1

n0
n= L
for the Proportion : 1
1+ ∑W P Q
NV d h=1 h h h

Take note that under proportional allocation, all formulas reduce to the familiar form
n0
n=
n0
1+
N
2 2
d rμ
 V
When the margin of error (d) is specified, replace d by( V
) . Replace d by( ) if
zα/ 2 zα/ 2
r
the relative error (r) is specified. Replace V d by (C d μ)2 if C d= is specified. And, if
zα/ 2
the parameter of interest is the total then use the same formula for the mean if r or C dis
specified.

Example

Suppose that we are planning to take a sample of the members of a health maintenance
organization (HMO) for purpose of estimating the average number of hospital episodes per
person. The sample will be selected from membership lists grouped according to age (under
45 years; 45-64 years; 65 years and over). Let us suppose that the distributions of hospital
episodes are available from national data (such as the National Health Interview) and are
given below:

Age Group No. of Members Average No. of Var. of Dist’n of

27 | P a g e
Hosp. Episodes Episodes
Under 45 years 600 0.164 0.245
45-64 years 500 0.166 0.296
65 years and over 400 0.236 0.435

Compute the number of subjects needed to be 99 % certain of estimating the mean number of
hospital episodes within 20% of the true mean under stratified random sampling with
proportional allocation.

Specification: Parameter of interest : μ

Allocation Method: Proportional Allocation

r =0.2 α =0.1 z α / 2=2.576


2 L
zα/ 2
n 0=
r μ
2 2 ∑ W h S2h
h=1

Based on the national data, an initial guess for the mean number of hospital episodes is
L

∑ N h μh 600 500 400


μ= h=1 = ( 0.164 ) + ( 0.166 ) + ( 0.236 )=0.184
N 1500 1500 1500

and since

2 Nh 2
Sh = σ , then
N h −1 h
3 2 2 2
600 500 400
∑ W h S 2h= (1500)(599) 0.245+
(1500)(499)
0.296+
(1500)(399)
0.436=0.313586
h =1

Thus,
2
2.576 (
n 0= 2 2
0.313586 ) =1538.801
0.2 0.184

Adjustment is definitely needed. Under proportional allocation,

n0 1538.801
n= = =759.576
n0 1538.801
1+ 1+
N 1500

So we take a sample size of 760.

28 | P a g e
Example

(Cochran)A sample of United State colleges and universities will be drawn using stratified
random sampling with optimum allocation in order to estimate enrollments for the current
academic year. The population of teachers’ colleges and normal schools was divided into 7
strata, of which one small stratum will be ignored. The first five strata were constructed by size
of institution while the sixth contained colleges for women only. Data needed for computing the
sample size were taken from the previous academic year. It shows that the total enrolment was
56 472. The other needed information is as follows:

Stratum Nn Sn
1 13 325
2 18 190
3 26 189
4 42 82
5 73 86
6 24 190
Total 196

A coefficient of variation of 5 % in the estimated total enrollment was specified.

Specifications: Parameter of interest : τ

Allocation Method: optimum allocation

C d=0.05
L
1 2
We use the formula n 0= (∑ W h S h )
V d h=1

τ
Now, μ= =56 , 472 /196=288.122
N

V d =(C d μ)2=( 0.05 2) ( 288.1222 ) =207.536

So,
L
1 2 1
n 0= (∑ W h S h )
2
( ( 13 ) ( 325 ) + ( 18 ) ( 190 ) + ( 26 ) ( 189 ) + ( 42 )( 82 ) + ( 73 )( 86 )+(24)(190)) =90.36
V d h=1 ( 207.536 ) ( 196 )
2

Definitely adjustments will be needed so we use the formula

n0
n= L
1
1+ 2 ∑ N h S2h
N V d h=1

29 | P a g e
90.36
n= =57.1
1
1+ 2
(4640387)
196 ( 207.36 )

We round this up to 58 as our sample size.

Estimating the Mean and Total Using Stratified Random Sampling

Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error Under Stratified Random Sampling

Parameter Estimator Variance Estimated Standard


Error


L 2
Mean L
Sh L
S 2h
(μ ¿ ∑ N h yh L ∑Wh 2
nh
(1−f h ) ∑ W n (1−f h)
2
h
y st = h=1 =∑ W h y h h =1 h=1 h
N h =1
=

√∑
L 2 2 2 2
W h Sh L W h S h 2 2
∑ n −∑ N
L
N h sh
¿ 2
(1−f h)
h =1 h h=1 h
h=1 N nh

√∑ W 2h S 2h L W 2h S2h
L
¿ −∑
h=1 nh h=1 N h

√∑
l 2
Total L
Sh L
s 2h
(τ ¿ τ^ st =∑ N h y h=N h y st N ∑W
2
(1−f h )
2
h N
2
(1−f h )
h=1 h=1 nh h=1 nh h

Example

The total number of inhabitants in the 100 cities of country X is to be estimated from a sample of
32 cities. The cities are arranged into 2 strata; the first, containing the 32 largest cities and the
second containing the remaining 68 cities. The number of inhabitants is presented in Dataset #3.
L
y st =∑ W h y h= ( 0.32 )( 600.375 )+ ( 0.68 ) ( 201 )=328.8
h=1

τ^ st =N h y st =( 100 ) ( 328.8 )=32880

The estimated total number of inhabitants is 32880 thousand.

Its standard error is estimated to be

30 | P a g e
√ s 2h
√ ( ) ( )
L

∑ N n (1−f h )= ( ( 32 )2 30531.13
2
h
8
1−
8
32
)+((68)
2 6291.739
24
1−
24
68
)=1927.526
h=1 h

Note:

 The standard error will be small if we choose the stratification variable so that all the
strata have small S2h, that is the elements within a stratum are homogenous with respect to
the characteristic of interest. If in fact, if it were possible to divide the population into
strata such that all items have the same value within a stratum then μ can be estimated
without any error.

CLUSTER SAMPLING

Sample Selection Variable

Cluster sampling is a sampling procedure or system where the sampling unit consists of a group
of elements called clusters. In simple one- stage cluster sampling, the clusters are selected using
simple random sampling.

Simple Selection Procedure for Simple One-Stage Cluster Sampling

Step 1. Specify appropriate clusters. Similar to the strata in stratified sampling, these clusters
must be nonoverlapping and their union must be equal to the population. The main difference
between the optimal construction of strata in stratified sampling and the construction of clusters
in cluster sampling is that the strata must be as homogeneous as possible and one stratum should
differ as much as possible from another with respect to the characteristic being measured;
whereas clusters should be heterogeneous as possible within and one cluster should look very
much like one another.

Step 2. Construct a frame containing all clusters in the population.

Step 3. Select n clusters from the frame using simple random sampling without replacement.

Step 4. The sample will consist of all the elements included in the selected clusters.

Examples:

1. The circulation manager of a newspaper wishes to estimate the average number of


newspaper purchased per household in a particular barangay. The 1000 households in the
barangay are listed in 100 clusters of 10 households each, and a simple random sample of
4 clusters is selected and all households in these clusters constitute the sample.

2. A forester wishes to estimate the average height of trees in a plantation. The plantation is
divided into quarter-acre plots. A simple random sample of 20 plots is selected from the

31 | P a g e
386 plots in the plantation. The forester then measures the height of all trees in the
sampled plots for his study.

3. An inspector wants to estimate the average weight of fill for cereal boxes packaged in a
certain factory. The cereal is available to him in cartons containing 12 boxes each. The
inspector randomly selects 5 cartons and measures the weight of fill for every box in the
sampled cartons.

Basic Features of Simple One-Stage Cluster Sampling


 The frame used in cluster sampling is a list of clusters. Thus, when a good list of
elements is not available or is too costly to obtain, cluster sampling is oftentimes
considered.
2
 For the standard error to be small, S B must be small so that the variability among the
measurements is mostly explained by S2W

 In practice though, the clusters are usually formed so that the elements within are
contiguous to each other (geographic subdivision). This way, cluster sampling will
effectively reduce the cost of the survey especially for those where the cost of obtaining
observations increases as the distance separating the elements increases. However, this is
at the expense of increasing the standard errors because elements that are close to
together are usually homogenous with respect to many characteristics.

 Cluster sampling can be inefficient especially if the clusters are large and homogenous
with respect to the characteristics under study. It will then be more economical to select a
sample of elements from the clusters selected rather than take information from all the
elements. This procedure is what we refer to as multi-stage sampling.

SAMPLE SIZE DETERMINATION FOR SIMPLE CLUSTER SAMPLING

Since simple one-stage cluster sampling uses SRS in the selection of the n clusters, then the
formulas for computing the sample size will be the same as that of the SRSWOR except that we
replace S2by S2B.

Using Dataset 5, suppose we wish to be virtually certain (that z α/ 2=3) of estimating the total
number of persons over 65 years of age residing in the five housing developments to within 10%
of the true value. How many clusters should we include in our sample?

Specifications: Parameter of Interest : τ


r =0.1
z α/ 2=3
N=5

32 | P a g e
Since r is specified, we use the same formula used to compute the sample size for the mean even
if the parameter of interest is τ .

2 2 no
S B Z α /2 n=
n o= 2 2
and no
r μ 1+
N

Computing for the parameters,


N

μ=¿ population mean ∑ μj


( 1.6 ) + ( 1.65 ) + ( 1.95 )+ (1.6 ) + ( 1.7 )
j=1
¿ = =1.7
N 5
N 2
( μ j−μ) (1.6−1.7 )2 +(1.65−1.7)2 +(1.95−1.7)2 +(1.6−1.7 )2 +(1.7−1.7)2
S B= ∑
2
= =0.02125
j=1 N −1 4

Computing for the sample size( number of clusters)


2 2 no 6.6176
S B Z α /2 (0.02125)(32) n= = =2.8481
n o= 2 2 = =6.6176and no 6.6176
r μ
2
(0.1 )(1.7 ¿¿¿ 2) 1+ 1+
N 5
Rounding up, we will select 3 clusters(housing developments) in our sample.

Estimation Using Clusters of Equal Size

Population Figures

N= number of clusters (sampling units) in the population

M= number of elements in each cluster

Y jk= measure taken from the k th element in the j th cluster, j=1 ,2 , … , N ; k=1 , 2 ,… , M
M

μ j= mean of the j th cluster = ∑


Y jk
k=1
M
M
Y j= total of the j th cluster ¿ ∑ Y jk =M μ j
k=1

N M N

μc =¿ population mean per cluster ∑ ∑ Y jk ∑ Y j


j=1 k=1
¿ = j=1
N N

33 | P a g e
N M N

μ=¿ population mean per element ∑ ∑ Y jk ∑ μ j μ


j=1 k=1
¿ = j=1 = c
NM N M
N M N N
τ =population total¿ ∑ ∑ Nμc =NMμ=∑ M μ j=∑ Y j
j=1 k=1 j=1 j=1

N M

2
σ =population variance
∑ ∑ (Y jk −μ)2
j=1 k=1
¿ =σ 2w +σ 2B
NM
N M 2
(Y −μ j )
σ =variance within clusters ¿ ∑ ∑ jk
2
w
j=1 k=1 NM
N N M 2 2
(μ j−μ) (μ j−μ)
σ =variance between/ among clusters ¿ ∑ ∑ =∑
2
B
j=1 k=1 NM j=1 N
N M

2
S = population variance (corrected)
h
∑ ∑ (Y jk −μ)2 M ( N −1 ) S 2B + M (N −1)S 2W
j=1 k=1
¿ =
NM −1 NM−1
N 2 N M 2
( μ −μ) (Y −μ j )
where S =∑ j
2
B and S2W =∑ ∑ jk
j=1 N −1 j=1 k=1 N (M −1)

Sample Counterparts

n= number of clusters in the sample

y ik =measure taken from the k thelement in thei th selected cluster i=1 , 2 ,… , n ; k =1 ,2 , … , M


M

y i=¿ mean of thei th selected cluster in the sample ∑ Y ik


¿ k=1
M
M
y i=¿ total of thei th selected cluster in the sample ¿ ∑ Y ik =M y i
k=1

M 2
(Y − y i )
s =¿ variance of the i selected cluster in the sample ¿ ∑ ik
2 th
i
k=1 M −1

Take note that y i, y i and s2i are not estimates since complete information on the i th selected cluster
is available. The notations for estimates were used to emphasize that these are random variables
whose values depend on which clusters are selected in the sample.

34 | P a g e
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error for Cluster Sampling With Equal Cluster Sizes

Parameter Estimator Variance Estimated Standard


Error


n M
Mean Per y ik n y i . n y ik 1−f
N
(Y j−μc )
2
n
( y i− y )2
Cluster y=∑ ∑ =∑ =∑ ∑ N−1 1−f
∑ n−1
i=1 k=1 n i=1 n i=1 n n j=1 n
( μc ) i=1


n M n 2
Mean Per y ik y y 1−f
N
(μ −μ) ( y i− ý )2
∑ Nj −1 =¿ 1−f
n

Element ý=∑ ∑ =∑ i . = S B ¿1−f


2
∑ n−1
i=1 k=1 nM i =1 n M n j=1 n n
( μ) i=1


Total τ^ =N y=NM ý (1−f ) N (μ j−μ)
2
n
( y i− ý )2
(τ )
2
N M
2

n j=1 N −1 NM
1−f
∑ n−1
n i=1

Example

Dataset 5 contains data on households in 5 housing developments in the population. The housing
developments serve as the clusters. Each cluster has 20 households. The households serve as the
elements . Suppose housing developments 2 and 5 were selected in the sample. Thus, the sample
consists of two clusters for a total of 40 households. Let us estimate the mean number of persons
over 65 years of age per household and its standard error.

The estimated mean number of persons over 65 years of age per housing development is :
2
y Y + Y 33+34
y=∑ i = 2 5 = =33.5
i=1 2 2 2

The estimated mean number of persons over 65 years of age per household is:
2
y i μ 2−μ5 1.65+1.7
ý=∑ = = =1.675
i=1 2 2 2

The estimated standard error (mean per household) is:

√ √
2 2


n 2 1− 1−
1−f ( y i− ý ) 5 5

n i=1 n−1
=
2
( ( 1.65−1.675 )2 + ( 1.7−1.675 )2) =
2
(0.00125)=0.019

The estimated total number of persons over 65 years of age is

τ^ =NM ý=( 5 )( 20 )( 1.675 ) =167.5

35 | P a g e
The estimated standard error is (5)(20)(0.019)=1.9.

Estimation Using Clusters of Unequal Size

Notations

Population Figures

N= number of clusters (sampling units) in the population

M j = number of elements in the j th cluster, j=1 ,2 , … , N


N
M o= number of elements in the population, ¿ ∑ M j
j=1

Y jk= measure taken from the k th element in the j th cluster, j=1 ,2 , … , N ; k=1 , 2 ,… , M j
M

μ = mean of the j th cluster = ∑ jk


Y
j k=1
Mj
Mj
Y j= total of the j th cluster ¿ ∑ Y jk =M j μ j
k=1

Mj 2
(Y jk −μ j )
S = variance of the j cluster ¿ ∑
2 th
j
k=1 M j−1

N Mj N

μc =¿ population mean per cluster ∑ ∑ Y jk ∑Yj


j=1 k=1 j=1
¿ =
N N
N Mj N N

∑ ∑ Y jk ∑Y j ∑ Y j / μm
j=1 k=1 j=1 j=1
μ=¿ population mean per element ¿ = N
=
N Mo M
∑ Mj
j=1

36 | P a g e
Mo
where μm = mean number of elements per cluster ¿
N
N Mj N
τ =population total¿ ∑ ∑ Y jk =∑ Y j =M μo =N μ c
j=1 k=1 j=1

Mj 2
(Y jk −μ)
S = population variance (corrected) ¿ ∑
2

k=1 M o 4−1

Sample Counterparts

n= number of clusters in the sample

mi=¿number of elements in the i th sample cluster, i=1 , 2 ,… , n


n
mo=¿ number of elements in the sample, ¿ ∑ mi
i=1

m=¿ estimated mean number of elements per cluster ∑ mi mo


k=1
=
n n

y ik =¿ mean taken from thek th element of the i thselected cluster in the sample,
i=1 , 2 … ,m ; k =1, 2 , … , m
m
y i

y i=¿ mean of the selected cluster in the sample¿ ∑ ik


k=1 m i
mi
y i=¿ total of thei th selected cluster in the sample ¿ ∑ y ik =mi y i
k=1

M 2
(Y ik − y i )
s =¿ variance of the i selected cluster in the sample ¿ ∑
2 th
i
k=1 mi−1

Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error for Cluster Sampling With Unequal Cluster Sizes

Parameter Estimator Variance Estimated Standard


Error


mi 2
Mean Per n
y ik n y i . 1−f
N
(Y j−μc ) n
( y i− y )2
Cluster y=∑ ∑ =∑ ∑ N−1 1−f
∑ n−1
i=1 k=1 n i=1 n n j=1 n
( μc ) i=1

Mean Per Ratio Estimator:


Element (approximate) (approximate)

37 | P a g e

( μ) n mi
y ik /n n mi y i y 1−f N 2
M j (μ j .−μ)
2
n 2
mi ( y i .− ý R )
2
ý R=∑ ∑ =∑ = ∑ 1−f

i=1 k =1 mi /n i=1 mo m n j=1
2
μ M (N −1) n j =1
2
m (n−1)


Unbiased Estimator:
1−f
n
( y i / μm− ý u )2
∑ n−1
N 2
1−f (Y /μ m−μ)
n
ý U =∑ ∑
mi
y ik /n n
=∑
y i /μ m y
= n
∑ j N−1 n i=1
j=1
i=1 k=1 μm i=1 n μm

Total Ratio Estimator: (approximate) (approximate)


(τ ) τ^ R=M o ý R


N 2 2
1−f M j ( μ j−μ) n
( y i. − ý R )
2
M
2
o
n
∑ 2
μ M (N −1)
Mo
1−f
∑ m2(n−1)
j=1 n i=1

√ ( )
2
yi
( )
2
Unbiased Estimator Yj − ýu
n −μ n μm
N
μm 1−f
∑ yi M
1−f
2
o ∑ Mo
n
∑ n−1
τ^ U =M o ýU =N y=N i=1 n N−1 i=1


j=1
n 2 n
( y i .− y )
2
2 1−f ( Y j .−μ c ) ¿ N 1−f
N
¿N ∑
n j=1 N −1 n
∑ n−1
i=1

Example

(Capistrano) We modify Dataset 5 as presented in Dataset 6 in such a way that clusters are now
of unequal sizes. Cluster 1 and 5 contain 10 households; clusters 2 and 4 contain 15 households;
while cluster 3 contains 20 households. Suppose clusters 1, 2 and 3 were selected in the sample.
Let us estimate the mean and the total number of persons over 65 years old per household and
their standard errors using ratio estimation.

The estimated mean number of persons over 65 years old per housing development is:
y 1 + y 2 + y 3 16+ 23+39
y= = =26 .
3 3

The estimated mean number of households per cluster:


n

m=¿ ∑ mi 10+15+20
k=1
= =15
n 3

38 | P a g e
The estimated mean number of persons over 65 years old per household is using ratio estimation
is

y 26
ý R= = =1.733
m 15

The estimate of the approximate standard error is

√ √
n 2 2 2 2 2 2 2 2
1−f mi ( y i− ý R ) 1−3 /5 10 (1.6−1.733) +15 (1.533−1.733) +20 (1. .95−1.733)
n
∑ m (n−1)
2
=
3 2
15 (3−1)
=0.09358
j =1

The estimated total number of persons over 65 years old is:

τ^ R=M o ý R=70 ( 1.7333 ) =121.3

The estimate of the approximate standard error is


n 2
1−f ( y i− ý R )
Mo
n
∑ m2 (n−1) =( 70 ) ( 0.9358 ) =6.551
i=1

Example

Let us use Dataset 6 once more but this time let’s estimate the mean and total using the unbiased
estimator. The estimation of the mean requires information on the cluster sizes for the
computation of μ M .

10+15+20+15+10
For this data, μ M = =14
5

The estimated mean number of persons over 65 years old using the unbiased estimator is

y 5.571
ý U = = =1.857
μm 3

The estimated standard error is


3

√ √
n 2 2 2 2 1−
1−f ( y i / μm− ý u ) 1−3 /5 (1.1429−1.857) +(1.6429−1.857) +(2.7587−1.857) 5
∑ n−1
n i=1
=
3 (3−1)
=
3
( 0.70926 )=0.

The estimated total number of persons over 65 years old is:

τ^ R=M o ý U =70 ( 1.857 )=130

39 | P a g e
The estimate of the approximate standard error is

√ ( y i /μ m− ý u )2
n
1−f
Mo
n
∑ n−1 =( 70 )( 0.9358 )=21.525
i=1

The standard errors of the estimates using the unbiased estimator are higher than the standard
errors using ratio estimation.

SYSTEMATIC SAMPLING

Every k th systematic sampling or 1-in-k systematic sampling is a method of selecting a sample by


randomly selecting one unit from the frame and take every k th element thereafter.

k is the sampling interval

Sample Selection Procedure

Step 1. Determine the interval k . In general, for a systematic sample of n elements from a
population of size N , k is computed using the formula k =⟦ N /n ⟧.

Step 2. Identify each unit in the frame with consecutive integers, beginning with 1. This
however, can be done simultaneously with the selection of the elements to be included in the
sample.

Step 3. Choose a number¸r, at random from integers 1 to k. Then the elements of the sample are
those units labeled r, r+k, r+2k, r+3k and so on until you reach the end of the frame.

Example

N=13 ,n=3. Thus, k =⟦ 13 /3 ⟧= 4⟦ ⟧1


3
=4 . Choose r ∈{1 ,2 , 3 , 4 }. The labels of units selected in

the sample are as follows:

r Labels of Units in the Sample

1 1,5,9,13
2 2,6,10
3 3,7,11

40 | P a g e
4 4,8,12

Examples:

 Industrial quality control sampling plans are usually systematic in structure. An


inspection plan for manufactured items moving along an assembly line may call for the
inspection of every k th item. In the inspection of work done at fixed stations, the
inspection plan may call for walking up and down the rows of work stations and
inspecting the machinery at every k th station. The time of the day is usually important in
assessing the quality of worker performance, and so, an inspection plan may call for
sampling the output of a work station at systematically time of the day.

 Market researchers and opinion pollsters who sample people on the move very often use
systematic sampling. For example, every k th customer at checkout counter may be asked
his or her opinion on qualities of a certain product. Or, every k th person boarding a bus
may be asked to fill out a questionnaire on bus service.

ESTIMATING THE MEAN AND TOTAL USING SYSTEMATIC SAMPLING

If systematic sampling is used to select a sample of n units, then the population total, the
population mean are estimated in the same manner used under simple random sampling.

Let Y j=¿ measure from the j th element of the population, j=1 ,2 , .. , N .


y j=¿ measure from the j th selected element of the sample, j=1 ,2 , .. ,n
Y ij =¿ measure from the j th element in the i th cluster, j=1 ,2 , .. ,n and i=1 , 2 ,… , k
y ij =¿ measure from the j th selected in the 1st cluster, j=1 ,2 , .. ,n and i=1

The systematic sample may be denoted as { y 1 , y 2 , … , y n}


The systematic sample may also be denoted as { y 11 , y 12 , … , y 1 n}, where systematic sampling is
viewed as a special case of cluster sampling wherein only 1 cluster is included in the sample.

Then the estimators can be written as:


n n

sample mean,
∑ yj ∑ y1 j
j=1
ý= = j=1
n n
sample total, τ^ =N ý

A slight modification of systematic sampling is called the Lahiri’s method and the sample mean
is unbiased estimator even if N ≠ nk (unlike the standard procedure).

Lahiri’s method

41 | P a g e
Step 1. Compute for k =⟦ N /n ⟧.

Step 2. Identify each unit in the frame with consecutive numbers, beginning with 1 to N. Thus,
element ¿ 1 is also labeled as N +1,¿ 2 is as N +2, and so on.

Step 3. Choose a number, r, at random from 1 ¿ N . Then the elements of the sample are those
units labeled as r , r +k , r +2 k , … , r +(n−1)k .

Example

Consider population ¿ {1 , 2 ,3 , 4 , 10} .

Let n=2. Thus k =⟦ 5 /2 ⟧ =2. Under the standard procedure, there are only k =2 possible samples.

r 1 2
Sample (1,3,10) (2,4)
2
ý 4 3
3

In Lahiri’s method, there are N=5 possible samples because we choose the random start, r,
from 1 to N,

r 1 2 3 4 5
Sample (1,3) (2,4) (3,10) (4,1) (10,2)
ý 2 3 6.5 2.5 6

Since each of the 5 possible samples of size 2 will be given the same chances of selection.

Basic Features of Systematic Sampling

 Systematic sampling is easier to perform in the field as compared to simple random


sampling especially if a good frame is not available. This results in a substantial saving in
time and a reduction in selection errors that can be committed by the field workers.

 The sample can be selected even if a list of the sampling units is not available.

 When units within the same sample are heterogeneous, the estimates under systematic
sampling will be more precise than the estimates under simple random sampling. This
will happen if the systematic sample is selected from a large population where the
elements are ordered according to the magnitude of the characteristics of interest or a
variable related to it.

42 | P a g e
 This procedure sometimes provides more information per unit cost than simple random
sampling since the sample is generally spread more uniformly over the entire population.

 The precision of the estimates depends on the order of the sampling units in the frame. It
is important to know the type of population under investigation since the estimates under
systematic sampling may be unreliable, particularly when there is unsuspected periodicity
in the arrangement of the units in the frame.

 Under certain conditions, an increase in sample size will not even guarantee an increase
in precision.

 Estimating the standard error of the estimate is more complicated. We can use model-
based inference. We can also use repeated systematic sampling.

MULTI-STAGE SAMPLING

Sample Selection Procedure

Multi-stage sampling is a method of selecting a sample that makes use of hierarchical structure
of units and sampling of these units are done in stages. The population is first divided into
primary sampling units (PSUs) and a sample of PSUs is selected. The selected PSUs are
subdivided into secondary sampling units (SSUs) and a sample of SSUs is selected from each of
the selected PSUs. This process is continued until the last stage elements are selected within the
final stage clusters.

Example:

 A researcher wishes to estimate the average number of days of confinement of patients in


Metro Manila hospitals. The researcher then takes a sample of hospitals and from each of
the hospitals selected, he selects a sample of patients in the past year.

 In the estimation of the amount of impurities in a bulk product like sugar, the sampling
procedure may select bags of sugar from warehouse and then select small test samples
from each bag. The test samples are then analyzed for amount of impurities.

 In order to estimate the total number of permanent residents of Quezon City who have
hypertension, the sampling procedure may first require the selection of barangays. Then
from each barangay a sample of households will be selected. Then from each household,
a person will be selected.

43 | P a g e
Advantages

1. Since this is just an extension of the concept of cluster sampling, the advantages of multi-
stage sampling are the same as those of cluster sampling. For example, a frame that lists
all the elements in the population is not needed, and there is a reduction in the cost of
obtaining the data because of the reduction in the travel costs.
2. In addition, it is not necessary to sample all of the elements in each sampled cluster. Thus
cost of sampling can often be reduced with little loss of information.

Disadvantages

1. Choosing the sample size is more difficult. The number of sampling units at each stage
must be determined. Furthermore, the choice of the sample size will now depend on 2
sources of variation: variation between the clusters and variation among elements within
the clusters.

2. The basic principle in the estimation of parameters is to build up estimates from the
bottom (last stage) to the top (first stage). Thus, estimation procedures are more difficult.
The more stages there are, the more complicate the analysis. Also, using PSUs that are
not of the same sizes will be more complicated than using PSUs of the same sizes.

If multi-stage sampling has two stages of sampling and the selection of sampling units in each
stage is done using SRSWOR then this is called simple 2-stage sampling. The PSUs are clusters
while the SSUs are the elements themselves.

Sample Selection Procedure for Simple 2 Stage-Sampling

Step 1. Specify the appropriate clusters. The two major considerations in choosing are: (a)
geographic proximity of the elements within a cluster; and, (b) cluster sizes that are convenient
to administer. The selection of the clusters will also depend on how many PSUs and SSUs we
intend to include in the sample. Do we intend to sample a few PSUs and many SSUs from each
or do we intend to sample many PSUs and a few SSUs from each or do we intend to sample
many PSUs and a few SSUs from each?

Step 2. Obtain a frame listing all PSUs in the population.

Step 3. Draw a SRS of PSUs.

Step 4. Obtain a frame listing all the SSUs (in this case, elements) for each of the selected PSUs
only.

44 | P a g e
Step 5. Select a SRS of elements from each of these frames.

NOTATION

N=¿ no. of PSUs in the population


n=¿ no. of PSUs in the sample
M i = no. of SSUs in the i th PSU of the population; i=1 , 2 ,… , N
mi= no. of SSUs in the i thsampled PSU; i=1 , 2 ,… , n
mi*¿ no. of sampled SSUs in the ith sampled PSU, i=1 , 2 ,… , n
N
M o=¿ total number of SSUs in the population ∑ M j
j=1
N
μ M =¿ mean number of SSUs in the PSUs of the population ¿ μ M = 1 ∑ Mi
N i=1

n
1
m=¿
n ∑ mi
i=1

Y ij =¿ measure taken from the j th SSU of the i th PSU of the population; i=1 , 2 ,… , N ;

j=1 ,2 , … , M i

y ij =¿measure taken from the j th SSU of the i th sampled PSU; i=1 , 2 ,.. , n ; j=1 , 2 , … ,mi ;
¿
y ij =¿ measure taken from the j th sampled SSU of the i th sampled PSU; i=1 , 2 ,… , n ;
¿
j=1 ,2 , … , mi

¿
Mi mj mi
1 1 1
μi= ∑Y
M i j=1 ij
y i= ∑ y ij
mi j=1
y = ¿ ∑ y ¿ij
¿
i
mi j=1
¿
Mi mj mi
1
Y i=∑ Y ij =M i μi y i=∑ y ij=¿ mi y i ¿ y = ¿ ∑ y ¿ij =m¿i y ¿i
¿
i
j=1 j=1 mi j=1
¿
Mi mi ¿ ¿ 2
(Y −μ )2 mi
(Y − y )2 ( y ij − y i )
S =∑ ij i
2
s =∑ ij i
2
s =∑
¿2
wi
j=1 M i−1 wi
j=1 mi−1 wi
j=1 m¿i −1
N M N N

∑ ∑ Y ij ∑ M i μ i ∑ C i μi Mi
i=1 j=1
μ=¿ population mean (per SSU)¿ N
= i=1 = i=1 where C i=
N μM N μM
∑ Mi
i=1

45 | P a g e
N Mi N
τ =¿ population total¿ ∑ ∑ Y ij =∑ M i μi=M o μ
i=1 j=1 i=1

Sample Size Determination using Simple 2-Stage Sampling

The sample size depends on the type of cost function. A simple form of cost function is:
n n
C=c 0 n+c 2 ∑ m¿i +c L ∑ mi
i=1 i=1

where c 0=¿fixed cost per PSU


c 2=¿ cost of getting information per PSU
c L=¿ cost of listing an SSU in the selected PSUs in the sample

However, it is difficult to use this cost function because we cannot determine in advance which
¿
PSUs will be selected in the first stage of sampling. In other words, mi ' s and the mi ' s are
random variables in the sense.

We then consider the expected cost function instead, where

(∑ )
n n
E ( C )=c 0 n+c 0 E m¿i + c L E( ∑ mi )
i=1 i=1

E ( C )=c 0 n+c 2 n μ m+ c L n μm =( c 0 +c L μ m ) n+c 2 n μ m where:

M0
μM= =¿mean number of SSUs in the population per PSU
N
N ¿
mi
μm =∑ =¿mean number of sampled SSUs per PSU
i=1 N

c 1=c 0 + cl μm

The problem is to determine n and μm that will minimize MSE for a given cost or vice versa.

Formulas for n and μm for f 2i =f 2 and E ( C )=c 1 n+c 2 n μm

Estimator Used for μ μm n


2
Ratio Estimator S22 d
ý R μm =√ c1 /c 2 for specified V d = 2
1 2 z α /2
S+2
B − S
μm 2

46 | P a g e
where:
n 2 2 n=
+2
S B −S 2
2
( μ1 − μ1 )
M m
C i (μi .−μ)
S =∑ 1 +2
+2
B
N−1 V d+ S
i=1 N B
N
M for specified cost:
S22=∑ i S2wi
i =1 M 0
C
n=
c1 +c 2 μ m


2
Unbiased Estimator 2
S2 d
ý U μm =√ c1 /c 2 V
or specified d 2 =
2 1 2 z α /2
S B − S2
where:
μm
n=
S 2B−S 22 (1
μM μm

1
)
n 2
(Ci μi .−μ) 1
S =∑ V d + S2B
+2
B
i=1 N−1 N
for specified cost:
C
n=
c1 +c 2 μ m

Week

( )
(i) µj Yi 10 2
Swj2
Y 1−∑ Y i /10
i=1

1 162.86 1140 101442.25 723.81


2 115.00 805 427062.25 104.33
3 193.29 1353 11130.25 3795.57
4 124.00 868 348690.25 1314.00
5 130.29 912 298662.25 627.57
6 213.43 1494 1260.25 802.95
7 338.86 2372 834482.25 820.14
8 377.86 2645 1407782.25 4181.81
9 155.86 1091 135056.25 2476.48
10 272.14 1905 1999362.25 2253.48
Total 14585 3764930. 50 17100.14

Specifications: Parameter of interest: µ


d = 50
α = 0.05 Zα/2 = 1.96

47 | P a g e
μm =
√ c1 s2


We will compute for the number of days per week using c2 2
S2 and the number of
2
S B−
μm

weeks using the formula n=


2
S B−S 2
2
( μ1 − μ1 )
M m

1 2
V d+ SB
N

We’ll use the results of the pilot study to come up with the preliminary values for SB2 =

( )
N 2

∑Y j Mj 2
( Y jk−μ j )
and S =∑
j=1
N
Y j− N
2
wj ∀j
N Mj 2 M j −1
∑ μ2M ( N −1 )
, S =∑
2
2 S
M o wj
k=1

j=1 j=1

SB2 = 3764930.5/( 72*9) = 8537.257 S22 = 17100.14/10 = 1710.014

√ √ √
c S 22 1710.014
μm = 1 = 2 =¿ ¿ 0.64
c2 2 1 2 8537.257−1710.014
S B− S
μM 2

f2 = µm/µM = .64/7 = .0917 mi = f2mi = (.0917)(7) ≈ 1 for all 1

2
S B−S 2
2
( 1

μM μm
1
)= 8537.257−1710.014 ( 17 − 0.641 ) =13.45 ≈ 14
n= 2
1 2 50 8537.257
V d+ S +
N B 1.96
2
52

Example

(Lemeshow) The number of visitors to state park, based on a pilot study, are given in the table
below.

Week Sun Mon Tue Wed Thurs Fri Sat


1 200 150 130 140 150 180 190
2 120 105 111 103 111 125 130
3 310 200 180 130 125 200 208
4 200 107 101 98 103 122 137
5 170 160 130 121 107 110 114
6 250 237 209 212 231 175 180

48 | P a g e
7 380 378 325 330 306 322 331
8 495 400 315 302 350 388 395
9 206 200 108 95 107 185 190
10 308 300 293 206 200 298 300

Suppose the park management wishes to do a survey in order to estimate the mean number of
visitors in a day. They intend to use simple 2-stage sampling where the weeks serve as the PSUs
and the days serve the SSUs. How many weeks and days within a week will be included in the
sample if management wishes to be 95% confident that the margin of error of the estimate is 50
using the unbiased estimator?(Of course, c L=0 since we already know that there are always 7
always 7 days in a week. Let us further assume that c 0=2 c 2).

Specifications: Parameter of interest: µ


d = 50
α = 0.05 Zα/2 = 1.96

μm =
√ c1 s2


We will compute for the number of days per week using c2 S2
2 and the number of weeks
+2
S −
B
μm

using the formula n=


+2
S B −S 2
2
( 1

1
μ M μm )
1 2
V d+ S
N B

We’ll use the results of the pilot study to come up with the preliminary values for

( )
N 2

∑Y j
+2 j=1
S B = N Y j−
N
∑ μ2M ( N −1 )
j=1

N
Mj 2
S =∑2
2 S
j=1 M o wj
Mj 2
( Y jk−μ j )
Swj =∑
2
∀j
k=1 M j −1

Week

49 | P a g e
( )
(i) µj Yi 10 2 Swj2
Y 1−∑ Y i /10
i=1

1 162.86 1140 101442.25 723.81


2 115.00 805 427062.25 104.33
3 193.29 1353 11130.25 3795.57
4 124.00 868 348690.25 1314.00
5 130.29 912 298662.25 627.57
6 213.43 1494 1260.25 802.95
7 338.86 2372 834482.25 820.14
8 377.86 2645 1407782.25 4181.81
9 155.86 1091 135056.25 2476.48
10 272.14 1905 1999362.25 2253.48
Total 14585 3764930. 50 17100.14

SB2 = 3764930.5/( 72*9) = 8537.257 S22 = 17100.14/10 = 1710.014

√ √ √
2
c1 S2 1710.014
μm = = 2 =¿ ¿ 0.64
c2 2 1 2 8537.257−1710.014/7
S B− S2
μM

f2 = µm/µM = .64/7 = .0917 mi = f2mi = (.0917)(7) ≈ 1 for all i

S 2B−S 22 ( μ1 − μ1 )
M m
=
8537.257−1710.014 ( 17 − 0.641 ) =7.28 ≈ 8
n= 2
1 50 8537.257
V d + S2B 2
+
N 1.96 10

Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of the Standard Error for Simple 2-Stage Sampling

Parameter Estimator Variance Estimated Standard Error

√(
2
Mean Unbiased
( )
N
1 1 2 Mi 1 1 2

)
N
Mi
(μ) Estimator: n

N
S B + ∑ nN μ
2
(
m ¿ −
M
)S
21
wi −
1 ¿2
sB +∑ 2
1 1 ¿2
( ¿ − )s wi
n m ¿ i=1 M i i n N i=1 nN μ M m i Mi
1
i
mi ¿ n c i y i
ý U = ∑ ∑ y ij=∑ n
n μ M i=1 j=1 m¿i where:
i=1
where:
where:

50 | P a g e
N 2 n ¿ 2
(C i μ i−μ) (c i y i − ý U )
S =∑ s =∑
2 ¿2
mi B
N−1 B
n−1
c i= i=1 i=1
μM

√(
Ratio Estimator: 2

)
N
1 1 +2 Mi 1 1 2
− S B +∑ 2
( ¿ − )S wi
n n N i=1 nN μ M m i M
(approximate) i
∑ mi y ¿
i
ý R= i=1n 2

( )
N
1 1 +2 Mi 1 1 2
∑ mi − S B +∑ ( ¿ − )S wi n 2 ¿
c i ( y i − ý R)
2
S =∑
2
n N i=1 nN μ M mi M i
+2
B
i=1
i=1 N −1
where
N 2 2
Ci (μ i−μ)
S =∑
2
B
i=1 N−1

Total Unbiased
(τ) Estimator 2
M o Var ( ý U ) M o s^
.e .( ý U )
n
N
τ^ U = ∑ mi y ¿i
n i=1

(approximate) (approximate)
Ratio Estimator
2
M o VAr( ý R ) M o s^
.e .( ý U )
τ^ R=M o ý R

Example (Cochran)

From the volume “American Men of Science”, a 2000-page listing consisting of 36, 000 names
of scientists with general information, 20 pages were selected at random. The total number of
names per page varies, in general from about 14 names to 21. On each page, two scientists were
selected and their ages were recorded. Dataset 9 contains the information gathered. Let us
estimate the mean age of scientists in the book using the unbiased estimator and the ratio
estimator.(Refer to Sampled PSU Statistics.csv)

Unbiased Estimation

M o 36 , 000
N=2,000 n=20 M o=36 ,000 μ M = = =18
N 2 , 000

51 | P a g e
n ¿
c y 951.2122
ý U =∑ i i = =47.5597
i=1 n 20
n ¿ 2
(c i y i − ý U ) 2215.3656
s =∑
¿2
B = =116.5907
i=1 n−1 19

√( c2i 1
) (√ 201 − 2000 )116.5907+0.0273=2.408
N
1 1 ¿2 1 ¿2 1
^
s . e ( ý U ) = − sB +∑ ( ¿ − )s wi =
n N i=1 nN mi Mi

Ratio Estimation:
n ¿
m y 17121.5
ý R=∑ i i = =47.6922
i=1 m i 359
n 2 ¿ 2
ci ( y i − ý R ) 1763.188
s B =∑
¿2
= =92.79939
i=1 n−1 19

√( )
c 2i 1
(√ 201 − 2000 ) 92.79939+0.0273=2.149
N
1 1 ¿+2 1 1
^
s . e ( ý U ) = − sB +∑ ( ¿ − )s¿wi2=
n N i=1 nN m i Mi

References:

Introduction to Sampling Designs Course Notes by Therese Ann G Capistrano


Cochran, W. G. (1953). Samling Techniques. 50–64.
Ziegel, E. R., Levy, P., & Lemeshow, S. (2000). Sampling of Populations. In Technometrics (Vol. 42).
https://doi.org/10.2307/1271120
Lohr, S. (2010) . Sampling: Design and Analyisis.Nelson Eductaion, Ltd : Canada
Sampling of Populations, Methods and Application by Levy and Lemeshow

52 | P a g e

You might also like