Sampling Designs Final Material
Sampling Designs Final Material
Sampling Designs Final Material
What is R?
R is a free software environment for statistical computing and graphics.
The R – chitecture
R exists as base package with a reasonable amount of functionality. The Software R and its
packages are stored in a central location known as the CRAN or the Comprehensive R Archive
Network. Once a package is stored in the CRAN, anyone with an internet connection can
download it from the CRAN and install it to use within their own copy of R.
To install R onto your computer you need to visit the project website (http://www.R-project.org).
The figure below shows the process of obtaining the installation files. On the main project page,
on the left-hand side, click on the link labelled ‘ CRAN’
There are various copies (mirrors) of CRAN across the globe; therefore the link to the CRAN
will navigate you to page of links to the various ‘mirror’ sites. Scroll down this list to find a
mirror near to you. You may click ‘ https://cran.stat.upd.edu.ph/’ since this is the closest to us.
Once you have been redirected to the CRAN mirror that you selected, you will see a web page
that asks you which platform you use (Linux , MacOS or Windows). Click the link that applies
to you.
If you click on the ‘Windows’ link, then you’ll be taken to another page with some more links;
click on’ base’, which will direct you to the webpage with the link to the setup file, once there,
click on the link that says ‘Download R 3.6.1 for Windows’, which will initiate the download of
the R setup file. Once this file has been downloaded, double click on it and you will enter a
(hopefully) familiar install procedure.
If you click on the ‘MacOS’ link you will be taken directly to a page from where you can
download the install package by clicking on the link labelled ‘R-3.6.1.pkg’ Clicking this link will
download the install file; once downloaded, double click on it and you will enter the normal
MAcOS install procedure.
To Install RStudio
1. Go to www.rstudio.com and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. For Windows : Click on the version recommended for your system, or the latest
Windows version, and save the executable file. Run the .exe file and follow the
installation instructions.
4. For MacOS: Click on the version recommended for your system, or the latest Mac
version, save the .dmg file on your computer, double-click it to open, and then
drag and drop it to your applications folder.
Environement/History
Editor/Script/
Data Pane
Files/Plots/
Packages/Help
Console
Edit – This menu contains edit functions such as cut and paste. From here, you can also clear the
console, activate a rudimentary data editor, and change how the Graphical User Interface looks.
View – This menu lets you select whether or not to see the toolbar and whether to show a status
bar at the bottom of the window.
Packages – This menu is very important because it is where you load, install and update
packages.
Window – If you have multiple windows. This menu allows you to change how the windows in R
are arranged.
Help – It routes you to online help (links to frequently asked questions, the R webpage etc.) and
it offers you an offline help (pdf manuals and system help files).
Commands in R are generally made up of two parts: objects and functions. These are separate by
“←“, which you can think of as meaning ‘is created from’. As such, the general from of
command is: object <- function which means ‘object is created from function’.
R is case sensitive; which means that if the same things are written in upper or lower case, R
thinks that they are completely different things.
R Workspace
The collection of objects and things you have created in a session is known as your workspace.
A working directory is a directory where you want to store your data files.
To set the working directory to this folder, we use the setwd( ) command to specify this newly
created folder as the working directory.
- Create a folder and place the data files you’ll be using in that folder.
Example: setwd(“D:/R Training/Files”)
By executing this command, we can now access files in that folder directly without having
reference to the full file path.
If you want to check what working directory is, we have to execute the command getwd( ).
Installing Packages
1. In windows if you select Packages => Install packages(s)… the window that will open
first asks you to select a CRAN and then choose a package you want to install.
2. If you know the package you want to install, then the simplest way to execute this
command is install.packages (“package.name”) in which ‘package.name’ is replaced by
the name of the package that you’d like to installed. Note that the name of the package
must be enclosed in speech marks.
Once a package is installed you need to reference it for R to know that you’re using it. You need
to install the package only once but once you need to reference it each time you start a new
session of R.
In complete enumeration /census, measurements on the variables of interest will be taken from
all the elements in the population.
Sampling unit is the unit that is selected in the sampling process. Sampling units are
nonoverlapping collections of elements from the population that cover the entire population.
Example:
Suppose a researcher wishes to select a sample in order to study the opinion of university
students in Metro Manila on the SOGIE bill. The elementary units in this study are the
university students.
Approach 1. The researcher compiles a list of all university students in Metro Manila. From the
list, the researcher selects the sample of university students. In this approach, the sampling units
are the university students themselves.
Approach 2. The researcher gets a list of universities. From the list, the researcher selects a
sample of the universities. The sample includes all the students in the selected universities. In
this approach, the sampling units are the universities, not the students.
Approach 3. The researcher gets a list of the universities in Metro Manila. From the list, the
researcher selects a sample of universities. The researcher then gets a list of students from each
of the selected universities and then selects a sample of students from each one of the lists. In
this approach, there are two sets of sampling units: universities in the first stage sampling of
sampling and students in the second stage of sampling.
1. One method is to number each subject in the population. Then place numbered cards in a
bowl, mix them thoroughly, and select as many cards as needed. The subjects whose
numbers are selected constitute the sample.
2. Obtain a random sample from a table of random numbers.
3. Generate random numbers with a computer.
Table of
Random
Numbers
Example:
Suppose a researcher wants to have an online talk radio featuring interviews with provincial
governors on the subject of the war on drugs. Because of time constraints, the 45-minute blog
talk radio can only accommodate five governors. The radio host wishes to select the governors at
random. Select a random sample of provinces from 43.
Step 3. Select 5 unique numbers from the starting point. Disregard larger numbers and
numbers that appear more than once. So going down, we get 04, 37, 32, 15 and 07.
Step 4. Take note of the provinces that corresponds to the selected numbers. The
governors of these provinces will join the online interview.
04 Bohol
37 Sulu
32 Misamis Occidental
15 Siquijor
07 Eastern Samar
The target population is the collection of elements from which information is desired while the
sampled population is the collection from which the sample is actually selected.
8|Page
-Elements from different groups can be well represented through sampling. When the
population is too large, a survey without sampling becomes impossible,especially when
there is a limited time.
Greater Accuracy and Efficiency
-When all entities are measured, the measurement error increases.
Timeliness
-Sampling takes lesser time since the volume of data is reduced.
Greater Scope
-More information can be extracted even with a limited amount of resource because of
generalizations
Nature of Testing Procedure
-Grinding and sectioning of fossil material in an archeological study, dissection of,
quality control of manufactured products, and many more need to be sampled or else the
entire population is destroyed.
Research Ethics
A researcher should control the use of animals, and it should be under an ethical
framework
Disadvantages of Sampling
There are chances of bias in the selection of the sampling method.
Appropriate calculation of sample size is a challenging task.
It requires adequate knowledge of the subjects.
When the population is not homogenous, there is a need for an expert who has
specialized knowledge in sampling.
Characteristics of Estimators
1. Element Sampling: the elementary units themselves serve as the sampling units.
2. Cluster Sampling: the sampling units contain many elementary units and all the
elementary units that belong in the selected sampling units will form the sample.
3. Multi-Stage Sampling: there is more than one stage of sampling and different sampling
units are used in each stage.
The sampling frame is the complete list of sampling units under study.
9|Page
Types of Frame
1. directory type: this is a listing of sampling units. It may be a physical list of units or a
list of codes representing the units
2. map type: lists in the form of processed maps
1. Cost. If an existing list should be used, it should not be more expensive to acquire and
clean this list than to generate a list of all sampling units in the population on your own.
2. Simple to use. The rule used to identify in which sampling unit an element belongs in
must not be too complicated.
A probability sample is one that is based on a sampling plan that gives every element in the
population a known, nonzero probability of being included in the sample; otherwise it is a
nonprobability sample.
10 | P a g e
2. Purposive Sampling: the elements are carefully selected to provide a “representative
sample”. Studies have demonstrated that selection bias can arise even with expert choice
but nevertheless the method may be well be appropriate for very small samples when the
expert has a good deal of information about the population elements. The two common
features of the method are (a) sampling units often consist of relatively large groups; and,
(b) sampling units are chosen so that they will provide accurate estimates for important
control variables for which results are known for the whole population and it is hoped
that it will also give “good” estimates for other variables that are highly correlated with
the control variables.
3. Quota Sampling: interviewers are assigned quotas on the number of respondents in the
different subgroups of the population figures for various types, often based on population
being studied.
4. Snowball Sampling: identifying one or more participants in the desired population and
using them to find other participants until the desired sample size is met.
The inclusion probability π j , is the probability that the jth element of the population is
included in the sample. The selection probability pi, is the probability that the ith possible
sample Si is selected.
In probability sampling, the inclusion and selection probabilities are both known. This will be
possible if the identification of the sampling units included in the sample is based on a
randomization mechanism.
The use of probability sampling does NOT guarantee the selection of a “representative sample”.
11 | P a g e
The knowledge of the inclusion and selection probabilities under probability sampling will allow
us to measure the reliability, validity and accuracy of the estimators.
1. Sample Design: includes both sampling plan and the estimation procedure. The
sampling plan is the methodology used for selecting the sample from the population. The
estimation procedures are the algorithms or formulas used for obtaining estimates of the
parameters from the sample and for estimating the reliability/ accuracy of these
population estimates.
2. Survey Measurements: This component includes: (a) the variables needed in order to
meet the objectives of the survey and (b) the survey instruments to be used to measure
these variables.
3. Survey Operations: Once the sample has been chosen and the measurement instruments
or questionnaires drafted, pretested, modified, then survey operation can already begin.
Survey operation includes both fieldwork of the survey (including data collection) and
data management.
4. Statistical Analysis and Report Writing. After the data have been collected, coded,
edited, and processed, the data can be analyzed statistically and the findings incorporated
into a final report. As in all components of a sample survey, considerable care should be
taken in the interpretation of the findings of the survey.
1. Cost Efficiency. Each observation, or item, taken from the population contains a certain
amount of information about the parameter of interest. Since information costs money,
the researcher must decide on the amount that will be used to estimate this parameter.
Too little information prevents the researcher from making good estimates, while too
much information results in a waste of money.
a. initially decide on the total cost to be allocated to the survey and then choosing
the sample design that will yield estimates having the highest degree of accuracy
at the stated cost.
b. make specifications on the desired accuracy of the estimate and choose the sample
design that will yield estimates meeting these specifications in the lowest possible
cost.
12 | P a g e
2. Feasibility. No matter how cost-efficient a particular design is, it is of no use if it is not
feasible to execute this design. This means the selection plan must be made such that it
would be possible for the interviewers to identify the elements from whom to get the
information from as specified by the sampling plan.
Note:
The sampling units in SWSWOR and SRSWR are the elements themselves.
Unlocking Terms:
n!
FORMULA: nPr=
( n−r ) !
Ordered n-tuple (a 1 , a2 , … , an) is the ordered collection of n elements with a 1 as its first
element, a 2 as its second element an a n as its nth element.
In SRSWOR, a particular element can appear only once in a given sample since permutation
must consist of distinct coordinates. IN SRWSR, a particular element can appear more than once
in a sample.
13 | P a g e
Step 1. Assign a number from 1 to N to each element in the population.
Step 2. Select n(distinct) numbers from 1 to N by use of some random process such as a table of
random numbers, a computer or a calculator with a random generator.
Step 3. The population elements corresponding to the selected numbers in Step 2 constitute the
sample using SRSWR( SRSWOR).
General Procedure
Step 1. Specify what is expected of the sample in terms of the level of reliability needed for the
resulting estimates. This statement is usually in terms of the desired limits of error (absolute or
relative) and the corresponding level of confidence to be placed on the estimates.
In general, the larger the sample, the greater will be the reliability of the resulting estimates.
Validity, in general, cannot be improved with an increase in the sample size unless the bias is a
function n. And because of the difficulty of ensuring that no unsuspected bias enters into the
estimates, the level of reliability is controlled instead of the level of accuracy.
Step 2. Find some equation that relates n with the desired reliability of the sample. The equation
will vary according to how the desired reliability is stated and the type of sampling procedure to
be used.
Step 4. If more than one variable is to be measured in the survey, select the most vital variables
in the study. Prescribe the desired degree of reliability for each item and the corresponding
sample size is computed. More commonly, there is sufficient variation among the computed n’s
so that it is not advisable to choose the largest n, either from budgetary considerations or because
this will give an over-all standard of reliability that is substantially higher than originally
contemplated. In this event, the desired reliability may be relaxed for certain of the items in order
to permit the use of a smaller value of n. Or in some case, these items are dropped from the
study.
Step 5. Finally appraise the chosen value of n to see whether it is consistent with the resources
available to take the sample. This demands an estimation of cost, labor, time, and materials
required to obtain the proposed size of the sample. If cost has been specified in advance and the
computed n is much larger than what the researcher can afford, let the researcher decide whether
to proceed with a much smaller sample size (thus reducing the reliability of the estimates) or to
look for a more efficient sampling design or to abandon efforts until more resources are found.
14 | P a g e
Note: If computed n1 is not an integer, the sample size n is usually taken as n=⟦ n1⟧ +1, subject to
the unit cost of sampling. For an additional sampling unit, compare the increase in cost to the
increase in reliability; particularly in very small populations where the reliability will fluctuate
largely between ⟦ n1 ⟧ and ⟦ n1 ⟧ +1.
1. Directly specify the desired variance of the estimator, V d =¿ desired Var (Θ ^ ), or the
desired coefficient of variation of the estimator, C d=¿ desired CV (Θ^ ). The formula for n
using C d will involve the coefficient of variation of the population measures which of the
population measures which is often more stable and easier to guess than the standard
error.
The coefficient of variation (CV) is a statistical measure of the dispersion of data points
in a data series around the mean. It is often calculated as the ratio of the standard
deviation to the mean.
2. Specify the desired margin of error, d, in the estimate and the risk α that the researcher is
willing to incur that the actual error is larger than d, that is, P (|Θ−θ
^ |>d ) =α .
The margin of error is supposed to measure the maximum amount by which the sample
results are expected to differ from those of the actual population.
3. Specify the desired relative error, r, in the estimate and the risk α that the researcher is
(| | )
willing to incur that the actual relative error is larger than r; that is, P
^ −θ
Θ
θ
> r =α
This type of error is relative to the size of the parameter being measured.
3. Guess the structure of the population and use some mathematical results.
15 | P a g e
For example, Deming shows how simple mathematical distributions may be used to
estimate the population variance from information on the range and general idea of the
shape of the distribution. If the distribution is like a binomial, with a proportion p of the
observations at one end of the range and a proportion q at the other end, S2 can be
estimated by pq r 2 where r is the range. Other useful relations are that S2 can be estimated
2 2
0.083 r for a rectangular distribution, 0.056 r for a distribution that is shaped like a right
triangle, 0.042r 2 for an isosceles triangle.
4. Take the sample in two steps, the first being a sample of size n, for which the estimates
are computed and the required n will be obtained.
This method gives the most reliable estimates but it is not often used since it slows down
the completion of the survey. Cochran lists some formulas used in determining n2 =n−n1
after a sample of n1 has already been taken.
no
n=
If n 0 /N < 0.05, use n=n0. Otherwise, use no to estimate the mean and
1+
N
no
n=
no −1 when the objective is to estimate proportion,
1+
N
When the objective is to estimate τ then simply multiply n o by N 2 when d or V d
is specified. Otherwise, use the formula for the mean.
Example:
A community within a city contains 3000 households and 10, 000 persons. For purposes
of planning a community satellite to the local health department, it is desired to estimate the total
number of physician visits made during a calendar by members of the community. For this
16 | P a g e
information to be useful, it should be accurate to within 10% of the true value. A small pilot
survey of 10 households, conducted for purposes of gathering preliminary information, yielded
the accompanying data on physician visits made during the previous calendar year. Using this
data as preliminary information, determine the sample size needed to meet the specifications of
the survey.
S √ 96.3
So, CV = = =0.575.
μ 19.4
2 2
CV z α / 2 (0.575)2 (2.576)2
n o= 2
= 2
=170.9076
r (0.1)
17 | P a g e
no
Since >0.05, we cannot ignore the fpc. Thus, we compute for
N
no 170.9076
n= = =161.696
no 170.9076
1+ 1+
N 3000
Example (Capistrano)
An anthropologist is studying the 3,200 inhabitants of island X. Among other things, he wishes
to estimate the percentage of inhabitants belonging to blood group O. Find a conservative
estimate for n using SRSWOR if the anthropologist will be content if the percentage is correct
within ±5% except for a 1 in 20 chance.
Since only a conservative estimate for n is needed, we’ll use P=Q=.5 (for which we will observe
2
z a/ 2 PQ ( 1.96 ¿¿¿ 2 ( .5 )( .5 ) 384.16
the largest variability). n o= = = 384.16. Since ≮ .05 ,we cannot
d
2
(.05 ¿ ¿¿ 2 3200
ignore the fpc,
no 384.16
n= = = 343.08. Use a sample of size 344.
1+ ( n ˳−1 ) / N 1+ 383.16/3200
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of the Standard Error under SRSWOR
√
n 2
Mean S N −n s N −n
( μ) ∑ yi n N √n N
y= i=1
n 2
σ N−n
n N−1
√
n 2 2
Total N N S N−n Ns N −n
(τ ) τ^ =N y=
n
∑ yi n N √n N
i=1
N −n
The factor is called the finite population correction (fpc).
N−1
18 | P a g e
The factors
N −n
N
n
N N √
=1−f (where f = ) for the variance and N −n for the standard error will
also be referred to as finite population correction(fpc). As n gets closer to N, the fpc decreases in
magnitude and thus will cause a reduction in the value of the standard error. On the other hand, if
the sampling fraction, f , is very small (that is, information has been obtained from only a very
small fraction of the population) then the fpc is close to one..
Using Data Set 1, we can estimate the mean and total forced vital capacity of workers in the
company under consideration and their respective standard errors as follows:
n
Estimated mean
∑ yi 3216
¿ y= i=1 = =80.4
n 40
∑ ( y i− y )2
s2= i=1 =153.9385
n−1
n
sampling fraction¿ f = =40 /1200
N
( ) √ √
n
Proportion of NPQ n PQ N−n ^
PQ^ n
Units in C ^
P=
∑ yi n ( N −1 )
1− =
N n N−1 n−1
1−
N
i=1
(P)
n
Total No of ^
A = N ^P 2
N PQ N−n N
√ √ √
Units in C n N−1 ^
PQ^ n ^Q
NP ^
(A) 1− = √ N−n
n−1 N n−1
19 | P a g e
Example (Capistrano)
In a sample of size 100 selected using SRSWOR from a population of size 500, there are 37 units
in Class C. Estimate the population proportion of units in C and the total number of units in C
and their respective standard errors.
^
P=
37
100 √ 99 √
=.37 . Its standard error is estimated to be ( .37 )( .63 ) 1− 100 ¿ 0.0434
500
^
A=( 500 ) ( 0.37 )=185 and its standard error is estimated to be (500)(0.0434) = 21.7.
Stratified Sampling is a probability sampling method where the population is divided into
nonoverlapping groups or strata based on supplementary information, and then independent
samples are selected within each stratum.
Stratified Random Sampling is a particular stratified sampling method where simple random
sampling without replacement is used in selecting the samples within each stratum.
Example:
1. A researcher wishes to estimate average enrollments and faculty sizes for high schools.
Private institutions tend to be smaller than the public ones, so stratified sampling is used
where the two strata are private and public.
2. A standard quality control check on automobile batteries involves simply measuring the
weight. One particular shipment from the manufacturer consisted of batteries produced in
six different months. The investigator decides to stratify in months in the sampling
inspection to observe month-to month variation.
20 | P a g e
Stratification will not always produce more reliable estimates as compared to the
estimates under SRSWOR of the same sample unless all of the strata are large. A
sufficient condition that will assure more reliable estimates is the selection of a
stratification variable that will decompose the variance σ 2, in such a way that σ 2B is larger
than σ 2w.
The choice of allocation method affects the reliability of the estimates. The allocation
method that yields the smallest variance per unit cost is optimum allocation. However, if
the stratum if the stratum variances are not too different from each other the estimate,
using proportional allocation will be as reliable as the estimate using Neyman allocation.
We would then prefer to use proportional allocation since the sample will be self-
weighing sample.
Other considerations in choosing the stratification variables are: (i) convenience (e.g.
geographic areas) since stratified sampling will allow for the decentralization of data
collection and processing (ii) simplicity of the domain analysis since n h s will not be
random variable since they will be fixed by design.
Step 1. Clearly specify the strata. These strata must be nonoverlapping and their union must be
the whole population.
Step 2. Place each sampling unit of the population into its appropriate stratum.
Step 3. Using a randomization mechanism, select a sample from each stratum, making sure that
the selection of the samples are independent of each other. The sample size may vary form one
stratum to the other, depending on the allocation method. The sample allocation procedure may
also vary in the different strata.
In stratified random sampling, use simple random sampling without replacement in each stratum.
Make sure that a different set of random numbers is generated in each one of the strata so that the
observations chosen in one stratum do not depend upon those chosen in another.
Step 4. The stratified random sample consists of the combined samples selected in Step 3.
Example.
Suppose we wish to select a sample of 50 students using stratified random sampling with sex as
stratification variable from a population consisting of 250 males and 100 males. In order to do
21 | P a g e
this our frame must be in the form of a list of students with information on the sex of each
student. We partition the population into two strata: Female and Male. Each female will be
assigned a unique number from 1 to 100. Suppose our sample of 50 must consist of 30 females
and 20 males. To select a sample of females using SRSWOR, we need to use a randomization
mechanism to choose 30 distinct numbers from 1 to 250. The females in the associated with the
selected numbers will be included in the sample. To select a sample of males using SRSWOR,
we once again use a randomization mechanism to choose 20 distinct numbers from 1 to 100. The
males in the list associated with the selected numbers will be included in the sample
Notations
Population Figures
Nh
W h =hth stratum weight=
N
nh
f h= sampling fraction in the hth stratum=
Nh
Nh
L Nh L
τ =population total=∑ ∑ Y hj =∑ N j μh
h =1 j=1 h=1
L Nh
2
σ w=variance within strata= h=1
∑ ∑ (Y hj−μh)2
j=1
N
L Nh
σ
2
=variance between strata=
∑ ∑ (μ h−μ)
2
L
Nh
=∑
B h=1 j=1 2
(μh−μ)
N h=1 N
22 | P a g e
L Nh
2
σ =population variance=
∑ ∑ (Y hj −μ)
2
h=1 j=1
N
Note : σ 2=σ 2w + σ 2B
Nh
∑ (Y hj−μh )2
2 j=1
Sh =
N h −1
Sample Counterparts
L
n h=number of units in the sample that belong in the hth stratum (∑ nh=n
h =1
∑ ( y hi− y h )2
2 i=1
sh=
nh−1
1. Equal Allocation: the same number of elements is sampled from each stratum. This
is used if the primary objective of the survey is to test hypotheses about differences
among the strata with respect to the variable of interest (under the assumption that
within strata variances are equal).
nh
2. Proportional Allocation (self-weighting samples): the sampling fraction is
Nh
specified to be the same for each stratum, which implies that the overall sampling
n
fraction is the fraction taken from each stratum. This is often used because of its
N
simplicity even if it is not the optimal design in terms of precision of estimates.
3. Optimal Allocation: the allocation that will yield an estimate that has the lowest
variance per unit cost. A special case is called Neyman allocation, where the cost per
unit is the same in all strata.
23 | P a g e
Equal Allocation Proportional Optimal
nh 1 Nh W h Sh N h Sh
n W h n= n
L N √ ch √ ch
n= n
(√ ) (√ )
L L
W h Sh N h Sh
∑ ch
∑ ch
h=1 h=1
where the cost function is
C=c 0 + ∑ c h nh
c 0= overhead cost &
c h= cost pet unit in the hth stratum
If cost is fixed in advance :
Nh Sh
√ ch (C−c0 )
L
∑ ( N h S h √ ch )
h=1
Var( y st ¿ L
L
1−f
L L
2
L
∑ W 2 S 2 (1−f h )
n h=1 h h n
∑ W h S2h ( ∑ W h S h) ∑ W h S 2h
h=1
or
h=1
− h=1
L L
n N
∑ W 2h S 2h ∑ W 2h S 2h for Neyman Allocation
h=1
− h=1
n N
2 2
With proportional allocation and equal variances Sh in all strata, say Sc , we have the
simple result
2
Sc
Var ( y st ) = (1−f )
n
For stratified random sampling with Neyman allocation, the variance simplifies to
24 | P a g e
L L
2
(∑ W h S h ) ∑ W h S 2h
h=1
V opt =Var ( y st )= − h=1
n N
Example:
Stratum Nh Sh Nh
W h=
N
1 100 50 0.2
2 150 10 0.3
3 250 5 0.5
How will we allocate the total sample of 140 elements to each stratum by using equal allocation,
proportional allocation and Neyman allocation? Compute for Var ( y st ) for each.
Equal Allocation
1 140
n h= n= =46.67 for h=1 ,2 , 3 (If we round up to 47, the sample size will increase to 141)
L 3
L 2
Sh 1
Var ( y st ) =∑ W
2
nh
( 1−f h ) = ¿
h
47
h=1
Proportional Allocation
Neyman Allocation
N h Sh
n h= L
n
∑ W h Sh
h =1
25 | P a g e
So, n1 = ( 5000
7750 )
∗140=90.32 , n =27.10 ,n =22.58
2 3 which we would round off to
n1=90 , n2=27 , n3 =23
( )
L 2 L
∑ W h Sh ∑ W h S 2h
h=1
Var ( y st ) = − h=1
n N
(0.2 ( 50 ) +0.3 ( 10 ) +0.5 (5)) 0.2 ( 50 2 )+ 0.3 ( 102 ) +0.5 ( 52 )
2
¿ − =0.63167
140 500
It is possible that the computed n h under optimal allocation will exceed the value of N h. For
example, the sample size needed is 140 and N 1=100 , N 2=110 , N 1=120 but the computed n1
under Neyman allocation is 120> N 1 . In such case, we will use 100% sampling in the first
stratum and allocate the remaining elements in the other strata using the formula,
W h Sh
n h=(n−N 1 ) L
∑ W h Sh
h=1
For a specified V d or d and the fpc is ignored, the first approximation for the sample size when
we wish to estimate the mean is
2 2 2 L 2 2
W h S h 1 L W h Sh
z α /2 n
n 0= 2 ∑ = ∑ where w h= h
d h=1 wh V d h=1 w h n
2
d
Note that V d = 2 and the exact formula depends on the allocation method.
z α /2
∑ W h Sh
h =1
L L L
Mean L 1 1 2
n 0= ∑
V d h=1
W 2h S2h n 0= ∑
V d h=1
W h S2h n 0= (∑ W h S h )
V d h=1
L L L
Total L N 1 2
n 0= ∑ N 2h S2h n 0= ∑ N h S2h n 0= (∑ W h S h )
V d h=1 V d h=1 V d h=1
26 | P a g e
L L L
Proportion L 1 1 2
n 0= ∑ N 2h Ph Qh n 0= ∑ W h Ph Q h n 0= (∑ W h √ P h Qh )
V d h=1 V d h=1 V d h=1
n0
As before, if < 0.05, use n=n0. Otherwise, compute for n using the following formula:
N
n0
n= L
for the Mean : 1
1+ ∑
NV d h=1
W h S 2h
n0
n=n= L
for the Total : 1
1+ 2 ∑ N h S 2h
N V d h=1
n0
n= L
for the Proportion : 1
1+ ∑W P Q
NV d h=1 h h h
Take note that under proportional allocation, all formulas reduce to the familiar form
n0
n=
n0
1+
N
2 2
d rμ
V
When the margin of error (d) is specified, replace d by( V
) . Replace d by( ) if
zα/ 2 zα/ 2
r
the relative error (r) is specified. Replace V d by (C d μ)2 if C d= is specified. And, if
zα/ 2
the parameter of interest is the total then use the same formula for the mean if r or C dis
specified.
Example
Suppose that we are planning to take a sample of the members of a health maintenance
organization (HMO) for purpose of estimating the average number of hospital episodes per
person. The sample will be selected from membership lists grouped according to age (under
45 years; 45-64 years; 65 years and over). Let us suppose that the distributions of hospital
episodes are available from national data (such as the National Health Interview) and are
given below:
27 | P a g e
Hosp. Episodes Episodes
Under 45 years 600 0.164 0.245
45-64 years 500 0.166 0.296
65 years and over 400 0.236 0.435
Compute the number of subjects needed to be 99 % certain of estimating the mean number of
hospital episodes within 20% of the true mean under stratified random sampling with
proportional allocation.
Based on the national data, an initial guess for the mean number of hospital episodes is
L
and since
2 Nh 2
Sh = σ , then
N h −1 h
3 2 2 2
600 500 400
∑ W h S 2h= (1500)(599) 0.245+
(1500)(499)
0.296+
(1500)(399)
0.436=0.313586
h =1
Thus,
2
2.576 (
n 0= 2 2
0.313586 ) =1538.801
0.2 0.184
n0 1538.801
n= = =759.576
n0 1538.801
1+ 1+
N 1500
28 | P a g e
Example
(Cochran)A sample of United State colleges and universities will be drawn using stratified
random sampling with optimum allocation in order to estimate enrollments for the current
academic year. The population of teachers’ colleges and normal schools was divided into 7
strata, of which one small stratum will be ignored. The first five strata were constructed by size
of institution while the sixth contained colleges for women only. Data needed for computing the
sample size were taken from the previous academic year. It shows that the total enrolment was
56 472. The other needed information is as follows:
Stratum Nn Sn
1 13 325
2 18 190
3 26 189
4 42 82
5 73 86
6 24 190
Total 196
C d=0.05
L
1 2
We use the formula n 0= (∑ W h S h )
V d h=1
τ
Now, μ= =56 , 472 /196=288.122
N
So,
L
1 2 1
n 0= (∑ W h S h )
2
( ( 13 ) ( 325 ) + ( 18 ) ( 190 ) + ( 26 ) ( 189 ) + ( 42 )( 82 ) + ( 73 )( 86 )+(24)(190)) =90.36
V d h=1 ( 207.536 ) ( 196 )
2
n0
n= L
1
1+ 2 ∑ N h S2h
N V d h=1
29 | P a g e
90.36
n= =57.1
1
1+ 2
(4640387)
196 ( 207.36 )
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error Under Stratified Random Sampling
√
L 2
Mean L
Sh L
S 2h
(μ ¿ ∑ N h yh L ∑Wh 2
nh
(1−f h ) ∑ W n (1−f h)
2
h
y st = h=1 =∑ W h y h h =1 h=1 h
N h =1
=
√∑
L 2 2 2 2
W h Sh L W h S h 2 2
∑ n −∑ N
L
N h sh
¿ 2
(1−f h)
h =1 h h=1 h
h=1 N nh
√∑ W 2h S 2h L W 2h S2h
L
¿ −∑
h=1 nh h=1 N h
√∑
l 2
Total L
Sh L
s 2h
(τ ¿ τ^ st =∑ N h y h=N h y st N ∑W
2
(1−f h )
2
h N
2
(1−f h )
h=1 h=1 nh h=1 nh h
Example
The total number of inhabitants in the 100 cities of country X is to be estimated from a sample of
32 cities. The cities are arranged into 2 strata; the first, containing the 32 largest cities and the
second containing the remaining 68 cities. The number of inhabitants is presented in Dataset #3.
L
y st =∑ W h y h= ( 0.32 )( 600.375 )+ ( 0.68 ) ( 201 )=328.8
h=1
30 | P a g e
√ s 2h
√ ( ) ( )
L
∑ N n (1−f h )= ( ( 32 )2 30531.13
2
h
8
1−
8
32
)+((68)
2 6291.739
24
1−
24
68
)=1927.526
h=1 h
Note:
The standard error will be small if we choose the stratification variable so that all the
strata have small S2h, that is the elements within a stratum are homogenous with respect to
the characteristic of interest. If in fact, if it were possible to divide the population into
strata such that all items have the same value within a stratum then μ can be estimated
without any error.
CLUSTER SAMPLING
Cluster sampling is a sampling procedure or system where the sampling unit consists of a group
of elements called clusters. In simple one- stage cluster sampling, the clusters are selected using
simple random sampling.
Step 1. Specify appropriate clusters. Similar to the strata in stratified sampling, these clusters
must be nonoverlapping and their union must be equal to the population. The main difference
between the optimal construction of strata in stratified sampling and the construction of clusters
in cluster sampling is that the strata must be as homogeneous as possible and one stratum should
differ as much as possible from another with respect to the characteristic being measured;
whereas clusters should be heterogeneous as possible within and one cluster should look very
much like one another.
Step 3. Select n clusters from the frame using simple random sampling without replacement.
Step 4. The sample will consist of all the elements included in the selected clusters.
Examples:
2. A forester wishes to estimate the average height of trees in a plantation. The plantation is
divided into quarter-acre plots. A simple random sample of 20 plots is selected from the
31 | P a g e
386 plots in the plantation. The forester then measures the height of all trees in the
sampled plots for his study.
3. An inspector wants to estimate the average weight of fill for cereal boxes packaged in a
certain factory. The cereal is available to him in cartons containing 12 boxes each. The
inspector randomly selects 5 cartons and measures the weight of fill for every box in the
sampled cartons.
In practice though, the clusters are usually formed so that the elements within are
contiguous to each other (geographic subdivision). This way, cluster sampling will
effectively reduce the cost of the survey especially for those where the cost of obtaining
observations increases as the distance separating the elements increases. However, this is
at the expense of increasing the standard errors because elements that are close to
together are usually homogenous with respect to many characteristics.
Cluster sampling can be inefficient especially if the clusters are large and homogenous
with respect to the characteristics under study. It will then be more economical to select a
sample of elements from the clusters selected rather than take information from all the
elements. This procedure is what we refer to as multi-stage sampling.
Since simple one-stage cluster sampling uses SRS in the selection of the n clusters, then the
formulas for computing the sample size will be the same as that of the SRSWOR except that we
replace S2by S2B.
Using Dataset 5, suppose we wish to be virtually certain (that z α/ 2=3) of estimating the total
number of persons over 65 years of age residing in the five housing developments to within 10%
of the true value. How many clusters should we include in our sample?
32 | P a g e
Since r is specified, we use the same formula used to compute the sample size for the mean even
if the parameter of interest is τ .
2 2 no
S B Z α /2 n=
n o= 2 2
and no
r μ 1+
N
Population Figures
Y jk= measure taken from the k th element in the j th cluster, j=1 ,2 , … , N ; k=1 , 2 ,… , M
M
N M N
33 | P a g e
N M N
N M
2
σ =population variance
∑ ∑ (Y jk −μ)2
j=1 k=1
¿ =σ 2w +σ 2B
NM
N M 2
(Y −μ j )
σ =variance within clusters ¿ ∑ ∑ jk
2
w
j=1 k=1 NM
N N M 2 2
(μ j−μ) (μ j−μ)
σ =variance between/ among clusters ¿ ∑ ∑ =∑
2
B
j=1 k=1 NM j=1 N
N M
2
S = population variance (corrected)
h
∑ ∑ (Y jk −μ)2 M ( N −1 ) S 2B + M (N −1)S 2W
j=1 k=1
¿ =
NM −1 NM−1
N 2 N M 2
( μ −μ) (Y −μ j )
where S =∑ j
2
B and S2W =∑ ∑ jk
j=1 N −1 j=1 k=1 N (M −1)
Sample Counterparts
M 2
(Y − y i )
s =¿ variance of the i selected cluster in the sample ¿ ∑ ik
2 th
i
k=1 M −1
Take note that y i, y i and s2i are not estimates since complete information on the i th selected cluster
is available. The notations for estimates were used to emphasize that these are random variables
whose values depend on which clusters are selected in the sample.
34 | P a g e
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error for Cluster Sampling With Equal Cluster Sizes
√
n M
Mean Per y ik n y i . n y ik 1−f
N
(Y j−μc )
2
n
( y i− y )2
Cluster y=∑ ∑ =∑ =∑ ∑ N−1 1−f
∑ n−1
i=1 k=1 n i=1 n i=1 n n j=1 n
( μc ) i=1
√
n M n 2
Mean Per y ik y y 1−f
N
(μ −μ) ( y i− ý )2
∑ Nj −1 =¿ 1−f
n
√
Total τ^ =N y=NM ý (1−f ) N (μ j−μ)
2
n
( y i− ý )2
(τ )
2
N M
2
∑
n j=1 N −1 NM
1−f
∑ n−1
n i=1
Example
Dataset 5 contains data on households in 5 housing developments in the population. The housing
developments serve as the clusters. Each cluster has 20 households. The households serve as the
elements . Suppose housing developments 2 and 5 were selected in the sample. Thus, the sample
consists of two clusters for a total of 40 households. Let us estimate the mean number of persons
over 65 years of age per household and its standard error.
The estimated mean number of persons over 65 years of age per housing development is :
2
y Y + Y 33+34
y=∑ i = 2 5 = =33.5
i=1 2 2 2
The estimated mean number of persons over 65 years of age per household is:
2
y i μ 2−μ5 1.65+1.7
ý=∑ = = =1.675
i=1 2 2 2
√ √
2 2
√
n 2 1− 1−
1−f ( y i− ý ) 5 5
∑
n i=1 n−1
=
2
( ( 1.65−1.675 )2 + ( 1.7−1.675 )2) =
2
(0.00125)=0.019
35 | P a g e
The estimated standard error is (5)(20)(0.019)=1.9.
Notations
Population Figures
Y jk= measure taken from the k th element in the j th cluster, j=1 ,2 , … , N ; k=1 , 2 ,… , M j
M
Mj 2
(Y jk −μ j )
S = variance of the j cluster ¿ ∑
2 th
j
k=1 M j−1
N Mj N
∑ ∑ Y jk ∑Y j ∑ Y j / μm
j=1 k=1 j=1 j=1
μ=¿ population mean per element ¿ = N
=
N Mo M
∑ Mj
j=1
36 | P a g e
Mo
where μm = mean number of elements per cluster ¿
N
N Mj N
τ =population total¿ ∑ ∑ Y jk =∑ Y j =M μo =N μ c
j=1 k=1 j=1
Mj 2
(Y jk −μ)
S = population variance (corrected) ¿ ∑
2
k=1 M o 4−1
Sample Counterparts
y ik =¿ mean taken from thek th element of the i thselected cluster in the sample,
i=1 , 2 … ,m ; k =1, 2 , … , m
m
y i
M 2
(Y ik − y i )
s =¿ variance of the i selected cluster in the sample ¿ ∑
2 th
i
k=1 mi−1
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of The Standard Error for Cluster Sampling With Unequal Cluster Sizes
√
mi 2
Mean Per n
y ik n y i . 1−f
N
(Y j−μc ) n
( y i− y )2
Cluster y=∑ ∑ =∑ ∑ N−1 1−f
∑ n−1
i=1 k=1 n i=1 n n j=1 n
( μc ) i=1
37 | P a g e
√
( μ) n mi
y ik /n n mi y i y 1−f N 2
M j (μ j .−μ)
2
n 2
mi ( y i .− ý R )
2
ý R=∑ ∑ =∑ = ∑ 1−f
∑
i=1 k =1 mi /n i=1 mo m n j=1
2
μ M (N −1) n j =1
2
m (n−1)
√
Unbiased Estimator:
1−f
n
( y i / μm− ý u )2
∑ n−1
N 2
1−f (Y /μ m−μ)
n
ý U =∑ ∑
mi
y ik /n n
=∑
y i /μ m y
= n
∑ j N−1 n i=1
j=1
i=1 k=1 μm i=1 n μm
√
N 2 2
1−f M j ( μ j−μ) n
( y i. − ý R )
2
M
2
o
n
∑ 2
μ M (N −1)
Mo
1−f
∑ m2(n−1)
j=1 n i=1
√ ( )
2
yi
( )
2
Unbiased Estimator Yj − ýu
n −μ n μm
N
μm 1−f
∑ yi M
1−f
2
o ∑ Mo
n
∑ n−1
τ^ U =M o ýU =N y=N i=1 n N−1 i=1
√
j=1
n 2 n
( y i .− y )
2
2 1−f ( Y j .−μ c ) ¿ N 1−f
N
¿N ∑
n j=1 N −1 n
∑ n−1
i=1
Example
(Capistrano) We modify Dataset 5 as presented in Dataset 6 in such a way that clusters are now
of unequal sizes. Cluster 1 and 5 contain 10 households; clusters 2 and 4 contain 15 households;
while cluster 3 contains 20 households. Suppose clusters 1, 2 and 3 were selected in the sample.
Let us estimate the mean and the total number of persons over 65 years old per household and
their standard errors using ratio estimation.
The estimated mean number of persons over 65 years old per housing development is:
y 1 + y 2 + y 3 16+ 23+39
y= = =26 .
3 3
m=¿ ∑ mi 10+15+20
k=1
= =15
n 3
38 | P a g e
The estimated mean number of persons over 65 years old per household is using ratio estimation
is
y 26
ý R= = =1.733
m 15
√ √
n 2 2 2 2 2 2 2 2
1−f mi ( y i− ý R ) 1−3 /5 10 (1.6−1.733) +15 (1.533−1.733) +20 (1. .95−1.733)
n
∑ m (n−1)
2
=
3 2
15 (3−1)
=0.09358
j =1
√
n 2
1−f ( y i− ý R )
Mo
n
∑ m2 (n−1) =( 70 ) ( 0.9358 ) =6.551
i=1
Example
Let us use Dataset 6 once more but this time let’s estimate the mean and total using the unbiased
estimator. The estimation of the mean requires information on the cluster sizes for the
computation of μ M .
10+15+20+15+10
For this data, μ M = =14
5
The estimated mean number of persons over 65 years old using the unbiased estimator is
y 5.571
ý U = = =1.857
μm 3
√
3
√ √
n 2 2 2 2 1−
1−f ( y i / μm− ý u ) 1−3 /5 (1.1429−1.857) +(1.6429−1.857) +(2.7587−1.857) 5
∑ n−1
n i=1
=
3 (3−1)
=
3
( 0.70926 )=0.
39 | P a g e
The estimate of the approximate standard error is
√ ( y i /μ m− ý u )2
n
1−f
Mo
n
∑ n−1 =( 70 )( 0.9358 )=21.525
i=1
The standard errors of the estimates using the unbiased estimator are higher than the standard
errors using ratio estimation.
SYSTEMATIC SAMPLING
Step 1. Determine the interval k . In general, for a systematic sample of n elements from a
population of size N , k is computed using the formula k =⟦ N /n ⟧.
Step 2. Identify each unit in the frame with consecutive integers, beginning with 1. This
however, can be done simultaneously with the selection of the elements to be included in the
sample.
Step 3. Choose a number¸r, at random from integers 1 to k. Then the elements of the sample are
those units labeled r, r+k, r+2k, r+3k and so on until you reach the end of the frame.
Example
1 1,5,9,13
2 2,6,10
3 3,7,11
40 | P a g e
4 4,8,12
Examples:
Market researchers and opinion pollsters who sample people on the move very often use
systematic sampling. For example, every k th customer at checkout counter may be asked
his or her opinion on qualities of a certain product. Or, every k th person boarding a bus
may be asked to fill out a questionnaire on bus service.
If systematic sampling is used to select a sample of n units, then the population total, the
population mean are estimated in the same manner used under simple random sampling.
sample mean,
∑ yj ∑ y1 j
j=1
ý= = j=1
n n
sample total, τ^ =N ý
A slight modification of systematic sampling is called the Lahiri’s method and the sample mean
is unbiased estimator even if N ≠ nk (unlike the standard procedure).
Lahiri’s method
41 | P a g e
Step 1. Compute for k =⟦ N /n ⟧.
Step 2. Identify each unit in the frame with consecutive numbers, beginning with 1 to N. Thus,
element ¿ 1 is also labeled as N +1,¿ 2 is as N +2, and so on.
Step 3. Choose a number, r, at random from 1 ¿ N . Then the elements of the sample are those
units labeled as r , r +k , r +2 k , … , r +(n−1)k .
Example
Let n=2. Thus k =⟦ 5 /2 ⟧ =2. Under the standard procedure, there are only k =2 possible samples.
r 1 2
Sample (1,3,10) (2,4)
2
ý 4 3
3
In Lahiri’s method, there are N=5 possible samples because we choose the random start, r,
from 1 to N,
r 1 2 3 4 5
Sample (1,3) (2,4) (3,10) (4,1) (10,2)
ý 2 3 6.5 2.5 6
Since each of the 5 possible samples of size 2 will be given the same chances of selection.
The sample can be selected even if a list of the sampling units is not available.
When units within the same sample are heterogeneous, the estimates under systematic
sampling will be more precise than the estimates under simple random sampling. This
will happen if the systematic sample is selected from a large population where the
elements are ordered according to the magnitude of the characteristics of interest or a
variable related to it.
42 | P a g e
This procedure sometimes provides more information per unit cost than simple random
sampling since the sample is generally spread more uniformly over the entire population.
The precision of the estimates depends on the order of the sampling units in the frame. It
is important to know the type of population under investigation since the estimates under
systematic sampling may be unreliable, particularly when there is unsuspected periodicity
in the arrangement of the units in the frame.
Under certain conditions, an increase in sample size will not even guarantee an increase
in precision.
Estimating the standard error of the estimate is more complicated. We can use model-
based inference. We can also use repeated systematic sampling.
MULTI-STAGE SAMPLING
Multi-stage sampling is a method of selecting a sample that makes use of hierarchical structure
of units and sampling of these units are done in stages. The population is first divided into
primary sampling units (PSUs) and a sample of PSUs is selected. The selected PSUs are
subdivided into secondary sampling units (SSUs) and a sample of SSUs is selected from each of
the selected PSUs. This process is continued until the last stage elements are selected within the
final stage clusters.
Example:
In the estimation of the amount of impurities in a bulk product like sugar, the sampling
procedure may select bags of sugar from warehouse and then select small test samples
from each bag. The test samples are then analyzed for amount of impurities.
In order to estimate the total number of permanent residents of Quezon City who have
hypertension, the sampling procedure may first require the selection of barangays. Then
from each barangay a sample of households will be selected. Then from each household,
a person will be selected.
43 | P a g e
Advantages
1. Since this is just an extension of the concept of cluster sampling, the advantages of multi-
stage sampling are the same as those of cluster sampling. For example, a frame that lists
all the elements in the population is not needed, and there is a reduction in the cost of
obtaining the data because of the reduction in the travel costs.
2. In addition, it is not necessary to sample all of the elements in each sampled cluster. Thus
cost of sampling can often be reduced with little loss of information.
Disadvantages
1. Choosing the sample size is more difficult. The number of sampling units at each stage
must be determined. Furthermore, the choice of the sample size will now depend on 2
sources of variation: variation between the clusters and variation among elements within
the clusters.
2. The basic principle in the estimation of parameters is to build up estimates from the
bottom (last stage) to the top (first stage). Thus, estimation procedures are more difficult.
The more stages there are, the more complicate the analysis. Also, using PSUs that are
not of the same sizes will be more complicated than using PSUs of the same sizes.
If multi-stage sampling has two stages of sampling and the selection of sampling units in each
stage is done using SRSWOR then this is called simple 2-stage sampling. The PSUs are clusters
while the SSUs are the elements themselves.
Step 1. Specify the appropriate clusters. The two major considerations in choosing are: (a)
geographic proximity of the elements within a cluster; and, (b) cluster sizes that are convenient
to administer. The selection of the clusters will also depend on how many PSUs and SSUs we
intend to include in the sample. Do we intend to sample a few PSUs and many SSUs from each
or do we intend to sample many PSUs and a few SSUs from each or do we intend to sample
many PSUs and a few SSUs from each?
Step 4. Obtain a frame listing all the SSUs (in this case, elements) for each of the selected PSUs
only.
44 | P a g e
Step 5. Select a SRS of elements from each of these frames.
NOTATION
n
1
m=¿
n ∑ mi
i=1
Y ij =¿ measure taken from the j th SSU of the i th PSU of the population; i=1 , 2 ,… , N ;
j=1 ,2 , … , M i
y ij =¿measure taken from the j th SSU of the i th sampled PSU; i=1 , 2 ,.. , n ; j=1 , 2 , … ,mi ;
¿
y ij =¿ measure taken from the j th sampled SSU of the i th sampled PSU; i=1 , 2 ,… , n ;
¿
j=1 ,2 , … , mi
¿
Mi mj mi
1 1 1
μi= ∑Y
M i j=1 ij
y i= ∑ y ij
mi j=1
y = ¿ ∑ y ¿ij
¿
i
mi j=1
¿
Mi mj mi
1
Y i=∑ Y ij =M i μi y i=∑ y ij=¿ mi y i ¿ y = ¿ ∑ y ¿ij =m¿i y ¿i
¿
i
j=1 j=1 mi j=1
¿
Mi mi ¿ ¿ 2
(Y −μ )2 mi
(Y − y )2 ( y ij − y i )
S =∑ ij i
2
s =∑ ij i
2
s =∑
¿2
wi
j=1 M i−1 wi
j=1 mi−1 wi
j=1 m¿i −1
N M N N
∑ ∑ Y ij ∑ M i μ i ∑ C i μi Mi
i=1 j=1
μ=¿ population mean (per SSU)¿ N
= i=1 = i=1 where C i=
N μM N μM
∑ Mi
i=1
45 | P a g e
N Mi N
τ =¿ population total¿ ∑ ∑ Y ij =∑ M i μi=M o μ
i=1 j=1 i=1
The sample size depends on the type of cost function. A simple form of cost function is:
n n
C=c 0 n+c 2 ∑ m¿i +c L ∑ mi
i=1 i=1
However, it is difficult to use this cost function because we cannot determine in advance which
¿
PSUs will be selected in the first stage of sampling. In other words, mi ' s and the mi ' s are
random variables in the sense.
(∑ )
n n
E ( C )=c 0 n+c 0 E m¿i + c L E( ∑ mi )
i=1 i=1
M0
μM= =¿mean number of SSUs in the population per PSU
N
N ¿
mi
μm =∑ =¿mean number of sampled SSUs per PSU
i=1 N
c 1=c 0 + cl μm
The problem is to determine n and μm that will minimize MSE for a given cost or vice versa.
√
2
Ratio Estimator S22 d
ý R μm =√ c1 /c 2 for specified V d = 2
1 2 z α /2
S+2
B − S
μm 2
46 | P a g e
where:
n 2 2 n=
+2
S B −S 2
2
( μ1 − μ1 )
M m
C i (μi .−μ)
S =∑ 1 +2
+2
B
N−1 V d+ S
i=1 N B
N
M for specified cost:
S22=∑ i S2wi
i =1 M 0
C
n=
c1 +c 2 μ m
√
2
Unbiased Estimator 2
S2 d
ý U μm =√ c1 /c 2 V
or specified d 2 =
2 1 2 z α /2
S B − S2
where:
μm
n=
S 2B−S 22 (1
μM μm
−
1
)
n 2
(Ci μi .−μ) 1
S =∑ V d + S2B
+2
B
i=1 N−1 N
for specified cost:
C
n=
c1 +c 2 μ m
Week
( )
(i) µj Yi 10 2
Swj2
Y 1−∑ Y i /10
i=1
47 | P a g e
μm =
√ c1 s2
√
We will compute for the number of days per week using c2 2
S2 and the number of
2
S B−
μm
1 2
V d+ SB
N
We’ll use the results of the pilot study to come up with the preliminary values for SB2 =
( )
N 2
∑Y j Mj 2
( Y jk−μ j )
and S =∑
j=1
N
Y j− N
2
wj ∀j
N Mj 2 M j −1
∑ μ2M ( N −1 )
, S =∑
2
2 S
M o wj
k=1
j=1 j=1
√ √ √
c S 22 1710.014
μm = 1 = 2 =¿ ¿ 0.64
c2 2 1 2 8537.257−1710.014
S B− S
μM 2
2
S B−S 2
2
( 1
−
μM μm
1
)= 8537.257−1710.014 ( 17 − 0.641 ) =13.45 ≈ 14
n= 2
1 2 50 8537.257
V d+ S +
N B 1.96
2
52
Example
(Lemeshow) The number of visitors to state park, based on a pilot study, are given in the table
below.
48 | P a g e
7 380 378 325 330 306 322 331
8 495 400 315 302 350 388 395
9 206 200 108 95 107 185 190
10 308 300 293 206 200 298 300
Suppose the park management wishes to do a survey in order to estimate the mean number of
visitors in a day. They intend to use simple 2-stage sampling where the weeks serve as the PSUs
and the days serve the SSUs. How many weeks and days within a week will be included in the
sample if management wishes to be 95% confident that the margin of error of the estimate is 50
using the unbiased estimator?(Of course, c L=0 since we already know that there are always 7
always 7 days in a week. Let us further assume that c 0=2 c 2).
μm =
√ c1 s2
√
We will compute for the number of days per week using c2 S2
2 and the number of weeks
+2
S −
B
μm
We’ll use the results of the pilot study to come up with the preliminary values for
( )
N 2
∑Y j
+2 j=1
S B = N Y j−
N
∑ μ2M ( N −1 )
j=1
N
Mj 2
S =∑2
2 S
j=1 M o wj
Mj 2
( Y jk−μ j )
Swj =∑
2
∀j
k=1 M j −1
Week
49 | P a g e
( )
(i) µj Yi 10 2 Swj2
Y 1−∑ Y i /10
i=1
√ √ √
2
c1 S2 1710.014
μm = = 2 =¿ ¿ 0.64
c2 2 1 2 8537.257−1710.014/7
S B− S2
μM
S 2B−S 22 ( μ1 − μ1 )
M m
=
8537.257−1710.014 ( 17 − 0.641 ) =7.28 ≈ 8
n= 2
1 50 8537.257
V d + S2B 2
+
N 1.96 10
Estimators of Population Mean and Total, Their Corresponding Variances, and Estimators
of the Standard Error for Simple 2-Stage Sampling
√(
2
Mean Unbiased
( )
N
1 1 2 Mi 1 1 2
)
N
Mi
(μ) Estimator: n
−
N
S B + ∑ nN μ
2
(
m ¿ −
M
)S
21
wi −
1 ¿2
sB +∑ 2
1 1 ¿2
( ¿ − )s wi
n m ¿ i=1 M i i n N i=1 nN μ M m i Mi
1
i
mi ¿ n c i y i
ý U = ∑ ∑ y ij=∑ n
n μ M i=1 j=1 m¿i where:
i=1
where:
where:
50 | P a g e
N 2 n ¿ 2
(C i μ i−μ) (c i y i − ý U )
S =∑ s =∑
2 ¿2
mi B
N−1 B
n−1
c i= i=1 i=1
μM
√(
Ratio Estimator: 2
)
N
1 1 +2 Mi 1 1 2
− S B +∑ 2
( ¿ − )S wi
n n N i=1 nN μ M m i M
(approximate) i
∑ mi y ¿
i
ý R= i=1n 2
( )
N
1 1 +2 Mi 1 1 2
∑ mi − S B +∑ ( ¿ − )S wi n 2 ¿
c i ( y i − ý R)
2
S =∑
2
n N i=1 nN μ M mi M i
+2
B
i=1
i=1 N −1
where
N 2 2
Ci (μ i−μ)
S =∑
2
B
i=1 N−1
Total Unbiased
(τ) Estimator 2
M o Var ( ý U ) M o s^
.e .( ý U )
n
N
τ^ U = ∑ mi y ¿i
n i=1
(approximate) (approximate)
Ratio Estimator
2
M o VAr( ý R ) M o s^
.e .( ý U )
τ^ R=M o ý R
Example (Cochran)
From the volume “American Men of Science”, a 2000-page listing consisting of 36, 000 names
of scientists with general information, 20 pages were selected at random. The total number of
names per page varies, in general from about 14 names to 21. On each page, two scientists were
selected and their ages were recorded. Dataset 9 contains the information gathered. Let us
estimate the mean age of scientists in the book using the unbiased estimator and the ratio
estimator.(Refer to Sampled PSU Statistics.csv)
Unbiased Estimation
M o 36 , 000
N=2,000 n=20 M o=36 ,000 μ M = = =18
N 2 , 000
51 | P a g e
n ¿
c y 951.2122
ý U =∑ i i = =47.5597
i=1 n 20
n ¿ 2
(c i y i − ý U ) 2215.3656
s =∑
¿2
B = =116.5907
i=1 n−1 19
√( c2i 1
) (√ 201 − 2000 )116.5907+0.0273=2.408
N
1 1 ¿2 1 ¿2 1
^
s . e ( ý U ) = − sB +∑ ( ¿ − )s wi =
n N i=1 nN mi Mi
Ratio Estimation:
n ¿
m y 17121.5
ý R=∑ i i = =47.6922
i=1 m i 359
n 2 ¿ 2
ci ( y i − ý R ) 1763.188
s B =∑
¿2
= =92.79939
i=1 n−1 19
√( )
c 2i 1
(√ 201 − 2000 ) 92.79939+0.0273=2.149
N
1 1 ¿+2 1 1
^
s . e ( ý U ) = − sB +∑ ( ¿ − )s¿wi2=
n N i=1 nN m i Mi
References:
52 | P a g e