Sampling & Survey #7 – Stratified Sampling

SRS form the basis of sampling and survey methods as it is easy to design and analyse, but it is rarely the best design. We may adopt systematic sampling or cluster sampling but we often are limited by the availability of sampling from. Thus, we need to look at stratified sampling to increase the precision.

In stratified sampling, we divide the population in H subpopulations, called strata, which do not overlap (mutually exclusive) and constitute the whole population so that each sampling unit belongs to exactly one stratum. We then draw independent random sample from each stratum.

Consider that we have 1000 males and 1000 females to select from, a SRS of size 100 might lead us to have no or very few males or females. This will cause us to not have a presentative sample since men and women respond differently on the item of interest. A stratified sample on the other hand, will suggest we take a SRS of 50 males and an independent SRS of 50 females, ensuring that the proportion in the sample is the same as that in the population.

Clearly, we will have better precision (lower variance) using a stratified sampling as compared to SRS, for the estimates of population means and total. This is because variance within each stratum is often lower than ht variance in the whole population.
Convenient to administer and may result in a lower cost. Sampling frame may be conducted differently in different strata, or different sampling designs or field procedures may be used. For example, using an internet survey by a large firm and a telephone survey for a small firm.

So let us look at some notations that we will be using. Firstly, we divide the population N into H mutually exclusive and exhaustive strata.
h = 1, 2, …, H index the strata.
$N_h$ = number of sampling units in stratum h. Total number of units in the entire population $N = N_1 + N_2 + \ldots + N_H$
$n_h$ = number of sample units draw from stratum h. Total sample size $n = n_1 + n_2 + \ldots + n_H$
$D_h$ = the set of $n+h$ sample units drawn from stratum h.
$y_{hj}$ = value of $j^{th}$ unit in stratum h, where h = 1, 2, .., H and j = 1, 2, …, $N_h$ .
Stratum population total, $= t_h = \sum_{j=1}^{N_h} y_{hj}, h=1, 2, \ldots , H$
Population total, $t = \sum_{h=1}^H t_h$
Stratum population mean, $\bar{y}_{hu} = \frac{1}{N_h} \sum_{j=1}^{N_h} y_{hj}$
Population mean, $\bar{y_u} = \frac{t}{N} = \frac{1}{N} \sum_{h=1}^{H} \sum_{j=1}^{N_h} y_{hj}$
Stratum population variance, $S_h^2 = \frac{1}{N_h - 1} \sum_{j=1}^{N_h} (y_{hj} - \bar{y}_{hu})^2$

Since we assume SRS within each stratum, then when we have a population of $N_h$ units and take an SRS of $n_h$ units, we can estimate $\bar{y}_{hu}$ by $\bar{y_h}$ and $t_h$ by $\hat{t_h} = N_h \bar{y}_h$
Estimate of stratum population mean $\bar{y}_{hu}$ , $\bar{y_h} = \frac{1}{n_h} \sum_{j \in D_h} y_{hj}$
Estimate of stratum population total $t_h$ , $\hat{t_h} = N_h \bar{y_h} = \frac{N_h}{n_h} \sum_{j \in D_h} y_{hj}$
Estimate of stratum population variance ${S_h}^2$ , ${s_h}^2 = \frac{1}{n_h - 1} \sum_{h \in D_h} (y_{hj} = \bar{y_u})^2$

Notice $\frac{N_h}{N}$ appears a few times, this is the weight of $\bar{y_h}$ which is the proportion of the population units in stratum h, that is, relative sizes of the strata. When we use stratified sampling, the relative sizes of the strata must be known.
Estimate of population total $t = \sum_{h=1}^H t_h$ , $\hat{t}_{STR} = \sum_{h=1}^H \hat{t_h} = \sum_{h=1}^H N_h \bar{y_h}$
Estimate of population mean $\bar{y_u}$ , $\bar{y}_{STR} = \frac{\hat{t}_{STR}}{N} = \sum_{h=1}^H \frac{N_h}{N} \bar{y_h}$

Since we do SRS in each stratum, both $\bar{y}_{STR}$ and $\hat{t}_{STR}$ are unbiased estimates of $\bar{y_u}$ and t. Similarly, the variances of the estimators are obtained through SRS since we do independent sampling for different strata.
$Var(\hat{t}_{STR}) = \sum_{h=1}^H Var(\hat{t_h}) = \sum_{h=1}^H (1 - \frac{n_h}{N_h}) N_h^2 \frac{S_h^2}{n_h}$
$Var(\bar{y}_{STR}) = \frac{1}{N^2} Var(\hat{t}_{STR}) = \sum_{h=1}^H (1 - \frac{n_h}{N_h})(\frac{N_h}{N})^2 \frac{S_h^2}{n_h}$
As always, the standard error of an estimator is the square root of the estimated variance:
$SE(\hat{t}_{STR}) = \sqrt{\hat{V}(\hat{t}_{STR})}$
$SE(\bar{y}_{STR}) = \sqrt{\hat{V}(\bar{y}_{STR})}$

As long as we select at least one element per stratum, the specification for a stratified sample is satisfied. And with two elements per stratum, we can estimate both the mean and its error.

If either the sample sizes within each stratum are large or the number of strata are large, the approximate $100(1 - \alpha)\%$ confidence interval for the population mean $\bar{y_u}$ can be constructed as $\bar{y}_{STR} \pm z_{\alpha / 2} SE(\bar{y}_{STR})$ = $\bar{y}_{STR} \pm z_{\alpha / 2} \sqrt{\hat{Var}(\bar{y}_{STR})}$

We recall that proportion is a special mean.
$\hat{p}_{STR} = \sum_{h=1}^H \frac{N_h}{N} \hat{p_h}$
$Var(\hat{p}_{STR}) = \sum_{h=1}^H (\frac{N_h}{N})^2 (\frac{N_h - n_h}{N_h - 1}) \frac{p_h (1 - p_h)}{n_h}$
$\hat{Var}(\hat{p}_{STR}) = \sum_{h=1}^H (\frac{N_h}{N})^2 (\frac{N_h - n_h}{N_h - 1}) \frac{\hat{p_h} (1 - \hat{p_h})}{n_h}$

In SRS, we learnt that $\hat{t} = N \bar{y} = \sum_{i \in D} \frac{N}{n} y_i = \sum_{i \in D} w_i y_i$ and $w_i = \frac{1}{\pi_i} = \frac{N}{n}$ for i = 1, 2, …, N, refers to our sample weights. The sampling weight of unit i can be interpreted as the number of population units represented by unit i. And in STR:
$\hat{t}_{STR} = \sum_{h=1}^H \hat{t_h} = \sum_{h=1}^H N_h \bar{y_h} = \sum_{h=1}^H \sum_{j=1}^{n_h} \frac{N_h}{n_h} y_{hj} = \sum_{h=1}^H \sum_{j=1}^{n_h} w_{hj} y_{hj}$
Here $w_{hj} = \frac{N_h}{n_h}$ measures the number of units in population represented by the $j^{th}$ unit in stratum h. Clearly, $\sum_{h=1}^H \sum_{j \in D_h} w_{hj} = N$
It should be noted that under SRS within strata, weights for each sampling unit within a stratum are the same, while weights across the strata may be different.

Now we study how to do proportional allocation (an important property of stratified sampling), that is, allocating such that the number of sampled units in each stratum is proportional to the size of the stratum.
Stratum sample size, $n_h = N_h \frac{n}{N}$
Notice this rewrites to $\frac{n_h}{N_h} = \frac{n}{N}$ which implies that sampling weights (inclusion probabilities) for all units are constant.
Thus, the probability that an individual will be selected to be in the sample, $\frac{n}{N}$ is the same as in an SRS, but many bad samples that could occur in an SRS cannot be elected in a stratified sample with proportional allocation. A stratified sampled with proportional allocation is self-weighting (every unit in the sample has the same weight and represents the same number of unit in the population).

When $N_h$ is large enough, the variance of $\bar{y}_{STR}$ is under proportional allocation is usually lesser or equal to the variance of $\bar{y}$ from an SRS with the same number of observations.

Sum of Squares

When we have a stratified sample of size n with proportional allocation, $\frac{n_h}{N_h} = \frac{n}{N}$
$Var_{prop}(\hat{t}_{STR}) = \sum_{h=1}^H (1 - \frac{n_h}{N_h}) {N_H}^2 \frac{{S_h}^2}{n_h} = (1 - \frac{n}{N}) \frac{N}{n} \sum_{h=1}^H N_h {S_h}^2 = (1 - \frac{n}{N}) \frac{N}{n} (SSW + \sum_{h=1}^H {s_h}^2)$
We can find $Var_{SRS}(\hat{t})$ too.

In general, the variance of the estimator of t from a stratified sample with proportional allocation will be smaller than the variance of the estimator of t from an SRS with the same number of observations. The more unequal the stratum means the more precise $\bar{y}_{hu}$ get when we use proportional allocation.

Note that $Var_{prop}(\hat{t}_{STR})$ depends primarily on SSW since SSTO is a fixed value for the finite population. Consequently, SSW is smaller when SSB is larger. This result ONLY holds for population variances.

After considering proportional allocation, we can now look at optimal allocation. For proportional allocation, we increase precision (lower variance) when the within-stratum variance are more of less equal across all the strata. But we do not consider the cost $c_h$ here. Our goal is to minimise the variance of the estimate given a total cost.

Problem: min $C = c_0 + \sum_{h=1}^H c_h n_h$ where C = total cost, $c_0$ = overhead (fixed) cost and $c_h$ = cost of taking an observation in stratum h.
We can use Lagrange Multpliers to solve and will find that the optimal allocation $n_h \propto \frac{N_h S_h}{\sqrt{c_h}}$
Thus the optimal sample size in stratum h is $n_h = (\frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{l=1}^H \frac{N_l S_l}{\sqrt{c_l}}})n$

Since $n_h \propto \frac{N_h S_h}{\sqrt{c_h}}$ , we see that sample size depends on stratum size, stratum variance, sampling cost.

Proportional allocation is the optimal allocation if all variance and cost are equal across the strata. $\Rightarrow n_h \propto N_h$
Neyman allocation is a special case of optimal allocation, used when the cost in the strata (not the variances) are approximately equal. $\Rightarrow n_h \propto N_h S_h$

In short, we have 3 methods of allocation of a sample to strata: Equal, Proportional, and Optimum (Neyman). These allocation strategies allow us to know the proportion of sample allocated to every stratum.

We say under absolute precision, that the desired margin of error is half-width of the confidence interval. The half-width of CI is $z_{\alpha / 2 } \sqrt{Var(\hat{t}_{STR})} = z_{\alpha / 2 } \sqrt{\frac{v}{n}}$ by definition if we ignore the FPC. So $n = \frac{z_{{\alpha / 2}}^2 v}{e^2}$

We may argue that stratified sampling almost always gives higher precision than SRS, and one should simply consider stratified only. However, we should note that stratified adds complexity to the survey, and we need to weigh if the gain in precision is worthy as compared with the added complexity. Stratified sampling is most efficient when the stratum mean differ widely, so we should construct such that the strata mean is as different as possible. To do this, we need more information. But the more information we have, the more strata we have, the more complexity there is. We will use SRS if there is no or little information about the target population.

Sampling & Survey #1 – Introduction
Sampling & Survey #2 – Simple Probability Samples
Sampling & Survey #3 – Simple Random Sampling
Sampling & Survey #4 – Qualities of estimator in SRS
Sampling & Survey #5 – Sampling weight, Confidence Interval and sample size in SRS
Sampling & Survey #6 – Systematic Sampling
Sampling & Survey #7 – Stratified Sampling
Sampling & Survey # 8 – Ratio Estimation
Sampling & Survey # 9 – Regression Estimation
Sampling & Survey #10 – Cluster Sampling
Sampling & Survey #11 – Two – Stage Cluster Sampling
Sampling & Survey #12 – Sampling with unequal probabilities (Part 1)
Sampling & Survey #13 – Sampling with unequal probabilities (Part 2)
Sampling & Survey #14 – Nonresponse

About

KS Teng

KS has been teaching H2/H1 Mathematics and IB mathematics for the more than 10 years. Having taught students from all various Junior Colleges, KS adapts to students' abilities and help them better understand the topics. As someone who loves to teach mathematics and sees it as a truly useful tool in life, KS seeks to enable students to appreciate math. Therefore, his tuition mission is to motivate and cultivate students to be independent and confident thinkers.

Comments

pingbacks / trackbacks

Sampling & Survey (Home) – The Culture
[…] Confidence Interval and sample size in SRS Sampling & Survey #6 – Systematic Sampling Sampling & Survey #7 – Stratified Sampling Sampling & Survey # 8 – Ratio Estimation Sampling & Survey # 9 – Regression […]
March 24th, 2016 03:13 PM

Cart

Sampling & Survey #7 – Stratified Sampling

Leave a Comment
Cancel reply

Sampling & Survey #7 – Stratified Sampling

Leave a Comment Cancel reply

Leave a Comment
Cancel reply