So last time we saw STR and here is a quick recap.

1. Set the stratification scheme
2. Set the stratum design
3. Implement the sampling methods for each stratum independently
4. Pool the strum estimates to estimate the population parameters
5. Estimate their respective variances
6. Construct CI, if necessary.

Today, we look at ratio estimation. For starters, we will use SRS only and same as before, we assume there is no non-sampling error, only sampling error.

We introduce two variables $x_i$ which is an auxiliary variable or subsidiary variable, and $y_i$ which is a response variable (characteristic of interest). The idea here is to utilise the auxiliary variable which is correlated to the response variable to improve precision.
Next, we have a new population parameter B (ratio).
$B = \frac{t_y}{t_x} = \frac{\bar{y_u}}{\bar{x_u}}$ where $t_u = \sum_{i=1}^N y_i$ and $t_x = \sum_{i=1}^N x_i$

And here is the procedure

1. We assume $t_x$ is known, $\Rightarrow \bar{x_u} = \frac{t_x}{N}$ is known
2. Use SRS and measure $y_i$ and $x_i$ in the sample.
3. Calculate $\bar{y}$ and $\bar{x}$ for the sample
4. $\hat{B} =\frac{\hat{t_y}}{\hat{t_x}} = \frac{\bar{y}}{\bar{x}} = \frac{\sum_{i \in D} y_i}{\sum_{i \in D} x_i}$
5. $\hat{t}_{y, ratio} = \hat{B} t_x$
6. $\hat{\bar{y}}_{ratio} = \hat{B} \bar{x_u}$

We use ratio estimation because at times, our  ratio of interest might be average yield in bushels per acre, ratio of fish caught to the number of hours spend, per capita income, etc. And for most of these cases, the population size N is unknown, so its still necessary for us to estimate a population total. Since we cannot use the estimator $\hat{t_y} = N \bar{y}$ here, we consider another measure of size, that is, $N = \frac{t_x}{\bar{x_u}}$. So we can estimate N by $\frac{t_x}{\bar{x}}$. Consequently, $\hat{t}_{y, ratio} = \bar{y} \frac{t_x}{\bar{x}}$ where $\frac{t_x}{\bar{x}}$ estimates the total sample size based on the auxiliary variable.

The benefits of using ratio estimation is clear.

• Smaller MSE if x and y are correlated, giving us an increase of precision
• We are able to adjust estimates to reflect known information, and evaluate them more in depth for a more representative result.
• We can adjust for nonresponse.

You might notice that taking a SRS will slight underestimate the true population mean of x‘s, that is, $\bar{x_D}$ is smaller than $\bar{x_u}$. And should x and y be positively correlated, $\bar{y_D}$ may also underestimate $\bar{y_u}$

Ratio estimation for the population mean $\bar{y_U}$ is given by
$\hat{\bar{y}}_{ratio} = \hat{B} \bar{x_u} = \frac{\bar{y}}{\bar{x}} \bar{x_u} = \bar{y} \frac{\bar{x_u}}{\bar{x}}$
Here we correct the underestimation by expanding $\bar{y_D}$ by a factor $\frac{\bar{x_u}}{\bar{x_D}}$

After looking at the estimators, as usual, we questions its qualities.

Firstly, the ratio estimators are biased. This arises because the unbiased $\bar{y}$ is multiplied by $\frac{\bar{x_u}}{\bar{x}} \Rightarrow \hat{\bar{y}}_{ratio} = \bar{y} \frac{\bar{x_u}}{\bar{x}}$. The good news is that our variance is reduced, essentially compensating for the presence of bias. This means that although $\mathbb{E}(\hat{\bar{y}}) \neq \bar{y_u}$, the value of $\hat{\bar{y}}$ for any individual sample is likely to be closer to $\bar{y_u}$ than the sample mean $\bar{y_D}$. Of course, the average deviation $\bar{y_D} = \bar{y_u}$, averaged over all possible samples D that could be obtained, is zero.

We introduce a population correlation coefficient of x and first.
$R = \frac{\sum_{i=1}^N (x_i - \bar{x_u})(y_i - \bar{y_u})}{(N-1) S_x S_y}$

$Bias(\hat{\bar{y}}_{ratio}) = \mathbb{E}(\hat{\bar{y}}_{ratio} - \bar{y_u}) = - \mathbb{E}[\hat{B} (\bar{x} - \bar{x_u})] = - Cov(\hat{B}, \bar{x})$

$\frac{|Bias(\hat{\bar{y}}_{ratio})|}{\sqrt{Var(\hat{\bar{y}}_{ratio})}}$ = $\frac{|Cov(\hat{B}, \bar{x})|}{\sqrt{Var(\hat{B}) \bar{x_u}^2}} \le \frac{\sqrt{Var(\bar{x})}}{\bar{x_u}}$ = $\frac{\sqrt{Var(\bar{x})}}{\mathbb{E}(\bar{x})}$ = $CV(\bar{x})$ where $CV(\bar{x}) = \frac{1}{\bar{x_u}} \sqrt{(1 - \frac{n}{N}) \frac{s_x^2}{n}}$

Here, notice that as sample size increased, $CV(\bar{x})$ decreases. Ignoring FPC, then $CV(\bar{x}) \sim \frac{1}{\sqrt{x}} \rightarrow 0$

$|Bias(\hat{\bar{y}}_{ratio})| < < \sqrt{Var(\hat{\bar{y}}_{ratio})} \Rightarrow$ MSE is dominated by the variance. So in large samples, $MSE(\hat{\bar{y}}_{ratio}) \approx Var(\hat{\bar{y}}_{ratio})$

Let $d_i = y_i - B x_i, i = 1, \ldots , N. \bar{d_u}=0$, then $\bar{d} = \frac{1}{n} \sum_{i \in D} (y_i - B x_i)$ is an unbiased estimator for $\bar{d_u}$
When n is large (more than 30), $Var(\hat{B}) \approx MSE(\hat{B}) \approx \frac{1}{\bar{x_u}^2} Var(\bar{d})$

Its is worth asking ourselves when this approximate MSE is small. Rewriting it, we have
$MSE(\hat{B}) = (1 - \frac{n}{N}) \frac{1}{n \bar{x_u}^2} (S_y^2 - 2 B R S_x S_y + B^2 S_x^2)$.
So approximate MSE is small when

• Sample size n is small
• sampling fraction $\frac{n}{N}$ is large
• Deviations $y_i - Bx_i$ are small
• Correlation between x and y is close to $\pm 1$
• $\bar{x_u}$ is large.

Estimated variance, $\hat{Var}(\hat{B}) = (1 - \frac{n}{N}) \frac{1}{n \bar{x_u}^2} s_e^2$ where $s_e^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \hat{B}x_i)^2$ = $\frac{1}{n-1} \sum_{i=1}^n e_i^2$ and $e_i = y_i - \hat{B}x_i$
When $\bar{x_u}$ is unknown, we can substitute it by $\bar{x_D}$, then
$\hat{Var}(\hat{\bar{y}}_{ratio}) = \hat{Var}(\hat{B} \bar{x_u}) = (1 - \frac{n}{N}) \frac{s_e^2}{n}$
$\hat{Var}(\hat{\bar{t}}_{ratio}) = \hat{Var}(\hat{B} t_x) = N^2 (1 - \frac{n}{N}) \frac{s_e^2}{n}$

Similarly, if the sample sizes are sufficiently large, approximate 95% CIs can be constructed using the standard errors as
$\hat{B} \pm 1.96 SE(\hat{B})$
$\hat{\bar{y}}_{ratio} \pm 1.96 SE(\hat{\bar{y}}_{ratio})$
$\hat{t}_{y, ratio} \pm 1.96 SE(\hat{t}_{y, ratio})$
For large samples, the effect of bias in the CIs can be ignored.

A distinct advantage of using ratio estimation is that the $MSE(\hat{\bar{y}}_{ratio}) \le MSE(\bar{y})$ iff $R \ge \frac{B S_x}{2S_y} = \frac{CV(x)}{2CV(y)}$. This implies that if the coefficient of variation are approximately equal, then it pays to use ratio estimation when the correlation between x and y is larger than $\frac{1}{2}$

Next time, we will look at Regression Estimation.

Sampling & Survey #1 – Introduction
Sampling & Survey #2 – Simple Probability Samples
Sampling & Survey #3 – Simple Random Sampling
Sampling & Survey #4 – Qualities of estimator in SRS
Sampling & Survey #5 – Sampling weight, Confidence Interval and sample size in SRS
Sampling & Survey #6 – Systematic Sampling
Sampling & Survey #7 – Stratified Sampling
Sampling & Survey # 8 – Ratio Estimation
Sampling & Survey # 9 – Regression Estimation
Sampling & Survey #10 – Cluster Sampling
Sampling & Survey #11 – Two – Stage Cluster Sampling
Sampling & Survey #12 – Sampling with unequal probabilities (Part 1)
Sampling & Survey #13 – Sampling with unequal probabilities (Part 2)
Sampling & Survey #14 – Nonresponse