Sampling & Survey # 8 – Ratio Estimation

So last time we saw STR and here is a quick recap.

Set the stratification scheme
Set the stratum design
Implement the sampling methods for each stratum independently
Pool the strum estimates to estimate the population parameters
Estimate their respective variances
Construct CI, if necessary.

Today, we look at ratio estimation. For starters, we will use SRS only and same as before, we assume there is no non-sampling error, only sampling error.

As usual, we start with the definitions
We introduce two variables $x_i$ which is an auxiliary variable or subsidiary variable, and $y_i$ which is a response variable (characteristic of interest). The idea here is to utilise the auxiliary variable which is correlated to the response variable to improve precision.
Next, we have a new population parameter B (ratio).
$B = \frac{t_y}{t_x} = \frac{\bar{y_u}}{\bar{x_u}}$ where $t_u = \sum_{i=1}^N y_i$ and $t_x = \sum_{i=1}^N x_i$

And here is the procedure

We assume $t_x$ is known, $\Rightarrow \bar{x_u} = \frac{t_x}{N}$ is known
Use SRS and measure $y_i$ and $x_i$ in the sample.
Calculate $\bar{y}$ and $\bar{x}$ for the sample
$\hat{B} =\frac{\hat{t_y}}{\hat{t_x}} = \frac{\bar{y}}{\bar{x}} = \frac{\sum_{i \in D} y_i}{\sum_{i \in D} x_i}$
$\hat{t}_{y, ratio} = \hat{B} t_x$
$\hat{\bar{y}}_{ratio} = \hat{B} \bar{x_u}$

We use ratio estimation because at times, our ratio of interest might be average yield in bushels per acre, ratio of fish caught to the number of hours spend, per capita income, etc. And for most of these cases, the population size N is unknown, so its still necessary for us to estimate a population total. Since we cannot use the estimator $\hat{t_y} = N \bar{y}$ here, we consider another measure of size, that is, $N = \frac{t_x}{\bar{x_u}}$ . So we can estimate N by $\frac{t_x}{\bar{x}}$ . Consequently, $\hat{t}_{y, ratio} = \bar{y} \frac{t_x}{\bar{x}}$ where $\frac{t_x}{\bar{x}}$ estimates the total sample size based on the auxiliary variable.

The benefits of using ratio estimation is clear.

Smaller MSE if x and y are correlated, giving us an increase of precision
We are able to adjust estimates to reflect known information, and evaluate them more in depth for a more representative result.
We can adjust for nonresponse.

You might notice that taking a SRS will slight underestimate the true population mean of x‘s, that is, $\bar{x_D}$ is smaller than $\bar{x_u}$ . And should x and y be positively correlated, $\bar{y_D}$ may also underestimate $\bar{y_u}$

Ratio estimation for the population mean $\bar{y_U}$ is given by
$\hat{\bar{y}}_{ratio} = \hat{B} \bar{x_u} = \frac{\bar{y}}{\bar{x}} \bar{x_u} = \bar{y} \frac{\bar{x_u}}{\bar{x}}$
Here we correct the underestimation by expanding $Latex \bar{y_D}$ by a factor $\frac{\bar{x_u}}{\bar{x_D}}$

After looking at the estimators, as usual, we questions its qualities.

Firstly, the ratio estimators are biased. This arises because the unbiased $\bar{y}$ is multiplied by $\frac{\bar{x_u}}{\bar{x}} \Rightarrow \hat{\bar{y}}_{ratio} = \bar{y} \frac{\bar{x_u}}{\bar{x}}$ . The good news is that our variance is reduced, essentially compensating for the presence of bias. This means that although $\mathbb{E}(\hat{\bar{y}}) \neq \bar{y_u}$ , the value of $\hat{\bar{y}}$ for any individual sample is likely to be closer to $\bar{y_u}$ than the sample mean $\bar{y_D}$ . Of course, the average deviation $\bar{y_D} = \bar{y_u}$ , averaged over all possible samples D that could be obtained, is zero.

We introduce a population correlation coefficient of x and y first.
$R = \frac{\sum_{i=1}^N (x_i - \bar{x_u})(y_i - \bar{y_u})}{(N-1) S_x S_y}$

$Bias(\hat{\bar{y}}_{ratio}) = \mathbb{E}(\hat{\bar{y}}_{ratio} - \bar{y_u}) = - \mathbb{E}[\hat{B} (\bar{x} - \bar{x_u})] = - Cov(\hat{B}, \bar{x})$

$\frac{|Bias(\hat{\bar{y}}_{ratio})|}{\sqrt{Var(\hat{\bar{y}}_{ratio})}}$ = $\frac{|Cov(\hat{B}, \bar{x})|}{\sqrt{Var(\hat{B}) \bar{x_u}^2}} \le \frac{\sqrt{Var(\bar{x})}}{\bar{x_u}}$ = $\frac{\sqrt{Var(\bar{x})}}{\mathbb{E}(\bar{x})}$ = $CV(\bar{x})$ where $CV(\bar{x}) = \frac{1}{\bar{x_u}} \sqrt{(1 - \frac{n}{N}) \frac{s_x^2}{n}}$

Here, notice that as sample size increased, $CV(\bar{x})$ decreases. Ignoring FPC, then $CV(\bar{x}) \sim \frac{1}{\sqrt{x}} \rightarrow 0$

$|Bias(\hat{\bar{y}}_{ratio})| < < \sqrt{Var(\hat{\bar{y}}_{ratio})} \Rightarrow$ MSE is dominated by the variance. So in large samples, $MSE(\hat{\bar{y}}_{ratio}) \approx Var(\hat{\bar{y}}_{ratio})$

Let $d_i = y_i - B x_i, i = 1, \ldots , N. \bar{d_u}=0$ , then $\bar{d} = \frac{1}{n} \sum_{i \in D} (y_i - B x_i)$ is an unbiased estimator for $\bar{d_u}$
When n is large (more than 30), $Var(\hat{B}) \approx MSE(\hat{B}) \approx \frac{1}{\bar{x_u}^2} Var(\bar{d})$

Its is worth asking ourselves when this approximate MSE is small. Rewriting it, we have
$MSE(\hat{B}) = (1 - \frac{n}{N}) \frac{1}{n \bar{x_u}^2} (S_y^2 - 2 B R S_x S_y + B^2 S_x^2)$ .
So approximate MSE is small when

Sample size n is small
sampling fraction $\frac{n}{N}$ is large
Deviations $y_i - Bx_i$ are small
Correlation between x and y is close to $\pm 1$
$\bar{x_u}$ is large.

Estimated variance, $\hat{Var}(\hat{B}) = (1 - \frac{n}{N}) \frac{1}{n \bar{x_u}^2} s_e^2$ where $s_e^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \hat{B}x_i)^2$ = $\frac{1}{n-1} \sum_{i=1}^n e_i^2$ and $e_i = y_i - \hat{B}x_i$
When $\bar{x_u}$ is unknown, we can substitute it by $\bar{x_D}$ , then
$\hat{Var}(\hat{\bar{y}}_{ratio}) = \hat{Var}(\hat{B} \bar{x_u}) = (1 - \frac{n}{N}) \frac{s_e^2}{n}$
$\hat{Var}(\hat{\bar{t}}_{ratio}) = \hat{Var}(\hat{B} t_x) = N^2 (1 - \frac{n}{N}) \frac{s_e^2}{n}$

Similarly, if the sample sizes are sufficiently large, approximate 95% CIs can be constructed using the standard errors as
$\hat{B} \pm 1.96 SE(\hat{B})$
$\hat{\bar{y}}_{ratio} \pm 1.96 SE(\hat{\bar{y}}_{ratio})$
$\hat{t}_{y, ratio} \pm 1.96 SE(\hat{t}_{y, ratio})$
For large samples, the effect of bias in the CIs can be ignored.

A distinct advantage of using ratio estimation is that the $MSE(\hat{\bar{y}}_{ratio}) \le MSE(\bar{y})$ iff $R \ge \frac{B S_x}{2S_y} = \frac{CV(x)}{2CV(y)}$ . This implies that if the coefficient of variation are approximately equal, then it pays to use ratio estimation when the correlation between x and y is larger than $\frac{1}{2}$

Next time, we will look at Regression Estimation.

Sampling & Survey #1 – Introduction
Sampling & Survey #2 – Simple Probability Samples
Sampling & Survey #3 – Simple Random Sampling
Sampling & Survey #4 – Qualities of estimator in SRS
Sampling & Survey #5 – Sampling weight, Confidence Interval and sample size in SRS
Sampling & Survey #6 – Systematic Sampling
Sampling & Survey #7 – Stratified Sampling
Sampling & Survey # 8 – Ratio Estimation
Sampling & Survey # 9 – Regression Estimation
Sampling & Survey #10 – Cluster Sampling
Sampling & Survey #11 – Two – Stage Cluster Sampling
Sampling & Survey #12 – Sampling with unequal probabilities (Part 1)
Sampling & Survey #13 – Sampling with unequal probabilities (Part 2)
Sampling & Survey #14 – Nonresponse

About

KS Teng

KS has been teaching H2/H1 Mathematics and IB mathematics for the more than 10 years. Having taught students from all various Junior Colleges, KS adapts to students' abilities and help them better understand the topics. As someone who loves to teach mathematics and sees it as a truly useful tool in life, KS seeks to enable students to appreciate math. Therefore, his tuition mission is to motivate and cultivate students to be independent and confident thinkers.

Cart

Sampling & Survey # 8 – Ratio Estimation

Leave a Comment
Cancel reply

Sampling & Survey # 8 – Ratio Estimation

Leave a Comment Cancel reply

Leave a Comment
Cancel reply