by Bryce Frank
Many forest inventories rely on design-based inference to produce estimates of population totals. The main tenent of design-based inference is the use of a randomization process to select samples from the population. We can leverage this randomization process to construct estimators of population parameters. In design-based inference we desire estimators with good properties. One of those properties is unbiasedness. This post will demonstrate what is meant by design unbiasedness using a visual demonstration.
For this post, we will consider the population total:
\[\tau = \sum_{i=1}^{N} y_i\]our target parameter. Note that \(N\) represents the population size and \(y_i\) represents the observation of the \(i\)th population unit. We can represent this population with the grid below, where each grid cell is a population unit, and each observation (i.e. the \(y_i\)) is indicated by a number. Examine the individual population values by hovering your mouse over the population unit. Yellower colors indicate high values, and cooler colors indicate lower values.
Adding up the \(N = 36\) values we can determine that \(\tau = 135\). However, in nearly all real sampling problems \(\tau\) will be unknown.
As samplers, our objective is to construct an estimate of \(\tau\) using only a limited number of observations. In design-based inference we do so by leveraging the process of randomization when selecting a sample from our population. This process of randomization ultimately allows us to establish important properties of a given estimator, such as unbiasedness.
A very common estimator available to us is the Horvitz-Thompson estimator, which relies on knowledge of the inclusion probability of each population unit. The inclusion probability \(\pi_i\) is the probability a population unit is selected into a sample. The Horvitz-Thompson estimator is:
\[\hat{\tau} = \sum_{i=1}^{N} \frac{y_i}{\pi_i}\]The inclusion probabilities are values set in advance by the specification of a sampling design. Many sampling designs are available, and one common one is simple random sapling without replacement (SRSWoR).
SRSWoR begins by randomly selecting an initial population unit, where all population units have an equal probability of selection. Then, at the second stage, the previously sampled unit is not replaced, and the remaining population units have an equal probability of selection. This process continues until \(n\) units have been selected. It can be shown that for all units, such a process will result in an inclusion probability of \(\pi_i = \frac{n}{N}\) for all units.
The Horvitz-Thompson estimator has long been known to be an unbiased estimator of the population total. That is, the expectation of the estimator, taken over all possible samples, is equal the the population total:
\[E_{D}[\hat{\tau}] = \tau\]where the subscript \(D\) refers to the expectation taken over all possible samples (i.e. it is a function of the sampling design).
In this case, the expectation is really just a summation over all possible realizations of the estimator:
\[E_{D}[\hat{\tau}] = \frac{1}{S} \sum_{j=1}^{S}\hat{\tau}_j\]where \(S\) represents the total number of possible samples. Note that any given sample (i.e. any set of population units selected for sampling) is just as likely to get selected as any other. Unfortunately, for even modest sized populations, \(S\) becomes a very large number. For the purposes of illustration we will take advantage of a Monte-Carlo simulation to visualize the unbiasedness of the HT estimator.
\[E_{D}[\hat{\tau}] \approx \frac{1}{M} \sum_{j=1}^{M}\hat{\tau}_j\]where \(M\) is sufficiently large.
The following visualization displays a population of \(N=36\) population units, represented by grid cells. The population values are indicated by the grid cell color. Hover your mouse over a particular grid cell to see the actual value. On the right is a figure that will demonstrate the unbiasedness of the HT estimator once the simulation begins.
The simulation iteratively samples the population using SRSWoR with a sample size \(n=4\). A sample is collected when a set of grid cells flashes in red. The red squares at each phase represent the sampled population units. At each iteration \(j\), the estimator \(\hat{\tau_j}\) is calculated. Additionally all of the previous iterations are aggregated into the approximated expected value, and added to the plot on the right.
As the simulation progresses, it is clear that the quantity on the y axis, \(E_{D}[\hat{\tau}]\) approaches the true value, \(\tau = 135\).