by Bryce Frank
In Design Unbiasedness we discussed the basics of probability-based sampling designs and how they can be used to produce unibased estimates of population parameters. The principles of design unbiasedness rely on randomization for the sample selection mechanism (i.e. a sampling design) and fixed values of response variables upon which we conduct sampling.
An alternative form of producing unbiased estimates is through the use of model-based inference. We will use some specific language that should cohere with most literature on the subject that deviates slightly from the design-based setting. Our objective in model-based inference is to predict values of population units or aggregates therein (such as totals or means) by estimating parameters of a model.
To begin, we will utilize the same population as before:
A key difference in model-based inference is that we assume the data were generated by a random process, referred to as a superpopulation model. We will assume the following:
\[y_i = \mu + \epsilon_i\]where \(y_i\) is the value of the \(i\)th cell, \(\mu\) is a mean parameter, common to all cells, and \(\epsilon_i\) is an error term for the \(i\)th cell. We will assume the error terms are independently and normally distributed with mean 0 and variance \(\sigma^2\):
\[\epsilon_i \sim N(0, \sigma^2)\]When we deal with model-based inference we often talk about population realizations. A realization is the set of population values we are able to observe. In real settings, we only have access to one realization. For this idea we will need some specific notation. Consider the sum of all of the observable grid cells above:
\[\tilde{\tau} = \sum_{i=1}^N y_i\]where \(\tilde{\tau}\) indicates the sum of the realized \(y_i\). Our objective in model-based inference, typically, is to predict this quantity \(\tilde{\tau}\) by estimating the superpopulation parameters.
A useful way to think about model-based inference is that they can be thought of as aggregations of predictions made using the estimated parameters of the model. Denote the set of sampled indices as \(S\) and the set of unsampled indices as \(R\). Our prediction for \(\tilde{\tau}\) might look like:
\[\hat{\tilde{\tau}} = \sum_{i \in S} y_i + \sum_{i \in R} \hat{y}_i\]In other words, we will add up the observed values in the sample, and we will add up the predictions of the unobserved values using the estimated model parameter \(\hat{\mu}\). To predict the \(\hat{y_i}\) we will use the sample mean:
\[\hat{\mu} = \hat{y_i} = \frac{1}{n} \sum_{i \in S} y_i\]that is, our prediction for unsampled population units is the sample mean.
It is left as an exercise to the reader to show that \(E_{M}[\hat{\mu}] = \mu\) and that \(E_{M}[\hat{\tilde{\tau}}] = E[\tilde{\tau}] = \tau\), where the sub-index \(M\) is used to emphasize that the expectation is taken over components of the model, and not a randomized sampling design as before. Noting these properties, we can claim that our predictor is unbiased with respect to the superpopulation model. Note further that the unbiasdedness claim is made for the expectation of the random variable \(\tilde{\tau}\). We can think of this quantity as the superpopulation parameter \(\tau = N\mu\).
Visualizing model unbiasedness forces us to confront two essential differences between model- and design-based difference.
First, a random selection mechanism is not (strictly) required to justify unbiasedness in this case, so we will treat the sample as fixed. However, some randomness must enter the system for us to be able to conduct inference, and this comes through the random error \(\epsilon_i\). The population that is observable is a single realization of the superpopulation model. To visualize model unbiasedness, we will need to generate many populations and approximate \(E_{M}[\hat{\tau}]\).
Second, in nearly all cases, the model parameter \(\mu\) is unobservable, we can only ever observe the population values \(y_i\) and make estimates of \(\mu\) based on these observations. This contrasts the design-based paradigm where population parameters can be observed if all units are observed. For the purposes of today, we will need to make the assumption that \(\mu\) takes on a known value. We will use \(\mu = \frac{135}{36}\), which is just the population mean of the above realization and a variance \(\sigma^2 = 4\). Note that, in this case, the unbiasedness condition does not depend on the variance term.
Notice an important difference between this and the design-based setting. In the design-based setting we can indicate the total with a horizontal line - it is a fixed quantity. But in the model-based setting \(\tilde{\tau}\) is random. To track the value of \(\tilde{\tau}\) we indicate it with a faint red line line for each realization. In black, the expectation of our predictor is approximated using the same Monte-Carlo procedure as before:
\[E_{M}[\hat{\tilde{\tau}}] \approx \frac{1}{M} \sum_{j=1}^M \hat{\tilde{\tau}}_j\]and finally, the superpopulation parameter \(\tau\), which is fixed but not observable, is indicated with a dashed horizontal line. Given many iterations, we begin to observe the model-unbiasedness condition \(E_{M}[\hat{\tilde{\tau}}] = \tau\).
While arguing for design- vs. model-based inference is beyond the scope of this post, it is important to emphasize that model-unbiasedness is established under an assumed superpopulation model. This type of assumption is not strictly necessary for design-based inference, and is typically considered a disadvantage for model-based inference, as a certain level of subjectivity is needed on behalf of the analyst. See the “Further Reading” section for a number of papers that discuss this topic.
Furthermore, this post did not consider several other important topics, including the mean squared error of the predictor, the role of sampling in model-based inference, and more complex model assumptions common in forest inventory.
Gregoire (1998) compares and contrasts design- and model-based inference. Royall (1970) is an early paper that formalizes model-based inference.
Gregoire, T.G. 1998. Design-based and model-based inference in survey sampling: appreciating the difference. Canadian Journal of Forest Research 28(10): 1429–1447.
Royall, R.M. 1970. On finite population sampling theory under certain linear regression models. Biometrika 57(2): 377–387.
tags: model-based