12 Missing Data

Published

October 6, 2025

12.1 Learning Objectives

After completing this chapter, you will be able to:

Classify missing data mechanisms (MCAR, MAR, MNAR) and understand when each is ignorable for inference.
Recognize and quantify biases from missing data, including survivorship bias, instant history bias, and selection bias.
Apply appropriate techniques for handling missing data (complete case analysis, imputation, weighting, model-based procedures).
Implement multiple imputation workflows and understand their assumptions and uncertainty.
Critically evaluate claims based on incomplete data and perform sensitivity analyses.

Note

This chapter addresses a ubiquitous problem in real-world data analysis: missing data. We’ll see through concrete examples how missing data can severely bias results, learn to classify different mechanisms that generate missingness, and explore practical methods for dealing with incomplete data. The material is enriched with findings from Fung and Hsieh (2000) and Karvanen et al. (2016).

12.2 Introduction: The Hidden Cost of Missing Data

Missing data are ubiquitous in real-world applications. They are caused by limitations and defects of sensors, people failing to respond to surveys, entities disappearing in the middle of a long-term study, and countless other reasons.

Data that are missing randomly are often benign: the law of large numbers and the central limit theorem help manage their impact. But data that are missing systematically can be a serious problem, as the following examples will demonstrate.

12.2.1 Opening Example: Investment Fund Performance

Consider three hypothetical investment funds, all established simultaneously and investing in risky assets. After 3 years:

Fund A has done well, reaching an average annual return of 10%
Fund B has done OK, reaching an average annual return of 5%
Fund C has been terrible, making an average annual loss of -3%

It is unlikely that investors would stick with Fund C, so it is likely to have made a quiet exit – for example, by returning the funds to investors and liquidating.

Question: What is the average performance of the funds?

Looking at surviving funds only: (10 + 5)/2 = 7.5% per year
Looking at all funds that started: (10 + 5 - 3)/3 = 4% per year

The difference – a seemingly small 3.5 percentage points – is an example of survivorship bias. Over time, this translates to millions in misallocated capital if investors rely on biased performance data.

Survivorship Bias is Everywhere

Survivorship bias happens everywhere entities in a long-term study can selectively disappear:

Subjects in long-term medication or diet studies who drop out
Individuals who have reached success or fame (we don’t see those who tried and failed)
Companies that went bankrupt and were removed from stock indices
Startups that shut down before being included in databases

12.2.2 A Historical Example: The Bomber Plane Problem

You probably have heard this story: during World War II, the U.S. military analyzed damage patterns on bomber planes returning from missions to decide where to add armor.

Survivorship bias illustrated: bullet hole patterns on returning aircraft. The red areas show where returning planes were hit. Figure by Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept); CC BY-SA 4.0.

The intuitive approach would be to reinforce the areas with the most bullet holes – after all, those are clearly getting hit frequently. Instead, statistician Abraham Wald made a crucial observation: the data showed where planes could take damage and still return. The planes that were hit in other locations didn’t make it back and weren’t in the dataset.

According to this insight, engineers should reinforce the areas without bullet holes. Those were the critical spots where damage meant the plane never returned. This is a classic example of selection bias in missing data – the missing observations (planes that didn’t return) contained the most important information.

Finnish Terminology Reference

For Finnish-speaking students, here’s a reference table of key terms:

English	Finnish
Missing data	Puuttuva tieto / Puuttuvat havainnot
Survivorship bias	Selviytymisharha
Selection bias	Valikoitumisharha / Valintaharha
Instant history bias	Takautuvan historian harha
Missing completely at random (MCAR)	Täysin satunnaisesti puuttuva
Missing at random (MAR)	Satunnaisesti puuttuva
Missing not at random (MNAR)	Ei-satunnaisesti puuttuva
Ignorable	Sivuutettava
Imputation	Imputaatio / Puuttuvien arvojen korvaaminen
Multiple imputation	Moninkertainen imputointi
Censoring	Sensurointi
Complete case analysis	Täydellisten havaintojen analyysi

12.3 Case Studies: When Missing Data Misleads

12.3.1 Financial Performance Biases

All standard financial indices are susceptible to survivorship bias: unsuccessful companies and funds are regularly removed and replaced by more successful competitors.

The issue is made worse by instant history bias: when successful funds are added to a database, their historical returns from their incubation period are backfilled, but only the winners make it to the database.

Observed Hedge Fund Performance Biases

Fung and Hsieh (2000) provide quantitative estimates of these biases using hedge fund data:

Survivorship and instant history bias in hedge fund performance data. The table shows estimates from the TASS database for both hedge funds and commodity trading advisors (CTAs). Figure from Fung and Hsieh (2000).

Key findings:

Survivorship bias in hedge fund performance estimates: approximately 3% per year
Instant history bias: around 1.4% per year for hedge funds, 3.6% per year for CTAs
Dropout rates: About 20% per year for offshore hedge funds, 19% per year for CTAs

Further bias is caused by selection bias, where fund managers can decide whether to include their fund in the indices, with better-performing funds more likely to be included.

The Funds-of-Funds Solution

One way to understand the true bias is to track an investment in a “fund of funds” that invests in every fund in the market. The track records of funds-of-funds:

Retain the investment experience of hedge funds that went out of business due to poor performance
Include hedge funds that stopped reporting to database vendors because of good performance
Don’t suffer from instant history bias (past records are unaffected when adding new funds)

Fung and Hsieh (2000) use this approach to validate their bias estimates, finding that adjusted individual fund returns align well with funds-of-funds returns after accounting for portfolio management costs.

Types of Bias in Fund Data

Let’s break down the different biases more carefully:

Survivorship Bias: Failed funds disappear from databases. Studies comparing surviving funds to all funds (including defunct ones) consistently show that surviving funds have higher average returns – not because they’re better, but because the worst performers are missing from the data.
Instant History Bias: New hedge funds typically undergo an incubation period, trading on money from the managers’ friends and relatives. After compiling good performance, they market themselves to database vendors and hedge fund consultants. When vendors add these funds to their databases, they backfill the earlier returns during the incubation period. The median incubation period for hedge funds is about 343 days (roughly 11 months), during which only the successful funds survive to be added.
Selection Bias: Only funds with good performance want to be included in a database. However, this effect is somewhat mitigated because some managers with superior performance deliberately don’t participate – particularly when they’re not interested in attracting more capital.

Advanced: Multi-Period Sampling Bias

Many databases require funds to have a minimum track record (e.g., 24–36 months) before inclusion. This creates an additional upward bias: funds that survive the initial period tend to have better performance than those that failed early.

Fung and Hsieh (2000) quantify this multi-period sampling bias:

Hedge funds: Approximately +0.6% per year
Commodity trading advisors (CTAs): Approximately +0.1% per year

Practical implication: When constructing performance indices or conducting research, one should apply bias adjustments before imposing minimum history requirements. Otherwise, they would compound the survivorship bias with this additional selection effect.

12.3.2 Missing Responses in Health Surveys

Our second major example comes from population health surveillance. The FINRISK 2007 study investigated whether survey participants accurately represent the full population.

The FINRISK 2007 Study Design

In the 2007 study, a random sample of 10,000 individuals aged 25–74 were invited to participate in a health examination survey in Finland:

6,257 individuals participated (63% participation rate – globally remarkably high)
Both participants and non-participants were followed up for deaths and hospitalizations for 5 years

However, the participants did not appear to fully represent the study population. The full population had substantially higher death rates:

Estimates of health indicators for different participation groups.

Subset	Heavy alcohol users % (95% CI)	Daily smokers % (95% CI)	BMI ≥ 30 kg/m² % (95% CI)	Systolic blood pressure ≥ 140 mm Hg % (95% CI)	Deaths per 1,000 (95% CI)	Hospitalizations per 1,000 (95% CI)
Full cohort	—	—	—	—	33.0 (29.5–36.5)	1,157 (1,098–1,215)
Participants	5.2 (4.7–5.8)	21.8 (20.7–22.8)	21.0 (20.0–22.0)	30.7 (29.5–31.8)	22.1 (18.5–25.7)	1,052 (989–1,115)

A subset of health indicators by participation group. Participants and non-participants show markedly different mortality and hospitalization patterns (bolded). Data from Table 5 of Karvanen et al. (2016).

This raises serious questions about the validity of health indicators estimated using only the survey respondents.

Estimating Health Indicators Without Ground Truth

The primary goal of the FINRISK study was to estimate population-level prevalences of key health indicators:

Heavy alcohol use (self-reported questionnaire data)
Daily smoking (self-reported questionnaire data)
Obesity (BMI \geq 30 kg/m^2, measured)
High blood pressure (systolic BP \geq 140 mm Hg, measured)
Elevated total cholesterol (\geq 5.0 mmol/L, measured)

These variables are only available for study participants. We have no direct way to measure smoking or alcohol use for the 37% who didn’t participate. We cannot simply use participant-only estimates because, as the mortality data shows, non-participants are fundamentally different from participants.

While we don’t have ground truth for health behaviors among non-participants, we do have complete data on deaths and hospitalizations for everyone – both participants and non-participants. These follow-up outcomes provide a rare opportunity to validate different statistical methods by checking whether they recover the known truth. If a method works for deaths/hospitalizations (where we know the answer), we can have more confidence in its estimates for smoking and alcohol use (where we don’t).

Recontact Data

After the initial health examination closed, the FINRISK study team mailed the same questionnaire to non-participants and asked them to return it by mail (without attending the physical examination). This differs from earlier reminders – it aimed to collect at least questionnaire data on smoking, alcohol use, and other self-reported behaviors from those who wouldn’t participate fully.

An additional 473 non-participants returned the questionnaire (13% recontact response rate).

This created three distinct participation groups:

Estimates of health indicators for different participation groups.

Subset	Heavy alcohol users % (95% CI)	Daily smokers % (95% CI)	BMI ≥ 30 kg/m² % (95% CI)	Systolic blood pressure ≥ 140 mm Hg % (95% CI)	Deaths per 1,000 (95% CI)	Hospitalizations per 1,000 (95% CI)
Full cohort	—	—	—	—	33.0 (29.5–36.5)	1,157 (1,098–1,215)
Participants	5.2 (4.7–5.8)	21.8 (20.7–22.8)	21.0 (20.0–22.0)	30.7 (29.5–31.8)	22.1 (18.5–25.7)	1,052 (989–1,115)
Recontact respondents	6.4 (4.1–8.7)	33.4 (29.1–37.7)	—	—	55.1 (34.6–75.7)	1,357 (1,043–1,670)

A subset of health indicators broken down by all three participation groups. The three groups show quite different mortality and hospitalization patterns. Data from Table 5 of Karvanen et al. (2016).

Can the recontact respondents help us correct the selection bias in the participant-only estimates? To answer this, we need a formal framework for handling missing data.

A Causal Model for the Missing Data Process

Karvanen et al. (2016) built a causal model to describe the process responsible for missing responses:

Causal model with design for the FINRISK 2007 study. Variables include background information Z, questionnaire variables X, physical measurements Y, and participation indicators M_1 and M_2. Dashed edges marked ‘A’ and ‘B’ represent assumptions about the missing data mechanism. Figure from Karvanen et al. (2016).

Legend for the causal model. Open circles are unobserved variables, filled circles are observed variables, and diamonds are indicator/decision variables. Figure from Karvanen et al. (2016).

The model includes two key assumptions:

Assumption A: Among non-participants, the recontact response is independent of unobserved health variables X and Y, conditional on background variables Z (age, sex, region). This assumption was tested empirically using follow-up data and found to hold – recontact response did not predict mortality or morbidity among non-participants when adjusted for age and sex.

Assumption B: Physical measurements Y are independent of participation M_1 given questionnaire variables X and background Z. This assumption cannot be tested with available data but is necessary for estimating prevalences of physical health indicators.

The Multiple Imputation Approach

Using these assumptions, Karvanen et al. (2016) developed a multiple imputation method tailored to this missing data mechanism (MI-MNAR).¹

Checking the method: To assess whether MI-MNAR works, they compared different approaches using deaths and hospitalizations – outcomes where ground truth was available for the full cohort:

Participants-only estimates: Deaths 22.1 per 1,000 (true value: 33.0) \times
MI-MAR (assumes missing at random): Deaths 23.6 per 1,000 \times
MI-MNAR (accounts for non-random missingness): Deaths 39.8 per 1,000 (CI overlaps 33.0) \checkmark

The participants-only and MI-MAR confidence intervals failed to capture the true values, revealing systematic bias. The MI-MNAR method performed better: its confidence intervals contained the true values for both deaths and hospitalizations. While this is suggestive rather than definitive validation, it provides evidence that the modeling assumptions are reasonable.

This encouraging performance on outcomes with known truth provides some confidence for applying MI-MNAR to estimate smoking and alcohol use (where we don’t have ground truth for non-participants):

Estimates of health indicators for different participation groups and population-level estimates with three MI methods.

Subset / Method	Heavy alcohol users % (95% CI)	Daily smokers % (95% CI)	BMI ≥ 30 kg/m² % (95% CI)	Systolic blood pressure ≥ 140 mm Hg % (95% CI)	Deaths per 1,000 (95% CI)	Hospitalizations per 1,000 (95% CI)
Full cohort	—	—	—	—	33.0 (29.5–36.5)	1,157 (1,098–1,215)
Participants	5.2 (4.7–5.8)	21.8 (20.7–22.8)	21.0 (20.0–22.0)	30.7 (29.5–31.8)	22.1 (18.5–25.7)	1,052 (989–1,115)
Recontact respondents	6.4 (4.1–8.7)	33.4 (29.1–37.7)	—	—	55.1 (34.6–75.7)	1,357 (1,043–1,670)
MI-MNAR	6.8 (5.6–8.1)	27.1 (24.8–29.3)	20.8 (19.7–22.0)	29.3 (28.1–30.6)	39.8 (32.7–47.0)	1,303 (1,175–1,431)
MI-MAR	5.5 (4.8–6.3)	23.4 (22.3–24.6)	20.4 (19.3–21.6)	29.1 (27.9–30.3)	23.6 (19.7–27.5)	1,027 (968–1,085)
MI-MAR, no recontact	5.4 (4.7–6.0)	23.4 (22.3–24.6)	20.1 (19.0–21.3)	29.1 (27.9–30.2)	23.4 (19.7–27.1)	1,030 (966–1,095)

A subset of health indicator estimates using different methods. The MI-MNAR method (which properly accounts for the non-random missing data mechanism) yields estimates that differ substantially from participants-only estimates for smoking and alcohol use. Data from Table 5 of Karvanen et al. (2016).

Notice the pattern: MI-MNAR estimates for smoking (27.1%) fall between participants (21.8%) and recontact respondents (33.4%), which makes sense given that non-participants are at higher risk. For obesity, blood pressure, and cholesterol, all methods agree because the recontact questionnaire lacked good proxy questions for these physical measurements.

Key insight: Even though only 13% of non-participants responded to recontact, this data dramatically improved the estimates. The assumption that recontact respondents can proxy all non-participants (conditional on age, sex, and region) found support in the mortality and hospitalization data, where this assumption led to estimates consistent with known truth.

The Value of Recontact

The FINRISK example demonstrates that all efforts to collect data on non-participants are likely to be useful, even if the recontact response rate remains low. A 13% response rate might seem discouraging, but it provided crucial information for correcting selection bias that would otherwise remain hidden.

12.4 Types and Mechanisms of Missing Data

Missing data can be classified in two complementary ways. Understanding both perspectives is crucial for choosing appropriate analysis methods.

12.4.1 Sources of Missing Data

First, we can classify missing data by how or why they arise. There are three common sources:

Censoring: Values exceed the range of possible observations. This could be due to limitations of a physical sensor, or due to limited follow-up time (e.g., a participant is still alive at the end of a medical study, so their death date is censored).
Unobserved in random sampling: In a study asking participants to rate 10 randomly selected movies, ratings for other movies are unobserved due to the study design – a form of planned missingness.
Unobserved in non-random sampling: Among the 10 selected movies, participants may fail to rate some. This is non-random missingness that may depend on whether they’ve seen the movie, liked it, etc.

Example: Classifying Sources of Missing Data

Classify each of the following into censoring, unobserved in random sampling, or unobserved in non-random sampling:

Faulty sensors cannot record extreme values
Limited number of sensors means not all locations are monitored
Medical tests are not performed on patients who appear healthy
Survey respondents are not included because they don’t have phones
Survey participants decline to answer sensitive questions
In a movie rating study, only 10 randomly selected movies are rated per participant
In the same study, participants skip rating movies they haven’t seen

Solution

Censoring: The sensor limitation causes systematic missingness for values beyond its range.
Unobserved in random sampling: If sensor placement is random, unmonitored locations are a planned random sample.
Unobserved in non-random sampling: Decision not to test depends on health status (unobserved outcome), creating non-random missingness.
Censoring or unobserved in non-random sampling: Lack of phone creates a systematic barrier; those without phones may differ systematically from those with phones.
Unobserved in non-random sampling: Non-response to sensitive questions likely correlates with the answer itself (e.g., those engaged in sensitive behavior may be less likely to report).
Unobserved in random sampling: By design, only a random subset of movies is rated.
Unobserved in non-random sampling: Whether someone has seen a movie is not random – it relates to preferences, which likely correlate with ratings.

12.4.2 Missing Data Mechanisms (MCAR, MAR, MNAR)

The classification above describes where missing data come from (operational perspective). We now turn to a complementary classification that describes the statistical relationship between missingness and data values (analytical perspective).

The mechanism classification has three categories:

MCAR (Missing Completely at Random): Missingness is independent of all data values
MAR (Missing at Random): Missingness depends on observed values only
MNAR (Missing Not at Random): Missingness depends on unobserved values

Key Distinction: Sources vs. Mechanisms

Sources (previous section): How or why does missingness arise? (censoring, study design, selective dropout)
Mechanisms (this section): Does missingness depend on data values? If so, which ones?

These are independent classification systems. The same source can correspond to different mechanisms:

Censoring (source) can be MCAR if sensors fail randomly, or MNAR if extreme values cause failures
Non-random sampling (source) is typically MAR or MNAR depending on what drives the dropout

Understanding the mechanism (MCAR/MAR/MNAR) determines which statistical methods are valid. We now define these three mechanisms formally.

Setup and Notation

Consider a dataset Y = (Y_{\text{obs}}, Y_{\text{mis}}), where:

Y_{\text{obs}} are the observed elements
Y_{\text{mis}} are the missing elements
R denotes response indicators: R_i = 1 if element i was observed, R_i = 0 if missing

The missingness mechanism is characterized by the probability distribution p(R \mid Y) – the probability that particular values are missing, given the full data (both observed and missing parts). The missingness of Y can be divided into 3 cases:

1. Missing Completely at Random (MCAR)

Missing Completely at Random (MCAR)

Data are MCAR if p(R \mid Y) = p(R). That is, R is independent of Y – the probability of missingness doesn’t depend on any data values, observed or missing.

Examples:

A lab technician accidentally spills coffee on a random stack of patient files
Survey responses are lost in transit by postal service errors
Database corruption affects randomly selected records

2. Missing at Random (MAR)

Missing at Random (MAR)

Data are MAR if p(R \mid Y) = p(R \mid Y_{\text{obs}}). The probability of a value being missing depends only on Y_{\text{obs}} but is independent of the missing values Y_{\text{mis}} themselves.

Examples:

Urban residents are more likely to respond to online surveys than rural residents (and location is recorded)
Younger participants are less likely to respond to follow-up questionnaires (and age is known)
People with higher education levels provide more complete financial data (and education is observed)

3. Missing Not at Random (MNAR)

Missing Not at Random (MNAR)

Data are MNAR if p(R \mid Y) depends on Y_{\text{mis}} as well as possibly Y_{\text{obs}}. The probability of missingness depends on the unobserved values themselves.

Examples:

People with severe symptoms are less likely to complete a symptom severity questionnaire
Individuals with very low test scores are more likely to skip remaining test questions
Employees with poor performance ratings are less likely to complete exit surveys
Fund C in our opening example (poor performance causes dropout)

Example: Classifying Missing Data Mechanisms

An e-commerce site is recording activity data from all its customers. Which missing data mechanism applies in the following settings?

Records for a period of time are lost because a disk got full.
Software error caused activity not to be recorded for customers who used a specific payment system.
No records can be made from users of an ad-blocker.

Solution

MCAR: The disk failure is unrelated to any customer characteristics or behavior. Missingness is completely random.
MAR: Missingness depends on the payment system used, which is an observed variable (the site knows which payment system each customer uses). Once we condition on payment system, missingness is random.
MNAR: Ad-blocker usage is typically unobserved by the site (that’s the point of ad-blockers!). The missingness depends on an unmeasured characteristic that likely correlates with the user’s privacy preferences and behavior.

12.4.3 Ignorability

The key statistical question about missing data is whether the mechanism is ignorable for a particular analysis.

A missingness mechanism is ignorable if we can perform valid inference without explicitly modeling the missingness process. Whether a mechanism is ignorable depends both on the missing data mechanism (MCAR, MAR, or MNAR) and on the statistical method used.

Complete case analysis (also called listwise deletion) is the simplest approach: analyze only observations with no missing values, discarding any row with missing data. Whether this approach is valid depends on the missing data mechanism:

MCAR is ignorable for all standard methods, including complete case analysis
MAR is ignorable for likelihood-based and multiple imputation methods, but not for complete case analysis
MNAR is generally not ignorable – we must explicitly model the missingness or make strong untestable assumptions

Why does ignorability depend on both the mechanism and the method?

MCAR: Imagine randomly tossing out some survey responses. The remaining responses are still representative – just a smaller random sample. Any standard analysis works fine. Complete case analysis is safe because we’re just working with a random subsample.

MAR: Now imagine younger people are less likely to respond, but we know everyone’s age. If we account for age properly (e.g., by including it in our model), we can correct for the bias. Methods like multiple imputation do this conditioning automatically. But complete case analysis doesn’t – it just throws away the younger non-respondents, leaving us with an unrepresentative older sample.

MNAR: Suppose people with severe symptoms are less likely to complete a symptom questionnaire, and we don’t observe their symptoms. There’s no variable we can condition on to fix this – the bias is baked into what we can’t see. We must either explicitly model why people didn’t respond (which requires strong assumptions about the unobserved) or find auxiliary data (like the FINRISK follow-up data).

The technical definition of ignorability involves how the likelihood factors.

Let $L(\theta; Y_{\text{obs}}, R)$ be the likelihood given observed data $Y_{\text{obs}}$ and missingness indicators $R$. The joint distribution factors as:

\[f(Y, R \mid \theta, \phi) = f(Y \mid \theta) f(R \mid Y, \phi)\]

where $\theta$ are parameters of interest and $\phi$ are parameters of the missingness mechanism.

Ignorability holds when:

The parameters $\theta$ and $\phi$ are distinct (variation in one doesn’t constrain the other)
The missingness is MAR: $f(R \mid Y, \phi) = f(R \mid Y_{\text{obs}}, \phi)$

Under these conditions, we can base inference on $f(Y_{\text{obs}} \mid \theta)$ and ignore $f(R \mid Y_{\text{obs}}, \phi)$.

For MCAR, we have even stronger ignorability: $f(R \mid Y, \phi) = f(R \mid \phi)$, so missingness is completely independent of the data.

To see ignorability in action, we simulate a simple regression problem: predicting $y$ from $x$ where the true relationship is $y = 2 + 3x + \text{noise}$. We know the true slope is 3.0. We then artificially impose three different missingness patterns and see whether complete case analysis recovers the true slope:

MCAR: Randomly delete 30% of observations (independent of any values)
MAR: Delete observations based on $x$ values (higher $x$ → more likely to be missing)
MNAR: Delete observations based on $y$ values (higher $y$ → more likely to be missing)

Since we control the data generation, we can compare the estimated slopes to the known truth. We repeat each scenario 100 times to assess variability and compute 95% confidence intervals.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

rng = np.random.default_rng(42)
n = 10000
n_sims = 100

# True data generating process: y = 2 + 3x + noise
true_beta = 3.0

# Three missingness mechanisms
def create_mcar_mask(x, y, rng, p=0.3):
    """MCAR: Missingness independent of everything"""
    return rng.random(len(x)) > p

def create_mar_mask(x, y, rng):
    """MAR: Missingness depends on observed x"""
    p = 1 / (1 + np.exp(-2*x))  # Higher x => much more missing
    return rng.random(len(x)) > p

def create_mnar_mask(x, y, rng):
    """MNAR: Missingness depends on unobserved y"""
    p = 1 / (1 + np.exp(-y/1.5))  # Higher y => more missing
    return rng.random(len(x)) > p

# Run simulations
results = {'MCAR': [], 'MAR': [], 'MNAR': []}

for i in range(n_sims):
    # Generate data
    x = rng.normal(size=n)
    y = 2 + 3*x + rng.normal(size=n)

    # Estimate under each mechanism
    for name, mask_fn in [('MCAR', create_mcar_mask),
                          ('MAR', create_mar_mask),
                          ('MNAR', create_mnar_mask)]:
        mask = mask_fn(x, y, rng)
        slope = stats.linregress(x[mask], y[mask]).slope
        results[name].append(slope)

# Compute means and 95% CIs
means = [true_beta] + [np.mean(results[k]) for k in ['MCAR', 'MAR', 'MNAR']]
cis = [[true_beta, true_beta]] + [np.percentile(results[k], [2.5, 97.5]) for k in ['MCAR', 'MAR', 'MNAR']]
ci_lower = [means[i] - cis[i][0] for i in range(len(means))]
ci_upper = [cis[i][1] - means[i] for i in range(len(means))]

# Visualize the bias
fig, ax = plt.subplots(figsize=(7, 4.5))
mechanisms = ['True β', 'MCAR', 'MAR', 'MNAR']
colors = ['black', 'green', 'orange', 'red']

bars = ax.bar(mechanisms, means, color=colors, alpha=0.7, edgecolor='black',
              yerr=[ci_lower, ci_upper], capsize=5, error_kw={'linewidth': 2})
ax.axhline(true_beta, color='black', linestyle='--', linewidth=1, alpha=0.5,
           label='True value')
ax.set_ylabel('Slope estimate (β)')
ax.set_title('Complete Case Analysis: Bias and Uncertainty by Missing Data Mechanism')
ax.set_ylim([2.5, 3.5])

# Add value labels
for i, (bar, val) in enumerate(zip(bars, means)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.02,
            f'{val:.3f}', ha='center', va='bottom', fontsize=9)

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print(f"True slope: {true_beta:.3f}")
for name in ['MCAR', 'MAR', 'MNAR']:
    mean = np.mean(results[name])
    ci_low, ci_high = np.percentile(results[name], [2.5, 97.5])
    print(f"{name}: {mean:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}], bias: {mean - true_beta:+.3f})")

True slope: 3.000
MCAR: 3.001 (95% CI: [2.982, 3.025], bias: +0.001)
MAR: 3.000 (95% CI: [2.966, 3.034], bias: -0.000)
MNAR: 2.794 (95% CI: [2.757, 2.836], bias: -0.206)

The simulation demonstrates key differences across mechanisms:

MCAR is unbiased – the confidence interval contains the true value
MAR is unbiased in this specific example due to a special case: when missingness depends only on the predictor $x$ in a regression, complete case analysis for the slope coefficient happens to be unbiased. However, this is an exception, not the rule. Complete case analysis under MAR is biased for most estimands (means, variances, correlations) and would be biased even for regression if missingness depended on an auxiliary variable or if we were estimating intercepts or predictions.
MNAR shows substantial bias because missingness directly depends on the unobserved outcome variable $y$, systematically excluding high values

Key takeaway: Complete case analysis is safe only under MCAR. Under MAR, complete case analysis is generally biased – the regression slope example above is a rare exception. Under MNAR, complete case analysis is almost always biased.

The Bottom Line: Ignoring the mechanism when it’s not ignorable leads to biased results. Always assess the likely mechanism before choosing an analysis method.

Common Misunderstanding

Even if a variable is fully observed, complete case analysis can bias its estimate when missingness in other variables depends on it. Deleting rows with any missing values changes the effective sampling design.

For example, if income is fully observed but wealthy people are less likely to respond to other survey questions, complete case analysis will undersample wealthy individuals and bias the income estimate downward.

12.5 Techniques for Dealing with Missing Data

Now that we understand missing data mechanisms, we turn to methods for handling incomplete data. The appropriate method depends on the mechanism and the analysis goals.

12.5.1 Overview of Approaches

Some common ways for dealing with missing data include:

Complete case analysis (listwise deletion): Ignore entries with missing data
Imputation: Fill in the missing values with estimated values
Weighting: Reweight complete cases to represent the full sample
Model-based procedures: Build a model for the observed data that accounts for missingness

We’ll explore each of these approaches.

12.5.2 Complete Case Analysis

Complete case analysis, as we saw before, simply discards any observation with missing data and analyzes the remaining complete observations.

When appropriate: MCAR with low missingness (a common rule of thumb is < 5%, though this depends on sample size and precision requirements)

Advantages:

Simple to implement
Honest uncertainty (standard errors reflect the actual sample size)
No additional modeling assumptions

Disadvantages:

Biased under MAR/MNAR
Loss of efficiency (throwing away data)
Different analyses may use different subsets of data (if missingness patterns differ across variables)

Example: In our fund performance example, using only surviving funds is complete case analysis. It’s severely biased because missingness (fund liquidation) depends on the outcome (poor performance).

Why the “Low Missingness” Threshold?

Even when MCAR holds, complete case analysis is only recommended when missingness is low (a common rule of thumb is < 5%, though this depends on sample size and study objectives). Two key considerations:

Efficiency: Even under MCAR, discarding observations wastes data. With very low missingness, you lose minimal precision. With higher missingness (e.g., 20-40%), you’re throwing away substantial information and widening confidence intervals unnecessarily.
Assumption credibility: The MCAR assumption is harder to defend when missingness is high. If you have substantial missing data, it’s worth investing effort in methods that work under weaker assumptions (MAR).

12.5.3 Imputation Methods

Imputation refers to a class of techniques for replacing missing values with estimated values. This makes subsequent analysis easy – after imputation, we have a seemingly complete dataset.

However, imputation comes with a serious warning:

The Seduction and Danger of Imputation

The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial biases.

– Dempster and Rubin (1983)

Simple (Single) Imputation

Common simple imputation methods include:

Mean/median imputation: Replace missing values with the mean or median of observed values
Hot deck imputation: Randomly select a similar record and use its value
Regression prediction: Predict missing values using observed relationships

The fundamental problem with single imputation: Treating imputed values as if they were observed leads to overconfident inference. The downstream analysis method is not aware of the larger uncertainty in the imputed values.

Additional caveats:

Replacing missing values using observed values for the same variable reduces variability – estimates of variation are biased
Imputation methods based on predicting missing values may work poorly if extrapolation is needed
Mean imputation breaks relationships between variables (imputed values have zero correlation with other variables)

Example: How Mean Imputation Distorts Variance

To see the problem concretely, suppose we have data on a variable X with true mean 50 and standard deviation 10. We randomly make 30% of values missing (MCAR), then impute them using the sample mean. What happens to the distribution?

Show code

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(123)

# True data
n = 100
X_true = rng.normal(50, 10, n)

# Create MCAR missingness (30% missing)
missing_mask = rng.random(n) < 0.3
X_observed = X_true.copy()
X_observed[missing_mask] = np.nan

# Mean imputation
X_mean_imputed = X_observed.copy()
mean_val = np.nanmean(X_observed)
X_mean_imputed[missing_mask] = mean_val

# Compare distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

# Plot 1: Histograms
ax1.hist(X_true, bins=15, alpha=0.5, label='True data', edgecolor='black')
ax1.hist(X_mean_imputed, bins=15, alpha=0.5, label='After mean imputation', edgecolor='black')
ax1.axvline(mean_val, color='red', linestyle='--', label=f'Imputed value ({mean_val:.1f})')
ax1.set_xlabel('X')
ax1.set_ylabel('Frequency')
ax1.set_title('Mean Imputation Distorts the Distribution')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Variance comparison
variances = {
    'True data': np.var(X_true),
    'Observed only': np.nanvar(X_observed),
    'Mean imputed': np.var(X_mean_imputed)
}

bars = ax2.bar(range(len(variances)), list(variances.values()),
               color=['black', 'blue', 'red'], alpha=0.7, edgecolor='black')
ax2.set_xticks(range(len(variances)))
ax2.set_xticklabels(list(variances.keys()), rotation=15, ha='right')
ax2.set_ylabel('Variance')
ax2.set_title('Mean Imputation Artificially Reduces Variance')
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar, val in zip(bars, variances.values()):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 2,
            f'{val:.1f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("Variance comparison:")
for name, var in variances.items():
    print(f"  {name:15s}: {var:.2f}")
print(f"\nVariance reduction: {(1 - variances['Mean imputed']/variances['True data'])*100:.1f}%")

Variance comparison:
  True data      : 80.10
  Observed only  : 81.85
  Mean imputed   : 49.11

Variance reduction: 38.7%

The spike at the mean in the imputed distribution reveals the problem: all 30 imputed values are identical (the mean), while the true values would have varied. This artificially reduces variance by a substantial fraction, leading to overconfident inference – confidence intervals that are too narrow and p-values that are too small.

Multiple Imputation (MI)

Multiple imputation addresses the fundamental flaw of single imputation: treating imputed values as known. Instead of creating one “completed” dataset, MI generates multiple plausible versions (typically m = 20-50) by drawing from the distribution of possible values given the observed data. Each imputed dataset is analyzed separately using standard methods, and results are combined using Rubin’s rules², which properly account for two sources of uncertainty: variation between imputations (uncertainty about missing values) and variation within imputations (sampling uncertainty).

Modern software packages (e.g., mice in R, IterativeImputer in Python) make implementation straightforward.

Connection to Bootstrap and Bayesian Methods

MI resembles both bootstrap and Bayesian inference:

Like bootstrap, MI generates multiple datasets and pools results to capture uncertainty
Like Bayesian methods, MI draws from a distribution of plausible values (though MI uses a frequentist framework for combining results via Rubin’s rules rather than full posterior inference)

In fact, fully Bayesian approaches to missing data are ideal but computationally intensive. MI can be viewed as an efficient approximation that works within standard frequentist workflows.

12.5.4 Weighting Methods

The idea behind weighting is to reweight the complete cases so they represent the full sample.

Inverse probability weighting (IPW):

Estimate the probability of being observed: \hat{p}_i = P(R_i = 1 \mid X_i)
Weight each complete case by w_i = 1/\hat{p}_i
Analyze weighted data

Under MAR with a correctly specified model for P(R = 1 \mid X), this gives unbiased estimates.

Example: Sex Ratio Correction in Height Estimation

Assume we wish to estimate the average height of people living in a region. We gather a sample:

70 females with average height of 166 cm
30 males with average height of 180 cm

The sample average is: \overline{X} = \frac{70 \cdot 166 + 30 \cdot 180}{100} = 170.2 \text{ cm}

However, if we know the true sex distribution is around 50:50 (not 70:30), we would get a better estimate of the population average by reweighting the group averages with the population frequencies: \overline{X}_{\text{rw}} = \frac{50 \cdot 166 + 50 \cdot 180}{100} = 173.0 \text{ cm}

This is a simple example of weighting by known population distributions.

Similar approaches are commonly used with other background variables with known population distributions: age, socioeconomic status, political preferences, etc.

12.5.5 Model-Based Procedures

Model-based procedures define a statistical model for the observed data and infer the model parameters using standard techniques (often maximum likelihood via the EM algorithm, or Bayesian methods).

Such approaches can lead to more accurate results when the incompleteness of the data is properly accounted for in the model. However:

The procedures must be adapted to deal with missing data
Existing code and tools often cannot be used as-is
Model misspecification³ can lead to bias

The FINRISK model discussed earlier is an example of a model-based procedure – it explicitly modeled the participation process and used multiple imputation within that framework.

Advanced: Selection Models vs. Pattern-Mixture Models

When modeling MNAR data, there are two main frameworks:

Selection models (like the FINRISK approach) factor the joint distribution as: f(Y, R) = f(Y) \cdot f(R \mid Y) They model the data generating process and then the selection/missingness mechanism conditional on the data.

Pattern-mixture models take the opposite factorization: f(Y, R) = f(Y \mid R) \cdot f(R) They model the distribution of data conditional on the missingness pattern, then combine across patterns.

Both approaches require untestable assumptions about the MNAR mechanism. In practice:

Selection models are more common and intuitive but can be computationally challenging
Pattern-mixture models can be easier to specify but require careful thought about combining patterns
Sensitivity analysis is essential – vary assumptions about the MNAR mechanism and assess how conclusions change
Implementation typically requires custom likelihood functions or Bayesian modeling tools (e.g., PyMC)

12.6 Practical Implementation

12.6.1 Decision Framework

When faced with missing data, follow this workflow:

1. Assess the missingness:

How much data is missing overall?
Which variables have missing values?
Are there patterns (e.g., variables missing together)?
What is the likely mechanism (MCAR, MAR, MNAR)?

2. Choose your method:

MCAR with low missingness: Complete case analysis is acceptable (rule of thumb: < 5%, but depends on sample size and precision needs)
MCAR with higher missingness: MI for efficiency gains
MAR: MI with rich predictors (include auxiliary variables)
MNAR: Requires special methods – sensitivity analysis, auxiliary data collection (like recontact), or explicit modeling of the missingness mechanism

3. Validate your approach:

Compare observed vs. imputed distributions (imputed values should look plausible)
Perform sensitivity analyses by varying assumptions
When possible, check against external data (like the FINRISK follow-up data)
Compare MI-MAR vs. MI-MNAR results to understand the impact of MNAR assumptions

Reporting Guidelines

Transparency is essential when dealing with missing data. A complete report should include:

Amount and pattern of missingness: Report overall percentage and range across variables
Assumed mechanism and justification: State whether you assume MCAR, MAR, or MNAR and why
Method used: Specify whether you used complete case, MI, weighting, or model-based methods
Implementation details:
- For MI: number of imputations, imputation models, predictors used
- For weighting: how weights were estimated
- Software and packages used
Sensitivity analyses: Report results under different assumptions
Impact assessment: Describe how results differ from complete case analysis

12.6.2 Prevention is Better than Correction

The best way to handle missing data is to prevent it. A researcher can design data collection to minimize missingness:

Shorter surveys with clear, relevant questions (reduces respondent fatigue)
Incentives for completion (monetary, results reports, lottery entries)
Follow-up protocols for non-respondents (like FINRISK’s recontact approach)
Pilot testing to identify problematic questions before full deployment
Collect auxiliary data for non-respondents when possible – even limited data can dramatically improve estimates

The FINRISK example demonstrated that even a 13% recontact response rate provided crucial information for correcting selection bias. All efforts to collect data on non-participants are worthwhile.

When Standard Methods Fail

Be aware of situations where even proper methods struggle:

Very high missingness (> 40%): Results become increasingly dependent on modeling assumptions rather than data
MNAR without auxiliary information: No method can reliably recover the truth without strong, untestable assumptions about the missingness mechanism
Extrapolation beyond observed data: Imputation requiring predictions far outside the range of observed values is inherently unreliable

Fundamental limitation: No statistical method can recover information that isn’t there. If the missing data mechanism is MNAR and you have no auxiliary information about non-respondents, your estimates will be biased. The best you can do is acknowledge the limitation explicitly, perform sensitivity analyses showing results under different assumptions, and discuss the likely direction and magnitude of bias.

12.7 Chapter Summary and Connections

12.7.1 Key Concepts Review

Two classification systems:

Sources (how missingness arises): Censoring, random sampling, non-random sampling
Mechanisms (statistical relationship): MCAR, MAR, MNAR

Missing data mechanisms:

MCAR (Missing Completely at Random): Missingness independent of all data
MAR (Missing at Random): Missingness depends only on observed data
MNAR (Missing Not at Random): Missingness depends on unobserved values

Ignorability:

MCAR: Ignorable for all methods (complete case analysis is unbiased)
MAR: Ignorable for likelihood-based and MI methods, but NOT for complete case analysis
MNAR: Generally not ignorable – requires explicit modeling or auxiliary data

Key methods:

Complete case analysis: Only valid under MCAR with low missingness
Multiple imputation: The workhorse for MAR; generates m datasets, analyzes each, pools with Rubin’s rules
Weighting: Reweight complete cases to represent full sample (inverse probability weighting)
Model-based procedures: Explicitly model the missingness mechanism

Real-world impact:

Survivorship bias (investment funds): Using only surviving funds systematically overstates returns, misleading billions in investment decisions
Selection bias (FINRISK): Non-participants had substantially higher mortality; ignoring them produces dangerously wrong public health estimates of smoking and alcohol use

12.7.2 The Big Picture

Missing data is not a niche technical problem – it’s a fundamental challenge in real-world data analysis that affects investment decisions, public health policy, and life-and-death medical outcomes.

We have principled methods for handling missing data, but they require careful thought and honest assessment. The key: understand the missingness mechanism, choose methods accordingly, validate assumptions when possible, and report transparently about limitations. No statistical method can compensate for very high missingness or MNAR without auxiliary data – prevention and auxiliary data collection (like FINRISK’s recontact effort) are invaluable.

12.7.3 Common Pitfalls to Avoid

Assuming MCAR without evidence: Most real-world missing data are not MCAR. Compare complete vs. incomplete cases before assuming missingness is random.
Naïve imputation: Mean imputation destroys variance; single imputation understates uncertainty. Use proper multiple imputation.
Ignoring auxiliary information: Include variables that predict missingness in imputation models, even if they’re not in your analysis model. The FINRISK recontact data – just 13% response – dramatically improved estimates.
Hiding your assumptions: Document your missingness mechanism assumptions, methods used, and sensitivity analyses. Readers cannot judge validity without this transparency.
Misspecified models and false confidence: Imputation models can introduce bias if misspecified. When MNAR is plausible but you use MAR methods, your uncertainty is understated. Validate and conduct sensitivity analyses.

12.7.4 Chapter Connections

Previous (Chapters 5-11):

Chapters 5-7 developed parametric inference, estimation, and hypothesis testing – all assuming complete data. Missing data violates this assumption and can bias estimates, widen confidence intervals artificially (mean imputation), or lead to invalid p-values
Chapter 9’s regression methods are vulnerable to selection bias when missingness depends on outcomes (like Fund C’s poor performance leading to dropout) – complete case analysis can severely bias coefficient estimates
Chapter 11’s causal inference framework helps understand missing data mechanisms – whether missingness depends on observed vs. unobserved variables determines if we can recover valid estimates

This Chapter: Addressed missing data through three mechanisms (MCAR, MAR, MNAR) and their appropriate methods. We saw how survivorship and selection bias can severely distort inference, and learned when complete case analysis fails versus when multiple imputation or model-based approaches are needed.

Next (Ch. 13):

Chapter 13 (Experimental Design and Surveys) will show how proper study design minimizes missingness and enables causal inference – the principles here inform power calculations, sample size planning, and follow-up protocols to reduce missing data

12.7.5 Self-Test Problems

Mechanism Classification: For each scenario, identify whether the missing data are MCAR, MAR, or MNAR:
1. A thermometer fails to record temperatures below -20°C due to physical limitations
2. Survey respondents from rural areas are less likely to complete an online survey (and location is recorded)
3. People with high anxiety scores are less likely to complete an anxiety questionnaire (and anxiety is what we’re measuring)
Solution
1. Source: Censoring (instrument lower limit). Mechanism: MNAR, because the probability of missingness depends on the unobserved value itself (temperatures below −20°C are deterministically missing).
2. Mechanism: MAR. Missingness depends on location, which is observed; conditional on location, response is random.
3. Mechanism: MNAR. Missingness depends on the unobserved anxiety score itself.
Survivorship Bias Calculation: Suppose a database contains 100 funds at year 0. After 5 years, 60 funds survive. The surviving funds show an average return of 8% per year. If the 40 defunct funds had an average return of -2% per year, what is the survivorship bias?
Solution
- Surviving portfolio: 8% (based on 60 funds)
- True portfolio: (60 × 8% + 40 × (-2%))/100 = (480 - 80)/100 = 4%
- Survivorship bias: 8% - 4% = 4% per year
This is even larger than the hedge fund bias estimated by Fung & Hsieh! If instant history bias adds +1.4% per year, the combined bias would be 5.4% per year.
Complete Case Bias: Assume income and health are positively correlated. A study measures income (fully observed) and health status (missing for 30% of respondents). Healthier people are more likely to respond. Will complete case analysis bias the income estimate? Why or why not?

Solution

Yes, complete case analysis will bias the income estimate even though income is fully observed. Since health and income are positively correlated (e.g., healthier people have higher income due to employment), the complete cases will oversample healthier individuals and thus oversample higher-income individuals. Deleting rows changes the effective sampling design, biasing estimates even for fully observed variables.
Ignorability and Valid Methods: Fill in which methods produce valid inference without explicitly modeling the missingness mechanism:

Mechanism Complete Case Analysis Likelihood/ML Methods Multiple Imputation

MCAR ? ? ?

MAR ? ? ?

MNAR ? ? ?

Solution

Mechanism Complete Case Analysis Likelihood/ML Methods Multiple Imputation

MCAR \checkmark \checkmark \checkmark

MAR \times \checkmark \checkmark

MNAR \times \times* \times*

*MNAR requires explicitly modeling the missingness mechanism or bringing auxiliary data/assumptions (selection models, pattern-mixture models, sensitivity analyses).
Method Selection (True/False):
1. With 3% missing data and MCAR, complete case analysis is acceptable.
2. With 30% missing data and MAR, complete case analysis is generally unbiased.
3. With plausible MNAR and no auxiliary information, sensitivity analysis is essential.
Solution
1. True - Under MCAR with low missingness (rule of thumb: < 5%), complete case analysis is valid though less efficient.
2. False - Even under MAR, complete case analysis is biased. The high missingness (30%) makes efficiency loss severe, but MAR methods like multiple imputation are needed for validity.
3. True - With MNAR and no auxiliary data, results are sensitive to untestable assumptions. Sensitivity analysis showing results under different MNAR assumptions is essential for honest inference.

Mechanism	Complete Case Analysis	Likelihood/ML Methods	Multiple Imputation
MCAR	?	?	?
MAR	?	?	?
MNAR	?	?	?

Mechanism	Complete Case Analysis	Likelihood/ML Methods	Multiple Imputation
MCAR	\checkmark	\checkmark	\checkmark
MAR	\times	\checkmark	\checkmark
MNAR	\times	\times*	\times*

12.7.6 Python and R Reference

# Multiple Imputation with statsmodels MICE (imputation + pooling handled internally)
import pandas as pd

# Quick missingness check
df.isnull().sum()             # per-column counts
df.isnull().sum().sum() / df.size  # overall proportion

import statsmodels.api as sm
from statsmodels.imputation.mice import MICEData, MICE

# df: your pandas DataFrame with missing values
md = MICEData(df)  # chained equations setup

# Specify analysis model
model = sm.OLS.from_formula('y ~ x1 + x2', md)

# Fit with multiple imputations; pooled inference returned
res = MICE(model, md, n_imputations=50, model_class=sm.OLS).fit()

# res.summary()  # pooled estimates and standard errors

Other Python options include sklearn.impute.IterativeImputer and miceforest (random forest-based).

library(mice)

# Quick look at missingness
summary(df)
md.pattern(df)

# Multiple Imputation with pooling
imp <- mice(
  data = df,
  m = 50,              # number of imputations
  method = "pmm",      # predictive mean matching for continuous vars
  seed = 123
)

fit <- with(imp, lm(y ~ x1 + x2))
pooled <- pool(fit)
summary(pooled)

# Access a completed dataset if needed:
# complete(imp, action = 1)      # first imputation
# complete(imp, action = "long") # all imputations in long format

Other R packages include Amelia and mi.

12.7.7 Connections to Source Material

This chapter synthesized material from multiple sources. The theoretical framework for missing data mechanisms (MCAR, MAR, MNAR) and ignorability follows Little & Rubin’s foundational Statistical Analysis with Missing Data. Two key empirical studies demonstrated real-world impact: Fung and Hsieh (2000) quantified survivorship and instant history bias in hedge fund databases, while Karvanen et al. (2016) demonstrated selection bias correction in the FINRISK health survey using multiple imputation for MNAR data.

12.7.8 Further Reading

Foundational Text: Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.
Practical Guide: Van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press. Free online
Case Studies: The papers by Fung and Hsieh (2000) and Karvanen et al. (2016) provide excellent examples of missing data in practice.

Remember: Missing data is not just a technical nuisance – it’s a fundamental challenge that can determine whether your analysis leads to valid conclusions or serious errors. Understand the mechanism, choose methods appropriately, and always report your approach transparently!

MI-MNAR = Multiple Imputation for data Missing Not at Random.↩︎
Rubin’s rules provide formulas for combining point estimates and variances from multiple imputed datasets. The total variance accounts for both within-imputation uncertainty (sampling variability) and between-imputation uncertainty (uncertainty about the missing values).↩︎
Model misspecification means using a model that doesn’t match the true data-generating process – e.g., assuming linearity when the relationship is nonlinear, or assuming independence when variables are correlated.↩︎

--- date: today --- # Missing Data ## Learning Objectives After completing this chapter, you will be able to: - **Classify missing data mechanisms** (MCAR, MAR, MNAR) and understand when each is ignorable for inference. - **Recognize and quantify biases** from missing data, including survivorship bias, instant history bias, and selection bias. - **Apply appropriate techniques** for handling missing data (complete case analysis, imputation, weighting, model-based procedures). - **Implement multiple imputation workflows** and understand their assumptions and uncertainty. - **Critically evaluate claims** based on incomplete data and perform sensitivity analyses. ::: {.callout-note} This chapter addresses a ubiquitous problem in real-world data analysis: missing data. We'll see through concrete examples how missing data can severely bias results, learn to classify different mechanisms that generate missingness, and explore practical methods for dealing with incomplete data. The material is enriched with findings from @fung2000performance and @karvanen2016selection. ::: ## Introduction: The Hidden Cost of Missing Data Missing data are ubiquitous in real-world applications. They are caused by limitations and defects of sensors, people failing to respond to surveys, entities disappearing in the middle of a long-term study, and countless other reasons. Data that are missing randomly are often benign: the law of large numbers and the central limit theorem help manage their impact. But data that are missing *systematically* can be a serious problem, as the following examples will demonstrate. ### Opening Example: Investment Fund Performance Consider three hypothetical investment funds, all established simultaneously and investing in risky assets. After 3 years: - **Fund A** has done well, reaching an average annual return of **10%** - **Fund B** has done OK, reaching an average annual return of **5%** - **Fund C** has been terrible, making an average annual loss of **-3%** It is unlikely that investors would stick with Fund C, so it is likely to have made a quiet exit -- for example, by returning the funds to investors and liquidating. **Question**: What is the average performance of the funds? - **Looking at surviving funds only**: (10 + 5)/2 = **7.5%** per year - **Looking at all funds that started**: (10 + 5 - 3)/3 = **4%** per year The difference -- a seemingly small 3.5 percentage points -- is an example of **survivorship bias**. Over time, this translates to millions in misallocated capital if investors rely on biased performance data. ::: {.callout-warning} ## Survivorship Bias is Everywhere Survivorship bias happens everywhere entities in a long-term study can selectively disappear: - Subjects in long-term medication or diet studies who drop out - Individuals who have reached success or fame (we don't see those who tried and failed) - Companies that went bankrupt and were removed from stock indices - Startups that shut down before being included in databases ::: ### A Historical Example: The Bomber Plane Problem You probably have heard this story: during World War II, the U.S. military analyzed damage patterns on bomber planes returning from missions to decide where to add armor. ![Survivorship bias illustrated: bullet hole patterns on returning aircraft. The red areas show where returning planes were hit. Figure by Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept); [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).](../figures/survivorship-bias.png){width=70%} The intuitive approach would be to reinforce the areas with the most bullet holes -- after all, those are clearly getting hit frequently. Instead, statistician [Abraham Wald](https://en.wikipedia.org/wiki/Abraham_Wald) made a crucial observation: the data showed where planes *could* take damage and still return. The planes that were hit in other locations didn't make it back and weren't in the dataset. According to this insight, engineers should reinforce the areas *without* bullet holes. Those were the critical spots where damage meant the plane never returned. This is a classic example of selection bias in missing data -- the missing observations (planes that didn't return) contained the most important information. ::: {.callout-note collapse="true"} ## Finnish Terminology Reference For Finnish-speaking students, here's a reference table of key terms: | English | Finnish | |---------|---------| | Missing data | Puuttuva tieto / Puuttuvat havainnot | | Survivorship bias | Selviytymisharha | | Selection bias | Valikoitumisharha / Valintaharha | | Instant history bias | Takautuvan historian harha | | Missing completely at random (MCAR) | Täysin satunnaisesti puuttuva | | Missing at random (MAR) | Satunnaisesti puuttuva | | Missing not at random (MNAR) | Ei-satunnaisesti puuttuva | | Ignorable | Sivuutettava | | Imputation | Imputaatio / Puuttuvien arvojen korvaaminen | | Multiple imputation | Moninkertainen imputointi | | Censoring | Sensurointi | | Complete case analysis | Täydellisten havaintojen analyysi | ::: ## Case Studies: When Missing Data Misleads ### Financial Performance Biases All standard financial indices are susceptible to survivorship bias: unsuccessful companies and funds are regularly removed and replaced by more successful competitors. The issue is made worse by **instant history bias**: when successful funds are added to a database, their historical returns from their incubation period are backfilled, but only the winners make it to the database. #### Observed Hedge Fund Performance Biases @fung2000performance provide quantitative estimates of these biases using hedge fund data: ![Survivorship and instant history bias in hedge fund performance data. The table shows estimates from the [TASS database](https://en.wikipedia.org/wiki/Trading_Advisor_Selection_System) for both hedge funds and commodity trading advisors (CTAs). Figure from @fung2000performance.](../figures/Fung_Hsieh2000_hedge_fund_bias.png){width=85%} **Key findings**: - **Survivorship bias** in hedge fund performance estimates: approximately **3% per year** - **Instant history bias**: around **1.4% per year** for hedge funds, **3.6% per year** for CTAs - **Dropout rates**: About **20% per year** for offshore hedge funds, **19% per year** for CTAs Further bias is caused by **selection bias**, where fund managers can decide whether to include their fund in the indices, with better-performing funds more likely to be included. ::: {.callout-note collapse="true"} ## The Funds-of-Funds Solution One way to understand the true bias is to track an investment in a "fund of funds" that invests in every fund in the market. The track records of funds-of-funds: - Retain the investment experience of hedge funds that went out of business due to poor performance - Include hedge funds that stopped reporting to database vendors because of good performance - Don't suffer from instant history bias (past records are unaffected when adding new funds) @fung2000performance use this approach to validate their bias estimates, finding that adjusted individual fund returns align well with funds-of-funds returns after accounting for portfolio management costs. ::: #### Types of Bias in Fund Data Let's break down the different biases more carefully: 1. **Survivorship Bias**: Failed funds disappear from databases. Studies comparing surviving funds to all funds (including defunct ones) consistently show that surviving funds have higher average returns -- not because they're better, but because the worst performers are missing from the data. 2. **Instant History Bias**: New hedge funds typically undergo an incubation period, trading on money from the managers' friends and relatives. After compiling good performance, they market themselves to database vendors and hedge fund consultants. When vendors add these funds to their databases, they backfill the earlier returns during the incubation period. The median incubation period for hedge funds is about **343 days** (roughly **11 months**), during which only the successful funds survive to be added. 3. **Selection Bias**: Only funds with good performance want to be included in a database. However, this effect is somewhat mitigated because some managers with superior performance deliberately don't participate -- particularly when they're not interested in attracting more capital. ::: {.callout-note collapse="true"} ## Advanced: Multi-Period Sampling Bias Many databases require funds to have a minimum track record (e.g., 24–36 months) before inclusion. This creates an additional upward bias: funds that survive the initial period tend to have better performance than those that failed early. @fung2000performance quantify this **multi-period sampling bias**: - **Hedge funds**: Approximately **+0.6% per year** - **Commodity trading advisors (CTAs)**: Approximately **+0.1% per year** **Practical implication**: When constructing performance indices or conducting research, one should apply bias adjustments *before* imposing minimum history requirements. Otherwise, they would compound the survivorship bias with this additional selection effect. ::: ### Missing Responses in Health Surveys Our second major example comes from population health surveillance. The FINRISK 2007 study investigated whether survey participants accurately represent the full population. #### The FINRISK 2007 Study Design In the 2007 study, a random sample of **10,000 individuals** aged 25–74 were invited to participate in a health examination survey in Finland: - **6,257 individuals participated** (63% participation rate -- globally remarkably high) - Both participants and non-participants were followed up for **deaths and hospitalizations** for 5 years However, the participants did not appear to fully represent the study population. The full population had substantially higher death rates: ::: {style="font-size: 0.85em;"} **Estimates of health indicators for different participation groups.** | Subset | Heavy alcohol users % (95% CI) | Daily smokers % (95% CI) | BMI ≥ 30 kg/m² % (95% CI) | Systolic blood pressure ≥ 140 mm Hg % (95% CI) | Deaths per 1,000 (95% CI) | Hospitalizations per 1,000 (95% CI) | | --------------------- | -----------------------------: | -----------------------: | ------------------------: | ---------------------------------------------: | ----------------------------------------: | ------------------------: | ----------------------------------: | | Full cohort | — | — | — | — | **33.0** (29.5–36.5) | **1,157** (1,098–1,215) | | Participants | 5.2 (4.7–5.8) | 21.8 (20.7–22.8) | 21.0 (20.0–22.0) | 30.7 (29.5–31.8) | **22.1** (18.5–25.7) | **1,052** (989–1,115) | A subset of health indicators by participation group. Participants and non-participants show markedly different mortality and hospitalization patterns (**bolded**). Data from Table 5 of @karvanen2016selection. ::: This raises serious questions about the validity of health indicators estimated using only the survey respondents. ::: {.callout-note} ## Estimating Health Indicators Without Ground Truth The primary goal of the FINRISK study was to estimate **population-level prevalences** of key health indicators: - **Heavy alcohol use** (self-reported questionnaire data) - **Daily smoking** (self-reported questionnaire data) - **Obesity** (BMI $\geq$ 30 kg/m$^2$, measured) - **High blood pressure** (systolic BP $\geq$ 140 mm Hg, measured) - **Elevated total cholesterol** ($\geq$ 5.0 mmol/L, measured) These variables are **only available for study participants**. We have no direct way to measure smoking or alcohol use for the 37% who didn't participate. We cannot simply use participant-only estimates because, as the mortality data shows, non-participants are fundamentally different from participants. While we don't have ground truth for health behaviors among non-participants, we **do** have complete data on **deaths and hospitalizations** for everyone -- both participants and non-participants. These follow-up outcomes provide a rare opportunity to validate different statistical methods by checking whether they recover the known truth. If a method works for deaths/hospitalizations (where we know the answer), we can have more confidence in its estimates for smoking and alcohol use (where we don't). ::: #### Recontact Data After the initial health examination closed, **the FINRISK study team mailed the same questionnaire to non-participants** and asked them to return it by mail (without attending the physical examination). This differs from earlier reminders -- it aimed to collect at least questionnaire data on smoking, alcohol use, and other self-reported behaviors from those who wouldn't participate fully. An additional **473 non-participants returned the questionnaire** (13% recontact response rate). This created three distinct participation groups: ::: {style="font-size: 0.85em;"} **Estimates of health indicators for different participation groups.** | Subset | Heavy alcohol users % (95% CI) | Daily smokers % (95% CI) | BMI ≥ 30 kg/m² % (95% CI) | Systolic blood pressure ≥ 140 mm Hg % (95% CI) | Deaths per 1,000 (95% CI) | Hospitalizations per 1,000 (95% CI) | | --------------------- | -----------------------------: | -----------------------: | ------------------------: | ---------------------------------------------: | ----------------------------------------: | ------------------------: | ----------------------------------: | | Full cohort | — | — | — | — | 33.0 (29.5–36.5) | 1,157 (1,098–1,215) | | Participants | 5.2 (4.7–5.8) | 21.8 (20.7–22.8) | 21.0 (20.0–22.0) | 30.7 (29.5–31.8) | 22.1 (18.5–25.7) | 1,052 (989–1,115) | | Recontact respondents | 6.4 (4.1–8.7) | 33.4 (29.1–37.7) | — | — | 55.1 (34.6–75.7) | 1,357 (1,043–1,670) | A subset of health indicators broken down by all three participation groups. The three groups show quite different mortality and hospitalization patterns. Data from Table 5 of @karvanen2016selection. ::: Can the recontact respondents help us correct the selection bias in the participant-only estimates? To answer this, we need a formal framework for handling missing data. #### A Causal Model for the Missing Data Process @karvanen2016selection built a causal model to describe the process responsible for missing responses: ::: {.columns} ::: {.column width="72%"} ![Causal model with design for the FINRISK 2007 study. Variables include background information $Z$, questionnaire variables $X$, physical measurements $Y$, and participation indicators $M_1$ and $M_2$. Dashed edges marked 'A' and 'B' represent assumptions about the missing data mechanism. Figure from @karvanen2016selection.](../figures/Karvanen2016_causal_graph.png){width=100%} ::: ::: {.column width="25%"} ![Legend for the causal model. Open circles are unobserved variables, filled circles are observed variables, and diamonds are indicator/decision variables. Figure from @karvanen2016selection.](../figures/Karvanen2016_causal_graph_legend.png){width=100%} ::: ::: The model includes two key assumptions: **Assumption A**: Among non-participants, the recontact response is independent of unobserved health variables $X$ and $Y$, conditional on background variables $Z$ (age, sex, region). This assumption was tested empirically using follow-up data and found to hold -- recontact response did not predict mortality or morbidity among non-participants when adjusted for age and sex. **Assumption B**: Physical measurements $Y$ are independent of participation $M_1$ given questionnaire variables $X$ and background $Z$. This assumption cannot be tested with available data but is necessary for estimating prevalences of physical health indicators. #### The Multiple Imputation Approach Using these assumptions, @karvanen2016selection developed a multiple imputation method tailored to this missing data mechanism (MI-MNAR).^[MI-MNAR = Multiple Imputation for data Missing Not at Random.] **Checking the method**: To assess whether MI-MNAR works, they compared different approaches using **deaths and hospitalizations** -- outcomes where ground truth was available for the full cohort: - **Participants-only estimates**: Deaths 22.1 per 1,000 (true value: 33.0) $\times$ - **MI-MAR** (assumes missing at random): Deaths 23.6 per 1,000 $\times$ - **MI-MNAR** (accounts for non-random missingness): Deaths 39.8 per 1,000 (CI overlaps 33.0) $\checkmark$ The participants-only and MI-MAR confidence intervals failed to capture the true values, revealing systematic bias. The MI-MNAR method performed better: its confidence intervals contained the true values for both deaths and hospitalizations. While this is suggestive rather than definitive validation, it provides evidence that the modeling assumptions are reasonable. **This encouraging performance on outcomes with known truth provides some confidence for applying MI-MNAR to estimate smoking and alcohol use** (where we don't have ground truth for non-participants): ::: {style="font-size: 0.85em;"} **Estimates of health indicators for different participation groups and population-level estimates with three MI methods.** | Subset / Method | Heavy alcohol users % (95% CI) | Daily smokers % (95% CI) | BMI ≥ 30 kg/m² % (95% CI) | Systolic blood pressure ≥ 140 mm Hg % (95% CI) | Deaths per 1,000 (95% CI) | Hospitalizations per 1,000 (95% CI) | | --------------------- | -----------------------------: | -----------------------: | ------------------------: | ---------------------------------------------: | ------------------------: | ----------------------------------: | | Full cohort | — | — | — | — | 33.0 (29.5–36.5) | 1,157 (1,098–1,215) | | Participants | 5.2 (4.7–5.8) | 21.8 (20.7–22.8) | 21.0 (20.0–22.0) | 30.7 (29.5–31.8) | 22.1 (18.5–25.7) | 1,052 (989–1,115) | | Recontact respondents | 6.4 (4.1–8.7) | 33.4 (29.1–37.7) | — | — | 55.1 (34.6–75.7) | 1,357 (1,043–1,670) | | MI-MNAR | 6.8 (5.6–8.1) | 27.1 (24.8–29.3) | 20.8 (19.7–22.0) | 29.3 (28.1–30.6) | 39.8 (32.7–47.0) | 1,303 (1,175–1,431) | | MI-MAR | 5.5 (4.8–6.3) | 23.4 (22.3–24.6) | 20.4 (19.3–21.6) | 29.1 (27.9–30.3) | 23.6 (19.7–27.5) | 1,027 (968–1,085) | | MI-MAR, no recontact | 5.4 (4.7–6.0) | 23.4 (22.3–24.6) | 20.1 (19.0–21.3) | 29.1 (27.9–30.2) | 23.4 (19.7–27.1) | 1,030 (966–1,095) | A subset of health indicator estimates using different methods. The MI-MNAR method (which properly accounts for the non-random missing data mechanism) yields estimates that differ substantially from participants-only estimates for smoking and alcohol use. Data from Table 5 of @karvanen2016selection. ::: Notice the pattern: MI-MNAR estimates for smoking (27.1%) fall between participants (21.8%) and recontact respondents (33.4%), which makes sense given that non-participants are at higher risk. For obesity, blood pressure, and cholesterol, all methods agree because the recontact questionnaire lacked good proxy questions for these physical measurements. **Key insight**: Even though only **13%** of non-participants responded to recontact, this data dramatically improved the estimates. The assumption that recontact respondents can proxy all non-participants (conditional on age, sex, and region) found support in the mortality and hospitalization data, where this assumption led to estimates consistent with known truth. ::: {.callout-tip} ## The Value of Recontact The FINRISK example demonstrates that all efforts to collect data on non-participants are likely to be useful, even if the recontact response rate remains low. A 13% response rate might seem discouraging, but it provided crucial information for correcting selection bias that would otherwise remain hidden. ::: ## Types and Mechanisms of Missing Data Missing data can be classified in two complementary ways. Understanding both perspectives is crucial for choosing appropriate analysis methods. ### Sources of Missing Data First, we can classify missing data by **how or why** they arise. There are three common sources: 1. **Censoring**: Values exceed the range of possible observations. This could be due to limitations of a physical sensor, or due to limited follow-up time (e.g., a participant is still alive at the end of a medical study, so their death date is censored). 2. **Unobserved in random sampling**: In a study asking participants to rate 10 randomly selected movies, ratings for other movies are unobserved due to the study design -- a form of planned missingness. 3. **Unobserved in non-random sampling**: Among the 10 selected movies, participants may fail to rate some. This is non-random missingness that may depend on whether they've seen the movie, liked it, etc. ::: {.callout-tip icon=false} ## Example: Classifying Sources of Missing Data Classify each of the following into **censoring**, **unobserved in random sampling**, or **unobserved in non-random sampling**: 1. Faulty sensors cannot record extreme values 2. Limited number of sensors means not all locations are monitored 3. Medical tests are not performed on patients who appear healthy 4. Survey respondents are not included because they don't have phones 5. Survey participants decline to answer sensitive questions 6. In a movie rating study, only 10 randomly selected movies are rated per participant 7. In the same study, participants skip rating movies they haven't seen ::: {.callout-note collapse="true"} ## Solution 1. **Censoring**: The sensor limitation causes systematic missingness for values beyond its range. 2. **Unobserved in random sampling**: If sensor placement is random, unmonitored locations are a planned random sample. 3. **Unobserved in non-random sampling**: Decision not to test depends on health status (unobserved outcome), creating non-random missingness. 4. **Censoring** or **unobserved in non-random sampling**: Lack of phone creates a systematic barrier; those without phones may differ systematically from those with phones. 5. **Unobserved in non-random sampling**: Non-response to sensitive questions likely correlates with the answer itself (e.g., those engaged in sensitive behavior may be less likely to report). 6. **Unobserved in random sampling**: By design, only a random subset of movies is rated. 7. **Unobserved in non-random sampling**: Whether someone has seen a movie is not random -- it relates to preferences, which likely correlate with ratings. ::: ::: ### Missing Data Mechanisms (MCAR, MAR, MNAR) The classification above describes **where** missing data come from (operational perspective). We now turn to a complementary classification that describes the **statistical relationship** between missingness and data values (analytical perspective). The mechanism classification has three categories: - **MCAR** (Missing Completely at Random): Missingness is independent of all data values - **MAR** (Missing at Random): Missingness depends on observed values only - **MNAR** (Missing Not at Random): Missingness depends on unobserved values ::: {.callout-note} ## Key Distinction: Sources vs. Mechanisms - **Sources** (previous section): How or why does missingness arise? (censoring, study design, selective dropout) - **Mechanisms** (this section): Does missingness depend on data values? If so, which ones? These are **independent classification systems**. The same source can correspond to different mechanisms: - Censoring (source) can be MCAR if sensors fail randomly, or MNAR if extreme values cause failures - Non-random sampling (source) is typically MAR or MNAR depending on what drives the dropout ::: Understanding the mechanism (MCAR/MAR/MNAR) determines which statistical methods are valid. We now define these three mechanisms formally. #### Setup and Notation Consider a dataset $Y = (Y_{\text{obs}}, Y_{\text{mis}})$, where: - $Y_{\text{obs}}$ are the observed elements - $Y_{\text{mis}}$ are the missing elements - $R$ denotes response indicators: $R_i = 1$ if element $i$ was observed, $R_i = 0$ if missing The missingness mechanism is characterized by the probability distribution $p(R \mid Y)$ -- the probability that particular values are missing, given the full data (both observed and missing parts). The missingness of $Y$ can be divided into 3 cases: #### 1. Missing Completely at Random (MCAR) ::: {.definition} **Missing Completely at Random (MCAR)** Data are MCAR if $p(R \mid Y) = p(R)$. That is, $R$ is independent of $Y$ -- the probability of missingness doesn't depend on any data values, observed or missing. ::: **Examples**: - A lab technician accidentally spills coffee on a random stack of patient files - Survey responses are lost in transit by postal service errors - Database corruption affects randomly selected records #### 2. Missing at Random (MAR) ::: {.definition} **Missing at Random (MAR)** Data are MAR if $p(R \mid Y) = p(R \mid Y_{\text{obs}})$. The probability of a value being missing depends only on $Y_{\text{obs}}$ but is independent of the missing values $Y_{\text{mis}}$ themselves. ::: **Examples**: - Urban residents are more likely to respond to online surveys than rural residents (and location is recorded) - Younger participants are less likely to respond to follow-up questionnaires (and age is known) - People with higher education levels provide more complete financial data (and education is observed) #### 3. Missing Not at Random (MNAR) ::: {.definition} **Missing Not at Random (MNAR)** Data are MNAR if $p(R \mid Y)$ depends on $Y_{\text{mis}}$ as well as possibly $Y_{\text{obs}}$. The probability of missingness depends on the unobserved values themselves. ::: **Examples**: - People with severe symptoms are less likely to complete a symptom severity questionnaire - Individuals with very low test scores are more likely to skip remaining test questions - Employees with poor performance ratings are less likely to complete exit surveys - Fund C in our opening example (poor performance causes dropout) ::: {.callout-tip icon=false} ## Example: Classifying Missing Data Mechanisms An e-commerce site is recording activity data from all its customers. Which missing data mechanism applies in the following settings? 1. Records for a period of time are lost because a disk got full. 2. Software error caused activity not to be recorded for customers who used a specific payment system. 3. No records can be made from users of an ad-blocker. ::: {.callout-note collapse="true"} ## Solution 1. **MCAR**: The disk failure is unrelated to any customer characteristics or behavior. Missingness is completely random. 2. **MAR**: Missingness depends on the payment system used, which is an *observed* variable (the site knows which payment system each customer uses). Once we condition on payment system, missingness is random. 3. **MNAR**: Ad-blocker usage is typically *unobserved* by the site (that's the point of ad-blockers!). The missingness depends on an unmeasured characteristic that likely correlates with the user's privacy preferences and behavior. ::: ::: ### Ignorability The key statistical question about missing data is whether the mechanism is **ignorable** for a particular analysis. A missingness mechanism is **ignorable** if we can perform valid inference without explicitly modeling the missingness process. Whether a mechanism is ignorable depends both on the missing data mechanism (MCAR, MAR, or MNAR) and on the statistical method used. **Complete case analysis** (also called listwise deletion) is the simplest approach: analyze only observations with no missing values, discarding any row with missing data. Whether this approach is valid depends on the missing data mechanism: - **MCAR** is ignorable for all standard methods, including complete case analysis - **MAR** is ignorable for likelihood-based and multiple imputation methods, but **not** for complete case analysis - **MNAR** is generally not ignorable -- we must explicitly model the missingness or make strong untestable assumptions ::: {.tabbed-content} ## Intuitive Why does ignorability depend on both the mechanism and the method? **MCAR**: Imagine randomly tossing out some survey responses. The remaining responses are still representative -- just a smaller random sample. Any standard analysis works fine. Complete case analysis is safe because we're just working with a random subsample. **MAR**: Now imagine younger people are less likely to respond, but we know everyone's age. If we account for age properly (e.g., by including it in our model), we can correct for the bias. Methods like multiple imputation do this conditioning automatically. But complete case analysis doesn't -- it just throws away the younger non-respondents, leaving us with an unrepresentative older sample. **MNAR**: Suppose people with severe symptoms are less likely to complete a symptom questionnaire, and we don't observe their symptoms. There's no variable we can condition on to fix this -- the bias is baked into what we can't see. We must either explicitly model why people didn't respond (which requires strong assumptions about the unobserved) or find auxiliary data (like the FINRISK follow-up data). ## Mathematical The technical definition of ignorability involves how the likelihood factors. Let $L(\theta; Y_{\text{obs}}, R)$ be the likelihood given observed data $Y_{\text{obs}}$ and missingness indicators $R$. The joint distribution factors as: $$f(Y, R \mid \theta, \phi) = f(Y \mid \theta) f(R \mid Y, \phi)$$ where $\theta$ are parameters of interest and $\phi$ are parameters of the missingness mechanism. **Ignorability** holds when: 1. The parameters $\theta$ and $\phi$ are **distinct** (variation in one doesn't constrain the other) 2. The missingness is **MAR**: $f(R \mid Y, \phi) = f(R \mid Y_{\text{obs}}, \phi)$ Under these conditions, we can base inference on $f(Y_{\text{obs}} \mid \theta)$ and ignore $f(R \mid Y_{\text{obs}}, \phi)$. For MCAR, we have even stronger ignorability: $f(R \mid Y, \phi) = f(R \mid \phi)$, so missingness is completely independent of the data. ## Computational To see ignorability in action, we simulate a simple regression problem: predicting $y$ from $x$ where the true relationship is $y = 2 + 3x + \text{noise}$. We know the true slope is 3.0. We then artificially impose three different missingness patterns and see whether complete case analysis recovers the true slope: - **MCAR**: Randomly delete 30% of observations (independent of any values) - **MAR**: Delete observations based on $x$ values (higher $x$ → more likely to be missing) - **MNAR**: Delete observations based on $y$ values (higher $y$ → more likely to be missing) Since we control the data generation, we can compare the estimated slopes to the known truth. We repeat each scenario 100 times to assess variability and compute 95% confidence intervals. ```{python} #| echo: true #| fig-width: 7 #| fig-height: 4.5 import numpy as np import matplotlib.pyplot as plt from scipy import stats rng = np.random.default_rng(42) n = 10000 n_sims = 100 # True data generating process: y = 2 + 3x + noise true_beta = 3.0 # Three missingness mechanisms def create_mcar_mask(x, y, rng, p=0.3): """MCAR: Missingness independent of everything""" return rng.random(len(x)) > p def create_mar_mask(x, y, rng): """MAR: Missingness depends on observed x""" p = 1 / (1 + np.exp(-2*x)) # Higher x => much more missing return rng.random(len(x)) > p def create_mnar_mask(x, y, rng): """MNAR: Missingness depends on unobserved y""" p = 1 / (1 + np.exp(-y/1.5)) # Higher y => more missing return rng.random(len(x)) > p # Run simulations results = {'MCAR': [], 'MAR': [], 'MNAR': []} for i in range(n_sims): # Generate data x = rng.normal(size=n) y = 2 + 3*x + rng.normal(size=n) # Estimate under each mechanism for name, mask_fn in [('MCAR', create_mcar_mask), ('MAR', create_mar_mask), ('MNAR', create_mnar_mask)]: mask = mask_fn(x, y, rng) slope = stats.linregress(x[mask], y[mask]).slope results[name].append(slope) # Compute means and 95% CIs means = [true_beta] + [np.mean(results[k]) for k in ['MCAR', 'MAR', 'MNAR']] cis = [[true_beta, true_beta]] + [np.percentile(results[k], [2.5, 97.5]) for k in ['MCAR', 'MAR', 'MNAR']] ci_lower = [means[i] - cis[i][0] for i in range(len(means))] ci_upper = [cis[i][1] - means[i] for i in range(len(means))] # Visualize the bias fig, ax = plt.subplots(figsize=(7, 4.5)) mechanisms = ['True β', 'MCAR', 'MAR', 'MNAR'] colors = ['black', 'green', 'orange', 'red'] bars = ax.bar(mechanisms, means, color=colors, alpha=0.7, edgecolor='black', yerr=[ci_lower, ci_upper], capsize=5, error_kw={'linewidth': 2}) ax.axhline(true_beta, color='black', linestyle='--', linewidth=1, alpha=0.5, label='True value') ax.set_ylabel('Slope estimate (β)') ax.set_title('Complete Case Analysis: Bias and Uncertainty by Missing Data Mechanism') ax.set_ylim([2.5, 3.5]) # Add value labels for i, (bar, val) in enumerate(zip(bars, means)): height = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2., height + 0.02, f'{val:.3f}', ha='center', va='bottom', fontsize=9) plt.grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() print(f"True slope: {true_beta:.3f}") for name in ['MCAR', 'MAR', 'MNAR']: mean = np.mean(results[name]) ci_low, ci_high = np.percentile(results[name], [2.5, 97.5]) print(f"{name}: {mean:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}], bias: {mean - true_beta:+.3f})") ``` The simulation demonstrates key differences across mechanisms: - **MCAR** is unbiased -- the confidence interval contains the true value - **MAR** is unbiased in this specific example due to a special case: when missingness depends only on the predictor $x$ in a regression, complete case analysis for the slope coefficient happens to be unbiased. **However, this is an exception, not the rule.** Complete case analysis under MAR is biased for most estimands (means, variances, correlations) and would be biased even for regression if missingness depended on an auxiliary variable or if we were estimating intercepts or predictions. - **MNAR** shows substantial bias because missingness directly depends on the unobserved outcome variable $y$, systematically excluding high values **Key takeaway**: Complete case analysis is safe only under MCAR. Under MAR, complete case analysis is generally biased -- the regression slope example above is a rare exception. Under MNAR, complete case analysis is almost always biased. ::: **The Bottom Line**: Ignoring the mechanism when it's not ignorable leads to biased results. Always assess the likely mechanism before choosing an analysis method. ::: {.callout-warning} ## Common Misunderstanding Even if a variable is fully observed, **complete case analysis can bias its estimate** when missingness in other variables depends on it. Deleting rows with any missing values changes the effective sampling design. For example, if income is fully observed but wealthy people are less likely to respond to other survey questions, complete case analysis will undersample wealthy individuals and bias the income estimate downward. ::: ## Techniques for Dealing with Missing Data Now that we understand missing data mechanisms, we turn to methods for handling incomplete data. The appropriate method depends on the mechanism and the analysis goals. ### Overview of Approaches Some common ways for dealing with missing data include: 1. **Complete case analysis** (listwise deletion): Ignore entries with missing data 2. **Imputation**: Fill in the missing values with estimated values 3. **Weighting**: Reweight complete cases to represent the full sample 4. **Model-based procedures**: Build a model for the observed data that accounts for missingness We'll explore each of these approaches. ### Complete Case Analysis Complete case analysis, as we saw before, simply discards any observation with missing data and analyzes the remaining complete observations. **When appropriate**: MCAR with low missingness (a common rule of thumb is < 5%, though this depends on sample size and precision requirements) **Advantages**: - Simple to implement - Honest uncertainty (standard errors reflect the actual sample size) - No additional modeling assumptions **Disadvantages**: - Biased under MAR/MNAR - Loss of efficiency (throwing away data) - Different analyses may use different subsets of data (if missingness patterns differ across variables) **Example**: In our fund performance example, using only surviving funds is complete case analysis. It's severely biased because missingness (fund liquidation) depends on the outcome (poor performance). ::: {.callout-note} ## Why the "Low Missingness" Threshold? Even when MCAR holds, complete case analysis is only recommended when missingness is low (a common rule of thumb is < 5%, though this depends on sample size and study objectives). Two key considerations: 1. **Efficiency**: Even under MCAR, discarding observations wastes data. With very low missingness, you lose minimal precision. With higher missingness (e.g., 20-40%), you're throwing away substantial information and widening confidence intervals unnecessarily. 2. **Assumption credibility**: The MCAR assumption is harder to defend when missingness is high. If you have substantial missing data, it's worth investing effort in methods that work under weaker assumptions (MAR). ::: ### Imputation Methods Imputation refers to a class of techniques for replacing missing values with estimated values. This makes subsequent analysis easy -- after imputation, we have a seemingly complete dataset. However, imputation comes with a serious warning: ::: {.callout-caution} ## The Seduction and Danger of Imputation > The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial biases. -- Dempster and Rubin (1983) ::: #### Simple (Single) Imputation Common simple imputation methods include: - **Mean/median imputation**: Replace missing values with the mean or median of observed values - **Hot deck imputation**: Randomly select a similar record and use its value - **Regression prediction**: Predict missing values using observed relationships **The fundamental problem with single imputation**: Treating imputed values as if they were observed leads to **overconfident** inference. The downstream analysis method is not aware of the larger uncertainty in the imputed values. **Additional caveats**: - Replacing missing values using observed values for the same variable **reduces variability** -- estimates of variation are biased - Imputation methods based on predicting missing values may work poorly if extrapolation is needed - Mean imputation **breaks relationships** between variables (imputed values have zero correlation with other variables) ::: {.callout-tip icon=false} ## Example: How Mean Imputation Distorts Variance To see the problem concretely, suppose we have data on a variable $X$ with true mean 50 and standard deviation 10. We randomly make 30% of values missing (MCAR), then impute them using the sample mean. What happens to the distribution? ```{python} #| echo: true #| fig-width: 7 #| fig-height: 4 import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(123) # True data n = 100 X_true = rng.normal(50, 10, n) # Create MCAR missingness (30% missing) missing_mask = rng.random(n) < 0.3 X_observed = X_true.copy() X_observed[missing_mask] = np.nan # Mean imputation X_mean_imputed = X_observed.copy() mean_val = np.nanmean(X_observed) X_mean_imputed[missing_mask] = mean_val # Compare distributions fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4)) # Plot 1: Histograms ax1.hist(X_true, bins=15, alpha=0.5, label='True data', edgecolor='black') ax1.hist(X_mean_imputed, bins=15, alpha=0.5, label='After mean imputation', edgecolor='black') ax1.axvline(mean_val, color='red', linestyle='--', label=f'Imputed value ({mean_val:.1f})') ax1.set_xlabel('X') ax1.set_ylabel('Frequency') ax1.set_title('Mean Imputation Distorts the Distribution') ax1.legend() ax1.grid(True, alpha=0.3) # Plot 2: Variance comparison variances = { 'True data': np.var(X_true), 'Observed only': np.nanvar(X_observed), 'Mean imputed': np.var(X_mean_imputed) } bars = ax2.bar(range(len(variances)), list(variances.values()), color=['black', 'blue', 'red'], alpha=0.7, edgecolor='black') ax2.set_xticks(range(len(variances))) ax2.set_xticklabels(list(variances.keys()), rotation=15, ha='right') ax2.set_ylabel('Variance') ax2.set_title('Mean Imputation Artificially Reduces Variance') ax2.grid(True, alpha=0.3, axis='y') # Add value labels for bar, val in zip(bars, variances.values()): height = bar.get_height() ax2.text(bar.get_x() + bar.get_width()/2., height + 2, f'{val:.1f}', ha='center', va='bottom') plt.tight_layout() plt.show() print("Variance comparison:") for name, var in variances.items(): print(f" {name:15s}: {var:.2f}") print(f"\nVariance reduction: {(1 - variances['Mean imputed']/variances['True data'])*100:.1f}%") ``` The spike at the mean in the imputed distribution reveals the problem: all 30 imputed values are identical (the mean), while the true values would have varied. This artificially reduces variance by a substantial fraction, leading to overconfident inference -- confidence intervals that are too narrow and p-values that are too small. ::: #### Multiple Imputation (MI) Multiple imputation addresses the fundamental flaw of single imputation: treating imputed values as known. Instead of creating one "completed" dataset, MI generates multiple plausible versions (typically $m = 20-50$) by drawing from the distribution of possible values given the observed data. Each imputed dataset is analyzed separately using standard methods, and results are combined using **Rubin's rules**^[Rubin's rules provide formulas for combining point estimates and variances from multiple imputed datasets. The total variance accounts for both within-imputation uncertainty (sampling variability) and between-imputation uncertainty (uncertainty about the missing values).], which properly account for two sources of uncertainty: variation between imputations (uncertainty about missing values) and variation within imputations (sampling uncertainty). Modern software packages (e.g., `mice` in R, `IterativeImputer` in Python) make implementation straightforward. ::: {.callout-note collapse="true"} ## Connection to Bootstrap and Bayesian Methods MI resembles both bootstrap and Bayesian inference: - Like **bootstrap**, MI generates multiple datasets and pools results to capture uncertainty - Like **Bayesian methods**, MI draws from a distribution of plausible values (though MI uses a frequentist framework for combining results via Rubin's rules rather than full posterior inference) In fact, fully Bayesian approaches to missing data are ideal but computationally intensive. MI can be viewed as an efficient approximation that works within standard frequentist workflows. ::: ### Weighting Methods The idea behind weighting is to reweight the complete cases so they represent the full sample. **Inverse probability weighting (IPW)**: - Estimate the probability of being observed: $\hat{p}_i = P(R_i = 1 \mid X_i)$ - Weight each complete case by $w_i = 1/\hat{p}_i$ - Analyze weighted data Under MAR with a correctly specified model for $P(R = 1 \mid X)$, this gives unbiased estimates. ::: {.callout-tip icon=false} ## Example: Sex Ratio Correction in Height Estimation Assume we wish to estimate the average height of people living in a region. We gather a sample: - 70 females with average height of 166 cm - 30 males with average height of 180 cm The sample average is: $$\overline{X} = \frac{70 \cdot 166 + 30 \cdot 180}{100} = 170.2 \text{ cm}$$ However, if we know the true sex distribution is around 50:50 (not 70:30), we would get a better estimate of the population average by reweighting the group averages with the population frequencies: $$\overline{X}_{\text{rw}} = \frac{50 \cdot 166 + 50 \cdot 180}{100} = 173.0 \text{ cm}$$ This is a simple example of weighting by known population distributions. ::: Similar approaches are commonly used with other background variables with known population distributions: age, socioeconomic status, political preferences, etc. ### Model-Based Procedures Model-based procedures define a statistical model for the observed data and infer the model parameters using standard techniques (often maximum likelihood via the EM algorithm, or Bayesian methods). Such approaches can lead to more accurate results when the incompleteness of the data is properly accounted for in the model. However: - The procedures must be adapted to deal with missing data - Existing code and tools often cannot be used as-is - Model misspecification^[Model misspecification means using a model that doesn't match the true data-generating process -- e.g., assuming linearity when the relationship is nonlinear, or assuming independence when variables are correlated.] can lead to bias The FINRISK model discussed earlier is an example of a model-based procedure -- it explicitly modeled the participation process and used multiple imputation within that framework. ::: {.callout-note collapse="true"} ## Advanced: Selection Models vs. Pattern-Mixture Models When modeling MNAR data, there are two main frameworks: **Selection models** (like the FINRISK approach) factor the joint distribution as: $$f(Y, R) = f(Y) \cdot f(R \mid Y)$$ They model the data generating process and then the selection/missingness mechanism conditional on the data. **Pattern-mixture models** take the opposite factorization: $$f(Y, R) = f(Y \mid R) \cdot f(R)$$ They model the distribution of data conditional on the missingness pattern, then combine across patterns. Both approaches require untestable assumptions about the MNAR mechanism. In practice: - **Selection models** are more common and intuitive but can be computationally challenging - **Pattern-mixture models** can be easier to specify but require careful thought about combining patterns - **Sensitivity analysis** is essential -- vary assumptions about the MNAR mechanism and assess how conclusions change - Implementation typically requires custom likelihood functions or Bayesian modeling tools (e.g., [PyMC](https://www.pymc.io/projects/examples/en/latest/howto/Missing_Data_Imputation.html)) ::: ## Practical Implementation ### Decision Framework When faced with missing data, follow this workflow: **1. Assess the missingness**: - How much data is missing overall? - Which variables have missing values? - Are there patterns (e.g., variables missing together)? - What is the likely mechanism (MCAR, MAR, MNAR)? **2. Choose your method**: - **MCAR with low missingness**: Complete case analysis is acceptable (rule of thumb: < 5%, but depends on sample size and precision needs) - **MCAR with higher missingness**: MI for efficiency gains - **MAR**: MI with rich predictors (include auxiliary variables) - **MNAR**: Requires special methods -- sensitivity analysis, auxiliary data collection (like recontact), or explicit modeling of the missingness mechanism **3. Validate your approach**: - Compare observed vs. imputed distributions (imputed values should look plausible) - Perform sensitivity analyses by varying assumptions - When possible, check against external data (like the FINRISK follow-up data) - Compare MI-MAR vs. MI-MNAR results to understand the impact of MNAR assumptions ::: {.callout-note collapse="true"} ## Reporting Guidelines Transparency is essential when dealing with missing data. A complete report should include: 1. **Amount and pattern of missingness**: Report overall percentage and range across variables 2. **Assumed mechanism and justification**: State whether you assume MCAR, MAR, or MNAR and why 3. **Method used**: Specify whether you used complete case, MI, weighting, or model-based methods 4. **Implementation details**: - For MI: number of imputations, imputation models, predictors used - For weighting: how weights were estimated - Software and packages used 5. **Sensitivity analyses**: Report results under different assumptions 6. **Impact assessment**: Describe how results differ from complete case analysis ::: ### Prevention is Better than Correction The best way to handle missing data is to prevent it. A researcher can design data collection to minimize missingness: - **Shorter surveys** with clear, relevant questions (reduces respondent fatigue) - **Incentives for completion** (monetary, results reports, lottery entries) - **Follow-up protocols** for non-respondents (like FINRISK's recontact approach) - **Pilot testing** to identify problematic questions before full deployment - **Collect auxiliary data** for non-respondents when possible -- even limited data can dramatically improve estimates The FINRISK example demonstrated that even a 13% recontact response rate provided crucial information for correcting selection bias. All efforts to collect data on non-participants are worthwhile. ::: {.callout-caution} ## When Standard Methods Fail Be aware of situations where even proper methods struggle: - **Very high missingness** (> 40%): Results become increasingly dependent on modeling assumptions rather than data - **MNAR without auxiliary information**: No method can reliably recover the truth without strong, untestable assumptions about the missingness mechanism - **Extrapolation beyond observed data**: Imputation requiring predictions far outside the range of observed values is inherently unreliable **Fundamental limitation**: No statistical method can recover information that isn't there. If the missing data mechanism is MNAR and you have no auxiliary information about non-respondents, your estimates will be biased. The best you can do is acknowledge the limitation explicitly, perform sensitivity analyses showing results under different assumptions, and discuss the likely direction and magnitude of bias. ::: ## Chapter Summary and Connections ### Key Concepts Review **Two classification systems**: - **Sources** (how missingness arises): Censoring, random sampling, non-random sampling - **Mechanisms** (statistical relationship): MCAR, MAR, MNAR **Missing data mechanisms**: - **MCAR** (Missing Completely at Random): Missingness independent of all data - **MAR** (Missing at Random): Missingness depends only on observed data - **MNAR** (Missing Not at Random): Missingness depends on unobserved values **Ignorability**: - **MCAR**: Ignorable for all methods (complete case analysis is unbiased) - **MAR**: Ignorable for likelihood-based and MI methods, but NOT for complete case analysis - **MNAR**: Generally not ignorable -- requires explicit modeling or auxiliary data **Key methods**: - **Complete case analysis**: Only valid under MCAR with low missingness - **Multiple imputation**: The workhorse for MAR; generates m datasets, analyzes each, pools with Rubin's rules - **Weighting**: Reweight complete cases to represent full sample (inverse probability weighting) - **Model-based procedures**: Explicitly model the missingness mechanism **Real-world impact**: - **Survivorship bias** (investment funds): Using only surviving funds systematically overstates returns, misleading billions in investment decisions - **Selection bias** (FINRISK): Non-participants had substantially higher mortality; ignoring them produces dangerously wrong public health estimates of smoking and alcohol use ### The Big Picture Missing data is not a niche technical problem -- it's a fundamental challenge in real-world data analysis that affects investment decisions, public health policy, and life-and-death medical outcomes. We have principled methods for handling missing data, but they require careful thought and honest assessment. The key: understand the missingness mechanism, choose methods accordingly, validate assumptions when possible, and report transparently about limitations. No statistical method can compensate for very high missingness or MNAR without auxiliary data -- prevention and auxiliary data collection (like FINRISK's recontact effort) are invaluable. ### Common Pitfalls to Avoid 1. **Assuming MCAR without evidence**: Most real-world missing data are not MCAR. Compare complete vs. incomplete cases before assuming missingness is random. 2. **Naïve imputation**: Mean imputation destroys variance; single imputation understates uncertainty. Use proper multiple imputation. 3. **Ignoring auxiliary information**: Include variables that predict missingness in imputation models, even if they're not in your analysis model. The FINRISK recontact data -- just 13% response -- dramatically improved estimates. 4. **Hiding your assumptions**: Document your missingness mechanism assumptions, methods used, and sensitivity analyses. Readers cannot judge validity without this transparency. 5. **Misspecified models and false confidence**: Imputation models can introduce bias if misspecified. When MNAR is plausible but you use MAR methods, your uncertainty is understated. Validate and conduct sensitivity analyses. ### Chapter Connections **Previous (Chapters 5-11):** - Chapters 5-7 developed parametric inference, estimation, and hypothesis testing -- all assuming complete data. Missing data violates this assumption and can bias estimates, widen confidence intervals artificially (mean imputation), or lead to invalid p-values - Chapter 9's regression methods are vulnerable to selection bias when missingness depends on outcomes (like Fund C's poor performance leading to dropout) -- complete case analysis can severely bias coefficient estimates - Chapter 11's causal inference framework helps understand missing data mechanisms -- whether missingness depends on observed vs. unobserved variables determines if we can recover valid estimates **This Chapter:** Addressed missing data through three mechanisms (MCAR, MAR, MNAR) and their appropriate methods. We saw how survivorship and selection bias can severely distort inference, and learned when complete case analysis fails versus when multiple imputation or model-based approaches are needed. **Next (Ch. 13):** - Chapter 13 (Experimental Design and Surveys) will show how proper study design minimizes missingness and enables causal inference -- the principles here inform power calculations, sample size planning, and follow-up protocols to reduce missing data ### Self-Test Problems 1. **Mechanism Classification**: For each scenario, identify whether the missing data are MCAR, MAR, or MNAR: a) A thermometer fails to record temperatures below -20°C due to physical limitations b) Survey respondents from rural areas are less likely to complete an online survey (and location is recorded) c) People with high anxiety scores are less likely to complete an anxiety questionnaire (and anxiety is what we're measuring) ::: {.callout-note collapse="true"} ## Solution a) **Source:** Censoring (instrument lower limit). **Mechanism:** MNAR, because the probability of missingness depends on the unobserved value itself (temperatures below −20°C are deterministically missing). b) **Mechanism:** MAR. Missingness depends on location, which is observed; conditional on location, response is random. c) **Mechanism:** MNAR. Missingness depends on the unobserved anxiety score itself. ::: 2. **Survivorship Bias Calculation**: Suppose a database contains 100 funds at year 0. After 5 years, 60 funds survive. The surviving funds show an average return of 8% per year. If the 40 defunct funds had an average return of -2% per year, what is the survivorship bias? ::: {.callout-note collapse="true"} ## Solution - Surviving portfolio: 8% (based on 60 funds) - True portfolio: (60 × 8% + 40 × (-2%))/100 = (480 - 80)/100 = 4% - Survivorship bias: 8% - 4% = **4% per year** This is even larger than the hedge fund bias estimated by Fung & Hsieh! If instant history bias adds +1.4% per year, the combined bias would be 5.4% per year. ::: 3. **Complete Case Bias**: Assume income and health are positively correlated. A study measures income (fully observed) and health status (missing for 30% of respondents). Healthier people are more likely to respond. Will complete case analysis bias the income estimate? Why or why not? ::: {.callout-note collapse="true"} ## Solution Yes, complete case analysis will bias the income estimate even though income is fully observed. Since health and income are positively correlated (e.g., healthier people have higher income due to employment), the complete cases will oversample healthier individuals and thus oversample higher-income individuals. Deleting rows changes the effective sampling design, biasing estimates even for fully observed variables. ::: 4. **Ignorability and Valid Methods**: Fill in which methods produce valid inference without explicitly modeling the missingness mechanism: | Mechanism | Complete Case Analysis | Likelihood/ML Methods | Multiple Imputation | |-----------|:----------------------:|:---------------------:|:-------------------:| | MCAR | ? | ? | ? | | MAR | ? | ? | ? | | MNAR | ? | ? | ? | ::: {.callout-note collapse="true"} ## Solution | Mechanism | Complete Case Analysis | Likelihood/ML Methods | Multiple Imputation | |-----------|:----------------------:|:---------------------:|:-------------------:| | MCAR | $\checkmark$ | $\checkmark$ | $\checkmark$ | | MAR | $\times$ | $\checkmark$ | $\checkmark$ | | MNAR | $\times$ | $\times$* | $\times$* | *MNAR requires explicitly modeling the missingness mechanism or bringing auxiliary data/assumptions (selection models, pattern-mixture models, sensitivity analyses). ::: 5. **Method Selection (True/False)**: a) With 3% missing data and MCAR, complete case analysis is acceptable. b) With 30% missing data and MAR, complete case analysis is generally unbiased. c) With plausible MNAR and no auxiliary information, sensitivity analysis is essential. ::: {.callout-note collapse="true"} ## Solution a) **True** - Under MCAR with low missingness (rule of thumb: < 5%), complete case analysis is valid though less efficient. b) **False** - Even under MAR, complete case analysis is biased. The high missingness (30%) makes efficiency loss severe, but MAR methods like multiple imputation are needed for validity. c) **True** - With MNAR and no auxiliary data, results are sensitive to untestable assumptions. Sensitivity analysis showing results under different MNAR assumptions is essential for honest inference. ::: ### Python and R Reference ::: {.content-visible when-format="html"} ::: {.tabbed-content} ## Python ```python # Multiple Imputation with statsmodels MICE (imputation + pooling handled internally) import pandas as pd # Quick missingness check df.isnull().sum() # per-column counts df.isnull().sum().sum() / df.size # overall proportion import statsmodels.api as sm from statsmodels.imputation.mice import MICEData, MICE # df: your pandas DataFrame with missing values md = MICEData(df) # chained equations setup # Specify analysis model model = sm.OLS.from_formula('y ~ x1 + x2', md) # Fit with multiple imputations; pooled inference returned res = MICE(model, md, n_imputations=50, model_class=sm.OLS).fit() # res.summary() # pooled estimates and standard errors ``` Other Python options include `sklearn.impute.IterativeImputer` and `miceforest` (random forest-based). ## R ```r library(mice) # Quick look at missingness summary(df) md.pattern(df) # Multiple Imputation with pooling imp <- mice( data = df, m = 50, # number of imputations method = "pmm", # predictive mean matching for continuous vars seed = 123 ) fit <- with(imp, lm(y ~ x1 + x2)) pooled <- pool(fit) summary(pooled) # Access a completed dataset if needed: # complete(imp, action = 1) # first imputation # complete(imp, action = "long") # all imputations in long format ``` Other R packages include `Amelia` and `mi`. ::: ::: ::: {.content-visible when-format="pdf"} ::: {.callout-note} ## Python and R Reference Code Python and R code examples for handling missing data can be found in the HTML version of these notes. ::: ::: ### Connections to Source Material This chapter synthesized material from multiple sources. The theoretical framework for missing data mechanisms (MCAR, MAR, MNAR) and ignorability follows Little & Rubin's foundational *Statistical Analysis with Missing Data*. Two key empirical studies demonstrated real-world impact: @fung2000performance quantified survivorship and instant history bias in hedge fund databases, while @karvanen2016selection demonstrated selection bias correction in the FINRISK health survey using multiple imputation for MNAR data. ### Further Reading - **Foundational Text**: Little, R. J. A., & Rubin, D. B. (2019). *Statistical Analysis with Missing Data* (3rd ed.). Wiley. - **Practical Guide**: Van Buuren, S. (2018). *Flexible Imputation of Missing Data* (2nd ed.). CRC Press. [Free online](https://stefvanbuuren.name/fimd/) - **Case Studies**: The papers by @fung2000performance and @karvanen2016selection provide excellent examples of missing data in practice. --- *Remember: Missing data is not just a technical nuisance -- it's a fundamental challenge that can determine whether your analysis leads to valid conclusions or serious errors. Understand the mechanism, choose methods appropriately, and always report your approach transparently!*