4 Nonparametric Estimation and The Bootstrap

Published

July 23, 2025

4.1 Learning Objectives

After completing this chapter, you will be able to:

Explain the definition and frequentist interpretation of a confidence interval
Apply the plug-in principle with the Empirical Distribution Function (EDF) to estimate statistical functionals
Use the bootstrap to simulate the sampling distribution of a statistic and estimate its standard error
Construct and compare the three main types of bootstrap confidence intervals: Normal, percentile, and pivotal
Identify the assumptions and limitations of the bootstrap, recognizing common situations where it can fail

Note

This chapter covers nonparametric estimation methods and the bootstrap. The material is adapted from Chapters 7 and 8 of Wasserman (2013), with reference to Chapter 6 for confidence intervals. Additional examples and computational perspectives have been added for data science applications.

4.2 Introduction and Motivation

4.2.1 Beyond Point Estimates: Quantifying Uncertainty

Imagine you’re a healthcare administrator planning for hospital capacity during an epidemic. Your data scientists provide a point estimate: “We expect 500 patients next week.” But is this estimate reliable enough to base critical decisions on? What if the true number could reasonably be anywhere from 300 to 700? A single point estimate, without quantifying its uncertainty, is often useless for decision-making. You need a plausible range – this is called a confidence interval.

In Chapter 3, we learned how to create point estimates – single “best guesses” for unknown quantities. We saw that the sample mean \bar{X}_n estimates the population mean \mu, and we even derived its standard error. But what about more complex statistics? How do we find the standard error of the median? The correlation coefficient? The 90th percentile?

Traditional statistical theory provides formulas for simple cases, but quickly becomes intractable for complex statistics. This chapter introduces two powerful ideas that revolutionized modern statistics:

The Plug-In Principle: A simple, intuitive method for estimating almost any quantity by “plugging in” the empirical distribution for the true distribution.
The Bootstrap: A computational method for estimating the standard error and confidence interval of virtually any statistic, even when no formula exists.

These methods exemplify the computational approach to statistics that has become dominant in the era of cheap computing power. Instead of deriving complex mathematical formulas, we let the computer simulate what would happen if we could repeat our experiment many times.

Finnish Terminology Reference

For Finnish-speaking students, here’s a reference table of key terms in this chapter:

English	Finnish	Context
Confidence interval	Luottamusväli	Range that contains parameter with specified probability
Coverage	Peitto	Probability that interval contains true parameter
Empirical distribution function	Empiirinen kertymäfunktio	Data-based estimate of CDF
Statistical functional	Tilastollinen funktionaali	Function of the distribution
Plug-in estimator	Pistoke-estimaattori	Estimate using empirical distribution
Bootstrap	Uusio-otanta	Resampling method for uncertainty
Resampling	Uudelleenotanta	Drawing samples from data
Bootstrap sample	Bootstrap-otos	Sample drawn with replacement
Percentile interval	Prosenttipiste-luottamusväli	CI using bootstrap quantiles
Pivotal interval	Pivotaalinen luottamusväli	CI using pivot quantity
Standard error	Keskivirhe	Standard deviation of estimator
Monte Carlo error	Monte Carlo -virhe	Error from finite simulations

4.3 Confidence Sets: The Foundation

Before diving into nonparametric estimation and the bootstrap, we need to establish the formal framework for confidence intervals, a staple of classical statistics.

A 1-\alpha confidence interval for a parameter \theta is an interval C_n = (a, b) where a = a(X_1, \ldots, X_n) and b = b(X_1, \ldots, X_n) are functions of the data such that \mathbb{P}_{\theta}(\theta \in C_n) \geq 1 - \alpha, \quad \text{for all } \theta \in \Theta.

In other words, C_n encloses \theta with probability 1-\alpha. This probability is called the coverage of the interval.

A common choice is \alpha = 0.05 which yields 95\% confidence intervals.

4.3.1 Interpretation of Confidence Sets

Confidence intervals are not probability statements about \theta. The parameter \theta is fixed; it’s the interval C_n that varies with different samples.

The correct interpretation is: if you repeatedly collect data and construct confidence intervals using the same procedure, (1-\alpha) \times 100\% of those intervals will contain the true parameter. This is the frequentist guarantee.

Bayesian credible intervals provide a different type of statement – they give a probability that \theta lies in the interval given your observed data. However, Bayesian intervals are not guaranteed to achieve frequentist coverage (containing the true parameter (1-\alpha) proportion of the time across repeated sampling).

Critical Point: What is Random?

Remember that \theta is fixed and C_n is random. The parameter doesn’t change; the interval does. Each time we collect new data, we get a different interval. The guarantee is that (1-\alpha) \times 100\% of these intervals will contain the true parameter.

Advanced: When Coverage Fails

The frequentist guarantee is an average over all possible datasets. For any specific confidence interval procedure, there may exist parameter values where the actual coverage is less than the nominal (1-\alpha) level.

Additionally, for some “unlucky” datasets (the \alpha proportion), the computed interval may be far from the true parameter. This is an inherent limitation of the frequentist approach – it provides no guarantees for individual intervals, only for the long-run performance of the procedure.

This suggests different approaches may be preferred in different contexts:

Frequentist methods: Well-suited for repeated analyses where long-run guarantees matter
Bayesian methods: May be preferred when prior information is available and you need probabilistic statements about parameters given your specific data

Both approaches are valuable tools for quantifying uncertainty.

4.3.2 Normal-Based Confidence Intervals

When an estimator \hat{\theta}_n is approximately normally distributed, we can form confidence intervals using the normal approximation. This is often the case for large samples due to the Central Limit Theorem.

Suppose \hat{\theta}_n \approx \mathcal{N}(\theta, \widehat{\text{se}}^2), where \widehat{\text{se}} is the (estimated) standard error of the estimator. Then we can standardize to get: \frac{\hat{\theta}_n - \theta}{\widehat{\text{se}}} \approx \mathcal{N}(0, 1)

Let z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2) be the upper \alpha/2 quantile of the standard normal distribution, where \Phi is the standard normal CDF. This means:

\mathbb{P}(Z > z_{\alpha/2}) = \alpha/2 for Z \sim \mathcal{N}(0, 1)
\mathbb{P}(-z_{\alpha/2} < Z < z_{\alpha/2}) = 1 - \alpha

Therefore: \mathbb{P}\left(-z_{\alpha/2} < \frac{\hat{\theta}_n - \theta}{\widehat{\text{se}}} < z_{\alpha/2}\right) \approx 1 - \alpha

Rearranging to isolate \theta: \mathbb{P}\left(\hat{\theta}_n - z_{\alpha/2} \widehat{\text{se}} < \theta < \hat{\theta}_n + z_{\alpha/2} \widehat{\text{se}}\right) \approx 1 - \alpha

This gives us the approximate (1-\alpha) confidence interval: C_n = \left(\hat{\theta}_n - z_{\alpha/2} \widehat{\text{se}}, \hat{\theta}_n + z_{\alpha/2} \widehat{\text{se}}\right)

For the common case of 95% confidence intervals (\alpha = 0.05):

z_{0.025} = \Phi^{-1}(0.975) \approx 1.96 \approx 2
This leads to the familiar rule of thumb: \hat{\theta}_n \pm 2 \widehat{\text{se}}

Example: Confidence Interval for the Mean

For the sample mean \bar{X}_n with known population variance \sigma^2:

Estimator: \hat{\theta}_n = \bar{X}_n
Standard error: \text{se} = \sigma/\sqrt{n}
95% CI: \bar{X}_n \pm 1.96 \cdot \sigma/\sqrt{n}

If \sigma is unknown, we substitute the sample standard deviation s to get:

95% CI: \bar{X}_n \pm 1.96 \cdot s/\sqrt{n}

For small samples, we would use the t-distribution instead of the normal distribution.

4.4 The Plug-In Principle: A General Method for Estimation

This lecture focuses on nonparametric estimation, and the plug-in principle is one key instrument of nonparametric statistics.

4.4.1 Estimating the Entire Distribution: The EDF

The plug-in principle is a simple idea that provides a unified framework for nonparametric estimation that works for virtually any statistical problem.

The Core Idea: When we need to estimate some property of an unknown distribution F, we simply:

Estimate the distribution itself using our data
Calculate the property using our estimated distribution

The most fundamental nonparametric estimate is of the CDF itself. Let X_1, \ldots, X_n \sim F be an i.i.d. sample where F is a distribution function on the real line. Since we don’t know the true CDF F, we estimate it with the Empirical Distribution Function (EDF).

The Empirical Distribution Function (EDF) \hat{F}_n is the CDF that puts probability mass 1/n on each data point. Formally: \hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n I(X_i \leq x) where I(X_i \leq x) = \begin{cases} 1 & \text{if } X_i \leq x \\ 0 & \text{if } X_i > x \end{cases}

Intuitively, \hat{F}_n(x) is simply the proportion of data points less than or equal to x. It’s the most natural estimate of F(x) = \mathbb{P}(X \leq x).

Properties of the EDF:

For any fixed point x:

Unbiased: \mathbb{E}(\hat{F}_n(x)) = F(x)
Variance: \mathbb{V}(\hat{F}_n(x)) = \frac{F(x)(1-F(x))}{n}
MSE: \text{MSE}(\hat{F}_n(x)) = \mathbb{V}(\hat{F}_n(x)) = \frac{F(x)(1-F(x))}{n} \to 0 as n \to \infty
Consistent: \hat{F}_n(x) \xrightarrow{P} F(x) as n \to \infty

But the EDF is even better than these pointwise properties suggest:

The EDF converges to the true CDF uniformly: \sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{P} 0

This is a remarkably strong guarantee – the EDF approximates the true CDF well everywhere, not just at individual points.

Let’s visualize the EDF with an example dataset:

Show code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Simulate nerve data (waiting times between pulses)
# Using exponential distribution as a model for waiting times
np.random.seed(42)
n = 100
nerve_data = np.random.exponential(scale=0.5, size=n)

# Create the EDF plot
fig, ax = plt.subplots(figsize=(7, 5))

# Sort data for plotting
sorted_data = np.sort(nerve_data)

# Plot the EDF as a step function
ax.step(sorted_data, np.arange(1, n+1)/n, where='post', 
        linewidth=2, color='blue', label='Empirical CDF')

# Add rug plot to show data points
ax.plot(sorted_data, np.zeros_like(sorted_data), '|', 
        color='black', markersize=10, alpha=0.5)

# Add true CDF for comparison
x_range = np.linspace(0, max(sorted_data)*1.1, 1000)
true_cdf = 1 - np.exp(-x_range/0.5)  # Exponential CDF
ax.plot(x_range, true_cdf, 'r--', linewidth=2, alpha=0.7, label='True CDF')

ax.set_xlabel('Waiting time (seconds)')
ax.set_ylabel('Cumulative probability')
ax.set_title('Empirical Distribution Function for Nerve Pulse Data (Simulated)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, max(sorted_data)*1.05)
ax.set_ylim(0, 1.05)

plt.tight_layout()
plt.show()

The step function shows \hat{F}_n, jumping by 1/n at each data point. The vertical lines at the bottom show the actual data points. Notice how the EDF closely follows the true CDF (shown for comparison).

Advanced: Confidence Bands for the CDF

We can construct confidence bands for the entire CDF using the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality:

\mathbb{P}\left(\sup_x |F(x) - \hat{F}_n(x)| > \epsilon\right) \leq 2e^{-2n\epsilon^2}

This leads to a 1-\alpha confidence band: L(x) = \max\{\hat{F}_n(x) - \epsilon_n, 0\} U(x) = \min\{\hat{F}_n(x) + \epsilon_n, 1\} where \epsilon_n = \sqrt{\frac{1}{2n}\log\left(\frac{2}{\alpha}\right)}.

Show code

# Add confidence bands to previous plot
alpha = 0.05
epsilon_n = np.sqrt(np.log(2/alpha) / (2*n))

fig, ax = plt.subplots(figsize=(7, 5))

# Plot EDF
ax.step(sorted_data, np.arange(1, n+1)/n, where='post', 
        linewidth=2, color='blue', label='Empirical CDF')

# Add confidence bands
y_values = np.arange(1, n+1)/n
lower_band = np.maximum(y_values - epsilon_n, 0)
upper_band = np.minimum(y_values + epsilon_n, 1)

ax.fill_between(sorted_data, lower_band, upper_band, 
                step='post', alpha=0.3, color='gray', 
                label=f'{int((1-alpha)*100)}% Confidence band')

# Add rug plot
ax.plot(sorted_data, np.zeros_like(sorted_data), '|', 
        color='black', markersize=10, alpha=0.5)

ax.set_xlabel('Waiting time (seconds)')
ax.set_ylabel('Cumulative probability')
ax.set_title('EDF with Confidence Band')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, max(sorted_data)*1.05)
ax.set_ylim(0, 1.05)

plt.tight_layout()
plt.show()

print(f"Width of confidence band: ±{epsilon_n:.3f}")
print(f"This means we're 95% confident the true CDF lies within this gray region")

Width of confidence band: ±0.136
This means we're 95% confident the true CDF lies within this gray region

4.4.2 Estimating Functionals: The Plug-In Estimator

Now that we can estimate the entire distribution, we can estimate any property of it. A statistical functional is any function of the CDF F or the PDF f.

Examples include:

The mean: \mu = \int x f(x) d x
The variance: \sigma^2 = \int (x-\mu)^2 f(x) d x
The median: m = F^{-1}(1/2)

The plug-in estimator of \theta = T(F) is defined by: \hat{\theta}_n = T(\hat{F}_n)

In other words, just plug in the empirical CDF \hat{F}_n for the unknown CDF F.

This principle is remarkably general. To estimate any property of the population, we calculate that same property on our empirical distribution. Let’s see how this works for various functionals:

The Mean:

Let \mu = T(F) = \int x f(x) dx.

The plug-in estimator is: \hat{\mu}_n = \sum x \hat{f}_n(x) = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}_n

The plug-in estimator for the mean is just the sample mean!

The Variance:

Let \sigma^2 = T(F) = \mathbb{V}(X) = \int x^2 f(x) d x - \left(\int x f(x) d x\right)^2.

The plug-in estimator is: \hat{\sigma}^2_n = \sum x^2 \hat{f}_n(x) - \left(\sum x \hat{f}_n(x)\right)^2 = \frac{1}{n}\sum_{i=1}^n X_i^2 - \left(\frac{1}{n}\sum_{i=1}^n X_i\right)^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2

Alternative: The Sample Variance

Another common estimator of \sigma^2 is the sample variance with the n-1 correction: S_n^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X}_n)^2

This estimator is unbiased (i.e., \mathbb{E}[S_n^2] = \sigma^2) and is almost identical to the plug-in estimator in practice. The factor \frac{n}{n-1} makes little difference for moderate to large samples. Most statistical software uses the n-1 version by default.

The Median:

The median is the value that splits the distribution in half. For the theoretical distribution: m = T(F) = F^{-1}(0.5)

The plug-in estimator is simply the sample median - the middle value when we sort our data. For an odd number of observations, it’s the middle value. For an even number, it’s the average of the two middle values.

Advanced: Other Statistical Functionals

Skewness (measures asymmetry): T(F) = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3} \implies \hat{\kappa}_n = \frac{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)^3}{(\hat{\sigma}^2_n)^{3/2}}

Correlation for bivariate data (X,Y): T(F) = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \implies \hat{\rho}_n = \frac{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)(Y_i-\bar{Y}_n)}{\sqrt{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)^2}\sqrt{\frac{1}{n}\sum_{i=1}^n(Y_i-\bar{Y}_n)^2}}

Quantiles in general: T(F) = F^{-1}(p) \implies \hat{q}_p = \hat{F}_n^{-1}(p)

Since \hat{F}_n is a step function, we define: \hat{F}_n^{-1}(p) = \inf\{x : \hat{F}_n(x) \geq p\}

This gives us the sample quantile at level p.

Confidence Intervals for Plug-in Estimators:

When the plug-in estimator \hat{\theta}_n = T(\hat{F}_n) is approximately normally distributed, we can form confidence intervals using: \hat{\theta}_n \pm z_{\alpha/2} \widehat{\text{se}}

as we saw earlier in the Normal-Based Confidence Intervals section.

The challenge is finding \widehat{\text{se}}. For the mean, we know from theory that \text{se}(\bar{X}_n) = \sigma/\sqrt{n}, which we can estimate by plugging in \hat{\sigma} to get: \bar{X}_n \pm z_{\alpha/2} \frac{\hat{\sigma}}{\sqrt{n}}

But what about the median? The 90th percentile? The interquartile range? For most functionals, there is no simple formula for the standard error.

This is where the bootstrap comes to the rescue, as we see next.

Recap: Nonparametric Estimation

Let X_1, \ldots, X_n \sim F be an i.i.d. sample where F is a distribution function on the real line.

The empirical distribution function \hat{F}_n is the CDF that puts mass 1/n at each data point X_i
A statistical functional T(F) is any function of F (e.g., mean, variance, median)
The plug-in estimator of \theta = T(F) is \hat{\theta}_n = T(\hat{F}_n)

We’ve seen how to create point estimates for any functional. The challenge is quantifying their uncertainty when no formula exists for the standard error.

4.5 The Bootstrap: Simulating Uncertainty

4.5.1 The Core Idea

The bootstrap, invented by Bradley Efron in 1979, is one of the most important statistical innovations of the 20th century. It provides a general, computational method for assessing the uncertainty of virtually any statistic or functional of the data.

Let T_n = g(X_1, \ldots, X_n) be our statistic of interest (e.g., the median, the variance, a given quantile). What’s its variability?

The key insight is simple: We can learn about the variability of our statistic by seeing how it varies across different samples. Since we only have one sample, we create new samples by resampling from our data.

Here’s the crucial diagram that illustrates the bootstrap principle:

Real World:      F   ==>   X₁,...,Xₙ   ==>   Tₙ = g(X)
                 ↑                            ↑
                 Unknown                      What we want to understand

Bootstrap World: F̂ₙ  ==>   X₁*,...,Xₙ*  ==>  Tₙ* = g(X*)
                 ↑                            ↑
                 Known!                       What we can simulate

In the real world:

Nature draws from the unknown distribution F
We observe one sample X_1, \ldots, X_n
We calculate our statistic T_n = g(X_1, \ldots, X_n)
We want to know the sampling distribution of T_n

In the bootstrap world:

We draw from the known empirical distribution \hat{F}_n (see below)
We get a bootstrap sample X_1^*, \ldots, X_n^*
We calculate the bootstrap statistic T_n^* = g(X_1^*, \ldots, X_n^*)
We can repeat the process multiple times to simulate the sampling distribution of T_n^*

The Bootstrap Principle: The distribution of T_n^* around T_n approximates the distribution of T_n around T(F).

How do we draw from \hat{F}_n? Remember that \hat{F}_n puts mass 1/n at each observed data point. Therefore:

Warning

Key Point: Drawing from \hat{F}_n means sampling with replacement from the original data. Each bootstrap sample contains n observations, some repeated, some omitted.

4.5.2 Bootstrap Variance and Standard Error Estimation

Let’s make this concrete with an algorithm:

Bootstrap Algorithm for Standard Error Estimation:

For b = 1, \ldots, B:
- Draw a bootstrap sample X_1^*, \ldots, X_n^* by sampling with replacement from \{X_1, \ldots, X_n\}
- Compute the statistic for this bootstrap sample: T_{n,b}^* = g(X_1^*, \ldots, X_n^*)
The bootstrap estimate of the standard error is: \widehat{\text{se}}_{\text{boot}} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B \left(T_{n,b}^* - \bar{T}_n^*\right)^2} where \bar{T}_n^* = \frac{1}{B}\sum_{b=1}^B T_{n,b}^*

Let’s implement this for a concrete example – estimating the standard error of the median:

import numpy as np

def bootstrap_se(data, statistic, B=1000):
    """
    Compute bootstrap standard error of a statistic.
    
    Parameters:
    -----------
    data : array-like
        Original sample
    statistic : function
        Function that computes the statistic of interest
    B : int
        Number of bootstrap replications
    
    Returns:
    --------
    se : float
        Bootstrap standard error
    boot_samples : array
        Bootstrap replications of the statistic
    """
    n = len(data)
    boot_samples = np.zeros(B)
    
    for b in range(B):
        # Draw bootstrap sample
        x_star = np.random.choice(data, size=n, replace=True)
        # Compute statistic
        boot_samples[b] = statistic(x_star)
    
    # Compute standard error
    se = np.std(boot_samples, ddof=1)
    
    return se, boot_samples

# Example: Standard error of the median
np.random.seed(42)
data = np.random.exponential(scale=2, size=50)

# Point estimate
median_est = np.median(data)

# Bootstrap standard error
se_boot, boot_medians = bootstrap_se(data, np.median, B=2000)

print(f"Sample median: {median_est:.3f}")
print(f"Bootstrap SE: {se_boot:.3f}")
print(f"Approximate 95% CI: ({median_est - 2*se_boot:.3f}, {median_est + 2*se_boot:.3f})")

Sample median: 1.146
Bootstrap SE: 0.276
Approximate 95% CI: (0.594, 1.697)

bootstrap_se <- function(data, statistic, B = 1000) {
  #' Compute bootstrap standard error of a statistic
  #' 
  #' @param data Original sample vector
  #' @param statistic Function that computes the statistic of interest
  #' @param B Number of bootstrap replications
  #' @return List with standard error and bootstrap samples
  
  n <- length(data)
  boot_samples <- numeric(B)
  
  for (b in 1:B) {
    # Draw bootstrap sample
    x_star <- sample(data, size = n, replace = TRUE)
    # Compute statistic
    boot_samples[b] <- statistic(x_star)
  }
  
  # Compute standard error
  se <- sd(boot_samples)
  
  return(list(se = se, boot_samples = boot_samples))
}

# Example: Standard error of the median
set.seed(42)
data <- rexp(50, rate = 1/2)

# Point estimate
median_est <- median(data)

# Bootstrap standard error
result <- bootstrap_se(data, median, B = 2000)
se_boot <- result$se
boot_medians <- result$boot_samples

cat(sprintf("Sample median: %.3f\n", median_est))
cat(sprintf("Bootstrap SE: %.3f\n", se_boot))
cat(sprintf("Approximate 95%% CI: (%.3f, %.3f)\n", 
            median_est - 2*se_boot, median_est + 2*se_boot))

Let’s visualize the bootstrap distribution:

Show code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 5))

# Left panel: Original data
ax1.hist(data, bins=20, density=True, alpha=0.7, color='blue', edgecolor='black')
ax1.axvline(median_est, color='red', linestyle='--', linewidth=2, label=f'Median = {median_est:.3f}')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.set_title('Original Data')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right panel: Bootstrap distribution
ax2.hist(boot_medians, bins=30, density=True, alpha=0.7, color='green', edgecolor='black')
ax2.axvline(median_est, color='red', linestyle='--', linewidth=2)
ax2.axvline(median_est - 2*se_boot, color='orange', linestyle=':', linewidth=2)
ax2.axvline(median_est + 2*se_boot, color='orange', linestyle=':', linewidth=2, 
            label='95% CI')
ax2.set_xlabel('Bootstrap median')
ax2.set_ylabel('Density')
ax2.set_title('Bootstrap Distribution')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The left panel shows our original data. The right panel shows the distribution of the median across 2000 bootstrap samples. This distribution tells us about the uncertainty in our median estimate.

Two Sources of Error

The bootstrap involves two approximations:

Statistical Approximation: \mathbb{V}_F(T_n) \approx \mathbb{V}_{\hat{F}_n}(T_n)
- This depends on how well \hat{F}_n approximates F
- We can’t control this – it depends on our sample size n
Monte Carlo Approximation: \mathbb{V}_{\hat{F}_n}(T_n) \approx v_{\text{boot}}
- This depends on the number of bootstrap samples B
- We can make this arbitrarily small by increasing B
- Typically B \geq 1000 is sufficient

In formulas: \mathbb{V}_F(T_n) \underbrace{\approx}_{\text{not so small}} \mathbb{V}_{\hat{F}_n}(T_n) \underbrace{\approx}_{\text{small}} v_{\text{boot}}

Remember: by increasing the bootstrap samples B we can reduce the Monte Carlo error (due to simulation), but we cannot improve the statistical approximation error without obtaining more real data.

Example: Monte Carlo Error Decreases With More Bootstrap Samples

Let’s verify that the Monte Carlo error decreases with B:

Show code

# Show convergence of bootstrap SE as B increases
B_values = [10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
se_estimates = []

for B in B_values:
    se, _ = bootstrap_se(data, np.median, B=B)
    se_estimates.append(se)

plt.figure(figsize=(7, 4))
plt.plot(B_values, se_estimates, 'bo-', linewidth=2, markersize=8)
plt.axhline(se_estimates[-1], color='red', linestyle='--', 
            label=f'Converged value ≈ {se_estimates[-1]:.3f}')
plt.xlabel('Number of bootstrap samples (B)')
plt.ylabel('Bootstrap SE estimate')
plt.title('Monte Carlo Error Decreases with B')
plt.xscale('log')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

The fluctuations for small B are due to Monte Carlo variability. As B increases, the bootstrap standard error estimate stabilizes. Still, this is only one component of the error – to improve further we need additional real data.

4.6 Bootstrap Confidence Intervals

4.6.1 Three Common Methods

Now that we can estimate the sampling distribution of any statistic via the bootstrap, we can construct confidence intervals. But how exactly should we use the bootstrap distribution to form an interval?

There are three main approaches, each with different strengths and weaknesses. We’ll illustrate all three using a real example: estimating the correlation between a country’s economic prosperity and the health of its population.

Example: European Health and Wealth Data

This example explores the correlation between a country’s wealth (GDP per capita) and the average lifespan of its citizens (life expectancy at birth). The data is for a subset of EU countries, with GDP per capita for 2025 and life expectancy from 2023.

Show code

# European Health and Wealth Data
# GDP per capita (2025 forecast), Life Expectancy at birth (2023) for a subset of EU countries.
countries = [
    'Germany', 'France', 'Italy', 'Spain', 'Netherlands', 'Poland', 
    'Belgium', 'Sweden', 'Austria', 'Ireland', 'Denmark', 'Finland', 
    'Portugal', 'Greece', 'Czech Republic'
]

gdp_per_capita = np.array([
    55911, 46792, 41091, 36192, 70480, 26805, 57772, 58100, 58192, 
    108919, 74969, 54163, 30002, 25756, 33039
])

life_expectancy = np.array([
    81.38, 83.33, 83.72, 83.67, 82.16, 78.63, 82.11, 83.26, 81.96, 
    82.41, 81.93, 81.91, 82.36, 81.86, 79.83
])

# Combine into a single dataset for resampling
data_combined = np.column_stack((gdp_per_capita, life_expectancy))
n = len(life_expectancy)

# Define correlation function
def correlation(data):
    x, y = data[:, 0], data[:, 1]
    return np.corrcoef(x, y)[0, 1]

# Original correlation
rho_hat = correlation(data_combined)
print(f"Sample correlation: {rho_hat:.3f}")

Sample correlation: 0.238

Now let’s generate bootstrap samples:

Show code

# Bootstrap the correlation
B = 5000
boot_correlations = np.zeros(B)

np.random.seed(42)
for b in range(B):
    # Resample pairs (important to maintain pairing!)
    indices = np.random.choice(n, size=n, replace=True)
    boot_sample = data_combined[indices]
    boot_correlations[b] = correlation(boot_sample)

# Bootstrap standard error
se_boot = np.std(boot_correlations, ddof=1)
print(f"Bootstrap SE: {se_boot:.3f}")

Bootstrap SE: 0.241

Let’s visualize the bootstrap distribution:

Show code

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 5))

# Scatter plot of original data
ax1.scatter(gdp_per_capita, life_expectancy, alpha=0.6, s=50)
ax1.set_xlabel('GDP per Capita (US$)')
ax1.set_ylabel('Life Expectancy (Years)')
ax1.set_title(f'European Countries (ρ = {rho_hat:.3f})')
ax1.grid(True, alpha=0.3)

# Bootstrap distribution
ax2.hist(boot_correlations, bins=40, density=True, alpha=0.7, 
         color='green', edgecolor='black')
ax2.axvline(rho_hat, color='red', linestyle='--', linewidth=2, 
            label=f'Original ρ = {rho_hat:.3f}')
ax2.set_xlabel('Bootstrap Correlation')
ax2.set_ylabel('Density')
ax2.set_title('Bootstrap Distribution')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Notice that the bootstrap distribution is somewhat skewed. This skewness will affect our confidence intervals.

4.6.2 Comparing Bootstrap Confidence Intervals

Let \hat{T}_n = T_n(\hat{F}_n) be the statistic evaluated on the empirical distribution, and T_n^* the bootstrap statistic.

Formula: $\hat{T}_n \pm z_{\alpha/2} \widehat{\text{se}}_{\text{boot}}$

This is the simplest method – we just use the bootstrap standard error in the usual normal-based formula.

# Normal interval
alpha = 0.05
z_alpha = stats.norm.ppf(1 - alpha/2)
normal_lower = rho_hat - z_alpha * se_boot
normal_upper = rho_hat + z_alpha * se_boot

print(f"95% Normal interval: ({normal_lower:.3f}, {normal_upper:.3f})")

95% Normal interval: (-0.235, 0.711)

Pros:

Simple and intuitive
Only requires the standard error, not the full distribution
Familiar to those who know basic statistics

Cons:

Assumes the sampling distribution is approximately normal
Can give nonsensical intervals (e.g., correlation > 1)
Ignores skewness in the bootstrap distribution

Let’s check if our interval makes sense:

if normal_upper > 1:
    print(f"Warning: Upper limit {normal_upper:.3f} exceeds 1!")
    print("This is impossible for a correlation!")

Formula: $\left(T^*_{n, \alpha/2}, T^*_{n, 1-\alpha/2}\right)$

where $T^*_{n,\beta}$ denotes the $\beta$ sample quantile of the bootstrapped statistic values $T^*_n$.

The percentile interval method uses the $\alpha/2$ and $1-\alpha/2$ quantiles of the bootstrap distribution directly.

# Percentile interval
percentile_lower = np.percentile(boot_correlations, 100 * alpha/2)
percentile_upper = np.percentile(boot_correlations, 100 * (1 - alpha/2))

print(f"95% Percentile interval: ({percentile_lower:.3f}, {percentile_upper:.3f})")

95% Percentile interval: (-0.389, 0.596)

Pros:

Doesn’t assume normality – adapts to the actual shape
Always respects parameter bounds (correlation stays in [-1, 1])
Simple to understand: “middle 95% of bootstrap values”
Works well when the bootstrap distribution is approximately unbiased

Cons:

Can have poor coverage when there’s bias
Not transformation-invariant
May not be accurate for small samples

Formula: $\left(2\hat{T}_n - T^*_{n, 1-\alpha/2}, 2\hat{T}_n - T^*_{n, \alpha/2}\right)$

This method assumes that the distribution of $T_n - \theta$ is approximately the same as the distribution of $T_n^* - T_n$, where $\theta$ is the true parameter.

# Pivotal interval  
pivotal_lower = 2 * rho_hat - np.percentile(boot_correlations, 100 * (1 - alpha/2))
pivotal_upper = 2 * rho_hat - np.percentile(boot_correlations, 100 * alpha/2)

print(f"95% Pivotal interval: ({pivotal_lower:.3f}, {pivotal_upper:.3f})")

95% Pivotal interval: (-0.120, 0.865)

Pros:

Often more accurate than the other two methods
Corrects for bias in the estimator
Transformation-respecting (invariant under monotone transformations)

Cons:

Less intuitive – the formula seems backwards at first
Can occasionally give values outside parameter bounds
Requires symmetric error distribution for best performance

Let’s compare all three intervals visually:

Show code

fig, ax = plt.subplots(figsize=(7, 5))

# Plot bootstrap distribution
histogram_data = ax.hist(boot_correlations, bins=40, density=True, alpha=0.5, 
                        color='gray', edgecolor='black', label='Bootstrap distribution')

# Get the maximum height for positioning intervals
max_density = max(histogram_data[0])

# Add vertical lines for original estimate
ax.axvline(rho_hat, color='red', linestyle='-', linewidth=3, label='Original estimate')

# Add confidence intervals overlaid on the histogram
methods = ['Normal', 'Percentile', 'Pivotal']
intervals = [(normal_lower, normal_upper), 
             (percentile_lower, percentile_upper),
             (pivotal_lower, pivotal_upper)]
colors = ['blue', 'green', 'orange']

# Position intervals at different heights on the histogram
y_positions = [max_density * 0.2, max_density * 0.15, max_density * 0.1]

for method, (lower, upper), color, y_pos in zip(methods, intervals, colors, y_positions):
    # Draw the interval line
    ax.plot([lower, upper], [y_pos, y_pos], color=color, linewidth=4, 
            marker='|', markersize=10, label=f'{method}')
    # Add vertical lines at interval endpoints
    ax.vlines([lower, upper], 0, y_pos, color=color, linestyle=':', alpha=0.5)

ax.set_xlabel('Correlation')
ax.set_ylabel('Density')
ax.set_title('Bootstrap Distribution with Confidence Intervals')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, max_density * 1.1)

plt.tight_layout()
plt.show()

Let’s summarize the confidence intervals in a table for easy comparison:

| Method     | Lower Bound | Upper Bound | Width |
|------------|-------------|-------------|-------|
| Normal     |   -0.235    |    0.711    | 0.946 |
| Percentile |   -0.389    |    0.596    | 0.986 |
| Pivotal    |   -0.120    |    0.865    | 0.986 |

In this example:

The Normal interval is symmetric around the point estimate, ignoring the skewness
The Percentile interval reflects the skewness of the bootstrap distribution
The Pivotal interval adjusts for bias and is slightly shifted from the percentile interval

Advanced: Justification for the Pivotal Interval

The pivotal method seems counterintuitive at first. Why do we subtract the upper quantile from 2\hat{T}_n to get the lower bound?

The key insight is that the pivotal interval assumes the error \hat{T}_n - \theta (where \theta = T(F) is the true value) behaves similarly to the bootstrap error T_n^* - \hat{T}_n.

Quick derivation: Start with the bootstrap distribution: \mathbb{P}( T^*_{n,\alpha/2} \leq T_n^* \leq T^*_{n,1-\alpha/2} ) = 1-\alpha

Subtract \hat{T}_n throughout: \mathbb{P}( T^*_{n,\alpha/2} - \hat{T}_n \leq T_n^* - \hat{T}_n \leq T^*_{n,1-\alpha/2} - \hat{T}_n ) = 1-\alpha

The key assumption: The bootstrap error T_n^* - \hat{T}_n has approximately the same distribution as the real error \hat{T}_n - \theta. Therefore: \mathbb{P}( T^*_{n,\alpha/2} - \hat{T}_n \leq \hat{T}_n - \theta \leq T^*_{n,1-\alpha/2} - \hat{T}_n ) \approx 1-\alpha

Rearranging to isolate \theta: \mathbb{P}( 2\hat{T}_n - T^*_{n,1-\alpha/2} \leq \theta \leq 2\hat{T}_n - T^*_{n,\alpha/2} ) \approx 1-\alpha

This gives us the pivotal interval: (2\hat{T}_n - T^*_{n,1-\alpha/2}, 2\hat{T}_n - T^*_{n,\alpha/2}).

Intuition: If bootstrap values tend to be above our estimate, then our estimate is probably below the truth by a similar amount. The “2×estimate minus bootstrap quantile” formula automatically corrects for this bias.

4.7 Bootstrap Application: Higher Moments

The following example demonstrates both the power and limitations of the bootstrap.

Example: Bootstrap for Higher Moments

Consider a sample of n = 20 observations from a standard normal distribution \mathcal{N}(0, 1). We’ll use the bootstrap to estimate confidence intervals for two related statistics:

T^{(1)}(F) = \mathbb{E}[X^4] - the fourth raw moment
T^{(2)}(F) = \mathbb{E}[(X - \mu)^4] - the fourth central moment

For the standard normal, both have the same true value: T^{(1)}(F) = T^{(2)}(F) = 3.

This example is valuable because:

We can compute the true sampling distribution via simulation
It shows how bootstrap performs for non-standard statistics
It reveals differences between seemingly similar estimators

Let’s implement this comparison:

Show code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set up the experiment
np.random.seed(42)
n = 20
true_value = 3  # True 4th moment for N(0,1)

# Generate one sample
sample = np.random.normal(0, 1, n)

# Compute point estimates
T1_hat = np.mean(sample**4)  # Raw 4th moment
T2_hat = np.mean((sample - np.mean(sample))**4)  # Central 4th moment

print(f"True value: {true_value}")
print(f"T¹ (raw 4th moment): {T1_hat:.3f}")
print(f"T² (central 4th moment): {T2_hat:.3f}")

# Bootstrap distributions
B = 2000
boot_T1 = np.zeros(B)
boot_T2 = np.zeros(B)

for b in range(B):
    boot_sample = np.random.choice(sample, size=n, replace=True)
    boot_T1[b] = np.mean(boot_sample**4)
    boot_T2[b] = np.mean((boot_sample - np.mean(boot_sample))**4)

# True sampling distributions (via simulation)
n_true = 10000
true_T1 = np.zeros(n_true)
true_T2 = np.zeros(n_true)

for i in range(n_true):
    true_sample = np.random.normal(0, 1, n)
    true_T1[i] = np.mean(true_sample**4)
    true_T2[i] = np.mean((true_sample - np.mean(true_sample))**4)

# Determine common x-axis range to show the dramatic difference
x_min = min(np.min(boot_T1), np.min(boot_T2), np.min(true_T1), np.min(true_T2))
x_max = max(np.max(boot_T1), np.max(boot_T2), np.max(true_T1), np.max(true_T2))

# Create figure with 4 subplots
fig, axes = plt.subplots(2, 2, figsize=(8, 10))

# T1: Raw 4th moment
# Bootstrap distribution
ax1 = axes[0, 0]
ax1.hist(boot_T1, bins=40, density=True, alpha=0.7, color='blue', 
         edgecolor='black', label='Bootstrap')
ax1.axvline(T1_hat, color='red', linestyle='--', linewidth=2, 
            label=f'Estimate = {T1_hat:.2f}')
ax1.axvline(true_value, color='black', linestyle=':', linewidth=2, 
            label=f'True = {true_value}')
ax1.set_title('T¹: Bootstrap Distribution')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.set_xlim(x_min, x_max)
ax1.legend()
ax1.grid(True, alpha=0.3)

# True sampling distribution
ax2 = axes[0, 1]
ax2.hist(true_T1, bins=40, density=True, alpha=0.7, color='green', 
         edgecolor='black', label='True sampling dist')
ax2.axvline(np.mean(true_T1), color='red', linestyle='--', linewidth=2, 
            label=f'Mean = {np.mean(true_T1):.2f}')
ax2.axvline(true_value, color='black', linestyle=':', linewidth=2, 
            label=f'True = {true_value}')
ax2.set_title('T¹: True Sampling Distribution')
ax2.set_xlabel('Value')
ax2.set_ylabel('Density')
ax2.set_xlim(x_min, x_max)
ax2.legend()
ax2.grid(True, alpha=0.3)

# T2: Central 4th moment
# Bootstrap distribution
ax3 = axes[1, 0]
ax3.hist(boot_T2, bins=40, density=True, alpha=0.7, color='blue', 
         edgecolor='black', label='Bootstrap')
ax3.axvline(T2_hat, color='red', linestyle='--', linewidth=2, 
            label=f'Estimate = {T2_hat:.2f}')
ax3.axvline(true_value, color='black', linestyle=':', linewidth=2, 
            label=f'True = {true_value}')
ax3.set_title('T²: Bootstrap Distribution')
ax3.set_xlabel('Value')
ax3.set_ylabel('Density')
ax3.set_xlim(x_min, x_max)
ax3.legend()
ax3.grid(True, alpha=0.3)

# True sampling distribution
ax4 = axes[1, 1]
ax4.hist(true_T2, bins=40, density=True, alpha=0.7, color='green', 
         edgecolor='black', label='True sampling dist')
ax4.axvline(np.mean(true_T2), color='red', linestyle='--', linewidth=2, 
            label=f'Mean = {np.mean(true_T2):.2f}')
ax4.axvline(true_value, color='black', linestyle=':', linewidth=2, 
            label=f'True = {true_value}')
ax4.set_title('T²: True Sampling Distribution')
ax4.set_xlabel('Value')
ax4.set_ylabel('Density')
ax4.set_xlim(x_min, x_max)
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.suptitle('Bootstrap vs True Sampling Distributions', fontsize=14)
plt.tight_layout()
plt.show()

True value: 3
T¹ (raw 4th moment): 2.025
T² (central 4th moment): 1.882

Now let’s compare the confidence intervals:

Confidence Intervals for T¹ (Raw 4th Moment):
  True 95% CI:        (0.58, 8.84)
  Bootstrap Normal:   (0.42, 3.63)
  Bootstrap Percent:  (0.64, 3.79)
  Bootstrap Pivotal:  (0.26, 3.41)

Confidence Intervals for T² (Central 4th Moment):
  True 95% CI:        (0.50, 7.97)
  Bootstrap Normal:   (0.41, 3.36)
  Bootstrap Percent:  (0.48, 3.36)
  Bootstrap Pivotal:  (0.40, 3.29)

What Went Wrong?

The bootstrap catastrophically fails here – all methods estimate upper confidence bounds of ~3.4-3.8, while the true bounds are ~8-9 (more than double!).

Why? Fourth moments have heavy-tailed sampling distributions. With only n=20 observations, we likely missed the rare extreme values that drive the true variability. The bootstrap can only resample what it sees, so it fundamentally cannot capture the full range of uncertainty.

This isn’t just a small sample problem – it’s a fundamental limitation when statistics depend heavily on extreme values. Let’s explore when else the bootstrap fails…

4.8 When The Bootstrap Fails

The bootstrap is remarkably general, but it’s not foolproof. It relies on the sample being a good representation of the population. When this assumption breaks down, so does the bootstrap.

The bootstrap may fail or perform poorly when:

Sample size is too small (typically n < 20-30)
Estimating extreme order statistics (min, max, extreme quantiles)
Heavy-tailed distributions without finite moments
Non-smooth statistics (e.g., number of modes)
Dependent data (unless using specialized methods like block bootstrap)

Example: Estimators at the Boundary

The bootstrap performs poorly for statistics at the edge of the parameter space. The classic example is estimating the maximum of a uniform distribution:

Show code

# Uniform distribution example
np.random.seed(42)
true_max = 10
n = 50
uniform_sample = np.random.uniform(0, true_max, n)
sample_max = np.max(uniform_sample)

# True sampling distribution of the maximum
# For Uniform(0, θ), the maximum has CDF F(x) = (x/θ)^n
x_theory = np.linspace(sample_max * 0.8, true_max, 1000)
pdf_theory = n * (x_theory/true_max)**(n-1) / true_max

# Bootstrap distribution
B = 2000
boot_max_uniform = np.zeros(B)
for b in range(B):
    boot_sample = np.random.choice(uniform_sample, size=n, replace=True)
    boot_max_uniform[b] = np.max(boot_sample)

fig, ax = plt.subplots(figsize=(7, 5))

# Bootstrap histogram
ax.hist(boot_max_uniform, bins=40, density=True, alpha=0.7, 
        color='blue', edgecolor='black', label='Bootstrap distribution')

# True distribution
ax.plot(x_theory, pdf_theory, 'r-', linewidth=2, label='True distribution')

# Key values
ax.axvline(sample_max, color='green', linestyle='--', linewidth=2, 
           label=f'Sample max = {sample_max:.2f}')
ax.axvline(true_max, color='black', linestyle=':', linewidth=2, 
           label=f'True θ = {true_max}')

ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Bootstrap Fails for Uniform Maximum')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(8, 10.5)

plt.tight_layout()
plt.show()

Why it fails:

The true maximum (\theta = 10) is always ≥ the sample maximum
The true sampling distribution has support on [\text{sample max}, \theta]
But bootstrap samples can never exceed the original sample maximum!
The bootstrap distribution is entirely to the left of where it should be

Note: Similar biases arise when estimating extreme statistics (e.g., very small or very large quantiles) of any distribution.

Despite these limitations, the bootstrap remains one of the most useful tools in modern statistics. Just remember: it’s a powerful method, not a magical one.

4.9 Chapter Summary and Connections

4.9.1 Key Concepts Review

We’ve covered two fundamental ideas that revolutionized statistical practice:

The Plug-In Principle:

Estimate the distribution with the empirical distribution function (EDF)
Estimate any functional T(F) by computing T(\hat{F}_n)
Simple, intuitive, and widely applicable
Gives us point estimates for any statistic

The Bootstrap:

Assess uncertainty by resampling from the data
Create the “bootstrap world” that mimics the real world
Estimate standard errors and confidence intervals for any statistic
Three types of confidence intervals: Normal, Percentile, Pivotal

4.9.2 Why These Concepts Matter

For Statistical Practice:

No need to derive complex formulas for standard errors
Works for statistics where theory is intractable (median, correlation, etc.)
Provides a unified approach to uncertainty quantification
Democratizes statistics – complex inference becomes accessible

For Data Science:

Computational approach aligns with modern computing power
Easy to implement and parallelize
Works with complex models and machine learning algorithms
Provides uncertainty estimates crucial for decision-making

For Understanding:

Makes abstract concepts concrete through simulation
Reveals the sampling distribution visually
Helps build intuition about statistical variability
Connects theoretical statistics to computational practice

4.9.3 Common Pitfalls to Avoid

Forgetting to sample with replacement
- Bootstrap samples must be the same size as original
- Sampling without replacement gives wrong answers
Using too few bootstrap samples
- Use at least B = 1,000 for standard errors
- Use B = 10,000 or more for confidence intervals
- Monte Carlo error decreases with \sqrt{B}
- In practice, use as many as you can given your available compute
Misinterpreting confidence intervals
- They quantify uncertainty in the estimate
- They are not probability statements about parameters
- Different methods can give different intervals
Applying bootstrap blindly
- Check if your statistic is smooth
- Be cautious with very small samples
- Watch out for boundary cases
Ignoring the assumptions
- Bootstrap assumes the sample represents the population
- It can’t fix biased sampling or systematic errors
- It’s not magic – just clever resampling

4.9.4 Chapter Connections

The bootstrap builds on our theoretical foundations and provides a computational path forward:

From Previous Chapters: We’ve applied the plug-in principle to the empirical distribution (Chapter 1’s probability concepts), used Chapter 2’s variance formulas for bootstrap standard errors, and provided an alternative to Chapter 3’s CLT-based confidence intervals when theoretical distributions are intractable
Next - Hypothesis Testing (Chapter 5): Bootstrap will create null distributions for complex test statistics, complemented by permutation tests as another resampling approach
Parametric Methods (Chapters 6-7): Compare bootstrap to theoretical approaches, use it to validate assumptions, and construct confidence intervals for maximum likelihood estimates when standard theory is difficult
Machine Learning Applications: Bootstrap underpins ensemble methods (bagging), provides uncertainty quantification for predictions, and helps with model selection—making it essential for modern data science

4.9.5 Rejoinder: Coming Full Circle

Recall our opening example: the healthcare administrator planning hospital capacity during an epidemic. We began by asking how to quantify the uncertainty in our estimates, not just provide point predictions.

Through this chapter, we’ve answered that question with two powerful tools:

The plug-in principle gave us a way to estimate any property of a distribution
The bootstrap gave us a way to quantify the uncertainty of those estimates

These tools enable data-driven decision making by providing not just estimates, but confidence intervals that capture the range of plausible values. Whether you’re estimating hospital capacity, financial risk, or any other critical quantity, you now have the tools to say not just “we expect 500 patients” but “we’re 95% confident it will be between 300 and 700 patients.”

This transformation from point estimates to uncertainty quantification is what makes statistics invaluable for real-world decision making.

4.9.6 Practical Advice

Start simple: Use percentile intervals as your default
Visualize: Always plot the bootstrap distribution
Compare methods: Try different CI methods to check robustness
Think about assumptions: Is your sample representative?
Use modern tools: Most software has built-in bootstrap functions

The bootstrap exemplifies the shift from mathematical to computational statistics. Master it, and you’ll have a powerful tool for almost any statistical problem you encounter.

4.9.7 Quick Self-Check

Before moving on, test your understanding with these questions:

What is bootstrap sampling used for?

Answer

Estimating the sampling distribution of a statistic
Computing standard errors and confidence intervals
Assessing uncertainty when no formula exists

The bootstrap is particularly valuable when theoretical formulas are intractable or don’t exist, such as for the median, correlation coefficient, or complex machine learning predictions.

What approximations does the method make?

Answer

Statistical approximation: Assumes \hat{F}_n approximates F well
- This depends on sample size n and cannot be improved without more data
Monte Carlo approximation: Finite B approximates infinite resampling
- This can be made arbitrarily small by increasing B (typically B \geq 1,000)

Remember: \mathbb{V}_F(T_n) \underbrace{\approx}_{\text{statistical error}} \mathbb{V}_{\hat{F}_n}(T_n) \underbrace{\approx}_{\text{Monte Carlo error}} v_{\text{boot}}

What are its limitations?

Answer

Small samples (typically n < 20-30): The empirical distribution poorly represents the population
Extreme order statistics: Cannot extrapolate beyond observed data range (e.g., max of uniform distribution)
Heavy-tailed distributions: May miss rare extreme values that drive variability (as we saw with 4th moments)
Non-smooth statistics: Discontinuous functions like the number of modes
Dependent data: Requires specialized methods like block bootstrap for time series

The key limitation: bootstrap cannot see what’s not in your sample!

How can bootstrap be used to construct confidence intervals?

Answer

Three main methods, each with different strengths:

Normal interval: \hat{T}_n \pm z_{\alpha/2} \cdot \widehat{\text{se}}_{\text{boot}}
- Simplest, but assumes normality
- Can give impossible values (e.g., correlation > 1)
Percentile interval: (T^*_{n,\alpha/2}, T^*_{n,1-\alpha/2})
- Uses bootstrap quantiles directly
- Respects parameter bounds, good default choice
Pivotal interval: (2\hat{T}_n - T^*_{n,1-\alpha/2}, 2\hat{T}_n - T^*_{n,\alpha/2})
- Corrects for bias, often most accurate
- Can occasionally exceed parameter bounds

Choose based on your specific problem and always visualize the bootstrap distribution!

4.9.8 Self-Test Problems

Implementing Bootstrap: Write a function to bootstrap the trimmed mean (removing top and bottom 10% before averaging). Compare its standard error to the regular mean for normal and heavy-tailed data.
Correlation CI: Using the European Health and Wealth data from the chapter, compute bootstrap confidence intervals for \rho^2 (squared correlation). How do the three methods compare? What happens to the intervals when you transform from \rho to \rho^2?
Ratio Statistics: Given paired data (X_i, Y_i), bootstrap the ratio \bar{Y}/\bar{X}. Why might this be challenging? Compare the three CI methods.
Bootstrap Failure - Range Statistic: Generate n=30 observations from a standard normal distribution and bootstrap the range (max - min). Compare the bootstrap distribution to the true sampling distribution (via simulation). Why does the bootstrap underestimate the variability? How does this relate to our discussion of extreme order statistics?

4.9.9 Code Reference

# Essential bootstrap code template
import numpy as np
from scipy import stats

def bootstrap(data, statistic, B=1000, alpha=0.05):
    """
    Generic bootstrap function for any statistic.
    
    Returns:
    - point_estimate: Original statistic value
    - se: Bootstrap standard error  
    - ci_normal: Normal-based CI
    - ci_percentile: Percentile CI
    - ci_pivotal: Pivotal CI
    - boot_dist: Bootstrap distribution
    """
    n = len(data)
    boot_samples = np.zeros(B)
    
    # Original estimate
    point_estimate = statistic(data)
    
    # Bootstrap
    for b in range(B):
        boot_data = np.random.choice(data, size=n, replace=True)
        boot_samples[b] = statistic(boot_data)
    
    # Standard error
    se = np.std(boot_samples, ddof=1)
    
    # Confidence intervals
    z = stats.norm.ppf(1 - alpha/2)
    ci_normal = (point_estimate - z*se, point_estimate + z*se)
    
    ci_percentile = (np.percentile(boot_samples, 100*alpha/2),
                     np.percentile(boot_samples, 100*(1-alpha/2)))
    
    ci_pivotal = (2*point_estimate - np.percentile(boot_samples, 100*(1-alpha/2)),
                  2*point_estimate - np.percentile(boot_samples, 100*alpha/2))
    
    return {
        'estimate': point_estimate,
        'se': se,
        'ci_normal': ci_normal,
        'ci_percentile': ci_percentile,
        'ci_pivotal': ci_pivotal,
        'boot_dist': boot_samples
    }

# Example usage
data = np.random.exponential(2, 50)
results = bootstrap(data, np.median, B=2000)
print(f"Median: {results['estimate']:.3f} (SE: {results['se']:.3f})")
print(f"95% Percentile CI: {results['ci_percentile']}")

# Essential bootstrap code template
bootstrap <- function(data, statistic, B = 1000, alpha = 0.05) {
  n <- length(data)
  boot_samples <- numeric(B)
  
  # Original estimate
  point_estimate <- statistic(data)
  
  # Bootstrap
  for (b in 1:B) {
    boot_data <- sample(data, size = n, replace = TRUE)
    boot_samples[b] <- statistic(boot_data)
  }
  
  # Standard error
  se <- sd(boot_samples)
  
  # Confidence intervals
  z <- qnorm(1 - alpha/2)
  ci_normal <- c(point_estimate - z*se, point_estimate + z*se)
  
  ci_percentile <- quantile(boot_samples, c(alpha/2, 1-alpha/2))
  
  ci_pivotal <- c(2*point_estimate - quantile(boot_samples, 1-alpha/2),
                  2*point_estimate - quantile(boot_samples, alpha/2))
  
  list(
    estimate = point_estimate,
    se = se,
    ci_normal = ci_normal,
    ci_percentile = ci_percentile,
    ci_pivotal = ci_pivotal,
    boot_dist = boot_samples
  )
}

# Example usage
set.seed(42)
data <- rexp(50, rate = 1/2)
results <- bootstrap(data, median, B = 2000)
cat(sprintf("Median: %.3f (SE: %.3f)\n", results$estimate, results$se))
cat(sprintf("95%% Percentile CI: (%.3f, %.3f)\n", 
            results$ci_percentile[1], results$ci_percentile[2]))

Remember: The bootstrap transforms the abstract problem of understanding sampling distributions into the concrete task of resampling from your data. It’s statistics made tangible through computation. When in doubt, bootstrap it!

--- date: today --- # Nonparametric Estimation and The Bootstrap ## Learning Objectives After completing this chapter, you will be able to: - Explain the definition and frequentist interpretation of a confidence interval - Apply the plug-in principle with the Empirical Distribution Function (EDF) to estimate statistical functionals - Use the bootstrap to simulate the sampling distribution of a statistic and estimate its standard error - Construct and compare the three main types of bootstrap confidence intervals: Normal, percentile, and pivotal - Identify the assumptions and limitations of the bootstrap, recognizing common situations where it can fail ::: {.callout-note} This chapter covers nonparametric estimation methods and the bootstrap. The material is adapted from Chapters 7 and 8 of @wasserman2013all, with reference to Chapter 6 for confidence intervals. Additional examples and computational perspectives have been added for data science applications. ::: ## Introduction and Motivation ### Beyond Point Estimates: Quantifying Uncertainty Imagine you're a healthcare administrator planning for hospital capacity during an epidemic. Your data scientists provide a point estimate: "We expect 500 patients next week." But is this estimate reliable enough to base critical decisions on? What if the true number could reasonably be anywhere from 300 to 700? A single point estimate, without quantifying its uncertainty, is often useless for decision-making. You need a plausible *range* -- this is called a **confidence interval**. In Chapter 3, we learned how to create point estimates -- single "best guesses" for unknown quantities. We saw that the sample mean $\bar{X}_n$ estimates the population mean $\mu$, and we even derived its standard error. But what about more complex statistics? How do we find the standard error of the median? The correlation coefficient? The 90th percentile? Traditional statistical theory provides formulas for simple cases, but quickly becomes intractable for complex statistics. This chapter introduces two powerful ideas that revolutionized modern statistics: 1. **The Plug-In Principle**: A simple, intuitive method for estimating almost any quantity by "plugging in" the empirical distribution for the true distribution. 2. **The Bootstrap**: A computational method for estimating the standard error and confidence interval of virtually any statistic, even when no formula exists. These methods exemplify the computational approach to statistics that has become dominant in the era of cheap computing power. Instead of deriving complex mathematical formulas, we let the computer simulate what would happen if we could repeat our experiment many times. ::: {.callout-note collapse="true"} ## Finnish Terminology Reference For Finnish-speaking students, here's a reference table of key terms in this chapter: | English | Finnish | Context | |---------|---------|---------| | Confidence interval | Luottamusväli | Range that contains parameter with specified probability | | Coverage | Peitto | Probability that interval contains true parameter | | Empirical distribution function | Empiirinen kertymäfunktio | Data-based estimate of CDF | | Statistical functional | Tilastollinen funktionaali | Function of the distribution | | Plug-in estimator | Pistoke-estimaattori | Estimate using empirical distribution | | Bootstrap | Uusio-otanta | Resampling method for uncertainty | | Resampling | Uudelleenotanta | Drawing samples from data | | Bootstrap sample | Bootstrap-otos | Sample drawn with replacement | | Percentile interval | Prosenttipiste-luottamusväli | CI using bootstrap quantiles | | Pivotal interval | Pivotaalinen luottamusväli | CI using pivot quantity | | Standard error | Keskivirhe | Standard deviation of estimator | | Monte Carlo error | Monte Carlo -virhe | Error from finite simulations | ::: ## Confidence Sets: The Foundation Before diving into nonparametric estimation and the bootstrap, we need to establish the formal framework for confidence intervals, a staple of classical statistics. ::: {.definition} A **$1-\alpha$ confidence interval** for a parameter $\theta$ is an interval $C_n = (a, b)$ where $a = a(X_1, \ldots, X_n)$ and $b = b(X_1, \ldots, X_n)$ are functions of the data such that $$\mathbb{P}_{\theta}(\theta \in C_n) \geq 1 - \alpha, \quad \text{for all } \theta \in \Theta.$$ In other words, $C_n$ encloses $\theta$ with probability $1-\alpha$. This probability is called the **coverage** of the interval. ::: A common choice is $\alpha = 0.05$ which yields $95\%$ confidence intervals. ### Interpretation of Confidence Sets Confidence intervals are not probability statements about $\theta$. The parameter $\theta$ is fixed; it's the interval $C_n$ that varies with different samples. The correct interpretation is: if you repeatedly collect data and construct confidence intervals using the same procedure, $(1-\alpha) \times 100\%$ of those intervals will contain the true parameter. This is the *frequentist* guarantee. Bayesian credible intervals provide a different type of statement -- they give a probability that $\theta$ lies in the interval given your observed data. However, Bayesian intervals are not guaranteed to achieve frequentist coverage (containing the true parameter $(1-\alpha)$ proportion of the time across repeated sampling). ::: {.callout-warning} ## Critical Point: What is Random? Remember that $\theta$ is **fixed** and $C_n$ is **random**. The parameter doesn't change; the interval does. Each time we collect new data, we get a different interval. The guarantee is that $(1-\alpha) \times 100\%$ of these intervals will contain the true parameter. ::: ::: {.callout-note collapse="true"} ## Advanced: When Coverage Fails The frequentist guarantee is an average over all possible datasets. For any specific confidence interval procedure, there may exist parameter values where the actual coverage is less than the nominal $(1-\alpha)$ level. Additionally, for some "unlucky" datasets (the $\alpha$ proportion), the computed interval may be far from the true parameter. This is an inherent limitation of the frequentist approach -- it provides no guarantees for individual intervals, only for the long-run performance of the procedure. This suggests different approaches may be preferred in different contexts: - **Frequentist methods:** Well-suited for repeated analyses where long-run guarantees matter - **Bayesian methods:** May be preferred when prior information is available and you need probabilistic statements about parameters given your specific data Both approaches are valuable tools for quantifying uncertainty. ::: ### Normal-Based Confidence Intervals When an estimator $\hat{\theta}_n$ is approximately normally distributed, we can form confidence intervals using the normal approximation. This is often the case for large samples due to the Central Limit Theorem. Suppose $\hat{\theta}_n \approx \mathcal{N}(\theta, \widehat{\text{se}}^2)$, where $\widehat{\text{se}}$ is the (estimated) standard error of the estimator. Then we can standardize to get: $$\frac{\hat{\theta}_n - \theta}{\widehat{\text{se}}} \approx \mathcal{N}(0, 1)$$ Let $z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)$ be the upper $\alpha/2$ quantile of the standard normal distribution, where $\Phi$ is the standard normal CDF. This means: - $\mathbb{P}(Z > z_{\alpha/2}) = \alpha/2$ for $Z \sim \mathcal{N}(0, 1)$ - $\mathbb{P}(-z_{\alpha/2} < Z < z_{\alpha/2}) = 1 - \alpha$ Therefore: $$\mathbb{P}\left(-z_{\alpha/2} < \frac{\hat{\theta}_n - \theta}{\widehat{\text{se}}} < z_{\alpha/2}\right) \approx 1 - \alpha$$ Rearranging to isolate $\theta$: $$\mathbb{P}\left(\hat{\theta}_n - z_{\alpha/2} \widehat{\text{se}} < \theta < \hat{\theta}_n + z_{\alpha/2} \widehat{\text{se}}\right) \approx 1 - \alpha$$ This gives us the approximate $(1-\alpha)$ confidence interval: $$C_n = \left(\hat{\theta}_n - z_{\alpha/2} \widehat{\text{se}}, \hat{\theta}_n + z_{\alpha/2} \widehat{\text{se}}\right)$$ For the common case of 95% confidence intervals ($\alpha = 0.05$): - $z_{0.025} = \Phi^{-1}(0.975) \approx 1.96 \approx 2$ - This leads to the familiar rule of thumb: $\hat{\theta}_n \pm 2 \widehat{\text{se}}$ ::: {.callout-tip icon=false} ## Example: Confidence Interval for the Mean For the sample mean $\bar{X}_n$ with known population variance $\sigma^2$: - Estimator: $\hat{\theta}_n = \bar{X}_n$ - Standard error: $\text{se} = \sigma/\sqrt{n}$ - 95% CI: $\bar{X}_n \pm 1.96 \cdot \sigma/\sqrt{n}$ If $\sigma$ is unknown, we substitute the sample standard deviation $s$ to get: - 95% CI: $\bar{X}_n \pm 1.96 \cdot s/\sqrt{n}$ For small samples, we would use the [$t$-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) instead of the normal distribution. ::: ## The Plug-In Principle: A General Method for Estimation This lecture focuses on **nonparametric estimation**, and the plug-in principle is one key instrument of nonparametric statistics. ### Estimating the Entire Distribution: The EDF The plug-in principle is a simple idea that provides a unified framework for nonparametric estimation that works for virtually any statistical problem. **The Core Idea**: When we need to estimate some property of an unknown distribution $F$, we simply: 1. Estimate the distribution itself using our data 2. Calculate the property using our estimated distribution The most fundamental nonparametric estimate is of the CDF itself. Let $X_1, \ldots, X_n \sim F$ be an i.i.d. sample where $F$ is a distribution function on the real line. Since we don't know the true CDF $F$, we estimate it with the **Empirical Distribution Function (EDF)**. ::: {.definition} The **Empirical Distribution Function (EDF)** $\hat{F}_n$ is the CDF that puts probability mass $1/n$ on each data point. Formally: $$\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^n I(X_i \leq x)$$ where $I(X_i \leq x) = \begin{cases} 1 & \text{if } X_i \leq x \\ 0 & \text{if } X_i > x \end{cases}$ ::: Intuitively, $\hat{F}_n(x)$ is simply the proportion of data points less than or equal to $x$. It's the most natural estimate of $F(x) = \mathbb{P}(X \leq x)$. **Properties of the EDF**: For any fixed point $x$: - **Unbiased**: $\mathbb{E}(\hat{F}_n(x)) = F(x)$ - **Variance**: $\mathbb{V}(\hat{F}_n(x)) = \frac{F(x)(1-F(x))}{n}$ - **MSE**: $\text{MSE}(\hat{F}_n(x)) = \mathbb{V}(\hat{F}_n(x)) = \frac{F(x)(1-F(x))}{n} \to 0$ as $n \to \infty$ - **Consistent**: $\hat{F}_n(x) \xrightarrow{P} F(x)$ as $n \to \infty$ But the EDF is even better than these pointwise properties suggest: ::: {.theorem name="Glivenko-Cantelli Theorem"} The EDF converges to the true CDF *uniformly*: $$\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{P} 0$$ ::: This is a remarkably strong guarantee -- the EDF approximates the true CDF well *everywhere*, not just at individual points. Let's visualize the EDF with an example dataset: ```{python} #| fig-width: 7 #| fig-height: 5 import numpy as np import matplotlib.pyplot as plt from scipy import stats # Simulate nerve data (waiting times between pulses) # Using exponential distribution as a model for waiting times np.random.seed(42) n = 100 nerve_data = np.random.exponential(scale=0.5, size=n) # Create the EDF plot fig, ax = plt.subplots(figsize=(7, 5)) # Sort data for plotting sorted_data = np.sort(nerve_data) # Plot the EDF as a step function ax.step(sorted_data, np.arange(1, n+1)/n, where='post', linewidth=2, color='blue', label='Empirical CDF') # Add rug plot to show data points ax.plot(sorted_data, np.zeros_like(sorted_data), '|', color='black', markersize=10, alpha=0.5) # Add true CDF for comparison x_range = np.linspace(0, max(sorted_data)*1.1, 1000) true_cdf = 1 - np.exp(-x_range/0.5) # Exponential CDF ax.plot(x_range, true_cdf, 'r--', linewidth=2, alpha=0.7, label='True CDF') ax.set_xlabel('Waiting time (seconds)') ax.set_ylabel('Cumulative probability') ax.set_title('Empirical Distribution Function for Nerve Pulse Data (Simulated)') ax.legend() ax.grid(True, alpha=0.3) ax.set_xlim(0, max(sorted_data)*1.05) ax.set_ylim(0, 1.05) plt.tight_layout() plt.show() ``` The step function shows $\hat{F}_n$, jumping by $1/n$ at each data point. The vertical lines at the bottom show the actual data points. Notice how the EDF closely follows the true CDF (shown for comparison). ::: {.callout-note collapse="true"} ## Advanced: Confidence Bands for the CDF We can construct confidence bands for the entire CDF using the [Dvoretzky-Kiefer-Wolfowitz (DKW) inequality](https://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality): $$\mathbb{P}\left(\sup_x |F(x) - \hat{F}_n(x)| > \epsilon\right) \leq 2e^{-2n\epsilon^2}$$ This leads to a $1-\alpha$ confidence band: $$L(x) = \max\{\hat{F}_n(x) - \epsilon_n, 0\}$$ $$U(x) = \min\{\hat{F}_n(x) + \epsilon_n, 1\}$$ where $\epsilon_n = \sqrt{\frac{1}{2n}\log\left(\frac{2}{\alpha}\right)}$. ```{python} #| fig-width: 7 #| fig-height: 5 # Add confidence bands to previous plot alpha = 0.05 epsilon_n = np.sqrt(np.log(2/alpha) / (2*n)) fig, ax = plt.subplots(figsize=(7, 5)) # Plot EDF ax.step(sorted_data, np.arange(1, n+1)/n, where='post', linewidth=2, color='blue', label='Empirical CDF') # Add confidence bands y_values = np.arange(1, n+1)/n lower_band = np.maximum(y_values - epsilon_n, 0) upper_band = np.minimum(y_values + epsilon_n, 1) ax.fill_between(sorted_data, lower_band, upper_band, step='post', alpha=0.3, color='gray', label=f'{int((1-alpha)*100)}% Confidence band') # Add rug plot ax.plot(sorted_data, np.zeros_like(sorted_data), '|', color='black', markersize=10, alpha=0.5) ax.set_xlabel('Waiting time (seconds)') ax.set_ylabel('Cumulative probability') ax.set_title('EDF with Confidence Band') ax.legend() ax.grid(True, alpha=0.3) ax.set_xlim(0, max(sorted_data)*1.05) ax.set_ylim(0, 1.05) plt.tight_layout() plt.show() print(f"Width of confidence band: ±{epsilon_n:.3f}") print(f"This means we're 95% confident the true CDF lies within this gray region") ``` ::: ### Estimating Functionals: The Plug-In Estimator Now that we can estimate the entire distribution, we can estimate any property of it. A **statistical functional** is any function of the CDF $F$ or the PDF $f$. Examples include: - The mean: $\mu = \int x f(x) d x$ - The variance: $\sigma^2 = \int (x-\mu)^2 f(x) d x$ - The median: $m = F^{-1}(1/2)$ ::: {.definition} The **plug-in estimator** of $\theta = T(F)$ is defined by: $$\hat{\theta}_n = T(\hat{F}_n)$$ ::: In other words, just *plug in* the empirical CDF $\hat{F}_n$ for the unknown CDF $F$. This principle is remarkably general. To estimate any property of the population, we calculate that same property on our empirical distribution. Let's see how this works for various functionals: **The Mean**: Let $\mu = T(F) = \int x f(x) dx$. The plug-in estimator is: $$\hat{\mu}_n = \sum x \hat{f}_n(x) = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}_n$$ The plug-in estimator for the mean is just the sample mean! **The Variance**: Let $\sigma^2 = T(F) = \mathbb{V}(X) = \int x^2 f(x) d x - \left(\int x f(x) d x\right)^2$. The plug-in estimator is: $$\hat{\sigma}^2_n = \sum x^2 \hat{f}_n(x) - \left(\sum x \hat{f}_n(x)\right)^2 = \frac{1}{n}\sum_{i=1}^n X_i^2 - \left(\frac{1}{n}\sum_{i=1}^n X_i\right)^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2$$ ::: {.callout-note} ## Alternative: The Sample Variance Another common estimator of $\sigma^2$ is the sample variance with the $n-1$ correction: $$S_n^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X}_n)^2$$ This estimator is unbiased (i.e., $\mathbb{E}[S_n^2] = \sigma^2$) and is almost identical to the plug-in estimator in practice. The factor $\frac{n}{n-1}$ makes little difference for moderate to large samples. Most statistical software uses the $n-1$ version by default. ::: **The Median**: The median is the value that splits the distribution in half. For the theoretical distribution: $$m = T(F) = F^{-1}(0.5)$$ The plug-in estimator is simply the sample median - the middle value when we sort our data. For an odd number of observations, it's the middle value. For an even number, it's the average of the two middle values. ::: {.callout-note collapse="true"} ## Advanced: Other Statistical Functionals **Skewness** (measures asymmetry): $$T(F) = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3} \implies \hat{\kappa}_n = \frac{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)^3}{(\hat{\sigma}^2_n)^{3/2}}$$ **Correlation** for bivariate data $(X,Y)$: $$T(F) = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \implies \hat{\rho}_n = \frac{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)(Y_i-\bar{Y}_n)}{\sqrt{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)^2}\sqrt{\frac{1}{n}\sum_{i=1}^n(Y_i-\bar{Y}_n)^2}}$$ **Quantiles** in general: $$T(F) = F^{-1}(p) \implies \hat{q}_p = \hat{F}_n^{-1}(p)$$ Since $\hat{F}_n$ is a step function, we define: $$\hat{F}_n^{-1}(p) = \inf\{x : \hat{F}_n(x) \geq p\}$$ This gives us the sample quantile at level $p$. ::: **Confidence Intervals for Plug-in Estimators**: When the plug-in estimator $\hat{\theta}_n = T(\hat{F}_n)$ is approximately normally distributed, we can form confidence intervals using: $$\hat{\theta}_n \pm z_{\alpha/2} \widehat{\text{se}}$$ as we saw earlier in the *Normal-Based Confidence Intervals* section. The challenge is finding $\widehat{\text{se}}$. For the mean, we know from theory that $\text{se}(\bar{X}_n) = \sigma/\sqrt{n}$, which we can estimate by plugging in $\hat{\sigma}$ to get: $$\bar{X}_n \pm z_{\alpha/2} \frac{\hat{\sigma}}{\sqrt{n}}$$ But what about the median? The 90th percentile? The interquartile range? For most functionals, **there is no simple formula for the standard error**. This is where the bootstrap comes to the rescue, as we see next. ::: {.callout-note} ## Recap: Nonparametric Estimation Let $X_1, \ldots, X_n \sim F$ be an i.i.d. sample where $F$ is a distribution function on the real line. - The **empirical distribution function** $\hat{F}_n$ is the CDF that puts mass $1/n$ at each data point $X_i$ - A **statistical functional** $T(F)$ is any function of $F$ (e.g., mean, variance, median) - The **plug-in estimator** of $\theta = T(F)$ is $\hat{\theta}_n = T(\hat{F}_n)$ We've seen how to create point estimates for any functional. The challenge is quantifying their uncertainty when no formula exists for the standard error. ::: ## The Bootstrap: Simulating Uncertainty ### The Core Idea The bootstrap, invented by Bradley Efron in 1979, is one of the most important statistical innovations of the 20th century. It provides a general, computational method for assessing the uncertainty of virtually any statistic or functional of the data. Let $T_n = g(X_1, \ldots, X_n)$ be our statistic of interest (e.g., the median, the variance, a given quantile). What's its variability? The key insight is simple: **We can learn about the variability of our statistic by seeing how it varies across different samples. Since we only have one sample, we create new samples by resampling from our data.** Here's the crucial diagram that illustrates the bootstrap principle: ``` Real World: F ==> X₁,...,Xₙ ==> Tₙ = g(X) ↑ ↑ Unknown What we want to understand Bootstrap World: F̂ₙ ==> X₁*,...,Xₙ* ==> Tₙ* = g(X*) ↑ ↑ Known! What we can simulate ``` In the real world: - Nature draws from the unknown distribution $F$ - We observe one sample $X_1, \ldots, X_n$ - We calculate our statistic $T_n = g(X_1, \ldots, X_n)$ - We want to know the sampling distribution of $T_n$ In the bootstrap world: - We draw from the known empirical distribution $\hat{F}_n$ (see below) - We get a bootstrap sample $X_1^*, \ldots, X_n^*$ - We calculate the bootstrap statistic $T_n^* = g(X_1^*, \ldots, X_n^*)$ - **We can repeat the process multiple times to simulate the sampling distribution of $T_n^*$** **The Bootstrap Principle**: The distribution of $T_n^*$ around $T_n$ approximates the distribution of $T_n$ around $T(F)$. How do we draw from $\hat{F}_n$? Remember that $\hat{F}_n$ puts mass $1/n$ at each observed data point. Therefore: ::: {.callout-warning} **Key Point**: Drawing from $\hat{F}_n$ means sampling **with replacement** from the original data. Each bootstrap sample contains $n$ observations, some repeated, some omitted. ::: ### Bootstrap Variance and Standard Error Estimation Let's make this concrete with an algorithm: **Bootstrap Algorithm for Standard Error Estimation**: 1. For $b = 1, \ldots, B$: - Draw a bootstrap sample $X_1^*, \ldots, X_n^*$ by sampling with replacement from $\{X_1, \ldots, X_n\}$ - Compute the statistic for this bootstrap sample: $T_{n,b}^* = g(X_1^*, \ldots, X_n^*)$ 2. The bootstrap estimate of the standard error is: $$\widehat{\text{se}}_{\text{boot}} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B \left(T_{n,b}^* - \bar{T}_n^*\right)^2}$$ where $\bar{T}_n^* = \frac{1}{B}\sum_{b=1}^B T_{n,b}^*$ Let's implement this for a concrete example -- estimating the standard error of the median: ::: {.tabbed-content} ## Python ```{python} #| code-fold: false import numpy as np def bootstrap_se(data, statistic, B=1000): """ Compute bootstrap standard error of a statistic. Parameters: ----------- data : array-like Original sample statistic : function Function that computes the statistic of interest B : int Number of bootstrap replications Returns: -------- se : float Bootstrap standard error boot_samples : array Bootstrap replications of the statistic """ n = len(data) boot_samples = np.zeros(B) for b in range(B): # Draw bootstrap sample x_star = np.random.choice(data, size=n, replace=True) # Compute statistic boot_samples[b] = statistic(x_star) # Compute standard error se = np.std(boot_samples, ddof=1) return se, boot_samples # Example: Standard error of the median np.random.seed(42) data = np.random.exponential(scale=2, size=50) # Point estimate median_est = np.median(data) # Bootstrap standard error se_boot, boot_medians = bootstrap_se(data, np.median, B=2000) print(f"Sample median: {median_est:.3f}") print(f"Bootstrap SE: {se_boot:.3f}") print(f"Approximate 95% CI: ({median_est - 2*se_boot:.3f}, {median_est + 2*se_boot:.3f})") ``` ## R ```r bootstrap_se <- function(data, statistic, B = 1000) { #' Compute bootstrap standard error of a statistic #' #' @param data Original sample vector #' @param statistic Function that computes the statistic of interest #' @param B Number of bootstrap replications #' @return List with standard error and bootstrap samples n <- length(data) boot_samples <- numeric(B) for (b in 1:B) { # Draw bootstrap sample x_star <- sample(data, size = n, replace = TRUE) # Compute statistic boot_samples[b] <- statistic(x_star) } # Compute standard error se <- sd(boot_samples) return(list(se = se, boot_samples = boot_samples)) } # Example: Standard error of the median set.seed(42) data <- rexp(50, rate = 1/2) # Point estimate median_est <- median(data) # Bootstrap standard error result <- bootstrap_se(data, median, B = 2000) se_boot <- result$se boot_medians <- result$boot_samples cat(sprintf("Sample median: %.3f\n", median_est)) cat(sprintf("Bootstrap SE: %.3f\n", se_boot)) cat(sprintf("Approximate 95%% CI: (%.3f, %.3f)\n", median_est - 2*se_boot, median_est + 2*se_boot)) ``` ::: Let's visualize the bootstrap distribution: ```{python} #| fig-width: 7 #| fig-height: 5 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 5)) # Left panel: Original data ax1.hist(data, bins=20, density=True, alpha=0.7, color='blue', edgecolor='black') ax1.axvline(median_est, color='red', linestyle='--', linewidth=2, label=f'Median = {median_est:.3f}') ax1.set_xlabel('Value') ax1.set_ylabel('Density') ax1.set_title('Original Data') ax1.legend() ax1.grid(True, alpha=0.3) # Right panel: Bootstrap distribution ax2.hist(boot_medians, bins=30, density=True, alpha=0.7, color='green', edgecolor='black') ax2.axvline(median_est, color='red', linestyle='--', linewidth=2) ax2.axvline(median_est - 2*se_boot, color='orange', linestyle=':', linewidth=2) ax2.axvline(median_est + 2*se_boot, color='orange', linestyle=':', linewidth=2, label='95% CI') ax2.set_xlabel('Bootstrap median') ax2.set_ylabel('Density') ax2.set_title('Bootstrap Distribution') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` The left panel shows our original data. The right panel shows the distribution of the median across 2000 bootstrap samples. This distribution tells us about the uncertainty in our median estimate. ::: {.callout-warning} ## Two Sources of Error The bootstrap involves two approximations: 1. **Statistical Approximation**: $\mathbb{V}_F(T_n) \approx \mathbb{V}_{\hat{F}_n}(T_n)$ - This depends on how well $\hat{F}_n$ approximates $F$ - We can't control this -- it depends on our sample size $n$ 2. **Monte Carlo Approximation**: $\mathbb{V}_{\hat{F}_n}(T_n) \approx v_{\text{boot}}$ - This depends on the number of bootstrap samples $B$ - We can make this arbitrarily small by increasing $B$ - Typically $B \geq 1000$ is sufficient In formulas: $$\mathbb{V}_F(T_n) \underbrace{\approx}_{\text{not so small}} \mathbb{V}_{\hat{F}_n}(T_n) \underbrace{\approx}_{\text{small}} v_{\text{boot}}$$ Remember: by increasing the bootstrap samples $B$ we can reduce the Monte Carlo error (due to simulation), but we cannot improve the statistical approximation error without obtaining more real data. ::: ::: {.callout-tip collapse="true"} ## Example: Monte Carlo Error Decreases With More Bootstrap Samples Let's verify that the Monte Carlo error decreases with $B$: ```{python} #| fig-width: 7 #| fig-height: 4 # Show convergence of bootstrap SE as B increases B_values = [10, 20, 50, 100, 200, 500, 1000, 2000, 5000] se_estimates = [] for B in B_values: se, _ = bootstrap_se(data, np.median, B=B) se_estimates.append(se) plt.figure(figsize=(7, 4)) plt.plot(B_values, se_estimates, 'bo-', linewidth=2, markersize=8) plt.axhline(se_estimates[-1], color='red', linestyle='--', label=f'Converged value ≈ {se_estimates[-1]:.3f}') plt.xlabel('Number of bootstrap samples (B)') plt.ylabel('Bootstrap SE estimate') plt.title('Monte Carlo Error Decreases with B') plt.xscale('log') plt.grid(True, alpha=0.3) plt.legend() plt.tight_layout() plt.show() ``` The fluctuations for small $B$ are due to Monte Carlo variability. As $B$ increases, the bootstrap standard error estimate stabilizes. Still, this is only one component of the error -- to improve further we need additional real data. ::: ## Bootstrap Confidence Intervals ### Three Common Methods Now that we can estimate the sampling distribution of any statistic via the bootstrap, we can construct confidence intervals. But how exactly should we use the bootstrap distribution to form an interval? There are **three main approaches**, each with different strengths and weaknesses. We'll illustrate all three using a real example: estimating the correlation between a country's economic prosperity and the health of its population. ::: {.callout-tip icon=false} ## Example: European Health and Wealth Data This example explores the correlation between a country's wealth (GDP per capita) and the average lifespan of its citizens (life expectancy at birth). The data is for a subset of EU countries, with [GDP per capita for 2025](https://en.wikipedia.org/wiki/List_of_sovereign_states_in_Europe_by_GDP_(nominal)_per_capita) and [life expectancy from 2023](https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy). ```{python} # European Health and Wealth Data # GDP per capita (2025 forecast), Life Expectancy at birth (2023) for a subset of EU countries. countries = [ 'Germany', 'France', 'Italy', 'Spain', 'Netherlands', 'Poland', 'Belgium', 'Sweden', 'Austria', 'Ireland', 'Denmark', 'Finland', 'Portugal', 'Greece', 'Czech Republic' ] gdp_per_capita = np.array([ 55911, 46792, 41091, 36192, 70480, 26805, 57772, 58100, 58192, 108919, 74969, 54163, 30002, 25756, 33039 ]) life_expectancy = np.array([ 81.38, 83.33, 83.72, 83.67, 82.16, 78.63, 82.11, 83.26, 81.96, 82.41, 81.93, 81.91, 82.36, 81.86, 79.83 ]) # Combine into a single dataset for resampling data_combined = np.column_stack((gdp_per_capita, life_expectancy)) n = len(life_expectancy) # Define correlation function def correlation(data): x, y = data[:, 0], data[:, 1] return np.corrcoef(x, y)[0, 1] # Original correlation rho_hat = correlation(data_combined) print(f"Sample correlation: {rho_hat:.3f}") ``` ::: Now let's generate bootstrap samples: ```{python} # Bootstrap the correlation B = 5000 boot_correlations = np.zeros(B) np.random.seed(42) for b in range(B): # Resample pairs (important to maintain pairing!) indices = np.random.choice(n, size=n, replace=True) boot_sample = data_combined[indices] boot_correlations[b] = correlation(boot_sample) # Bootstrap standard error se_boot = np.std(boot_correlations, ddof=1) print(f"Bootstrap SE: {se_boot:.3f}") ``` Let's visualize the bootstrap distribution: ```{python} #| fig-width: 7 #| fig-height: 5 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 5)) # Scatter plot of original data ax1.scatter(gdp_per_capita, life_expectancy, alpha=0.6, s=50) ax1.set_xlabel('GDP per Capita (US$)') ax1.set_ylabel('Life Expectancy (Years)') ax1.set_title(f'European Countries (ρ = {rho_hat:.3f})') ax1.grid(True, alpha=0.3) # Bootstrap distribution ax2.hist(boot_correlations, bins=40, density=True, alpha=0.7, color='green', edgecolor='black') ax2.axvline(rho_hat, color='red', linestyle='--', linewidth=2, label=f'Original ρ = {rho_hat:.3f}') ax2.set_xlabel('Bootstrap Correlation') ax2.set_ylabel('Density') ax2.set_title('Bootstrap Distribution') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` Notice that the bootstrap distribution is somewhat skewed. This skewness will affect our confidence intervals. ### Comparing Bootstrap Confidence Intervals Let $\hat{T}_n = T_n(\hat{F}_n)$ be the statistic evaluated on the empirical distribution, and $T_n^*$ the bootstrap statistic. ::: {.tabbed-content} ## Method 1: Normal Interval **Formula**: $\hat{T}_n \pm z_{\alpha/2} \widehat{\text{se}}_{\text{boot}}$ This is the simplest method -- we just use the bootstrap standard error in the usual normal-based formula. ```{python} # Normal interval alpha = 0.05 z_alpha = stats.norm.ppf(1 - alpha/2) normal_lower = rho_hat - z_alpha * se_boot normal_upper = rho_hat + z_alpha * se_boot print(f"95% Normal interval: ({normal_lower:.3f}, {normal_upper:.3f})") ``` **Pros**: - Simple and intuitive - Only requires the standard error, not the full distribution - Familiar to those who know basic statistics **Cons**: - Assumes the sampling distribution is approximately normal - Can give nonsensical intervals (e.g., correlation > 1) - Ignores skewness in the bootstrap distribution Let's check if our interval makes sense: ```{python} if normal_upper > 1: print(f"Warning: Upper limit {normal_upper:.3f} exceeds 1!") print("This is impossible for a correlation!") ``` ## Method 2: Percentile Interval **Formula**: $\left(T^*_{n, \alpha/2}, T^*_{n, 1-\alpha/2}\right)$ where $T^*_{n,\beta}$ denotes the $\beta$ sample quantile of the bootstrapped statistic values $T^*_n$. The percentile interval method uses the $\alpha/2$ and $1-\alpha/2$ quantiles of the bootstrap distribution directly. ```{python} # Percentile interval percentile_lower = np.percentile(boot_correlations, 100 * alpha/2) percentile_upper = np.percentile(boot_correlations, 100 * (1 - alpha/2)) print(f"95% Percentile interval: ({percentile_lower:.3f}, {percentile_upper:.3f})") ``` **Pros**: - Doesn't assume normality -- adapts to the actual shape - Always respects parameter bounds (correlation stays in [-1, 1]) - Simple to understand: "middle 95% of bootstrap values" - Works well when the bootstrap distribution is approximately unbiased **Cons**: - Can have poor coverage when there's bias - Not transformation-invariant - May not be accurate for small samples ## Method 3: Pivotal Interval **Formula**: $\left(2\hat{T}_n - T^*_{n, 1-\alpha/2}, 2\hat{T}_n - T^*_{n, \alpha/2}\right)$ This method assumes that the distribution of $T_n - \theta$ is approximately the same as the distribution of $T_n^* - T_n$, where $\theta$ is the true parameter. ```{python} # Pivotal interval pivotal_lower = 2 * rho_hat - np.percentile(boot_correlations, 100 * (1 - alpha/2)) pivotal_upper = 2 * rho_hat - np.percentile(boot_correlations, 100 * alpha/2) print(f"95% Pivotal interval: ({pivotal_lower:.3f}, {pivotal_upper:.3f})") ``` **Pros**: - Often more accurate than the other two methods - Corrects for bias in the estimator - Transformation-respecting (invariant under monotone transformations) **Cons**: - Less intuitive -- the formula seems backwards at first - Can occasionally give values outside parameter bounds - Requires symmetric error distribution for best performance ::: Let's compare all three intervals visually: ```{python} #| fig-width: 7 #| fig-height: 5 fig, ax = plt.subplots(figsize=(7, 5)) # Plot bootstrap distribution histogram_data = ax.hist(boot_correlations, bins=40, density=True, alpha=0.5, color='gray', edgecolor='black', label='Bootstrap distribution') # Get the maximum height for positioning intervals max_density = max(histogram_data[0]) # Add vertical lines for original estimate ax.axvline(rho_hat, color='red', linestyle='-', linewidth=3, label='Original estimate') # Add confidence intervals overlaid on the histogram methods = ['Normal', 'Percentile', 'Pivotal'] intervals = [(normal_lower, normal_upper), (percentile_lower, percentile_upper), (pivotal_lower, pivotal_upper)] colors = ['blue', 'green', 'orange'] # Position intervals at different heights on the histogram y_positions = [max_density * 0.2, max_density * 0.15, max_density * 0.1] for method, (lower, upper), color, y_pos in zip(methods, intervals, colors, y_positions): # Draw the interval line ax.plot([lower, upper], [y_pos, y_pos], color=color, linewidth=4, marker='|', markersize=10, label=f'{method}') # Add vertical lines at interval endpoints ax.vlines([lower, upper], 0, y_pos, color=color, linestyle=':', alpha=0.5) ax.set_xlabel('Correlation') ax.set_ylabel('Density') ax.set_title('Bootstrap Distribution with Confidence Intervals') ax.legend() ax.grid(True, alpha=0.3) ax.set_ylim(0, max_density * 1.1) plt.tight_layout() plt.show() ``` Let's summarize the confidence intervals in a table for easy comparison: ```{python} #| echo: false # Generate the table with actual values print(f"| Method | Lower Bound | Upper Bound | Width |") print(f"|------------|-------------|-------------|-------|") print(f"| Normal | {normal_lower:.3f} | {normal_upper:.3f} | {normal_upper - normal_lower:.3f} |") print(f"| Percentile | {percentile_lower:.3f} | {percentile_upper:.3f} | {percentile_upper - percentile_lower:.3f} |") print(f"| Pivotal | {pivotal_lower:.3f} | {pivotal_upper:.3f} | {pivotal_upper - pivotal_lower:.3f} |") ``` In this example: - The **Normal interval** is symmetric around the point estimate, ignoring the skewness - The **Percentile interval** reflects the skewness of the bootstrap distribution - The **Pivotal interval** adjusts for bias and is slightly shifted from the percentile interval ::: {.callout-note collapse="true"} ## Advanced: Justification for the Pivotal Interval The pivotal method seems counterintuitive at first. Why do we subtract the upper quantile from $2\hat{T}_n$ to get the lower bound? The key insight is that the pivotal interval assumes the error $\hat{T}_n - \theta$ (where $\theta = T(F)$ is the true value) behaves similarly to the bootstrap error $T_n^* - \hat{T}_n$. **Quick derivation**: Start with the bootstrap distribution: $$\mathbb{P}( T^*_{n,\alpha/2} \leq T_n^* \leq T^*_{n,1-\alpha/2} ) = 1-\alpha$$ Subtract $\hat{T}_n$ throughout: $$\mathbb{P}( T^*_{n,\alpha/2} - \hat{T}_n \leq T_n^* - \hat{T}_n \leq T^*_{n,1-\alpha/2} - \hat{T}_n ) = 1-\alpha$$ The key assumption: The bootstrap error $T_n^* - \hat{T}_n$ has approximately the same distribution as the real error $\hat{T}_n - \theta$. Therefore: $$\mathbb{P}( T^*_{n,\alpha/2} - \hat{T}_n \leq \hat{T}_n - \theta \leq T^*_{n,1-\alpha/2} - \hat{T}_n ) \approx 1-\alpha$$ Rearranging to isolate $\theta$: $$\mathbb{P}( 2\hat{T}_n - T^*_{n,1-\alpha/2} \leq \theta \leq 2\hat{T}_n - T^*_{n,\alpha/2} ) \approx 1-\alpha$$ This gives us the pivotal interval: $(2\hat{T}_n - T^*_{n,1-\alpha/2}, 2\hat{T}_n - T^*_{n,\alpha/2})$. **Intuition**: If bootstrap values tend to be above our estimate, then our estimate is probably below the truth by a similar amount. The "2×estimate minus bootstrap quantile" formula automatically corrects for this bias. ::: ## Bootstrap Application: Higher Moments The following example demonstrates both the power and limitations of the bootstrap. ::: {.callout-tip icon=false} ## Example: Bootstrap for Higher Moments Consider a sample of $n = 20$ observations from a standard normal distribution $\mathcal{N}(0, 1)$. We'll use the bootstrap to estimate confidence intervals for two related statistics: 1. $T^{(1)}(F) = \mathbb{E}[X^4]$ - the fourth raw moment 2. $T^{(2)}(F) = \mathbb{E}[(X - \mu)^4]$ - the fourth central moment For the standard normal, both have the same true value: $T^{(1)}(F) = T^{(2)}(F) = 3$. This example is valuable because: - We can compute the true sampling distribution via simulation - It shows how bootstrap performs for non-standard statistics - It reveals differences between seemingly similar estimators ::: Let's implement this comparison: ```{python} #| fig-width: 8 #| fig-height: 10 import numpy as np import matplotlib.pyplot as plt from scipy import stats # Set up the experiment np.random.seed(42) n = 20 true_value = 3 # True 4th moment for N(0,1) # Generate one sample sample = np.random.normal(0, 1, n) # Compute point estimates T1_hat = np.mean(sample**4) # Raw 4th moment T2_hat = np.mean((sample - np.mean(sample))**4) # Central 4th moment print(f"True value: {true_value}") print(f"T¹ (raw 4th moment): {T1_hat:.3f}") print(f"T² (central 4th moment): {T2_hat:.3f}") # Bootstrap distributions B = 2000 boot_T1 = np.zeros(B) boot_T2 = np.zeros(B) for b in range(B): boot_sample = np.random.choice(sample, size=n, replace=True) boot_T1[b] = np.mean(boot_sample**4) boot_T2[b] = np.mean((boot_sample - np.mean(boot_sample))**4) # True sampling distributions (via simulation) n_true = 10000 true_T1 = np.zeros(n_true) true_T2 = np.zeros(n_true) for i in range(n_true): true_sample = np.random.normal(0, 1, n) true_T1[i] = np.mean(true_sample**4) true_T2[i] = np.mean((true_sample - np.mean(true_sample))**4) # Determine common x-axis range to show the dramatic difference x_min = min(np.min(boot_T1), np.min(boot_T2), np.min(true_T1), np.min(true_T2)) x_max = max(np.max(boot_T1), np.max(boot_T2), np.max(true_T1), np.max(true_T2)) # Create figure with 4 subplots fig, axes = plt.subplots(2, 2, figsize=(8, 10)) # T1: Raw 4th moment # Bootstrap distribution ax1 = axes[0, 0] ax1.hist(boot_T1, bins=40, density=True, alpha=0.7, color='blue', edgecolor='black', label='Bootstrap') ax1.axvline(T1_hat, color='red', linestyle='--', linewidth=2, label=f'Estimate = {T1_hat:.2f}') ax1.axvline(true_value, color='black', linestyle=':', linewidth=2, label=f'True = {true_value}') ax1.set_title('T¹: Bootstrap Distribution') ax1.set_xlabel('Value') ax1.set_ylabel('Density') ax1.set_xlim(x_min, x_max) ax1.legend() ax1.grid(True, alpha=0.3) # True sampling distribution ax2 = axes[0, 1] ax2.hist(true_T1, bins=40, density=True, alpha=0.7, color='green', edgecolor='black', label='True sampling dist') ax2.axvline(np.mean(true_T1), color='red', linestyle='--', linewidth=2, label=f'Mean = {np.mean(true_T1):.2f}') ax2.axvline(true_value, color='black', linestyle=':', linewidth=2, label=f'True = {true_value}') ax2.set_title('T¹: True Sampling Distribution') ax2.set_xlabel('Value') ax2.set_ylabel('Density') ax2.set_xlim(x_min, x_max) ax2.legend() ax2.grid(True, alpha=0.3) # T2: Central 4th moment # Bootstrap distribution ax3 = axes[1, 0] ax3.hist(boot_T2, bins=40, density=True, alpha=0.7, color='blue', edgecolor='black', label='Bootstrap') ax3.axvline(T2_hat, color='red', linestyle='--', linewidth=2, label=f'Estimate = {T2_hat:.2f}') ax3.axvline(true_value, color='black', linestyle=':', linewidth=2, label=f'True = {true_value}') ax3.set_title('T²: Bootstrap Distribution') ax3.set_xlabel('Value') ax3.set_ylabel('Density') ax3.set_xlim(x_min, x_max) ax3.legend() ax3.grid(True, alpha=0.3) # True sampling distribution ax4 = axes[1, 1] ax4.hist(true_T2, bins=40, density=True, alpha=0.7, color='green', edgecolor='black', label='True sampling dist') ax4.axvline(np.mean(true_T2), color='red', linestyle='--', linewidth=2, label=f'Mean = {np.mean(true_T2):.2f}') ax4.axvline(true_value, color='black', linestyle=':', linewidth=2, label=f'True = {true_value}') ax4.set_title('T²: True Sampling Distribution') ax4.set_xlabel('Value') ax4.set_ylabel('Density') ax4.set_xlim(x_min, x_max) ax4.legend() ax4.grid(True, alpha=0.3) plt.suptitle('Bootstrap vs True Sampling Distributions', fontsize=14) plt.tight_layout() plt.show() ``` Now let's compare the confidence intervals: ```{python} #| echo: false # Compute confidence intervals alpha = 0.05 # Bootstrap intervals boot_T1_lower_norm = T1_hat - 1.96 * np.std(boot_T1) boot_T1_upper_norm = T1_hat + 1.96 * np.std(boot_T1) boot_T1_lower_perc = np.percentile(boot_T1, 100 * alpha/2) boot_T1_upper_perc = np.percentile(boot_T1, 100 * (1 - alpha/2)) boot_T1_lower_piv = 2*T1_hat - np.percentile(boot_T1, 100 * (1 - alpha/2)) boot_T1_upper_piv = 2*T1_hat - np.percentile(boot_T1, 100 * alpha/2) boot_T2_lower_norm = T2_hat - 1.96 * np.std(boot_T2) boot_T2_upper_norm = T2_hat + 1.96 * np.std(boot_T2) boot_T2_lower_perc = np.percentile(boot_T2, 100 * alpha/2) boot_T2_upper_perc = np.percentile(boot_T2, 100 * (1 - alpha/2)) boot_T2_lower_piv = 2*T2_hat - np.percentile(boot_T2, 100 * (1 - alpha/2)) boot_T2_upper_piv = 2*T2_hat - np.percentile(boot_T2, 100 * alpha/2) # True intervals (oracle) true_T1_lower = np.percentile(true_T1, 100 * alpha/2) true_T1_upper = np.percentile(true_T1, 100 * (1 - alpha/2)) true_T2_lower = np.percentile(true_T2, 100 * alpha/2) true_T2_upper = np.percentile(true_T2, 100 * (1 - alpha/2)) print("Confidence Intervals for T¹ (Raw 4th Moment):") print(f" True 95% CI: ({true_T1_lower:.2f}, {true_T1_upper:.2f})") print(f" Bootstrap Normal: ({boot_T1_lower_norm:.2f}, {boot_T1_upper_norm:.2f})") print(f" Bootstrap Percent: ({boot_T1_lower_perc:.2f}, {boot_T1_upper_perc:.2f})") print(f" Bootstrap Pivotal: ({boot_T1_lower_piv:.2f}, {boot_T1_upper_piv:.2f})") print("\nConfidence Intervals for T² (Central 4th Moment):") print(f" True 95% CI: ({true_T2_lower:.2f}, {true_T2_upper:.2f})") print(f" Bootstrap Normal: ({boot_T2_lower_norm:.2f}, {boot_T2_upper_norm:.2f})") print(f" Bootstrap Percent: ({boot_T2_lower_perc:.2f}, {boot_T2_upper_perc:.2f})") print(f" Bootstrap Pivotal: ({boot_T2_lower_piv:.2f}, {boot_T2_upper_piv:.2f})") ``` ::: {.callout-warning} ## What Went Wrong? The bootstrap catastrophically fails here -- all methods estimate upper confidence bounds of ~3.4-3.8, while the true bounds are ~8-9 (more than double!). Why? Fourth moments have heavy-tailed sampling distributions. With only n=20 observations, we likely missed the rare extreme values that drive the true variability. The bootstrap can only resample what it sees, so it fundamentally cannot capture the full range of uncertainty. This isn't just a small sample problem -- it's a fundamental limitation when statistics depend heavily on extreme values. Let's explore when else the bootstrap fails... ::: ## When The Bootstrap Fails The bootstrap is remarkably general, but it's not foolproof. It relies on the sample being a good representation of the population. When this assumption breaks down, so does the bootstrap. The bootstrap may fail or perform poorly when: 1. **Sample size is too small** (typically n < 20-30) 2. **Estimating extreme order statistics** (min, max, extreme quantiles) 3. **Heavy-tailed distributions** without finite moments 4. **Non-smooth statistics** (e.g., number of modes) 5. **Dependent data** (unless using specialized methods like block bootstrap) ::: {.callout-tip icon=false} ## Example: Estimators at the Boundary The bootstrap performs poorly for statistics at the edge of the parameter space. The classic example is estimating the maximum of a uniform distribution: ```{python} #| fig-width: 7 #| fig-height: 5 # Uniform distribution example np.random.seed(42) true_max = 10 n = 50 uniform_sample = np.random.uniform(0, true_max, n) sample_max = np.max(uniform_sample) # True sampling distribution of the maximum # For Uniform(0, θ), the maximum has CDF F(x) = (x/θ)^n x_theory = np.linspace(sample_max * 0.8, true_max, 1000) pdf_theory = n * (x_theory/true_max)**(n-1) / true_max # Bootstrap distribution B = 2000 boot_max_uniform = np.zeros(B) for b in range(B): boot_sample = np.random.choice(uniform_sample, size=n, replace=True) boot_max_uniform[b] = np.max(boot_sample) fig, ax = plt.subplots(figsize=(7, 5)) # Bootstrap histogram ax.hist(boot_max_uniform, bins=40, density=True, alpha=0.7, color='blue', edgecolor='black', label='Bootstrap distribution') # True distribution ax.plot(x_theory, pdf_theory, 'r-', linewidth=2, label='True distribution') # Key values ax.axvline(sample_max, color='green', linestyle='--', linewidth=2, label=f'Sample max = {sample_max:.2f}') ax.axvline(true_max, color='black', linestyle=':', linewidth=2, label=f'True θ = {true_max}') ax.set_xlabel('Value') ax.set_ylabel('Density') ax.set_title('Bootstrap Fails for Uniform Maximum') ax.legend() ax.grid(True, alpha=0.3) ax.set_xlim(8, 10.5) plt.tight_layout() plt.show() ``` **Why it fails**: - The true maximum ($\theta = 10$) is always ≥ the sample maximum - The true sampling distribution has support on $[\text{sample max}, \theta]$ - But bootstrap samples can never exceed the original sample maximum! - The bootstrap distribution is entirely to the left of where it should be **Note:** Similar biases arise when estimating extreme statistics (e.g., very small or very large quantiles) of any distribution. ::: Despite these limitations, the bootstrap remains one of the most useful tools in modern statistics. Just remember: it's a powerful method, not a magical one. ## Chapter Summary and Connections ### Key Concepts Review We've covered two fundamental ideas that revolutionized statistical practice: **The Plug-In Principle**: - Estimate the distribution with the empirical distribution function (EDF) - Estimate any functional $T(F)$ by computing $T(\hat{F}_n)$ - Simple, intuitive, and widely applicable - Gives us point estimates for any statistic **The Bootstrap**: - Assess uncertainty by resampling from the data - Create the "bootstrap world" that mimics the real world - Estimate standard errors and confidence intervals for any statistic - Three types of confidence intervals: Normal, Percentile, Pivotal ### Why These Concepts Matter **For Statistical Practice**: - No need to derive complex formulas for standard errors - Works for statistics where theory is intractable (median, correlation, etc.) - Provides a unified approach to uncertainty quantification - Democratizes statistics -- complex inference becomes accessible **For Data Science**: - Computational approach aligns with modern computing power - Easy to implement and parallelize - Works with complex models and machine learning algorithms - Provides uncertainty estimates crucial for decision-making **For Understanding**: - Makes abstract concepts concrete through simulation - Reveals the sampling distribution visually - Helps build intuition about statistical variability - Connects theoretical statistics to computational practice ### Common Pitfalls to Avoid 1. **Forgetting to sample with replacement** - Bootstrap samples must be the same size as original - Sampling without replacement gives wrong answers 2. **Using too few bootstrap samples** - Use at least $B = 1,000$ for standard errors - Use $B = 10,000$ or more for confidence intervals - Monte Carlo error decreases with $\sqrt{B}$ - In practice, use as many as you can given your available compute 3. **Misinterpreting confidence intervals** - They quantify uncertainty in the estimate - They are not probability statements about parameters - Different methods can give different intervals 4. **Applying bootstrap blindly** - Check if your statistic is smooth - Be cautious with very small samples - Watch out for boundary cases 5. **Ignoring the assumptions** - Bootstrap assumes the sample represents the population - It can't fix biased sampling or systematic errors - It's not magic -- just clever resampling ### Chapter Connections The bootstrap builds on our theoretical foundations and provides a computational path forward: - **From Previous Chapters**: We've applied the plug-in principle to the empirical distribution (Chapter 1's probability concepts), used Chapter 2's variance formulas for bootstrap standard errors, and provided an alternative to Chapter 3's CLT-based confidence intervals when theoretical distributions are intractable - **Next - Hypothesis Testing (Chapter 5)**: Bootstrap will create null distributions for complex test statistics, complemented by permutation tests as another resampling approach - **Parametric Methods (Chapters 6-7)**: Compare bootstrap to theoretical approaches, use it to validate assumptions, and construct confidence intervals for maximum likelihood estimates when standard theory is difficult - **Machine Learning Applications**: Bootstrap underpins ensemble methods (bagging), provides uncertainty quantification for predictions, and helps with model selection—making it essential for modern data science ### Rejoinder: Coming Full Circle Recall our opening example: the healthcare administrator planning hospital capacity during an epidemic. We began by asking how to quantify the uncertainty in our estimates, not just provide point predictions. Through this chapter, we've answered that question with two powerful tools: 1. **The plug-in principle** gave us a way to estimate any property of a distribution 2. **The bootstrap** gave us a way to quantify the uncertainty of those estimates These tools enable data-driven decision making by providing not just estimates, but confidence intervals that capture the range of plausible values. Whether you're estimating hospital capacity, financial risk, or any other critical quantity, you now have the tools to say not just "we expect 500 patients" but "we're 95% confident it will be between 300 and 700 patients." This transformation from point estimates to uncertainty quantification is what makes statistics invaluable for real-world decision making. ### Practical Advice 1. **Start simple**: Use percentile intervals as your default 2. **Visualize**: Always plot the bootstrap distribution 3. **Compare methods**: Try different CI methods to check robustness 4. **Think about assumptions**: Is your sample representative? 5. **Use modern tools**: Most software has built-in bootstrap functions The bootstrap exemplifies the shift from mathematical to computational statistics. Master it, and you'll have a powerful tool for almost any statistical problem you encounter. ### Quick Self-Check Before moving on, test your understanding with these questions: 1. **What is bootstrap sampling used for?** ::: {.callout-note collapse="true"} ## Answer - Estimating the sampling distribution of a statistic - Computing standard errors and confidence intervals - Assessing uncertainty when no formula exists The bootstrap is particularly valuable when theoretical formulas are intractable or don't exist, such as for the median, correlation coefficient, or complex machine learning predictions. ::: 2. **What approximations does the method make?** ::: {.callout-note collapse="true"} ## Answer - **Statistical approximation**: Assumes $\hat{F}_n$ approximates $F$ well - This depends on sample size $n$ and cannot be improved without more data - **Monte Carlo approximation**: Finite $B$ approximates infinite resampling - This can be made arbitrarily small by increasing $B$ (typically $B \geq 1,000$) Remember: $\mathbb{V}_F(T_n) \underbrace{\approx}_{\text{statistical error}} \mathbb{V}_{\hat{F}_n}(T_n) \underbrace{\approx}_{\text{Monte Carlo error}} v_{\text{boot}}$ ::: 3. **What are its limitations?** ::: {.callout-note collapse="true"} ## Answer - **Small samples** (typically $n < 20-30$): The empirical distribution poorly represents the population - **Extreme order statistics**: Cannot extrapolate beyond observed data range (e.g., max of uniform distribution) - **Heavy-tailed distributions**: May miss rare extreme values that drive variability (as we saw with 4th moments) - **Non-smooth statistics**: Discontinuous functions like the number of modes - **Dependent data**: Requires specialized methods like block bootstrap for time series The key limitation: bootstrap cannot see what's not in your sample! ::: 4. **How can bootstrap be used to construct confidence intervals?** ::: {.callout-note collapse="true"} ## Answer Three main methods, each with different strengths: - **Normal interval**: $\hat{T}_n \pm z_{\alpha/2} \cdot \widehat{\text{se}}_{\text{boot}}$ - Simplest, but assumes normality - Can give impossible values (e.g., correlation > 1) - **Percentile interval**: $(T^*_{n,\alpha/2}, T^*_{n,1-\alpha/2})$ - Uses bootstrap quantiles directly - Respects parameter bounds, good default choice - **Pivotal interval**: $(2\hat{T}_n - T^*_{n,1-\alpha/2}, 2\hat{T}_n - T^*_{n,\alpha/2})$ - Corrects for bias, often most accurate - Can occasionally exceed parameter bounds Choose based on your specific problem and always visualize the bootstrap distribution! ::: ### Self-Test Problems 1. **Implementing Bootstrap**: Write a function to bootstrap the trimmed mean (removing top and bottom 10% before averaging). Compare its standard error to the regular mean for normal and heavy-tailed data. 2. **Correlation CI**: Using the European Health and Wealth data from the chapter, compute bootstrap confidence intervals for $\rho^2$ (squared correlation). How do the three methods compare? What happens to the intervals when you transform from $\rho$ to $\rho^2$? 3. **Ratio Statistics**: Given paired data $(X_i, Y_i)$, bootstrap the ratio $\bar{Y}/\bar{X}$. Why might this be challenging? Compare the three CI methods. 4. **Bootstrap Failure - Range Statistic**: Generate $n=30$ observations from a standard normal distribution and bootstrap the range (max - min). Compare the bootstrap distribution to the true sampling distribution (via simulation). Why does the bootstrap underestimate the variability? How does this relate to our discussion of extreme order statistics? ### Code Reference ::: {.content-visible when-format="html"} ::: {.tabbed-content} ## Python ```{python} #| eval: false #| echo: true # Essential bootstrap code template import numpy as np from scipy import stats def bootstrap(data, statistic, B=1000, alpha=0.05): """ Generic bootstrap function for any statistic. Returns: - point_estimate: Original statistic value - se: Bootstrap standard error - ci_normal: Normal-based CI - ci_percentile: Percentile CI - ci_pivotal: Pivotal CI - boot_dist: Bootstrap distribution """ n = len(data) boot_samples = np.zeros(B) # Original estimate point_estimate = statistic(data) # Bootstrap for b in range(B): boot_data = np.random.choice(data, size=n, replace=True) boot_samples[b] = statistic(boot_data) # Standard error se = np.std(boot_samples, ddof=1) # Confidence intervals z = stats.norm.ppf(1 - alpha/2) ci_normal = (point_estimate - z*se, point_estimate + z*se) ci_percentile = (np.percentile(boot_samples, 100*alpha/2), np.percentile(boot_samples, 100*(1-alpha/2))) ci_pivotal = (2*point_estimate - np.percentile(boot_samples, 100*(1-alpha/2)), 2*point_estimate - np.percentile(boot_samples, 100*alpha/2)) return { 'estimate': point_estimate, 'se': se, 'ci_normal': ci_normal, 'ci_percentile': ci_percentile, 'ci_pivotal': ci_pivotal, 'boot_dist': boot_samples } # Example usage data = np.random.exponential(2, 50) results = bootstrap(data, np.median, B=2000) print(f"Median: {results['estimate']:.3f} (SE: {results['se']:.3f})") print(f"95% Percentile CI: {results['ci_percentile']}") ``` ## R ```r # Essential bootstrap code template bootstrap <- function(data, statistic, B = 1000, alpha = 0.05) { n <- length(data) boot_samples <- numeric(B) # Original estimate point_estimate <- statistic(data) # Bootstrap for (b in 1:B) { boot_data <- sample(data, size = n, replace = TRUE) boot_samples[b] <- statistic(boot_data) } # Standard error se <- sd(boot_samples) # Confidence intervals z <- qnorm(1 - alpha/2) ci_normal <- c(point_estimate - z*se, point_estimate + z*se) ci_percentile <- quantile(boot_samples, c(alpha/2, 1-alpha/2)) ci_pivotal <- c(2*point_estimate - quantile(boot_samples, 1-alpha/2), 2*point_estimate - quantile(boot_samples, alpha/2)) list( estimate = point_estimate, se = se, ci_normal = ci_normal, ci_percentile = ci_percentile, ci_pivotal = ci_pivotal, boot_dist = boot_samples ) } # Example usage set.seed(42) data <- rexp(50, rate = 1/2) results <- bootstrap(data, median, B = 2000) cat(sprintf("Median: %.3f (SE: %.3f)\n", results$estimate, results$se)) cat(sprintf("95%% Percentile CI: (%.3f, %.3f)\n", results$ci_percentile[1], results$ci_percentile[2])) ``` ::: ::: ::: {.content-visible when-format="pdf"} ::: {.callout-note} ## Python and R Reference Code Python and R code examples for this chapter can be found in the HTML version of these notes. ::: ::: --- *Remember: The bootstrap transforms the abstract problem of understanding sampling distributions into the concrete task of resampling from your data. It's statistics made tangible through computation. When in doubt, bootstrap it!*