3  Convergence and The Basics of Inference

Published

July 23, 2025

3.1 Learning Objectives

After completing this chapter, you will be able to:

  • Explain how probability inequalities provide bounds on uncertainty.
  • Define concepts of probabilistic convergence and apply the Law of Large Numbers and Central Limit Theorem.
  • Define the core vocabulary of statistical inference (models, parameters, estimators).
  • Evaluate an estimator’s quality using its standard error, bias, and variance.
  • Explain the bias-variance tradeoff in the context of Mean Squared Error (MSE).
Note

This chapter covers probability inequalities, convergence concepts, and the foundations of statistical inference. The material is adapted from Chapters 4, 5, and 6 of Wasserman (2013), supplemented with additional examples and perspectives relevant to data science applications.

3.2 Introduction and Motivation

3.2.1 Convergence Matters to Understand Machine Learning Algorithms

Deep learning models are trained with stochastic optimization algorithms. These algorithms produce a sequence of parameter estimates \theta_1, \theta_2, \theta_3, \ldots

as they iterate through the data. But here’s the fundamental question: do these estimates eventually converge to a good solution, and how do we establish that?

The challenge is that these parameter estimates are random variables – they depend on random initialization, random mini-batch selection, and random data shuffling. We can’t use the simple definition of convergence where |x_n - x| < \epsilon for all large n, which you may remember from calculus. We need new mathematical tools.

This chapter develops the language of probabilistic convergence to understand and analyze such algorithms. We’ll then use these tools to build the foundation of statistical inference – the science of drawing conclusions about populations from samples.

Consider a concrete example: training a neural network for image classification. At each iteration t:

  1. Pick a random subset S of training images
  2. Compute the gradient g = \sum_{x_i \in S} \nabla_\theta L(\theta_t; x_i) of the loss1
  3. Compute next estimate of the model parameters, \theta_{t+1}, using g and the current parameters2

The randomness in batch selection makes each \theta_t a random variable. As mentioned before, ideally we would want \theta_1, \theta_2, \ldots to converge to a good solution. But what does it even mean to say the algorithm “converges”? This chapter provides the answer.

Convergence isn’t just about optimization algorithms. It’s central to all of statistics:

  • When we compute a sample mean with increasing amount of data, does it converge to the population mean?
  • More generally, when we estimate a model parameter, does our estimate improve with more data?
  • When we approximate a distribution, does the approximation get better?

The remarkable answers to these questions – provided by the Law of Large Numbers and Central Limit Theorem – form the theoretical backbone of statistical inference and machine learning.

For Finnish-speaking students, here’s a reference table of key terms in this chapter:

English Finnish Context
Markov’s inequality Markovin epäyhtälö Bounds probability of large values
Chebyshev’s inequality Tšebyšovin epäyhtälö Uses variance to bound deviations
Convergence in probability Stokastinen suppeneminen Random variable settling to a value
Convergence in distribution Jakaumasuppeneminen Distribution shape converging
Law of Large Numbers Suurten lukujen laki Sample mean → population mean
Central Limit Theorem Keskeinen raja-arvolause Sums become normally distributed
Statistical model Tilastollinen malli Set of possible distributions
Parametric model Parametrinen malli Finite-dimensional parameter space
Nonparametric model Epäparametrinen malli Infinite-dimensional space
Nuisance parameter Kiusaparametri Parameter not of primary interest
Point estimation Piste-estimointi Single best guess of parameter
Estimator Estimaattori Function of data estimating parameter
Bias Harha Expected error of estimator
Unbiased Harhaton Zero expected error
Consistent Tarkentuva Converges to true value
Standard error Keskivirhe Standard deviation of estimator
Mean Squared Error (MSE) Keskimääräinen neliövirhe Average squared error
Sampling distribution Otantajakauma Distribution of the estimator

3.3 Inequalities: Bounding the Unknown

3.3.1 Why We Need Inequalities

In probability and statistics, we often encounter quantities that are difficult or impossible to compute exactly. Inequalities provide bounds – upper or lower limits – that give us useful information even when exact calculations are intractable. They serve three critical purposes:

  1. Bounding quantities: When we can’t compute a probability exactly, an upper bound tells us it’s “at most this large”
  2. Proving theorems: The Law of Large Numbers and Central Limit Theorem rely on inequalities in their proofs
  3. Practical guarantees: In machine learning, we use inequalities to create bounds on critical quantities such as generalization error3

Think of inequalities as providing universal statistical guarantees. They tell us that no matter how complicated the underlying distribution, certain bounds will always hold.

3.3.2 Markov’s Inequality

Markov’s Inequality: For a non-negative random variable X with finite expectation: \mathbb{P}(X \geq t) \leq \frac{\mathbb{E}(X)}{t} \quad \text{for all } t > 0

This remarkably simple inequality says that – no matter what – the probability of a non-negative random variable exceeding a threshold t is bounded by its mean divided by t.

Since X \geq 0: \begin{align} \mathbb{E}(X) &= \int_0^{\infty} x f(x) \, dx \\ &= \int_0^t x f(x) \, dx + \int_t^{\infty} x f(x) \, dx \\ &\geq \int_t^{\infty} x f(x) \, dx \\ &\geq t \int_t^{\infty} f(x) \, dx \\ &= t \mathbb{P}(X \geq t) \end{align}

Rearranging gives the result.

Example: Exceeding a Multiple of the Mean

Let X be a non-negative random variable with mean \mathbb{E}(X) = \mu. What can we say about the probability that X exceeds k times its mean, for some k > 1?

Using Markov’s inequality by setting t = k\mu: \mathbb{P}(X \geq k\mu) \leq \frac{\mathbb{E}(X)}{k\mu} = \frac{\mu}{k\mu} = \frac{1}{k}

For example, the probability of a non-negative random variable exceeding twice its mean is at most 1/2. The probability of it exceeding 10 times its mean is at most 1/10. This universal bound is surprisingly useful.

Example: Exam Scores

If the average exam score is 50 points, what’s the maximum probability that a randomly selected student scored 90 or more?

Using Markov’s inequality: \mathbb{P}(X \geq 90) \leq \frac{50}{90} = \frac{5}{9} \approx 0.556

At most 55.6% of students can score 90 or more. This bound requires only knowing the average – no other information about the distribution!

Markov’s inequality captures a fundamental truth: averages constrain extremes.

Imagine a village where the average wealth is €50,000. What fraction of villagers could be millionaires? If everyone were a millionaire, the average would be at least €1,000,000. Since the average is only €50,000, at most 5% can be millionaires: \[\text{Fraction of millionaires} \leq \frac{€50,000}{€1,000,000} = 0.05\]

This reasoning works for any non-negative quantity: test scores, waiting times, file sizes, or loss values in machine learning. The average puts a hard limit on how often extreme values can occur.

Markov’s inequality is the foundation for many other inequalities. Its power lies in its generality—it applies to any non-negative random variable with finite expectation.

The inequality is tight (best possible) for certain distributions. Consider: \[X = \begin{cases} 0 & \text{with probability } 1 - \frac{\mu}{t} \\ t & \text{with probability } \frac{\mu}{t} \end{cases}\]

Then \(\mathbb{E}(X) = \mu\) and \(\mathbb{P}(X \geq t) = \frac{\mu}{t}\), achieving equality in Markov’s inequality.

Let’s visualize Markov’s inequality by comparing the true tail probability with the bound for an exponential distribution.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set up the exponential distribution
beta = 2  # mean = 2
x = np.linspace(0, 10, 1000)
pdf = stats.expon.pdf(x, scale=beta)

# Compute true probabilities and Markov bounds
t_values = np.linspace(0.5, 10, 100)
true_probs = 1 - stats.expon.cdf(t_values, scale=beta)
markov_bounds = np.minimum(beta / t_values, 1)  # E[X]/t, capped at 1

# Create the plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 5))

# Left plot: PDF with shaded tail
t_example = 5
ax1.plot(x, pdf, 'b-', linewidth=2, label='Exponential(2) PDF')
ax1.fill_between(x[x >= t_example], pdf[x >= t_example], alpha=0.3, color='red', 
                 label=f'P(X ≥ {t_example})')
ax1.axvline(beta, color='green', linestyle='--', linewidth=2, label=f'E[X] = {beta}')
ax1.set_xlabel('x')
ax1.set_ylabel('Probability density')
ax1.set_title('Exponential Distribution')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right plot: True probability vs Markov bound
ax2.plot(t_values, true_probs, 'b-', linewidth=2, label='True P(X ≥ t)')
ax2.plot(t_values, markov_bounds, 'r--', linewidth=2, label='Markov bound E[X]/t')
ax2.set_xlabel('t')
ax2.set_ylabel('Probability')
ax2.set_title('Markov\'s Inequality for Exponential(2)')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

# Numerical comparison at specific points
print("Comparison of true probability vs Markov bound:")
for t in [1, 2, 4, 8]:
    true_p = 1 - stats.expon.cdf(t, scale=beta)
    markov_p = beta / t
    print(f"t = {t}: True P(X ≥ {t}) = {true_p:.4f}, Markov bound = {markov_p:.4f}")

Comparison of true probability vs Markov bound:
t = 1: True P(X ≥ 1) = 0.6065, Markov bound = 2.0000
t = 2: True P(X ≥ 2) = 0.3679, Markov bound = 1.0000
t = 4: True P(X ≥ 4) = 0.1353, Markov bound = 0.5000
t = 8: True P(X ≥ 8) = 0.0183, Markov bound = 0.2500

Notice that the Markov bound is always valid but often loose. It becomes tighter as \(t\) increases relative to the mean.

3.3.3 Chebyshev’s Inequality

While Markov’s inequality uses only the mean, Chebyshev’s inequality leverages the variance to provide a tighter bound on deviations from the mean.

Chebyshev’s Inequality: Let X have mean \mu and variance \sigma^2. Then: \mathbb{P}(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2} \quad \text{for all } t > 0

Apply Markov’s inequality to the non-negative random variable (X - \mu)^2: \mathbb{P}(|X - \mu| \geq t) = \mathbb{P}((X - \mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

Equivalently, in terms of standard deviations: \mathbb{P}(|X - \mu| \geq k\sigma) \leq \frac{\sigma^2}{k^2 \sigma^2 } = \frac{1}{k^2} \quad \text{for all } k > 0

Example: Universal Two-Sigma Rule

For any distribution (not just normal!), Chebyshev’s inequality tells us:

\mathbb{P}(|X - \mu| < k \sigma) \ge 1 - \frac{1}{k^2}.

Thus, for k=2 and k=3 we find:

  • At least 75% of the data lies within 2 standard deviations of the mean: \mathbb{P}(|X - \mu| < 2\sigma) \geq 1 - \frac{1}{4} = 0.75
  • At least 89% lies within 3 standard deviations: \mathbb{P}(|X - \mu| < 3\sigma) \geq 1 - \frac{1}{9} \approx 0.889

Compare this to the normal distribution where about 95% lies within 2\sigma and 99.7% within 3\sigma. Chebyshev’s bounds are weaker but universal.

We show below the Chebyshev’s bound compared to the actual tail probabilities of a few famous distributions (normal, uniform and exponential).

Show code
# Visualizing Chebyshev's inequality
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set up comparison for different distributions
k_values = np.linspace(0.5, 4, 100)
chebyshev_bound = np.minimum(1 / k_values**2, 1)

# Compute actual probabilities for different distributions
normal_probs = 2 * (1 - stats.norm.cdf(k_values))
uniform_probs = np.maximum(0, 1 - k_values / np.sqrt(3))  # Uniform on [-sqrt(3), sqrt(3)]
exp_probs = []
for k in k_values:
    # For exponential with mean 1, mu=1, sigma=1
    exp_probs.append(stats.expon.cdf(1 - k, scale=1) + (1 - stats.expon.cdf(1 + k, scale=1)))

plt.figure(figsize=(7, 4))
plt.plot(k_values, chebyshev_bound, 'k-', linewidth=3, label='Chebyshev bound')
plt.plot(k_values, normal_probs, 'b--', linewidth=2, label='Normal')
plt.plot(k_values, uniform_probs, 'g--', linewidth=2, label='Uniform')
plt.plot(k_values, exp_probs, 'r--', linewidth=2, label='Exponential')
plt.xlabel('Number of standard deviations (k)')
plt.ylabel('P(|X - μ| ≥ kσ)')
plt.title('Chebyshev\'s Inequality vs Actual Tail Probabilities')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0.5, 4)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

# Print specific values
print("Probability of being more than k standard deviations from the mean:")
print("k\tChebyshev\tNormal\t\tUniform\t\tExponential")
for k in [1, 2, 3]:
    cheby = 1/k**2
    normal = 2 * (1 - stats.norm.cdf(k))
    uniform = max(0, 1 - k/np.sqrt(3))
    exp_val = stats.expon.cdf(1 - k, scale=1) + (1 - stats.expon.cdf(1 + k, scale=1))
    print(f"{k}\t{cheby:.4f}\t\t{normal:.4f}\t\t{uniform:.4f}\t\t{exp_val:.4f}")

Probability of being more than k standard deviations from the mean:
k   Chebyshev   Normal      Uniform     Exponential
1   1.0000      0.3173      0.4226      0.1353
2   0.2500      0.0455      0.0000      0.0498
3   0.1111      0.0027      0.0000      0.0183

While Chebyshev’s inequality is universal, it can be quite loose. For bounded random variables, Hoeffding’s inequality provides an exponentially decaying bound that’s much sharper.

Let X_1, \ldots, X_n be independent random variables with X_i \in [a_i, b_i]. Let S_n = \sum_{i=1}^n X_i. Then for any t > 0: \mathbb{P}(S_n - \mathbb{E}[S_n] \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

For the special case of n independent Bernoulli(p) random variables: \mathbb{P}\left(|\bar{X}_n - p| > \epsilon\right) \leq 2e^{-2n\epsilon^2} where \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

The key insight is the exponential decay in n. This makes Hoeffding’s inequality the foundation for many machine learning generalization bounds.

Example: Comparing Bounds

Consider estimating a probability p from n = 100 Bernoulli trials. How likely is our estimate to be off by more than \epsilon = 0.2?

Chebyshev’s bound: Since \mathbb{V}(\bar{X}_n) = p(1-p)/n \leq 1/(4n): \mathbb{P}(|\bar{X}_n - p| > 0.2) \leq \frac{1/(4 \times 100)}{0.2^2} = \frac{1/400}{0.04} = 0.0625

Hoeffding’s bound: \mathbb{P}(|\bar{X}_n - p| > 0.2) \leq 2e^{-2 \times 100 \times 0.2^2} = 2e^{-8} \approx 0.00067

Hoeffding’s bound is nearly 100 times tighter! This exponential improvement is crucial for machine learning theory.

The proof of Hoeffding’s inequality uses moment generating functions and is beyond the scope of this course, but the intuition is that bounded random variables have light tails, allowing for much stronger concentration.

3.4 Convergence of Random Variables

3.4.1 The Need for Probabilistic Convergence

In calculus, we say a sequence x_n converges to x if for every \epsilon > 0, we have |x_n - x| < \epsilon for all sufficiently large n. But what about sequences of random variables?

There are multiple scenarios:

  1. Concentrating distribution: Let X_n \sim \mathcal{N}(0, 1/n). As n increases, the distribution concentrates more tightly around 0. Intuitively, X_n is “converging” to 0.

  2. Tracking outcomes: The case above can be generalized where X_n does not converge to a constant (such as 0), but converges to the values taken by another random variable X.

The problem is that for any specific x, \mathbb{P}(X_n = x) = 0 for all n: continuous random variables never exactly equal any specific value.

There is then a completely different kind of convergence.

  1. Stable distribution: Let X_n \sim \mathcal{N}(0, 1) for all n. Each X_n has the same distribution, but they’re different random variables. Is there a broader sense in which this sequence “converges”?

In sum, we need new definitions that capture different notions of what it means for random variables to converge.

3.4.2 Convergence in Probability

We consider the first two cases mentioned earlier: convergence of outcomes of a sequence of random variables to a constant or to the outcomes of another random variable, known as convergence in probability.

A sequence of random variables X_n converges in probability to a random variable X, written X_n \xrightarrow{P} X, if for every \epsilon > 0: \mathbb{P}(|X_n - X| > \epsilon) \to 0 \text{ as } n \to \infty

When X = c (a constant), we write X_n \xrightarrow{P} c.

This definition captures the idea that X_n becomes increasingly likely to be close to X as n grows. The probability of X_n being “far” from X (more than \epsilon away) vanishes. In other words, the sequence of outcomes of the random variable X_n “track” the outcomes of X with ever-increasing accuracy as n increases.

Example: Convergence to Zero

Let X_n \sim \mathcal{N}(0, 1/n). We’ll show that X_n \xrightarrow{P} 0.

For any \epsilon > 0, using Chebyshev’s inequality: \mathbb{P}(|X_n - 0| > \epsilon) = \mathbb{P}(|X_n| > \epsilon) \leq \frac{\mathbb{V}(X_n)}{\epsilon^2} = \frac{1/n}{\epsilon^2} = \frac{1}{n\epsilon^2}

Since \frac{1}{n\epsilon^2} \to 0 as n \to \infty, we have X_n \xrightarrow{P} 0.

3.4.3 Convergence in Distribution

We now consider the other case, where it’s not the random variables to converge but their distribution.

A sequence of random variables X_n converges in distribution to a random variable X, written X_n \rightsquigarrow X, if: \lim_{n \to \infty} F_n(t) = F(t) at all points t where F is continuous. Here F_n is the CDF of X_n and F is the CDF of X.

This captures the idea that the distribution (or “shape”) of X_n becomes increasingly similar to that of X. We’re not saying the random variables themselves are close – just their overall probability distributions.

If X is a point mass at c, we denote X_n \rightsquigarrow c.

Warning

Key Distinction:

  • Convergence in probability: The random variables themselves get close
  • Convergence in distribution: Only the distributions get close

Let’s visualize this with the X_n \sim \mathcal{N}(0, 1/n) example:

Show code
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set up the figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 4))

# Left plot: PDFs converging to a spike at 0
x = np.linspace(-3, 3, 1000)
n_values = [1, 2, 5, 10, 50]
colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(n_values)))

for n, color in zip(n_values, colors):
    pdf = stats.norm.pdf(x, loc=0, scale=1/np.sqrt(n))
    ax1.plot(x, pdf, linewidth=2, color=color, label=f'n = {n}')

ax1.axvline(0, color='red', linestyle='--', alpha=0.7)
ax1.set_xlabel('x')
ax1.set_ylabel('Probability density')
ax1.set_title('PDFs of N(0, 1/n)')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 2.5)

# Right plot: CDFs converging to step function
for n, color in zip(n_values, colors):
    cdf = stats.norm.cdf(x, loc=0, scale=1/np.sqrt(n))
    ax2.plot(x, cdf, linewidth=2, color=color, label=f'n = {n}')

# Plot limiting step function
ax2.plot(x[x < 0], np.zeros(sum(x < 0)), 'r--', linewidth=2, label='Limit')
ax2.plot(x[x >= 0], np.ones(sum(x >= 0)), 'r--', linewidth=2)
ax2.plot([0, 0], [0, 1], 'ro', markersize=8, fillstyle='none')
ax2.plot([0], [1], 'ro', markersize=8)

ax2.set_xlabel('x')
ax2.set_ylabel('Cumulative probability')
ax2.set_title('CDFs converging to step function')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

As n increases:

  • The PDF becomes more concentrated at 0 (spike)
  • The CDF approaches a step function jumping from 0 to 1 at x=0
  • This is convergence in distribution to a point mass at 0

3.4.4 Comparing Modes of Convergence

Relationships Between Convergence Types

  1. If X_n \xrightarrow{P} X then X_n \rightsquigarrow X (always)
  2. If X is a point mass at c and X_n \rightsquigarrow X, then X_n \xrightarrow{P} c

Convergence in probability implies convergence in distribution, but the converse holds only for constants.

Convergence in Probability: The Perfect Weather Forecast

Let \(X\) be the actual temperature tomorrow and \(X_n\) be its forecast from an ever-improving machine learning model where \(n\) is the model version, as we make it bigger and feed it more data.

Convergence in probability (\(X_n \xrightarrow{P} X\)) means the forecast becomes more and more accurate as the model gets better and better. Eventually, the temperature prediction \(X_n\) gets so close to the actual temperature \(X\) that the forecast error, \(|X_n - X|\), becomes negligible. The individual outcomes match.

Convergence in Distribution: The Perfect Climate Model

A climate model doesn’t predict a specific day’s temperature; it captures the statistical “character” of a season. Let \(X\) be the random variable for daily temperature, and \(X_n\) be a model’s simulation of a typical day.

Convergence in distribution (\(X_n \rightsquigarrow X\)) means the model’s simulated statistics (e.g., its histogram of temperatures) become identical to the real climate’s statistics. The patterns match, but the individual outcomes don’t.

The Takeaway:

  • Probability implies Distribution: A perfect daily forecast naturally captures the climate’s long-term statistics.
  • Distribution does NOT imply Probability: A perfect climate model can’t predict the actual temperature on next Friday.

We can construct a counterexample showing that convergence in distribution does NOT imply convergence in probability.

Counterexample: Let \(X \sim \mathcal{N}(0,1)\) and define \(X_n = -X\) for all \(n\). Then:

  • Each \(X_n \sim \mathcal{N}(0,1)\), so trivially \(X_n \rightsquigarrow X\)
  • But \(|X_n - X| = |2X|\), so \(\mathbb{P}(|X_n - X| > \epsilon) = \mathbb{P}(|X| > \epsilon/2) \not\to 0\)
  • Therefore \(X_n\) does NOT converge to \(X\) in probability!

The random variables have the same distribution but are perfectly anti-correlated.

Let’s demonstrate both types of convergence and the counterexample computationally.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
fig, axes = plt.subplots(2, 2, figsize=(7, 8))

# Case 1: X_n ~ N(0, 1/n) → 0 (both types of convergence)
ax = axes[0, 0]
n_values = [1, 10, 100, 1000]
n_samples = 5000

for i, n in enumerate(n_values):
    samples = np.random.normal(0, 1/np.sqrt(n), n_samples)
    ax.hist(samples, bins=50, alpha=0.6, density=True, 
            label=f'n={n}', range=(-3, 3))

ax.axvline(0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Case 1: N(0,1/n) → 0\n(Converges in both senses)')
ax.legend()
ax.set_xlim(-3, 3)

# Case 1 continued: Show |X_n - 0| for different epsilon
ax = axes[0, 1]
epsilon = 0.5
prob_far = []
for n in range(1, 101):
    samples = np.random.normal(0, 1/np.sqrt(n), n_samples)
    prob_far.append(np.mean(np.abs(samples) > epsilon))

ax.plot(range(1, 101), prob_far, 'b-', linewidth=2)
ax.axhline(0, color='red', linestyle='--')
ax.set_xlabel('n')
ax.set_ylabel(f'P(|X_n| > {epsilon})')
ax.set_title('Convergence in Probability to 0')
ax.grid(True, alpha=0.3)

# Case 2: X_n = -X counterexample
ax = axes[1, 0]
X = np.random.normal(0, 1, n_samples)
X_n = -X  # X_n for all n

# Plot distributions (they're identical!)
ax.hist(X, bins=50, alpha=0.6, density=True, label='X ~ N(0,1)')
ax.hist(X_n, bins=50, alpha=0.6, density=True, label='X_n = -X ~ N(0,1)')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Case 2: Same distribution\n(Converges in distribution)')
ax.legend()

# But |X_n - X| = |2X| doesn't converge to 0
ax = axes[1, 1]
diff = np.abs(X_n - X)
ax.hist(diff, bins=50, alpha=0.8, density=True, color='red')
ax.set_xlabel('|X_n - X| = |2X|')
ax.set_ylabel('Density')
ax.set_title('But NOT convergence in probability!\n|X_n - X| doesn\'t concentrate at 0')

# Add theoretical chi distribution (|2X| where X~N(0,1))
x_theory = np.linspace(0, 8, 1000)
y_theory = stats.chi.pdf(x_theory * 0.5, df=1) * 0.5  # Scale for |2X|
ax.plot(x_theory, y_theory, 'k--', linewidth=2, label='Theory')
ax.legend()

plt.tight_layout()
plt.show()

Summary:

  • Case 1: \(X_n ~ \mathcal{N}\left(0, 1/n\right)\) converges to 0 in BOTH senses
  • Case 2: \(X_n = -X\) has same distribution as X but does NOT converge in probability
  • Convergence in distribution is weaker than convergence in probability

3.4.5 Properties and Transformations

Understanding how convergence behaves under various operations is crucial for statistical theory. Here are the key properties:

Operations Under Convergence in Probability

If X_n \xrightarrow{P} X and Y_n \xrightarrow{P} Y, then:

  1. X_n + Y_n \xrightarrow{P} X + Y
  2. X_n Y_n \xrightarrow{P} XY
  3. X_n / Y_n \xrightarrow{P} X/Y (if \mathbb{P}(Y = 0) = 0)

This shows that convergence in probability is well-beheaved under standard operations of sum, product, and division not-by-zero.

Slutsky’s Theorem

If X_n \rightsquigarrow X and Y_n \xrightarrow{P} c (constant), then:

  1. X_n + Y_n \rightsquigarrow X + c
  2. X_n Y_n \rightsquigarrow cX
  3. X_n / Y_n \rightsquigarrow X/c (if c \neq 0)

Slutsky’s theorem tells us that convergence in distribution behaves nicely when paired with random variables that converge to a constant (this is not true in general!).

Continuous Mapping Theorem

If g is a continuous function:

  1. X_n \xrightarrow{P} X \implies g(X_n) \xrightarrow{P} g(X)
  2. X_n \rightsquigarrow X \implies g(X_n) \rightsquigarrow g(X)

Finally, we see that continuous mappings behave nicely for both types of convergence.

Warning

Important limitation: In general, if X_n \rightsquigarrow X and Y_n \rightsquigarrow Y, we cannot conclude that X_n + Y_n \rightsquigarrow X + Y. Convergence in distribution does not preserve sums unless one component converges to a constant!

Counterexample: Let X \sim \mathcal{N}(0,1) and define Y_n = -X for all n. Then Y_n \sim \mathcal{N}(0,1), so Y_n \rightsquigarrow Y \sim \mathcal{N}(0,1). But X + Y_n = X - X = 0, which does not converge in distribution to X + Y \sim \mathcal{N}(0,2).

Key takeaway

The rules for convergence are subtle. Generally speaking, convergence in probability behaves nicely under algebraic operations, but convergence in distribution requires more care. Always verify which type of convergence you have before applying these properties!

3.5 The Two Fundamental Theorems of Statistics

3.5.1 The Law of Large Numbers (LLN)

The Law of Large Numbers formalizes one of our most basic intuitions about probability: averages stabilize as we collect more data. When we flip a fair coin many times, the proportion of heads approaches 1/2. When we measure heights of many people, the sample mean approaches the population mean. This isn’t just intuition – it’s a mathematical theorem.

Let X_1, X_2, \ldots, X_n be independent and identically distributed (IID) random variables with \mathbb{E}(X_i) = \mu and \mathbb{V}(X_i) = \sigma^2 < \infty. Define the sample mean: \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i

Then \bar{X}_n \xrightarrow{P} \mu.

Interpretation: The sample mean converges in probability to the population mean. As we collect more data, our estimate gets arbitrarily close to the true value with high probability.

We’ll use Chebyshev’s inequality. First, compute the mean and variance of \bar{X}_n:

\mathbb{E}(\bar{X}_n) = \mathbb{E}\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \frac{1}{n} \sum_{i=1}^n \mathbb{E}(X_i) = \frac{1}{n} \cdot n\mu = \mu

\mathbb{V}(\bar{X}_n) = \mathbb{V}\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}

(We used independence for the variance calculation.)

Now apply Chebyshev’s inequality: for any \epsilon > 0, \mathbb{P}(|\bar{X}_n - \mu| > \epsilon) \leq \frac{\mathbb{V}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \to 0 \text{ as } n \to \infty

Therefore \bar{X}_n \xrightarrow{P} \mu.

Let’s visualize the Law of Large Numbers in action by simulating repeated rolls of a standard six-sided die and computing the mean of all rolls until that point.

We show this in two plots: on a normal scale (top) and on a log-scale (bottom) for the number of rolls on the x axis. The bottom plot also zooms in on the y-axis around 3.5.

Note how the sample mean starts with high variability but converges to the true mean (3.5) as the number of rolls increases.

Show code
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate die rolls
n_max = 10000
die_rolls = np.random.randint(1, 7, n_max)
cumulative_mean = np.cumsum(die_rolls) / np.arange(1, n_max + 1)
true_mean = 3.5

# Create the plot without shared x-axis
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(7, 5))

# Top plot: Full scale with linear x-axis
ax1.plot(cumulative_mean, linewidth=2, color='blue', alpha=0.8)
ax1.axhline(y=true_mean, color='red', linestyle='--', linewidth=2, 
            label=f'True mean = {true_mean}')
ax1.set_ylabel('Sample mean')
ax1.set_title('Law of Large Numbers: Sample Mean of Die Rolls')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(1, 6)

# Bottom plot: Zoomed in to show convergence with log scale
x_values = np.arange(100, n_max)
ax2.plot(x_values, cumulative_mean[100:], linewidth=2, color='blue', alpha=0.8)
ax2.axhline(y=true_mean, color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Number of rolls')
ax2.set_ylabel('Sample mean')
ax2.set_xscale('log')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(3.2, 3.8)

plt.tight_layout()
plt.show()

# Show convergence at specific sample sizes
for n in [10, 100, 1000, 10000]:
    print(f"After {n:5d} rolls: sample mean = {cumulative_mean[n-1]:.4f}, "
          f"error = {abs(cumulative_mean[n-1] - true_mean):.4f}")

After    10 rolls: sample mean = 3.8000, error = 0.3000
After   100 rolls: sample mean = 3.6900, error = 0.1900
After  1000 rolls: sample mean = 3.4570, error = 0.0430
After 10000 rolls: sample mean = 3.4999, error = 0.0001

The theorem above is known as the “Weak” Law of Large Numbers because it guarantees convergence in probability. There exists a stronger version that guarantees almost sure convergence: \mathbb{P}(\bar{X}_n \to \mu) = 1. The “Strong” LLN says that with probability 1, the sample mean will eventually get arbitrarily close to \mu and stay close, while the Weak LLN only guarantees that the probability of being far from \mu goes to zero. The Weak LLN requires only finite variance, while the Strong LLN typically needs additional assumptions (like finite fourth moments) but delivers a more powerful conclusion. We present the Weak version as it has minimal assumptions and suffices for most statistical applications.

3.5.2 The Central Limit Theorem (CLT)

While the Law of Large Numbers tells us that sample means converge to the population mean, it doesn’t tell us about the distribution of the sample mean. The Central Limit Theorem fills this gap with a remarkable result: properly scaled sample means are approximately normal, regardless of the underlying distribution!

Let X_1, X_2, \ldots, X_n be IID random variables with \mathbb{E}(X_i) = \mu and \mathbb{V}(X_i) = \sigma^2 < \infty. Define: Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma}

Then Z_n \rightsquigarrow Z where Z \sim \mathcal{N}(0, 1).

Alternative notations (all mean the same thing):

  • \bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n) for large n
  • \sqrt{n}(\bar{X}_n - \mu) \rightsquigarrow \mathcal{N}(0, \sigma^2)
  • (\bar{X}_n - \mu)/(\sigma/\sqrt{n}) \rightsquigarrow \mathcal{N}(0, 1)
Warning

Critical Point: The CLT is about the distribution of the sample mean, not the data itself! The original data doesn’t become normal—only the sampling distribution of \bar{X}_n does.

Interactive Demonstration of the CLT

Let’s visualize how the CLT works for different distributions from continuous to discrete and skewed (asymmetrical).

The interactive visualization below allows you to see this convergence in action. You can change the underlying distribution and adjust the sample size n to see how the distribution of the standardized sample mean approaches a standard normal distribution (red curves).

Example: CLT in Practice

A factory produces bolts with mean length \mu = 5 cm and standard deviation \sigma = 0.1 cm. If we randomly sample 100 bolts, what’s the probability their average length exceeds 5.02 cm?

By the CLT, \bar{X}_{100} \approx \mathcal{N}(5, 0.1^2/100) = \mathcal{N}(5, 0.0001).

We want: \mathbb{P}(\bar{X}_{100} > 5.02) = \mathbb{P}\left(\frac{\bar{X}_{100} - 5}{0.01} > \frac{5.02 - 5}{0.01}\right) = \mathbb{P}(Z > 2)

where Z \sim \mathcal{N}(0,1). From standard normal tables: \mathbb{P}(Z > 2) \approx 0.0228.

So there’s about a 2.3% chance the sample mean exceeds 5.02 cm.

CLT with Unknown Variance: If we replace \sigma with the sample standard deviation S_n = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2}

then we still have: \frac{\sqrt{n}(\bar{X}_n - \mu)}{S_n} \rightsquigarrow \mathcal{N}(0, 1)

This version is crucial for practice since we rarely know the true variance!

Rejoinder: Understanding Algorithms

Remember the sequence of random variables \theta_1, \theta_2, \ldots from our stochastic optimization algorithm at the beginning of this chapter? We can now answer what kind of convergence we should expect:

Convergence in probability: We want \theta_n \xrightarrow{P} \theta^* where \theta^* is the true optimal solution. This means the probability of \theta_n being far from the optimum vanishes as iterations increase.

The tools we’ve covered – probability inequalities (to bound deviations), convergence concepts (to formalize what “converges” means), and limit theorems (to understand averaging behavior) – are the foundation for analyzing when and why algorithms like stochastic gradient descent converge to good solutions. Modern machine learning theory relies heavily on these concepts to provide theoretical guarantees about algorithm performance!

3.6 The Language of Statistical Inference

3.6.1 From Probability to Inference

We’ve developed powerful tools: inequalities that bound uncertainty, convergence concepts that describe limiting behavior, and fundamental theorems that guarantee nice properties of averages. Now we flip the perspective.

Probability: Given a known distribution, what can we say about the data we’ll observe?

Statistical Inference: Given observed data, what can we infer about the unknown distribution that generated it? More formally: given a sample X_1, \ldots, X_n \sim F, how do we infer F?

This process – often called “learning” in computer science – is at the core of both classical statistics and modern machine learning. Sometimes we want to infer the entire distribution F, but often we focus on specific features like its mean, variance, or other parameters.

Example: Modeling Uncertainty in Real Decisions

An online retailer tests a new ad campaign. Out of 1000 users who see the ad, 30 make a purchase (3% conversion rate). But this raises critical questions:

Immediate questions:

  • What can we say about the true conversion rate? Is it exactly 3%?
  • How likely is it that at least 25 out of the next 1000 users will convert?

Comparative questions:

  • A competing ad had 290 conversions out of 10,000 users (2.9% rate). Which is better?
  • How confident can we be that the 3% ad truly outperforms the 2.9% ad – could the difference just be due to random chance?

Long-term questions:

  • What’s the probability that the long-run conversion rate exceeds 2.5%?
  • How many more users do we need to test to be 95% confident about the true rate?

These questions – about uncertainty, confidence, and decision-making with limited data – are at the heart of statistical inference.

3.6.2 Statistical Models

A statistical model \mathfrak{F} is a set of probability distributions (or densities or regression functions).

In the context of inference, we use models to represent our assumptions about which distributions could have generated our observed data. The model defines the “universe of possibilities” we’re considering – we then use data to identify which specific distribution within \mathfrak{F} is most plausible.

Models come in two main flavors, parametric and nonparametric.

Parametric Models

A parametric model is indexed by a finite number of parameters. We write it as: \mathfrak{F} = \{f(x; \theta) : \theta \in \Theta\}

where:

  • \theta (theta) is the parameter (possibly vector-valued)4
  • \Theta (capital theta) is the parameter space (the set of all possible parameter values)
  • f(x; \theta) is the density or distribution function indexed by \theta

Typically, the parameters \theta are unknown quantities we want to estimate. If there are elements of the vector \theta that we are not interested in, those are called nuisance parameters.

A product manager at a tech company launches a new “AI Recap” feature in their app. To determine if the feature is a success, they track the number of daily views over the first month. They hypothesize that the daily view count approximately follows a normal distribution.

The model for daily views is a parametric family \mathfrak{F}: \mathfrak{F} = \{f(x; \theta) : \theta = (\mu, \sigma^2), \mu \in \mathbb{R}, \sigma^2 > 0\}

where the density function is:f(x; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

This is a 2-dimensional parametric model with:

  • Parameter vector: \theta = (\mu, \sigma^2)
  • Parameter space: \Theta = \mathbb{R} \times (0, \infty)

Nuisance parameters in action: The company has set a target: the feature will be considered successful and receive further development only if it can reliably generate more than 100,000 views per day.

  • The parameter of interest is the average daily views, \mu. The entire business decision hinges on testing the hypothesis that \mu > 100,000.
  • The nuisance parameter is the variance, \sigma^2. The day-to-day fluctuation in views is critical for assessing the statistical certainty of our estimate for \mu, but it’s not the primary metric for success. The product manager needs to account for this variability, but their core question is about the average performance, not the variability itself.

Nonparametric Models

Nonparametric Models cannot be parameterized by a finite number of parameters. These models make minimal assumptions about the distribution. For example: \mathfrak{F}_{\text{ALL}} = \{\text{all continuous CDFs}\}

or with some constraints: \mathfrak{F} = \{\text{all distributions with finite variance}\}

This seems impossibly broad! In practice, we don’t explicitly enumerate all possible distributions. Instead, nonparametric methods directly use the data without assuming a specific functional form or parameter to be estimated. We will see multiple concrete example of nonparametric techniques in the next chapter. So, in theory the model space is infinite-dimensional, but in practice nonparametric estimation procedures are still concrete and computable.

Example: Choosing a Model

Scenario 1: Heights of adult males in Finland.

  • Parametric choice: \mathfrak{F} = \{\mathcal{N}(\mu, \sigma^2) : \mu \in \mathbb{R}, \sigma > 0\}
  • Justification: Heights are often approximately normal due to many small genetic and environmental factors (CLT in action!)

Scenario 2: Time between website visits.

  • Parametric choice: \mathfrak{F} = \{\text{Exponential}(\lambda) : \lambda > 0\}
  • Justification: Exponential models “memoryless” waiting times

Scenario 3: Unknown distribution shape.

  • Nonparametric choice: \mathfrak{F} = \{\text{all distributions with finite variance}\}
  • Justification: Make minimal assumptions, let data speak

3.6.3 Point Estimation

Point estimation is the task of providing a single “best guess” for an unknown quantity based on data.

This quantity can be a single parameter, a full vector of parameters, even a full CDF or PDF, or prediction for a future value of some random variable.

A point estimate of \theta is denoted by \hat{\theta}.

Given data X_1, \ldots, X_n, a point estimator is a function: \hat{\theta}_n = g(X_1, \ldots, X_n)

The “hat” notation \hat{\theta} can indicate both an estimator and the estimate.

Warning

Critical Distinction:

  • Parameter \theta: Fixed, unknown number we want to learn
  • Estimator \hat{\theta}_n: Random variable (before seeing data)
  • Estimate \hat{\theta}_n: Specific number (after seeing data) – notation can be overlapping

For example, \bar{X}_n is an estimator; \bar{x}_n = 3.7 is an estimate.

The distribution of \hat{\theta}_n is called the sampling distribution. The standard deviation of this distribution is the standard error: \text{se}(\hat{\theta}_n) = \sqrt{\mathbb{V}(\hat{\theta}_n)}

When the standard error depends on unknown parameters, we use the estimated standard error \widehat{\text{se}}.

A particularly common standard error in stastistics is the standard error of the mean (SEM).

3.6.4 How to Evaluate Estimators

How do we judge if an estimator is “good”? Several criteria have emerged:

Bias: The systematic error of an estimator. \text{bias}(\hat{\theta}_n) = \mathbb{E}(\hat{\theta}_n) - \theta

An estimator is unbiased if \mathbb{E}(\hat{\theta}_n) = \theta.

Consistency: An estimator is consistent if it converges to the true value. \hat{\theta}_n \xrightarrow{P} \theta \text{ as } n \to \infty

Mean Squared Error (MSE): The average squared distance from the truth. \text{MSE}(\hat{\theta}_n) = \mathbb{E}[(\hat{\theta}_n - \theta)^2]

Example: Evaluating Estimators for the Mean

Suppose X_1, \ldots, X_n \sim \mathcal{N}(\theta, \sigma^2) where \theta is unknown. Consider three estimators:

  1. Constant estimator: \hat{\theta}_n^{(1)} = 3
    • Bias: \mathbb{E}(3) - \theta = 3 - \theta (biased unless \theta = 3)
    • Variance: \mathbb{V}(3) = 0
    • Consistent: No, always equals 3
    • MSE: (3 - \theta)^2
  2. First observation: \hat{\theta}_n^{(2)} = X_1
    • Bias: \mathbb{E}(X_1) - \theta = 0 (unbiased!)
    • Variance: \mathbb{V}(X_1) = \sigma^2
    • Consistent: No, variance doesn’t shrink
    • MSE: \sigma^2
  3. Sample mean: \hat{\theta}_n^{(3)} = \bar{X}_n
    • Bias: \mathbb{E}(\bar{X}_n) - \theta = 0 (unbiased!)
    • Variance: \mathbb{V}(\bar{X}_n) = \sigma^2/n
    • Consistent: Yes! (by LLN)
    • MSE: \sigma^2/n \to 0

The sample mean is unbiased AND consistent—it improves with more data!

A classic example shows that unbiased isn’t everything:

Sample variance: Two common estimators for population variance \sigma^2:

  1. Unbiased version: S^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2
    • \mathbb{E}(S^2) = \sigma^2 (unbiased by design)
  2. Maximum likelihood estimator: \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2
    • \mathbb{E}(\hat{\sigma}^2) = \frac{n-1}{n}\sigma^2 (biased!)

Which is better? It depends on the criterion!

3.6.5 The Bias-Variance Tradeoff

The MSE decomposes as: \text{MSE} = \text{bias}^2(\hat{\theta}_n) + \mathbb{V}(\hat{\theta}_n)

Let \bar{\theta}_n = \mathbb{E}(\hat{\theta}_n). Then: \begin{align} \mathbb{E}[(\hat{\theta}_n - \theta)^2] &= \mathbb{E}[(\hat{\theta}_n - \bar{\theta}_n + \bar{\theta}_n - \theta)^2] \\ &= \mathbb{E}[(\hat{\theta}_n - \bar{\theta}_n)^2] + 2(\bar{\theta}_n - \theta)\mathbb{E}[\hat{\theta}_n - \bar{\theta}_n] + (\bar{\theta}_n - \theta)^2 \\ &= \mathbb{E}[(\hat{\theta}_n - \bar{\theta}_n)^2] + 2(\bar{\theta}_n - \theta) \cdot 0 + (\bar{\theta}_n - \theta)^2 \\ &= \mathbb{V}(\hat{\theta}_n) + \text{bias}^2(\hat{\theta}_n) \end{align}

where we used that \mathbb{E}[\hat{\theta}_n - \bar{\theta}_n] = \mathbb{E}[\hat{\theta}_n] - \bar{\theta}_n = \bar{\theta}_n - \bar{\theta}_n = 0.

This decomposition reveals a fundamental tradeoff in statistics.

Imagine you’re an archer trying to hit a target. Your performance depends on two things:

Bias: How far your average shot is from the bullseye. A biased archer consistently aims too high or too far left.

Variance: How spread out your shots are. A high-variance archer is inconsistent—sometimes dead on, sometimes way off.

The best archer has low bias AND low variance. But here’s the key insight: sometimes accepting a little bias can dramatically reduce variance, improving overall accuracy!

Think of it this way:

  • A complex model (like memorizing training data) has low bias but high variance
  • A simple model (like always predicting the average) has higher bias but low variance
  • The sweet spot balances both

This tradeoff is why regularization works in machine learning – we accept a bit of bias to gain a lot in variance reduction.

The bias-variance decomposition gives us a precise way to understand prediction error:

\[\text{MSE}(\hat{\theta}_n) = \text{bias}^2(\hat{\theta}_n) + \mathbb{V}(\hat{\theta}_n)\]

This isn’t just algebra – it reveals the two fundamental sources of error:

  1. Systematic error (bias): Being consistently wrong
  2. Random error (variance): Being inconsistently wrong

For prediction problems where \(\hat{f}(x)\) estimates \(f(x)\): \[\mathbb{E}[(\hat{f}(x) - f(x))^2] = \underbrace{(E[\hat{f}(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\mathbb{V}(\hat{f}(x))}_{\text{variance}}\]

The optimal predictor minimizes their sum. In machine learning:

  • Increasing model complexity typically decreases bias but increases variance
  • The art is finding the right complexity for your data

Let’s visualize the bias-variance tradeoff by comparing estimators for population variance.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulation parameters
true_variance = 1.0  # True σ² = 1
n_values = np.arange(5, 101, 5)
n_simulations = 10000

# Storage for results
bias_unbiased = []
bias_mle = []
variance_unbiased = []
variance_mle = []
mse_unbiased = []
mse_mle = []

for n in n_values:
    # Generate many samples and compute both estimators
    unbiased_estimates = []
    mle_estimates = []
    
    for _ in range(n_simulations):
        # Generate sample from N(0, 1)
        sample = np.random.normal(0, 1, n)
        sample_mean = np.mean(sample)
        
        # Unbiased estimator (n-1 denominator)
        s_squared = np.sum((sample - sample_mean)**2) / (n - 1)
        unbiased_estimates.append(s_squared)
        
        # MLE (n denominator)
        sigma_hat_squared = np.sum((sample - sample_mean)**2) / n
        mle_estimates.append(sigma_hat_squared)
    
    # Calculate bias, variance, and MSE
    unbiased_estimates = np.array(unbiased_estimates)
    mle_estimates = np.array(mle_estimates)
    
    # Bias
    bias_unbiased.append(np.mean(unbiased_estimates) - true_variance)
    bias_mle.append(np.mean(mle_estimates) - true_variance)
    
    # Variance
    variance_unbiased.append(np.var(unbiased_estimates))
    variance_mle.append(np.var(mle_estimates))
    
    # MSE
    mse_unbiased.append(np.mean((unbiased_estimates - true_variance)**2))
    mse_mle.append(np.mean((mle_estimates - true_variance)**2))

# Create plots
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(7, 5))

# Bias plot
ax1.plot(n_values, bias_unbiased, 'b-', linewidth=2, label='Unbiased (S²)')
ax1.plot(n_values, bias_mle, 'r--', linewidth=2, label='MLE (σ̂²)')
ax1.axhline(y=0, color='black', linestyle=':', alpha=0.5)
ax1.set_xlabel('Sample size (n)')
ax1.set_ylabel('Bias')
ax1.set_title('Bias')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Variance plot  
ax2.plot(n_values, variance_unbiased, 'b-', linewidth=2, label='Unbiased (S²)')
ax2.plot(n_values, variance_mle, 'r--', linewidth=2, label='MLE (σ̂²)')
ax2.set_xlabel('Sample size (n)')
ax2.set_ylabel('Variance')
ax2.set_title('Variance of Estimator')
ax2.legend()
ax2.grid(True, alpha=0.3)

# MSE plot
ax3.plot(n_values, mse_unbiased, 'b-', linewidth=2, label='Unbiased (S²)')
ax3.plot(n_values, mse_mle, 'r--', linewidth=2, label='MLE (σ̂²)')
ax3.set_xlabel('Sample size (n)')
ax3.set_ylabel('MSE')
ax3.set_title('Mean Squared Error')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.suptitle('Bias-Variance Tradeoff: Variance Estimators', fontsize=14)
plt.tight_layout()
plt.show()

# Print numerical comparison for n=10
n_idx = 1  # n=10
print(f"For n=10:")
print(f"Unbiased (S²): Bias = {bias_unbiased[n_idx]:.4f}, Variance = {variance_unbiased[n_idx]:.4f}, MSE = {mse_unbiased[n_idx]:.4f}")
print(f"MLE (σ̂²):      Bias = {bias_mle[n_idx]:.4f}, Variance = {variance_mle[n_idx]:.4f}, MSE = {mse_mle[n_idx]:.4f}")
print(f"\nThe MLE has lower MSE despite being biased!")

For n=10:
Unbiased (S²): Bias = -0.0013, Variance = 0.2219, MSE = 0.2219
MLE (σ̂²):      Bias = -0.1012, Variance = 0.1797, MSE = 0.1899

The MLE has lower MSE despite being biased!

Key insights from the visualization:

  • The unbiased estimator \(S^2\) has zero bias (blue solid line at 0)
  • The MLE \(\hat{\sigma}^2\) has negative bias that shrinks as \(n\) grows
  • The MLE has lower variance than the unbiased estimator
  • For finite samples, the biased MLE has lower MSE!

This demonstrates a profound principle: the “best” estimator depends on your criterion. Unbiasedness is nice, but minimizing MSE often matters more in practice.

3.7 Chapter Summary and Connections

3.7.1 Key Concepts Review

We’ve built a complete framework for understanding randomness and inference:

Inequalities bound the unknown:

  • Markov: \mathbb{P}(X \geq t) \leq \mathbb{E}(X)/t — averages constrain extremes
  • Chebyshev: \mathbb{P}(|X - \mu| \geq k\sigma) \leq 1/k^2 — universal bounds using variance

Convergence describes limiting behavior:

  • In probability: X_n \xrightarrow{P} X — the random variable itself settles down
  • In distribution: X_n \rightsquigarrow X — the shape of the distribution stabilizes
  • Key relationship: Convergence in probability implies convergence in distribution

Fundamental theorems guarantee nice behavior:

  • Law of Large Numbers: \bar{X}_n \xrightarrow{P} \mu — sample means converge to population means
  • Central Limit Theorem: \sqrt{n}(\bar{X}_n - \mu)/\sigma \rightsquigarrow \mathcal{N}(0,1) — sample means are approximately normal

Statistical inference flips the perspective:

  • Models: Parametric (finite-dimensional) vs nonparametric (infinite-dimensional)
  • Estimators: Functions of data that guess parameters
  • Standard error: Standard deviation of an estimator’s sampling distribution
  • Evaluation criteria: Bias, variance, consistency, MSE
  • Bias-variance tradeoff: \text{MSE} = \text{bias}^2 + \text{variance}

3.7.2 Why These Concepts Matter

For Statistical Inference:

  • LLN justifies using sample statistics to estimate population parameters
  • CLT enables confidence intervals and hypothesis tests (to be seen in the next chapters)
  • Bias-variance tradeoff guides choice of estimators
  • Consistency ensures our methods improve with more data

For Machine Learning:

  • Convergence concepts analyze iterative algorithms such as stochastic optimization
  • Bias-variance tradeoff explains overfitting vs underfitting
  • CLT justifies bootstrap and cross-validation, as we will see in the next chapters

For Data Science Practice:

  • Understanding variability in estimates prevents overconfidence
  • Recognizing when CLT applies (and when it doesn’t)
  • Choosing between simple and complex models
  • Interpreting A/B test results correctly

3.7.3 Common Pitfalls to Avoid

  1. Confusing convergence types:

    • “My algorithm converged” – in what sense?
    • Convergence in distribution does not imply convergence in probability!
  2. Misapplying the CLT:

    • CLT is about sample means, not individual observations
    • Need large enough n (depends on skewness)
    • Doesn’t work without finite variance (Cauchy!)
  3. Overvaluing unbiasedness:

    • Unbiased doesn’t mean good (e.g., using just X_1)
    • Biased can be better in statistics (regularization, priors)
  4. Ignoring assumptions:

    • Independence matters for variance calculations
    • Finite variance required for CLT
  5. Misinterpreting bounds:

    • Markov/Chebyshev give worst-case bounds
    • Often very loose in practice
    • Tighter bounds exist for specific distributions

3.7.4 Chapter Connections

This chapter connects fundamental probability theory to practical statistical inference:

  • From Previous Chapters: We’ve applied Chapter 1’s probability framework and Chapter 2’s expectation/variance concepts to prove convergence theorems (LLN, CLT) that explain why sample statistics work as estimators
  • Next - Chapter 4 (Bootstrap): While this chapter gave us theoretical tools for inference (CLT-based confidence intervals), the bootstrap will provide a computational approach that works even when theoretical distributions are intractable
  • Statistical Modeling (Chapters 5+): The bias-variance tradeoff introduced here becomes central to model selection, while MSE serves as our primary tool for comparing estimators in regression and machine learning
  • Throughout the Course: The convergence concepts (especially CLT) and inference framework established here underpin virtually every statistical method—from hypothesis testing to Bayesian inference

3.7.5 Self-Test Problems

  1. Applying Chebyshev: A website’s daily visitors have mean 10,000 and standard deviation 2,000. Without assuming any distribution, what can you say about the probability of getting fewer than 4,000 or more than 16,000 visitors?

  2. CLT Application: A casino’s slot machine pays out €1 with probability 0.4 and €0 otherwise. If someone plays 400 times, approximate the probability their total winnings exceed €170.

  3. Comparing Estimators: Given X_1, \ldots, X_n \sim \text{Uniform}(0, \theta), consider two estimators:

    • \hat{\theta}_1 = 2\bar{X}_n
    • \hat{\theta}_2 = \frac{n+1}{n} \max\{X_1, \ldots, X_n\}

    Calculate bias and variance for each. Which has lower MSE?

  4. Convergence Concepts: Let X_n have PMF: P(X_n = 0) = 1 - 1/n, \quad P(X_n = n) = 1/n

    • Does X_n \xrightarrow{P} 0?
    • Does X_n \rightsquigarrow 0?
    • Does \mathbb{E}(X_n) \to 0?

3.7.6 Python and R Reference

#| eval: false
import numpy as np
from scipy import stats

# Probability inequalities
def markov_bound(mean, t):
    """Markov inequality bound P(X >= t)"""
    return min(mean / t, 1) if t > 0 else 1

def chebyshev_bound(k):
    """Chebyshev bound P(|X - μ| >= kσ)"""
    return min(1 / k**2, 1) if k > 0 else 1

# Simulating convergence
def demonstrate_lln(dist, n_max=10000):
    """Show Law of Large Numbers"""
    samples = dist.rvs(n_max)
    cumulative_mean = np.cumsum(samples) / np.arange(1, n_max + 1)
    return cumulative_mean

def demonstrate_clt(dist, n, n_simulations=10000):
    """Show Central Limit Theorem"""
    sample_means = []
    for _ in range(n_simulations):
        sample = dist.rvs(n)
        sample_means.append(np.mean(sample))
    return np.array(sample_means)

# Estimator evaluation
def evaluate_estimator(estimator_func, true_value, n, sample_func, n_simulations=10000):
    """Compute bias, variance, and MSE of an estimator via simulation."""
    estimates = []
    for _ in range(n_simulations):
        sample = sample_func(n)
        estimate = estimator_func(sample)
        estimates.append(estimate)
    
    estimates = np.array(estimates)
    bias = np.mean(estimates) - true_value
    variance = np.var(estimates)
    mse = np.mean((estimates - true_value)**2)
    
    return {'bias': bias, 'variance': variance, 'mse': mse}

# Common distributions for examples
normal_dist = stats.norm(loc=0, scale=1)
exp_dist = stats.expon(scale=1)
uniform_dist = stats.uniform(loc=0, scale=1)

# Sample statistics
def sample_mean(x):
    return np.mean(x)

def sample_variance_unbiased(x):
    return np.var(x, ddof=1)  # n-1 denominator

def sample_variance_mle(x):
    return np.var(x, ddof=0)  # n denominator

# Example: Evaluate variance estimators for N(0,1) samples (true variance = 1)
def generate_normal_sample(n):
    return np.random.normal(loc=0, scale=1, size=n)

print("--- Evaluating Variance Estimators (n=10) ---")
mle_results = evaluate_estimator(
    estimator_func=sample_variance_mle,
    true_value=1.0,
    n=10,
    sample_func=generate_normal_sample
)
print(f"MLE Estimator: {mle_results}")

unbiased_results = evaluate_estimator(
    estimator_func=sample_variance_unbiased,
    true_value=1.0,
    n=10,
    sample_func=generate_normal_sample
)
print(f"Unbiased Estimator: {unbiased_results}")
#| eval: false
# Probability inequalities
markov_bound <- function(mean_val, t) {
  if (t <= 0) return(1)
  min(mean_val / t, 1)
}

chebyshev_bound <- function(k) {
  if (k <= 0) return(1)
  min(1 / k^2, 1)
}

# Simulating convergence
demonstrate_lln <- function(dist_func, n_max = 10000, ...) {
  # dist_func should be a function like rnorm, rexp, etc.
  samples <- dist_func(n_max, ...)
  cumulative_mean <- cumsum(samples) / seq_len(n_max)
  return(cumulative_mean)
}

demonstrate_clt <- function(dist_func, n, n_simulations = 10000, ...) {
  sample_means <- replicate(n_simulations, {
    sample <- dist_func(n, ...)
    mean(sample)
  })
  return(sample_means)
}

# Estimator evaluation
evaluate_estimator <- function(estimator_func, true_value, n, 
                              sample_func, n_simulations = 10000) {
  estimates <- replicate(n_simulations, {
    sample <- sample_func(n)
    estimator_func(sample)
  })
  
  bias <- mean(estimates) - true_value
  variance <- var(estimates)
  mse <- mean((estimates - true_value)^2)
  
  list(bias = bias, variance = variance, mse = mse)
}

# Example usage
# LLN for exponential
lln_exp <- demonstrate_lln(rexp, n_max = 10000, rate = 1)
plot(lln_exp, type = 'l', ylim = c(0.5, 1.5), main = "LLN for Exponential(1)")
abline(h = 1, col = 'red', lty = 2) # True mean is 1

# CLT for uniform
clt_unif <- demonstrate_clt(runif, n = 30, n_simulations = 10000)
hist(clt_unif, breaks = 50, probability = TRUE, main = "CLT for Uniform(0,1)")
# Variance of U(0,1) is 1/12. SD of sample mean = sqrt(Var(X)/n)
curve(dnorm(x, mean = 0.5, sd = sqrt((1/12)/30)), add = TRUE, col = 'red', lwd=2)

# Compare variance estimators
mle_var <- function(x) { var(x) * (length(x) - 1) / length(x) }

compare_var_estimators <- function(n) {
  sample_func <- function(k) rnorm(k, 0, 1) # True variance = 1
  
  unbiased_res <- evaluate_estimator(
    var,  # R's var() is the unbiased (n-1) version
    true_value = 1, n = n, sample_func = sample_func
  )
  mle_res <- evaluate_estimator(
    mle_var,
    true_value = 1, n = n, sample_func = sample_func
  )
  list(unbiased = unbiased_res, mle = mle_res)
}

print("--- Comparing Variance Estimators (n=10) ---")
print(compare_var_estimators(10))

3.7.7 Further Reading

  • Statistical inference: Casella & Berger, “Statistical Inference”
  • Machine learning perspective: Shalev-Shwartz & Ben-David, “Understanding Machine Learning: From Theory to Algorithms”

Remember: Convergence and inference concepts are the bedrock of statistics. The Law of Large Numbers tells us why sampling works. The Central Limit Theorem tells us how to quantify uncertainty. The bias-variance tradeoff tells us how to choose good estimators. Master these ideas – they’re the key to everything that follows!


  1. Remember that \nabla_\theta f denotes the gradient of function f (\theta; x) with respect to \theta – its “vector derivative” with respect to \theta in more than dimension.↩︎

  2. For example, via a simple gradient descent step: \theta_{t+1} = \theta_t - \alpha_t g where \alpha_t > 0 is the learning rate at step t.↩︎

  3. For example, in the framework known as probably approximately correct (PAC) learning.↩︎

  4. The symbol \theta is almost universally reserved to represent generic “parameters” of a model in statistics and machine learning.↩︎