6 Parametric Inference II: Properties of Estimators

Published

September 23, 2025

6.1 Learning Objectives

After completing this chapter, you will be able to:

Explain the key properties of the MLE (consistency, equivariance, asymptotic normality, and efficiency) and their practical implications.
Define Fisher Information and use it to quantify the precision of parameter estimates.
Construct and interpret confidence intervals for parameters using the MLE and its standard error.
Apply the Delta Method to find confidence intervals for transformed parameters.
Implement the EM algorithm for finding MLEs in models with latent variables.

Note

This chapter explores the theoretical properties of Maximum Likelihood Estimators and provides practical tools for statistical inference. The material is adapted from Chapter 9 of Wasserman (2013), with additional computational examples and modern perspectives on optimization for latent variable models.

6.2 Introduction: How Good Are Our Estimators?

In Chapter 5, we learned how to find estimators for parametric models using the Method of Moments (MoM) and Maximum Likelihood Estimation (MLE). We saw that finding the MLE often requires numerical optimization, and we explored practical algorithms for this task.

But finding an estimator is only half the story. The next natural question is: how good are these estimators?

This chapter addresses this fundamental question by exploring the properties of estimators, with a special focus on the MLE. We’ll discover why the MLE is considered the “gold standard” of parametric estimation – it turns out to have a remarkable collection of desirable properties that make it optimal in many senses.

Key Questions We’ll Answer

Does our estimator converge to the true value as we get more data? (Consistency)
How much uncertainty is associated with our estimate? Can we quantify it? (Confidence Intervals)
Is our estimator the best possible one, or could we do better? (Efficiency)
How do these properties extend to complex models with multiple parameters or latent variables?

Understanding these properties is crucial for several reasons:

Practical inference: Knowing that \hat{\theta}_n = 2.3 is not very useful without understanding the uncertainty. Is the true value likely between 2.2 and 2.4, or between 1 and 4?
Method selection: When multiple estimation methods exist, these properties help us choose the best one.
Foundation for advanced methods: The concepts in this chapter – especially Fisher Information and the EM algorithm – are fundamental to modern statistical methods and machine learning.
Connection to optimization: We’ll see that the same mathematical quantities that determine statistical uncertainty also appear in optimization algorithms.

Let’s begin with a concrete example to bridge our previous work on finding MLEs with the new focus on analyzing their properties.

6.3 Warm-up: A Complete MLE Example with Numerical Optimization

Before diving into the theoretical properties, let’s work through a complete example that bridges Chapter 5 (finding MLEs) and Chapter 6 (analyzing them). We’ll find the MLE for a Beta distribution, which requires numerical optimization.

The likelihood for the Beta distribution with parameters \alpha, \beta > 0 is: f(x; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}

The log-likelihood for n observations is: \begin{aligned} \ell_n(\alpha, \beta) = \sum_{i=1}^n \Big[&\log \Gamma(\alpha + \beta) - \log \Gamma(\alpha) - \log \Gamma(\beta) \\ &+ (\alpha - 1) \log(X_i) + (\beta - 1) \log(1-X_i)\Big] \end{aligned}

We will implement it below as the negative log-likelihood (beta_nll).

import numpy as np
import scipy.optimize
import jax
import jax.numpy as jnp
import jax.scipy.special as jsps

# Generate data from a Beta distribution
np.random.seed(43)
x = np.random.beta(1.5, 0.5, size=100)

# Define the negative log-likelihood for Beta(α, β)
def beta_nll(theta, x):
    """Negative log-likelihood for Beta distribution using JAX"""
    alpha, beta = theta
    # Use log-gamma function for numerical stability
    return -jnp.sum(
        jsps.gammaln(alpha + beta) 
        - jsps.gammaln(alpha) 
        - jsps.gammaln(beta)
        + (alpha - 1) * jnp.log(x)
        + (beta - 1) * jnp.log(1 - x)
    )

# Get the gradient function automatically
beta_grad = jax.grad(beta_nll)

# We specify bounds: α, β > 0 
# - Bounds are *inclusive* so we set a small positive number as lower bound
# - None here denotes infinity
bounds = [(0.0001, None), (0.0001, None)]

# Optimize using L-BFGS-B with bounds to ensure positivity
result = scipy.optimize.minimize(
    beta_nll, 
    x0=jnp.array([1.0, 1.0]),  # Initial guess
    args=(x,), # Tuple of additional arguments to pass to jac (x is data)
    jac=beta_grad,
    method='L-BFGS-B',
    bounds=bounds
)

print(f"Raw optimization result:\n")
print(result)

print(f"\nExtracted results:\n")
print(f"  MLE: α̂ = {result.x[0]:.3f}, β̂ = {result.x[1]:.3f}")
print(f"  Negative log-likelihood: {result.fun:.3f}")
print(f"  Converged: {result.success}")
print(f"  Iterations: {result.nit}")

Raw optimization result:

  message: CONVERGENCE: RELATIVE REDUCTION OF F <= FACTR*EPSMCH
  success: True
   status: 0
      fun: -55.24228286743164
        x: [ 1.501e+00  4.985e-01]
      nit: 10
      jac: [-1.030e-04 -9.155e-05]
     nfev: 41
     njev: 41
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

Extracted results:

  MLE: α̂ = 1.501, β̂ = 0.499
  Negative log-likelihood: -55.242
  Converged: True
  Iterations: 10

This example showcases several important points:

Automatic differentiation: We use JAX to automatically compute gradients, avoiding error-prone manual derivations
Constrained optimization: The L-BFGS-B method handles the constraint that both parameters must be positive
Numerical stability: We work with log-gamma functions rather than raw factorials

The MLE as a Starting Point

Now that we have \hat{\alpha} and \hat{\beta}, we can ask the crucial questions:

How accurate are these estimates?
What’s their sampling distribution?
Are they optimal?

These are exactly the questions this chapter will answer!

Finnish-English Terminology Reference

For Finnish-speaking students, here are the key terms we’ll use in this chapter:

English	Finnish	Context
Equivariant	Ekvivariantti	Property of MLE under reparameterization
Efficient	Tehokas	Optimal variance property
Score function	Rinnefunktio	Gradient of log-likelihood
Fisher information	Fisherin informaatio	Variance of score function
Delta method	Delta-menetelmä	Method for transformed parameters
Sufficient statistic	Tyhjentävä tunnusluku	Contains all information about parameter
Latent variable	Piilomuuttuja	Unobserved variable in model
EM Algorithm	EM-algoritmi	Expectation-Maximization algorithm
Asymptotic normality	Asymptoottinen normaalisuus	Large-sample distribution property
Consistency	Konsistenssi/Tarkentuva	Convergence to true value

6.4 Core Properties of the Maximum Likelihood Estimator

Before exploring its properties, let’s recall the definition of the maximum likelihood estimator (MLE) from Chapter 5:¹

Let X_1, \ldots, X_n be i.i.d. with PDF (or PMF) f(x; \theta).

The likelihood function is: \mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)

The log-likelihood function is: \ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)

The maximum likelihood estimator (MLE), denoted by \hat{\theta}_n, is the value of \theta that maximizes \mathcal{L}_n(\theta) (or equivalently, \ell_n(\theta)).

6.4.1 Overview

In Chapter 5, we saw that the MLE is found by maximizing the likelihood function – a principle that seems intuitively reasonable. But intuition alone doesn’t make for good statistics. We need to understand why the MLE works so well.

It turns out that the MLE has several remarkable properties that make it the “gold standard” of parametric estimation:

Property	Mathematical Statement	Intuitive Meaning
Consistency	\hat{\theta}_n \xrightarrow{P} \theta_*	The MLE converges to the true parameter value as n \to \infty
Equivariance	If \hat{\theta}_n is the MLE of \theta, then g(\hat{\theta}_n) is the MLE of g(\theta)	The MLE behaves sensibly under parameter transformations
Asymptotic Normality	\frac{(\hat{\theta}_n - \theta_*)}{\hat{\text{se}}} \rightsquigarrow \mathcal{N}(0, 1)	The MLE has an approximately normal distribution for large samples
Asymptotic Efficiency	\text{Var}(\hat{\theta}_n) achieves the Cramér-Rao lower bound	The MLE has the smallest possible variance among consistent estimators
Approximate Bayes	\hat{\theta}_n \approx \arg\max_\theta \pi(\theta \mid X^n) with flat prior	The MLE approximates the Bayesian posterior mode

Let’s explore each of these properties in detail.

6.4.2 Consistency: Getting It Right Eventually

The most fundamental property we could ask of any estimator is that it gets closer to the truth as we collect more data. This is the property of consistency.

An estimator \hat{\theta}_n is consistent for \theta_* if: \hat{\theta}_n \xrightarrow{P} \theta_* That is, for any \epsilon > 0: \lim_{n \to \infty} P(|\hat{\theta}_n - \theta_*| > \epsilon) = 0

In words: as we collect more data, the probability that our estimate is far from the truth goes to zero. The MLE has this property under mild conditions.

The Theory Behind Consistency

The consistency of the MLE is deeply connected to the Kullback-Leibler (KL) divergence. For two densities f and g, the KL divergence is:

D(f, g) = \int f(x) \log \frac{f(x)}{g(x)} dx

Key properties:

D(f, g) \geq 0 always
D(f, g) = 0 if and only if f = g (almost everywhere)

The crucial insight is that maximizing the log-likelihood is equivalent to minimizing the KL divergence between the true distribution and our model. As n \to \infty, the empirical distribution converges to the true distribution, so the MLE converges to the parameter that makes the model distribution equal to the true distribution.

Identifiability: For this to work, the model must be identifiable: different parameter values must correspond to different distributions. That is, \theta \neq \psi implies D(f(\cdot; \theta), f(\cdot; \psi)) > 0.

Example: Consistency in Action

Let’s visualize how the MLE becomes more accurate with increasing sample size n. We’ll use the Poisson distribution where the MLE for the rate parameter \lambda is simply the sample mean:²

Show code

import matplotlib.pyplot as plt

# True parameter
true_lambda = 3.0

# Sample sizes to consider
sample_sizes = [10, 50, 100, 500, 1000, 5000]
num_simulations = 100

# Store results
mle_estimates = {n: [] for n in sample_sizes}

# Run simulations
np.random.seed(42)
for n in sample_sizes:
    for _ in range(num_simulations):
        # Generate Poisson data
        data = np.random.poisson(true_lambda, size=n)
        # MLE for Poisson is just the sample mean
        mle = np.mean(data)
        mle_estimates[n].append(mle)

# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5))

# Left plot: Box plots showing distribution at each sample size
positions = range(len(sample_sizes))
ax1.boxplot([mle_estimates[n] for n in sample_sizes], 
            positions=positions,
            labels=[str(n) for n in sample_sizes])
ax1.axhline(y=true_lambda, color='red', linestyle='--', label=f'True λ = {true_lambda}')
ax1.set_xlabel('Sample size n')
ax1.set_ylabel('MLE estimate')
ax1.set_title('Distribution of MLE')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right plot: Standard deviation vs sample size
std_devs = [np.std(mle_estimates[n]) for n in sample_sizes]
ax2.loglog(sample_sizes, std_devs, 'bo-', label='Empirical SD')
# Theoretical standard deviation: sqrt(λ/n)
theoretical_sd = [np.sqrt(true_lambda/n) for n in sample_sizes]
ax2.loglog(sample_sizes, theoretical_sd, 'r--', label='Theoretical SD')
ax2.set_xlabel('Sample size n')
ax2.set_ylabel('Standard deviation of MLE')
ax2.set_title('Convergence Rate')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The plots demonstrate consistency: as n increases, the MLE concentrates more tightly around the true value, with standard deviation decreasing at rate 1/\sqrt{n}.

6.4.3 Equivariance: Reparameterization Invariance

A subtle but important property of the MLE is equivariance (or “functional invariance”). This means that if we reparameterize our model, the MLE transforms in the natural way.

If \hat{\theta}_n is the MLE of \theta, then for any function g: \widehat{g(\theta)} = g(\hat{\theta}_n) That is, the MLE of \tau = g(\theta) is \hat{\tau}_n = g(\hat{\theta}_n).

Proof: Let \tau = g(\theta) where g has inverse h, so \theta = h(\tau). For any \tau:

\mathcal{L}_n(\tau) = \prod_{i=1}^n f(X_i; h(\tau)) = \prod_{i=1}^n f(X_i; \theta) = \mathcal{L}_n(\theta)

where \theta = h(\tau). Therefore:

\mathcal{L}_n(\tau) = \mathcal{L}_n(\theta) \leq \mathcal{L}_n(\hat{\theta}_n) = \mathcal{L}_n(\hat{\tau}_n)

Since this holds for any \tau, we have \hat{\tau}_n = g(\hat{\theta}_n). □

Example: Equivariance in Practice

Consider X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1) where we’re interested in both:

The mean \theta
The parameter \tau = e^\theta (perhaps \theta is log-income and \tau is income)

The MLE for \theta is \hat{\theta}_n = \bar{X}_n. By equivariance: \hat{\tau}_n = e^{\hat{\theta}_n} = e^{\bar{X}_n}

No need to rederive from scratch! This is particularly convenient when dealing with complex transformations.

Why Equivariance Matters:

Convenience: We can work in whatever parameterization is most natural for finding the MLE, then transform to the parameterization of interest.
Consistency across parameterizations: Different researchers might parameterize the same model differently (e.g., variance vs. precision in a normal distribution). Equivariance ensures they’ll get equivalent results.
Not universal: This property is special to MLEs! Other estimators, like the Bayesian posterior mode (also known as MAP or maximum a posteriori estimate), generally lack this property. For instance, if \theta has a uniform prior, \tau = \theta^2 does not have a uniform prior, leading to different posterior modes.

A Common Misconception

Equivariance does NOT mean that: \mathbb{E}[g(\hat{\theta}_n)] = g(\mathbb{E}[\hat{\theta}_n])

In general, g(\hat{\theta}_n) is a biased estimator of g(\theta) even if \hat{\theta}_n is unbiased for \theta (unless g is linear). Equivariance is about what parameter value maximizes the likelihood, not about expected values.

6.4.4 Asymptotic Normality & Optimality

The consistency and equivariance properties are nice, but they don’t tell us about the distribution of the MLE. How much uncertainty is there in our estimate? How efficient is it compared to other estimators?

The remarkable answer is that the MLE is approximately normally distributed with the smallest possible variance. We’ll explore these twin properties – asymptotic normality and efficiency – in detail in the next section, as they require us to first understand a fundamental concept: the Fisher Information.

6.5 Fisher Information and Confidence Intervals

6.5.1 Fisher Information: Quantifying What Data Can Tell Us

To understand the precision of the MLE, we need to introduce one of the most important concepts in statistical theory: the Fisher Information. Named after statistician and polymath R.A. Fisher, this quantity measures how much “information” about a parameter is contained in the data.

The Fisher Information is formally defined through the score function and its variance.

The score function is the gradient of the log-likelihood: s(X; \theta) = \frac{\partial \log f(X; \theta)}{\partial \theta}

The Fisher Information is: I_n(\theta) = \mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i; \theta)\right) = \sum_{i=1}^n \mathbb{V}_\theta(s(X_i; \theta))

For a single observation (n=1), we often write I(\theta) = I_1(\theta).

The computation of the Fisher information can be simplified using the following results:

For an IID sample of size n:

I_n(\theta) = n \cdot I(\theta) (information accumulates linearly)
\mathbb{E}_\theta[s(X; \theta)] = 0 (expected score is zero at the true parameter)
I(\theta) = -\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right] (expected negative curvature)

The last property shows that Fisher Information is literally the expected curvature of the log-likelihood – confirming our intuition about “sharpness”!

Proof of Properties of Fisher Information and Score

Property 1: Linear accumulation for IID samples

For IID samples X_1, \ldots, X_n: I_n(\theta) = \mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i; \theta)\right)

Since the X_i are independent and \mathbb{V}(\sum Y_i) = \sum \mathbb{V}(Y_i) for independent random variables: I_n(\theta) = \sum_{i=1}^n \mathbb{V}_\theta(s(X_i; \theta)) = n \cdot \mathbb{V}_\theta(s(X; \theta)) = n \cdot I(\theta)

Property 2: Expected score is zero

Under regularity conditions that allow interchange of derivative and integral: \mathbb{E}_\theta[s(X; \theta)] = \mathbb{E}_\theta\left[\frac{\partial \log f(X; \theta)}{\partial \theta}\right] = \int \frac{\partial \log f(x; \theta)}{\partial \theta} f(x; \theta) dx

Using \frac{\partial \log f}{\partial \theta} = \frac{1}{f} \frac{\partial f}{\partial \theta}: = \int \frac{\partial f(x; \theta)}{\partial \theta} dx = \frac{\partial}{\partial \theta} \int f(x; \theta) dx = \frac{\partial}{\partial \theta} (1) = 0

Property 3: Alternative formula using second derivative

We start with Property 2: \mathbb{E}_\theta[s(X; \theta)] = 0, which we can write explicitly as: \int s(x; \theta) f(x; \theta) dx = \int \frac{\partial \log f(x; \theta)}{\partial \theta} f(x; \theta) dx = 0

Since this holds for all \theta, we can differentiate both sides with respect to \theta: \frac{\partial}{\partial \theta} \int s(x; \theta) f(x; \theta) dx = 0

Under regularity conditions (allowing interchange of derivative and integral): \int \frac{\partial}{\partial \theta}[s(x; \theta) f(x; \theta)] dx = 0

Using the product rule: \int \left[\frac{\partial s(x; \theta)}{\partial \theta} f(x; \theta) + s(x; \theta) \frac{\partial f(x; \theta)}{\partial \theta}\right] dx = 0

Now, note that \frac{\partial f(x; \theta)}{\partial \theta} = f(x; \theta) \cdot \frac{\partial \log f(x; \theta)}{\partial \theta} = f(x; \theta) \cdot s(x; \theta)

Substituting: \int \left[\frac{\partial s(x; \theta)}{\partial \theta} f(x; \theta) + s(x; \theta)^2 f(x; \theta)\right] dx = 0

This can be rewritten as: \mathbb{E}_\theta\left[\frac{\partial s(X; \theta)}{\partial \theta}\right] + \mathbb{E}_\theta[s(X; \theta)^2] = 0

Since s(X; \theta) = \frac{\partial \log f(X; \theta)}{\partial \theta}, we have:

\frac{\partial s(X; \theta)}{\partial \theta} = \frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}
\mathbb{E}_\theta[s(X; \theta)^2] = \mathbb{V}_\theta[s(X; \theta)] = I(\theta) (since \mathbb{E}_\theta[s(X; \theta)] = 0)

Therefore: \mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right] + I(\theta) = 0

Which gives us: I(\theta) = -\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right]

Imagine you’re trying to estimate a parameter by maximizing the log-likelihood function. Picture this function as a hill that you’re climbing to find the peak (the MLE).

Now think about two scenarios:

Sharp, pointy peak: The log-likelihood drops off steeply as you move away from the maximum. Even a small change in the parameter makes the data much less likely. This means the data is very “informative” – it strongly prefers one specific parameter value.
Flat, broad peak: The log-likelihood changes slowly near the maximum. Many different parameter values give similar likelihoods. The data isn’t very informative about the exact parameter value.

The Fisher Information measures the “sharpness” or curvature of the log-likelihood at its peak:

Sharp peak → High Fisher Information → Low variance for $\hat{\theta}$ → Narrow confidence interval
Flat peak → Low Fisher Information → High variance for $\hat{\theta}$ → Wide confidence interval

This is the key insight: the same curvature that makes optimization easy (sharp peak = clear maximum) also makes estimation precise (sharp peak = low uncertainty)!

The meaning of $\mathbb{E}_\theta[s(X; \theta)] = 0$ is subtle and often misunderstood. It does NOT mean:

The derivative is zero (that’s what happens at the MLE for a specific dataset)
The log-likelihood is maximized at the true parameter

What it DOES mean:

When data is generated from $f(x; \theta_*)$, the score evaluated at $\theta_*$ (true parameter) averages to zero across all possible datasets
If you’re at the true parameter value and observe a random data point, it’s equally likely to suggest increasing or decreasing $\theta$
This is why the true parameter is “stable” – random samples don’t systematically pull the estimate away from the truth

Think of it as a balance point: at the true parameter, the data provides no systematic evidence to move in either direction, even though any individual sample might suggest moving up or down.

Let’s calculate and visualize the Fisher Information for a concrete example. We’ll examine the Bernoulli distribution, where the Fisher Information has a particularly interesting form, which we will derive later:

\[I(p) = 1/(p(1-p))\]

This shows how the precision of estimating a probability depends on the true probability value:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Example: Fisher Information for Bernoulli(p)
# For Bernoulli: I(p) = 1/(p(1-p))

p_values = np.linspace(0.01, 0.99, 200)
fisher_info = 1 / (p_values * (1 - p_values))

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(7, 8))

# Top plot: Fisher Information
ax1.plot(p_values, fisher_info, 'b-', linewidth=2)
ax1.set_xlabel('p')
ax1.set_ylabel('I(p)')
ax1.set_title('Fisher Information for Bernoulli(p)')
ax1.grid(True, alpha=0.3)
ax1.set_ylim(0, 50)

# Middle plot: Standard error (1/sqrt(nI(p))) for different n
n_values = [10, 50, 100, 500]
colors = ['red', 'orange', 'green', 'blue']
for n, color in zip(n_values, colors):
    se = 1 / np.sqrt(n * fisher_info)
    ax2.plot(p_values, se, color=color, linewidth=2, label=f'n = {n}')
ax2.set_xlabel('p')
ax2.set_ylabel('Standard Error')
ax2.set_title('Standard Error of MLE')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 0.2)

# Bottom plot: Log-likelihood curves for different p values
n_obs = 20
successes = 12
theta_range = np.linspace(0.01, 0.99, 200)

# Show log-likelihood for different true p values
for p_true, color in [(0.5, 'red'), (0.6, 'green'), (0.8, 'blue')]:
    log_lik = successes * np.log(theta_range) + (n_obs - successes) * np.log(1 - theta_range)
    ax3.plot(theta_range, log_lik, color=color, linewidth=2, 
             label=f'Data from p = {p_true}')
    
    # Mark the MLE
    mle = successes / n_obs
    mle_ll = successes * np.log(mle) + (n_obs - successes) * np.log(1 - mle)
    ax3.plot(mle, mle_ll, 'ko', markersize=8)

ax3.set_xlabel('θ')
ax3.set_ylabel('Log-likelihood')
ax3.set_title(f'Log-likelihood curves (n={n_obs}, observed {successes} successes)')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key insights from the plots:

Fisher Information is lowest at $p=0.5$ (hardest to estimate a fair coin)
Fisher Information $\rightarrow \infty$ as $p \rightarrow 0$ or $p \rightarrow 1$ (extreme probabilities are ‘easier’ to pin down)
Standard error decreases with both $n$ and $I(p)$
The curvature of the log-likelihood reflects the Fisher Information

6.5.2 Asymptotic Normality of the MLE

Now we can state the key theorem that connects Fisher Information to the distribution of the MLE:

Let \text{se} = \sqrt{\mathbb{V}(\hat{\theta}_n)}. Under appropriate regularity conditions we have the following:

\text{se} \approx \sqrt{1 / I_n(\theta)} and \frac{(\hat{\theta}_n - \theta)}{\text{se}} \rightsquigarrow \mathcal{N}(0,1)
Let \widehat{\text{se}} = \sqrt{1 / I_n(\hat{\theta}_n)}. Then \frac{(\hat{\theta}_n - \theta)}{\widehat{\text{se}}} \rightsquigarrow \mathcal{N}(0,1)

According to this theorem, the distribution of the MLE is approximately \mathcal{N}(\theta, \widehat{\text{se}}^2). This is one of the most important results in statistics: it tells us not just that the MLE converges to the true value (consistency), but also gives us its approximate distribution for finite samples.

The key insight is that the same Fisher Information that measures how “sharp” the likelihood is also determines the precision of our estimate. Sharp likelihood → High Fisher Information → Small standard error → Precise estimate.

6.5.3 Constructing Confidence Intervals for the MLE

With the asymptotic normality result in hand, we can now construct confidence intervals for our parameter estimates.

The interval C_n = \left( \hat{\theta}_n - z_{\alpha/2}\widehat{\text{se}}, \hat{\theta}_n + z_{\alpha/2}\widehat{\text{se}} \right) is an asymptotically valid (1-\alpha) confidence interval for \theta.

For a 95% confidence interval, z_{0.025} \approx 1.96 \approx 2, giving the simple rule:

\text{95\% CI} \approx \hat{\theta}_n \pm 2 \cdot \widehat{\text{se}}

This type of confidence interval is used as the basis for statements about expected error in opinion polls and many other applications.

Example: Confidence Interval for a Proportion

Let’s work through the complete derivation for the Bernoulli case, which gives us confidence intervals for proportions – one of the most common applications in practice.

For X_1, \ldots, X_n \sim \text{Bernoulli}(p), we have f(x; p) = p^x (1-p)^{1-x} for x \in \{0, 1\}.

Step 1: Find the MLE The likelihood for n observations is: \mathcal{L}_n(p) = \prod_{i=1}^n p^{X_i} (1-p)^{1-X_i} = p^{S} (1-p)^{n-S} where S = \sum_{i=1}^n X_i is the total number of successes.

Taking the log: \ell_n(p) = S \log p + (n-S) \log(1-p)

Setting \frac{d\ell_n}{dp} = \frac{S}{p} - \frac{n-S}{1-p} = 0 gives \hat{p}_n = \frac{S}{n} = \bar{X}_n

Step 2: Compute the score function For a single observation: \log f(X; p) = X \log p + (1-X) \log(1-p)

The score function is: s(X; p) = \frac{\partial \log f(X; p)}{\partial p} = \frac{X}{p} - \frac{1-X}{1-p}

Let’s verify \mathbb{E}_p[s(X; p)] = 0: \mathbb{E}_p[s(X; p)] = \mathbb{E}_p\left[\frac{X}{p} - \frac{1-X}{1-p}\right] = \frac{p}{p} - \frac{1-p}{1-p} = 1 - 1 = 0 \checkmark

Step 3: Compute the second derivative \frac{\partial^2 \log f(X; p)}{\partial p^2} = \frac{\partial}{\partial p}\left[\frac{X}{p} - \frac{1-X}{1-p}\right] = -\frac{X}{p^2} - \frac{1-X}{(1-p)^2}

Step 4: Find the Fisher Information Using the formula I(p) = -\mathbb{E}_p\left[\frac{\partial^2 \log f(X; p)}{\partial p^2}\right]:

I(p) = -\mathbb{E}_p\left[-\frac{X}{p^2} - \frac{1-X}{(1-p)^2}\right] = \mathbb{E}_p\left[\frac{X}{p^2}\right] + \mathbb{E}_p\left[\frac{1-X}{(1-p)^2}\right]

Since \mathbb{E}_p[X] = p and \mathbb{E}_p[1-X] = 1-p: I(p) = \frac{p}{p^2} + \frac{1-p}{(1-p)^2} = \frac{1}{p} + \frac{1}{1-p} = \frac{1-p+p}{p(1-p)} = \frac{1}{p(1-p)}

Step 5: Derive the standard error For n observations, I_n(p) = n \cdot I(p) = \frac{n}{p(1-p)}

The standard error of \hat{p}_n is: \widehat{\text{se}} = \frac{1}{\sqrt{I_n(\hat{p}_n)}} = \frac{1}{\sqrt{\frac{n}{\hat{p}_n(1-\hat{p}_n)}}} = \sqrt{\frac{\hat{p}_n(1-\hat{p}_n)}{n}}

Step 6: Construct the confidence interval From the asymptotic normality theorem: \frac{\hat{p}_n - p}{\widehat{\text{se}}} \rightsquigarrow \mathcal{N}(0,1)

For a (1-\alpha) confidence interval: \mathbb{P}\left(-z_{\alpha/2} \leq \frac{\hat{p}_n - p}{\widehat{\text{se}}} \leq z_{\alpha/2}\right) \approx 1-\alpha

Rearranging: \mathbb{P}\left(\hat{p}_n - z_{\alpha/2} \cdot \widehat{\text{se}} \leq p \leq \hat{p}_n + z_{\alpha/2} \cdot \widehat{\text{se}}\right) \approx 1-\alpha

Therefore, the (1-\alpha) confidence interval is: \hat{p}_n \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_n(1-\hat{p}_n)}{n}}

Example: Political Opinion Polling

Let’s apply the confidence interval for a proportion from the previous worked example to understand how political polls work and what their “margin of error” really means.

Show code

# Political polling example
# Suppose we poll n=1000 randomly selected likely voters about a referendum

n = 1000  # Sample size
successes = 520  # Number who support the referendum
p_hat = successes / n  # Sample proportion = 0.52 (52%)

# Standard error using the formula we derived
se = np.sqrt(p_hat * (1 - p_hat) / n)

# 95% confidence interval (using z_{0.025} = 1.96)
z_critical = 1.96
ci_lower = p_hat - z_critical * se
ci_upper = p_hat + z_critical * se

print("POLL RESULTS:")
print(f"Sample size: {n} voters")
print(f"Support for referendum: {p_hat*100:.1f}% ({successes} out of {n})")
print(f"Standard error: {se*100:.2f} percentage points")
print(f"95% confidence interval: ({ci_lower*100:.1f}%, {ci_upper*100:.1f}%)")
print(f"Margin of error: ±{z_critical*se*100:.1f} percentage points")

# Interpretation
print("\nINTERPRETATION:")
if ci_lower > 0.5:
    print("The referendum has statistically significant majority support.")
elif ci_upper < 0.5:
    print("The referendum has statistically significant minority support.")
else:
    print("The poll cannot determine if there's majority support (CI includes 50%).")

# Compare different scenarios to understand margin of error
print("\n" + "="*60)
print("HOW MARGIN OF ERROR VARIES:")
print("="*60)

scenarios = [
    (0.50, 500, "50% support, n=500"),
    (0.50, 1000, "50% support, n=1000"), 
    (0.50, 2000, "50% support, n=2000"),
    (0.20, 1000, "20% support, n=1000"),
    (0.10, 1000, "10% support, n=1000"),
    (0.01, 1000, "1% support, n=1000")
]

print(f"{'Scenario':<25} {'Margin of Error':>20} {'95% CI':>20}")
print("-" * 65)
for p, n, desc in scenarios:
    margin = 1.96 * np.sqrt(p * (1 - p) / n) * 100
    ci_low = (p - 1.96 * np.sqrt(p * (1 - p) / n)) * 100
    ci_high = (p + 1.96 * np.sqrt(p * (1 - p) / n)) * 100
    print(f"{desc:<25} ±{margin:>6.1f} percentage pts  ({ci_low:>5.1f}%, {ci_high:>5.1f}%)")

# Key assumptions and limitations
print("\nKEY ASSUMPTIONS:")
print("1. Random sampling from the population of interest")
print("2. Respondents answer truthfully")
print("3. The sample size is large enough for normal approximation (np ≥ 10 and n(1-p) ≥ 10)")
print("4. No systematic bias in who responds to the poll")

POLL RESULTS:
Sample size: 1000 voters
Support for referendum: 52.0% (520 out of 1000)
Standard error: 1.58 percentage points
95% confidence interval: (48.9%, 55.1%)
Margin of error: ±3.1 percentage points

INTERPRETATION:
The poll cannot determine if there's majority support (CI includes 50%).

============================================================
HOW MARGIN OF ERROR VARIES:
============================================================
Scenario                       Margin of Error               95% CI
-----------------------------------------------------------------
50% support, n=500        ±   4.4 percentage pts  ( 45.6%,  54.4%)
50% support, n=1000       ±   3.1 percentage pts  ( 46.9%,  53.1%)
50% support, n=2000       ±   2.2 percentage pts  ( 47.8%,  52.2%)
20% support, n=1000       ±   2.5 percentage pts  ( 17.5%,  22.5%)
10% support, n=1000       ±   1.9 percentage pts  (  8.1%,  11.9%)
1% support, n=1000        ±   0.6 percentage pts  (  0.4%,   1.6%)

KEY ASSUMPTIONS:
1. Random sampling from the population of interest
2. Respondents answer truthfully
3. The sample size is large enough for normal approximation (np ≥ 10 and n(1-p) ≥ 10)
4. No systematic bias in who responds to the poll

Important insights about polling:

The margin of error is largest when p = 0.5: This is why close races are hardest to call. When support is near 50%, we have maximum uncertainty about which side has the majority.
Sample size matters, but with diminishing returns: To halve the margin of error, you need to quadruple the sample size (since margin ∝ 1/√n).
Extreme proportions have smaller margins: If only 1% support something, we can estimate that quite precisely even with modest sample sizes.
“Statistical ties”: When the confidence interval includes 50%, we cannot conclude which side has majority support. This is often called a “statistical tie” in media reports.
The stated margin usually assumes 95% confidence: When polls report “±3 percentage points,” they typically mean the 95% confidence interval extends 3 points in each direction.

Advanced: Limitations of the Wald Interval

The confidence interval we derived above, known as the Wald interval, has poor coverage when p is near 0 or 1, or when n is small. In practice, consider using:

Agresti-Coull interval: Add 2 successes and 2 failures before computing: \tilde{p} = (S+2)/(n+4).
Wilson score interval: More complex but widely recommended as the default choice.

The Wald interval remains important for understanding the theory, but these alternatives perform better in practice.

Advanced: The Cramér-Rao Lower Bound

We mentioned that the MLE achieves the smallest possible variance. This is formalized by the Cramér-Rao Lower Bound:

For any unbiased estimator \tilde{\theta}: \mathbb{V}(\tilde{\theta}) \geq \frac{1}{I_n(\theta)}

Since the MLE asymptotically achieves variance 1/I_n(\theta), it is asymptotically efficient – no consistent estimator can do better!

This is why we call the MLE “optimal”: it extracts all possible information from the data about the parameter.

6.6 Additional Topics

6.6.1 The Delta Method: Confidence Intervals for Transformations

Often we’re not interested in the parameter \theta itself, but in some transformation \tau = g(\theta). For example:

If \theta is log-odds, we might want odds \tau = e^\theta
If \theta is standard deviation, we might want variance \tau = \theta^2
If \theta is a rate parameter, we might want mean lifetime \tau = 1/\theta

Due to the equivariance property seen earlier, if we know the MLE \hat{\theta}, we know also that the MLE of \tau is \hat{\tau} = g(\hat{\theta}). However, what about the confidence intervals of \hat{\tau}?

The Delta Method provides a way to find confidence intervals for transformed parameters.

If \tau = g(\theta) where g is differentiable and g'(\theta) \neq 0, then: \widehat{\text{se}}(\hat{\tau}_n) \approx |g'(\hat{\theta}_n)| \cdot \widehat{\text{se}}(\hat{\theta}_n)

Therefore: \frac{(\hat{\tau}_n - \tau)}{\widehat{\text{se}}(\hat{\tau}_n)} \rightsquigarrow \mathcal{N}(0,1)

Intuition Behind the Delta Method

If we zoom in enough, any smooth function looks linear. The Delta Method says that when we transform our estimate through a function g, the standard error gets multiplied by |g'(\hat{\theta}_n)| – the absolute slope of the function at our estimate.

This approximation becomes exact as n \to \infty because the variability of \hat{\theta}_n shrinks, effectively “zooming in” on the region where g is effectively linear.

Example: Full Delta Method Workflow

Let’s work through estimating \psi = \log \sigma for a Normal distribution with known mean. This example demonstrates every step of the Delta Method process.

Setup: X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with \mu known.

Step 1: Find MLE for \sigma

The log-likelihood is: \ell(\sigma) = -n \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2

Taking the derivative and setting to zero: \frac{d\ell}{d\sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (X_i - \mu)^2 = 0

Solving: \hat{\sigma}_n = \sqrt{\frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2}

Step 2: Find Fisher Information for \sigma

The log density for a single observation is: \log f(X; \sigma) = -\log \sigma - \frac{(X-\mu)^2}{2\sigma^2}

First derivative: \frac{\partial \log f(X; \sigma)}{\partial \sigma} = -\frac{1}{\sigma} + \frac{(X-\mu)^2}{\sigma^3}

Second derivative: \frac{\partial^2 \log f(X; \sigma)}{\partial \sigma^2} = \frac{1}{\sigma^2} - \frac{3(X-\mu)^2}{\sigma^4}

Fisher Information: I(\sigma) = -\mathbb{E}\left[\frac{\partial^2 \log f(X; \sigma)}{\partial \sigma^2}\right] = -\frac{1}{\sigma^2} + \frac{3\sigma^2}{\sigma^4} = \frac{2}{\sigma^2}

Step 3: Standard Error for \hat{\sigma} \widehat{\text{se}}(\hat{\sigma}_n) = \frac{1}{\sqrt{nI(\hat{\sigma}_n)}} = \frac{\hat{\sigma}_n}{\sqrt{2n}}

Step 4: Apply the Delta Method

For \psi = g(\sigma) = \log \sigma, we have g'(\sigma) = 1/\sigma. Therefore:

\widehat{\text{se}}(\hat{\psi}_n) = |g'(\hat{\sigma}_n)| \cdot \widehat{\text{se}}(\hat{\sigma}_n) = \frac{1}{\hat{\sigma}_n} \cdot \frac{\hat{\sigma}_n}{\sqrt{2n}} = \frac{1}{\sqrt{2n}}

Step 5: Confidence Interval

A 95% confidence interval for \psi = \log \sigma is: \hat{\psi}_n \pm \frac{2}{\sqrt{2n}} = \log \hat{\sigma}_n \pm \frac{2}{\sqrt{2n}}

Notice something remarkable: the standard error for \log \sigma doesn’t depend on \sigma itself! This is one reason why log-transformations are often used for scale parameters.

Verification Via Simulation

Let’s verify the Delta Method through simulation. We’ll generate many samples, compute both \hat{\sigma} and \log \hat{\sigma} for each, and check if their empirical standard errors match the theoretical predictions:

Show code

# Demonstrate the Delta Method with simulations
np.random.seed(42)

# True parameters
mu_true = 5.0
sigma_true = 2.0
n = 100

# Generate data and compute MLEs
n_sims = 1000
sigma_mles = []
log_sigma_mles = []

for _ in range(n_sims):
    data = np.random.normal(mu_true, sigma_true, n)
    sigma_hat = np.sqrt(np.mean((data - mu_true)**2))
    sigma_mles.append(sigma_hat)
    log_sigma_mles.append(np.log(sigma_hat))

sigma_mles = np.array(sigma_mles)
log_sigma_mles = np.array(log_sigma_mles)

# Theoretical values
theoretical_se_sigma = sigma_true / np.sqrt(2*n)
theoretical_se_log_sigma = 1 / np.sqrt(2*n)

# Create plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5))

# Left: Distribution of sigma MLE
ax1.hist(sigma_mles, bins=30, density=True, alpha=0.7, color='blue')
x = np.linspace(sigma_mles.min(), sigma_mles.max(), 100)
ax1.plot(x, stats.norm.pdf(x, sigma_true, theoretical_se_sigma), 
         'r-', linewidth=2, label='Theoretical')
ax1.axvline(sigma_true, color='black', linestyle='--', label='True value')
ax1.set_xlabel('$\hat{\sigma}$')
ax1.set_ylabel('Density')
ax1.set_title('Distribution of $\hat{\sigma}$')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Distribution of log(sigma) MLE
ax2.hist(log_sigma_mles, bins=30, density=True, alpha=0.7, color='green')
x = np.linspace(log_sigma_mles.min(), log_sigma_mles.max(), 100)
ax2.plot(x, stats.norm.pdf(x, np.log(sigma_true), theoretical_se_log_sigma), 
         'r-', linewidth=2, label='Theoretical')
ax2.axvline(np.log(sigma_true), color='black', linestyle='--', label='True value')
ax2.set_xlabel('$\hat{\psi} = \log \hat{\sigma}$')
ax2.set_ylabel('Density')
ax2.set_title('Distribution of $\log \hat{\sigma}$')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Standard Error Comparison:")
print(f"Empirical SE($\hat{{\sigma}}$): {np.std(sigma_mles):.4f}")
print(f"Theoretical SE($\hat{{\sigma}}$): {theoretical_se_sigma:.4f}")
print(f"Empirical SE($\log \hat{{\sigma}}$): {np.std(log_sigma_mles):.4f}")
print(f"Theoretical SE($\log \hat{{\sigma}}$): {theoretical_se_log_sigma:.4f}")
print(f"\nSE ratio (empirical): {np.std(log_sigma_mles)/np.std(sigma_mles):.4f}")
print(f"SE ratio (Delta method): {1/sigma_true:.4f}")

Standard Error Comparison:
Empirical SE($\hat{\sigma}$): 0.1403
Theoretical SE($\hat{\sigma}$): 0.1414
Empirical SE($\log \hat{\sigma}$): 0.0705
Theoretical SE($\log \hat{\sigma}$): 0.0707

SE ratio (empirical): 0.5027
SE ratio (Delta method): 0.5000

Key takeaways: The simulation confirms that the Delta Method works perfectly! The empirical standard error ratio between \log \hat{\sigma} and \hat{\sigma} matches exactly what the Delta Method predicts: 1/\sigma. Both distributions are well-approximated by normal distributions, validating the asymptotic normality of MLEs.

6.6.2 Multiparameter Models

Real-world problems often involve multiple parameters. The theory we saw in this chapter extends naturally, with matrices replacing scalars.

For \theta = (\theta_1, \ldots, \theta_k), let \hat{\theta} = (\hat{\theta}_1, \ldots, \hat{\theta}_k) be the MLE. The key concepts become:

Let \ell_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta) and define: H_{jk} = \frac{\partial^2 \ell_n}{\partial \theta_j \partial \theta_k}

The Fisher Information Matrix is: I_n(\theta) = - \begin{pmatrix} \mathbb{E}_\theta(H_{11}) & \mathbb{E}_\theta(H_{12}) & \cdots & \mathbb{E}_\theta(H_{1k}) \\ \mathbb{E}_\theta(H_{21}) & \mathbb{E}_\theta(H_{22}) & \cdots & \mathbb{E}_\theta(H_{2k}) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb{E}_\theta(H_{k1}) & \mathbb{E}_\theta(H_{k2}) & \cdots & \mathbb{E}_\theta(H_{kk}) \end{pmatrix}

Under regularity conditions:

The MLE is approximately multivariate normal: (\hat{\theta} - \theta) \approx \mathcal{N}(0, J_n) where J_n = I_n^{-1}(\theta) is the inverse Fisher Information matrix.
For parameter \theta_j, the standard error is: \widehat{\text{se}}_j = \sqrt{J_n(j,j)} (the square root of the j-th diagonal element of J_n)
The covariance between \hat{\theta}_j and \hat{\theta}_k is J_n(j,k).

Example: Multiparameter Normal Distribution

For X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with both parameters unknown:

The Fisher Information matrix is: I_n(\mu, \sigma) = n \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{2}{\sigma^2} \end{pmatrix}

Therefore: J_n = \frac{1}{n} \begin{pmatrix} \sigma^2 & 0 \\ 0 & \frac{\sigma^2}{2} \end{pmatrix}

This tells us:

\widehat{\text{se}}(\hat{\mu}) = \sigma/\sqrt{n} (familiar formula!)
\widehat{\text{se}}(\hat{\sigma}) = \sigma/\sqrt{2n}
\text{Cov}(\hat{\mu}, \hat{\sigma}) = 0 (they’re independent!)

The zero off-diagonal terms mean that uncertainty about \mu doesn’t affect our estimate of \sigma, and vice versa. This orthogonality is special to the normal distribution.

Advanced: Multiparameter Delta Method

For a function \tau = g(\theta_1, \ldots, \theta_k), the Delta Method generalizes to:

\widehat{\text{se}}(\hat{\tau}) = \sqrt{(\widehat{\nabla} g)^T \widehat{J}_n (\widehat{\nabla} g)}

where \widehat{\nabla} g is the gradient of g evaluated at \hat{\theta}.

Example: For the Normal distribution, if we’re interested in the coefficient of variation \tau = \sigma/\mu:

\nabla g = \begin{pmatrix} -\sigma/\mu^2 \\ 1/\mu \end{pmatrix}

The standard error involves both parameter uncertainties and their covariance (though the covariance is zero for the normal case).

6.6.3 Sufficient Statistics

A statistic is a function T(X^n) of the data. A sufficient statistic is a statistic that contains all the information in the data about the parameter.

A statistic T(X^n) is sufficient for parameter \theta if the conditional distribution of the data given T doesn’t depend on \theta.

Let X_1, \ldots, X_n \sim \text{Bernoulli}(p). Then \mathcal{L}(p) = p^S(1-p)^{n-S} where S = \sum_{i=1}^n X_i.

The likelihood depends on the data only through S, so S is sufficient. This means that once we know the total number of successes, the individual outcomes provide no additional information about p.

A sufficient statistic is minimal if it provides the most compressed summary possible while still retaining all information about the parameter. More precisely: any other sufficient statistic can be “reduced” to the minimal one, but not vice versa.

Connection Between MLE and Sufficiency

When a non-trivial sufficient statistic exists, the MLE depends on the data only through that statistic. Examples:

Bernoulli: MLE \hat{p} = S/n depends only on sufficient statistic S = \sum X_i
Normal: MLE (\hat{\mu}, \hat{\sigma}^2) depends only on sufficient statistics (\bar{X}, S^2)
Uniform(0,\theta): MLE is exactly the sufficient statistic \max\{X_i\}

This provides theoretical justification for data reduction: when sufficient statistics exist, we can compress our data without losing any information about the parameter. However, not all models have nice sufficient statistics beyond the trivial one (the entire dataset).

6.7 Connection to Machine Learning: Cross-Entropy as MLE

One of the most profound connections between statistics and modern machine learning is that many “loss functions” used in ML are actually negative log-likelihoods in disguise. This isn’t a coincidence – it reflects the deep statistical foundations of machine learning. Let’s consider the case of the cross-entropy loss used in classification tasks.

Consider a classification problem where we predict probabilities for K classes using a machine learning model f(X; \theta) parameterized by \theta.

The cross-entropy loss is:

H(p, q) = -\frac{1}{n} \sum_{i=1}^n \sum_{j=1}^K p(Y_i = j) \log q_\theta(Y_i = j)

where:

p(Y_i = j) is the observed distribution (1 if observation i is class j, 0 otherwise)
q_\theta(Y_i = j) = f(X_i; \theta)_j is the predicted probability from our model with parameters \theta

Let’s show this is exactly the negative log-likelihood for a categorical distribution.

The likelihood for categorical data is: \mathcal{L}_n(\theta) = \prod_{i=1}^n \text{Categorical}(Y_i; f(X_i; \theta))

The categorical probability for observation i is: \text{Categorical}(Y_i; f(X_i; \theta)) = \prod_{j=1}^K f(X_i; \theta)_j^{p(Y_i = j)}

Taking the log: \ell_n(\theta) = \sum_{i=1}^n \sum_{j=1}^K p(Y_i = j) \log f(X_i; \theta)_j

This is exactly -n \cdot H(p, q). Thus: Minimizing cross-entropy = Maximizing likelihood!

Other Common ML Loss Functions as MLEs

This pattern appears throughout machine learning:

ML Loss Function	Statistical Model	MLE Interpretation
Mean Squared Error (MSE)	Y \sim \mathcal{N}(\mu(x), \sigma^2)	Gaussian likelihood
Mean Absolute Error (MAE)	Y \sim \text{Laplace}(\mu(x), b)	Laplace likelihood
Huber Loss	Hybrid of L_1 and L_2	Robust to outliers

Understanding that common ML losses are negative log-likelihoods provides:

Principled loss design: When facing a new problem, you can derive an appropriate loss function by specifying a probabilistic model for your data.
Historical context: It explains why these particular loss functions became standard – they weren’t arbitrary choices but emerged from statistical principles.
Intuition: Knowing the probabilistic interpretation helps understand when each loss is appropriate (e.g., MAE for heavy-tailed errors, MSE for Gaussian noise).

Example: Deriving a Custom Loss

Suppose you believe errors in your regression problem approximately follow a Student-t distribution (heavy tails for robustness). The negative log-likelihood gives you the loss function:

\text{Loss}(y, \hat{y}) = \frac{\nu + 1}{2} \log\left(1 + \frac{(y - \hat{y})^2}{\nu s^2}\right)

where \nu is degrees of freedom and s is scale. This naturally down-weights outliers!

6.8 MLE for Latent Variable Models: The EM Algorithm

6.8.1 The Challenge of Latent Variables

So far, we’ve assumed that maximizing the likelihood is straightforward – we take derivatives, set them to zero, and solve (possibly numerically). But what if the likelihood itself is intractable?

This often happens when our model involves latent (unobserved) variables – hidden quantities that would make the problem easy if only we could observe them.

Common examples with latent variables:

Mixture models: Which component generated each observation?
Hidden Markov models: What’s the hidden state sequence?
Factor analysis: What are the values of the latent factors?
Missing data: What would the missing values have been?

The canonical example is the Gaussian Mixture Model (GMM):³

f(y; \theta) = \sum_{k=1}^K \pi_k \mathcal{N}(y; \mu_k, \sigma_k^2)

where \theta = (\pi_1, \ldots, \pi_K, \mu_1, \ldots, \mu_K, \sigma_1, \ldots, \sigma_K).

The likelihood involves a sum inside the logarithm: \ell_n(\theta) = \sum_{i=1}^n \log\left[\sum_{k=1}^K \pi_k \mathcal{N}(y_i; \mu_k, \sigma_k^2)\right]

This is a nightmare to differentiate! The derivatives involve ratios of sums – messy and hard to work with.

The key insight: If we knew which component Z_i \in \{1, \ldots, K\} generated each observation Y_i, the problem would become trivial – we’d just fit separate Gaussians to each group.

Example: Two-Component Gaussian Mixture (K=2)

For the special case of K=2 (two components), the mixture model simplifies to:

f(y; \theta) = \pi \mathcal{N}(y; \mu_1, \sigma_1^2) + (1-\pi) \mathcal{N}(y; \mu_2, \sigma_2^2)

where:

\pi \in [0,1] is the mixing weight (probability of component 1)
(1-\pi) is the probability of component 2
\mu_1, \mu_2 are the means of the two Gaussian components
\sigma_1^2, \sigma_2^2 are the variances of the two components
The parameter vector is \theta = (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2)

Interpretation: Each observation Y_i is generated by:

First, flip a biased coin with probability \pi of heads
If heads: draw Y_i \sim \mathcal{N}(\mu_1, \sigma_1^2)
If tails: draw Y_i \sim \mathcal{N}(\mu_2, \sigma_2^2)

The latent variables Z_1, \ldots, Z_n \in \{1, 2\} indicate which component generated each observation. If we knew these Z_i values, estimation would be trivial – we’d simply:

Estimate \pi as the proportion of observations from component 1
Estimate \mu_1, \sigma_1 using only observations where Z_i = 1
Estimate \mu_2, \sigma_2 using only observations where Z_i = 2

But since the Z_i are unobserved, we need a method to estimate both the component assignments and the parameters.

6.8.2 The Expectation-Maximization (EM) Algorithm

The EM algorithm is a clever iterative procedure that alternates between:

Guessing the values of the latent variables (E-step)
Updating parameters as if those guesses were correct (M-step)

Think of the EM algorithm as a “chicken and egg” problem solver for mixture models.

The dilemma:

If we knew which cluster each point belonged to, we could easily estimate the cluster parameters (just compute means and variances for each group)
If we knew the cluster parameters, we could easily assign points to clusters (pick the most likely cluster for each point)

The EM solution: Start with a guess and alternate!

E-step (Expectation): Given current cluster parameters, compute “soft assignments” – the probability that each point belongs to each cluster. These are called “responsibilities.”
- Point near cluster 1 center: Maybe 90% probability for cluster 1, 10% for cluster 2
- Point between clusters: Maybe 50-50 split
M-step (Maximization): Update cluster parameters using weighted calculations, where weights are the responsibilities.
- New mean for cluster 1: Weighted average of all points, using their cluster 1 responsibilities as weights
- Points with high responsibility contribute more to the cluster’s parameters
Iterate: Keep alternating until convergence.

It’s like a dance where clusters and assignments gradually find their proper configuration, each step making the other more accurate.

The EM algorithm maximizes the expected complete-data log-likelihood. Let:

$Y$ = observed data
$Z$ = latent/missing data
$(Y, Z)$ = complete data

The algorithm iterates between:

E-step: Compute the expected complete-data log-likelihood: \[Q(\theta | \theta^{(t)}) = \mathbb{E}_{Z|Y,\theta^{(t)}}[\log P(Y, Z | \theta)]\]

This expectation is over the distribution of $Z$ given the observed data $Y$ and current parameters $\theta^{(t)}$.

M-step: Maximize with respect to $\theta$: \[\theta^{(t+1)} = \arg\max_\theta Q(\theta | \theta^{(t)})\]

Key property: The likelihood is guaranteed to increase (or stay the same) at each iteration: \[\mathcal{L}(\theta^{(t+1)}) \geq \mathcal{L}(\theta^{(t)})\]

This monotonic improvement ensures convergence to a local maximum.

Let’s implement and visualize the EM algorithm for a 2D Gaussian mixture:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from scipy.stats import multivariate_normal

# Generate synthetic data from a mixture of 3 Gaussians
np.random.seed(42)
n_samples = 300
n_components = 3

# True parameters
true_means = np.array([[-2, -2], [0, 2], [3, -1]])
true_covs = [np.array([[0.5, 0.2], [0.2, 0.5]]),
             np.array([[0.8, -0.3], [-0.3, 0.8]]),
             np.array([[0.6, 0], [0, 0.3]])]
true_weights = np.array([0.3, 0.5, 0.2])

# Generate data
data = []
true_labels = []
for i in range(n_samples):
    # Choose component
    k = np.random.choice(n_components, p=true_weights)
    true_labels.append(k)
    # Generate point from that component
    data.append(np.random.multivariate_normal(true_means[k], true_covs[k]))
data = np.array(data)
true_labels = np.array(true_labels)

# Initialize EM with random parameters
means = np.random.randn(n_components, 2) * 2
covs = [np.eye(2) for _ in range(n_components)]
weights = np.ones(n_components) / n_components

# Function to plot current state
def plot_em_state(data, means, covs, weights, responsibilities, iteration, log_likelihood_history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5))
    
    # Left plot: data colored by responsibilities
    colors = responsibilities @ np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
    ax1.scatter(data[:, 0], data[:, 1], c=colors, alpha=0.6, s=30)
    
    # Plot component centers and ellipses
    for k in range(n_components):
        ax1.plot(means[k, 0], means[k, 1], 'k*', markersize=15)
        
        # Draw ellipse representing covariance
        eigenvalues, eigenvectors = np.linalg.eigh(covs[k])
        angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0]))
        width, height = 2 * np.sqrt(eigenvalues)
        ellipse = Ellipse(means[k], width, height, angle=angle,
                         facecolor='none', edgecolor='black', linewidth=2)
        ax1.add_patch(ellipse)
    
    ax1.set_title(f'Iteration {iteration}: Soft Assignments')
    ax1.set_xlim(-5, 5)
    ax1.set_ylim(-5, 5)
    ax1.grid(True, alpha=0.3)
    
    # Right plot: log-likelihood history
    ax2.plot(range(len(log_likelihood_history)), log_likelihood_history, 'b-', linewidth=2)
    ax2.plot(iteration, log_likelihood_history[iteration], 'ro', markersize=8)  # Mark current iteration
    ax2.set_xlabel('Iteration')
    ax2.set_ylabel('Log-likelihood')
    ax2.set_title('Convergence')
    ax2.set_xlim(-0.5, 9.5)  # Always show full range
    ax2.set_xticks(range(10))  # Show only integer ticks 0-9
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

# EM iterations
n_iterations = 10
log_likelihood_history = []

for iteration in range(n_iterations):
    # E-step: compute responsibilities
    responsibilities = np.zeros((n_samples, n_components))
    for k in range(n_components):
        responsibilities[:, k] = weights[k] * multivariate_normal.pdf(
            data, means[k], covs[k])
    responsibilities /= responsibilities.sum(axis=1, keepdims=True)
    
    # Compute log-likelihood
    log_likelihood = np.sum(np.log(np.sum([
        weights[k] * multivariate_normal.pdf(data, means[k], covs[k])
        for k in range(n_components)], axis=0)))
    log_likelihood_history.append(log_likelihood)
    
    # Plot current state (only show a few iterations)
    if iteration in [0, 1, 3, 9]:
        fig = plot_em_state(data, means, covs, weights, responsibilities, iteration, log_likelihood_history)
        plt.show()
    
    # M-step: update parameters
    for k in range(n_components):
        resp_k = responsibilities[:, k]
        # Update weight
        weights[k] = resp_k.sum() / n_samples
        # Update mean
        means[k] = (resp_k[:, np.newaxis] * data).sum(axis=0) / resp_k.sum()
        # Update covariance
        diff = data - means[k]
        covs[k] = (resp_k[:, np.newaxis, np.newaxis] * 
                   diff[:, :, np.newaxis] @ diff[:, np.newaxis, :]).sum(axis=0) / resp_k.sum()

print("Final parameters:")
print(f"Weights: {weights}")
print(f"Means:\n{means}")

Final parameters:
Weights: [0.45669616 0.211968   0.33133584]
Means:
[[ 0.19903644  1.95778334]
 [ 3.02378239 -0.94719336]
 [-2.06041818 -1.92642337]]

The visualizations show:

Left plots: Data points colored by their soft assignments (RGB = probabilities for 3 clusters)
Black stars: Cluster centers
Ellipses: Covariance structure of each component
Right plot: Log-likelihood increasing monotonically

Notice how the algorithm gradually discovers the true cluster structure!

6.8.3 Properties and Practical Considerations

Strengths of EM:

Guaranteed improvement: The likelihood never decreases
No step size tuning: Unlike gradient descent, no learning rate to choose
Naturally handles constraints: Probabilities automatically sum to 1
Interpretable intermediate results: Soft assignments have meaning

Limitations and solutions:

Local optima: EM only finds a local maximum
- Solution: Run from multiple random initializations
- Smart initialization: Use k-means++ or similar
Slow convergence: Can take many iterations near the optimum
- Solution: Switch to Newton’s method near convergence
- Early stopping: Monitor log-likelihood changes
Choosing number of components: How many clusters?
- Solution: Use information criteria (AIC, BIC)
- Cross-validation: Evaluate on held-out data

The Initialization Trap

The EM algorithm is extremely sensitive to initialization. Poor starting values can lead to:

Convergence to inferior local optima
One component “eating” all the data
Empty components (singularities)

Always run EM multiple times with different initializations and choose the result with the highest likelihood!

6.9 Chapter Summary and Connections

6.9.1 Key Concepts Review

We’ve explored the remarkable properties that make the Maximum Likelihood Estimator the “gold standard” of parametric estimation:

Core Properties of the MLE:

Consistency: \hat{\theta}_n \xrightarrow{P} \theta_* – the MLE converges to the truth as n \to \infty.
Equivariance: \widehat{g(\theta)} = g(\hat{\theta}_n) – reparameterization doesn’t affect the MLE.
Asymptotic Normality: \hat{\theta}_n \approx \mathcal{N}(\theta, 1/I_n(\theta)) – enabling confidence intervals.
Asymptotic Efficiency: Achieves the Cramér-Rao lower bound – optimal variance.

Fisher Information:

Measures the “information” about a parameter contained in data.
Equals the expected curvature of the log-likelihood: I(\theta) = -\mathbb{E}[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}].
Determines the precision of the MLE: \text{Var}(\hat{\theta}_n) \approx 1/(nI(\theta)).
Sharp likelihood peak → High information → Precise estimates.

Deep Connections:

Cross-entropy loss in ML = Negative log-likelihood.
Many ML algorithms are secretly performing MLE.

Practical Tools:

Confidence Intervals: \hat{\theta}_n \pm z_{\alpha/2} \cdot \widehat{\text{se}} where \widehat{\text{se}} = 1/\sqrt{I_n(\hat{\theta}_n)}.
Delta Method: For transformed parameters \tau = g(\theta): \widehat{\text{se}}(\hat{\tau}) \approx |g'(\hat{\theta})| \cdot \widehat{\text{se}}(\hat{\theta}).
EM Algorithm: Iterative method for MLEs with latent variables – alternates between E-step (soft assignments) and M-step (parameter updates).

6.9.2 The Big Picture

We have moved from simply finding estimators (Chapter 5) to evaluating their quality. This chapter revealed why the MLE is so widely used:

It works: Consistency ensures we get the right answer with enough data.
It’s optimal: No other estimator has smaller asymptotic variance.
It’s practical: To a degree, we can quantify uncertainty through Fisher Information.⁴
It’s flexible: Equivariance means we can work in convenient parameterizations.
It extends: The EM algorithm handles complex models with latent structure.

6.9.3 Common Pitfalls to Avoid

Misinterpreting Confidence Intervals: A 95% CI does NOT mean:
- “95% probability the parameter is in this interval” (that’s Bayesian!).
- “95% of the data falls in this interval” (that’s a prediction interval).
- Correct: “95% of such intervals constructed from repeated samples would contain the true parameter”.
Assuming Asymptotic Results for Small Samples:
- Asymptotic normality may not hold for small n.
- Fisher Information formulas are approximate for finite samples.
- Bootstrap or exact methods may be preferable when n < 30.⁵
Getting Stuck in Local Optima with EM:
- EM only guarantees local, not global, maximum.
- Always run from multiple starting points.
- Monitor log-likelihood to ensure proper convergence.
Forgetting About Model Assumptions:
- MLE theory assumes the model is correctly specified (i.e., the true distribution belongs to our model family).
- We briefly mentioned that parameters must be identifiable (Section on Consistency) – different parameter values must give different distributions.
- The asymptotic results (normality, efficiency) require “regularity conditions” that we haven’t detailed but include smoothness of the likelihood.
Over-interpreting Asymptotic Results:
- All our results (asymptotic normality, efficiency) are for large samples.
- With small samples, the MLE might not be normally distributed and confidence intervals may be inaccurate.
- The MLE’s optimality is theoretical – in practice, we might prefer more robust methods when data contains outliers or model assumptions are questionable.

6.9.4 Chapter Connections

Building on Previous Chapters:

Chapter 3 (Statistical Inference): Provided the framework for evaluating estimators. Now we have seen that MLEs have optimal properties within this framework.
Chapter 4 (Bootstrap): Offers a computational alternative to the analytical standard errors we derived here. When Fisher Information is hard to compute, bootstrap it!
Chapter 5 (Finding Estimators): Showed how to find MLEs. This chapter justifies why we bother – they have provably good properties.

Looking Ahead:

Bayesian Inference: While MLE obtains point estimates and confidence intervals, Bayesian methods provide full probability distributions over parameters. The MLE appears as the mode of the posterior with uniform priors.
Regression Models: The theory in this chapter extends directly – least squares is MLE for normal errors, logistic regression is MLE for binary outcomes.
Model Selection: Information criteria (AIC, BIC) build on the likelihood framework we’ve developed.
Robust Statistics: When MLE assumptions fail, we need methods that sacrifice some efficiency for robustness.

The properties we’ve studied – consistency, asymptotic normality, efficiency – will reappear throughout statistics. Whether you’re fitting neural networks or analyzing clinical trials, you’re likely using some form of MLE, and the theory in this chapter explains why it works.

6.9.5 Self-Test Problems

Fisher Information: For X_1, \ldots, X_n \sim \text{Exponential}(\lambda) with PDF f(x; \lambda) = \lambda e^{-\lambda x} for x > 0:
- Find the Fisher Information I(\lambda) for a single observation.
- What is the approximate standard error of the MLE \hat{\lambda}_n?
Confidence Intervals: You flip a coin 100 times and observe 58 heads. Use Fisher Information to construct an approximate 95% confidence interval for the probability of heads.
Delta Method: If \hat{p} = 0.4 with standard error 0.05, find the standard error of \hat{\tau} = \log(p/(1-p)) using the Delta Method.
EM Intuition: In a two-component Gaussian mixture, the E-step computes “responsibilities” for each data point. What do these represent? Why can’t we just assign each point to its most likely cluster?
MLE Properties: True or False (with brief explanation):
- If \hat{\theta}_n is the MLE of \theta, then \hat{\theta}_n^2 is the MLE of \theta^2.
- The MLE is always unbiased.
- Fisher Information measures how “peaked” the likelihood function is.

6.9.6 Python and R Reference

#| eval: false
import numpy as np
import scipy.stats as stats
import scipy.optimize as optimize
from scipy.stats import multivariate_normal
import statsmodels.api as sm
import sympy as sp
from sklearn.mixture import GaussianMixture
import jax
import jax.numpy as jnp
import jax.scipy.stats as jstats

# Fisher Information calculation (symbolic)
def fisher_info_symbolic():
    """Example: Fisher Information for Exponential distribution"""
    # Define symbols
    x, lam = sp.symbols('x lambda', positive=True)
    
    # Define log pdf
    log_pdf = sp.log(lam) - lam * x
    
    # Score function (first derivative)
    score = sp.diff(log_pdf, lam)
    
    # Second derivative
    second_deriv = sp.diff(score, lam)
    
    # Fisher Information (negative expectation)
    # For exponential: E[X] = 1/lambda
    fisher_info = -second_deriv.subs(x, 1/lam)
    return sp.simplify(fisher_info)

# Fisher Information using JAX (automatic differentiation)
def fisher_info_jax(log_pdf, theta, n_samples=10000):
    """Estimate Fisher Information using automatic differentiation"""
    # Get the Hessian (second derivatives)
    hessian = jax.hessian(log_pdf)
    
    # Generate samples and compute expectation
    key = jax.random.PRNGKey(42)
    # This is problem-specific - would need appropriate sampler
    # Example continues with general structure
    
    return -jnp.mean(hessian(theta))

# Confidence intervals with Fisher Information
def mle_confidence_interval(mle, fisher_info_n, alpha=0.05):
    """Construct CI using Fisher Information"""
    se = 1 / np.sqrt(fisher_info_n)
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_lower = mle - z_crit * se
    ci_upper = mle + z_crit * se
    return ci_lower, ci_upper

# Delta Method implementation
def delta_method_se(mle, se_mle, g_derivative):
    """
    Apply Delta Method to find SE of transformed parameter
    
    Parameters:
    - mle: MLE of original parameter
    - se_mle: Standard error of MLE
    - g_derivative: Derivative of transformation g evaluated at MLE
    """
    return np.abs(g_derivative) * se_mle

# Example: Poisson rate to mean waiting time
lam_hat = 3.0  # MLE for rate
se_lam = 0.3   # Standard error
# For tau = 1/lambda, g'(lambda) = -1/lambda^2
g_prime = -1 / lam_hat**2
se_tau = delta_method_se(lam_hat, se_lam, g_prime)

# EM Algorithm using sklearn
def fit_gaussian_mixture(data, n_components=2, n_init=10):
    """Fit Gaussian Mixture Model using EM"""
    gmm = GaussianMixture(
        n_components=n_components,
        n_init=n_init,  # Number of initializations
        init_params='k-means++',  # Smart initialization
        max_iter=100,
        tol=1e-3,
        verbose=0
    )
    gmm.fit(data)
    
    return {
        'means': gmm.means_,
        'covariances': gmm.covariances_,
        'weights': gmm.weights_,
        'log_likelihood': gmm.score(data) * len(data),
        'converged': gmm.converged_
    }

# Advanced models with confidence intervals
def fit_model_with_cis(model, data):
    """Example using statsmodels for automatic CIs"""
    # Example: Poisson regression
    results = model.fit()
    
    # Extract MLEs and standard errors
    params = results.params
    se = results.bse
    conf_int = results.conf_int(alpha=0.05)
    
    # Fisher Information matrix
    fisher_info = results.cov_params()
    
    return {
        'mle': params,
        'se': se,
        'confidence_intervals': conf_int,
        'fisher_info_inverse': fisher_info
    }

# Manual EM implementation for mixture of normals
class GaussianMixtureEM:
    def __init__(self, n_components=2):
        self.n_components = n_components
        
    def e_step(self, X, means, covs, weights):
        """Compute responsibilities (soft assignments)"""
        n_samples = X.shape[0]
        resp = np.zeros((n_samples, self.n_components))
        
        for k in range(self.n_components):
            resp[:, k] = weights[k] * multivariate_normal.pdf(
                X, means[k], covs[k]
            )
        
        # Normalize
        resp /= resp.sum(axis=1, keepdims=True)
        return resp
    
    def m_step(self, X, resp):
        """Update parameters given responsibilities"""
        n_samples = X.shape[0]
        
        # Update weights
        weights = resp.sum(axis=0) / n_samples
        
        # Update means
        means = []
        covs = []
        for k in range(self.n_components):
            resp_k = resp[:, k]
            mean_k = (resp_k[:, np.newaxis] * X).sum(axis=0) / resp_k.sum()
            means.append(mean_k)
            
            # Update covariances
            diff = X - mean_k
            cov_k = (resp_k[:, np.newaxis, np.newaxis] * 
                    diff[:, :, np.newaxis] @ diff[:, np.newaxis, :]).sum(axis=0)
            cov_k /= resp_k.sum()
            covs.append(cov_k)
            
        return np.array(means), covs, weights

#| eval: false
library(MASS)      # For fitdistr
library(stats4)    # For mle function
library(numDeriv)  # For numerical derivatives
library(mclust)    # For Gaussian mixture models
library(msm)       # For deltamethod function

# Fisher Information calculation
fisher_info_exponential <- function(lambda) {
  # For Exponential(lambda): I(lambda) = 1/lambda^2
  return(1 / lambda^2)
}

# Fisher Information via numerical methods
fisher_info_numerical <- function(loglik_fn, theta, ...) {
  # Compute negative expected Hessian
  hess <- hessian(loglik_fn, theta, ...)
  return(-hess)
}

# MLE with standard errors using fitdistr
fit_with_se <- function(data, distribution) {
  fit <- fitdistr(data, distribution)
  
  # Extract estimates and standard errors
  mle <- fit$estimate
  se <- fit$sd
  
  # Construct confidence intervals
  ci_lower <- mle - 1.96 * se
  ci_upper <- mle + 1.96 * se
  
  list(
    mle = mle,
    se = se,
    ci = cbind(lower = ci_lower, upper = ci_upper),
    loglik = fit$loglik
  )
}

# Delta Method implementation
apply_delta_method <- function(mle, var_mle, g_expr, param_name) {
  # Using msm::deltamethod
  # g_expr: expression for g(theta) as a string
  # param_name: name of parameter in expression
  
  se_transformed <- deltamethod(
    as.formula(paste("~", g_expr)), 
    mle, 
    var_mle
  )
  
  return(se_transformed)
}

# Example: Delta method for Poisson
lambda_hat <- 3.0
se_lambda <- 0.3
var_lambda <- se_lambda^2

# For tau = 1/lambda
se_tau <- deltamethod(~ 1/x1, lambda_hat, var_lambda)

# Manual Delta Method
delta_method_manual <- function(mle, se_mle, g_deriv) {
  # g_deriv: derivative of g evaluated at MLE
  se_transformed <- abs(g_deriv) * se_mle
  return(se_transformed)
}

# EM Algorithm using mclust
fit_gaussian_mixture <- function(data, G = 2) {
  # Fit Gaussian mixture model
  fit <- Mclust(data, G = G)
  
  list(
    means = fit$parameters$mean,
    covariances = fit$parameters$variance$sigma,
    weights = fit$parameters$pro,
    log_likelihood = fit$loglik,
    classification = fit$classification,
    uncertainty = fit$uncertainty
  )
}

# Manual implementation of MLE with custom likelihood
mle_custom <- function(data, start, nll_function) {
  # Negative log-likelihood function
  nll <- function(params) {
    nll_function(params, data)
  }
  
  # Optimize
  fit <- optim(
    par = start,
    fn = nll,
    method = "L-BFGS-B",
    hessian = TRUE  # Get Hessian for Fisher Information
  )
  
  # Fisher Information is the Hessian at MLE
  fisher_info <- fit$hessian
  
  # Standard errors from inverse Fisher Information
  se <- sqrt(diag(solve(fisher_info)))
  
  list(
    mle = fit$par,
    se = se,
    fisher_info = fisher_info,
    convergence = fit$convergence,
    log_likelihood = -fit$value
  )
}

# Example: Beta distribution MLE
beta_nll <- function(params, data) {
  alpha <- params[1]
  beta <- params[2]
  if(alpha <= 0 || beta <= 0) return(Inf)
  -sum(dbeta(data, alpha, beta, log = TRUE))
}

# Fit Beta distribution
# data <- rbeta(100, 1.5, 0.5)
# fit <- mle_custom(data, c(1, 1), beta_nll)

# Computing profile likelihood confidence intervals
profile_ci <- function(data, param_index, mle_full, nll_function, level = 0.95) {
  # Critical value for likelihood ratio test
  crit <- qchisq(level, df = 1) / 2
  
  # Profile negative log-likelihood
  profile_nll <- function(param_value, other_params) {
    params <- mle_full
    params[param_index] <- param_value
    # Optimize over other parameters...
    # (simplified here)
    nll_function(params, data)
  }
  
  # Find confidence interval boundaries
  # (simplified - would need root finding in practice)
}

# EM Algorithm manual implementation
em_gaussian_mixture <- function(X, n_components = 2, max_iter = 100, tol = 1e-4) {
  n <- nrow(X)
  d <- ncol(X)
  
  # Initialize parameters
  km <- kmeans(X, n_components)
  means <- km$centers
  weights <- table(km$cluster) / n
  covs <- list()
  for(k in 1:n_components) {
    covs[[k]] <- cov(X[km$cluster == k, ])
  }
  
  log_lik_old <- -Inf
  
  for(iter in 1:max_iter) {
    # E-step: compute responsibilities
    resp <- matrix(0, n, n_components)
    for(k in 1:n_components) {
      resp[, k] <- weights[k] * dmvnorm(X, means[k, ], covs[[k]])
    }
    resp <- resp / rowSums(resp)
    
    # Log-likelihood
    log_lik <- sum(log(rowSums(resp)))
    if(abs(log_lik - log_lik_old) < tol) break
    log_lik_old <- log_lik
    
    # M-step: update parameters
    for(k in 1:n_components) {
      nk <- sum(resp[, k])
      weights[k] <- nk / n
      means[k, ] <- colSums(resp[, k] * X) / nk
      
      X_centered <- sweep(X, 2, means[k, ])
      covs[[k]] <- t(X_centered) %*% (resp[, k] * X_centered) / nk
    }
  }
  
  list(
    means = means,
    covariances = covs,
    weights = weights,
    responsibilities = resp,
    log_likelihood = log_lik,
    iterations = iter
  )
}

6.9.7 Connections to Source Material

Mapping to “All of Statistics”

This table maps sections in these lecture notes to the corresponding sections in Wasserman (2013) (“All of Statistics” or AoS).

Lecture Note Section	Corresponding AoS Section(s)
Introduction: How Good Are Our Estimators?	From slides and general context from AoS §9.4 introduction on properties of estimators.
Warm-up: Beta Distribution MLE	Example from slides; similar numerical optimization examples in AoS §9.13.4.
Core Properties of the MLE
↳ Overview	AoS §9.4 (list of properties).
↳ Consistency	AoS §9.5 (Theorem 9.13 and related discussion).
↳ Equivariance	AoS §9.6 (Theorem and Example).
↳ Asymptotic Normality & Optimality	AoS §9.7 introduction, §9.8.
Fisher Information and Confidence Intervals
↳ Fisher Information	AoS §9.7 (Definitions and Theorem 9.17).
↳ Constructing Confidence Intervals	AoS §9.7 (Theorem 9.19 and Examples).
↳ Cramér-Rao Lower Bound	Mentioned in AoS §9.8.
Additional Topics
↳ The Delta Method	AoS §9.9 (Theorem 9.24 and Examples).
↳ Multiparameter Models	AoS §9.10 (Fisher Information Matrix and related theorems).
↳ Sufficient Statistics	AoS §9.13.2 (brief treatment from slides).
Connection to Machine Learning: Cross-Entropy as MLE	Expanded from slides.
MLE for Latent Variable Models: The EM Algorithm	AoS §9.13.4 (Appendix on computing MLEs).
Self-Test Problems	Inspired by AoS §9.14 exercises.

6.9.8 Further Materials

Connection to machine learning: Murphy (2022), “Probabilistic Machine Learning: An Introduction”, Chapter 4

Remember: The MLE isn’t just a computational procedure – it’s a principled approach to estimation with deep theoretical foundations. The properties we’ve studied explain why maximum likelihood appears everywhere from simple coin flips to complex neural networks. These concepts represent the statistical foundations of modern machine learning!

Note that the term MLE is used interchangeably to denote maximum-likelihood estimation (the technique), maximum-likelihood estimator (the rule to compute the estimate) and maximum-likelihood estimate (the result).↩︎
In this example, we don’t need numerical optimization since we can compute the MLE analytically. We could still do numerical optimization and we would get the same results – it just would be slower and not very smart.↩︎
Also called Mixture of Gaussians (MoG).↩︎
As we will see in future lectures, a Bayesian approach is often more suitable for proper uncertainty quantitication.↩︎
This is just a rule of thumb.↩︎

--- date: today --- # Parametric Inference II: Properties of Estimators ## Learning Objectives After completing this chapter, you will be able to: - **Explain the key properties of the MLE** (consistency, equivariance, asymptotic normality, and efficiency) and their practical implications. - **Define Fisher Information** and use it to quantify the precision of parameter estimates. - **Construct and interpret confidence intervals** for parameters using the MLE and its standard error. - **Apply the Delta Method** to find confidence intervals for transformed parameters. - **Implement the EM algorithm** for finding MLEs in models with latent variables. ::: {.callout-note} This chapter explores the theoretical properties of Maximum Likelihood Estimators and provides practical tools for statistical inference. The material is adapted from Chapter 9 of @wasserman2013all, with additional computational examples and modern perspectives on optimization for latent variable models. ::: ## Introduction: How Good Are Our Estimators? In Chapter 5, we learned how to *find* estimators for parametric models using the Method of Moments (MoM) and Maximum Likelihood Estimation (MLE). We saw that finding the MLE often requires numerical optimization, and we explored practical algorithms for this task. But finding an estimator is only half the story. The next natural question is: **how good are these estimators?** This chapter addresses this fundamental question by exploring the *properties* of estimators, with a special focus on the MLE. We'll discover why the MLE is considered the "gold standard" of parametric estimation -- it turns out to have a remarkable collection of desirable properties that make it optimal in many senses. ::: {.callout-note} ## Key Questions We'll Answer - **Does our estimator converge to the true value** as we get more data? (Consistency) - **How much uncertainty is associated with our estimate?** Can we quantify it? (Confidence Intervals) - **Is our estimator the best possible one**, or could we do better? (Efficiency) - **How do these properties extend** to complex models with multiple parameters or latent variables? ::: Understanding these properties is crucial for several reasons: 1. **Practical inference**: Knowing that $\hat{\theta}_n = 2.3$ is not very useful without understanding the uncertainty. Is the true value likely between 2.2 and 2.4, or between 1 and 4? 2. **Method selection**: When multiple estimation methods exist, these properties help us choose the best one. 3. **Foundation for advanced methods**: The concepts in this chapter -- especially Fisher Information and the EM algorithm -- are fundamental to modern statistical methods and machine learning. 4. **Connection to optimization**: We'll see that the same mathematical quantities that determine statistical uncertainty also appear in optimization algorithms. Let's begin with a concrete example to bridge our previous work on finding MLEs with the new focus on analyzing their properties. ## Warm-up: A Complete MLE Example with Numerical Optimization Before diving into the theoretical properties, let's work through a complete example that bridges Chapter 5 (finding MLEs) and Chapter 6 (analyzing them). We'll find the MLE for a [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution), which requires numerical optimization. The **likelihood** for the Beta distribution with parameters $\alpha, \beta > 0$ is: $$f(x; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}$$ The **log-likelihood** for $n$ observations is: $$\begin{aligned} \ell_n(\alpha, \beta) = \sum_{i=1}^n \Big[&\log \Gamma(\alpha + \beta) - \log \Gamma(\alpha) - \log \Gamma(\beta) \\ &+ (\alpha - 1) \log(X_i) + (\beta - 1) \log(1-X_i)\Big] \end{aligned}$$ We will implement it below as the negative log-likelihood (`beta_nll`). ```{python} #| echo: true #| code-fold: false import numpy as np import scipy.optimize import jax import jax.numpy as jnp import jax.scipy.special as jsps # Generate data from a Beta distribution np.random.seed(43) x = np.random.beta(1.5, 0.5, size=100) # Define the negative log-likelihood for Beta(α, β) def beta_nll(theta, x): """Negative log-likelihood for Beta distribution using JAX""" alpha, beta = theta # Use log-gamma function for numerical stability return -jnp.sum( jsps.gammaln(alpha + beta) - jsps.gammaln(alpha) - jsps.gammaln(beta) + (alpha - 1) * jnp.log(x) + (beta - 1) * jnp.log(1 - x) ) # Get the gradient function automatically beta_grad = jax.grad(beta_nll) # We specify bounds: α, β > 0 # - Bounds are *inclusive* so we set a small positive number as lower bound # - None here denotes infinity bounds = [(0.0001, None), (0.0001, None)] # Optimize using L-BFGS-B with bounds to ensure positivity result = scipy.optimize.minimize( beta_nll, x0=jnp.array([1.0, 1.0]), # Initial guess args=(x,), # Tuple of additional arguments to pass to jac (x is data) jac=beta_grad, method='L-BFGS-B', bounds=bounds ) print(f"Raw optimization result:\n") print(result) print(f"\nExtracted results:\n") print(f" MLE: α̂ = {result.x[0]:.3f}, β̂ = {result.x[1]:.3f}") print(f" Negative log-likelihood: {result.fun:.3f}") print(f" Converged: {result.success}") print(f" Iterations: {result.nit}") ``` This example showcases several important points: 1. **Automatic differentiation**: We use JAX to automatically compute gradients, avoiding error-prone manual derivations 2. **Constrained optimization**: The L-BFGS-B method handles the constraint that both parameters must be positive 3. **Numerical stability**: We work with log-gamma functions rather than raw factorials ::: {.callout-note} ## The MLE as a Starting Point Now that we have $\hat{\alpha}$ and $\hat{\beta}$, we can ask the crucial questions: - How accurate are these estimates? - What's their sampling distribution? - Are they optimal? These are exactly the questions this chapter will answer! ::: ::: {.callout-note collapse="true"} ## Finnish-English Terminology Reference For Finnish-speaking students, here are the key terms we'll use in this chapter: | English | Finnish | Context | |---------|---------|---------| | Equivariant | Ekvivariantti | Property of MLE under reparameterization | | Efficient | Tehokas | Optimal variance property | | Score function | Rinnefunktio | Gradient of log-likelihood | | Fisher information | Fisherin informaatio | Variance of score function | | Delta method | Delta-menetelmä | Method for transformed parameters | | Sufficient statistic | Tyhjentävä tunnusluku | Contains all information about parameter | | Latent variable | Piilomuuttuja | Unobserved variable in model | | EM Algorithm | EM-algoritmi | Expectation-Maximization algorithm | | Asymptotic normality | Asymptoottinen normaalisuus | Large-sample distribution property | | Consistency | Konsistenssi/Tarkentuva | Convergence to true value | ::: ## Core Properties of the Maximum Likelihood Estimator Before exploring its properties, let's recall the definition of the **maximum likelihood estimator** (MLE) from Chapter 5:^[Note that the term MLE is used interchangeably to denote maximum-likelihood estimation (the technique), maximum-likelihood estimator (the rule to compute the estimate) and maximum-likelihood estimate (the result).] ::: {.definition} Let $X_1, \ldots, X_n$ be i.i.d. with PDF (or PMF) $f(x; \theta)$. The **likelihood function** is: $$\mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$$ The **log-likelihood function** is: $$\ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)$$ The **maximum likelihood estimator** (MLE), denoted by $\hat{\theta}_n$, is the value of $\theta$ that maximizes $\mathcal{L}_n(\theta)$ (or equivalently, $\ell_n(\theta)$). ::: ### Overview In Chapter 5, we saw that the MLE is found by maximizing the likelihood function -- a principle that seems intuitively reasonable. But intuition alone doesn't make for good statistics. We need to understand *why* the MLE works so well. It turns out that the MLE has several remarkable properties that make it the "gold standard" of parametric estimation: | Property | Mathematical Statement | Intuitive Meaning | |----------|----------------------|-------------------| | **Consistency** | $\hat{\theta}_n \xrightarrow{P} \theta_*$ | The MLE converges to the true parameter value as $n \to \infty$ | | **Equivariance** | If $\hat{\theta}_n$ is the MLE of $\theta$, then $g(\hat{\theta}_n)$ is the MLE of $g(\theta)$ | The MLE behaves sensibly under parameter transformations | | **Asymptotic Normality** | $\frac{(\hat{\theta}_n - \theta_*)}{\hat{\text{se}}} \rightsquigarrow \mathcal{N}(0, 1)$ | The MLE has an approximately normal distribution for large samples | | **Asymptotic Efficiency** | $\text{Var}(\hat{\theta}_n)$ achieves the Cramér-Rao lower bound | The MLE has the smallest possible variance among consistent estimators | | **Approximate Bayes** | $\hat{\theta}_n \approx \arg\max_\theta \pi(\theta \mid X^n)$ with flat prior | The MLE approximates the Bayesian posterior mode | Let's explore each of these properties in detail. ### Consistency: Getting It Right Eventually The most fundamental property we could ask of any estimator is that it gets closer to the truth as we collect more data. This is the property of **consistency**. ::: {.definition} An estimator $\hat{\theta}_n$ is **consistent** for $\theta_*$ if: $$\hat{\theta}_n \xrightarrow{P} \theta_*$$ That is, for any $\epsilon > 0$: $$\lim_{n \to \infty} P(|\hat{\theta}_n - \theta_*| > \epsilon) = 0$$ ::: In words: as we collect more data, the probability that our estimate is far from the truth goes to zero. The MLE has this property under mild conditions. ::: {.callout-note collapse="true"} ## The Theory Behind Consistency The consistency of the MLE is deeply connected to the Kullback-Leibler (KL) divergence. For two densities $f$ and $g$, the KL divergence is: $$D(f, g) = \int f(x) \log \frac{f(x)}{g(x)} dx$$ Key properties: - $D(f, g) \geq 0$ always - $D(f, g) = 0$ if and only if $f = g$ (almost everywhere) The crucial insight is that maximizing the log-likelihood is equivalent to minimizing the KL divergence between the true distribution and our model. As $n \to \infty$, the empirical distribution converges to the true distribution, so the MLE converges to the parameter that makes the model distribution equal to the true distribution. **Identifiability**: For this to work, the model must be **identifiable**: different parameter values must correspond to different distributions. That is, $\theta \neq \psi$ implies $D(f(\cdot; \theta), f(\cdot; \psi)) > 0$. ::: ::: {.callout-tip icon=false} ## Example: Consistency in Action Let's visualize how the MLE becomes more accurate with increasing sample size $n$. We'll use the Poisson distribution where the MLE for the rate parameter $\lambda$ is simply the sample mean:^[In this example, we don't need numerical optimization since we can compute the MLE analytically. We could still do numerical optimization and we would get the same results -- it just would be slower and not very smart.] ```{python} #| fig-width: 7 #| fig-height: 5 import matplotlib.pyplot as plt # True parameter true_lambda = 3.0 # Sample sizes to consider sample_sizes = [10, 50, 100, 500, 1000, 5000] num_simulations = 100 # Store results mle_estimates = {n: [] for n in sample_sizes} # Run simulations np.random.seed(42) for n in sample_sizes: for _ in range(num_simulations): # Generate Poisson data data = np.random.poisson(true_lambda, size=n) # MLE for Poisson is just the sample mean mle = np.mean(data) mle_estimates[n].append(mle) # Create plot fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5)) # Left plot: Box plots showing distribution at each sample size positions = range(len(sample_sizes)) ax1.boxplot([mle_estimates[n] for n in sample_sizes], positions=positions, labels=[str(n) for n in sample_sizes]) ax1.axhline(y=true_lambda, color='red', linestyle='--', label=f'True λ = {true_lambda}') ax1.set_xlabel('Sample size n') ax1.set_ylabel('MLE estimate') ax1.set_title('Distribution of MLE') ax1.legend() ax1.grid(True, alpha=0.3) # Right plot: Standard deviation vs sample size std_devs = [np.std(mle_estimates[n]) for n in sample_sizes] ax2.loglog(sample_sizes, std_devs, 'bo-', label='Empirical SD') # Theoretical standard deviation: sqrt(λ/n) theoretical_sd = [np.sqrt(true_lambda/n) for n in sample_sizes] ax2.loglog(sample_sizes, theoretical_sd, 'r--', label='Theoretical SD') ax2.set_xlabel('Sample size n') ax2.set_ylabel('Standard deviation of MLE') ax2.set_title('Convergence Rate') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` The plots demonstrate consistency: as $n$ increases, the MLE concentrates more tightly around the true value, with standard deviation decreasing at rate $1/\sqrt{n}$. ::: ### Equivariance: Reparameterization Invariance A subtle but important property of the MLE is **equivariance** (or "functional invariance"). This means that if we reparameterize our model, the MLE transforms in the natural way. ::: {.theorem name="Equivariance of the MLE"} If $\hat{\theta}_n$ is the MLE of $\theta$, then for any function $g$: $$\widehat{g(\theta)} = g(\hat{\theta}_n)$$ That is, the MLE of $\tau = g(\theta)$ is $\hat{\tau}_n = g(\hat{\theta}_n)$. ::: **Proof**: Let $\tau = g(\theta)$ where $g$ has inverse $h$, so $\theta = h(\tau)$. For any $\tau$: $$\mathcal{L}_n(\tau) = \prod_{i=1}^n f(X_i; h(\tau)) = \prod_{i=1}^n f(X_i; \theta) = \mathcal{L}_n(\theta)$$ where $\theta = h(\tau)$. Therefore: $$\mathcal{L}_n(\tau) = \mathcal{L}_n(\theta) \leq \mathcal{L}_n(\hat{\theta}_n) = \mathcal{L}_n(\hat{\tau}_n)$$ Since this holds for any $\tau$, we have $\hat{\tau}_n = g(\hat{\theta}_n)$. □ ::: {.callout-tip icon=false} ## Example: Equivariance in Practice Consider $X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1)$ where we're interested in both: - The mean $\theta$ - The parameter $\tau = e^\theta$ (perhaps $\theta$ is log-income and $\tau$ is income) The MLE for $\theta$ is $\hat{\theta}_n = \bar{X}_n$. By equivariance: $$\hat{\tau}_n = e^{\hat{\theta}_n} = e^{\bar{X}_n}$$ No need to rederive from scratch! This is particularly convenient when dealing with complex transformations. ::: **Why Equivariance Matters**: 1. **Convenience**: We can work in whatever parameterization is most natural for finding the MLE, then transform to the parameterization of interest. 2. **Consistency across parameterizations**: Different researchers might parameterize the same model differently (e.g., variance vs. precision in a normal distribution). Equivariance ensures they'll get equivalent results. 3. **Not universal**: This property is special to MLEs! Other estimators, like the Bayesian posterior mode (also known as MAP or [*maximum a posteriori*](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) estimate), generally lack this property. For instance, if $\theta$ has a uniform prior, $\tau = \theta^2$ does not have a uniform prior, leading to different posterior modes. ::: {.callout-warning} ## A Common Misconception Equivariance does NOT mean that: $$\mathbb{E}[g(\hat{\theta}_n)] = g(\mathbb{E}[\hat{\theta}_n])$$ In general, $g(\hat{\theta}_n)$ is a biased estimator of $g(\theta)$ even if $\hat{\theta}_n$ is unbiased for $\theta$ (unless $g$ is linear). Equivariance is about what parameter value maximizes the likelihood, not about expected values. ::: ### Asymptotic Normality & Optimality The consistency and equivariance properties are nice, but they don't tell us about the *distribution* of the MLE. How much uncertainty is there in our estimate? How efficient is it compared to other estimators? The remarkable answer is that the MLE is approximately normally distributed with the smallest possible variance. We'll explore these twin properties -- asymptotic normality and efficiency -- in detail in the next section, as they require us to first understand a fundamental concept: the Fisher Information. ## Fisher Information and Confidence Intervals ### Fisher Information: Quantifying What Data Can Tell Us To understand the precision of the MLE, we need to introduce one of the most important concepts in statistical theory: the **Fisher Information**. Named after statistician and polymath [R.A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher), this quantity measures how much "information" about a parameter is contained in the data. The Fisher Information is formally defined through the **score function** and its variance. ::: {.definition} The **score function** is the gradient of the log-likelihood: $$s(X; \theta) = \frac{\partial \log f(X; \theta)}{\partial \theta}$$ The **Fisher Information** is: $$I_n(\theta) = \mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i; \theta)\right) = \sum_{i=1}^n \mathbb{V}_\theta(s(X_i; \theta))$$ ::: For a single observation ($n=1$), we often write $I(\theta) = I_1(\theta)$. The computation of the Fisher information can be simplified using the following results: ::: {.theorem} For an IID sample of size $n$: 1. $I_n(\theta) = n \cdot I(\theta)$ (information accumulates linearly) 2. $\mathbb{E}_\theta[s(X; \theta)] = 0$ (expected score is zero at the true parameter) 3. $I(\theta) = -\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right]$ (expected negative curvature) ::: The last property shows that Fisher Information is literally the expected curvature of the log-likelihood -- confirming our intuition about "sharpness"! ::: {.callout-note collapse="true"} ## Proof of Properties of Fisher Information and Score **Property 1: Linear accumulation for IID samples** For IID samples $X_1, \ldots, X_n$: $$I_n(\theta) = \mathbb{V}_\theta\left(\sum_{i=1}^n s(X_i; \theta)\right)$$ Since the $X_i$ are independent and $\mathbb{V}(\sum Y_i) = \sum \mathbb{V}(Y_i)$ for independent random variables: $$I_n(\theta) = \sum_{i=1}^n \mathbb{V}_\theta(s(X_i; \theta)) = n \cdot \mathbb{V}_\theta(s(X; \theta)) = n \cdot I(\theta)$$ **Property 2: Expected score is zero** Under regularity conditions that allow interchange of derivative and integral: $$\mathbb{E}_\theta[s(X; \theta)] = \mathbb{E}_\theta\left[\frac{\partial \log f(X; \theta)}{\partial \theta}\right] = \int \frac{\partial \log f(x; \theta)}{\partial \theta} f(x; \theta) dx$$ Using $\frac{\partial \log f}{\partial \theta} = \frac{1}{f} \frac{\partial f}{\partial \theta}$: $$= \int \frac{\partial f(x; \theta)}{\partial \theta} dx = \frac{\partial}{\partial \theta} \int f(x; \theta) dx = \frac{\partial}{\partial \theta} (1) = 0$$ **Property 3: Alternative formula using second derivative** We start with Property 2: $\mathbb{E}_\theta[s(X; \theta)] = 0$, which we can write explicitly as: $$\int s(x; \theta) f(x; \theta) dx = \int \frac{\partial \log f(x; \theta)}{\partial \theta} f(x; \theta) dx = 0$$ Since this holds for all $\theta$, we can differentiate both sides with respect to $\theta$: $$\frac{\partial}{\partial \theta} \int s(x; \theta) f(x; \theta) dx = 0$$ Under regularity conditions (allowing interchange of derivative and integral): $$\int \frac{\partial}{\partial \theta}[s(x; \theta) f(x; \theta)] dx = 0$$ Using the product rule: $$\int \left[\frac{\partial s(x; \theta)}{\partial \theta} f(x; \theta) + s(x; \theta) \frac{\partial f(x; \theta)}{\partial \theta}\right] dx = 0$$ Now, note that $\frac{\partial f(x; \theta)}{\partial \theta} = f(x; \theta) \cdot \frac{\partial \log f(x; \theta)}{\partial \theta} = f(x; \theta) \cdot s(x; \theta)$ Substituting: $$\int \left[\frac{\partial s(x; \theta)}{\partial \theta} f(x; \theta) + s(x; \theta)^2 f(x; \theta)\right] dx = 0$$ This can be rewritten as: $$\mathbb{E}_\theta\left[\frac{\partial s(X; \theta)}{\partial \theta}\right] + \mathbb{E}_\theta[s(X; \theta)^2] = 0$$ Since $s(X; \theta) = \frac{\partial \log f(X; \theta)}{\partial \theta}$, we have: - $\frac{\partial s(X; \theta)}{\partial \theta} = \frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}$ - $\mathbb{E}_\theta[s(X; \theta)^2] = \mathbb{V}_\theta[s(X; \theta)] = I(\theta)$ (since $\mathbb{E}_\theta[s(X; \theta)] = 0$) Therefore: $$\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right] + I(\theta) = 0$$ Which gives us: $I(\theta) = -\mathbb{E}_\theta\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right]$ ::: ::: {.tabbed-content} ## Intuitive Imagine you're trying to estimate a parameter by maximizing the log-likelihood function. Picture this function as a hill that you're climbing to find the peak (the MLE). Now think about two scenarios: 1. **Sharp, pointy peak**: The log-likelihood drops off steeply as you move away from the maximum. Even a small change in the parameter makes the data much less likely. This means the data is very "informative" -- it strongly prefers one specific parameter value. 2. **Flat, broad peak**: The log-likelihood changes slowly near the maximum. Many different parameter values give similar likelihoods. The data isn't very informative about the exact parameter value. The **Fisher Information** measures the "sharpness" or curvature of the log-likelihood at its peak: - **Sharp peak** → High Fisher Information → Low variance for $\hat{\theta}$ → Narrow confidence interval - **Flat peak** → Low Fisher Information → High variance for $\hat{\theta}$ → Wide confidence interval This is the key insight: the same curvature that makes optimization easy (sharp peak = clear maximum) also makes estimation precise (sharp peak = low uncertainty)! ## Mathematical The meaning of $\mathbb{E}_\theta[s(X; \theta)] = 0$ is subtle and often misunderstood. It does NOT mean: - The derivative is zero (that's what happens at the MLE for a specific dataset) - The log-likelihood is maximized at the true parameter What it DOES mean: - When data is generated from $f(x; \theta_*)$, the score evaluated at $\theta_*$ (true parameter) averages to zero across all possible datasets - If you're at the true parameter value and observe a random data point, it's equally likely to suggest increasing or decreasing $\theta$ - This is why the true parameter is "stable" -- random samples don't systematically pull the estimate away from the truth Think of it as a balance point: at the true parameter, the data provides no systematic evidence to move in either direction, even though any individual sample might suggest moving up or down. ## Computational Let's calculate and visualize the Fisher Information for a concrete example. We'll examine the [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution), where the Fisher Information has a particularly interesting form, which we will derive later: $$I(p) = 1/(p(1-p))$$ This shows how the precision of estimating a probability depends on the true probability value: ```{python} #| fig-width: 7 #| fig-height: 8 import numpy as np import matplotlib.pyplot as plt from scipy import stats # Example: Fisher Information for Bernoulli(p) # For Bernoulli: I(p) = 1/(p(1-p)) p_values = np.linspace(0.01, 0.99, 200) fisher_info = 1 / (p_values * (1 - p_values)) fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(7, 8)) # Top plot: Fisher Information ax1.plot(p_values, fisher_info, 'b-', linewidth=2) ax1.set_xlabel('p') ax1.set_ylabel('I(p)') ax1.set_title('Fisher Information for Bernoulli(p)') ax1.grid(True, alpha=0.3) ax1.set_ylim(0, 50) # Middle plot: Standard error (1/sqrt(nI(p))) for different n n_values = [10, 50, 100, 500] colors = ['red', 'orange', 'green', 'blue'] for n, color in zip(n_values, colors): se = 1 / np.sqrt(n * fisher_info) ax2.plot(p_values, se, color=color, linewidth=2, label=f'n = {n}') ax2.set_xlabel('p') ax2.set_ylabel('Standard Error') ax2.set_title('Standard Error of MLE') ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_ylim(0, 0.2) # Bottom plot: Log-likelihood curves for different p values n_obs = 20 successes = 12 theta_range = np.linspace(0.01, 0.99, 200) # Show log-likelihood for different true p values for p_true, color in [(0.5, 'red'), (0.6, 'green'), (0.8, 'blue')]: log_lik = successes * np.log(theta_range) + (n_obs - successes) * np.log(1 - theta_range) ax3.plot(theta_range, log_lik, color=color, linewidth=2, label=f'Data from p = {p_true}') # Mark the MLE mle = successes / n_obs mle_ll = successes * np.log(mle) + (n_obs - successes) * np.log(1 - mle) ax3.plot(mle, mle_ll, 'ko', markersize=8) ax3.set_xlabel('θ') ax3.set_ylabel('Log-likelihood') ax3.set_title(f'Log-likelihood curves (n={n_obs}, observed {successes} successes)') ax3.legend() ax3.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` **Key insights from the plots:** 1. Fisher Information is lowest at $p=0.5$ (hardest to estimate a fair coin) 2. Fisher Information $\rightarrow \infty$ as $p \rightarrow 0$ or $p \rightarrow 1$ (extreme probabilities are 'easier' to pin down) 3. Standard error decreases with both $n$ and $I(p)$ 4. The curvature of the log-likelihood reflects the Fisher Information ::: ### Asymptotic Normality of the MLE Now we can state the key theorem that connects Fisher Information to the distribution of the MLE: ::: {.theorem name="Asymptotic Normality of the MLE"} Let $\text{se} = \sqrt{\mathbb{V}(\hat{\theta}_n)}$. Under appropriate regularity conditions we have the following: 1. $\text{se} \approx \sqrt{1 / I_n(\theta)}$ and $$\frac{(\hat{\theta}_n - \theta)}{\text{se}} \rightsquigarrow \mathcal{N}(0,1)$$ 2. Let $\widehat{\text{se}} = \sqrt{1 / I_n(\hat{\theta}_n)}$. Then $$\frac{(\hat{\theta}_n - \theta)}{\widehat{\text{se}}} \rightsquigarrow \mathcal{N}(0,1)$$ ::: According to this theorem, the distribution of the MLE is approximately $\mathcal{N}(\theta, \widehat{\text{se}}^2)$. This is one of the most important results in statistics: it tells us not just that the MLE converges to the true value (consistency), but also gives us its approximate distribution for finite samples. The key insight is that the same Fisher Information that measures how "sharp" the likelihood is also determines the precision of our estimate. Sharp likelihood → High Fisher Information → Small standard error → Precise estimate. ### Constructing Confidence Intervals for the MLE With the asymptotic normality result in hand, we can now construct confidence intervals for our parameter estimates. ::: {.theorem} The interval $$C_n = \left( \hat{\theta}_n - z_{\alpha/2}\widehat{\text{se}}, \hat{\theta}_n + z_{\alpha/2}\widehat{\text{se}} \right)$$ is an asymptotically valid $(1-\alpha)$ confidence interval for $\theta$. ::: For a 95% confidence interval, $z_{0.025} \approx 1.96 \approx 2$, giving the simple rule: $$\text{95\% CI} \approx \hat{\theta}_n \pm 2 \cdot \widehat{\text{se}}$$ This type of confidence interval is used as the basis for statements about expected error in opinion polls and many other applications. ::: {.callout-tip icon=false} ## Example: Confidence Interval for a Proportion Let's work through the complete derivation for the Bernoulli case, which gives us confidence intervals for proportions -- one of the most common applications in practice. For $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$, we have $f(x; p) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$. **Step 1: Find the MLE** The likelihood for $n$ observations is: $$\mathcal{L}_n(p) = \prod_{i=1}^n p^{X_i} (1-p)^{1-X_i} = p^{S} (1-p)^{n-S}$$ where $S = \sum_{i=1}^n X_i$ is the total number of successes. Taking the log: $\ell_n(p) = S \log p + (n-S) \log(1-p)$ Setting $\frac{d\ell_n}{dp} = \frac{S}{p} - \frac{n-S}{1-p} = 0$ gives $\hat{p}_n = \frac{S}{n} = \bar{X}_n$ **Step 2: Compute the score function** For a single observation: $$\log f(X; p) = X \log p + (1-X) \log(1-p)$$ The score function is: $$s(X; p) = \frac{\partial \log f(X; p)}{\partial p} = \frac{X}{p} - \frac{1-X}{1-p}$$ Let's verify $\mathbb{E}_p[s(X; p)] = 0$: $$\mathbb{E}_p[s(X; p)] = \mathbb{E}_p\left[\frac{X}{p} - \frac{1-X}{1-p}\right] = \frac{p}{p} - \frac{1-p}{1-p} = 1 - 1 = 0 \checkmark$$ **Step 3: Compute the second derivative** $$\frac{\partial^2 \log f(X; p)}{\partial p^2} = \frac{\partial}{\partial p}\left[\frac{X}{p} - \frac{1-X}{1-p}\right] = -\frac{X}{p^2} - \frac{1-X}{(1-p)^2}$$ **Step 4: Find the Fisher Information** Using the formula $I(p) = -\mathbb{E}_p\left[\frac{\partial^2 \log f(X; p)}{\partial p^2}\right]$: $$I(p) = -\mathbb{E}_p\left[-\frac{X}{p^2} - \frac{1-X}{(1-p)^2}\right] = \mathbb{E}_p\left[\frac{X}{p^2}\right] + \mathbb{E}_p\left[\frac{1-X}{(1-p)^2}\right]$$ Since $\mathbb{E}_p[X] = p$ and $\mathbb{E}_p[1-X] = 1-p$: $$I(p) = \frac{p}{p^2} + \frac{1-p}{(1-p)^2} = \frac{1}{p} + \frac{1}{1-p} = \frac{1-p+p}{p(1-p)} = \frac{1}{p(1-p)}$$ **Step 5: Derive the standard error** For $n$ observations, $I_n(p) = n \cdot I(p) = \frac{n}{p(1-p)}$ The standard error of $\hat{p}_n$ is: $$\widehat{\text{se}} = \frac{1}{\sqrt{I_n(\hat{p}_n)}} = \frac{1}{\sqrt{\frac{n}{\hat{p}_n(1-\hat{p}_n)}}} = \sqrt{\frac{\hat{p}_n(1-\hat{p}_n)}{n}}$$ **Step 6: Construct the confidence interval** From the asymptotic normality theorem: $\frac{\hat{p}_n - p}{\widehat{\text{se}}} \rightsquigarrow \mathcal{N}(0,1)$ For a $(1-\alpha)$ confidence interval: $$\mathbb{P}\left(-z_{\alpha/2} \leq \frac{\hat{p}_n - p}{\widehat{\text{se}}} \leq z_{\alpha/2}\right) \approx 1-\alpha$$ Rearranging: $$\mathbb{P}\left(\hat{p}_n - z_{\alpha/2} \cdot \widehat{\text{se}} \leq p \leq \hat{p}_n + z_{\alpha/2} \cdot \widehat{\text{se}}\right) \approx 1-\alpha$$ Therefore, the $(1-\alpha)$ confidence interval is: $$\hat{p}_n \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_n(1-\hat{p}_n)}{n}}$$ ::: ::: {.callout-tip icon=false} ## Example: Political Opinion Polling Let's apply the confidence interval for a proportion from the previous worked example to understand how political polls work and what their "margin of error" really means. ```{python} #| echo: true # Political polling example # Suppose we poll n=1000 randomly selected likely voters about a referendum n = 1000 # Sample size successes = 520 # Number who support the referendum p_hat = successes / n # Sample proportion = 0.52 (52%) # Standard error using the formula we derived se = np.sqrt(p_hat * (1 - p_hat) / n) # 95% confidence interval (using z_{0.025} = 1.96) z_critical = 1.96 ci_lower = p_hat - z_critical * se ci_upper = p_hat + z_critical * se print("POLL RESULTS:") print(f"Sample size: {n} voters") print(f"Support for referendum: {p_hat*100:.1f}% ({successes} out of {n})") print(f"Standard error: {se*100:.2f} percentage points") print(f"95% confidence interval: ({ci_lower*100:.1f}%, {ci_upper*100:.1f}%)") print(f"Margin of error: ±{z_critical*se*100:.1f} percentage points") # Interpretation print("\nINTERPRETATION:") if ci_lower > 0.5: print("The referendum has statistically significant majority support.") elif ci_upper < 0.5: print("The referendum has statistically significant minority support.") else: print("The poll cannot determine if there's majority support (CI includes 50%).") # Compare different scenarios to understand margin of error print("\n" + "="*60) print("HOW MARGIN OF ERROR VARIES:") print("="*60) scenarios = [ (0.50, 500, "50% support, n=500"), (0.50, 1000, "50% support, n=1000"), (0.50, 2000, "50% support, n=2000"), (0.20, 1000, "20% support, n=1000"), (0.10, 1000, "10% support, n=1000"), (0.01, 1000, "1% support, n=1000") ] print(f"{'Scenario':<25} {'Margin of Error':>20} {'95% CI':>20}") print("-" * 65) for p, n, desc in scenarios: margin = 1.96 * np.sqrt(p * (1 - p) / n) * 100 ci_low = (p - 1.96 * np.sqrt(p * (1 - p) / n)) * 100 ci_high = (p + 1.96 * np.sqrt(p * (1 - p) / n)) * 100 print(f"{desc:<25} ±{margin:>6.1f} percentage pts ({ci_low:>5.1f}%, {ci_high:>5.1f}%)") # Key assumptions and limitations print("\nKEY ASSUMPTIONS:") print("1. Random sampling from the population of interest") print("2. Respondents answer truthfully") print("3. The sample size is large enough for normal approximation (np ≥ 10 and n(1-p) ≥ 10)") print("4. No systematic bias in who responds to the poll") ``` **Important insights about polling:** 1. **The margin of error is largest when p = 0.5**: This is why close races are hardest to call. When support is near 50%, we have maximum uncertainty about which side has the majority. 2. **Sample size matters, but with diminishing returns**: To halve the margin of error, you need to quadruple the sample size (since margin ∝ 1/√n). 3. **Extreme proportions have smaller margins**: If only 1% support something, we can estimate that quite precisely even with modest sample sizes. 4. **"Statistical ties"**: When the confidence interval includes 50%, we cannot conclude which side has majority support. This is often called a "statistical tie" in media reports. 5. **The stated margin usually assumes 95% confidence**: When polls report "±3 percentage points," they typically mean the 95% confidence interval extends 3 points in each direction. ::: ::: {.callout-warning collapse="true"} ## Advanced: Limitations of the Wald Interval The confidence interval we derived above, known as the **Wald interval**, has poor coverage when $p$ is near 0 or 1, or when $n$ is small. In practice, consider using: - **Agresti-Coull interval**: Add 2 successes and 2 failures before computing: $\tilde{p} = (S+2)/(n+4)$. - **Wilson score interval**: More complex but widely recommended as the default choice. The Wald interval remains important for understanding the theory, but these alternatives perform better in practice. ::: ::: {.callout-note collapse="true"} ## Advanced: The Cramér-Rao Lower Bound We mentioned that the MLE achieves the smallest possible variance. This is formalized by the **Cramér-Rao Lower Bound**: For any unbiased estimator $\tilde{\theta}$: $$\mathbb{V}(\tilde{\theta}) \geq \frac{1}{I_n(\theta)}$$ Since the MLE asymptotically achieves variance $1/I_n(\theta)$, it is **asymptotically efficient** -- no consistent estimator can do better! This is why we call the MLE "optimal": it extracts all possible information from the data about the parameter. ::: ## Additional Topics ### The Delta Method: Confidence Intervals for Transformations Often we're not interested in the parameter $\theta$ itself, but in some transformation $\tau = g(\theta)$. For example: - If $\theta$ is log-odds, we might want odds $\tau = e^\theta$ - If $\theta$ is standard deviation, we might want variance $\tau = \theta^2$ - If $\theta$ is a rate parameter, we might want mean lifetime $\tau = 1/\theta$ Due to the **equivariance property** seen earlier, if we know the MLE $\hat{\theta}$, we know also that the MLE of $\tau$ is $\hat{\tau} = g(\hat{\theta})$. However, what about the confidence intervals of $\hat{\tau}$? The **Delta Method** provides a way to find confidence intervals for transformed parameters. ::: {.theorem name="The Delta Method"} If $\tau = g(\theta)$ where $g$ is differentiable and $g'(\theta) \neq 0$, then: $$\widehat{\text{se}}(\hat{\tau}_n) \approx |g'(\hat{\theta}_n)| \cdot \widehat{\text{se}}(\hat{\theta}_n)$$ Therefore: $$\frac{(\hat{\tau}_n - \tau)}{\widehat{\text{se}}(\hat{\tau}_n)} \rightsquigarrow \mathcal{N}(0,1)$$ ::: ::: {.callout-note} ## Intuition Behind the Delta Method If we zoom in enough, any smooth function looks linear. The Delta Method says that when we transform our estimate through a function $g$, the standard error gets multiplied by $|g'(\hat{\theta}_n)|$ -- the absolute slope of the function at our estimate. This approximation becomes exact as $n \to \infty$ because the variability of $\hat{\theta}_n$ shrinks, effectively "zooming in" on the region where $g$ is effectively linear. ::: ::: {.callout-tip icon=false} ## Example: Full Delta Method Workflow Let's work through estimating $\psi = \log \sigma$ for a Normal distribution with known mean. This example demonstrates every step of the Delta Method process. **Setup**: $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with $\mu$ known. **Step 1: Find MLE for $\sigma$** The log-likelihood is: $$\ell(\sigma) = -n \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2$$ Taking the derivative and setting to zero: $$\frac{d\ell}{d\sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (X_i - \mu)^2 = 0$$ Solving: $\hat{\sigma}_n = \sqrt{\frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2}$ **Step 2: Find Fisher Information for $\sigma$** The log density for a single observation is: $$\log f(X; \sigma) = -\log \sigma - \frac{(X-\mu)^2}{2\sigma^2}$$ First derivative: $$\frac{\partial \log f(X; \sigma)}{\partial \sigma} = -\frac{1}{\sigma} + \frac{(X-\mu)^2}{\sigma^3}$$ Second derivative: $$\frac{\partial^2 \log f(X; \sigma)}{\partial \sigma^2} = \frac{1}{\sigma^2} - \frac{3(X-\mu)^2}{\sigma^4}$$ Fisher Information: $$I(\sigma) = -\mathbb{E}\left[\frac{\partial^2 \log f(X; \sigma)}{\partial \sigma^2}\right] = -\frac{1}{\sigma^2} + \frac{3\sigma^2}{\sigma^4} = \frac{2}{\sigma^2}$$ **Step 3: Standard Error for $\hat{\sigma}$** $$\widehat{\text{se}}(\hat{\sigma}_n) = \frac{1}{\sqrt{nI(\hat{\sigma}_n)}} = \frac{\hat{\sigma}_n}{\sqrt{2n}}$$ **Step 4: Apply the Delta Method** For $\psi = g(\sigma) = \log \sigma$, we have $g'(\sigma) = 1/\sigma$. Therefore: $$\widehat{\text{se}}(\hat{\psi}_n) = |g'(\hat{\sigma}_n)| \cdot \widehat{\text{se}}(\hat{\sigma}_n) = \frac{1}{\hat{\sigma}_n} \cdot \frac{\hat{\sigma}_n}{\sqrt{2n}} = \frac{1}{\sqrt{2n}}$$ **Step 5: Confidence Interval** A 95% confidence interval for $\psi = \log \sigma$ is: $$\hat{\psi}_n \pm \frac{2}{\sqrt{2n}} = \log \hat{\sigma}_n \pm \frac{2}{\sqrt{2n}}$$ Notice something remarkable: the standard error for $\log \sigma$ doesn't depend on $\sigma$ itself! This is one reason why log-transformations are often used for scale parameters. **Verification Via Simulation** Let's verify the Delta Method through simulation. We'll generate many samples, compute both $\hat{\sigma}$ and $\log \hat{\sigma}$ for each, and check if their empirical standard errors match the theoretical predictions: ```{python} #| fig-width: 7 #| fig-height: 6 # Demonstrate the Delta Method with simulations np.random.seed(42) # True parameters mu_true = 5.0 sigma_true = 2.0 n = 100 # Generate data and compute MLEs n_sims = 1000 sigma_mles = [] log_sigma_mles = [] for _ in range(n_sims): data = np.random.normal(mu_true, sigma_true, n) sigma_hat = np.sqrt(np.mean((data - mu_true)**2)) sigma_mles.append(sigma_hat) log_sigma_mles.append(np.log(sigma_hat)) sigma_mles = np.array(sigma_mles) log_sigma_mles = np.array(log_sigma_mles) # Theoretical values theoretical_se_sigma = sigma_true / np.sqrt(2*n) theoretical_se_log_sigma = 1 / np.sqrt(2*n) # Create plots fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5)) # Left: Distribution of sigma MLE ax1.hist(sigma_mles, bins=30, density=True, alpha=0.7, color='blue') x = np.linspace(sigma_mles.min(), sigma_mles.max(), 100) ax1.plot(x, stats.norm.pdf(x, sigma_true, theoretical_se_sigma), 'r-', linewidth=2, label='Theoretical') ax1.axvline(sigma_true, color='black', linestyle='--', label='True value') ax1.set_xlabel('$\hat{\sigma}$') ax1.set_ylabel('Density') ax1.set_title('Distribution of $\hat{\sigma}$') ax1.legend() ax1.grid(True, alpha=0.3) # Right: Distribution of log(sigma) MLE ax2.hist(log_sigma_mles, bins=30, density=True, alpha=0.7, color='green') x = np.linspace(log_sigma_mles.min(), log_sigma_mles.max(), 100) ax2.plot(x, stats.norm.pdf(x, np.log(sigma_true), theoretical_se_log_sigma), 'r-', linewidth=2, label='Theoretical') ax2.axvline(np.log(sigma_true), color='black', linestyle='--', label='True value') ax2.set_xlabel('$\hat{\psi} = \log \hat{\sigma}$') ax2.set_ylabel('Density') ax2.set_title('Distribution of $\log \hat{\sigma}$') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() print("Standard Error Comparison:") print(f"Empirical SE($\hat{{\sigma}}$): {np.std(sigma_mles):.4f}") print(f"Theoretical SE($\hat{{\sigma}}$): {theoretical_se_sigma:.4f}") print(f"Empirical SE($\log \hat{{\sigma}}$): {np.std(log_sigma_mles):.4f}") print(f"Theoretical SE($\log \hat{{\sigma}}$): {theoretical_se_log_sigma:.4f}") print(f"\nSE ratio (empirical): {np.std(log_sigma_mles)/np.std(sigma_mles):.4f}") print(f"SE ratio (Delta method): {1/sigma_true:.4f}") ``` **Key takeaways**: The simulation confirms that the Delta Method works perfectly! The empirical standard error ratio between $\log \hat{\sigma}$ and $\hat{\sigma}$ matches exactly what the Delta Method predicts: $1/\sigma$. Both distributions are well-approximated by normal distributions, validating the asymptotic normality of MLEs. ::: ### Multiparameter Models Real-world problems often involve multiple parameters. The theory we saw in this chapter extends naturally, with matrices replacing scalars. For $\theta = (\theta_1, \ldots, \theta_k)$, let $\hat{\theta} = (\hat{\theta}_1, \ldots, \hat{\theta}_k)$ be the MLE. The key concepts become: ::: {.definition} Let $\ell_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)$ and define: $$H_{jk} = \frac{\partial^2 \ell_n}{\partial \theta_j \partial \theta_k}$$ The **Fisher Information Matrix** is: $$I_n(\theta) = - \begin{pmatrix} \mathbb{E}_\theta(H_{11}) & \mathbb{E}_\theta(H_{12}) & \cdots & \mathbb{E}_\theta(H_{1k}) \\ \mathbb{E}_\theta(H_{21}) & \mathbb{E}_\theta(H_{22}) & \cdots & \mathbb{E}_\theta(H_{2k}) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb{E}_\theta(H_{k1}) & \mathbb{E}_\theta(H_{k2}) & \cdots & \mathbb{E}_\theta(H_{kk}) \end{pmatrix}$$ ::: ::: {.theorem} Under regularity conditions: 1. The MLE is approximately multivariate normal: $$(\hat{\theta} - \theta) \approx \mathcal{N}(0, J_n)$$ where $J_n = I_n^{-1}(\theta)$ is the inverse Fisher Information matrix. 2. For parameter $\theta_j$, the standard error is: $$\widehat{\text{se}}_j = \sqrt{J_n(j,j)}$$ (the square root of the $j$-th diagonal element of $J_n$) 3. The covariance between $\hat{\theta}_j$ and $\hat{\theta}_k$ is $J_n(j,k)$. ::: ::: {.callout-tip icon=false} ## Example: Multiparameter Normal Distribution For $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with both parameters unknown: The Fisher Information matrix is: $$I_n(\mu, \sigma) = n \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{2}{\sigma^2} \end{pmatrix}$$ Therefore: $$J_n = \frac{1}{n} \begin{pmatrix} \sigma^2 & 0 \\ 0 & \frac{\sigma^2}{2} \end{pmatrix}$$ This tells us: - $\widehat{\text{se}}(\hat{\mu}) = \sigma/\sqrt{n}$ (familiar formula!) - $\widehat{\text{se}}(\hat{\sigma}) = \sigma/\sqrt{2n}$ - $\text{Cov}(\hat{\mu}, \hat{\sigma}) = 0$ (they're independent!) The zero off-diagonal terms mean that uncertainty about $\mu$ doesn't affect our estimate of $\sigma$, and vice versa. This orthogonality is special to the normal distribution. ::: ::: {.callout-note collapse="true"} ## Advanced: Multiparameter Delta Method For a function $\tau = g(\theta_1, \ldots, \theta_k)$, the Delta Method generalizes to: $$\widehat{\text{se}}(\hat{\tau}) = \sqrt{(\widehat{\nabla} g)^T \widehat{J}_n (\widehat{\nabla} g)}$$ where $\widehat{\nabla} g$ is the gradient of $g$ evaluated at $\hat{\theta}$. **Example**: For the Normal distribution, if we're interested in the coefficient of variation $\tau = \sigma/\mu$: $$\nabla g = \begin{pmatrix} -\sigma/\mu^2 \\ 1/\mu \end{pmatrix}$$ The standard error involves both parameter uncertainties and their covariance (though the covariance is zero for the normal case). ::: ### Sufficient Statistics A **statistic** is a function $T(X^n)$ of the data. A **sufficient statistic** is a statistic that contains all the information in the data about the parameter. ::: {.definition} A statistic $T(X^n)$ is **sufficient** for parameter $\theta$ if the conditional distribution of the data given $T$ doesn't depend on $\theta$. ::: ::: {.example} Let $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$. Then $\mathcal{L}(p) = p^S(1-p)^{n-S}$ where $S = \sum_{i=1}^n X_i$. The likelihood depends on the data only through $S$, so $S$ is sufficient. This means that once we know the total number of successes, the individual outcomes provide no additional information about $p$. ::: A sufficient statistic is **minimal** if it provides the most compressed summary possible while still retaining all information about the parameter. More precisely: any other sufficient statistic can be "reduced" to the minimal one, but not vice versa. ::: {.callout-note} ## Connection Between MLE and Sufficiency When a non-trivial sufficient statistic exists, the MLE depends on the data only through that statistic. Examples: - **Bernoulli**: MLE $\hat{p} = S/n$ depends only on sufficient statistic $S = \sum X_i$ - **Normal**: MLE $(\hat{\mu}, \hat{\sigma}^2)$ depends only on sufficient statistics $(\bar{X}, S^2)$ - **Uniform(0,$\theta$)**: MLE is exactly the sufficient statistic $\max\{X_i\}$ This provides theoretical justification for data reduction: when sufficient statistics exist, we can compress our data without losing any information about the parameter. However, not all models have nice sufficient statistics beyond the trivial one (the entire dataset). ::: ## Connection to Machine Learning: Cross-Entropy as MLE One of the most profound connections between statistics and modern machine learning is that many "loss functions" used in ML are actually negative log-likelihoods in disguise. This isn't a coincidence -- it reflects the deep statistical foundations of machine learning. Let's consider the case of the cross-entropy loss used in classification tasks. Consider a classification problem where we predict probabilities for $K$ classes using a machine learning model $f(X; \theta)$ parameterized by $\theta$. The **cross-entropy loss** is: $$H(p, q) = -\frac{1}{n} \sum_{i=1}^n \sum_{j=1}^K p(Y_i = j) \log q_\theta(Y_i = j)$$ where: - $p(Y_i = j)$ is the observed distribution (1 if observation $i$ is class $j$, 0 otherwise) - $q_\theta(Y_i = j) = f(X_i; \theta)_j$ is the predicted probability from our model with parameters $\theta$ Let's show this is exactly the negative log-likelihood for a categorical distribution. The likelihood for categorical data is: $$\mathcal{L}_n(\theta) = \prod_{i=1}^n \text{Categorical}(Y_i; f(X_i; \theta))$$ The categorical probability for observation $i$ is: $$\text{Categorical}(Y_i; f(X_i; \theta)) = \prod_{j=1}^K f(X_i; \theta)_j^{p(Y_i = j)}$$ Taking the log: $$\ell_n(\theta) = \sum_{i=1}^n \sum_{j=1}^K p(Y_i = j) \log f(X_i; \theta)_j$$ This is exactly $-n \cdot H(p, q)$. Thus: **Minimizing cross-entropy = Maximizing likelihood!** ::: {.callout-note} ## Other Common ML Loss Functions as MLEs This pattern appears throughout machine learning: | ML Loss Function | Statistical Model | MLE Interpretation | |-----------------|-------------------|-------------------| | Mean Squared Error (MSE) | $Y \sim \mathcal{N}(\mu(x), \sigma^2)$ | Gaussian likelihood | | Mean Absolute Error (MAE) | $Y \sim \text{Laplace}(\mu(x), b)$ | Laplace likelihood | | Huber Loss | Hybrid of $L_1$ and $L_2$ | Robust to outliers | ::: Understanding that common ML losses are negative log-likelihoods provides: 1. **Principled loss design**: When facing a new problem, you can derive an appropriate loss function by specifying a probabilistic model for your data. 2. **Historical context**: It explains why these particular loss functions became standard -- they weren't arbitrary choices but emerged from statistical principles. 3. **Intuition**: Knowing the probabilistic interpretation helps understand when each loss is appropriate (e.g., MAE for heavy-tailed errors, MSE for Gaussian noise). ::: {.callout-tip icon=false} ## Example: Deriving a Custom Loss Suppose you believe errors in your regression problem approximately follow a [Student-t distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) (heavy tails for robustness). The negative log-likelihood gives you the loss function: $$\text{Loss}(y, \hat{y}) = \frac{\nu + 1}{2} \log\left(1 + \frac{(y - \hat{y})^2}{\nu s^2}\right)$$ where $\nu$ is degrees of freedom and $s$ is scale. This naturally down-weights outliers! ::: ## MLE for Latent Variable Models: The EM Algorithm ### The Challenge of Latent Variables So far, we've assumed that maximizing the likelihood is straightforward -- we take derivatives, set them to zero, and solve (possibly numerically). But what if the likelihood itself is intractable? This often happens when our model involves **latent (unobserved) variables** -- hidden quantities that would make the problem easy if only we could observe them. **Common examples with latent variables:** - **Mixture models**: Which component generated each observation? - **Hidden Markov models**: What's the hidden state sequence? - **Factor analysis**: What are the values of the latent factors? - **Missing data**: What would the missing values have been? The canonical example is the **Gaussian Mixture Model (GMM)**:^[Also called **Mixture of Gaussians (MoG)**.] $$f(y; \theta) = \sum_{k=1}^K \pi_k \mathcal{N}(y; \mu_k, \sigma_k^2)$$ where $\theta = (\pi_1, \ldots, \pi_K, \mu_1, \ldots, \mu_K, \sigma_1, \ldots, \sigma_K)$. The likelihood involves a sum inside the logarithm: $$\ell_n(\theta) = \sum_{i=1}^n \log\left[\sum_{k=1}^K \pi_k \mathcal{N}(y_i; \mu_k, \sigma_k^2)\right]$$ This is a nightmare to differentiate! The derivatives involve ratios of sums -- messy and hard to work with. **The key insight**: If we knew which component $Z_i \in \{1, \ldots, K\}$ generated each observation $Y_i$, the problem would become trivial -- we'd just fit separate Gaussians to each group. ::: {.callout-note collapse="true"} ## Example: Two-Component Gaussian Mixture ($K=2$) For the special case of $K=2$ (two components), the mixture model simplifies to: $$f(y; \theta) = \pi \mathcal{N}(y; \mu_1, \sigma_1^2) + (1-\pi) \mathcal{N}(y; \mu_2, \sigma_2^2)$$ where: - $\pi \in [0,1]$ is the mixing weight (probability of component 1) - $(1-\pi)$ is the probability of component 2 - $\mu_1, \mu_2$ are the means of the two Gaussian components - $\sigma_1^2, \sigma_2^2$ are the variances of the two components - The parameter vector is $\theta = (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2)$ **Interpretation**: Each observation $Y_i$ is generated by: 1. First, flip a biased coin with probability $\pi$ of heads 2. If heads: draw $Y_i \sim \mathcal{N}(\mu_1, \sigma_1^2)$ 3. If tails: draw $Y_i \sim \mathcal{N}(\mu_2, \sigma_2^2)$ The **latent variables** $Z_1, \ldots, Z_n \in \{1, 2\}$ indicate which component generated each observation. If we knew these $Z_i$ values, estimation would be trivial -- we'd simply: - Estimate $\pi$ as the proportion of observations from component 1 - Estimate $\mu_1, \sigma_1$ using only observations where $Z_i = 1$ - Estimate $\mu_2, \sigma_2$ using only observations where $Z_i = 2$ But since the $Z_i$ are unobserved, we need a method to estimate both the component assignments and the parameters. ::: ### The Expectation-Maximization (EM) Algorithm The EM algorithm is a clever iterative procedure that alternates between: 1. Guessing the values of the latent variables (E-step) 2. Updating parameters as if those guesses were correct (M-step) ::: {.tabbed-content} ## Intuitive Think of the EM algorithm as a "chicken and egg" problem solver for mixture models. **The dilemma**: - If we knew which cluster each point belonged to, we could easily estimate the cluster parameters (just compute means and variances for each group) - If we knew the cluster parameters, we could easily assign points to clusters (pick the most likely cluster for each point) **The EM solution**: Start with a guess and alternate! 1. **E-step (Expectation)**: Given current cluster parameters, compute "soft assignments" -- the probability that each point belongs to each cluster. These are called "responsibilities." - Point near cluster 1 center: Maybe 90% probability for cluster 1, 10% for cluster 2 - Point between clusters: Maybe 50-50 split 2. **M-step (Maximization)**: Update cluster parameters using weighted calculations, where weights are the responsibilities. - New mean for cluster 1: Weighted average of all points, using their cluster 1 responsibilities as weights - Points with high responsibility contribute more to the cluster's parameters 3. **Iterate**: Keep alternating until convergence. It's like a dance where clusters and assignments gradually find their proper configuration, each step making the other more accurate. ## Mathematical The EM algorithm maximizes the expected complete-data log-likelihood. Let: - $Y$ = observed data - $Z$ = latent/missing data - $(Y, Z)$ = complete data The algorithm iterates between: **E-step**: Compute the expected complete-data log-likelihood: $$Q(\theta | \theta^{(t)}) = \mathbb{E}_{Z|Y,\theta^{(t)}}[\log P(Y, Z | \theta)]$$ This expectation is over the distribution of $Z$ given the observed data $Y$ and current parameters $\theta^{(t)}$. **M-step**: Maximize with respect to $\theta$: $$\theta^{(t+1)} = \arg\max_\theta Q(\theta | \theta^{(t)})$$ **Key property**: The likelihood is guaranteed to increase (or stay the same) at each iteration: $$\mathcal{L}(\theta^{(t+1)}) \geq \mathcal{L}(\theta^{(t)})$$ This monotonic improvement ensures convergence to a local maximum. ## Computational Let's implement and visualize the EM algorithm for a 2D Gaussian mixture: ```{python} #| fig-width: 7 #| fig-height: 10 import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Ellipse from scipy.stats import multivariate_normal # Generate synthetic data from a mixture of 3 Gaussians np.random.seed(42) n_samples = 300 n_components = 3 # True parameters true_means = np.array([[-2, -2], [0, 2], [3, -1]]) true_covs = [np.array([[0.5, 0.2], [0.2, 0.5]]), np.array([[0.8, -0.3], [-0.3, 0.8]]), np.array([[0.6, 0], [0, 0.3]])] true_weights = np.array([0.3, 0.5, 0.2]) # Generate data data = [] true_labels = [] for i in range(n_samples): # Choose component k = np.random.choice(n_components, p=true_weights) true_labels.append(k) # Generate point from that component data.append(np.random.multivariate_normal(true_means[k], true_covs[k])) data = np.array(data) true_labels = np.array(true_labels) # Initialize EM with random parameters means = np.random.randn(n_components, 2) * 2 covs = [np.eye(2) for _ in range(n_components)] weights = np.ones(n_components) / n_components # Function to plot current state def plot_em_state(data, means, covs, weights, responsibilities, iteration, log_likelihood_history): fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3.5)) # Left plot: data colored by responsibilities colors = responsibilities @ np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) ax1.scatter(data[:, 0], data[:, 1], c=colors, alpha=0.6, s=30) # Plot component centers and ellipses for k in range(n_components): ax1.plot(means[k, 0], means[k, 1], 'k*', markersize=15) # Draw ellipse representing covariance eigenvalues, eigenvectors = np.linalg.eigh(covs[k]) angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0])) width, height = 2 * np.sqrt(eigenvalues) ellipse = Ellipse(means[k], width, height, angle=angle, facecolor='none', edgecolor='black', linewidth=2) ax1.add_patch(ellipse) ax1.set_title(f'Iteration {iteration}: Soft Assignments') ax1.set_xlim(-5, 5) ax1.set_ylim(-5, 5) ax1.grid(True, alpha=0.3) # Right plot: log-likelihood history ax2.plot(range(len(log_likelihood_history)), log_likelihood_history, 'b-', linewidth=2) ax2.plot(iteration, log_likelihood_history[iteration], 'ro', markersize=8) # Mark current iteration ax2.set_xlabel('Iteration') ax2.set_ylabel('Log-likelihood') ax2.set_title('Convergence') ax2.set_xlim(-0.5, 9.5) # Always show full range ax2.set_xticks(range(10)) # Show only integer ticks 0-9 ax2.grid(True, alpha=0.3) plt.tight_layout() return fig # EM iterations n_iterations = 10 log_likelihood_history = [] for iteration in range(n_iterations): # E-step: compute responsibilities responsibilities = np.zeros((n_samples, n_components)) for k in range(n_components): responsibilities[:, k] = weights[k] * multivariate_normal.pdf( data, means[k], covs[k]) responsibilities /= responsibilities.sum(axis=1, keepdims=True) # Compute log-likelihood log_likelihood = np.sum(np.log(np.sum([ weights[k] * multivariate_normal.pdf(data, means[k], covs[k]) for k in range(n_components)], axis=0))) log_likelihood_history.append(log_likelihood) # Plot current state (only show a few iterations) if iteration in [0, 1, 3, 9]: fig = plot_em_state(data, means, covs, weights, responsibilities, iteration, log_likelihood_history) plt.show() # M-step: update parameters for k in range(n_components): resp_k = responsibilities[:, k] # Update weight weights[k] = resp_k.sum() / n_samples # Update mean means[k] = (resp_k[:, np.newaxis] * data).sum(axis=0) / resp_k.sum() # Update covariance diff = data - means[k] covs[k] = (resp_k[:, np.newaxis, np.newaxis] * diff[:, :, np.newaxis] @ diff[:, np.newaxis, :]).sum(axis=0) / resp_k.sum() print("Final parameters:") print(f"Weights: {weights}") print(f"Means:\n{means}") ``` The visualizations show: - **Left plots**: Data points colored by their soft assignments (RGB = probabilities for 3 clusters) - **Black stars**: Cluster centers - **Ellipses**: Covariance structure of each component - **Right plot**: Log-likelihood increasing monotonically Notice how the algorithm gradually discovers the true cluster structure! ::: ### Properties and Practical Considerations **Strengths of EM:** 1. **Guaranteed improvement**: The likelihood never decreases 2. **No step size tuning**: Unlike gradient descent, no learning rate to choose 3. **Naturally handles constraints**: Probabilities automatically sum to 1 4. **Interpretable intermediate results**: Soft assignments have meaning **Limitations and solutions:** 1. **Local optima**: EM only finds a local maximum - **Solution**: Run from multiple random initializations - **Smart initialization**: Use [k-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) or similar 2. **Slow convergence**: Can take many iterations near the optimum - **Solution**: Switch to Newton's method near convergence - **Early stopping**: Monitor log-likelihood changes 3. **Choosing number of components**: How many clusters? - **Solution**: Use information criteria (AIC, BIC) - **Cross-validation**: Evaluate on held-out data ::: {.callout-warning} ## The Initialization Trap The EM algorithm is extremely sensitive to initialization. Poor starting values can lead to: - Convergence to inferior local optima - One component "eating" all the data - Empty components (singularities) Always run EM multiple times with different initializations and choose the result with the highest likelihood! ::: ## Chapter Summary and Connections ### Key Concepts Review We've explored the remarkable properties that make the Maximum Likelihood Estimator the "gold standard" of parametric estimation: **Core Properties of the MLE**: - **Consistency**: $\hat{\theta}_n \xrightarrow{P} \theta_*$ -- the MLE converges to the truth as $n \to \infty$. - **Equivariance**: $\widehat{g(\theta)} = g(\hat{\theta}_n)$ -- reparameterization doesn't affect the MLE. - **Asymptotic Normality**: $\hat{\theta}_n \approx \mathcal{N}(\theta, 1/I_n(\theta))$ -- enabling confidence intervals. - **Asymptotic Efficiency**: Achieves the Cramér-Rao lower bound -- optimal variance. **Fisher Information**: - Measures the "information" about a parameter contained in data. - Equals the expected curvature of the log-likelihood: $I(\theta) = -\mathbb{E}[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}]$. - Determines the precision of the MLE: $\text{Var}(\hat{\theta}_n) \approx 1/(nI(\theta))$. - Sharp likelihood peak → High information → Precise estimates. **Deep Connections**: - Cross-entropy loss in ML = Negative log-likelihood. - Many ML algorithms are secretly performing MLE. **Practical Tools**: - **Confidence Intervals**: $\hat{\theta}_n \pm z_{\alpha/2} \cdot \widehat{\text{se}}$ where $\widehat{\text{se}} = 1/\sqrt{I_n(\hat{\theta}_n)}$. - **Delta Method**: For transformed parameters $\tau = g(\theta)$: $\widehat{\text{se}}(\hat{\tau}) \approx |g'(\hat{\theta})| \cdot \widehat{\text{se}}(\hat{\theta})$. - **EM Algorithm**: Iterative method for MLEs with latent variables -- alternates between E-step (soft assignments) and M-step (parameter updates). ### The Big Picture We have moved from simply *finding* estimators (Chapter 5) to *evaluating their quality*. This chapter revealed why the MLE is so widely used: 1. **It works**: Consistency ensures we get the right answer with enough data. 2. **It's optimal**: No other estimator has smaller asymptotic variance. 3. **It's practical**: To a degree, we can quantify uncertainty through Fisher Information.^[As we will see in future lectures, a Bayesian approach is often more suitable for proper uncertainty quantitication.] 4. **It's flexible**: Equivariance means we can work in convenient parameterizations. 5. **It extends**: The EM algorithm handles complex models with latent structure. ### Common Pitfalls to Avoid 1. **Misinterpreting Confidence Intervals**: A 95% CI does NOT mean: - "95% probability the parameter is in this interval" (that's Bayesian!). - "95% of the data falls in this interval" (that's a prediction interval). - Correct: "95% of such intervals constructed from repeated samples would contain the true parameter". 2. **Assuming Asymptotic Results for Small Samples**: - Asymptotic normality may not hold for small $n$. - Fisher Information formulas are approximate for finite samples. - Bootstrap or exact methods may be preferable when $n < 30$.^[This is just a rule of thumb.] 3. **Getting Stuck in Local Optima with EM**: - EM only guarantees local, not global, maximum. - Always run from multiple starting points. - Monitor log-likelihood to ensure proper convergence. 4. **Forgetting About Model Assumptions**: - MLE theory assumes the model is correctly specified (i.e., the true distribution belongs to our model family). - We briefly mentioned that parameters must be identifiable (Section on Consistency) -- different parameter values must give different distributions. - The asymptotic results (normality, efficiency) require "regularity conditions" that we haven't detailed but include smoothness of the likelihood. 5. **Over-interpreting Asymptotic Results**: - All our results (asymptotic normality, efficiency) are for large samples. - With small samples, the MLE might not be normally distributed and confidence intervals may be inaccurate. - The MLE's optimality is theoretical -- in practice, we might prefer more robust methods when data contains outliers or model assumptions are questionable. ### Chapter Connections **Building on Previous Chapters**: - **Chapter 3 (Statistical Inference)**: Provided the framework for evaluating estimators. Now we have seen that MLEs have optimal properties within this framework. - **Chapter 4 (Bootstrap)**: Offers a computational alternative to the analytical standard errors we derived here. When Fisher Information is hard to compute, bootstrap it! - **Chapter 5 (Finding Estimators)**: Showed how to find MLEs. This chapter justifies why we bother -- they have provably good properties. **Looking Ahead**: - **Bayesian Inference**: While MLE obtains point estimates and confidence intervals, Bayesian methods provide full probability distributions over parameters. The MLE appears as the mode of the posterior with uniform priors. - **Regression Models**: The theory in this chapter extends directly -- least squares is MLE for normal errors, logistic regression is MLE for binary outcomes. - **Model Selection**: Information criteria (AIC, BIC) build on the likelihood framework we've developed. - **Robust Statistics**: When MLE assumptions fail, we need methods that sacrifice some efficiency for robustness. The properties we've studied -- consistency, asymptotic normality, efficiency -- will reappear throughout statistics. Whether you're fitting neural networks or analyzing clinical trials, you're likely using some form of MLE, and the theory in this chapter explains why it works. ### Self-Test Problems 1. **Fisher Information**: For $X_1, \ldots, X_n \sim \text{Exponential}(\lambda)$ with PDF $f(x; \lambda) = \lambda e^{-\lambda x}$ for $x > 0$: - Find the Fisher Information $I(\lambda)$ for a single observation. - What is the approximate standard error of the MLE $\hat{\lambda}_n$? 2. **Confidence Intervals**: You flip a coin 100 times and observe 58 heads. Use Fisher Information to construct an approximate 95% confidence interval for the probability of heads. 3. **Delta Method**: If $\hat{p} = 0.4$ with standard error 0.05, find the standard error of $\hat{\tau} = \log(p/(1-p))$ using the Delta Method. 4. **EM Intuition**: In a two-component Gaussian mixture, the E-step computes "responsibilities" for each data point. What do these represent? Why can't we just assign each point to its most likely cluster? 5. **MLE Properties**: True or False (with brief explanation): - If $\hat{\theta}_n$ is the MLE of $\theta$, then $\hat{\theta}_n^2$ is the MLE of $\theta^2$. - The MLE is always unbiased. - Fisher Information measures how "peaked" the likelihood function is. ### Python and R Reference ::: {.content-visible when-format="html"} ::: {.tabbed-content} ## Python ```python #| eval: false import numpy as np import scipy.stats as stats import scipy.optimize as optimize from scipy.stats import multivariate_normal import statsmodels.api as sm import sympy as sp from sklearn.mixture import GaussianMixture import jax import jax.numpy as jnp import jax.scipy.stats as jstats # Fisher Information calculation (symbolic) def fisher_info_symbolic(): """Example: Fisher Information for Exponential distribution""" # Define symbols x, lam = sp.symbols('x lambda', positive=True) # Define log pdf log_pdf = sp.log(lam) - lam * x # Score function (first derivative) score = sp.diff(log_pdf, lam) # Second derivative second_deriv = sp.diff(score, lam) # Fisher Information (negative expectation) # For exponential: E[X] = 1/lambda fisher_info = -second_deriv.subs(x, 1/lam) return sp.simplify(fisher_info) # Fisher Information using JAX (automatic differentiation) def fisher_info_jax(log_pdf, theta, n_samples=10000): """Estimate Fisher Information using automatic differentiation""" # Get the Hessian (second derivatives) hessian = jax.hessian(log_pdf) # Generate samples and compute expectation key = jax.random.PRNGKey(42) # This is problem-specific - would need appropriate sampler # Example continues with general structure return -jnp.mean(hessian(theta)) # Confidence intervals with Fisher Information def mle_confidence_interval(mle, fisher_info_n, alpha=0.05): """Construct CI using Fisher Information""" se = 1 / np.sqrt(fisher_info_n) z_crit = stats.norm.ppf(1 - alpha/2) ci_lower = mle - z_crit * se ci_upper = mle + z_crit * se return ci_lower, ci_upper # Delta Method implementation def delta_method_se(mle, se_mle, g_derivative): """ Apply Delta Method to find SE of transformed parameter Parameters: - mle: MLE of original parameter - se_mle: Standard error of MLE - g_derivative: Derivative of transformation g evaluated at MLE """ return np.abs(g_derivative) * se_mle # Example: Poisson rate to mean waiting time lam_hat = 3.0 # MLE for rate se_lam = 0.3 # Standard error # For tau = 1/lambda, g'(lambda) = -1/lambda^2 g_prime = -1 / lam_hat**2 se_tau = delta_method_se(lam_hat, se_lam, g_prime) # EM Algorithm using sklearn def fit_gaussian_mixture(data, n_components=2, n_init=10): """Fit Gaussian Mixture Model using EM""" gmm = GaussianMixture( n_components=n_components, n_init=n_init, # Number of initializations init_params='k-means++', # Smart initialization max_iter=100, tol=1e-3, verbose=0 ) gmm.fit(data) return { 'means': gmm.means_, 'covariances': gmm.covariances_, 'weights': gmm.weights_, 'log_likelihood': gmm.score(data) * len(data), 'converged': gmm.converged_ } # Advanced models with confidence intervals def fit_model_with_cis(model, data): """Example using statsmodels for automatic CIs""" # Example: Poisson regression results = model.fit() # Extract MLEs and standard errors params = results.params se = results.bse conf_int = results.conf_int(alpha=0.05) # Fisher Information matrix fisher_info = results.cov_params() return { 'mle': params, 'se': se, 'confidence_intervals': conf_int, 'fisher_info_inverse': fisher_info } # Manual EM implementation for mixture of normals class GaussianMixtureEM: def __init__(self, n_components=2): self.n_components = n_components def e_step(self, X, means, covs, weights): """Compute responsibilities (soft assignments)""" n_samples = X.shape[0] resp = np.zeros((n_samples, self.n_components)) for k in range(self.n_components): resp[:, k] = weights[k] * multivariate_normal.pdf( X, means[k], covs[k] ) # Normalize resp /= resp.sum(axis=1, keepdims=True) return resp def m_step(self, X, resp): """Update parameters given responsibilities""" n_samples = X.shape[0] # Update weights weights = resp.sum(axis=0) / n_samples # Update means means = [] covs = [] for k in range(self.n_components): resp_k = resp[:, k] mean_k = (resp_k[:, np.newaxis] * X).sum(axis=0) / resp_k.sum() means.append(mean_k) # Update covariances diff = X - mean_k cov_k = (resp_k[:, np.newaxis, np.newaxis] * diff[:, :, np.newaxis] @ diff[:, np.newaxis, :]).sum(axis=0) cov_k /= resp_k.sum() covs.append(cov_k) return np.array(means), covs, weights ``` ## R ```r #| eval: false library(MASS) # For fitdistr library(stats4) # For mle function library(numDeriv) # For numerical derivatives library(mclust) # For Gaussian mixture models library(msm) # For deltamethod function # Fisher Information calculation fisher_info_exponential <- function(lambda) { # For Exponential(lambda): I(lambda) = 1/lambda^2 return(1 / lambda^2) } # Fisher Information via numerical methods fisher_info_numerical <- function(loglik_fn, theta, ...) { # Compute negative expected Hessian hess <- hessian(loglik_fn, theta, ...) return(-hess) } # MLE with standard errors using fitdistr fit_with_se <- function(data, distribution) { fit <- fitdistr(data, distribution) # Extract estimates and standard errors mle <- fit$estimate se <- fit$sd # Construct confidence intervals ci_lower <- mle - 1.96 * se ci_upper <- mle + 1.96 * se list( mle = mle, se = se, ci = cbind(lower = ci_lower, upper = ci_upper), loglik = fit$loglik ) } # Delta Method implementation apply_delta_method <- function(mle, var_mle, g_expr, param_name) { # Using msm::deltamethod # g_expr: expression for g(theta) as a string # param_name: name of parameter in expression se_transformed <- deltamethod( as.formula(paste("~", g_expr)), mle, var_mle ) return(se_transformed) } # Example: Delta method for Poisson lambda_hat <- 3.0 se_lambda <- 0.3 var_lambda <- se_lambda^2 # For tau = 1/lambda se_tau <- deltamethod(~ 1/x1, lambda_hat, var_lambda) # Manual Delta Method delta_method_manual <- function(mle, se_mle, g_deriv) { # g_deriv: derivative of g evaluated at MLE se_transformed <- abs(g_deriv) * se_mle return(se_transformed) } # EM Algorithm using mclust fit_gaussian_mixture <- function(data, G = 2) { # Fit Gaussian mixture model fit <- Mclust(data, G = G) list( means = fit$parameters$mean, covariances = fit$parameters$variance$sigma, weights = fit$parameters$pro, log_likelihood = fit$loglik, classification = fit$classification, uncertainty = fit$uncertainty ) } # Manual implementation of MLE with custom likelihood mle_custom <- function(data, start, nll_function) { # Negative log-likelihood function nll <- function(params) { nll_function(params, data) } # Optimize fit <- optim( par = start, fn = nll, method = "L-BFGS-B", hessian = TRUE # Get Hessian for Fisher Information ) # Fisher Information is the Hessian at MLE fisher_info <- fit$hessian # Standard errors from inverse Fisher Information se <- sqrt(diag(solve(fisher_info))) list( mle = fit$par, se = se, fisher_info = fisher_info, convergence = fit$convergence, log_likelihood = -fit$value ) } # Example: Beta distribution MLE beta_nll <- function(params, data) { alpha <- params[1] beta <- params[2] if(alpha <= 0 || beta <= 0) return(Inf) -sum(dbeta(data, alpha, beta, log = TRUE)) } # Fit Beta distribution # data <- rbeta(100, 1.5, 0.5) # fit <- mle_custom(data, c(1, 1), beta_nll) # Computing profile likelihood confidence intervals profile_ci <- function(data, param_index, mle_full, nll_function, level = 0.95) { # Critical value for likelihood ratio test crit <- qchisq(level, df = 1) / 2 # Profile negative log-likelihood profile_nll <- function(param_value, other_params) { params <- mle_full params[param_index] <- param_value # Optimize over other parameters... # (simplified here) nll_function(params, data) } # Find confidence interval boundaries # (simplified - would need root finding in practice) } # EM Algorithm manual implementation em_gaussian_mixture <- function(X, n_components = 2, max_iter = 100, tol = 1e-4) { n <- nrow(X) d <- ncol(X) # Initialize parameters km <- kmeans(X, n_components) means <- km$centers weights <- table(km$cluster) / n covs <- list() for(k in 1:n_components) { covs[[k]] <- cov(X[km$cluster == k, ]) } log_lik_old <- -Inf for(iter in 1:max_iter) { # E-step: compute responsibilities resp <- matrix(0, n, n_components) for(k in 1:n_components) { resp[, k] <- weights[k] * dmvnorm(X, means[k, ], covs[[k]]) } resp <- resp / rowSums(resp) # Log-likelihood log_lik <- sum(log(rowSums(resp))) if(abs(log_lik - log_lik_old) < tol) break log_lik_old <- log_lik # M-step: update parameters for(k in 1:n_components) { nk <- sum(resp[, k]) weights[k] <- nk / n means[k, ] <- colSums(resp[, k] * X) / nk X_centered <- sweep(X, 2, means[k, ]) covs[[k]] <- t(X_centered) %*% (resp[, k] * X_centered) / nk } } list( means = means, covariances = covs, weights = weights, responsibilities = resp, log_likelihood = log_lik, iterations = iter ) } ``` ::: ::: ::: {.content-visible when-format="pdf"} ::: {.callout-note} ## Python and R Reference Code Python and R code examples for Fisher Information, confidence intervals, the Delta Method, and the EM algorithm can be found in the HTML version of these notes. The code includes: - Computing Fisher Information both symbolically and numerically - Constructing confidence intervals using Fisher Information - Implementing the Delta Method for transformed parameters - Using built-in packages for Gaussian mixture models (sklearn in Python, mclust in R) - Manual implementations of the EM algorithm Key packages: - **Python**: `statsmodels`, `sympy`, `sklearn.mixture`, `jax` - **R**: `MASS`, `mclust`, `msm`, `numDeriv` ::: ::: ### Connections to Source Material ::: {.callout-note collapse="true"} ## Mapping to "All of Statistics" This table maps sections in these lecture notes to the corresponding sections in @wasserman2013all ("All of Statistics" or AoS). | Lecture Note Section | Corresponding AoS Section(s) | | :--- | :--- | | **Introduction: How Good Are Our Estimators?** | From slides and general context from AoS §9.4 introduction on properties of estimators. | | **Warm-up: Beta Distribution MLE** | Example from slides; similar numerical optimization examples in AoS §9.13.4. | | **Core Properties of the MLE** | | | ↳ Overview | AoS §9.4 (list of properties). | | ↳ Consistency | AoS §9.5 (Theorem 9.13 and related discussion). | | ↳ Equivariance | AoS §9.6 (Theorem and Example). | | ↳ Asymptotic Normality & Optimality | AoS §9.7 introduction, §9.8. | | **Fisher Information and Confidence Intervals** | | | ↳ Fisher Information | AoS §9.7 (Definitions and Theorem 9.17). | | ↳ Constructing Confidence Intervals | AoS §9.7 (Theorem 9.19 and Examples). | | ↳ Cramér-Rao Lower Bound | Mentioned in AoS §9.8. | | **Additional Topics** | | | ↳ The Delta Method | AoS §9.9 (Theorem 9.24 and Examples). | | ↳ Multiparameter Models | AoS §9.10 (Fisher Information Matrix and related theorems). | | ↳ Sufficient Statistics | AoS §9.13.2 (brief treatment from slides). | | **Connection to Machine Learning: Cross-Entropy as MLE** | Expanded from slides. | | **MLE for Latent Variable Models: The EM Algorithm** | AoS §9.13.4 (Appendix on computing MLEs). | | **Self-Test Problems** | Inspired by AoS §9.14 exercises. | ::: ### Further Materials - **Connection to machine learning**: Murphy (2022), "Probabilistic Machine Learning: An Introduction", Chapter 4 --- *Remember: The MLE isn't just a computational procedure -- it's a principled approach to estimation with deep theoretical foundations. The properties we've studied explain why maximum likelihood appears everywhere from simple coin flips to complex neural networks. These concepts represent the statistical foundations of modern machine learning!*