5 Parametric Inference I: Finding Estimators

Published

July 23, 2025

5.1 Learning Objectives

After completing this chapter, you will be able to:

Define a parametric model and explain its role in statistical inference, distinguishing between parameters of interest and nuisance parameters.
Derive estimators using the Method of Moments (MoM) by equating theoretical and sample moments.
Explain the principle of Maximum Likelihood Estimation (MLE) and formulate the likelihood and log-likelihood functions for a given model.
Find the Maximum Likelihood Estimator (MLE) analytically in simple cases by maximizing the (log-)likelihood.
Explain when and why numerical optimization is necessary for MLE, and apply standard optimization libraries to find estimators computationally.

Note

This chapter covers parametric models and two fundamental approaches to finding estimators: the Method of Moments and Maximum Likelihood Estimation. The material is adapted from Chapter 9 of Wasserman (2013) and supplemented with computational examples and optimization techniques relevant to modern data science applications.

5.2 Introduction: Machine Learning As Statistical Estimation

When fitting a machine learning model f(x, \theta) for regression with inputs x_i and targets y_i, i = 1, \dots, n, a common approach is to minimize the mean squared error (MSE): \min_{\theta} \frac{1}{n} \sum_{i=1}^n \left( f(x_i, \theta) - y_i \right)^2.

The terms in this expression are similar to the log-probability of a normal distribution: \log \mathcal{N}(y_i;\; f(x_i, \theta), \sigma^2) = -\frac{1}{2}\log(2\pi \sigma^2) - \frac{1}{2 \sigma^2} \left( f(x_i, \theta) - y_i \right)^2.

This similarity is not a coincidence. When \sigma is constant, maximizing the log-likelihood means maximizing -\frac{1}{2\sigma^2} \sum_i (f(x_i, \theta) - y_i)^2 plus a constant – which is equivalent to minimizing the MSE.

The ML-Statistics Connection

When you minimize MSE in machine learning, you’re performing maximum likelihood estimation under a Gaussian noise assumption! This connection reveals a fundamental truth: many machine learning algorithms are secretly (or openly) solving statistical estimation problems.

This chapter introduces the foundational principles of parametric inference – the engine that powers both classical statistics and modern machine learning. We’ll learn two primary methods for finding estimators for model parameters: the Method of Moments and the celebrated Maximum Likelihood Estimation.

Finnish Terminology Reference

For Finnish-speaking students, here’s a reference table of key terms in this chapter:

English	Finnish	Context
Parametric model	Parametrinen malli	Models with finite parameters
Parameter of interest	Kiinnostuksen parametri	The quantity we want to estimate
Nuisance parameter	Kiusaparametri	Parameters not of primary interest
Method of Moments (MoM)	Momenttimenetelmä	Estimation by matching moments
Moment	Momentti	Expected value of powers
Sample moment	Otosmomentti	Empirical average of powers
Maximum Likelihood Estimation (MLE)	Suurimman uskottavuuden menetelmä	Most common estimation method
Likelihood function	Uskottavuusfunktio	Joint density as function of parameters
Log-likelihood function	Log-uskottavuusfunktio	Logarithm of likelihood
Maximum Likelihood Estimator	SU-estimaattori	Parameter maximizing likelihood
Numerical optimization	Numeerinen optimointi	Computational methods for finding optima
Gradient	Gradientti	Vector of partial derivatives
Gradient descent	Gradienttimenetelmä	Iterative optimization algorithm

5.3 Parametric Models

We introduced parametric models in Chapter 3 when discussing statistical inference frameworks. Recall that in the world of statistical inference, we often make assumptions about the structure of our data-generating process. A parametric model is one such assumption – it postulates that our data comes from a distribution that can be fully characterized by a finite number of parameters.

Now that we’re diving into estimation methods, let’s revisit this concept with a focus on how we actually find these parameters.

A parametric model is a set of distributions \mathfrak{F} = \{ f(x; \theta) : \theta \in \Theta \} where:

\theta = (\theta_1, \ldots, \theta_k) is the parameter (possibly vector-valued)
\Theta \subseteq \mathbb{R}^k is the parameter space (the set of all possible parameter values)
f(x; \theta) is the density or distribution function indexed by \theta

The key insight: We assume the data-generating process belongs to a specific family of distributions, and our job is just to find the right parameter \theta within that family.

In other words, the problem of inference reduces to estimating the parameter(s) \theta.

All Models Are Wrong, But Some Are Useful

Parametric models are widely used although the underlying models are usually not perfect. As the statistician George Box famously said – in what is possibly the most repeated quote in statistics – “All models are wrong, but some are useful.”

When the model is good enough, parametric models can be very useful because they offer a simple representation for potentially complex phenomena. The art of statistical modeling is finding a model that is wrong in acceptable ways while still capturing the essential features of your data.

Examples of Parametric Models:

Simple distributions:
- Bernoulli(p): One parameter determining success probability
- Poisson(\lambda): One parameter determining both mean and variance
- Normal(\mu, \sigma^2): Two parameters for location and scale
Regression models:
- Linear regression: Y = \beta_0 + \beta_1 X + \epsilon where \epsilon \sim N(0, \sigma^2)
- Logistic regression: P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}
Finite mixture models:
- Mixture of Gaussians: f(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2)
- Parameters include mixing weights \pi_k and component parameters (\mu_k, \sigma_k)
Machine Learning models:
- Deep Neural Networks: Often millions of parameters (weights and biases)
- Despite their complexity, these are still parametric models!

Remember: Parameters of Interest vs. Nuisance Parameters

We introduced this distinction in Chapter 3, which it’s crucial for estimation:

Parameter of interest: The specific quantity T(\theta) we want to estimate
Nuisance parameter: Other parameters we must estimate but don’t care about directly

Example: Mean Lifetime Estimation

Equipment lifetimes often follow a Gamma distribution. If X_1, \ldots, X_n \sim \text{Gamma}(\alpha, \beta), then:

f(x; \alpha, \beta) = \frac{1}{\beta^\alpha \Gamma(\alpha)} x^{\alpha-1} e^{-x/\beta}, \quad x > 0

If we want to estimate the mean lifetime:

Parameter of interest: T(\alpha, \beta) = \mathbb{E}(X) = \alpha\beta
Nuisance parameters: The individual shape (\alpha) and scale (\beta) parameters

Note that we must estimate both parameters, but only their product matters for our question.

5.4 The Method of Moments (MoM)

The Method of Moments is a simple estimation technique that does not yield optimal estimators, but provides easy-to-compute values that can serve as good starting points for more sophisticated methods.

5.4.1 The Principle: Matching Moments

The Method of Moments is based on a straightforward idea: if our model is correct, then theoretical properties of the distribution (moments) should match their empirical counterparts in the data. It’s like saying, “If this really is the right distribution, then the average I calculate from my model should match the average I see in my data.”

For a model with parameter \theta = (\theta_1, \ldots, \theta_k):

The j^{\text{th}} theoretical moment is: \alpha_j(\theta) = \mathbb{E}_{\theta}(X^j)
The j^{\text{th}} sample moment is: \hat{\alpha}_j = \frac{1}{n} \sum_{i=1}^n X_i^j

The Method of Moments estimator \hat{\theta}_n is the value of \theta such that: \begin{align} \alpha_1(\hat{\theta}_n) &= \hat{\alpha}_1 \\ \alpha_2(\hat{\theta}_n) &= \hat{\alpha}_2 \\ &\vdots \\ \alpha_k(\hat{\theta}_n) &= \hat{\alpha}_k \end{align}

This gives us a system of k equations with k unknowns – exactly what we need to solve for k parameters!

5.4.2 MoM in Action: Examples

Example: Bernoulli Distribution

For X_1, \ldots, X_n \sim \text{Bernoulli}(p), we have one parameter to estimate.

Theoretical first moment: \alpha_1(p) = \mathbb{E}_p(X) = p
Sample first moment: \hat{\alpha}_1 = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}_n

Equating them: p = \bar{X}_n

Therefore, the MoM estimator is \hat{p}_{\text{MoM}} = \bar{X}_n – simply the proportion of successes!

Example: Normal Distribution

For X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2), we have two parameters, so we need two equations.

First moment equation:

\alpha_1(\theta) = \mathbb{E}(X) = \mu
\hat{\alpha}_1 = \bar{X}_n
Setting equal: \mu = \bar{X}_n

Second moment equation:

\alpha_2(\theta) = \mathbb{E}(X^2) = \mathbb{V}(X) + (\mathbb{E}(X))^2 = \sigma^2 + \mu^2
\hat{\alpha}_2 = \frac{1}{n}\sum_{i=1}^n X_i^2
Setting equal: \sigma^2 + \mu^2 = \frac{1}{n}\sum_{i=1}^n X_i^2

Solving this system:

\hat{\mu}_{\text{MoM}} = \bar{X}_n
\hat{\sigma}^2_{\text{MoM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2

Example: Gamma Distribution

For X_1, \ldots, X_n \sim \text{Gamma}(\alpha, \beta), let’s derive the MoM estimators.

The Gamma distribution has:

First moment: \mathbb{E}(X) = \alpha\beta
Second moment: \mathbb{E}(X^2) = \mathbb{V}(X) + (\mathbb{E}(X))^2 = \alpha\beta^2 + (\alpha\beta)^2

Setting up the moment equations: \alpha\beta = \bar{X}_n \alpha\beta^2 + (\alpha\beta)^2 = \frac{1}{n}\sum_{i=1}^n X_i^2

From the first equation: \alpha = \bar{X}_n / \beta

Substituting into the second equation: \frac{\bar{X}_n}{\beta} \cdot \beta^2 + \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n X_i^2

Simplifying: \bar{X}_n \beta + \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n X_i^2

Solving for \beta: \beta = \frac{\frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2}{\bar{X}_n} = \frac{\text{sample variance}}{\bar{X}_n}

Therefore, the MoM estimators are:

\hat{\beta}_{\text{MoM}} = \frac{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2}{\bar{X}_n}
\hat{\alpha}_{\text{MoM}} = \frac{\bar{X}_n}{\hat{\beta}_{\text{MoM}}} = \frac{\bar{X}_n^2}{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2}

Show code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate data from a Gamma distribution
np.random.seed(42)
true_alpha, true_beta = 3.0, 2.0  # True parameters
n = 100
data = stats.gamma.rvs(a=true_alpha, scale=true_beta, size=n)

# Method of Moments for Gamma distribution
# For Gamma(α, β): E[X] = αβ, E[X²] = α(α+1)β²
sample_mean = np.mean(data)
sample_second_moment = np.mean(data**2)

# Solve the system of equations
# mean = α * β
# second_moment = α * β² * (α + 1) = α * β² + α² * β²
# This gives us: β = (second_moment - mean²) / mean
mom_beta = (sample_second_moment - sample_mean**2) / sample_mean
mom_alpha = sample_mean / mom_beta

print(f"True parameters: α = {true_alpha}, β = {true_beta}")
print(f"MoM estimates:   α = {mom_alpha:.3f}, β = {mom_beta:.3f}")

# Visualize the fit
x = np.linspace(0, 20, 200)
plt.figure(figsize=(7, 4))
plt.hist(data, bins=30, density=True, alpha=0.7, label='Data')
plt.plot(x, stats.gamma.pdf(x, a=true_alpha, scale=true_beta), 
         'g-', linewidth=2, label='True distribution')
plt.plot(x, stats.gamma.pdf(x, a=mom_alpha, scale=mom_beta), 
         'r--', linewidth=2, label='MoM fit')
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Method of Moments Estimation for Gamma Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

True parameters: α = 3.0, β = 2.0
MoM estimates:   α = 3.869, β = 1.511

5.4.3 Properties of Method of Moments Estimator

Under regular conditions, Method of Moments estimators have some desirable properties:

Let \hat{\theta}_n denote the method of moments estimator. Under appropriate conditions on the model, the following statements hold:

Existence: The estimate \hat{\theta}_n exists with probability tending to 1.
Consistency: The estimate is consistent: \hat{\theta}_n \xrightarrow{P} \theta
Asymptotic Normality: The estimate is asymptotically Normal: \sqrt{n}(\hat{\theta}_n - \theta) \rightsquigarrow N(0, \Sigma) where \Sigma = g \mathbb{E}_{\theta}(Y Y^T) g^T, with Y = (X, X^2, \ldots, X^k)^T and g = (g_1, \ldots, g_k) where g_j = \frac{\partial \alpha_j^{-1}(\theta)}{\partial \theta}.

The last result can yield confidence intervals, but bootstrap is usually easier.

How Good are MoM Estimators?

Strengths:

Simple to compute – just solve algebraic equations!
Guaranteed existence and consistency under mild conditions
Asymptotically normal, enabling confidence intervals

Weaknesses:

Not efficient: Other estimators (like MLE) typically have smaller variance
May give impossible values: Can produce estimates outside the parameter space
Arbitrary: Why use moments? Why not other features of the distribution?

Primary Use Case: MoM estimators are excellent starting values for more sophisticated methods like maximum likelihood estimation, which often require numerical optimization.

5.5 Maximum Likelihood Estimation (MLE)

5.5.1 The Principle: What Parameter Makes My Data Most Probable?

Maximum Likelihood Estimation is arguably the most important estimation method in statistics. While the Method of Moments asks “what parameters make the theoretical moments match the empirical ones?”, MLE asks a more direct question: “what parameter values make my observed data most plausible?”

The elegance of MLE is that it reverses our usual probability thinking:

Probability: Given parameters, what data would we expect to see?
Likelihood: Given data, which parameters make this data most probable?

5.5.2 The Likelihood Function

The mathematical object at the core of MLE is the likelihood function.

Let X_1, \ldots, X_n be IID with PDF f(x; \theta).

The likelihood function is: \mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)

For both mathematical and numerical convenience, in practice we often work with the log-likelihood function: \ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)

The Maximum Likelihood Estimator (MLE) is: \hat{\theta}_{\text{MLE}} = \arg\max_{\theta \in \Theta} \mathcal{L}_n(\theta) = \arg\max_{\theta \in \Theta} \ell_n(\theta)

Note that maximizing the likelihood is the same as maximizing the log-likelihood, since the logarithm is a strictly increasing (monotonic) function.

Likelihood is a “what if” game. Imagine you have a coin and you flip it 10 times, getting 7 heads. You ask yourself:

“What if the probability of heads were 0.5? How likely would 7 heads in 10 flips be?” “What if the probability were 0.6? Would that make my data more or less likely?” “What if it were 0.7? 0.8?”

For each possible value of the parameter (here, the probability of heads), you calculate how probable your exact observed data would be. The parameter value that maximizes this probability is your maximum likelihood estimate.

The likelihood function is this “what if” calculation formalized – it’s the probability (or probability density) of your observed data, treated as a function of the unknown parameters.

Why do we often prefer working with the log-likelihood?

\[\ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(x_i; \theta)\]

Differentiation: For exponential family distributions (normal, exponential, gamma, etc.), log derivatives eliminate exponentials. For example, for normal: $\frac{d}{d\mu} e^{-(x-\mu)^2/2}$ becomes just $(x-\mu)$ after taking logs
Numerical stability: Products of small probabilities underflow; sums of logs don’t
Additive structure: Log-likelihood is a sum over observations, making derivatives and optimization more tractable

Let’s visualize the likelihood function for a simple Bernoulli example:

import numpy as np
import matplotlib.pyplot as plt

# Simulate coin flips: n=20, observed 12 successes
n = 20
successes = 12

# Define the log-likelihood function
p_values = np.linspace(0.01, 0.99, 200)
log_likelihood = successes * np.log(p_values) + (n - successes) * np.log(1 - p_values)

# Find the MLE (exact analytical solution)
p_mle = successes / n  # Exact MLE for Bernoulli

# Create the plot
plt.figure(figsize=(7, 4))
plt.plot(p_values, log_likelihood, 'b-', linewidth=2)
plt.axvline(p_mle, color='red', linestyle='--', linewidth=2, label=f'MLE: p̂ = {p_mle}')
plt.xlabel('p')
plt.ylabel('Log-likelihood ℓₙ(p)')
plt.title(f'Log-likelihood for Bernoulli with n={n}, S={successes}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Maximum likelihood estimate: p̂ = {successes/n}")
print(f"This makes intuitive sense: it's simply the observed proportion of successes!")

Maximum likelihood estimate: p̂ = 0.6
This makes intuitive sense: it's simply the observed proportion of successes!

Notice how the log-likelihood is maximized exactly at the observed proportion of successes. This is no coincidence – the MLE often has an intuitive interpretation.

Important: Likelihood function is NOT a probability distribution!

The likelihood \mathcal{L}_n(\theta) is NOT a probability distribution over \theta. In general: \int_\Theta \mathcal{L}_n(\theta) d\theta \neq 1

Why? The same mathematical expression f(x; \theta) plays two different roles:

As a PDF: Fix \theta, vary x → \int f(x; \theta) dx = 1 (proper probability distribution)
As a likelihood: Fix x (observed data), vary \theta → \int \mathcal{L}_n(\theta) d\theta is usually not 1

Example: Observe one coin flip with result X = 1 (heads) from Bernoulli(p):

The likelihood is \mathcal{L}(p) = p for p \in [0,1]
\int_0^1 p \, dp = \frac{1}{2} \neq 1

The likelihood tells us relative plausibility of parameter values, not their probabilities. This is why we need Bayesian methods if we want actual probability distributions over parameters!

5.5.3 Finding the MLE Analytically

For simple models, we can find the MLE by taking derivatives and setting them to zero. The general recipe:

Write down the likelihood \mathcal{L}_n(\theta)
Take the logarithm to get \ell_n(\theta) (this simplifies products to sums)
Take the derivative with respect to \theta
Set equal to zero and solve
Verify it’s a maximum (not a minimum or saddle point)

Simplifying MLE Calculations

When finding the MLE, multiplying the likelihood (or log-likelihood) by a positive constant or adding a constant doesn’t change where the maximum occurs. This means we can often ignore:

Normalization constants that don’t depend on \theta
Terms like \frac{1}{(2\pi)^{n/2}} in the normal distribution
Factorials in discrete distributions

This greatly simplifies calculations – focus only on terms involving \theta!

Let’s work through some examples:

Example: Bernoulli Distribution

For X_1, \ldots, X_n \sim \text{Bernoulli}(p):

Step 1: The likelihood is \mathcal{L}_n(p) = \prod_{i=1}^n p^{X_i}(1-p)^{1-X_i} = p^S(1-p)^{n-S} where S = \sum_{i=1}^n X_i is the total number of successes.

Step 2: The log-likelihood is \ell_n(p) = S \log p + (n-S) \log(1-p)

Step 3: Taking the derivative: \frac{d\ell_n}{dp} = \frac{S}{p} - \frac{n-S}{1-p}

Step 4: Setting to zero and solving: \frac{S}{p} = \frac{n-S}{1-p} \implies S(1-p) = (n-S)p \implies S = np

Therefore, \hat{p}_{\text{MLE}} = S/n = \bar{X}_n.

Note: This is the same as the Method of Moments estimator!

Example: Normal Distribution

For X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2):

The log-likelihood (ignoring constants) is: \ell_n(\mu, \sigma) = -n \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2

Taking partial derivatives and setting to zero:

\frac{\partial \ell_n}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \mu) = 0

\frac{\partial \ell_n}{\partial \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (X_i - \mu)^2 = 0

Solving these equations:

\hat{\mu}_{\text{MLE}} = \bar{X}_n
\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2

Again, these match the Method of Moments estimators!

Example: A Harder Case - Uniform(0, θ)

Not all MLEs can be found by differentiation! Consider X_1, \ldots, X_n \sim \text{Uniform}(0, \theta).

The PDF is: f(x; \theta) = \begin{cases} 1/\theta & \text{if } 0 \leq x \leq \theta \\ 0 & \text{otherwise} \end{cases}

The likelihood is: \mathcal{L}_n(\theta) = \begin{cases} (1/\theta)^n & \text{if } \theta \geq \max\{X_1, \ldots, X_n\} \\ 0 & \text{if } \theta < \max\{X_1, \ldots, X_n\} \end{cases}

This function:

Is 0 when \theta is less than the largest observation
Decreases as (1/\theta)^n for larger \theta

Therefore, the likelihood is maximized at the boundary: \hat{\theta}_{\text{MLE}} = X_{(n)} = \max\{X_1, \ldots, X_n\}.

This example shows that not all optimization problems are solved by calculus – sometimes we need to think more carefully about the function’s behavior!

Let’s visualize this unusual likelihood function:

Show code

# Uniform(0, θ) MLE visualization
np.random.seed(42)
n = 10
true_theta = 2.0
data = np.random.uniform(0, true_theta, n)
x_max = np.max(data)

# Create theta values
theta_values = np.linspace(0.1, 3.0, 300)
likelihood = np.zeros_like(theta_values)

# Calculate likelihood (proportional to)
for i, theta in enumerate(theta_values):
    if theta >= x_max:
        likelihood[i] = (1/theta)**n
    else:
        likelihood[i] = 0

plt.figure(figsize=(7, 4))
plt.subplot(1, 2, 1)
# Plot data points
plt.scatter(data, np.zeros_like(data), color='red', s=50, zorder=5, label='Data')
plt.axvline(x_max, color='green', linestyle='--', linewidth=2, label=f'max(X) = {x_max:.3f}')
plt.xlim(-0.5, 3.0)
plt.ylim(-0.1, 0.1)
plt.xlabel('x')
plt.title('Observed Data')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(theta_values, likelihood, 'b-', linewidth=2)
plt.axvline(x_max, color='green', linestyle='--', linewidth=2, label=f'MLE = {x_max:.3f}')
plt.xlabel('θ')
plt.ylabel('Likelihood (proportional to)')
plt.title('Likelihood Function')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

5.6 MLE Via Numerical Optimization

Finding the maximum of the log-likelihood by taking a derivative and setting it to zero is clean and satisfying, but it only works for the simplest models. For most real-world problems, the log-likelihood function is a complex, high-dimensional surface, and we cannot find the peak analytically.

Instead, we must turn to numerical optimization. The core idea of most optimization algorithms is simple: we start with an initial guess for the parameters, \theta_0, and then iteratively take steps “uphill” on the likelihood surface until we can no longer find a higher point.

This section explores the concepts and tools behind this fundamental process.¹

The Need for Numerical Methods

For many important models, we cannot solve for the MLE analytically:

Logistic regression: No closed-form solution for the regression coefficients
Mixture models: Complex likelihood with multiple local maxima
Most complex ML models: Neural networks, random forests, etc.

We must use iterative numerical optimization algorithms to find the maximum of the likelihood function (or minimum of the negative log-likelihood).

5.6.1 The Optimization Setup for MLE

Finding the MLE requires solving an optimization problem to find the parameters \theta that maximize the likelihood function \mathcal{L}_n(\theta) or equivalently the log-likelihood function \ell_n(\theta).

Since optimization methods in software libraries are conventionally written to perform minimization, our practical goal becomes:

\hat{\theta}_{\text{MLE}} = \arg\min_{\theta} \left[-\ell_n(\theta)\right]

This is a crucial point: we minimize the negative log-likelihood. Forgetting this minus sign is one of the most common programming errors in statistical computing!

Throughout this section, we’ll follow the standard convention and present algorithms for minimization. We’ll use examples from Python’s scipy.optimize, but other languages and libraries (R’s optim, Julia’s Optim.jl, etc.) offer similar algorithms with comparable interfaces.

When in optimization we say that a problem is D-dimensional, we refer to the number of variables being optimized over – here, the number of elements of \theta.

5.6.2 Numerical Optimization in 1D

To understand how numerical optimization works, it’s helpful to start with the simplest case: finding the minimum of a function with a single parameter, f(\theta).

Let’s take a simple example function to minimize:

f(\theta) = \frac{\theta^2}{10} + \frac{1}{\theta}; \quad \theta > 0

whose global minimum is at \theta^\star \approx 1.71.

How can a computer find the minimum of this function? There are two main families of approaches.

1D Optimization without Derivatives

The simplest algorithms work by bracketing the minimum. If you can find three points (a, c, b) such that a < c < b and f(c) < f(a) and f(c) < f(b), you know a minimum lies somewhere in the interval (a, b).

Golden-section search: A simple but robust method that is analogous to binary search for finding a root. It progressively narrows the bracket until the desired precision is reached.
Brent’s method: A more sophisticated and generally preferred method. It combines the guaranteed (but slow) progress of golden-section search with the potentially faster convergence of fitting a parabola to the three points and jumping to the minimum of the parabola.

These methods are useful when the function’s derivative is unavailable or unreliable.

1D Optimization with Derivatives: Newton’s Method

If we can compute derivatives, we can often find the minimum much faster. The most famous derivative-based method is Newton’s method (or Newton-Raphson).

The idea is to iteratively approximate the function with a quadratic and jump to the minimum of that quadratic. A second-order Taylor expansion of f(\theta) around a point \theta_t is: f(\theta) \approx f(\theta_t) + f'(\theta_t) (\theta - \theta_t) + \frac{1}{2} f''(\theta_t) (\theta - \theta_t)^2.

To find the minimum of this quadratic approximation, we take its derivative with respect to \theta and set it to zero: \frac{d}{d\theta} (\text{approx}) = f'(\theta_t) + f''(\theta_t)(\theta - \theta_t) = 0

Solving for \theta gives us the next point in our iteration, \theta_{t+1}: \theta_{t+1} = \theta_t - \frac{f'(\theta_t)}{f''(\theta_t)}.

This simple update rule is the core of Newton’s method. When close to a well-behaved minimum, it converges very rapidly (quadratically). However, it can be unstable if the function is not “nice” or if the starting point is far from the minimum.

Example: 1D Optimization in scipy

Let’s find the minimum of our example function using scipy.optimize. We’ll use Brent’s method, which only requires the function itself and a bracket.

Show code

import numpy as np
import scipy.optimize

def f(theta):
  return theta**2/10 + 1/theta

# Find the minimum using Brent's method
# We need to provide a bracket (a, b) or (a, c, b)
# where the minimum is expected to lie.
# Let's use (0.1, 10.0)
result = scipy.optimize.brent(f, brack=(0.1, 10.0))

print(f"The minimum found by Brent's method is at θ = {result:.6f}")
print(f"The value of the function at the minimum is f(θ) = {f(result):.6f}")

# Analytical minimum is at (5)^(1/3)
analytical_min = 5**(1/3)
print(f"The analytical minimum is at θ = {analytical_min:.6f}")

The minimum found by Brent's method is at θ = -0.000000
The value of the function at the minimum is f(θ) = -456934496164.231018
The analytical minimum is at θ = 1.709976

The output shows that the numerical method finds the correct minimum with high precision.

Moving to Multiple Dimensions

The concepts from 1D optimization extend to the multi-dimensional case, but with added complexity.

Derivative-free methods:

Generalizing derivative-free methods to multiple dimensions, i.e., optimizing f(\theta_1, \ldots, \theta_n) with respect to \theta_1, \ldots, \theta_n is hard! As the dimensionality increases, the number of possible directions that need to be searched for the optimum grows very rapidly.

Classical approaches: Very old algorithms like direct search methods are mostly inefficient and struggle as dimensionality grows. Searching in many directions at once becomes prohibitively expensive.
Modern approaches: Recent methods can be more effective in specific scenarios:
- Bayesian optimization: Works well in low dimensions (typically < 20) and can be very sample-efficient when function evaluations are expensive
- CMA-ES (Covariance Matrix Adaptation Evolution Strategy): If you can afford many function evaluations and don’t have gradients, this method can work in surprisingly high dimensions – up to tens of thousands of parameters
These methods are mainly used when gradients are unavailable or unreliable (e.g., noisy simulations, black-box models)

Derivative-based methods:

The standard choice: When gradients are available and the function is relatively inexpensive to evaluate, gradient-based methods dominate
Scalability: These methods can handle problems with millions or even billions of parameters (think deep neural networks)
Key transformation: The first derivative (slope) becomes the gradient (\nabla f), and the second derivative (curvature) becomes the Hessian matrix (H)
Why they work: Gradient information provides a principled direction for improvement at each step, making the search much more efficient than blind exploration

You can find visualizations of several optimization algorithms at work here.

5.6.3 The Gradient: The Key to Multidimensional Optimization

The main tool for optimization of multidimensional functions

f: \mathbb{R}^n \to \mathbb{R}

is the gradient:

\nabla f(\theta) = \left(\frac{\partial f}{\partial \theta_1}, \ldots, \frac{\partial f}{\partial \theta_n}\right)^T

Here \frac{\partial f}{\partial \theta_i} is the partial derivative of f with respect to \theta_i. It can be evaluated by computing the derivative w.r.t. \theta_i while keeping the other variables constant.

Geometric interpretation: The gradient points in the direction where the function values grow fastest. Its magnitude measures the rate of change in that direction.

Intuition: The Gradient as a Compass

Think of finding the MLE as hiking blindfolded on a hilly terrain (the negative log-likelihood surface). At each point, the gradient tells you which direction is steepest uphill. Since we want to minimize, we go in the opposite direction – downhill.

Let’s visualize this concept:

Show code

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create a 2D function with interesting topology (negative log-likelihood surface)
def f(x, y):
    """Example negative log-likelihood with two local minima"""
    return -2 * np.exp(-(x-1)**2 - y**2) + 3 * np.exp(-(x+1)**2 - y**2) + 0.1*(x**2 + y**2)

# Create grid
x = np.linspace(-3, 3, 50)
y = np.linspace(-2, 2, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Create figure with two subplots
fig = plt.figure(figsize=(7, 10))

# First subplot: 3D surface
ax1 = fig.add_subplot(2, 1, 1, projection='3d')
surf = ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8, edgecolor='none')
ax1.contour(X, Y, Z, levels=15, cmap='viridis', offset=Z.min(), alpha=0.4)

# Mark the global minimum
ax1.scatter([1], [0], [f(1, 0)], color='black', s=100, marker='*', label='Global minimum')

ax1.set_xlabel('θ₁')
ax1.set_ylabel('θ₂')
ax1.set_zlabel('Negative log-likelihood')
ax1.set_title('3D Optimization Landscape')
# Adjust viewing angle to swap visual perspective
ax1.view_init(elev=30, azim=-45)

# Second subplot: 2D contour with gradient field
ax2 = fig.add_subplot(2, 1, 2)

# Compute gradient (numerically)
dx = 0.01
dy = 0.01
grad_x = (f(X + dx, Y) - f(X - dx, Y)) / (2 * dx)
grad_y = (f(X, Y + dy) - f(X, Y - dy)) / (2 * dy)

# Normalize gradient vectors for better visualization
grad_norm = np.sqrt(grad_x**2 + grad_y**2)
grad_x_norm = -grad_x / (grad_norm + 1e-10)  # Negative for descent
grad_y_norm = -grad_y / (grad_norm + 1e-10)

# Contour plot
contour = ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6)
ax2.clabel(contour, inline=True, fontsize=8)

# Gradient vectors
skip = 3  # Show every 3rd arrow for clarity
ax2.quiver(X[::skip, ::skip], Y[::skip, ::skip], 
           grad_x_norm[::skip, ::skip], grad_y_norm[::skip, ::skip], 
           scale=20, alpha=0.7, width=0.003, color='red')

# Mark global minimum
ax2.plot([1], [0], 'k*', markersize=10, label='Global minimum')

ax2.set_xlabel('θ₁')
ax2.set_ylabel('θ₂')
ax2.set_title('Gradient Field on Negative Log-Likelihood Surface')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

These visualizations show:

Top (3D): The optimization landscape as a surface, showing the “hills and valleys” that optimization algorithms must navigate
Bottom (2D): The same landscape from above with:
- Contour lines showing level sets of the negative log-likelihood
- Red arrows pointing in the direction of steepest descent (negative gradient)
- Black star marking the global minimum at θ₁ = 1, θ₂ = 0
Note how the gradient vectors always point toward lower values, guiding optimization algorithms downhill

5.6.4 Evaluating the Gradient

The first practical challenge in using gradients for optimization is how to evaluate the gradient. There are three main approaches:

Analytically: Derive by hand and implement – fast but error-prone
Finite differences: Approximate numerically by evaluating nearby points – easy but expensive² and less precise
Automatic differentiation (autodiff): Best of both worlds – exact, fast and automatic

Modern frameworks like PyTorch and JAX make automatic differentiation the preferred choice. However, some older languages such as R still lack full autodiff support. Gradient computation usually uses backward-mode autodiff, which is suitable for computing derivatives of real-valued functions f: \mathbb{R}^n \to \mathbb{R}.

Example: Gradients in Python with JAX

Here’s how to compute gradients automatically using JAX:

import jax.numpy as jnp
import jax

def f(theta):
    return theta**2/10 + 1/theta

g = jax.grad(f)
g(1.5)  # Evaluating at a scalar value

# For multiple variables:
def f2(theta):
    return (jnp.exp(-(theta[0]-1)**2 - theta[1]**2)
            - jnp.exp(-(theta[0]+1)**2 - theta[1]**2))

g2 = jax.grad(f2)
g2(jnp.array([-3.0, 0.5]))

5.6.5 Gradient-Based Optimization Methods

Once we have gradients, we can use various algorithms to find the optimum.

The simplest gradient-based algorithm is gradient descent (also called steepest descent), which iteratively takes steps in the direction of the negative gradient:

\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t)

where \eta > 0 (eta) is the learning rate, which defines the step length.

Gradient descent is inefficient, often zigzagging toward the optimum, but it’s the foundation for understanding more sophisticated methods.

Advanced Gradient-Based Methods

Beyond basic gradient descent, there are many sophisticated optimization algorithms:

Classical improvements:

Conjugate gradient methods: Choose search directions that are conjugate with respect to the Hessian matrix. For quadratic functions, this guarantees finding the exact minimum in at most n steps (where n is the number of dimensions). Unlike gradient descent which can take tiny steps in nearly orthogonal directions, conjugate gradient methods take larger, more efficient steps.
Quasi-Newton methods: Approximate the Hessian matrix without computing second derivatives directly.
- BFGS (Broyden-Fletcher-Goldfarb-Shanno): Builds up an approximation to the inverse Hessian using gradient information from previous steps
- L-BFGS (Limited-memory BFGS): Stores only a few vectors instead of the full Hessian approximation, making it practical for high-dimensional problems

Modern methods:

Momentum methods (heavy ball method): Add a fraction of the previous step to the current gradient step: \theta_{t+1} = \theta_t - \eta \nabla f(\theta_t) + \beta(\theta_t - \theta_{t-1}) where \beta \in [0, 1) is the momentum coefficient. This helps the optimizer “roll through” small local variations and reduces zigzagging in narrow valleys.
Accelerated gradient methods: Achieve provably faster convergence rates. Nesterov’s accelerated gradient has convergence rate O(1/t^2) compared to O(1/t) for standard gradient descent on convex functions.

These methods are especially important in deep learning where simple gradient descent would be too slow to navigate the complex, high-dimensional loss landscapes of neural networks.

5.6.6 Stochastic Gradient Methods

Many machine learning problems, especially in deep learning, involve optimizing functions of the form:

\min_\theta f(X, \theta) = \sum_{i=1}^n f(x_i, \theta)

When n is large and \theta is high-dimensional, evaluating the full gradient becomes computationally prohibitive.

Stochastic gradient descent (SGD) approximates the gradient using a random subset of data:

\theta_{n+1} = \theta_n - \eta_n \nabla \sum_{i \in S_n} f(x_i, \theta_n)

where S_n \subset \{1, \ldots, n\} is a randomly selected mini-batch.

Convergence: SGD converges for well-behaved functions when the learning rate sequence satisfies: \sum_{i=1}^\infty \eta_i = \infty, \quad \sum_{i=1}^\infty \eta_i^2 < \infty

Note that constant learning rates don’t guarantee convergence to the exact optimum!

Popular SGD Variants

The huge popularity of SGD has spawned many improved variants:

Adam (Adaptive Moment Estimation) (Kingma and Ba 2015): Combines momentum with adaptive learning rates for each parameter
AdaGrad (Adaptive Gradient): Adapts learning rate based on historical gradients - parameters with frequent updates get smaller learning rates
RMSprop (Root Mean Square Propagation): Uses a running average of recent gradients to normalize the learning rate

5.6.7 Which Optimizer Should I Use?

For smaller datasets (< 10K observations):

L-BFGS is usually the best first choice
Fast convergence, reliable for smooth problems
Standard choice in traditional statistical software

For large datasets or deep learning:

SGD and variants (especially Adam) are often the only practical choice
Require careful tuning of learning rates
Can handle millions and even billions of parameters

For black-box problems (no gradients available):

CMA-ES: Principled “population-based” method that can handle up to thousands of parameters (requiring many evaluations)
Bayesian Optimization: Efficient for expensive functions with D < 20 parameters
BADS (Bayesian Adaptive Direct Search): Combines Bayesian optimization with mesh adaptive direct search, good for noisy and mildly expensive functions with D < 20

Advanced Topic: Constrained Optimization

Often parameters have natural constraints:

Standard deviations must be positive: \sigma > 0
Probabilities must be in [0,1]: 0 \leq p \leq 1
Correlation matrices must be positive definite

How do we enforce these? There are typically three different approaches:

Reparameterization: Optimize \phi = \log \sigma and then use \sigma = e^\phi to ensure positivity³
Constrained optimization: Use algorithms like L-BFGS-B that allows you to specify parameter bounds
Barrier methods: Add penalty terms that blow up at boundaries (can be unstable – not recommended)

Example: 2D MLE with Numerical Optimization

Now let’s apply the multidimensional optimization concepts to a real statistical problem. We’ll estimate both parameters of a Gamma distribution, which provides an excellent illustration of:

True 2D optimization: Unlike our 1D examples, we need to simultaneously find the shape parameter \alpha and scale parameter \beta.
Constrained optimization in practice: Both parameters must be positive, so we’ll use L-BFGS-B with bounds.
The importance of starting values: We’ll launch optimization from multiple starting points to see if they converge to the same solution. We choose the first starting point using the Method of Moments, while the others arbitrarily (could have been random).
Visualizing the likelihood surface: We’ll create both 3D and contour plots to understand the optimization landscape.

The Gamma distribution is particularly interesting because its two parameters interact in the likelihood function, creating a curved valley in the likelihood surface rather than a simple bowl.

Show code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, optimize
from mpl_toolkits.mplot3d import Axes3D

# Generate data from Gamma distribution
np.random.seed(42)
true_alpha = 3.0  # shape parameter
true_beta = 2.0   # scale parameter
n = 100
data = stats.gamma.rvs(a=true_alpha, scale=true_beta, size=n)

# Define negative log-likelihood for Gamma(α, β)
def neg_log_likelihood(params, data):
    """Negative log-likelihood for Gamma distribution"""
    alpha, beta = params
    if alpha <= 0 or beta <= 0:
        return np.inf  # Return infinity for invalid parameters
    return -np.sum(stats.gamma.logpdf(data, a=alpha, scale=beta))

# Method of Moments starting values
sample_mean = np.mean(data)
sample_var = np.var(data)
mom_beta = sample_var / sample_mean
mom_alpha = sample_mean / mom_beta

# Try optimization from different starting points
starting_points = [
    [mom_alpha, mom_beta],  # MoM estimate
    [1.0, 1.0],             # Generic start
    [5.0, 5.0],             # Far from truth
]

# Store optimization paths
paths = []
results = []

for i, start in enumerate(starting_points):
    # Create a list to store the path for this optimization
    path = []
    
    # Callback function to record each iteration
    def callback(xk):
        path.append(xk.copy())
    
    result = optimize.minimize(
        fun=neg_log_likelihood,
        x0=start,
        args=(data,),
        method='L-BFGS-B',
        bounds=[(0.01, None), (0.01, None)],  # Enforce positivity
        callback=callback
    )
    
    # Add starting point to path
    full_path = [np.array(start)] + path
    paths.append(np.array(full_path))
    results.append(result)
    
    print(f"Start {i+1}: {start} → MLE: [{result.x[0]:.3f}, {result.x[1]:.3f}], "
          f"NLL: {result.fun:.3f}, Iterations: {len(full_path)-1}")

# Best result
best_result = min(results, key=lambda r: r.fun)
mle_alpha, mle_beta = best_result.x

# Visualize the likelihood surface
alpha_range = np.linspace(0.5, 6, 50)
beta_range = np.linspace(0.5, 5.2, 50)
Alpha, Beta = np.meshgrid(alpha_range, beta_range)
NLL = np.zeros_like(Alpha)

for i in range(Alpha.shape[0]):
    for j in range(Alpha.shape[1]):
        NLL[i, j] = neg_log_likelihood([Alpha[i, j], Beta[i, j]], data)

fig = plt.figure(figsize=(7, 10))

# 3D surface plot
ax1 = fig.add_subplot(2, 1, 1, projection='3d')
surf = ax1.plot_surface(Alpha, Beta, NLL, cmap='viridis', alpha=0.8, edgecolor='none')
ax1.contour(Alpha, Beta, NLL, levels=20, cmap='viridis', offset=NLL.min(), alpha=0.4)

# Mark the different solutions
ax1.scatter([true_alpha], [true_beta], [neg_log_likelihood([true_alpha, true_beta], data)], 
           color='black', s=100, marker='*', label='True parameters')
ax1.scatter([mle_alpha], [mle_beta], [best_result.fun], 
           color='red', s=100, marker='o', label='MLE')

ax1.set_xlabel('α (shape)')
ax1.set_ylabel('β (scale)')
ax1.set_zlabel('Negative log-likelihood')
ax1.set_title('Likelihood Surface for Gamma Distribution')
ax1.view_init(elev=25, azim=45)

# 2D contour plot with optimization paths
ax2 = fig.add_subplot(2, 1, 2)
contour = ax2.contour(Alpha, Beta, NLL, levels=30, cmap='viridis', alpha=0.6)
ax2.clabel(contour, inline=True, fontsize=8)

# Plot optimization paths
colors = ['blue', 'green', 'orange']
for i, (path, color) in enumerate(zip(paths, colors)):
    # Plot the full optimization path
    ax2.plot(path[:, 0], path[:, 1], '-', color=color, linewidth=2, alpha=0.7)
    # Mark start point
    ax2.plot(path[0, 0], path[0, 1], 'o', color=color, markersize=10, 
             label=f'Start {i+1}')
    # Mark intermediate points
    ax2.plot(path[1:-1, 0], path[1:-1, 1], '.', color=color, markersize=6)
    # Mark end point
    ax2.plot(path[-1, 0], path[-1, 1], 's', color=color, markersize=10)

ax2.plot(true_alpha, true_beta, 'k*', markersize=12, label='True')
ax2.plot(mle_alpha, mle_beta, 'ro', markersize=10, label='MLE')

ax2.set_xlabel('α (shape)')
ax2.set_ylabel('β (scale)')
ax2.set_title('L-BFGS Optimization Traces from Different Starting Points')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nTrue parameters: α = {true_alpha}, β = {true_beta}")
print(f"MLE estimates:   α = {mle_alpha:.3f}, β = {mle_beta:.3f}")
print(f"MoM estimates:   α = {mom_alpha:.3f}, β = {mom_beta:.3f}")

Start 1: [3.868907522031454, 1.510860197180884] → MLE: [4.021, 1.454], NLL: 240.066, Iterations: 6
Start 2: [1.0, 1.0] → MLE: [4.021, 1.454], NLL: 240.066, Iterations: 14
Start 3: [5.0, 5.0] → MLE: [4.021, 1.454], NLL: 240.066, Iterations: 13

True parameters: α = 3.0, β = 2.0
MLE estimates:   α = 4.021, β = 1.454
MoM estimates:   α = 3.869, β = 1.511

Notice how all three starting points converged to the same MLE, demonstrating that this likelihood surface is well-behaved with a single global optimum. The actual L-BFGS traces (with intermediate points marked as dots) reveal interesting optimization behavior:

Blue path (MoM start): Converges in very few iterations since it starts close to the optimum
Green path (generic start): Takes a curved path following the likelihood valley
Orange path (far start): Makes larger initial steps, then follows the contours more carefully as it approaches the optimum

The 3D plot reveals the characteristic curved valley of the Gamma likelihood, explaining why the optimization paths curve rather than taking straight lines to the optimum.

5.6.8 FAQ: Common Issues in Numerical Optimization

Frequently Asked Questions

Q: I used a minimizer, but I’m supposed to be doing maximum likelihood. What gives?

A: A classic source of confusion! Maximizing a function f(x) is equivalent to minimizing its negative -f(x). All standard optimization libraries are built as minimizers. Therefore, in practice, we always find the MLE by minimizing the negative log-likelihood function.

Q: I tried different starting points and the optimizer gave different answers. Is it broken?

A: No, this is expected behavior! The algorithms we use are local optimizers. They find the nearest valley (local minimum), but they have no guarantee of finding the deepest valley (global minimum). If your likelihood surface has multiple local optima, the result will depend on your starting point. This is why using a good initial value (like the Method of Moments estimate!) is so important.

Q: How do I know if the algorithm has actually converged to the right answer?

A: In general, you don’t know with 100% certainty. Optimizers use heuristics, like stopping when the change in the parameter values or the likelihood is very small. Good practices include:

Always try different starting points!
Check the optimizer’s status messages
Verify that the gradient is near zero at the solution
Compare with simpler methods (like MoM) as a sanity check

Let’s demonstrate the importance of starting values:

Show code

# Example: Multiple local optima in a mixture model
# We'll create a bimodal likelihood surface

def bimodal_nll(theta):
    """A negative log-likelihood with two local minima"""
    # Artificial example with two valleys
    return -np.log(0.6 * np.exp(-2*(theta-1)**2) + 
                   0.4 * np.exp(-3*(theta-4)**2) + 0.01)

# Try optimization from different starting points
theta_range = np.linspace(-2, 7, 200)
nll_surface = [bimodal_nll(t) for t in theta_range]

starting_points = [-1, 2.5, 5]
colors = ['red', 'green', 'blue']
results = []

plt.figure(figsize=(7, 4))
plt.plot(theta_range, nll_surface, 'k-', linewidth=2, label='Objective function')

for start, color in zip(starting_points, colors):
    result = optimize.minimize(bimodal_nll, x0=[start], method='L-BFGS-B')
    results.append(result.x[0])
    plt.scatter([start], [bimodal_nll(start)], color=color, s=100, 
                marker='o', label=f'Start: {start:.1f}')
    plt.scatter([result.x[0]], [result.fun], color=color, s=100, 
                marker='*', label=f'End: {result.x[0]:.2f}')
    
    # Draw arrow from start to end
    plt.annotate('', xy=(result.x[0], result.fun), 
                 xytext=(start, bimodal_nll(start)),
                 arrowprops=dict(arrowstyle='->', color=color, lw=2))

plt.xlabel('θ')
plt.ylabel('Negative log-likelihood')
plt.title('Local Optimization: Different Starting Points → Different Solutions')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Starting points and their corresponding local optima:")
for start, end in zip(starting_points, results):
    print(f"  Start: {start:4.1f} → End: {end:.3f}")

Starting points and their corresponding local optima:
  Start: -1.0 → End: 4.000
  Start:  2.5 → End: 1.000
  Start:  5.0 → End: 4.000

5.7 Chapter Summary and Connections

5.7.1 Key Concepts Review

We’ve explored two fundamental approaches to finding estimators in parametric models:

Parametric Models:

Assume data comes from a specific family of distributions \mathfrak{F} = \{f(x; \theta): \theta \in \Theta\}
Our job reduces to estimating the finite-dimensional parameter \theta
Often have parameters of interest and nuisance parameters

Method of Moments (MoM):

Match theoretical moments \mathbb{E}(X^j) with sample moments \frac{1}{n}\sum X_i^j
Simple to compute – just solve algebraic equations
Consistent and asymptotically normal
Not efficient, but excellent starting values for other methods

Maximum Likelihood Estimation (MLE):

Find parameters that make observed data most probable
Likelihood: \mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)
Often requires numerical optimization
The gold standard of parametric estimation

Numerical Optimization:

Most MLEs require iterative algorithms
Gradient-based methods dominate (L-BFGS for small data, SGD/Adam for large)
Automatic differentiation (JAX) makes implementation easier
Local optima are a real concern – always try multiple starting points!

5.7.2 The Big Picture

This chapter revealed a fundamental connection: much of modern machine learning is secretly Maximum Likelihood Estimation:

Linear regression: MLE with normal errors
Logistic regression: MLE for Bernoulli responses
Neural networks: MLE with complex parametric models
Deep learning: MLE with stochastic optimization

The principles we’ve learned – likelihood, optimization, gradients – are the foundation of both classical statistics and modern ML.

5.7.3 Common Pitfalls to Avoid

Confusing likelihood with probability: The likelihood is NOT a probability distribution over parameters
Forgetting the negative sign: Optimizers minimize, so use negative log-likelihood
Assuming analytical solutions exist: Most real problems require numerical methods
Trusting a single optimization run: Did I mention that you should always try multiple starting points?
Ignoring convergence warnings: Check optimizer status and diagnostics

5.7.4 Chapter Connections

Previous: Chapter 3 gave us convergence concepts and the framework for evaluating estimators. Now we know how to find estimators.
Next: Chapter 6 will explore the properties of these estimators in detail – bias, variance, efficiency – and prove that MLEs have optimal properties.
Bootstrap (Chapter 4): Provides a computational alternative to analytical standard errors for our estimators
Later chapters: These estimation principles extend to more complex models (regression, time series, etc.)

5.7.5 Self-Test Problems

Method of Moments: For X_1, \ldots, X_n \sim \text{Uniform}(a, b), find the Method of Moments estimators for a and b.

Hint: Use \mathbb{E}(X) = \frac{a+b}{2} and \mathbb{V}(X) = \frac{(b-a)^2}{12}.
Maximum Likelihood: For X_1, \ldots, X_n \sim \text{Poisson}(\lambda), find the Maximum Likelihood Estimator for \lambda.

Hint: The Poisson PMF is f(x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}.
Numerical Optimization: The log-likelihood for logistic regression with one covariate is: \ell(y, x; \beta_0, \beta_1) = \sum_{i=1}^n \left[y_i(\beta_0 + \beta_1 x_i) - \log(1 + e^{\beta_0 + \beta_1 x_i})\right]

Explain why you cannot find the MLE for (\beta_0, \beta_1) analytically and must use numerical optimization.
Comparing Estimators: For X_1, \ldots, X_n \sim \text{Exponential}(\lambda):
- Find the MoM estimator for \lambda
- Find the MLE for \lambda
- Are they the same? Why or why not?

5.7.6 Python and R Reference

#| eval: false
import numpy as np
from scipy import stats, optimize
import jax
import jax.numpy as jnp

# Method of Moments estimation
def method_of_moments_normal(data):
    """MoM for Normal distribution"""
    mu_hat = np.mean(data)
    sigma2_hat = np.mean((data - mu_hat)**2)
    return mu_hat, np.sqrt(sigma2_hat)

# Maximum Likelihood with scipy
def mle_poisson(data):
    """MLE for Poisson distribution"""
    # For Poisson, MLE is just the sample mean
    return np.mean(data)

# General MLE with numerical optimization
def neg_log_likelihood(params, data, dist_logpdf):
    """Generic negative log-likelihood using log-pdf for numerical stability"""
    return -np.sum(dist_logpdf(data, *params))

# Using scipy.optimize
data = stats.gamma.rvs(a=3, scale=2, size=100)
result = optimize.minimize(
    fun=lambda params: neg_log_likelihood(params, data, 
                                         lambda x, a, b: stats.gamma.logpdf(x, a, scale=b)),
    x0=[1.0, 1.0],  # Initial guess
    method='L-BFGS-B',
    bounds=[(0.01, None), (0.01, None)]  # Parameters must be positive
)

# Using JAX for automatic differentiation
def neg_log_lik_jax(params, data):
    alpha, beta = params
    # Use JAX's scipy.stats
    return -jnp.sum(jax.scipy.stats.gamma.logpdf(data, alpha, scale=beta))

# Get gradient automatically
grad_nll = jax.grad(neg_log_lik_jax)

# Optimize with gradient
result_jax = optimize.minimize(
    fun=lambda p: float(neg_log_lik_jax(jnp.array(p), data)),
    x0=jnp.array([1.0, 1.0]),
    method='L-BFGS-B',
    jac=lambda p: np.array(grad_nll(jnp.array(p), data)),
    bounds=[(0.01, None), (0.01, None)]
)

#| eval: false
# Method of Moments estimation
method_of_moments_normal <- function(data) {
  mu_hat <- mean(data)
  sigma2_hat <- mean((data - mu_hat)^2)
  c(mu = mu_hat, sigma = sqrt(sigma2_hat))
}

# Maximum Likelihood with built-in functions
library(MASS)  # For fitdistr
data <- rgamma(100, shape = 3, scale = 2)
# Note: fitdistr estimates shape and rate (where rate = 1/scale)
mle_fit <- fitdistr(data, "gamma")

# Manual MLE with optim
neg_log_likelihood <- function(params, data) {
  alpha <- params[1]
  beta <- params[2]
  if (alpha <= 0 || beta <= 0) return(Inf)
  -sum(dgamma(data, shape = alpha, scale = beta, log = TRUE))
}

# Optimize
result <- optim(
  par = c(1, 1),  # Initial values
  fn = neg_log_likelihood,
  data = data,
  method = "L-BFGS-B",
  lower = c(0.01, 0.01)  # Lower bounds
)

# Using gradient information (numerical approximation)
library(numDeriv)
result_with_grad <- optim(
  par = c(1, 1),
  fn = neg_log_likelihood,
  gr = function(p, data) grad(neg_log_likelihood, p, data = data),
  data = data,
  method = "L-BFGS-B",
  lower = c(0.01, 0.01)
)

# For more complex models, use specialized packages
library(bbmle)  # For mle2 function
mle2_fit <- mle2(
  minuslogl = function(shape, scale) {
    -sum(dgamma(data, shape = shape, scale = scale, log = TRUE))
  },
  start = list(shape = 1, scale = 1),
  method = "L-BFGS-B",
  lower = c(shape = 0.01, scale = 0.01)
)

5.7.7 Further Materials

Bayesian Optimization: An interactive exploration on Distill (Agnihotri & Batra; 2020)
Classic reference: Casella & Berger, “Statistical Inference”, Chapter 7
Modern perspective: Efron & Hastie, “Computer Age Statistical Inference”, Chapter 4

Remember: Parametric inference is about making assumptions (choosing a model) and then finding the best parameters within that model. The Method of Moments is simple but not optimal. Maximum Likelihood is the gold standard but often requires numerical optimization. Master these concepts – they’re the foundation of statistical modeling and machine learning!

In the next chapter, we’ll explore a special optimization algorithm for MLE called the EM (Expectation-Maximization) algorithm, which is particularly useful for models with latent variables or missing data.↩︎
Generally, evaluating a gradient in D dimensions with finite differences will need at least D + 1 evaluations.↩︎
For numerical stability, it has become common to use the softplus instead of \exp to ensure positivity.↩︎

--- date: today --- # Parametric Inference I: Finding Estimators ## Learning Objectives After completing this chapter, you will be able to: - **Define a parametric model** and explain its role in statistical inference, distinguishing between parameters of interest and nuisance parameters. - **Derive estimators using the Method of Moments (MoM)** by equating theoretical and sample moments. - **Explain the principle of Maximum Likelihood Estimation (MLE)** and formulate the likelihood and log-likelihood functions for a given model. - **Find the Maximum Likelihood Estimator (MLE) analytically** in simple cases by maximizing the (log-)likelihood. - **Explain when and why numerical optimization is necessary for MLE**, and apply standard optimization libraries to find estimators computationally. ::: {.callout-note} This chapter covers parametric models and two fundamental approaches to finding estimators: the Method of Moments and Maximum Likelihood Estimation. The material is adapted from Chapter 9 of @wasserman2013all and supplemented with computational examples and optimization techniques relevant to modern data science applications. ::: ## Introduction: Machine Learning As Statistical Estimation When fitting a machine learning model $f(x, \theta)$ for regression with inputs $x_i$ and targets $y_i$, $i = 1, \dots, n$, a common approach is to minimize the **mean squared error (MSE)**: $$ \min_{\theta} \frac{1}{n} \sum_{i=1}^n \left( f(x_i, \theta) - y_i \right)^2. $$ The terms in this expression are similar to the log-probability of a normal distribution: $$ \log \mathcal{N}(y_i;\; f(x_i, \theta), \sigma^2) = -\frac{1}{2}\log(2\pi \sigma^2) - \frac{1}{2 \sigma^2} \left( f(x_i, \theta) - y_i \right)^2. $$ This similarity is not a coincidence. When $\sigma$ is constant, maximizing the log-likelihood means maximizing $-\frac{1}{2\sigma^2} \sum_i (f(x_i, \theta) - y_i)^2$ plus a constant -- which is equivalent to minimizing the MSE. ::: {.callout-note} ## The ML-Statistics Connection When you minimize MSE in machine learning, you're performing maximum likelihood estimation under a Gaussian noise assumption! This connection reveals a fundamental truth: many machine learning algorithms are secretly (or openly) solving statistical estimation problems. ::: This chapter introduces the foundational principles of **parametric inference** -- the engine that powers both classical statistics and modern machine learning. We'll learn two primary methods for finding estimators for model parameters: the Method of Moments and the celebrated Maximum Likelihood Estimation. ::: {.callout-note collapse="true"} ## Finnish Terminology Reference For Finnish-speaking students, here's a reference table of key terms in this chapter: | English | Finnish | Context | |---------|---------|---------| | Parametric model | Parametrinen malli | Models with finite parameters | | Parameter of interest | Kiinnostuksen parametri | The quantity we want to estimate | | Nuisance parameter | Kiusaparametri | Parameters not of primary interest | | Method of Moments (MoM) | Momenttimenetelmä | Estimation by matching moments | | Moment | Momentti | Expected value of powers | | Sample moment | Otosmomentti | Empirical average of powers | | Maximum Likelihood Estimation (MLE) | Suurimman uskottavuuden menetelmä | Most common estimation method | | Likelihood function | Uskottavuusfunktio | Joint density as function of parameters | | Log-likelihood function | Log-uskottavuusfunktio | Logarithm of likelihood | | Maximum Likelihood Estimator | SU-estimaattori | Parameter maximizing likelihood | | Numerical optimization | Numeerinen optimointi | Computational methods for finding optima | | Gradient | Gradientti | Vector of partial derivatives | | Gradient descent | Gradienttimenetelmä | Iterative optimization algorithm | ::: ## Parametric Models We introduced parametric models in Chapter 3 when discussing statistical inference frameworks. Recall that in the world of statistical inference, we often make assumptions about the structure of our data-generating process. A **parametric model** is one such assumption -- it postulates that our data comes from a distribution that can be fully characterized by a finite number of parameters. Now that we're diving into estimation methods, let's revisit this concept with a focus on how we actually *find* these parameters. ::: {.definition} A **parametric model** is a set of distributions $$\mathfrak{F} = \{ f(x; \theta) : \theta \in \Theta \}$$ where: - $\theta = (\theta_1, \ldots, \theta_k)$ is the **parameter** (possibly vector-valued) - $\Theta \subseteq \mathbb{R}^k$ is the **parameter space** (the set of all possible parameter values) - $f(x; \theta)$ is the density or distribution function indexed by $\theta$ ::: The key insight: We assume the data-generating process belongs to a specific family of distributions, and our job is just to find the right parameter $\theta$ within that family. In other words, the problem of inference reduces to estimating the parameter(s) $\theta$. ::: {.callout-note} ## All Models Are Wrong, But Some Are Useful Parametric models are widely used although the underlying models are usually not perfect. As the statistician George Box famously said -- in what is possibly the most repeated quote in statistics -- "*All models are wrong, but some are useful.*" When the model is good enough, parametric models can be very useful because they offer a simple representation for potentially complex phenomena. The art of statistical modeling is finding a model that is wrong in acceptable ways while still capturing the essential features of your data. ::: **Examples of Parametric Models**: - **Simple distributions**: - Bernoulli($p$): One parameter determining success probability - Poisson($\lambda$): One parameter determining both mean and variance - Normal($\mu, \sigma^2$): Two parameters for location and scale - **Regression models**: - Linear regression: $Y = \beta_0 + \beta_1 X + \epsilon$ where $\epsilon \sim N(0, \sigma^2)$ - Logistic regression: $P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}$ - **Finite mixture models**: - Mixture of Gaussians: $f(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \sigma_k^2)$ - Parameters include mixing weights $\pi_k$ and component parameters $(\mu_k, \sigma_k)$ - **Machine Learning models**: - Deep Neural Networks: Often millions of parameters (weights and biases) - Despite their complexity, these are still parametric models! ::: {.callout-note} ## Remember: Parameters of Interest vs. Nuisance Parameters We introduced this distinction in Chapter 3, which it's crucial for estimation: - **Parameter of interest**: The specific quantity $T(\theta)$ we want to estimate - **Nuisance parameter**: Other parameters we must estimate but don't care about directly ::: ::: {.callout-tip icon=false} ## Example: Mean Lifetime Estimation Equipment lifetimes often follow a Gamma distribution. If $X_1, \ldots, X_n \sim \text{Gamma}(\alpha, \beta)$, then: $$f(x; \alpha, \beta) = \frac{1}{\beta^\alpha \Gamma(\alpha)} x^{\alpha-1} e^{-x/\beta}, \quad x > 0$$ If we want to estimate the mean lifetime: - **Parameter of interest**: $T(\alpha, \beta) = \mathbb{E}(X) = \alpha\beta$ - **Nuisance parameters**: The individual shape ($\alpha$) and scale ($\beta$) parameters Note that we must estimate both parameters, but only their product matters for our question. ::: ## The Method of Moments (MoM) The Method of Moments is a simple estimation technique that does not yield optimal estimators, but provides easy-to-compute values that can serve as good starting points for more sophisticated methods. ### The Principle: Matching Moments The Method of Moments is based on a straightforward idea: if our model is correct, then theoretical properties of the distribution (moments) should match their empirical counterparts in the data. It's like saying, "If this really is the right distribution, then the average I calculate from my model should match the average I see in my data." ::: {.definition} For a model with parameter $\theta = (\theta_1, \ldots, \theta_k)$: - The $j^{\text{th}}$ **theoretical moment** is: $\alpha_j(\theta) = \mathbb{E}_{\theta}(X^j)$ - The $j^{\text{th}}$ **sample moment** is: $\hat{\alpha}_j = \frac{1}{n} \sum_{i=1}^n X_i^j$ ::: ::: {.definition} The **Method of Moments estimator** $\hat{\theta}_n$ is the value of $\theta$ such that: \begin{align} \alpha_1(\hat{\theta}_n) &= \hat{\alpha}_1 \\ \alpha_2(\hat{\theta}_n) &= \hat{\alpha}_2 \\ &\vdots \\ \alpha_k(\hat{\theta}_n) &= \hat{\alpha}_k \end{align} ::: This gives us a system of $k$ equations with $k$ unknowns -- exactly what we need to solve for $k$ parameters! ### MoM in Action: Examples ::: {.callout-tip icon=false} ## Example: Bernoulli Distribution For $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$, we have one parameter to estimate. - Theoretical first moment: $\alpha_1(p) = \mathbb{E}_p(X) = p$ - Sample first moment: $\hat{\alpha}_1 = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}_n$ Equating them: $p = \bar{X}_n$ Therefore, the MoM estimator is $\hat{p}_{\text{MoM}} = \bar{X}_n$ -- simply the proportion of successes! ::: ::: {.callout-tip icon=false} ## Example: Normal Distribution For $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$, we have two parameters, so we need two equations. **First moment equation:** - $\alpha_1(\theta) = \mathbb{E}(X) = \mu$ - $\hat{\alpha}_1 = \bar{X}_n$ - Setting equal: $\mu = \bar{X}_n$ **Second moment equation:** - $\alpha_2(\theta) = \mathbb{E}(X^2) = \mathbb{V}(X) + (\mathbb{E}(X))^2 = \sigma^2 + \mu^2$ - $\hat{\alpha}_2 = \frac{1}{n}\sum_{i=1}^n X_i^2$ - Setting equal: $\sigma^2 + \mu^2 = \frac{1}{n}\sum_{i=1}^n X_i^2$ Solving this system: - $\hat{\mu}_{\text{MoM}} = \bar{X}_n$ - $\hat{\sigma}^2_{\text{MoM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)^2$ ::: ::: {.callout-tip icon=false collapse="true"} ## Example: Gamma Distribution For $X_1, \ldots, X_n \sim \text{Gamma}(\alpha, \beta)$, let's derive the MoM estimators. The Gamma distribution has: - First moment: $\mathbb{E}(X) = \alpha\beta$ - Second moment: $\mathbb{E}(X^2) = \mathbb{V}(X) + (\mathbb{E}(X))^2 = \alpha\beta^2 + (\alpha\beta)^2$ Setting up the moment equations: $$\alpha\beta = \bar{X}_n$$ $$\alpha\beta^2 + (\alpha\beta)^2 = \frac{1}{n}\sum_{i=1}^n X_i^2$$ From the first equation: $\alpha = \bar{X}_n / \beta$ Substituting into the second equation: $$\frac{\bar{X}_n}{\beta} \cdot \beta^2 + \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n X_i^2$$ Simplifying: $$\bar{X}_n \beta + \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n X_i^2$$ Solving for $\beta$: $$\beta = \frac{\frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2}{\bar{X}_n} = \frac{\text{sample variance}}{\bar{X}_n}$$ Therefore, the MoM estimators are: - $\hat{\beta}_{\text{MoM}} = \frac{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2}{\bar{X}_n}$ - $\hat{\alpha}_{\text{MoM}} = \frac{\bar{X}_n}{\hat{\beta}_{\text{MoM}}} = \frac{\bar{X}_n^2}{\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2}$ ```{python} #| echo: true import numpy as np import matplotlib.pyplot as plt from scipy import stats # Generate data from a Gamma distribution np.random.seed(42) true_alpha, true_beta = 3.0, 2.0 # True parameters n = 100 data = stats.gamma.rvs(a=true_alpha, scale=true_beta, size=n) # Method of Moments for Gamma distribution # For Gamma(α, β): E[X] = αβ, E[X²] = α(α+1)β² sample_mean = np.mean(data) sample_second_moment = np.mean(data**2) # Solve the system of equations # mean = α * β # second_moment = α * β² * (α + 1) = α * β² + α² * β² # This gives us: β = (second_moment - mean²) / mean mom_beta = (sample_second_moment - sample_mean**2) / sample_mean mom_alpha = sample_mean / mom_beta print(f"True parameters: α = {true_alpha}, β = {true_beta}") print(f"MoM estimates: α = {mom_alpha:.3f}, β = {mom_beta:.3f}") # Visualize the fit x = np.linspace(0, 20, 200) plt.figure(figsize=(7, 4)) plt.hist(data, bins=30, density=True, alpha=0.7, label='Data') plt.plot(x, stats.gamma.pdf(x, a=true_alpha, scale=true_beta), 'g-', linewidth=2, label='True distribution') plt.plot(x, stats.gamma.pdf(x, a=mom_alpha, scale=mom_beta), 'r--', linewidth=2, label='MoM fit') plt.xlabel('x') plt.ylabel('Density') plt.title('Method of Moments Estimation for Gamma Distribution') plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` ::: ### Properties of Method of Moments Estimator Under regular conditions, Method of Moments estimators have some desirable properties: ::: {.theorem name="Properties of MoM Estimator"} Let $\hat{\theta}_n$ denote the method of moments estimator. Under appropriate conditions on the model, the following statements hold: 1. **Existence**: The estimate $\hat{\theta}_n$ exists with probability tending to 1. 2. **Consistency**: The estimate is consistent: $$\hat{\theta}_n \xrightarrow{P} \theta$$ 3. **Asymptotic Normality**: The estimate is asymptotically Normal: $$\sqrt{n}(\hat{\theta}_n - \theta) \rightsquigarrow N(0, \Sigma)$$ where $$\Sigma = g \mathbb{E}_{\theta}(Y Y^T) g^T,$$ with $Y = (X, X^2, \ldots, X^k)^T$ and $g = (g_1, \ldots, g_k)$ where $g_j = \frac{\partial \alpha_j^{-1}(\theta)}{\partial \theta}$. ::: The last result can yield confidence intervals, but bootstrap is usually easier. ::: {.callout-note} ## How Good are MoM Estimators? **Strengths:** - Simple to compute -- just solve algebraic equations! - Guaranteed existence and consistency under mild conditions - Asymptotically normal, enabling confidence intervals **Weaknesses:** - **Not efficient**: Other estimators (like MLE) typically have smaller variance - **May give impossible values**: Can produce estimates outside the parameter space - **Arbitrary**: Why use moments? Why not other features of the distribution? **Primary Use Case**: MoM estimators are excellent starting values for more sophisticated methods like maximum likelihood estimation, which often require numerical optimization. ::: ## Maximum Likelihood Estimation (MLE) ### The Principle: What Parameter Makes My Data Most Probable? Maximum Likelihood Estimation is arguably the most important estimation method in statistics. While the Method of Moments asks "what parameters make the theoretical moments match the empirical ones?", MLE asks a more direct question: "what parameter values make my observed data most plausible?" The elegance of MLE is that it reverses our usual probability thinking: - **Probability**: Given parameters, what data would we expect to see? - **Likelihood**: Given data, which parameters make this data most probable? ### The Likelihood Function The mathematical object at the core of MLE is the likelihood function. ::: {.definition} Let $X_1, \ldots, X_n$ be IID with PDF $f(x; \theta)$. The **likelihood function** is: $$\mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$$ ::: For both mathematical and numerical convenience, in practice we often work with the **log-likelihood function**: $$\ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)$$ ::: {.definition} The **Maximum Likelihood Estimator (MLE)** is: $$\hat{\theta}_{\text{MLE}} = \arg\max_{\theta \in \Theta} \mathcal{L}_n(\theta) = \arg\max_{\theta \in \Theta} \ell_n(\theta)$$ ::: Note that maximizing the likelihood is the same as maximizing the log-likelihood, since the logarithm is a strictly increasing (monotonic) function. ::: {.tabbed-content} ## Intuitive Likelihood is a "what if" game. Imagine you have a coin and you flip it 10 times, getting 7 heads. You ask yourself: "What if the probability of heads were 0.5? How likely would 7 heads in 10 flips be?" "What if the probability were 0.6? Would that make my data more or less likely?" "What if it were 0.7? 0.8?" For each possible value of the parameter (here, the probability of heads), you calculate how probable your exact observed data would be. The parameter value that maximizes this probability is your maximum likelihood estimate. The likelihood function is this "what if" calculation formalized -- it's the probability (or probability density) of your observed data, treated as a function of the unknown parameters. ## Mathematical Why do we often prefer working with the log-likelihood? $$\ell_n(\theta) = \log \mathcal{L}_n(\theta) = \sum_{i=1}^n \log f(x_i; \theta)$$ 1. **Differentiation**: For [exponential family](https://en.wikipedia.org/wiki/Exponential_family) distributions (normal, exponential, gamma, etc.), log derivatives eliminate exponentials. For example, for normal: $\frac{d}{d\mu} e^{-(x-\mu)^2/2}$ becomes just $(x-\mu)$ after taking logs 2. **Numerical stability**: Products of small probabilities underflow; sums of logs don't 3. **Additive structure**: Log-likelihood is a sum over observations, making derivatives and optimization more tractable ## Computational Let's visualize the likelihood function for a simple Bernoulli example: ```{python} #| fig-width: 7 #| fig-height: 4 import numpy as np import matplotlib.pyplot as plt # Simulate coin flips: n=20, observed 12 successes n = 20 successes = 12 # Define the log-likelihood function p_values = np.linspace(0.01, 0.99, 200) log_likelihood = successes * np.log(p_values) + (n - successes) * np.log(1 - p_values) # Find the MLE (exact analytical solution) p_mle = successes / n # Exact MLE for Bernoulli # Create the plot plt.figure(figsize=(7, 4)) plt.plot(p_values, log_likelihood, 'b-', linewidth=2) plt.axvline(p_mle, color='red', linestyle='--', linewidth=2, label=f'MLE: p̂ = {p_mle}') plt.xlabel('p') plt.ylabel('Log-likelihood ℓₙ(p)') plt.title(f'Log-likelihood for Bernoulli with n={n}, S={successes}') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"Maximum likelihood estimate: p̂ = {successes/n}") print(f"This makes intuitive sense: it's simply the observed proportion of successes!") ``` Notice how the log-likelihood is maximized exactly at the observed proportion of successes. This is no coincidence -- the MLE often has an intuitive interpretation. ::: ::: {.callout-warning} ## Important: Likelihood function is NOT a probability distribution! The likelihood $\mathcal{L}_n(\theta)$ is NOT a probability distribution over $\theta$. In general: $$\int_\Theta \mathcal{L}_n(\theta) d\theta \neq 1$$ Why? The same mathematical expression $f(x; \theta)$ plays two different roles: - **As a PDF**: Fix $\theta$, vary $x$ → $\int f(x; \theta) dx = 1$ (proper probability distribution) - **As a likelihood**: Fix $x$ (observed data), vary $\theta$ → $\int \mathcal{L}_n(\theta) d\theta$ is usually not 1 **Example**: Observe one coin flip with result $X = 1$ (heads) from Bernoulli($p$): - The likelihood is $\mathcal{L}(p) = p$ for $p \in [0,1]$ - $\int_0^1 p \, dp = \frac{1}{2} \neq 1$ The likelihood tells us relative plausibility of parameter values, not their probabilities. This is why we need Bayesian methods if we want actual probability distributions over parameters! ::: ### Finding the MLE Analytically For simple models, we can find the MLE by taking derivatives and setting them to zero. The general recipe: 1. Write down the likelihood $\mathcal{L}_n(\theta)$ 2. Take the logarithm to get $\ell_n(\theta)$ (this simplifies products to sums) 3. Take the derivative with respect to $\theta$ 4. Set equal to zero and solve 5. Verify it's a maximum (not a minimum or saddle point) ::: {.callout-note} ## Simplifying MLE Calculations When finding the MLE, multiplying the likelihood (or log-likelihood) by a positive constant or adding a constant doesn't change where the maximum occurs. This means we can often ignore: - Normalization constants that don't depend on $\theta$ - Terms like $\frac{1}{(2\pi)^{n/2}}$ in the normal distribution - Factorials in discrete distributions This greatly simplifies calculations -- focus only on terms involving $\theta$! ::: Let's work through some examples: ::: {.callout-tip icon=false} ## Example: Bernoulli Distribution For $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$: **Step 1**: The likelihood is $$\mathcal{L}_n(p) = \prod_{i=1}^n p^{X_i}(1-p)^{1-X_i} = p^S(1-p)^{n-S}$$ where $S = \sum_{i=1}^n X_i$ is the total number of successes. **Step 2**: The log-likelihood is $$\ell_n(p) = S \log p + (n-S) \log(1-p)$$ **Step 3**: Taking the derivative: $$\frac{d\ell_n}{dp} = \frac{S}{p} - \frac{n-S}{1-p}$$ **Step 4**: Setting to zero and solving: $$\frac{S}{p} = \frac{n-S}{1-p} \implies S(1-p) = (n-S)p \implies S = np$$ Therefore, $\hat{p}_{\text{MLE}} = S/n = \bar{X}_n$. **Note**: This is the same as the Method of Moments estimator! ::: ::: {.callout-tip icon=false} ## Example: Normal Distribution For $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$: The log-likelihood (ignoring constants) is: $$\ell_n(\mu, \sigma) = -n \log \sigma - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2$$ Taking partial derivatives and setting to zero: $$\frac{\partial \ell_n}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \mu) = 0$$ $$\frac{\partial \ell_n}{\partial \sigma} = -\frac{n}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^n (X_i - \mu)^2 = 0$$ Solving these equations: - $\hat{\mu}_{\text{MLE}} = \bar{X}_n$ - $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X}_n)^2$ Again, these match the Method of Moments estimators! ::: ::: {.callout-tip icon=false} ## Example: A Harder Case - Uniform(0, θ) Not all MLEs can be found by differentiation! Consider $X_1, \ldots, X_n \sim \text{Uniform}(0, \theta)$. The PDF is: $$f(x; \theta) = \begin{cases} 1/\theta & \text{if } 0 \leq x \leq \theta \\ 0 & \text{otherwise} \end{cases}$$ The likelihood is: $$\mathcal{L}_n(\theta) = \begin{cases} (1/\theta)^n & \text{if } \theta \geq \max\{X_1, \ldots, X_n\} \\ 0 & \text{if } \theta < \max\{X_1, \ldots, X_n\} \end{cases}$$ This function: - Is 0 when $\theta$ is less than the largest observation - Decreases as $(1/\theta)^n$ for larger $\theta$ Therefore, the likelihood is maximized at the boundary: $\hat{\theta}_{\text{MLE}} = X_{(n)} = \max\{X_1, \ldots, X_n\}$. This example shows that not all optimization problems are solved by calculus -- sometimes we need to think more carefully about the function's behavior! Let's visualize this unusual likelihood function: ```{python} #| fig-width: 7 #| fig-height: 4 # Uniform(0, θ) MLE visualization np.random.seed(42) n = 10 true_theta = 2.0 data = np.random.uniform(0, true_theta, n) x_max = np.max(data) # Create theta values theta_values = np.linspace(0.1, 3.0, 300) likelihood = np.zeros_like(theta_values) # Calculate likelihood (proportional to) for i, theta in enumerate(theta_values): if theta >= x_max: likelihood[i] = (1/theta)**n else: likelihood[i] = 0 plt.figure(figsize=(7, 4)) plt.subplot(1, 2, 1) # Plot data points plt.scatter(data, np.zeros_like(data), color='red', s=50, zorder=5, label='Data') plt.axvline(x_max, color='green', linestyle='--', linewidth=2, label=f'max(X) = {x_max:.3f}') plt.xlim(-0.5, 3.0) plt.ylim(-0.1, 0.1) plt.xlabel('x') plt.title('Observed Data') plt.legend() plt.subplot(1, 2, 2) plt.plot(theta_values, likelihood, 'b-', linewidth=2) plt.axvline(x_max, color='green', linestyle='--', linewidth=2, label=f'MLE = {x_max:.3f}') plt.xlabel('θ') plt.ylabel('Likelihood (proportional to)') plt.title('Likelihood Function') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ::: ## MLE Via Numerical Optimization Finding the maximum of the log-likelihood by taking a derivative and setting it to zero is clean and satisfying, but it only works for the simplest models. For most real-world problems, the log-likelihood function is a complex, high-dimensional surface, and we cannot find the peak analytically. Instead, we must turn to **numerical optimization**. The core idea of most optimization algorithms is simple: we start with an initial guess for the parameters, $\theta_0$, and then iteratively take steps "uphill" on the likelihood surface until we can no longer find a higher point. This section explores the concepts and tools behind this fundamental process.^[In the next chapter, we'll explore a special optimization algorithm for MLE called the EM (Expectation-Maximization) algorithm, which is particularly useful for models with latent variables or missing data.] ::: {.callout-note} ## The Need for Numerical Methods For many important models, we cannot solve for the MLE analytically: - **Logistic regression**: No closed-form solution for the regression coefficients - **Mixture models**: Complex likelihood with multiple local maxima - **Most complex ML models**: Neural networks, random forests, etc. We must use iterative numerical optimization algorithms to find the maximum of the likelihood function (or minimum of the negative log-likelihood). ::: ### The Optimization Setup for MLE Finding the MLE requires solving an optimization problem to find the parameters $\theta$ that maximize the likelihood function $\mathcal{L}_n(\theta)$ or equivalently the log-likelihood function $\ell_n(\theta)$. Since optimization methods in software libraries are conventionally written to perform **minimization**, our practical goal becomes: $$ \hat{\theta}_{\text{MLE}} = \arg\min_{\theta} \left[-\ell_n(\theta)\right] $$ This is a crucial point: we minimize the **negative** log-likelihood. Forgetting this minus sign is one of the most common programming errors in statistical computing! Throughout this section, we'll follow the standard convention and present algorithms for minimization. We'll use examples from Python's `scipy.optimize`, but other languages and libraries (R's `optim`, Julia's `Optim.jl`, etc.) offer similar algorithms with comparable interfaces. When in optimization we say that a problem is $D$-dimensional, we refer to the number of variables being optimized over -- here, the number of elements of $\theta$. ### Numerical Optimization in 1D To understand how numerical optimization works, it's helpful to start with the simplest case: finding the minimum of a function with a single parameter, $f(\theta)$. Let's take a simple example function to minimize: $$ f(\theta) = \frac{\theta^2}{10} + \frac{1}{\theta}; \quad \theta > 0 $$ whose global minimum is at $\theta^\star \approx 1.71$. ```{python} #| echo: false #| fig-width: 7 #| fig-height: 4 import numpy as np import matplotlib.pyplot as plt def f(theta): return theta**2/10 + 1/theta theta_vals = np.linspace(0.2, 5, 200) f_vals = f(theta_vals) min_theta_analytical = (5)**(1/3) min_f = f(min_theta_analytical) plt.figure(figsize=(7, 4)) plt.plot(theta_vals, f_vals, label=r'$f(\theta) = \frac{\theta^2}{10} + \frac{1}{\theta}$') plt.axvline(min_theta_analytical, color='r', linestyle='--', label=f'Minimum at θ ≈ {min_theta_analytical:.2f}') plt.axhline(min_f, color='r', linestyle='--') plt.xlabel('θ') plt.ylabel('f(θ)') plt.title('A Simple 1D Optimization Problem') plt.legend() plt.grid(True, alpha=0.3) plt.ylim(0, 5) plt.show() ``` How can a computer find the minimum of this function? There are two main families of approaches. #### 1D Optimization without Derivatives The simplest algorithms work by bracketing the minimum. If you can find three points $(a, c, b)$ such that $a < c < b$ and $f(c) < f(a)$ and $f(c) < f(b)$, you know a minimum lies somewhere in the interval $(a, b)$. - [**Golden-section search**](https://en.wikipedia.org/wiki/Golden-section_search): A simple but robust method that is analogous to binary search for finding a root. It progressively narrows the bracket until the desired precision is reached. - [**Brent's method**](https://en.wikipedia.org/wiki/Brent%27s_method): A more sophisticated and generally preferred method. It combines the guaranteed (but slow) progress of golden-section search with the potentially faster convergence of fitting a parabola to the three points and jumping to the minimum of the parabola. These methods are useful when the function's derivative is unavailable or unreliable. #### 1D Optimization with Derivatives: Newton's Method If we can compute derivatives, we can often find the minimum much faster. The most famous derivative-based method is [**Newton's method**](https://en.wikipedia.org/wiki/Newton%27s_method) (or Newton-Raphson). The idea is to iteratively approximate the function with a quadratic and jump to the minimum of that quadratic. A second-order Taylor expansion of $f(\theta)$ around a point $\theta_t$ is: $$ f(\theta) \approx f(\theta_t) + f'(\theta_t) (\theta - \theta_t) + \frac{1}{2} f''(\theta_t) (\theta - \theta_t)^2. $$ To find the minimum of this quadratic approximation, we take its derivative with respect to $\theta$ and set it to zero: $$ \frac{d}{d\theta} (\text{approx}) = f'(\theta_t) + f''(\theta_t)(\theta - \theta_t) = 0 $$ Solving for $\theta$ gives us the next point in our iteration, $\theta_{t+1}$: $$ \theta_{t+1} = \theta_t - \frac{f'(\theta_t)}{f''(\theta_t)}. $$ This simple update rule is the core of Newton's method. When close to a well-behaved minimum, it converges very rapidly (quadratically). However, it can be unstable if the function is not "nice" or if the starting point is far from the minimum. ::: {.callout-tip icon=false} ## Example: 1D Optimization in `scipy` Let's find the minimum of our example function using `scipy.optimize`. We'll use Brent's method, which only requires the function itself and a bracket. ```{python} #| echo: true import numpy as np import scipy.optimize def f(theta): return theta**2/10 + 1/theta # Find the minimum using Brent's method # We need to provide a bracket (a, b) or (a, c, b) # where the minimum is expected to lie. # Let's use (0.1, 10.0) result = scipy.optimize.brent(f, brack=(0.1, 10.0)) print(f"The minimum found by Brent's method is at θ = {result:.6f}") print(f"The value of the function at the minimum is f(θ) = {f(result):.6f}") # Analytical minimum is at (5)^(1/3) analytical_min = 5**(1/3) print(f"The analytical minimum is at θ = {analytical_min:.6f}") ``` The output shows that the numerical method finds the correct minimum with high precision. ::: #### Moving to Multiple Dimensions The concepts from 1D optimization extend to the multi-dimensional case, but with added complexity. **Derivative-free methods:** Generalizing derivative-free methods to multiple dimensions, i.e., optimizing $f(\theta_1, \ldots, \theta_n)$ with respect to $\theta_1, \ldots, \theta_n$ is hard! As the dimensionality increases, the number of possible directions that need to be searched for the optimum grows very rapidly. - **Classical approaches**: Very old algorithms like [direct search](https://en.wikipedia.org/wiki/Pattern_search_(optimization)) methods are mostly inefficient and struggle as dimensionality grows. Searching in many directions at once becomes prohibitively expensive. - **Modern approaches**: Recent methods can be more effective in specific scenarios: - **[Bayesian optimization](https://en.wikipedia.org/wiki/Bayesian_optimization)**: Works well in low dimensions (typically < 20) and can be very sample-efficient when function evaluations are expensive - **[CMA-ES](https://en.wikipedia.org/wiki/CMA-ES)** (Covariance Matrix Adaptation Evolution Strategy): If you can afford many function evaluations and don't have gradients, this method can work in surprisingly high dimensions -- up to tens of thousands of parameters - These methods are mainly used when gradients are unavailable or unreliable (e.g., noisy simulations, black-box models) **Derivative-based methods:** - **The standard choice**: When gradients are available and the function is relatively inexpensive to evaluate, gradient-based methods dominate - **Scalability**: These methods can handle problems with millions or even billions of parameters (think deep neural networks) - **Key transformation**: The first derivative (slope) becomes the **gradient** ($\nabla f$), and the second derivative (curvature) becomes the **Hessian matrix** ($H$) - **Why they work**: Gradient information provides a principled direction for improvement at each step, making the search much more efficient than blind exploration You can find visualizations of several optimization algorithms at work [here](https://github.com/lacerbi/optimviz). ### The Gradient: The Key to Multidimensional Optimization The main tool for optimization of multidimensional functions $$f: \mathbb{R}^n \to \mathbb{R}$$ is the **gradient**: $$\nabla f(\theta) = \left(\frac{\partial f}{\partial \theta_1}, \ldots, \frac{\partial f}{\partial \theta_n}\right)^T$$ Here $\frac{\partial f}{\partial \theta_i}$ is the **partial derivative** of $f$ with respect to $\theta_i$. It can be evaluated by computing the derivative w.r.t. $\theta_i$ while keeping the other variables constant. **Geometric interpretation**: The gradient points in the direction where the function values grow fastest. Its magnitude measures the rate of change in that direction. ::: {.callout-note} ## Intuition: The Gradient as a Compass Think of finding the MLE as hiking blindfolded on a hilly terrain (the negative log-likelihood surface). At each point, the gradient tells you which direction is steepest uphill. Since we want to minimize, we go in the opposite direction -- downhill. ::: Let's visualize this concept: ```{python} #| fig-width: 7 #| fig-height: 10 import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D # Create a 2D function with interesting topology (negative log-likelihood surface) def f(x, y): """Example negative log-likelihood with two local minima""" return -2 * np.exp(-(x-1)**2 - y**2) + 3 * np.exp(-(x+1)**2 - y**2) + 0.1*(x**2 + y**2) # Create grid x = np.linspace(-3, 3, 50) y = np.linspace(-2, 2, 40) X, Y = np.meshgrid(x, y) Z = f(X, Y) # Create figure with two subplots fig = plt.figure(figsize=(7, 10)) # First subplot: 3D surface ax1 = fig.add_subplot(2, 1, 1, projection='3d') surf = ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8, edgecolor='none') ax1.contour(X, Y, Z, levels=15, cmap='viridis', offset=Z.min(), alpha=0.4) # Mark the global minimum ax1.scatter([1], [0], [f(1, 0)], color='black', s=100, marker='*', label='Global minimum') ax1.set_xlabel('θ₁') ax1.set_ylabel('θ₂') ax1.set_zlabel('Negative log-likelihood') ax1.set_title('3D Optimization Landscape') # Adjust viewing angle to swap visual perspective ax1.view_init(elev=30, azim=-45) # Second subplot: 2D contour with gradient field ax2 = fig.add_subplot(2, 1, 2) # Compute gradient (numerically) dx = 0.01 dy = 0.01 grad_x = (f(X + dx, Y) - f(X - dx, Y)) / (2 * dx) grad_y = (f(X, Y + dy) - f(X, Y - dy)) / (2 * dy) # Normalize gradient vectors for better visualization grad_norm = np.sqrt(grad_x**2 + grad_y**2) grad_x_norm = -grad_x / (grad_norm + 1e-10) # Negative for descent grad_y_norm = -grad_y / (grad_norm + 1e-10) # Contour plot contour = ax2.contour(X, Y, Z, levels=15, cmap='viridis', alpha=0.6) ax2.clabel(contour, inline=True, fontsize=8) # Gradient vectors skip = 3 # Show every 3rd arrow for clarity ax2.quiver(X[::skip, ::skip], Y[::skip, ::skip], grad_x_norm[::skip, ::skip], grad_y_norm[::skip, ::skip], scale=20, alpha=0.7, width=0.003, color='red') # Mark global minimum ax2.plot([1], [0], 'k*', markersize=10, label='Global minimum') ax2.set_xlabel('θ₁') ax2.set_ylabel('θ₂') ax2.set_title('Gradient Field on Negative Log-Likelihood Surface') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` These visualizations show: - **Top (3D)**: The optimization landscape as a surface, showing the "hills and valleys" that optimization algorithms must navigate - **Bottom (2D)**: The same landscape from above with: - **Contour lines** showing level sets of the negative log-likelihood - **Red arrows** pointing in the direction of steepest descent (negative gradient) - **Black star** marking the global minimum at θ₁ = 1, θ₂ = 0 - Note how the gradient vectors always point toward lower values, guiding optimization algorithms downhill ### Evaluating the Gradient The first practical challenge in using gradients for optimization is how to evaluate the gradient. There are three main approaches: 1. **Analytically**: Derive by hand and implement -- fast but error-prone 2. [**Finite differences**](https://en.wikipedia.org/wiki/Finite_difference): Approximate numerically by evaluating nearby points -- easy but expensive^[Generally, evaluating a gradient in $D$ dimensions with finite differences will need at least $D + 1$ evaluations.] and less precise 3. **Automatic differentiation (autodiff)**: Best of both worlds -- exact, fast and automatic Modern frameworks like [PyTorch](https://pytorch.org/) and [JAX](https://docs.jax.dev/en/latest/index.html) make automatic differentiation the preferred choice. However, some older languages such as R still lack full autodiff support. Gradient computation usually uses **backward-mode autodiff**, which is suitable for computing derivatives of real-valued functions $f: \mathbb{R}^n \to \mathbb{R}$. ::: {.callout-note collapse="true"} ## Example: Gradients in Python with JAX Here's how to compute gradients automatically using JAX: ```python import jax.numpy as jnp import jax def f(theta): return theta**2/10 + 1/theta g = jax.grad(f) g(1.5) # Evaluating at a scalar value # For multiple variables: def f2(theta): return (jnp.exp(-(theta[0]-1)**2 - theta[1]**2) - jnp.exp(-(theta[0]+1)**2 - theta[1]**2)) g2 = jax.grad(f2) g2(jnp.array([-3.0, 0.5])) ``` ::: ### Gradient-Based Optimization Methods Once we have gradients, we can use various algorithms to find the optimum. The simplest gradient-based algorithm is **gradient descent** (also called **steepest descent**), which iteratively takes steps in the direction of the negative gradient: $$\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t)$$ where $\eta > 0$ ([eta](https://en.wikipedia.org/wiki/Eta)) is the *learning rate*, which defines the step length. Gradient descent is inefficient, often zigzagging toward the optimum, but it's the foundation for understanding more sophisticated methods. ::: {.callout-note collapse="true"} ## Advanced Gradient-Based Methods Beyond basic gradient descent, there are many sophisticated optimization algorithms: **Classical improvements**: - **[Conjugate gradient methods](https://en.wikipedia.org/wiki/Conjugate_gradient_method)**: Choose search directions that are conjugate with respect to the Hessian matrix. For quadratic functions, this guarantees finding the exact minimum in at most $n$ steps (where $n$ is the number of dimensions). Unlike gradient descent which can take tiny steps in nearly orthogonal directions, conjugate gradient methods take larger, more efficient steps. - **Quasi-Newton methods**: Approximate the Hessian matrix without computing second derivatives directly. - **[BFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm)** (Broyden-Fletcher-Goldfarb-Shanno): Builds up an approximation to the inverse Hessian using gradient information from previous steps - **[L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS)** (Limited-memory BFGS): Stores only a few vectors instead of the full Hessian approximation, making it practical for high-dimensional problems **Modern methods**: - **Momentum methods** (heavy ball method): Add a fraction of the previous step to the current gradient step: $$\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t) + \beta(\theta_t - \theta_{t-1})$$ where $\beta \in [0, 1)$ is the momentum coefficient. This helps the optimizer "roll through" small local variations and reduces zigzagging in narrow valleys. - **Accelerated gradient methods**: Achieve provably faster convergence rates. Nesterov's accelerated gradient has convergence rate $O(1/t^2)$ compared to $O(1/t)$ for standard gradient descent on convex functions. These methods are especially important in deep learning where simple gradient descent would be too slow to navigate the complex, high-dimensional loss landscapes of neural networks. ::: ### Stochastic Gradient Methods Many machine learning problems, especially in deep learning, involve optimizing functions of the form: $$\min_\theta f(X, \theta) = \sum_{i=1}^n f(x_i, \theta)$$ When $n$ is large and $\theta$ is high-dimensional, evaluating the full gradient becomes computationally prohibitive. **Stochastic gradient descent (SGD)** approximates the gradient using a random subset of data: $$\theta_{n+1} = \theta_n - \eta_n \nabla \sum_{i \in S_n} f(x_i, \theta_n)$$ where $S_n \subset \{1, \ldots, n\}$ is a randomly selected mini-batch. **Convergence**: SGD converges for well-behaved functions when the learning rate sequence satisfies: $$\sum_{i=1}^\infty \eta_i = \infty, \quad \sum_{i=1}^\infty \eta_i^2 < \infty$$ Note that constant learning rates don't guarantee convergence to the exact optimum! ::: {.callout-note} ## Popular SGD Variants The huge popularity of SGD has spawned many improved variants: - **Adam** (Adaptive Moment Estimation) [@kingma2014adam]: Combines momentum with adaptive learning rates for each parameter - **AdaGrad** (Adaptive Gradient): Adapts learning rate based on historical gradients - parameters with frequent updates get smaller learning rates - **RMSprop** (Root Mean Square Propagation): Uses a running average of recent gradients to normalize the learning rate ::: ### Which Optimizer Should I Use? **For smaller datasets** (< 10K observations): - **L-BFGS** is usually the best first choice - Fast convergence, reliable for smooth problems - Standard choice in traditional statistical software **For large datasets or deep learning**: - **SGD** and variants (especially **Adam**) are often the only practical choice - Require careful tuning of learning rates - Can handle millions and even billions of parameters **For black-box problems** (no gradients available): - **[CMA-ES](https://en.wikipedia.org/wiki/CMA-ES)**: Principled "population-based" method that can handle up to thousands of parameters (requiring many evaluations) - **[Bayesian Optimization](https://en.wikipedia.org/wiki/Bayesian_optimization)**: Efficient for expensive functions with $D < 20$ parameters - **[BADS](https://acerbilab.github.io/pybads/index.html)** (Bayesian Adaptive Direct Search): Combines Bayesian optimization with mesh adaptive direct search, good for noisy and mildly expensive functions with $D < 20$ ::: {.callout-note collapse="true"} ## Advanced Topic: Constrained Optimization Often parameters have natural constraints: - Standard deviations must be positive: $\sigma > 0$ - Probabilities must be in [0,1]: $0 \leq p \leq 1$ - Correlation matrices must be positive definite How do we enforce these? There are typically three different approaches: 1. **Reparameterization**: Optimize $\phi = \log \sigma$ and then use $\sigma = e^\phi$ to ensure positivity^[For numerical stability, it has become common to use the [softplus](https://en.wikipedia.org/wiki/Softplus) instead of $\exp$ to ensure positivity.] 2. **Constrained optimization**: Use algorithms like L-BFGS-B that allows you to specify parameter bounds 3. **Barrier methods**: Add penalty terms that blow up at boundaries (can be unstable -- not recommended) ::: ::: {.callout-note collapse="true"} ## Example: 2D MLE with Numerical Optimization Now let's apply the multidimensional optimization concepts to a real statistical problem. We'll estimate both parameters of a Gamma distribution, which provides an excellent illustration of: - **True 2D optimization**: Unlike our 1D examples, we need to simultaneously find the shape parameter $\alpha$ and scale parameter $\beta$. - **Constrained optimization in practice**: Both parameters must be positive, so we'll use L-BFGS-B with bounds. - **The importance of starting values**: We'll launch optimization from multiple starting points to see if they converge to the same solution. We choose the first starting point using the Method of Moments, while the others arbitrarily (could have been random). - **Visualizing the likelihood surface**: We'll create both 3D and contour plots to understand the optimization landscape. The Gamma distribution is particularly interesting because its two parameters interact in the likelihood function, creating a curved valley in the likelihood surface rather than a simple bowl. ```{python} #| fig-width: 7 #| fig-height: 10 import numpy as np import matplotlib.pyplot as plt from scipy import stats, optimize from mpl_toolkits.mplot3d import Axes3D # Generate data from Gamma distribution np.random.seed(42) true_alpha = 3.0 # shape parameter true_beta = 2.0 # scale parameter n = 100 data = stats.gamma.rvs(a=true_alpha, scale=true_beta, size=n) # Define negative log-likelihood for Gamma(α, β) def neg_log_likelihood(params, data): """Negative log-likelihood for Gamma distribution""" alpha, beta = params if alpha <= 0 or beta <= 0: return np.inf # Return infinity for invalid parameters return -np.sum(stats.gamma.logpdf(data, a=alpha, scale=beta)) # Method of Moments starting values sample_mean = np.mean(data) sample_var = np.var(data) mom_beta = sample_var / sample_mean mom_alpha = sample_mean / mom_beta # Try optimization from different starting points starting_points = [ [mom_alpha, mom_beta], # MoM estimate [1.0, 1.0], # Generic start [5.0, 5.0], # Far from truth ] # Store optimization paths paths = [] results = [] for i, start in enumerate(starting_points): # Create a list to store the path for this optimization path = [] # Callback function to record each iteration def callback(xk): path.append(xk.copy()) result = optimize.minimize( fun=neg_log_likelihood, x0=start, args=(data,), method='L-BFGS-B', bounds=[(0.01, None), (0.01, None)], # Enforce positivity callback=callback ) # Add starting point to path full_path = [np.array(start)] + path paths.append(np.array(full_path)) results.append(result) print(f"Start {i+1}: {start} → MLE: [{result.x[0]:.3f}, {result.x[1]:.3f}], " f"NLL: {result.fun:.3f}, Iterations: {len(full_path)-1}") # Best result best_result = min(results, key=lambda r: r.fun) mle_alpha, mle_beta = best_result.x # Visualize the likelihood surface alpha_range = np.linspace(0.5, 6, 50) beta_range = np.linspace(0.5, 5.2, 50) Alpha, Beta = np.meshgrid(alpha_range, beta_range) NLL = np.zeros_like(Alpha) for i in range(Alpha.shape[0]): for j in range(Alpha.shape[1]): NLL[i, j] = neg_log_likelihood([Alpha[i, j], Beta[i, j]], data) fig = plt.figure(figsize=(7, 10)) # 3D surface plot ax1 = fig.add_subplot(2, 1, 1, projection='3d') surf = ax1.plot_surface(Alpha, Beta, NLL, cmap='viridis', alpha=0.8, edgecolor='none') ax1.contour(Alpha, Beta, NLL, levels=20, cmap='viridis', offset=NLL.min(), alpha=0.4) # Mark the different solutions ax1.scatter([true_alpha], [true_beta], [neg_log_likelihood([true_alpha, true_beta], data)], color='black', s=100, marker='*', label='True parameters') ax1.scatter([mle_alpha], [mle_beta], [best_result.fun], color='red', s=100, marker='o', label='MLE') ax1.set_xlabel('α (shape)') ax1.set_ylabel('β (scale)') ax1.set_zlabel('Negative log-likelihood') ax1.set_title('Likelihood Surface for Gamma Distribution') ax1.view_init(elev=25, azim=45) # 2D contour plot with optimization paths ax2 = fig.add_subplot(2, 1, 2) contour = ax2.contour(Alpha, Beta, NLL, levels=30, cmap='viridis', alpha=0.6) ax2.clabel(contour, inline=True, fontsize=8) # Plot optimization paths colors = ['blue', 'green', 'orange'] for i, (path, color) in enumerate(zip(paths, colors)): # Plot the full optimization path ax2.plot(path[:, 0], path[:, 1], '-', color=color, linewidth=2, alpha=0.7) # Mark start point ax2.plot(path[0, 0], path[0, 1], 'o', color=color, markersize=10, label=f'Start {i+1}') # Mark intermediate points ax2.plot(path[1:-1, 0], path[1:-1, 1], '.', color=color, markersize=6) # Mark end point ax2.plot(path[-1, 0], path[-1, 1], 's', color=color, markersize=10) ax2.plot(true_alpha, true_beta, 'k*', markersize=12, label='True') ax2.plot(mle_alpha, mle_beta, 'ro', markersize=10, label='MLE') ax2.set_xlabel('α (shape)') ax2.set_ylabel('β (scale)') ax2.set_title('L-BFGS Optimization Traces from Different Starting Points') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"\nTrue parameters: α = {true_alpha}, β = {true_beta}") print(f"MLE estimates: α = {mle_alpha:.3f}, β = {mle_beta:.3f}") print(f"MoM estimates: α = {mom_alpha:.3f}, β = {mom_beta:.3f}") ``` Notice how all three starting points converged to the same MLE, demonstrating that this likelihood surface is well-behaved with a single global optimum. The actual L-BFGS traces (with intermediate points marked as dots) reveal interesting optimization behavior: - **Blue path (MoM start)**: Converges in very few iterations since it starts close to the optimum - **Green path (generic start)**: Takes a curved path following the likelihood valley - **Orange path (far start)**: Makes larger initial steps, then follows the contours more carefully as it approaches the optimum The 3D plot reveals the characteristic curved valley of the Gamma likelihood, explaining why the optimization paths curve rather than taking straight lines to the optimum. ::: ### FAQ: Common Issues in Numerical Optimization ::: {.callout-note} ## Frequently Asked Questions **Q: I used a minimizer, but I'm supposed to be doing *maximum* likelihood. What gives?** A: A classic source of confusion! Maximizing a function $f(x)$ is equivalent to minimizing its negative $-f(x)$. All standard optimization libraries are built as minimizers. Therefore, in practice, we always find the MLE by **minimizing the negative log-likelihood function**. **Q: I tried different starting points and the optimizer gave different answers. Is it broken?** A: No, this is expected behavior! The algorithms we use are **local optimizers**. They find the nearest valley (local minimum), but they have no guarantee of finding the deepest valley (global minimum). If your likelihood surface has multiple local optima, the result will depend on your starting point. This is why using a good initial value (like the Method of Moments estimate!) is so important. **Q: How do I know if the algorithm has actually converged to the right answer?** A: In general, you don't know with 100% certainty. Optimizers use heuristics, like stopping when the change in the parameter values or the likelihood is very small. Good practices include: - **Always try different starting points!** - Check the optimizer's status messages - Verify that the gradient is near zero at the solution - Compare with simpler methods (like MoM) as a sanity check ::: Let's demonstrate the importance of starting values: ```{python} #| fig-width: 7 #| fig-height: 4 # Example: Multiple local optima in a mixture model # We'll create a bimodal likelihood surface def bimodal_nll(theta): """A negative log-likelihood with two local minima""" # Artificial example with two valleys return -np.log(0.6 * np.exp(-2*(theta-1)**2) + 0.4 * np.exp(-3*(theta-4)**2) + 0.01) # Try optimization from different starting points theta_range = np.linspace(-2, 7, 200) nll_surface = [bimodal_nll(t) for t in theta_range] starting_points = [-1, 2.5, 5] colors = ['red', 'green', 'blue'] results = [] plt.figure(figsize=(7, 4)) plt.plot(theta_range, nll_surface, 'k-', linewidth=2, label='Objective function') for start, color in zip(starting_points, colors): result = optimize.minimize(bimodal_nll, x0=[start], method='L-BFGS-B') results.append(result.x[0]) plt.scatter([start], [bimodal_nll(start)], color=color, s=100, marker='o', label=f'Start: {start:.1f}') plt.scatter([result.x[0]], [result.fun], color=color, s=100, marker='*', label=f'End: {result.x[0]:.2f}') # Draw arrow from start to end plt.annotate('', xy=(result.x[0], result.fun), xytext=(start, bimodal_nll(start)), arrowprops=dict(arrowstyle='->', color=color, lw=2)) plt.xlabel('θ') plt.ylabel('Negative log-likelihood') plt.title('Local Optimization: Different Starting Points → Different Solutions') plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() print("Starting points and their corresponding local optima:") for start, end in zip(starting_points, results): print(f" Start: {start:4.1f} → End: {end:.3f}") ``` ## Chapter Summary and Connections ### Key Concepts Review We've explored two fundamental approaches to finding estimators in parametric models: **Parametric Models**: - Assume data comes from a specific family of distributions $\mathfrak{F} = \{f(x; \theta): \theta \in \Theta\}$ - Our job reduces to estimating the finite-dimensional parameter $\theta$ - Often have parameters of interest and nuisance parameters **Method of Moments (MoM)**: - Match theoretical moments $\mathbb{E}(X^j)$ with sample moments $\frac{1}{n}\sum X_i^j$ - Simple to compute -- just solve algebraic equations - Consistent and asymptotically normal - Not efficient, but excellent starting values for other methods **Maximum Likelihood Estimation (MLE)**: - Find parameters that make observed data most probable - Likelihood: $\mathcal{L}_n(\theta) = \prod_{i=1}^n f(X_i; \theta)$ - Often requires numerical optimization - The gold standard of parametric estimation **Numerical Optimization**: - Most MLEs require iterative algorithms - Gradient-based methods dominate (L-BFGS for small data, SGD/Adam for large) - Automatic differentiation (JAX) makes implementation easier - Local optima are a real concern -- **always try multiple starting points!** ### The Big Picture This chapter revealed a fundamental connection: much of modern machine learning is secretly Maximum Likelihood Estimation: - **Linear regression**: MLE with normal errors - **Logistic regression**: MLE for Bernoulli responses - **Neural networks**: MLE with complex parametric models - **Deep learning**: MLE with stochastic optimization The principles we've learned -- likelihood, optimization, gradients -- are the foundation of both classical statistics and modern ML. ### Common Pitfalls to Avoid 1. **Confusing likelihood with probability**: The likelihood is NOT a probability distribution over parameters 2. **Forgetting the negative sign**: Optimizers minimize, so use negative log-likelihood 3. **Assuming analytical solutions exist**: Most real problems require numerical methods 4. **Trusting a single optimization run**: Did I mention that you should **always try multiple starting points**? 5. **Ignoring convergence warnings**: Check optimizer status and diagnostics ### Chapter Connections - **Previous**: Chapter 3 gave us convergence concepts and the framework for evaluating estimators. Now we know how to *find* estimators. - **Next**: Chapter 6 will explore the *properties* of these estimators in detail -- bias, variance, efficiency -- and prove that MLEs have optimal properties. - **Bootstrap (Chapter 4)**: Provides a computational alternative to analytical standard errors for our estimators - **Later chapters**: These estimation principles extend to more complex models (regression, time series, etc.) ### Self-Test Problems 1. **Method of Moments**: For $X_1, \ldots, X_n \sim \text{Uniform}(a, b)$, find the Method of Moments estimators for $a$ and $b$. *Hint*: Use $\mathbb{E}(X) = \frac{a+b}{2}$ and $\mathbb{V}(X) = \frac{(b-a)^2}{12}$. 2. **Maximum Likelihood**: For $X_1, \ldots, X_n \sim \text{Poisson}(\lambda)$, find the Maximum Likelihood Estimator for $\lambda$. *Hint*: The Poisson PMF is $f(x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$. 3. **Numerical Optimization**: The log-likelihood for logistic regression with one covariate is: $$\ell(y, x; \beta_0, \beta_1) = \sum_{i=1}^n \left[y_i(\beta_0 + \beta_1 x_i) - \log(1 + e^{\beta_0 + \beta_1 x_i})\right]$$ Explain why you cannot find the MLE for $(\beta_0, \beta_1)$ analytically and must use numerical optimization. 4. **Comparing Estimators**: For $X_1, \ldots, X_n \sim \text{Exponential}(\lambda)$: - Find the MoM estimator for $\lambda$ - Find the MLE for $\lambda$ - Are they the same? Why or why not? ### Python and R Reference ::: {.content-visible when-format="html"} ::: {.tabbed-content} ## Python ```python #| eval: false import numpy as np from scipy import stats, optimize import jax import jax.numpy as jnp # Method of Moments estimation def method_of_moments_normal(data): """MoM for Normal distribution""" mu_hat = np.mean(data) sigma2_hat = np.mean((data - mu_hat)**2) return mu_hat, np.sqrt(sigma2_hat) # Maximum Likelihood with scipy def mle_poisson(data): """MLE for Poisson distribution""" # For Poisson, MLE is just the sample mean return np.mean(data) # General MLE with numerical optimization def neg_log_likelihood(params, data, dist_logpdf): """Generic negative log-likelihood using log-pdf for numerical stability""" return -np.sum(dist_logpdf(data, *params)) # Using scipy.optimize data = stats.gamma.rvs(a=3, scale=2, size=100) result = optimize.minimize( fun=lambda params: neg_log_likelihood(params, data, lambda x, a, b: stats.gamma.logpdf(x, a, scale=b)), x0=[1.0, 1.0], # Initial guess method='L-BFGS-B', bounds=[(0.01, None), (0.01, None)] # Parameters must be positive ) # Using JAX for automatic differentiation def neg_log_lik_jax(params, data): alpha, beta = params # Use JAX's scipy.stats return -jnp.sum(jax.scipy.stats.gamma.logpdf(data, alpha, scale=beta)) # Get gradient automatically grad_nll = jax.grad(neg_log_lik_jax) # Optimize with gradient result_jax = optimize.minimize( fun=lambda p: float(neg_log_lik_jax(jnp.array(p), data)), x0=jnp.array([1.0, 1.0]), method='L-BFGS-B', jac=lambda p: np.array(grad_nll(jnp.array(p), data)), bounds=[(0.01, None), (0.01, None)] ) ``` ## R ```r #| eval: false # Method of Moments estimation method_of_moments_normal <- function(data) { mu_hat <- mean(data) sigma2_hat <- mean((data - mu_hat)^2) c(mu = mu_hat, sigma = sqrt(sigma2_hat)) } # Maximum Likelihood with built-in functions library(MASS) # For fitdistr data <- rgamma(100, shape = 3, scale = 2) # Note: fitdistr estimates shape and rate (where rate = 1/scale) mle_fit <- fitdistr(data, "gamma") # Manual MLE with optim neg_log_likelihood <- function(params, data) { alpha <- params[1] beta <- params[2] if (alpha <= 0 || beta <= 0) return(Inf) -sum(dgamma(data, shape = alpha, scale = beta, log = TRUE)) } # Optimize result <- optim( par = c(1, 1), # Initial values fn = neg_log_likelihood, data = data, method = "L-BFGS-B", lower = c(0.01, 0.01) # Lower bounds ) # Using gradient information (numerical approximation) library(numDeriv) result_with_grad <- optim( par = c(1, 1), fn = neg_log_likelihood, gr = function(p, data) grad(neg_log_likelihood, p, data = data), data = data, method = "L-BFGS-B", lower = c(0.01, 0.01) ) # For more complex models, use specialized packages library(bbmle) # For mle2 function mle2_fit <- mle2( minuslogl = function(shape, scale) { -sum(dgamma(data, shape = shape, scale = scale, log = TRUE)) }, start = list(shape = 1, scale = 1), method = "L-BFGS-B", lower = c(shape = 0.01, scale = 0.01) ) ``` ::: ::: ::: {.content-visible when-format="pdf"} ::: {.callout-note} ## Python and R Reference Code Python and R code examples for this chapter can be found in the HTML version of these notes. ::: ::: ### Further Materials - **Bayesian Optimization**: An interactive exploration on [Distill](https://distill.pub/2020/bayesian-optimization/) (Agnihotri & Batra; 2020) - **Classic reference**: Casella & Berger, "Statistical Inference", Chapter 7 - **Modern perspective**: Efron & Hastie, "Computer Age Statistical Inference", Chapter 4 --- *Remember: Parametric inference is about making assumptions (choosing a model) and then finding the best parameters within that model. The Method of Moments is simple but not optimal. Maximum Likelihood is the gold standard but often requires numerical optimization. Master these concepts -- they're the foundation of statistical modeling and machine learning!*