1  Probability Foundations

Published

July 23, 2025

1.1 Learning Objectives

After completing this chapter, you will be able to:

  • Explain the role of probability in data science and statistical modeling.
  • Apply fundamental probability concepts, including sample spaces, events, and the axioms of probability.
  • Calculate probabilities using independence, conditional probability, and Bayes’ theorem.
  • Distinguish between discrete and continuous random variables and use their distribution functions (PMF, PDF, CDF).
  • Describe and apply key univariate and multivariate distributions (e.g., Binomial, Normal, Multinomial).
Note

This chapter covers fundamental probability concepts essential for statistical inference. The material is adapted and expanded from Chapters 1 and 2 of Wasserman (2013), which interested readers are encouraged to consult directly for a more rigorous and comprehensive treatment.

1.2 Why Do We Need Statistics?

1.2.1 Statistics and Machine Learning in Data Science

Machine learning (ML) has revolutionized our ability to make predictions. Given enough training data, modern ML models can achieve remarkable accuracy on unseen data that resembles what they’ve been trained on. But there’s a crucial limitation: these models excel when the new data comes from the same distribution as the training data.

How do we move beyond this constraint to make reliable predictions in the real world, where conditions change and data can be messy, incomplete, or collected differently than our training set?

This is where statistics becomes essential. Statistics provides the tools to:

  • Understand principles of data collection: How was the data gathered? What biases might exist?
  • Plan data collection strategically: Design experiments and surveys that yield meaningful insights
  • Deal with missing data: Real-world data is rarely complete - we need principled ways to handle gaps
  • Understand causality in modeling: Correlation isn’t causation, and confounding variables can mislead us

Without these statistical foundations, even the most sophisticated ML models can fail spectacularly when deployed in practice.

1.2.2 Case Study: IBM Watson Health

The story of IBM Watson Health illustrates why statistical thinking is crucial for real-world AI applications.

In 2011, IBM Watson made headlines by defeating human champions on the quiz show Jeopardy! This victory showcased the power of natural language processing and knowledge retrieval. IBM saw an opportunity: if Watson could master general knowledge, why not train it to be a doctor?

Watson Health launched in 2015 with an ambitious goal: use data from top US hospitals to train an AI system that could diagnose and treat patients anywhere in the world. The vision was compelling - bring world-class medical expertise to underserved areas through AI.

Over the years, IBM:

  • Spent over $4 billion on acquisitions
  • Employed 7,000 people developing the system
  • Partnered with prestigious medical institutions

Yet by 2022, IBM sold Watson Health’s data and assets for just $1 billion - a massive loss. What went wrong?

The fundamental issue was data representativeness. Watson Health was trained on data from elite US hospitals treating specific patient populations. But this data didn’t represent:

  • Different healthcare systems and practices globally
  • Diverse patient populations with varying genetics, lifestyles, and environmental factors
  • Resource constraints in different settings
  • Variations in how medical data is recorded and coded

This failure wasn’t due to inadequate machine learning algorithms - it was a failure to apply statistical thinking about data collection, representation, and generalization. No amount of computational power can overcome fundamentally biased or unrepresentative data.

Read more in this Slate article.

1.2.3 Two Cultures of Statistical Modeling

In his influential essay Breiman (2001), statistician Leo Breiman identified two distinct approaches to statistical modeling, each with different goals and philosophies. These are often cast as the two approaches of prediction vs. explanation.

Feature The Algorithmic Modeling Culture The Data Modeling Culture
Goal Accurate prediction Understanding mechanisms
Approach Treat the mapping from inputs to outputs as a black box Specify interpretable models that represent how nature works
Validation Predictive accuracy on held-out data Statistical tests, confidence intervals, model diagnostics
Philosophy “It doesn’t matter how it works, as long as it works” “We need to understand which factors matter and why”
Examples Deep learning, random forests, boosting Linear regression, causal inference, experimental design

Think of these two cultures like different approaches to cooking:

The algorithmic approach is like following a top-rated recipe - you don’t know why you fold (not stir) the batter or why ingredients must be room temperature, but following the steps precisely often produces better results than many trained cooks achieve. You can pick 5-star recipes and succeed without any cooking knowledge.

The data modeling approach is like understanding food science - you know about Maillard reactions, gluten development, and emulsification. But translating this into a great dish is slow and complex. You might spend hours calculating optimal ratios only to produce something inferior to what a simple recipe would have given you.

This creates a fundamental tension: The recipe follower often produces better food faster. The food scientist understands why things work and with time might produce a breakthrough - but may struggle to match the empirical success of well-tested recipes. In machine learning, this same tension exists - a neural network might predict customer behavior better than any theory-based model, even if we don’t understand why. Sometimes, letting algorithms find patterns empirically works better than imposing our theoretical understanding. However, understanding often gives us an edge to build better algorithms and generalize to novel scenarios.

Formally, both cultures can be seen as addressing the problem of characterizing a mapping: \[X \rightarrow Y\]

where \(X\) represents input features and \(Y\) represents the response.

Algorithmic approach (example): Find a function \(\hat{f}\) that minimizes prediction error. A common approach is to find the function that minimizes the mean squared error (MSE), \[\text{MSE} = \frac{1}{N} \sum_{n=1}^N (Y_n - \hat{f}(X_n))^2\]

that is make the squared difference between the actual outcome \(Y_n\) and the prediction \(\hat{f}(X_n)\) as small as possible over the available training data. In practice, we often report the root mean squared error (RMSE) = \(\sqrt{\text{MSE}}\), which has the same units as \(Y\). We don’t care about what the function \(\hat{f}\) looks like, as long as it minimizes this error.

Data modeling approach (example): Build a mechanistic model \(Y = f(X; \theta) + \epsilon\) where:

  • \(f\) has a specific, interpretable form
  • \(\theta\) are parameters with scientific, interpretable meaning
  • \(\epsilon\) represents random error

While fitting the model to data often still involves optimizing some objective, the goal here is to study the best-fitting parameters \(\theta\), or find the best model \(f\) among a set of competing hypotheses.

We compare here the two approaches represented by a random forest (RF) model and a linear regression model. The former represents a traditional machine learning approach, while the latter is a staple of statistical modelling.

A trained random forest model is harder to interpret, hence falls in the “algorithmic” camp for the purpose of this example. Conversely, a fitted linear regression model yields interpretable weights which directly tell us how the features linearly affect the response, so it represents the data modeling camp.

# Comparing algorithmic vs data modeling approaches
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm

# Load synthetic housing price data with complex patterns
# File available here: 
# https://raw.githubusercontent.com/lacerbi/stats-for-ds-website/refs/heads/main/data/housing_prices.csv
df = pd.read_csv('../data/housing_prices.csv')

# Prepare features and target
features = ['size_sqft', 'bedrooms', 'age_years', 'location_score', 
            'garage_spaces', 'has_pool', 'crime_rate', 'school_rating']
X = df[features]
y = df['price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset: {X.shape[0]:,} houses with {X.shape[1]} features")
print(f"Training on {len(X_train):,} houses, testing on {len(X_test):,}")
print(f"\nAverage house price: €{y.mean():,.0f}")
print(f"Price standard deviation: €{y.std():,.0f}")

# ALGORITHMIC APPROACH: Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Calculate metrics
from sklearn.metrics import mean_squared_error
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

print("\n=== ALGORITHMIC APPROACH (Random Forest) ===")
print(f"Root Mean Squared Error (RMSE): €{rf_rmse:,.0f}")
print("\nFeature Importances:")
for feature, importance in zip(features, rf_model.feature_importances_):
    print(f"  {feature}: {importance:.3f}")
Dataset: 5,000 houses with 8 features
Training on 4,000 houses, testing on 1,000

Average house price: €439,750
Price standard deviation: €108,720

=== ALGORITHMIC APPROACH (Random Forest) ===
Root Mean Squared Error (RMSE): €32,608

Feature Importances:
  size_sqft: 0.125
  bedrooms: 0.022
  age_years: 0.022
  location_score: 0.047
  garage_spaces: 0.013
  has_pool: 0.007
  crime_rate: 0.018
  school_rating: 0.746
# DATA MODELING APPROACH: Linear Regression
# Add constant term for intercept
X_train_lm = sm.add_constant(X_train)
X_test_lm = sm.add_constant(X_test)

# Fit linear model
lm_model = sm.OLS(y_train, X_train_lm)
lm_results = lm_model.fit()

# Make predictions
lm_predictions = lm_results.predict(X_test_lm)
lm_rmse = np.sqrt(mean_squared_error(y_test, lm_predictions))

print("\n=== DATA MODELING APPROACH (Linear Regression) ===")
print(f"Root Mean Squared Error (RMSE): €{lm_rmse:,.0f}")

# Show interpretable coefficients
print("\nLinear Model Coefficients:")
coef_df = pd.DataFrame({
    'Feature': ['Intercept'] + features,
    'Coefficient': lm_results.params,
    'Std Error': lm_results.bse,
    'P-value': lm_results.pvalues
})
coef_df['Significant'] = coef_df['P-value'] < 0.05
print(coef_df.to_string(index=False))

print("\n=== INTERPRETATION ===")
print("Linear model suggests:")
for i, feature in enumerate(features):
    coef = lm_results.params[i+1]  # +1 to skip intercept
    if abs(coef) > 100:
        print(f"- Each unit increase in {feature}: €{coef:,.0f} change in price")
print("\nBUT: The model performs (slightly) worse than Random Forest.")
print(f"RF RMSE: €{rf_rmse:,.0f} vs Linear RMSE: €{lm_rmse:,.0f}")
print(f"That's €{lm_rmse - rf_rmse:,.0f} worse prediction error.")
print(f"Should we care more about prediction or understanding?")

=== DATA MODELING APPROACH (Linear Regression) ===
Root Mean Squared Error (RMSE): €34,061

Linear Model Coefficients:
       Feature    Coefficient   Std Error       P-value  Significant
     Intercept -241058.748914 4060.095405  0.000000e+00         True
     size_sqft      72.843101    1.113174  0.000000e+00         True
      bedrooms   17505.976934  561.915079 6.369481e-191         True
     age_years      66.884992   37.769905  7.666128e-02        False
location_score    6326.172511  191.797597 3.372091e-211         True
 garage_spaces   15599.005835  602.273749 7.547388e-137         True
      has_pool   29992.244032 1362.847287 2.110344e-101         True
    crime_rate   -1268.321711  110.805186  7.133961e-30         True
 school_rating   66814.033458  394.636343  0.000000e+00         True

=== INTERPRETATION ===
Linear model suggests:
- Each unit increase in bedrooms: €17,506 change in price
- Each unit increase in location_score: €6,326 change in price
- Each unit increase in garage_spaces: €15,599 change in price
- Each unit increase in has_pool: €29,992 change in price
- Each unit increase in crime_rate: €-1,268 change in price
- Each unit increase in school_rating: €66,814 change in price

BUT: The model performs (slightly) worse than Random Forest.
RF RMSE: €32,608 vs Linear RMSE: €34,061
That's €1,453 worse prediction error.
Should we care more about prediction or understanding?

Both approaches have their place in modern data science. The algorithmic culture has driven breakthroughs in areas like computer vision and natural language processing, where prediction accuracy is paramount. For example, large language models (LLMs) are massively large deep neural networks (pre)trained with the extremely simple objective of just “predicting the next word”1 – without any attempt at understanding the underlying process.

The data modeling culture remains essential for scientific understanding, policy decisions, and any application where we need to know not just what will happen, but why. For LLMs, and in ML more broadly, this aspect is studied by the field of interpretability or “explainable ML” – trying to understand how modern ML models “think” and reach their conclusions.

1.3 Foundations of Probability

Probability provides the mathematical language for quantifying uncertainty. Before we can make statistical inferences or build predictive models, we need a solid foundation in probability theory.

This course is taught internationally, but for Finnish-speaking students, here’s a reference table of key probability terms you may have encountered in your earlier studies:

English Finnish Context
Sample space Perusjoukko, otosavaruus The set of all possible outcomes
Event Tapahtuma A subset of the sample space
Probability distribution Todennäköisyysjakauma Assignment of probabilities to events
Probability measure Todennäköisyysmitta Mathematical function P satisfying axioms
Independent Riippumattomat Events that don’t influence each other
Conditional probability Ehdollinen todennäköisyys Probability given some information
Bayes’ Theorem Bayesin kaava Formula for updating probabilities
Random variable Satunnaismuuttuja Function mapping outcomes to numbers
Cumulative distribution function (CDF) Kertymäfunktio P(X \le x)
Discrete Diskreetti Taking countable values
Probability mass function (PMF) Todennäköisyysmassafunktio P(X = x) for discrete X
Probability density function (PDF) Tiheysfunktio Density for continuous variables
Quantile function Kvantiilifunktio Inverse of CDF
First quartile Ensimmäinen kvartiili 25th percentile
Median Mediaani 50th percentile
Joint density function Yhteistiheysfunktio PDF for multiple variables
Marginal density Reunatiheysfunktio PDF of one variable from joint
Conditional density Ehdollinen tiheysfunktio PDF given another variable’s value
Random vector Satunnaisvektori Vector of random variables
Independent and identically distributed (IID) Riippumattomat ja samoin jakautuneet Common assumption for data
Random sample Satunnaisotos IID observations from population
Frequentist probability Frekventistinen todennäköisyys Long-run frequency interpretation
Subjective probability Subjektiivinen todennäköisyys Degree of belief interpretation

Note: Some terms have multiple Finnish translations. We report here the most common ones.

1.3.1 Sample Spaces and Events

The sample space \Omega is the set of all possible outcomes of an experiment. Individual elements \omega \in \Omega are called sample outcomes, realizations, or just elements. Subsets of \Omega are called events.

Notation: \omega and \Omega are the lowercase (respectively, uppercase) version of the Greek letter omega.

Example: Coin flips

If we flip a coin twice, where each outcome can be head (H) or tails (T), the sample space is: \Omega = \{HH, HT, TH, TT\}

The event “first flip is heads” is A = \{HH, HT\}.

Example: Temperature measurement

When measuring temperature, the sample space might be the full set of real numbers: \Omega = \mathbb{R} = (-\infty, \infty)

The event “temperature between 20°C and 25°C” is the interval A = [20, 25].

Note that we often take \Omega to be larger than strictly necessary - in this case for example we are including physically impossible values like -1000°C. This is still mathematically valid. As we will see later, we can assign zero probability to impossible events.

Notation: [a, b] denotes the interval between a and b (included), whereas (a, b) is the interval between a and b (excluded).

Sample spaces can be:

  • Finite: \Omega = \{1, 2, 3, 4, 5, 6\} for a die roll
  • Countably infinite: \Omega = \{1, 2, 3, ...\} for “number of flips until first heads”
  • Uncountably infinite: \Omega = [0, 1] for “random number between 0 and 1”

1.3.2 Set Operations for Events

Since events are sets, we can combine them using standard set operations:

Operation Notation Meaning
Complement A^c “not A” - all outcomes not in A
Union A \cup B “A or B” - outcomes in either A or B (or both)
Intersection A \cap B “A and B” - outcomes in both A and B
Difference A \setminus B Outcomes in A but not in B
Note

Disjoint events: Events A and B are disjoint (or mutually exclusive) if A \cap B = \emptyset. This means they cannot occur simultaneously. For example, in the case of a standard six-sided die roll, let A = “rolling an even number” = \{2, 4, 6\} and B = “rolling a 1” = \{1\}. These events are disjoint because you can’t roll both an even number AND a 1 simultaneously.

1.3.3 Probability Axioms

Now that we have defined the space of possible events, we can define the probability of an event. A probability measure must satisfy three fundamental axioms:

A function \mathbb{P} that assigns a real number \mathbb{P}(A) to each event A is a probability measure if:

  1. Non-negativity: \mathbb{P}(A) \geq 0 for every event A
  2. Normalization: \mathbb{P}(\Omega) = 1
  3. Countable additivity: If A_1, A_2, ... are disjoint events, then: \mathbb{P}\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \mathbb{P}(A_i)

These axioms ensure probability respects intuitive laws:

  1. Non-negativity: The probability of rain tomorrow cannot be negative.
  2. Normalization: When rolling a six-sided die, the probability of getting one of the faces \{1, 2, 3, 4, 5, 6\} is 1: 1 represent the total probability.
  3. Countable additivity: The probability of rolling a 1 or a 2 on a die is the sum of their individual probabilities, as these events cannot happen together.

It turns out that under some assumptions, it can be shown that these axioms are exactly what you would pick if one wants to quantify the concept of “possibility of an event” with a single number – a result known as Cox’s theorem.

From these axioms, we can derive many useful properties:

  • \mathbb{P}(\emptyset) = 0 (the impossible event has probability 0)
  • \mathbb{P}(A^c) = 1 - \mathbb{P}(A) (complement rule)
  • If A \subset B, then \mathbb{P}(A) \leq \mathbb{P}(B) (monotonicity)
  • 0 \leq \mathbb{P}(A) \leq 1 for any event A

For any events A and B: \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)

Why? This formula accounts for the “double counting” when we add \mathbb{P}(A) and \mathbb{P}(B) – the intersection A \cap B gets counted twice, so we subtract it once.

Note

Sometimes you will see the notation \mathbb{P}(A B) to denote \mathbb{P}(A \cap B).

1.3.4 Interpretations of Probability

Probability can be understood from different philosophical perspectives, each leading to the same mathematical framework.

There are two main ways to think about what probability means:

Frequency interpretation: Probability is the long-run proportion of times an event occurs in repeated experiments. If we flip a fair coin millions of times, we expect heads about 50% of the time.

Subjective interpretation: Probability represents a degree of belief. When a weather forecaster says “30% chance of rain,” they’re expressing their confidence based on available information.

Both interpretations are useful, and both lead to the same mathematical rules.

Frequentist probability: \[\mathbb{P}(A) = \lim_{n \to \infty} \frac{\text{number of times A occurs in n trials}}{n}\]

This requires the experiment to be repeatable under identical conditions.

Subjective/Bayesian probability: \(\mathbb{P}(A)\) quantifies an agent’s degree of belief that \(A\) is true, constrained by:

  • Coherence: beliefs must satisfy the probability axioms
  • Updating: beliefs change rationally when new information arrives (via Bayes’ theorem)

Let’s simulate the frequentist interpretation by flipping a fair coin many times and tracking how the proportion of heads converges to the true probability of 0.5. This directly demonstrates the mathematical definition: probability as the long-run proportion.

import numpy as np
import matplotlib.pyplot as plt

# Simulate many coin flips to see frequentist convergence
np.random.seed(42)
n_flips = 10000
flips = np.random.choice(['H', 'T'], size=n_flips)

# Calculate running proportion of heads
heads_count = np.cumsum(flips == 'H')
proportions = heads_count / np.arange(1, n_flips + 1)

# Plot convergence to true probability
plt.figure(figsize=(7, 4))
plt.plot(proportions, linewidth=2, alpha=0.8, color='blue')
plt.axhline(y=0.5, color='red', linestyle='--', label='True probability = 0.5', linewidth=2)
plt.xlabel('Number of flips')
plt.ylabel('Proportion of heads')
plt.title('Frequentist Probability: Long-Run Proportion Converges to True Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0.3, 0.7)
plt.show()

# Print summary
print(f"After {n_flips:,} flips:")
print(f"Proportion of heads: {proportions[-1]:.4f}")
print(f"Deviation from 0.5: {abs(proportions[-1] - 0.5):.4f}")

After 10,000 flips:
Proportion of heads: 0.5013
Deviation from 0.5: 0.0013

The subjective/Bayesian interpretation involves updating beliefs based on evidence. We’ll explore this computational approach in detail when we cover Bayes’ theorem.

1.3.5 Finite Sample Spaces and Counting

When \Omega is finite and all outcomes are equally likely, probability calculations reduce to counting:

\mathbb{P}(A) = \frac{|A|}{|\Omega|} = \frac{\text{number of outcomes in A}}{\text{total number of outcomes}}

Example: Rolling two dice

What’s the probability the sum of rolling two six-sided dice equals 7?

\Omega = \{(i,j) : i,j \in \{1,2,3,4,5,6\}\}, so |\Omega| = 36.

The event “sum equals 7” is A = \{(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)\}.

Therefore \mathbb{P}(A) = \frac{6}{36} = \frac{1}{6}.

Key counting principle - Binomial Coefficient:

The binomial coefficient (read as “n choose k”) is:

\binom{n}{k} = \frac{n!}{k!(n-k)!}

Where n! denotes the factorial, e.g. 4! = 4 \cdot 3 \cdot 2 \cdot 1 = 24.

The binomial coefficient2 counts the number of ways to choose k objects from n objects when order doesn’t matter. For example:

  • \binom{5}{2} = \frac{5!}{2!3!} = \frac{5 \times 4}{2 \times 1} = 10 ways to choose 2 items from 5
  • Choosing 2 students from a class of 30: \binom{30}{2} = 435 possible pairs

1.4 Independence and Conditional Probability

1.4.1 Independent Events

Two events A and B are independent if: \mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B)

We denote this as A \perp\!\!\!\perp B. When events are not independent, we write A \not\perp\!\!\!\perp B.

Independence means that knowing whether one event occurred tells us nothing about the other event.

Note

The textbook uses non-standard notation for independence and non-independence. We use the standard notation A \perp\!\!\!\perp B (and A \not\perp\!\!\!\perp B for dependence), which is widely adopted in probability and statistics literature.

Example: Fair coin tosses

A fair coin is tossed twice. Let H_1 = “first toss is heads” and H_2 = “second toss is heads”. Are these two events independent?

  • \mathbb{P}(H_1) = \mathbb{P}(H_2) = \frac{1}{2}
  • \mathbb{P}(H_1 \cap H_2) = \mathbb{P}(\text{both heads}) = \frac{1}{4}
  • Since \frac{1}{4} = \frac{1}{2} \times \frac{1}{2}, the events are independent

This matches the intuition – whether we obtain head on the first flip does not tell us anything about the second flip, and vice versa.

Example: Dependent events

Draw two cards from a deck without replacement.

  • A = “first card is an ace”
  • B = “second card is an ace”

Are these events independent?

  • \mathbb{P}(A) = \mathbb{P}(B) = \frac{4}{52}
  • However: \mathbb{P}(A \cap B) = \frac{4}{52} \times \frac{3}{51} \neq \mathbb{P}(A)\mathbb{P}(B)

The events are dependent because drawing an ace first changes the probability of drawing an ace second.

Warning

Common misconception: Disjoint events are NOT independent!

If A and B are disjoint with positive probability, then:

  • \mathbb{P}(A \cap B) = 0 (since they are disjoint, they can’t happen together)
  • \mathbb{P}(A)\mathbb{P}(B) > 0 (if both have positive probability)

So disjoint events are actually maximally dependent - if one occurs, the other definitely doesn’t!

1.4.2 Conditional Probability

The conditional probability of A given B is: \mathbb{P}(A|B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} provided \mathbb{P}(B) > 0.

Think of \mathbb{P}(A|B) as the probability of A in the “new universe” where we know B has occurred.

Caution

Prosecutor’s Fallacy: Confusing \mathbb{P}(A|B) with \mathbb{P}(B|A).

These can be vastly different! For example:

  • \mathbb{P}(\text{match} | \text{guilty}) might be 0.98.
  • \mathbb{P}(\text{guilty} | \text{match}) might be 0.04.

The second depends on the prior probability of guilt and how many innocent people might also match. We will see next how to compute one from the other.

1.4.3 Bayes’ Theorem

Sometimes we know \mathbb{P}(B|A) but we are really interested in the other way round, \mathbb{P}(A|B).

For example, in the example above, we may know the probability that a test gives a match if the suspect is guilty, \mathbb{P}(\text{match} \mid \text{guilty}), but what we really want to know is the probability that the suspect is guilty given that we find a match, \mathbb{P}(\text{guilty} \mid \text{match}).

Such “inverse” conditional probabilities can be calculated via Bayes’ theorem.

For events A and B with \mathbb{P}(B) > 0: \mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}

Where B is some information (evidence) and A an hypothesis.

If A_1, ..., A_k partition \Omega (they’re disjoint and cover all possibilities), then: \mathbb{P}(B) = \sum_{i=1}^k \mathbb{P}(B|A_i)\mathbb{P}(A_i)

Combining these gives the full form of Bayes’ theorem: \mathbb{P}(A_i|B) = \frac{\mathbb{P}(B|A_i)\mathbb{P}(A_i)}{\sum_j \mathbb{P}(B|A_j)\mathbb{P}(A_j)}

Terminology:

  • \mathbb{P}(A_i): Prior probability for hypothesis A_i (before seeing evidence B), also known as “base rate”.
  • \mathbb{P}(A_i|B): Posterior probability (after seeing evidence B).
  • \mathbb{P}(B|A_i): Likelihood of hypothesis A_i for fixed evidence B.
Example: Email spam detection
  • Prior: 70% of emails are spam
  • “Free” appears in 90% of spam emails
  • “Free” appears in 1% of legitimate emails

If an email contains “free”, what’s the probability it’s spam?

Let S = “email is spam” and F = “email contains ‘free’”.

Given:

  • \mathbb{P}(S) = 0.7
  • \mathbb{P}(F|S) = 0.9
  • \mathbb{P}(F|S^c) = 0.01

By Bayes’ theorem: \mathbb{P}(S|F) = \frac{0.9 \times 0.7}{0.9 \times 0.7 + 0.01 \times 0.3} = \frac{0.63}{0.633} \approx 0.995

So an email containing “free” has a 99.5% chance of being spam!

Conditional probability answers questions like:

  • What’s the probability of rain given that it’s cloudy?
  • What’s the probability a patient has a disease given a positive test result?
  • What’s the probability a user clicks an ad given they’re on a mobile device?

The key insight: additional information changes probabilities. Knowing that \(B\) occurred restricts our attention to outcomes where \(B\) is true, potentially changing how likely \(A\) becomes.

Some conditional probabilities are easier to compute than others. For example, we may know that if the patient has a disease, then the test will return positive with a certain probability. However, to compute the “inverse” probability (if the test is positive, what’s the probability of the patient having the disease?) we need Bayes’ theorem.

For fixed \(B\) with \(\mathbb{P}(B) > 0\), the conditional probability \(\mathbb{P}(\cdot|B)\) is itself a probability measure:

  1. \(\mathbb{P}(A|B) \geq 0\) for all \(A\)
  2. \(\mathbb{P}(\Omega|B) = 1\)
  3. If \(A_1, A_2, ...\) are disjoint, then \(\mathbb{P}(\bigcup_i A_i|B) = \sum_i \mathbb{P}(A_i|B)\)

Key relationships:

  • If \(A \perp\!\!\!\perp B\), then \(\mathbb{P}(A|B) = \mathbb{P}(A)\) (independence means conditioning doesn’t matter)
  • \(\mathbb{P}(A \cap B) = \mathbb{P}(A|B)\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A)\) (multiplication rule)

Conversely, in general \(\mathbb{P}(A|\cdot)\) (the likelihood) is not a probability measure.

We will visualize how conditional probability can be counterintuitive. We’ll simulate a medical test scenario to show how base rates affect our interpretation of test results.

import numpy as np
import matplotlib.pyplot as plt

# Medical test scenario
p_disease = 0.001                   # 0.1% have the disease (base rate)
p_pos_given_disease = 0.99          # 99% sensitivity
p_neg_given_healthy = 0.99          # 99% specificity

# Calculate probability of positive test
p_healthy = 1 - p_disease
p_pos_given_healthy = 1 - p_neg_given_healthy
p_positive = p_pos_given_disease * p_disease + p_pos_given_healthy * p_healthy

# Apply Bayes' theorem: P(disease|positive)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive

# Visualize with different base rates
base_rates = np.logspace(-4, -1, 50)  # 0.01% to 10%
posterior_probs = []

for base_rate in base_rates:
    p_pos = p_pos_given_disease * base_rate + p_pos_given_healthy * (1 - base_rate)
    posterior = (p_pos_given_disease * base_rate) / p_pos
    posterior_probs.append(posterior)

plt.figure(figsize=(7, 5))
plt.semilogx(base_rates * 100, np.array(posterior_probs) * 100, linewidth=3)
plt.axvline(x=0.1, color='red', linestyle='--', label=f'Current: {p_disease_given_pos:.1%} chance')
plt.xlabel('Disease Prevalence (%)')
plt.ylabel('P(Disease | Positive Test) (%)')
plt.title('Impact of Base Rate on Test Interpretation')
plt.ylim(0, 100)  # Set y-axis range from 0 to 100
plt.xticks([0.01, 0.1, 1, 10], ['0.01', '0.1', '1', '10'])  # Set custom x-axis ticks
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print(f"With 99% accurate test and 0.1% base rate:")
print(f"P(disease | positive test) = {p_disease_given_pos:.1%}")
print(f"Surprising: A positive test means only ~9% chance of disease!")

With 99% accurate test and 0.1% base rate:
P(disease | positive test) = 9.0%
Surprising: A positive test means only ~9% chance of disease!

1.4.4 Classic Probability Examples

Let’s work through some classic examples that illustrate key concepts:

Example: At least one head in 10 flips

What’s the probability of getting at least one head in 10 coin flips?

Hint: Instead of counting all the ways to get 1, 2, …, or 10 heads, use the complement.

\mathbb{P}(\text{at least one head}) = 1 - \mathbb{P}(\text{no heads}) = 1 - \mathbb{P}(\text{all tails})

Since flips are independent: \mathbb{P}(\text{all tails}) = \left(\frac{1}{2}\right)^{10} = \frac{1}{1024}

Therefore: \mathbb{P}(\text{at least one head}) = 1 - \frac{1}{1024} \approx 0.999

Example (advanced): Basketball competition

Two players take turns shooting. Player A shoots first with probability 1/3 of scoring. Player B shoots second with probability 1/4. First to score wins. What’s the probability A wins?

A wins if:

  • A scores on first shot: probability 1/3
  • Both miss, then A scores: (2/3)(3/4)(1/3)
  • Both miss twice, then A scores: (2/3)(3/4)(2/3)(3/4)(1/3)

This is a geometric series: \mathbb{P}(A \text{ wins}) = \frac{1}{3} \sum_{k=0}^{\infty} \left(\frac{2}{3} \cdot \frac{3}{4}\right)^k = \frac{1}{3} \cdot \frac{1}{1-\frac{1}{2}} = \frac{2}{3}

1.5 Random Variables

So far, we’ve worked with events - subsets of the sample space. But in practice, we usually care about numerical quantities associated with random outcomes. This is where random variables come in.

1.5.1 Definition and Intuition

A random variable is a function X: \Omega \rightarrow \mathbb{R} that assigns a real number to each outcome in the sample space.

A random variable is defined by its possible values (real numbers) and their probabilities.

In the case of a discrete random variable, the set of values is discrete (finite or infinite), x_1, \ldots, and each value can be assigned a corresponding point probability p_1, \ldots with 0 \le p_i \le 1, \sum_{i=1}^\infty p_i = 1.

In the case of a continuous random variable, probabilities are defined by a non-negative probability density function that integrates to 1.

A random variable is just a way to assign numbers to outcomes. Think of it as a measurement or quantity that depends on chance.

Examples:

  • Number of heads in 10 coin flips
  • Time until next customer arrives
  • Temperature at noon tomorrow
  • Stock price at market close

The key insight: once we have numbers, we can do arithmetic, calculate averages, measure spread, and use all the tools of mathematics.

Warning: The following will likely make sense only if you have taken an advanced course in probability theory or measure theory. Feel free to skip it otherwise.

Formally, \(X\) is a measurable function from \((\Omega, \mathcal{F})\) to \((\mathbb{R}, \mathcal{B})\) where:

  • \(\mathcal{F}\) is the \(\sigma\)-algebra of events in \(\Omega\)
  • \(\mathcal{B}\) is the Borel \(\sigma\)-algebra on \(\mathbb{R}\)

Measurability means: for any Borel set \(B \subset \mathbb{R}\), the pre-image \(X^{-1}(B) = \{\omega : X(\omega) \in B\}\) is an event in \(\mathcal{F}\).

This technical condition ensures we can compute probabilities like \(\mathbb{P}(X \in B)\).

Here we demonstrate how random variables map outcomes to numbers, allowing us to analyze randomness mathematically. For example, we can plot a histogram for the realizations over multiple experiments.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Demonstrate a random variable: X = number of heads in 10 coin flips
np.random.seed(42)

# Single experiment
flips = np.random.choice(['H', 'T'], size=10)
X = np.sum(flips == 'H')
print(f"Outcomes (single experiment): {flips}")
print(f"X (number of heads) = {X}")

# Simulate many experiments to see the distribution
n_sims = 10000
X_values = [np.sum(np.random.choice(['H', 'T'], size=10) == 'H') 
            for _ in range(n_sims)]

# Visualize distribution
plt.figure(figsize=(7, 4))
counts, bins, _ = plt.hist(X_values, bins=np.arange(0.5, 11.5, 1), density=True, 
                          alpha=0.7, edgecolor='black')

x = np.arange(0, 11)
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title(f'Random Variable X: Number of Heads in 10 Flips ({n_sims} experiments)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nAverage value: {np.mean(X_values):.3f} (theoretical: 5.0)")
Outcomes (single experiment): ['H' 'T' 'H' 'H' 'H' 'T' 'H' 'H' 'H' 'T']
X (number of heads) = 7

Average value: 4.993 (theoretical: 5.0)

Example: Coin flips

Within the same sample space we can define multiple distinct random variables.

For example, let \Omega = \{HH, HT, TH, TT\} (two flips). Define:

  • X = number of heads
  • Y = 1 if first flip is heads, 0 otherwise
  • Z = 1 if flips match, 0 otherwise

Then:

  • X(HH) = 2, X(HT) = 1, X(TH) = 1, X(TT) = 0
  • Y(HH) = 1, Y(HT) = 1, Y(TH) = 0, Y(TT) = 0
  • Z(HH) = 1, Z(HT) = 0, Z(TH) = 0, Z(TT) = 1
Tip

Notation convention:

  • Capital letters (X, Y, Z) denote random variables
  • Lowercase letters (x, y, z) denote specific values
  • \{X = x\} is the event that X takes value x

However, do not expect people to strictly follow this convention beyond mathematical and statistical textbooks. In the real world, you will often see “x” used to refer both to a value and to a random variable “X” that happens to take value x.

1.5.2 Cumulative Distribution Functions

The Cumulative Distribution Function (CDF) completely characterizes a random variable’s probability distribution.

The cumulative distribution function (CDF) of a random variable X is the function F_X(x): \mathbb{R} \rightarrow [0, 1] defined by F_X(x) = \mathbb{P}(X \leq x) for all x \in \mathbb{R}.

Example: Two coin flips

Let X = number of heads for two flips of fair coins.

  • \mathbb{P}(X = 0) = 1/4
  • \mathbb{P}(X = 1) = 1/2
  • \mathbb{P}(X = 2) = 1/4

The CDF is: F_X(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1/4 & \text{if } 0 \leq x < 1 \\ 3/4 & \text{if } 1 \leq x < 2 \\ 1 & \text{if } x \geq 2 \end{cases}

Note: The CDF is defined for ALL real x, even though X only takes values 0, 1, 2!

Show code
import matplotlib.pyplot as plt
import numpy as np

# Define the CDF values
x_jumps = [0, 1, 2]  # Points where jumps occur
cdf_values = [0.25, 0.75, 1.0]  # CDF values after jumps
cdf_values_before = [0, 0.25, 0.75]  # CDF values before jumps

fig, ax = plt.subplots(figsize=(7, 4))

# Plot the step function
# Left segment (x < 0)
ax.hlines(0, -1, 0, colors='black', linewidth=2)

# Plot each segment
for i in range(len(x_jumps)):
    # Horizontal line segment
    if i < len(x_jumps) - 1:
        ax.hlines(cdf_values[i], x_jumps[i], x_jumps[i+1], colors='black', linewidth=2)
    else:
        # Last segment extends to the right
        ax.hlines(cdf_values[i], x_jumps[i], 3, colors='black', linewidth=2)
    
    # Open circles (at discontinuities, left endpoints)
    if i > 0:
        ax.plot(x_jumps[i], cdf_values_before[i], 'o', color='black', 
                markerfacecolor='white', markersize=8, markeredgewidth=2)
    
    # Filled circles (at jump points, right endpoints)
    ax.plot(x_jumps[i], cdf_values[i], 'o', color='black', 
            markerfacecolor='black', markersize=8)

# Open circle at x=0, y=0
ax.plot(0, 0, 'o', color='black', markerfacecolor='white', 
        markersize=8, markeredgewidth=2)

# Set axis properties
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('$F_X(x)$', fontsize=12)
ax.set_xlim(-1, 3)
ax.set_ylim(-0.1, 1.1)

# Set tick marks
ax.set_xticks([0, 1, 2])
ax.set_yticks([0.25, 0.50, 0.75, 1.0])

# Add grid
ax.grid(True, alpha=0.3)

# Add arrows to axes
ax.annotate('', xy=(3.2, 0), xytext=(3, 0),
            arrowprops=dict(arrowstyle='->', color='black', lw=1))
ax.annotate('', xy=(0, 1.15), xytext=(0, 1.1),
            arrowprops=dict(arrowstyle='->', color='black', lw=1))

plt.tight_layout()
plt.show()
Figure 1.1: Cumulative distribution function (CDF) for the number of heads when flipping a coin twice.

1.5.3 Discrete Random Variables

A random variable X is discrete if it takes countably many values \{x_1, x_2, ...\}. Its probability mass function (PMF) (sometimes just probability function) is defined as: f_X(x) = \mathbb{P}(X = x)

Properties of PMFs:

  • f_X(x) \geq 0 for all x
  • \sum_{i} f_X(x_i) = 1 (probabilities sum to 1)
  • F_X(x) = \sum_{x_i \leq x} f_X(x_i) (CDF is sum of PMF)
Show code
# PMF for coin flipping example
x_values = [0, 1, 2]
pmf_values = [0.25, 0.5, 0.25]

fig, ax = plt.subplots(figsize=(7, 4))

# Plot vertical lines from x-axis to probability values
for x, p in zip(x_values, pmf_values):
    ax.plot([x, x], [0, p], 'k-', linewidth=2)
    # Add filled circles at the top
    ax.plot(x, p, 'ko', markersize=8, markerfacecolor='black')

# Set axis properties
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('$f_X(x)$', fontsize=12)
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(0, 1)

# Set tick marks
ax.set_xticks([0, 1, 2])
ax.set_yticks([0.25, 0.5, 0.75, 1])

# Add grid
ax.grid(True, alpha=0.3, axis='y')

# Add a horizontal line at y=0 for clarity
ax.axhline(y=0, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()
Figure 1.2: Probability mass function (PMF) for the number of heads when flipping a coin twice.

1.5.4 Core Discrete Distributions

Note

Notation preview: We’ll use \mathbb{E}[X] to denote the expected value (mean) of a random variable X, and \text{Var}(X) or \sigma^2 for its variance (a measure of spread). These concepts will be covered in detail in Chapter 2 of the lecture notes.

Bernoulli Distribution

The Bernoulli distribution is the simplest non-trivial random variable – a single binary outcome with probability p \in [0, 1] of happening.

X \sim \text{Bernoulli}(p) if: f_X(x) = \begin{cases} p & \text{if } x = 1 \\ 1-p & \text{if } x = 0 \\ 0 & \text{otherwise} \end{cases}

An outcome of X = 1 is often referred to as a “hit” or a “success”, while X = 0 is a “miss” or a “failure”.

Use cases:

  • Coin flip (heads/tails)
    • If p \neq 0.5, this is known as a biased coin (as opposed to a fair coin with p = 0.5)
    • Here what constitutes a “hit” and a “miss” is arbitrary!
  • Success/failure of a single trial
  • Binary classification (spam/not spam)
  • User clicks/doesn’t click an ad
Show code
# Bernoulli distribution PMF
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli

# Parameter
p = 0.3  # probability of success

# PMF visualization
fig, ax = plt.subplots(figsize=(7, 4))

# Plot PMF
x = [0, 1]
pmf = [1-p, p]
bars = ax.bar(x, pmf, width=0.4, alpha=0.8, color=['lightcoral', 'lightblue'], 
               edgecolor='black', linewidth=2)

# Add value labels
for i, (xi, pi) in enumerate(zip(x, pmf)):
    ax.text(xi, pi + 0.02, f'{pi:.2f}', ha='center', va='bottom', fontsize=12)

ax.set_xticks([0, 1])
ax.set_xticklabels(['Failure (0)', 'Success (1)'])
ax.set_ylabel('Probability')
ax.set_title(f'Bernoulli Distribution PMF (p = {p})')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"E[X] = p = {p}")
print(f"Var(X) = p(1-p) = {p*(1-p):.3f}")

E[X] = p = 0.3
Var(X) = p(1-p) = 0.210

Binomial Distribution

The binomial distribution counts the number of successes in a fixed number n of independent Bernoulli trials each with probability p.

X \sim \text{Binomial}(n, p) if: f_X(x) = \binom{n}{x} p^x (1-p)^{n-x}, \quad x = 0, 1, ..., n

Key properties:

  • Sum of independent Bernoullis: If X_i \sim \text{Bernoulli}(p) are independent, then \sum_{i=1}^n X_i \sim \text{Binomial}(n, p)
  • Additivity: If X \sim \text{Binomial}(n_1, p) and Y \sim \text{Binomial}(n_2, p) are independent, then X + Y \sim \text{Binomial}(n_1 + n_2, p)

Use cases:

  • Number of heads in n coin flips
  • Number of defective items in a batch
  • Number of customers who make a purchase
  • Number of successful treatments in a clinical trial
Warning

Independence assumption: The binomial distribution assumes all trials are independent - each outcome does not affect the probability of subsequent outcomes. This assumption may not hold in practice!

For example, if items are defective because a machine has broken (rather than random variation), then finding one defective item suggests all subsequent items might also be defective. In such cases, the binomial distribution would be inappropriate.

Show code
# Binomial distribution visualization
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# Parameters
n, p = 20, 0.3
x = np.arange(0, n+1)

# Create figure
fig, ax = plt.subplots(figsize=(7, 5))

# Plot PMF
pmf = binom.pmf(x, n, p)
bars = ax.bar(x, pmf, alpha=0.8, color='steelblue', edgecolor='black')

# Highlight mean
mean = n * p
ax.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean:.1f}')

# Add value labels on significant bars
for i, (xi, pi) in enumerate(zip(x, pmf)):
    if pi > 0.01:  # Only label visible bars
        ax.text(xi, pi + 0.003, f'{pi:.3f}', ha='center', va='bottom', fontsize=8)

ax.set_xlabel('Number of successes (k)')
ax.set_ylabel('P(X = k)')
ax.set_title(f'Binomial Distribution PMF: n={n}, p={p}')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"E[X] = np = {n*p}")
print(f"Var(X) = np(1-p) = {n*p*(1-p)}")
print(f"σ = {np.sqrt(n*p*(1-p)):.3f}")

E[X] = np = 6.0
Var(X) = np(1-p) = 4.199999999999999
σ = 2.049

Discrete Uniform Distribution

The discrete uniform distribution is another simple discrete distribution - every outcome is equally likely.

X \sim \text{DiscreteUniform}(a, b) if: f_X(x) = \frac{1}{b-a+1}, \quad x \in \{a, a+1, ..., b\} where a and b are integers with a \leq b.

Key properties:

  • \mathbb{E}[X] = \frac{a+b}{2}
  • \text{Var}(X) = \frac{(b-a+1)^2 - 1}{12}

Use cases:

  • Fair die roll: \text{DiscreteUniform}(1, 6)
  • Attack roll: \text{DiscreteUniform}(1, 20)
  • Random selection from a finite set
  • Lottery number selection
Show code
# Discrete Uniform distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# Example: fair die
a, b = 1, 6
x = np.arange(a, b+1)
pmf = [1/(b-a+1)] * len(x)

fig, ax = plt.subplots(figsize=(7, 4))
bars = ax.bar(x, pmf, width=0.6, alpha=0.8, color='lightgreen', edgecolor='black', linewidth=2)

# Add value labels
for i, (xi, pi) in enumerate(zip(x, pmf)):
    ax.text(xi, pi + 0.01, f'{pi:.3f}', ha='center', va='bottom', fontsize=10)

ax.set_xlabel('Outcome')
ax.set_ylabel('Probability')
ax.set_title(f'Discrete Uniform Distribution: Fair Die')
ax.set_ylim(0, 0.3)
ax.set_xticks(x)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"E[X] = {(a+b)/2}")
print(f"Var(X) = {((b-a+1)**2 - 1)/12:.3f}")

E[X] = 3.5
Var(X) = 2.917

Categorical Distribution

The categorical distribution is a generalization of Bernoulli to multiple categories (also called “Generalized Bernoulli” or “Multinoulli”). You can also see it as a generalization of the discrete uniform distribution to a discrete non-uniform distribution.

X \sim \text{Categorical}(p_1, ..., p_k) if: f_X(x) = p_x, \quad x \in \{1, 2, ..., k\} where p_i \geq 0 and \sum_{i=1}^k p_i = 1.

Key properties:

  • One-hot encoding: Often represented as a vector with one 1 and rest 0s
  • Special case: Categorical with k=2 is equivalent to Bernoulli
  • Special case: If all probabilities are equal, it becomes a discrete uniform
  • Foundation for multinomial distribution (multiple categorical trials)

Use cases:

  • Outcome of rolling a (possibly unfair) die
  • Classification into multiple categories
  • Language modeling (next-token prediction)3
  • Customer choice among products
Show code
# Categorical distribution
import numpy as np
import matplotlib.pyplot as plt

# Example: Customer choice among 5 products
categories = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
probabilities = [0.30, 0.25, 0.20, 0.15, 0.10]
x = np.arange(len(categories))

fig, ax = plt.subplots(figsize=(7, 4))
bars = ax.bar(x, probabilities, alpha=0.8, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'],
               edgecolor='black', linewidth=2)

# Add value labels
for i, p in enumerate(probabilities):
    ax.text(i, p + 0.01, f'{p:.2f}', ha='center', va='bottom', fontsize=10)

ax.set_xlabel('Category')
ax.set_ylabel('Probability')
ax.set_title('Categorical Distribution: Customer Product Choice')
ax.set_xticks(x)
ax.set_xticklabels(categories, rotation=45, ha='right')
ax.set_ylim(0, 0.4)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Expected value for indicator representation (does it make sense here?)
print("If we encode categories as 1, 2, 3, 4, 5:")
expected = sum((i+1) * p for i, p in enumerate(probabilities))
print(f"E[X] = {expected:.2f} (does it really make sense here?)")

If we encode categories as 1, 2, 3, 4, 5:
E[X] = 2.50 (does it really make sense here?)

Brief Catalog: Other Discrete Distributions

Poisson(\lambda): The Poisson distribution models count of rare events in fixed intervals:

  • PMF: f_X(x) = e^{-\lambda} \frac{\lambda^x}{x!} for x = 0, 1, 2, ...
  • Mean = Variance = \lambda (lambda)
  • Use: Email arrivals, typos per page, customer arrivals
  • Approximates Binomial(n,p) when n large, p small: use \lambda = np

Geometric(p): The geometric distribution represents the number of trials until first success:

  • PMF: f_X(x) = p(1-p)^{x-1} for x = 1, 2, ...
  • Use: Waiting times, number of attempts until success

Negative Binomial(r, p): The negative binomial represents the number of failures before rth success

  • Generalization of geometric distribution
  • Use: Overdispersed count data, robust alternative to Poisson

1.5.5 Continuous Random Variables

A random variable X is continuous if there exists a function f_X such that:

  1. f_X(x) \geq 0 for all x
  2. \int_{-\infty}^{\infty} f_X(x) dx = 1
  3. For any a < b: \mathbb{P}(a < X < b) = \int_a^b f_X(x) dx

The function f_X is called the probability density function (PDF).

Warning

Important distinctions from discrete case:

  • \mathbb{P}(X = x) = 0 for any single point x: in a continuum, there is zero probability of picking one specific point
  • PDF can exceed 1 (it’s a density, not a probability!)
  • We get probabilities by integrating densities over an interval, not summing

Relationship between PDF and CDF:

  • F_X(x) = \int_{-\infty}^x f_X(t) dt
  • f_X(x) = F_X'(x) (where the derivative exists)

1.5.6 Core Continuous Distributions

Uniform Distribution

The uniform distribution is the continuous analog of “equally likely outcomes.”

X \sim \text{Uniform}(a, b) if: f_X(x) = \begin{cases} \frac{1}{b-a} & \text{if } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

Properties:

  • CDF: F_X(x) = \frac{x-a}{b-a} for a \leq x \leq b
  • Every interval of equal length has equal probability

Use cases:

  • Random number generation (\text{Uniform}(0,1) is fundamental in computational statistics)
  • Modeling complete uncertainty within bounds
  • Arrival times when “any time is equally likely”
Show code
# Uniform distribution visualization
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform

# Example: Uniform(a=2, b=5)
a, b = 2, 5
x = np.linspace(0, 7, 1000)

# PDF
pdf = uniform.pdf(x, loc=a, scale=b-a)

fig, ax = plt.subplots(figsize=(7, 4))

# Plot the PDF
ax.plot(x, pdf, linewidth=3, color='darkblue', label=f'Uniform({a}, {b})')
ax.fill_between(x, 0, pdf, where=(x >= a) & (x <= b), alpha=0.3, color='lightblue')

# Mark the boundaries
ax.axvline(x=a, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax.axvline(x=b, color='red', linestyle='--', linewidth=2, alpha=0.7)

# Add height label
ax.text((a+b)/2, 1/(b-a) + 0.02, f'height = 1/{b-a} = {1/(b-a):.3f}', 
        ha='center', fontsize=10)

ax.set_xlabel('x')
ax.set_ylabel('Density f(x)')
ax.set_title('Uniform Distribution PDF')
ax.set_ylim(0, 0.5)
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

print(f"E[X] = (a+b)/2 = {(a+b)/2}")
print(f"Var(X) = (b-a)²/12 = {(b-a)**2/12:.3f}")
print(f"Total area under curve = {1/(b-a)} × {b-a} = 1 ✓")

E[X] = (a+b)/2 = 3.5
Var(X) = (b-a)²/12 = 0.750
Total area under curve = 0.3333333333333333 × 3 = 1 ✓

Normal (Gaussian) Distribution

The normal distribution (also called Gaussian distribution) is the most important distribution in statistics, arising from the Central Limit Theorem.

X \sim \text{Normal}(\mu, \sigma^2) or \mathcal{N}(\mu, \sigma^2) if: f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Parameters:

  • \mu (mu): mean (center of distribution)
  • \sigma^2 (sigma squared): variance (\sigma is standard deviation - controls spread)

Key properties:

  • Symmetric bell curve centered at \mu
  • About 68% of probability within \mu \pm \sigma
  • About 95% within \mu \pm 2\sigma
  • About 99.7% within \mu \pm 3\sigma

Standard Normal: Z \sim \mathcal{N}(0, 1) has \mu = 0, \sigma = 1

  • Any normal can be standardized: If X \sim \mathcal{N}(\mu, \sigma^2), then Z = \frac{X-\mu}{\sigma} \sim \mathcal{N}(0,1)
  • This allows us to use standard normal tables for any normal distribution

Additivity: If X_i \sim \mathcal{N}(\mu_i, \sigma_i^2) are independent, then: \sum_{i=1}^n X_i \sim \mathcal{N}\left(\sum_{i=1}^n \mu_i, \sum_{i=1}^n \sigma_i^2\right)

Use cases:

  • Measurement errors
  • Heights, weights, test scores in large populations
  • Sum of many small independent effects (CLT)
  • Approximation for many other distributions
Show code
# Normal distribution visualization
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(7, 8))

# 1. PDF with different parameters
x = np.linspace(-6, 6, 1000)

# Different means
for mu, color in zip([-2, 0, 2], ['red', 'blue', 'green']):
    ax1.plot(x, norm.pdf(x, loc=mu), linewidth=2, 
             label=f'μ={mu}, σ=1', color=color)

# Different standard deviations
for sigma, style in zip([0.5, 1, 2], ['-', '--', ':']):
    ax1.plot(x, norm.pdf(x, scale=sigma), linewidth=2, 
             label=f'μ=0, σ={sigma}', linestyle=style, color='black')

ax1.set_xlabel('x')
ax1.set_ylabel('Density')
ax1.set_title('Normal Distribution PDF')
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# 2. 68-95-99.7 Rule
mu, sigma = 0, 1
x_range = np.linspace(-4, 4, 1000)
y = norm.pdf(x_range, mu, sigma)
ax2.plot(x_range, y, 'k-', linewidth=2)

# Fill areas for different sigma ranges
colors = ['lightblue', 'lightgreen', 'lightyellow']
alphas = [0.8, 0.6, 0.4]
labels = ['68% (±1σ)', '95% (±2σ)', '99.7% (±3σ)']

for n_sigma, color, alpha, label in zip([1, 2, 3], colors, alphas, labels):
    x_fill = x_range[np.abs(x_range - mu) <= n_sigma * sigma]
    y_fill = norm.pdf(x_fill, mu, sigma)
    ax2.fill_between(x_fill, 0, y_fill, color=color, alpha=alpha, label=label)

ax2.set_xlabel('Standard Deviations from Mean')
ax2.set_ylabel('Density')
ax2.set_title('68-95-99.7 Rule for Standard Normal')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Key probabilities
print("Standard Normal Probabilities:")
print(f"P(-1 ≤ Z ≤ 1) = {norm.cdf(1) - norm.cdf(-1):.3f} ≈ 0.68")
print(f"P(-2 ≤ Z ≤ 2) = {norm.cdf(2) - norm.cdf(-2):.3f} ≈ 0.95")
print(f"P(-3 ≤ Z ≤ 3) = {norm.cdf(3) - norm.cdf(-3):.3f} ≈ 0.997")

Standard Normal Probabilities:
P(-1 ≤ Z ≤ 1) = 0.683 ≈ 0.68
P(-2 ≤ Z ≤ 2) = 0.954 ≈ 0.95
P(-3 ≤ Z ≤ 3) = 0.997 ≈ 0.997

Exponential Distribution

The exponential distribution models waiting times between events in a Poisson process.

X \sim \text{Exponential}(\beta) if: f_X(x) = \frac{1}{\beta} e^{-x/\beta}, \quad x > 0

Here \beta (beta) is the scale parameter, the average time between events.

You can also find the exponential distribution parameterized in terms of a rate parameter \lambda = \frac{1}{\beta}.

Key properties:

  • Memoryless: \mathbb{P}(X > s + t | X > s) = \mathbb{P}(X > t)
    • “The future doesn’t depend on how long you’ve already waited”
  • Connection to Poisson: If events occur at rate \lambda = 1/\beta, time between events is Exponential(\beta)
  • CDF: F_X(x) = 1 - e^{-x/\beta} for x > 0

Use cases:

  • Time between customer arrivals
  • Lifetime of electronic components
  • Time until next earthquake
  • Duration of phone calls
Show code
# Exponential distribution visualization
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Different β values (mean waiting time)
betas = [0.5, 1, 2]
colors = ['red', 'blue', 'green']

fig, ax = plt.subplots(figsize=(7, 5))

x = np.linspace(0, 6, 1000)

for beta, color in zip(betas, colors):
    pdf = expon.pdf(x, scale=beta)
    ax.plot(x, pdf, linewidth=2, color=color, label=f'β = {beta}')
    
    # Show mean as vertical line
    ax.axvline(beta, color=color, linestyle='--', alpha=0.5, linewidth=1)

ax.set_xlabel('Time (x)')
ax.set_ylabel('Density')
ax.set_title('Exponential Distribution PDF')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 6)
ax.set_ylim(0, 2.1)

plt.tight_layout()
plt.show()

# Example calculations
beta = 1
print(f"For β = {beta} (rate λ = {1/beta}):")
print(f"E[X] = β = {beta}")
print(f"Var(X) = β² = {beta**2}")
print(f"P(X > 1) = {1 - expon.cdf(1, scale=beta):.3f}")
print(f"Median waiting time = {expon.ppf(0.5, scale=beta):.3f}")

For β = 1 (rate λ = 1.0):
E[X] = β = 1
Var(X) = β² = 1
P(X > 1) = 0.368
Median waiting time = 0.693

Brief Catalog: Other Continuous Distributions

Gamma(\alpha, \beta): The gamma distribution is a generalization of exponential

  • Sum of independent exponentials
  • Models waiting time for multiple events

Beta(\alpha, \beta): The beta distribution models values between 0 and 1

  • Models proportions and probabilities
  • Conjugate prior for binomial parameter

t(\nu): The t-distribution is a heavy-tailed alternative to normal

  • More probability in tails than normal
  • Used when variance is unknown/unstable
  • \nu = 1 gives Cauchy (no finite mean!)

\chi^2(p): The chi-squared distribution is the sum of squared standard normals

  • If Z_1, ..., Z_p \sim \mathcal{N}(0,1) independent, then \sum Z_i^2 \sim \chi^2(p)
  • Used in hypothesis testing and confidence intervals

1.6 Multivariate Distributions

So far we’ve focused on single random variables. But in practice, we often deal with multiple related variables: height and weight, temperature and humidity, stock prices of different companies. This leads us to multivariate distributions.

1.6.1 Joint Distributions

For random variables X and Y, the joint distribution describes their behavior together:

  • Discrete case: Joint PMF f_{X,Y}(x,y) = \mathbb{P}(X = x, Y = y)
  • Continuous case: Joint PDF f_{X,Y}(x,y) where \mathbb{P}((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy
Example: Discrete joint distribution

Roll two fair six-sided dice. Let X = first die, Y = second die.

The joint PMF is: f_{X,Y}(x,y) = \frac{1}{36} \text{ for } x,y \in \{1,2,3,4,5,6\}

We can display this as a 6×6 table with each entry equal to 1/36.

Example: Continuous joint distribution

Let (X,Y) be uniformly distributed on the unit disk.

f_{X,Y}(x,y) = \begin{cases} \frac{1}{\pi} & \text{if } x^2 + y^2 \leq 1 \\ 0 & \text{otherwise} \end{cases}

The normalizing constant 1/\pi makes the total probability equal to 1 (area of unit disk is \pi).

1.6.2 Marginal Distributions

Given a joint distribution, we can find the distribution of each variable separately.

The marginal distribution of X is obtained by “summing out” or “integrating out” the other variable:

  • Discrete: f_X(x) = \sum_y f_{X,Y}(x,y)
  • Continuous: f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy
Tip

Think of marginal distributions as projections: if you have points scattered in 2D, the marginal distribution of X is like looking at their shadows on the X-axis.

Example: Sum of two dice

Let X = first die, Y = second die, S = X + Y.

What is \mathbb{P}(S = 7)?

To find \mathbb{P}(S = 7), we sum over all ways to get 7:

  • (1,6), (2,5), (3,4), (4,3), (5,2), (6,1)

So \mathbb{P}(S = 7) = 6 \times \frac{1}{36} = \frac{1}{6}

1.6.3 Independent Random Variables

Random variables X and Y are independent if: f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y) for all x, y.

This means the joint distribution factors into the product of marginals - knowing the value of one variable tells us nothing about the other.

Example: Independent coin flips

Flip two fair coins. Let X = 1 if first is heads, 0 otherwise. Same for Y with second coin.

Joint distribution:

Y = 0 Y = 1
X = 0 1/4 1/4
X = 1 1/4 1/4

Since each entry equals the product of marginal probabilities (e.g., \frac{1}{4} = \frac{1}{2} \times \frac{1}{2}), X and Y are independent.

Warning

Common mistake: Assuming uncorrelated means independent.

Independence implies zero correlation, but zero correlation does NOT imply independence! We’ll see counterexamples when we study correlation in Chapter 3.

1.6.4 Conditional Distributions

The conditional distribution of X given Y = y is:

  • Discrete: f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} if f_Y(y) > 0
  • Continuous: Same formula, interpreted as densities

This tells us how X behaves when we know Y = y.

Example: Quality control

A factory produces items on two machines. Let:

  • X = quality score (0-100)
  • Y = machine (1 or 2)

Suppose Machine 1 produces 60% of items with quality \sim \mathcal{N}(80, 25), and Machine 2 produces 40% with quality \sim \mathcal{N}(70, 100).

If we observe a quality score of 75, which machine likely produced it? This requires the conditional distribution \mathbb{P}(Y|X=75).

1.6.5 Interactive Exploration: Marginal and Conditional Distributions

Let’s explore how marginal and conditional distributions relate to a joint distribution using an interactive visualization.

Instructions:

  • Use the sliders to change the x and y values
  • Check the boxes to switch between marginal distributions (e.g., f_X(x)) and conditional distributions (e.g., f_{X|Y}(x|y))
  • When showing conditional distributions, red dashed lines appear on the joint distribution showing where we’re conditioning
  • The visualization uses the simpler shorthand notation p(x) for f_X(x) and p(x|y) for f_{X|Y}(x|y) (and analogous formulas for other pdfs)

Key insights:

  • Marginal distributions show the overall distribution of one variable, ignoring the other
  • Conditional distributions show how one variable is distributed when we fix the other at a specific value
  • The shape of conditional distributions changes as we move the conditioning value
  • This demonstrates how knowing one variable’s value provides information about the other when they’re not independent

1.6.6 Random Vectors and IID Random Variables

A random vector is a vector \mathbf{X} = (X_1, X_2, ..., X_n)^T where each component X_i is a random variable. The joint behavior of all components is characterized by their joint distribution.

Random vectors allow us to study multiple random quantities together, which leads us to an important special case.

IID Random Variables:

Random variables X_1, ..., X_n are independent and identically distributed (IID) if:

  1. They are mutually independent
  2. They all have the same distribution

We write: X_1, ..., X_n \stackrel{iid}{\sim} F.

If F has density f we also write X_1, ..., X_n \stackrel{iid}{\sim} f.

X_1, ..., X_n is a random sample of size n from F (or f, respectively).

IID assumptions are fundamental in statistics:

  • Random sampling: Each observation comes from the same population
  • No interference: One observation doesn’t affect others
  • Stable conditions: The underlying distribution doesn’t change
Example: Customer arrivals

Times between customer arrivals at a stable business might be IID Exponential(\beta).

Not IID:

  • Stock prices (today’s price depends on yesterday’s)
  • Temperature readings (temporal correlation)
  • Survey responses from same household (likely correlated)

1.6.7 Important Multivariate Distributions

Multinomial Distribution

The multinomial distribution is a generalization of binomial to multiple categories.

If we have k categories with probabilities p_1, ..., p_k (summing to 1), and we observe n independent trials, then the counts (X_1, ..., X_k) follow a Multinomial distribution:

f(x_1, ..., x_k) = \frac{n!}{x_1! \cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}

where \sum x_i = n.

Use cases:

  • Dice rolls (6 categories for a standard die – or k for k-sided dice)
  • Survey responses (multiple choice)
  • Document word counts
  • Genetic allele frequencies

Multivariate Normal Distribution

The multivariate normal distribution is the multivariate generalization of the normal distribution.

A random vector \mathbf{X} = (X_1, ..., X_k)^T has a multivariate normal distribution, written \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), if:

f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

where:

  • \boldsymbol{\mu} is the mean vector
  • \boldsymbol{\Sigma} is the covariance matrix (symmetric, positive definite)

Key properties:

  • Marginals are normal: If \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), then X_i \sim \mathcal{N}(\mu_i, \Sigma_{ii})
  • Linear combinations are normal: \mathbf{a}^T\mathbf{X} \sim \mathcal{N}(\mathbf{a}^T\boldsymbol{\mu}, \mathbf{a}^T\boldsymbol{\Sigma}\mathbf{a})
  • Conditional distributions are normal (with formulas for conditional mean and variance)

Special case - Bivariate normal: For two variables with correlation \rho: \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}

The correlation \rho controls the relationship:

  • \rho = 0: independent (for normal variables, uncorrelated = independent!)
  • \rho > 0: positive relationship
  • \rho < 0: negative relationship
Show code
# Bivariate normal distribution visualization
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

# Create figure
fig, ax = plt.subplots(figsize=(7, 6))

# Bivariate normal with correlation
mean = [0, 0]
cov = [[1, 0.7], [0.7, 1]]  # correlation = 0.7

# Create grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))

# Calculate PDF
rv = multivariate_normal(mean, cov)
Z = rv.pdf(pos)

# Contour plot
contour = ax.contour(X, Y, Z, levels=10, colors='blue', alpha=0.5)
ax.clabel(contour, inline=True, fontsize=8)
contourf = ax.contourf(X, Y, Z, levels=20, cmap='Blues', alpha=0.7)
fig.colorbar(contourf, ax=ax, label='Density')

# Add marginal indicators
ax.axhline(y=0, color='red', linestyle='--', alpha=0.5, linewidth=1)
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5, linewidth=1)

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Bivariate Normal Distribution (ρ = 0.7)')
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

plt.show()

# Example calculations
print("Bivariate Normal with ρ = 0.7:")
print(f"Var(X) = Var(Y) = 1")
print(f"Cov(X,Y) = ρ·σ_X·σ_Y = 0.7")
print(f"If we observe Y=1, then:")
print(f"  E[X|Y=1] = ρ·(Y-μ_Y) = 0.7")
print(f"  Var(X|Y=1) = (1-ρ²) = {1 - 0.7**2:.2f}")

Bivariate Normal with ρ = 0.7:
Var(X) = Var(Y) = 1
Cov(X,Y) = ρ·σ_X·σ_Y = 0.7
If we observe Y=1, then:
  E[X|Y=1] = ρ·(Y-μ_Y) = 0.7
  Var(X|Y=1) = (1-ρ²) = 0.51
Note

The multivariate normal distribution is central to many statistical methods. We will return to it in more detail in Chapter 2 when we discuss expectations, covariances, and the properties of linear combinations of random variables.

We often define variables that are transformations g(\cdot) of other random variables. Assuming we know the distribution of X or (X, Y), how do we find the distribution of Y = g(X) or (U,V) = g(X,Y)?

Method 1: CDF technique

  1. Find the CDF: F_Y(y) = \mathbb{P}(Y \leq y) = \mathbb{P}(g(X) \leq y)
  2. Differentiate to get PDF: f_Y(y) = F_Y'(y)

Method 2: Jacobian method (for bijective transformations)

If (U,V) = g(X,Y) is one-to-one with inverse (X,Y) = h(U,V), then: f_{U,V}(u,v) = f_{X,Y}(h(u,v)) \cdot |J|

where J is the Jacobian determinant: J = \det\begin{pmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{pmatrix}

Example: Box-Muller transform

We can generate Gaussian (normal) distributed random numbers starting from uniform.

If U_1, U_2 \sim \text{Uniform}(0,1) independently, then:

  • X = \sqrt{-2\ln U_1} \cos(2\pi U_2)
  • Y = \sqrt{-2\ln U_1} \sin(2\pi U_2)

are independent \mathcal{N}(0,1) random variables!

1.7 Chapter Summary and Connections

1.7.1 Key Concepts Review

We’ve built up probability theory from its foundations:

  1. Sample spaces and events provide the basic framework for describing uncertainty
  2. Probability axioms give us consistent rules for quantifying uncertainty
  3. Independence and conditioning let us model relationships between events
  4. Random variables connect probability to numerical quantities we can analyze
  5. Distributions characterize the behavior of random variables
  6. Multivariate distributions handle multiple related random quantities

1.7.2 Why These Concepts Matter

For Statistical Inference:

  • Random variables let us model data mathematically
  • Distributions provide templates for common patterns
  • Independence assumptions simplify analysis
  • Conditioning lets us update beliefs with data

For Machine Learning:

  • Probability distributions model uncertainty in predictions
  • Bayes’ theorem enables Bayesian ML methods
  • Multivariate distributions handle high-dimensional data
  • Independence assumptions make computation tractable

For Data Science Practice:

  • Understanding distributions helps choose appropriate methods
  • Recognizing dependence prevents incorrect analyses
  • Conditional probability quantifies relationships in data
  • Simulation using these distributions validates methods

1.7.3 Common Pitfalls to Avoid

  1. Confusing \mathbb{P}(A|B) with \mathbb{P}(B|A) - These can be vastly different!
  2. Assuming independence without justification - Real-world variables are often dependent
  3. Misinterpreting PDFs as probabilities - PDFs are densities, not probabilities
  4. Forgetting \mathbb{P}(X = x) = 0 for continuous variables - Use intervals for continuous RVs
  5. Thinking disjoint means independent - Disjoint events are maximally dependent!

1.7.4 Chapter Connections

The probability foundations from this chapter provide the mathematical language for all of statistics:

  • Next - Chapter 2 (Expectation): Building on our introduction to random variables, we’ll explore expectation as a fundamental tool for summarizing distributions, including variance and the powerful linearity of expectation property
  • Chapter 3 (Convergence & Inference): Using the probability framework and IID concept from this chapter, we’ll prove the Law of Large Numbers and Central Limit Theorem—the theoretical foundations that justify using samples to learn about populations
  • Chapter 4 (Bootstrap): Apply our understanding of empirical distributions to develop computational methods for quantifying uncertainty, providing a modern alternative to traditional parametric approaches

1.7.5 Self-Test Problems

Try to answer these questions after reading these lecture notes.

  1. Bayes in action: A test for a disease has 95% sensitivity (true positive rate) and 98% specificity (true negative rate). If 0.1% of the population has the disease, what’s the probability someone with a positive test actually has the disease?

  2. Distribution identification: Times between earthquakes in a region average 50 days. What distribution would you use to model the time until the next earthquake? Why?

  3. Independence check: You roll two dice. Let A = “sum is even” and B = “first die shows 3”. Are A and B independent?

  4. Conditional expectation preview: In a factory, Machine 1 makes 70% of products with defect rate 2%. Machine 2 makes 30% with defect rate 5%. If a product is defective, what’s the probability it came from Machine 1?

1.7.6 Further Reading

  • Probability Theory: Ross, “A First Course in Probability” - accessible introduction
  • Mathematical Statistics: Casella & Berger, “Statistical Inference” - rigorous treatment
  • Bayesian Perspective: Gelman et al., “Bayesian Data Analysis” - modern Bayesian view
  • Computational Approach: Blitzstein & Hwang, “Introduction to Probability” - simulation-based

1.7.7 Python and R Reference

# Probability distributions in Python
from scipy import stats
import numpy as np

# Discrete distributions
stats.binom.pmf(x, n=n, p=p)           # Binomial PMF
stats.binom.cdf(x, n=n, p=p)           # Binomial CDF
stats.binom.rvs(n=n, p=p, size=size)   # Generate random binomial

stats.poisson.pmf(x, mu=lam)           # Poisson PMF
stats.poisson.cdf(x, mu=lam)           # Poisson CDF
stats.poisson.rvs(mu=lam, size=size)   # Generate random Poisson

# Continuous distributions  
stats.norm.pdf(x, loc=mean, scale=sd)   # Normal PDF
stats.norm.cdf(x, loc=mean, scale=sd)   # Normal CDF
stats.norm.rvs(loc=mean, scale=sd, size=size) # Generate random normal

stats.expon.pdf(x, scale=beta)          # Exponential PDF
stats.expon.cdf(x, scale=beta)          # Exponential CDF
stats.expon.rvs(scale=beta, size=size)  # Generate random exponential

# Multivariate normal
stats.multivariate_normal.rvs(mean, cov, size=size)  # Generate
stats.multivariate_normal.pdf(x, mean, cov)          # Density

Note on lambda parameter: In the Python code, we used lam instead of lambda (\(\lambda\)) for the Poisson distribution parameter because lambda is a reserved keyword in Python (used for anonymous functions). Using lam (or lamb) in Python is a common convention to avoid syntax errors.

# Probability distributions in R

# Discrete distributions
dbinom(x, size=n, prob=p)     # Binomial PMF
pbinom(x, size=n, prob=p)     # Binomial CDF
rbinom(n, size, prob)          # Generate random binomial

dpois(x, lambda)               # Poisson PMF
ppois(x, lambda)               # Poisson CDF  
rpois(n, lambda)               # Generate random Poisson

# Continuous distributions
dnorm(x, mean=0, sd=1)         # Normal PDF
pnorm(x, mean=0, sd=1)         # Normal CDF
rnorm(n, mean=0, sd=1)         # Generate random normal

dexp(x, rate=1/beta)           # Exponential PDF
pexp(x, rate=1/beta)           # Exponential CDF
rexp(n, rate=1/beta)           # Generate random exponential

# Multivariate normal
library(mvtnorm)
rmvnorm(n, mean, sigma)        # Generate multivariate normal
dmvnorm(x, mean, sigma)        # Multivariate normal density

Remember: Probability is the language of uncertainty. Master this language, and you’ll be able to express and analyze uncertainty in any domain.


  1. More correctly, LLMs predict tokens, which are parts of words and other characters.↩︎

  2. The name “binomial” comes from its appearance in the binomial theorem: (x+y)^n = \sum_{k=0}^{n} \binom{n}{k} x^k y^{n-k}.↩︎

  3. In modern LLMs, the categorical distribution is over tokens (parts of words), not full words. The token vocabulary can be huge - tens of thousands of different tokens like “a”, “aba”, “add”, etc. GPT models typically use vocabularies of 50,000-100,000 tokens.↩︎