Statistics for Data Science: Lecture Notes

DATA11007

Author

Affiliation

Associate Professor, Department of Computer Science, University of Helsinki

Published

July 23, 2025

Preface

Welcome to the lecture notes for the Statistics for Data Science (DATA11007) course offered by the Master’s Programme in Data Science at the University of Helsinki.

These notes are designed to be an accessible companion to the course, providing multiple perspectives on statistical concepts for students with diverse backgrounds.

Note

Please refer to the official Course Description and Moodle page for specific information about current implementations of the course.

About These Notes

These lecture notes complement the main course textbook, All of Statistics by Larry Wasserman; Wasserman (2013). While Wasserman’s book provides a comprehensive and rigorous treatment, these notes aim to:

Provide gentler introductions to complex topics
Offer multiple perspectives (intuitive, practical, mathematical, computational)
Include extensive code examples in Python (and occasionally R)
Bridge prerequisite gaps for students from different backgrounds

Formats and Feedback

These notes are available in two formats:

HTML (recommended): Interactive version with tabs, code folding, and better navigation
PDF: Traditional document format for printing or offline reading

The content is identical in both formats. The PDF version is automatically generated from the source using Quarto’s LaTeX conversion and may occasionally contain minor formatting artifacts, particularly in complex layouts or interactive elements.

Your Feedback is Welcome!

These lecture notes are new and under active development. We welcome feedback, corrections, and suggestions! If you spot any errors, unclear explanations, or have ideas for improvements, please let your instructor know. Your feedback will contribute directly to the development of these materials.

How to Use These Notes

Throughout these notes, you’ll encounter sections with multiple perspectives on the same concept. These are presented in tabbed panels to help you explore different viewpoints:

This perspective provides conceptual understanding through analogies, visualizations, and real-world examples. It focuses on building intuition without heavy mathematical formalism.

This perspective explains concepts using mathematical language and notation. It may provide alternative formulations or connect to other mathematical ideas that some readers find clearer or more familiar.

This perspective shows how to implement and explore concepts through code. It includes simulations, numerical experiments, and practical algorithms.

Important: These perspectives often contain more advanced material than required for the course, or teasers for material which will be covered later. You’re not expected to understand everything in every panel! Think of them as:

Additional resources for deeper exploration
Different entry points to the same concept
Teasers for the content of future lectures
Optional enrichment beyond the core material

Choose the perspective that best matches your learning style, or explore all three to build a richer understanding. The core course content is always presented in the main text outside these panels.

Example: Understanding Variance

Here’s a brief example of how these perspectives work. Consider the concept of variance (\text{Var}), which measures how spread out data values are:

Variance measures how far data points typically are from their average.

Imagine test scores in two classes:

Class A: Everyone scores between 75-85 (low variance)
Class B: Scores range from 40-100 (high variance)

Both might have the same average (say, 80), but Class B has much more spread. Variance captures this spread numerically.

The variance of a random variable $X$ is defined as:

\[\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\]

For a discrete distribution:

\[\text{Var}(X) = \sum_i (x_i - \mu)^2 \cdot \mathbb{P}(X = x_i)\]

Key properties:

$\text{Var}(aX + b) = a^2\text{Var}(X)$
If $X, Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

import numpy as np
import matplotlib.pyplot as plt

# Two datasets with same mean but different variance
class_a = [75, 78, 80, 82, 85, 77, 83, 79, 81, 80]
class_b = [40, 95, 100, 65, 85, 90, 45, 98, 70, 82]

print(f"Class A: mean = {np.mean(class_a):.1f}, variance = {np.var(class_a):.1f}")
print(f"Class B: mean = {np.mean(class_b):.1f}, variance = {np.var(class_b):.1f}")

# Visualize to see the spread
plt.figure(figsize=(7, 3))

# Use same bins for both histograms to ensure consistent bar width
bins = np.linspace(30, 110, 17)  # 16 bins from 30 to 110

plt.hist(class_a, alpha=0.5, label='Class A (low variance)', bins=bins, edgecolor='black')
plt.hist(class_b, alpha=0.5, label='Class B (high variance)', bins=bins, edgecolor='black')
plt.xlabel('Test Score')
plt.ylabel('Count')
plt.title('Comparison of Test Score Distributions')
plt.legend(loc='upper left')

# Set y-axis to show only integers
plt.gca().yaxis.set_major_locator(plt.MaxNLocator(integer=True))

plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Class A: mean = 80.0, variance = 7.8
Class B: mean = 77.0, variance = 413.8

Notice how each perspective offers something different: intuition through examples, mathematical rigor through formulas, and hands-on exploration through code.

Prerequisites

The course assumes:

Basic programming ability (Python or R)
Calculus (derivatives and integrals)
Linear algebra (matrices and vectors)
Basic probability (sample spaces, events)

Don’t worry if you’re rusty on some topics – we’ll review key concepts as needed.

Course Structure

Foundations: Probability, random variables, distributions
Inference: Estimation, confidence intervals, hypothesis testing
Methods: Regression, resampling, Bayesian approaches
Advanced Topics: Missing data, causal inference, experimental design

Programming

In this course, we’ll mainly use Python with occasional usages of R. Python is the main language for modern machine learning, and both Python and R are widely used in data science. Feel free to choose the environment you’re most comfortable with.

Acknowledgments

These notes build upon slides and other materials from previous course iterations by Professor Antti Honkela and incorporate feedback from many students. Special thanks to the teaching assistants and students who have helped improve these materials.

The content is largely adapted from Wasserman (2013), which remains the main course textbook.

These notes were compiled with the assistance of the Claude 4 family of AI models, particularly when it comes to the Quarto layout, coding content, and brainstorming examples.

Last updated: July 23, 2025

# Preface {.unnumbered} Welcome to the lecture notes for the [**Statistics for Data Science**](https://studies.helsinki.fi/courses/course-unit/otm-c8758490-46a6-4b93-8296-de0b69e3da84) (`DATA11007`) course offered by the [Master's Programme in Data Science](https://www.helsinki.fi/en/degree-programmes/data-science-masters-programme) at the University of Helsinki. These notes are designed to be an accessible companion to the course, providing multiple perspectives on statistical concepts for students with diverse backgrounds. ::: {.callout-note} Please refer to the official [Course Description](https://studies.helsinki.fi/courses/course-unit/otm-c8758490-46a6-4b93-8296-de0b69e3da84) and Moodle page for specific information about current implementations of the course. ::: ## About These Notes These lecture notes complement the main course textbook, *All of Statistics* by Larry Wasserman; @wasserman2013all. While Wasserman's book provides a comprehensive and rigorous treatment, these notes aim to: - Provide gentler introductions to complex topics - Offer multiple perspectives (intuitive, practical, mathematical, computational) - Include extensive code examples in Python (and occasionally R) - Bridge prerequisite gaps for students from different backgrounds ### Formats and Feedback These notes are available in two formats: - **[HTML](https://lacerbi.github.io/stats-for-ds-website/)** (recommended): Interactive version with tabs, code folding, and better navigation - **[PDF](https://github.com/lacerbi/stats-for-ds-website/raw/main/Statistics-for-Data-Science--Lecture-Notes.pdf)**: Traditional document format for printing or offline reading The content is identical in both formats. The PDF version is automatically generated from the source using [Quarto](https://quarto.org/)'s LaTeX conversion and may occasionally contain minor formatting artifacts, particularly in complex layouts or interactive elements. ::: {.callout-tip} ## Your Feedback is Welcome! These lecture notes are new and under active development. We welcome feedback, corrections, and suggestions! If you spot any errors, unclear explanations, or have ideas for improvements, please let your instructor know. Your feedback will contribute directly to the development of these materials. ::: ## How to Use These Notes Throughout these notes, you'll encounter sections with multiple perspectives on the same concept. These are presented in tabbed panels to help you explore different viewpoints: ::: {.tabbed-content} ## Intuitive This perspective provides conceptual understanding through analogies, visualizations, and real-world examples. It focuses on building intuition without heavy mathematical formalism. ## Mathematical This perspective explains concepts using mathematical language and notation. It may provide alternative formulations or connect to other mathematical ideas that some readers find clearer or more familiar. ## Computational This perspective shows how to implement and explore concepts through code. It includes simulations, numerical experiments, and practical algorithms. ::: **Important**: These perspectives often contain more advanced material than required for the course, or *teasers* for material which will be covered later. You're not expected to understand everything in every panel! Think of them as: - **Additional resources** for deeper exploration - **Different entry points** to the same concept - **Teasers** for the content of future lectures - **Optional enrichment** beyond the core material Choose the perspective that best matches your learning style, or explore all three to build a richer understanding. The core course content is always presented in the main text outside these panels. ### Example: Understanding Variance Here's a brief example of how these perspectives work. Consider the concept of variance ($\text{Var}$), which measures how spread out data values are: ::: {.tabbed-content} ## Intuitive Variance measures how far data points typically are from their average. Imagine test scores in two classes: - Class A: Everyone scores between 75-85 (low variance) - Class B: Scores range from 40-100 (high variance) Both might have the same average (say, 80), but Class B has much more spread. Variance captures this spread numerically. ## Mathematical The variance of a random variable $X$ is defined as: $$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$ For a discrete distribution: $$\text{Var}(X) = \sum_i (x_i - \mu)^2 \cdot \mathbb{P}(X = x_i)$$ Key properties: - $\text{Var}(aX + b) = a^2\text{Var}(X)$ - If $X, Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ ## Computational ```{python} #| code-fold: false #| fig-width: 8 #| fig-height: 3 import numpy as np import matplotlib.pyplot as plt # Two datasets with same mean but different variance class_a = [75, 78, 80, 82, 85, 77, 83, 79, 81, 80] class_b = [40, 95, 100, 65, 85, 90, 45, 98, 70, 82] print(f"Class A: mean = {np.mean(class_a):.1f}, variance = {np.var(class_a):.1f}") print(f"Class B: mean = {np.mean(class_b):.1f}, variance = {np.var(class_b):.1f}") # Visualize to see the spread plt.figure(figsize=(7, 3)) # Use same bins for both histograms to ensure consistent bar width bins = np.linspace(30, 110, 17) # 16 bins from 30 to 110 plt.hist(class_a, alpha=0.5, label='Class A (low variance)', bins=bins, edgecolor='black') plt.hist(class_b, alpha=0.5, label='Class B (high variance)', bins=bins, edgecolor='black') plt.xlabel('Test Score') plt.ylabel('Count') plt.title('Comparison of Test Score Distributions') plt.legend(loc='upper left') # Set y-axis to show only integers plt.gca().yaxis.set_major_locator(plt.MaxNLocator(integer=True)) plt.grid(True, alpha=0.3, axis='y') plt.tight_layout() plt.show() ``` ::: Notice how each perspective offers something different: intuition through examples, mathematical rigor through formulas, and hands-on exploration through code. ## Prerequisites The course assumes: - Basic programming ability (Python or R) - Calculus (derivatives and integrals) - Linear algebra (matrices and vectors) - Basic probability (sample spaces, events) Don't worry if you're rusty on some topics – we'll review key concepts as needed. ## Course Structure 1. **Foundations**: Probability, random variables, distributions 2. **Inference**: Estimation, confidence intervals, hypothesis testing 3. **Methods**: Regression, resampling, Bayesian approaches 4. **Advanced Topics**: Missing data, causal inference, experimental design ## Programming In this course, we'll mainly use Python with occasional usages of R. Python is the main language for modern machine learning, and both Python and R are widely used in data science. Feel free to choose the environment you're most comfortable with. ## Acknowledgments These notes build upon slides and other materials from previous course iterations by Professor [Antti Honkela](https://www.cs.helsinki.fi/u/ahonkela/) and incorporate feedback from many students. Special thanks to the teaching assistants and students who have helped improve these materials. The content is largely adapted from @wasserman2013all, which remains the main course textbook. These notes were compiled with the assistance of the [Claude 4](https://www.anthropic.com/news/claude-4) family of AI models, particularly when it comes to the [Quarto](https://quarto.org/) layout, coding content, and brainstorming examples. --- *Last updated: {{< meta date >}}*