13  Experimental and Survey Design

Published

October 6, 2025

13.1 Learning Objectives

After completing this chapter, you will be able to:

  • Design and critique surveys by distinguishing target populations from sampling frames, choosing appropriate sampling methods, recognizing nonresponse mechanisms, and applying postsurvey adjustments.
  • Design and critique experiments with a focus on reliability, internal validity, and external validity; incorporate controls, randomization, replication, and balanced factor combinations.
  • Differentiate within-subject and between-subject designs and explain matching and randomized blocks.
  • Perform practical power analysis for common comparisons and interpret Type S (sign) and Type M (magnitude) error diagnostics in study design and result interpretation.
  • Connect design choices to inference quality, understanding when observational comparisons mislead and when randomization or adjustment is necessary.

13.2 The Importance of Data Collection

13.2.1 A Cautionary Tale: The 1936 Literary Digest Poll

In 1936, American magazine The Literary Digest undertook a massive project: they mailed questionnaires to 10 million readers and potential readers for an election poll.1 This was election polling by brute force.

They received 2.27 million responses – an enormous sample by any standard.

Based on these responses, they confidently predicted victory for Republican candidate Alf Landon, who they forecast would receive 55% of the popular vote and 370 (out of 531) electoral votes, over the Democrat Franklin D. Roosevelt with 41% of the popular vote.

State Electoral Vote Landon Poll Total Roosevelt Poll Total
Alabama 11 3,060 10,082
Arizona 3 2,337 1,975
Arkansas 9 2,724 7,608
California 22 89,516 7,608
Colorado 6 15,949 10,025
Connecticut 8 28,809 13,413
Delaware 3 2,918 2,048
Florida 7 6,087 8,620
Georgia 12 3,948 12,915
Idaho 4 3,653 2,611
Illinois 29 123,297 79,035
Indiana 14 42,805 26,663
Iowa 11 31,871 18,614
Kansas 9 35,408 20,254
Kentucky 11 13,365 16,592
Louisiana 10 3,686 7,902
Maine 5 3,686 7,902
Maryland 8 17,463 18,341
Massachusetts 17 87,449 25,965
Michigan 19 51,478 25,686
Minnesota 11 30,762 20,733
Mississippi 9 848 6,080
Missouri 15 50,022 8,267
Montana 4 4,490 3,562
Nebraska 7 18,280 11,770
Nevada 3 1,003 955
New Hampshire 16 9,207 2,737
New Jersey 16 58,677 27,631
New Mexico 3 1,625 1,662
New York 47 162,260 139,277
North Carolina 13 6,113 16,324
North Dakota 4 4,250 3,666
Ohio 26 77,896 50,778
Oklahoma 11 14,442 15,075
Oregon 5 11,747 10,951
Pennsylvania 36 119,086 81,114
Rhode Island 4 10,401 3,489
South Carolina 8 1,247 7,105
South Dakota 4 8,483 4,507
Tennessee 11 9,883 19,829
Texas 23 15,341 37,501
Utah 4 4,067 5,318
Vermont 3 7,241 2,458
Virginia 11 10,223 16,783
Washington 8 21,370 15,300
West Virginia 8 13,660 10,235
Wisconsin 12 33,796 20,781
Wyoming 3 2,526 1,533
State Unknown 7 1,586 545
TOTAL 531 1,293,669 972,897

Source: Literary Digest, 31 October 1936

The poll showed Landon winning with 1,293,669 votes to Roosevelt’s 972,897 (about 57% to 43%).

The results turned out quite different. Roosevelt won 61% of the popular vote and 523 (out of 531) electoral votes, against just 36.5% for Landon. It wasn’t even close.

The actual 1936 election results showed a Roosevelt landslide, the opposite of what the Literary Digest predicted. Source: National Atlas of the United States.

What Went Wrong?

The Literary Digest was not a newcomer to polling – they had run polls since 1920 and forecast the winner correctly every time. However, they were not highly regarded by academic experts, as their policy was to report results such that they were not “weighted, adjusted, nor interpreted.”

Three major issues emerged:

  1. Sampling frame problems: Their method was to find addresses from automobile and telephone registries, in addition to their subscribers. Were these representative of the population as a whole?

  2. Nonresponse bias: While 2.27 million responses were received, this represented less than 25% response rate. Were the responses unbiased?

  3. Timing issues: The poll was conducted several months before the election. Was there an opinion swing between the poll and the election?

Three Questions to Always Ask

When evaluating any survey or study, ask:

  • Sampling frame vs. target population: Does the list you’re sampling from actually represent who you want to learn about?
  • Nonresponse and selection: Who chose to respond? Are they systematically different from those who didn’t?
  • Temporal validity: If measuring something that changes over time, when was the measurement taken?

The Postmortem Evidence

The failure of the poll attracted significant interest at the time. In 1937, Gallup ran a poll to try to help understand why it failed. These results were further analyzed and reported by Squire (1988).

The findings from Gallup’s 1937 analysis are reported below in three tables.

Before looking at the explanations below, examine each table and think: what does this reveal about why the poll failed?

Table 1: 1936 Presidential Vote by Car and Telephone Ownership (in Percent)

Presidential Vote Car & Phone Car, No Phone Phone, No Car Neither
Roosevelt 55 68 69 79
Landon 45 30 30 19
Other 1 2 0 2
Total N 946 447 236 657

Source: American Institute of Public Opinion, 28 May 1937

This table reveals a clear pattern: Roosevelt support increased as wealth decreased. Among those with neither car nor phone (the poorest group), Roosevelt won 79–19. Among those with both (the wealthiest), he still won but only 55–45. Since the Literary Digest sampling frame relied on car and phone registries, it systematically overrepresented wealthier voters who were more favorable to Landon. Still, this sampling frame bias alone is not enough to reverse the predicted outcome from Roosevelt winning to Landon winning.

Table 2: Presidential Vote by Receiving Literary Digest Straw Vote Ballot or Not (in Percent)

Presidential Vote Received Poll Not Receive Poll Do Not Know
Roosevelt 55 71 73
Landon 44 27 25
Other 1 1 3
Total N 780 1339 149

Source: American Institute of Public Opinion, 28 May 1937

This table confirms that the sampling frame problem alone doesn’t fully explain the failure. Among those who actually received the poll, Roosevelt still won 55–44 (matching the actual 61% popular vote reasonably well, given the wealth bias). The real issue must have been in who chose to respond.

Table 3: Presidential Vote by Returning or Not Returning Straw Vote Ballot (in Percent)

Presidential Vote Did Return Did Not Return Do Not Know
Roosevelt 48 69 56
Landon 51 30 40
Other 1 1 4
Total N 493 288 48

Source: American Institute of Public Opinion, 28 May 1937

Here’s the smoking gun: among those who received the poll ballot, those who returned it voted only 48% for Roosevelt (losing 48–51 to Landon), while those who didn’t return it voted 69% for Roosevelt. The poll’s prediction was based entirely on those who returned ballots – and they were systematically different from non-responders. This is classic nonresponse bias.

The Verdict

According to these results, the main reason for the failure seems to have been nonresponse bias.

Roosevelt won 55–44 among those who received the poll, but for some reason the Roosevelt-supporting majority was less active in returning their answers than the Landon-supporting minority.

Scale Cannot Fix Bias

Even with 2.27 million responses, systematic nonresponse bias overwhelmed the large sample size. Scale does not immunize against bias.

This observation is very important, given the rapidly decreasing response rates to many surveys today.

Response rates to US phone-based surveys have declined dramatically. Source: Pew Research Center

13.2.2 Types of Data Collection

We will consider two main types of data collection:

  1. Surveys are an example of observational studies, where data is collected by asking people. We observe what exists without intervening.

  2. Experiments are studies that involve the comparison of two or more interventions. The job of the experimenter is to select which interventions are applied and when.

Both require careful design to produce credible inferences.

For Finnish-speaking students, here’s a reference table of key terms:

English Finnish
Survey Kyselytutkimus (myös: luotaus, kartoitus)
Observational study Havainnoiva tutkimus
Experiment Koe
Intervention Interventio
Target population Perusjoukko, kohdepopulaatio, kohdeväestö
Sampling frame Otantakehikko
Sampling Otanta
Cluster sampling Ryväsotanta
Stratified sampling Ositettu otanta
Panel Paneeli
Nonresponse Vastaamattomuus
Control group Verrokkiryhmä (verrokki)
Reliability Luotettavuus
Internal validity Sisäinen validiteetti
External validity Ulkoinen validiteetti, ulkoinen pätevyys
Systematic error Systemaattinen virhe
Replication Replikaatio
Randomization Satunnaistaminen
Within-subject design Toistomittausasetelma
Between-subject design Riippumattomien ryhmien asetelma
Matched pairs Kaltaistetut parit
Randomized blocks Satunnaistettujen lohkojen asetelma
Factorial design Tekijäkoeasetelma, yhdistelyasetelma
Latin square Latinalainen neliö
Fractional factorial Osittainen tekijäkoeasetelma

13.3 Survey Planning

13.3.1 Surveys and Inference

In the inference chapters, we often assumed an IID sample X_1, \ldots, X_n \sim F that was used for inference about the population distribution F.

The same inference methods can be applied to learn about properties of a large but finite population of N individuals using a smaller sample of n individuals.

When the individuals are people, drawing an IID sample can become difficult.

The problem becomes even more difficult when the properties of interest are not objective physical properties but something like subjective opinions.

13.3.2 Survey Pitfalls: Two Critical Steps

A survey aims at gathering information from respondents to make inferences about a larger population.

This process involves two steps that may introduce distinct sources of uncertainty or error in the inferences:

  1. Question → Respondent: The responses provided by the respondents to the survey may lead to incorrect inferences about the features of the respondents, for example if the survey questions are poorly chosen.

  2. Respondents → Population: The inferred features about the respondents may lead to incorrect inferences about the features of the population if the respondents are not a representative sample, for example because of selective nonresponse, or if the original sample was poorly selected.

Warning

You need to fix both the questions and the sample. Getting one right while failing at the other still produces unreliable inferences.

13.3.3 From Target Population to Responses

Good surveys start by defining a target population – for example, all residents of an area in a given age range, or all users of a service.

Survey respondents need to be reached to answer questions. Practical restrictions require defining a sampling frame from which the actual sample is drawn.

Target Population: The entities about which you want to infer.

Sampling Frame: The operational list or mechanism from which samples are drawn.

Example sampling frames:

  • For phone-based surveys: all phone numbers in a registry, or simply all phone numbers
  • For mail-in surveys: everyone with a known mailing address, or all buildings in a region
  • For internet-based surveys: all visitors to some website(s) in a given period of time
  • For surveys of companies: companies listed in some registry
Imperfect Frames Are the Norm

The sampling frame is often an imperfect representation of the population. For example:

  • People who have recently moved may not have updated contact information
  • Not everyone owns a phone or has a regular mailing address
  • Others might have multiple phones or addresses
  • People might not visit the website during the specified time window for various reasons
  • Recently established companies may not be in the registry yet

Coverage gaps between the frame and the target population introduce systematic errors.

Example: The 1936 Poll’s Sampling Frame

What was the sampling frame in the Literary Digest poll? Which issues can you identify?

Sampling frame: Automobile and telephone registries, plus magazine subscribers.

Coverage issues: In 1936, during the Great Depression, car and phone ownership was associated with wealth. The sampling frame systematically excluded poorer voters, who heavily favored Roosevelt.

Nonresponse issues: Even among those who received the poll, response was associated with political preference and socioeconomic status.

13.3.4 Sampling Methods

In theory, the best method for sampling is often uniformly random sampling (simple random sampling) – every unit in the frame has an equal probability of selection.

However, operational constraints often lead to other methods:

  • Cluster sampling is sometimes used to simplify data collection, for example in sampling pupils one class at a time instead of independently. The results need to be corrected for within-cluster homogeneity, though.

Cluster sampling has a problem: observations within a cluster tend to be similar to each other. Students in the same classroom have similar backgrounds, people in the same neighborhood have similar incomes, etc.

This similarity means you get less information per observation than with simple random sampling. When everyone in a cluster is similar, surveying 20 people from that cluster tells you less than surveying 20 randomly selected people from across all clusters.

The design effect (DEFF) quantifies how much this inflates the variance of your estimator:

\text{DEFF} = 1 + (m - 1)\rho

where m is average cluster size and \rho is the intra-cluster correlation (ICC) — how similar people within a cluster are.

  • If \rho = 0 (no similarity within clusters), then DEFF = 1 (no penalty)
  • If \rho > 0 (people in clusters are similar), then DEFF > 1 (your estimator has higher variance)

Implication: Effective sample size n_{\text{eff}} \approx \frac{n}{\text{DEFF}}.

Example: You sample 1000 students by randomly selecting 50 classrooms with 20 students each. If students within a classroom have ICC = 0.2, then:

\text{DEFF} = 1 + (20-1) \times 0.2 = 4.8

Your effective sample size is only about 1000/4.8 \approx 208 students – you’d need nearly 5 times as many students to achieve the same precision as simple random sampling!

  • Stratified sampling can be used to ensure sufficient representation from different subgroups. Unlike cluster sampling, this can actually increase precision (reduce variance of your estimator) if strata are homogeneous within and heterogeneous between – because you’re guaranteed to sample from all the different groups, rather than missing some by chance.

  • Panels are samples that are reused over time. This reduces the effort of contacting new individuals every time and can help track changes in some phenomenon over time (selecting a fresh sample every time might increase variability between consecutive analyses). The downside of a panel is that as the same sample is reused multiple times, any biases in the original sample are propagated to more analyses.

13.3.5 Respondents vs. Nonrespondents

After the sample has been drawn, usually only a subset respond.

Professional pollsters devote significant resources to trying to chase the nonrespondents – multiple contact attempts, incentives, mixed-mode follow-up.

As illustrated by the 1936 election poll example (and the FINRISK example in Chapter 12), trying to understand the nonrespondents better is critical for accurate results.

Response Mechanisms (same as the missing data mechanisms discussed in Chapter 12):

  • MCAR (Missing Completely At Random): Response probability is independent of everything
  • MAR (Missing At Random): Response probability depends only on observed covariates
  • MNAR (Missing Not At Random): Response probability depends on the unobserved outcome itself

Nonresponse is often effectively MNAR in practice.

Systematic differences between respondents and nonrespondents distort estimates more than random error does. With declining response rates (often below 10% for phone surveys), even well-designed surveys face serious challenges.

Example: Simulating Nonresponse Bias

Let’s simulate a scenario where we’re trying to estimate the prevalence of some outcome (Y) in a population of 20,000 people. However, response probability depends on an unobserved variable Z (like socioeconomic status or political interest), and Z also affects the outcome. This creates MNAR nonresponse – people with higher Z are more likely to respond, and they also have different outcomes.

Show code
import numpy as np
import matplotlib.pyplot as plt

# Simulate nonresponse bias
rng = np.random.default_rng(42)
n = 20000

# Z is a latent variable (e.g., political interest or socioeconomic status)
Z = rng.normal(0, 1, n)

# True outcome depends on Z
Y = (Z > 0.2).astype(int)  # True prevalence: P(Y=1) varies with Z

# Response probability depends on Z (MNAR-like)
# People with high Z are more likely to respond
p_resp = 1 / (1 + np.exp(-(-0.5 + 1.2*Z)))
R = rng.binomial(1, p_resp)

# Compare true prevalence vs respondent prevalence
p_true = Y.mean()
p_resp_est = Y[R==1].mean()

print(f"True prevalence: {p_true:.3f}")
print(f"Naive estimate from respondents: {p_resp_est:.3f}")
print(f"Bias: {p_resp_est - p_true:.3f}")
print(f"Response rate: {R.mean():.1%}")

plt.figure(figsize=(7, 3.5))
bars = [p_true, p_resp_est]
colors = ['#0173B2', '#DE8F05']
plt.bar([0, 1], bars, color=colors, width=0.6)
plt.xticks([0, 1], ["True\nprevalence", "Respondent\nprevalence"])
plt.ylabel("Proportion")
plt.title("Nonresponse Bias: Large n Doesn't Save You")
plt.ylim(0, max(bars) * 1.2)
for i, v in enumerate(bars):
    plt.text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
True prevalence: 0.422
Naive estimate from respondents: 0.655
Bias: 0.233
Response rate: 40.8%

Even with 20,000 observations, if response depends on the outcome (MNAR), the naive estimate is badly biased.

13.3.6 Postsurvey Adjustments

After all the responses are in and cleaned up, it is usually useful to apply postsurvey adjustments. The idea is to use known population characteristics (from census data, registries, etc.) to correct for imbalances in who responded.

Common methods:

  • Weighting: If you know the true population proportions of age groups, gender, or region, you can reweight respondents to match those proportions
    • Poststratification: Reweight to match known population proportions in each stratum (e.g., you know 40% are women aged 18-30, 15% are men aged 60+, etc.)
    • Raking: When you only know the marginal totals (e.g., 52% women overall, 25% aged 18-30 overall) but not the full cross-tabulation (women aged 18-30), iteratively adjust weights to match each marginal separately
  • Model-based adjustments: Build models to predict outcomes for nonrespondents based on observed covariates, or use multiple imputation when assumptions are plausible
Limitations

These adjustments can help with MAR nonresponse (where response depends on observed covariates), but they cannot fully fix coverage gaps or strong MNAR. If nonresponse is strongly related to the outcome in ways not captured by your observed covariates, no adjustment can recover the truth — you’re missing information you can’t reconstruct.

13.3.7 Designing Good Survey Questions

In order to answer a survey question, the respondent has to go through multiple stages (Groves et al. 2009):

  1. Comprehension or interpreting the question
  2. Retrieval of information needed to answer
  3. Judgment to combine and process the recalled information
  4. Reporting to formulate the response

Good survey questions take these steps into account to ensure reliable responses.

Question Design Pitfalls

  • Ambiguous questions or formulations unclear to respondents: Using jargon, assuming knowledge, or combining multiple questions (“double-barreled” questions)

  • Trusting human memory too much: Asking about events from long ago or requiring detailed recall

  • Difficult questions and the substitution heuristic: According to Kahneman (2011), people subconsciously substitute an easier question for a difficult one

    • Example: “Should I invest in Ford Motor Company stock?” → “Do I like Ford cars?”
  • Offered options bias the answers: Order effects (people favor options presented first or last), anchoring (initial values influence estimates), and the specific range of options provided (asking “0-50 or 50-100?” vs. “0-500 or 500-1000?” elicits different responses)

  • Social desirability bias: People may not be equally honest in all questions

Practical Guidelines for Survey Questions
  • Pilot test questions with a small group first
  • Use clear time windows (“in the past week” rather than “recently”)
  • Use neutral wording (avoid leading or loaded language)
  • Limit recall burden (don’t ask for details from years ago)
  • Balance/rotate options to avoid order effects
  • Consider indirect questioning for sensitive topics

13.3.8 Sampling Summary: The Five-Step Process

  1. Define the target population
  2. Construct the sampling frame
  3. Draw the sample using an appropriate method
  4. Collect data from respondents and understand nonrespondents
  5. Apply postsurvey adjustments

Each step requires careful thought and introduces potential sources of error.

13.4 Experimental Design

13.4.1 What Is an Experiment?

An experiment involves active manipulation of some factors, known as independent or input variables, to determine their impact on a selected target(s), also known as dependent or output variable(s).

Experiments are everywhere:

  • Daily life: What impact would leaving 10 minutes earlier have on your commute?
  • Science and engineering: What impact would changing conditions of a process have?
  • Business: How would changing customer offering impact sales and profits?
  • Data science and machine learning: Which hyperparameters should be used?

13.4.2 Goals of a Good Experiment

  • Reliability: Whether the measure will produce a similar value when the measuring instrument is reapplied.

  • Internal Validity: Whether the effects observed in a study are due to the independent variable of interest and not some other confounding factor.

  • External Validity: Whether (causal) relationships can be generalized to different measures, persons, settings, and times.

These three properties form the foundation of credible experimental evidence.

13.4.3 Experimental Design Pitfalls

Just as surveys can go wrong in systematic ways, so can experiments. We list the major pitfalls below.

Systematic Errors

Systematic errors are biases introduced by the experimental process itself. Examples include experiment operator effects (different technicians measuring differently), time-of-day or seasonal effects, instrument drift over time, and learning effects as operators gain experience.

These systematic errors become especially problematic when they correlate with experimental variables. For instance, if one lab technician always measures treatment cases while another measures controls, you cannot distinguish treatment effects from technician effects. Similarly, if you test all controls in the morning and all treatments in the afternoon, time-of-day effects confound your results.

Mitigation

Randomize measurement order; standardize protocols across operators and time; blind experimenters to treatment conditions where possible.

Lack of Replication

Lack of replication prevents you from distinguishing true effects from random variation. Without repeating measurements under the same conditions, an observed difference might be due to the treatment – or it might be noise from uncontrolled factors.

The minimum number of replicates is 2 to detect that variability exists, while 3 replicates allows you to identify and potentially exclude outliers. In practice, you often need many more replicates, depending on the inherent variability of your system and the size of effect you want to detect. Low replication leads directly to low statistical power (discussed later in this chapter).

Changing Multiple Things at Once

Changing multiple things at once creates ambiguity about causation. If you simultaneously change temperature and pressure in an industrial process and observe improved yield, which change was responsible? You cannot tell – the effects are confounded.

Mitigation

Use factorial thinking – test combinations systematically (discussed below under factorial designs), or change one factor at a time while holding others constant. Balanced designs allow you to study multiple factors efficiently without losing the ability to separate their effects.

Lack of Controls

Lack of controls removes your baseline for comparison. When studying interventions over time, outcomes naturally change even without treatment due to temporal trends, regression to the mean, or environmental shifts. Without a concurrent control group receiving no treatment (or the current standard treatment), you cannot isolate the treatment’s specific contribution.

Using historical controls – comparing current treatment subjects to past control subjects – is especially risky because systematic changes in environment, measurement procedures, subject populations, or operator experience may have occurred between time periods. These temporal confounds make it impossible to attribute differences specifically to the treatment.

Beware “Historical Controls”

Using old data as controls introduces confounding: something may have changed in the environment (seasonality, operators, instruments, populations). Prefer concurrent controls randomized within the same study conditions.

13.4.4 Within-Subject and Between-Subject Designs

In a within-subject design, each subject receives all the treatments, while in a between-subject design different treatments are given to different subjects.

Both designs have their own pros and cons:

Aspect Within-Subject Between-Subject
Comparison Direct comparison within same subject Comparison across subjects
Efficiency Higher (controls for individual variation) Lower (more between-subject variability)
Main problem Interactions between treatments (carryover effects, order effects, etc.) Greater variability
Mitigation Randomize/counterbalance order Matching, randomized blocks

In a good within-subject design, the order of treatments should be random for each subject to avoid systematic order effects.

Variability in between-subject designs can sometimes be reduced with matching: finding closely matching subjects who are given the alternative treatments. With two treatments this leads to matched pairs, with more treatments to randomized blocks design.

Example: Matched Pairs

In a clinical trial comparing two pain medications:

  • Between-subject (unmatched): Randomly assign 50 patients to drug A, 50 to drug B
  • Matched pairs: Find 50 pairs of similar patients (matched on age, condition severity, etc.), then randomize one in each pair to drug A, the other to drug B

The matched design has much higher power because it reduces between-subject variability along the matched dimensions (age, severity). Unmatched characteristics still contribute variability, but by controlling for the most important sources of variation, matching substantially improves precision.

13.4.5 Factorial Designs and Latin Squares

When there are more factors, the number of possible combinations grows exponentially.

In order to fully understand possible interactions of the factors, it is necessary to test all combinations, known as factorial design.

For example, with 3 factors each at 3 levels: 3^3 = 27 experimental runs.

Combinatorial designs such as Latin squares and their generalizations, as well as fractional factorial designs, allow deriving balanced designs of subsets of combinations, therefore reducing the number of experiments needed.

However, they should only be used when there are no significant interactions between the factors.

Example: Factorial vs. Latin Square for 3 Factors

With factors \(A\), \(B\), \(C\) each at 3 levels: \(3^3 = 27\) experiments

B \ A \(A_1\) \(A_2\) \(A_3\)
\(B_1\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\)
\(B_2\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\)
\(B_3\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\) \(C_1, C_2, C_3\)

All 27 combinations are tested.

A Latin square arranges one factor’s levels such that each level appears exactly once in each row and once in each column. This balances the third factor across both other factors.

This yields only \(3 \times 3 = 9\) experiments:

B \ A \(A_1\) \(A_2\) \(A_3\)
\(B_1\) \(C_1\) \(C_2\) \(C_3\)
\(B_2\) \(C_2\) \(C_3\) \(C_1\)
\(B_3\) \(C_3\) \(C_1\) \(C_2\)

Each level of \(C\) appears exactly once in each row (level of \(B\)) and column (level of \(A\)).

This balances \(C\) across \(A\) and \(B\) while reducing runs by 2/3.

The Trade-off: Reduced Designs and Confounding

Both Latin squares and fractional factorials save experimental runs by aliasing (deliberately confounding) some effects together.

Latin squares assume no interactions between factors. If A and B interact, or if A and C interact, the Latin square will confound these interactions with main effects, making it impossible to separate them.

Fractional factorials similarly confound higher-order interactions with lower-order effects to reduce the number of required runs.

Use reduced designs only when you can credibly assume no important interactions exist, based on subject-matter knowledge or preliminary experiments. Otherwise, you risk mistaking interaction effects for main effects.

13.4.6 Experimental Design Takeaways

The key principles for credible experiments:

  • Remember to include controls (concurrent, not historical)
  • Replicate (minimum 2–3, more is better)
  • Randomize everything you possibly can:
    • Order of applying treatments
    • Order of making measurements
    • Assignment of experimenters and instruments
  • Use Latin squares or other combinatorial designs to construct balanced combinations when full factorial design is infeasible
Core Principles of Experimental Design
  1. Replication: Multiple independent observations to estimate variability
  2. Randomization: Breaks links between treatments and lurking variables
  3. Blocking (or matching): Groups similar units to reduce noise

These principles apply across all experimental sciences.

Connection to Causal Inference

As discussed in Chapter 11, randomization is critical for causal inference because random assignment of treatment breaks the link between treatment and confounding variables. With randomization, observed associations reflect causal effects rather than spurious correlations due to selection bias or confounding.

13.5 Power and Design Analysis

13.5.1 Classical Power: What It Tells (and Doesn’t)

Power calculation is about estimating the necessary sample size in order to have sufficient probability to detect changes of assumed magnitude with statistical significance.

Power calculation is essential in research to ensure studies have a high probability of reaching the desired outcome.

Statistical Power: The probability of rejecting the null hypothesis when it is false. Equivalently, the probability of detecting an effect of a specified size at a chosen significance level \alpha.

Power depends on:

  • Effect size (larger effects → higher power)
  • Variance (lower variance → higher power)
  • Sample size n (larger n → higher power)
  • Significance level \alpha (higher \alpha → higher power, but more Type I errors)

Low-powered studies with too few subjects tend to produce random results: there is a significant risk that any statistically significant results are not reproducible.

Warning

Low-powered studies are unethical because they place a burden on test subjects that is likely to produce nothing of value. If your study has only 20% power, you’re wasting resources 80% of the time, and even the 20% of “successful” results may be misleading (as we’ll see next).

13.5.2 Type S and Type M Errors

Gelman and Carlin (2014) highlight two additional error types that are crucial for understanding what happens in low-power settings:

Type S (Sign) Error: The probability that the estimate has the incorrect sign, if it is statistically significantly different from zero.

Type M (Magnitude) Error (exaggeration ratio): The expectation of the absolute value of the estimate divided by the effect size, if statistically significantly different from zero.

These errors reveal a pattern: in low-power studies, statistically significant results are often in the wrong direction and greatly exaggerated.

Why This Matters

Classical power analysis asks: “Will I detect an effect?”

Type S and M errors ask: “If I detect something, can I trust the direction and magnitude?”

In noisy, small-sample settings:

  • Type S error: A “significant” result might actually be the opposite of the truth
  • Type M error: A “significant” result might be 5× or 10× larger than reality

Publication filters on significance amplify these problems: the literature fills up with wrong-signed and exaggerated effects.

To understand the relationship between power and these error types, we reproduce the analysis from Figure 2 of Gelman and Carlin (2014) for unbiased estimates that are normally distributed:

Type S error rate and exaggeration ratio as a function of power for unbiased, normally distributed estimates.

Key observations:

  1. Type S errors become serious below 10% power: When power drops below 0.1, statistically significant results are likely to have the wrong sign – your “discovery” points in the opposite direction of truth.

  2. Type M errors (exaggeration) become serious below 50% power: When power is much below 0.5, statistically significant estimates tend to be much larger in magnitude than the true effect size – often overestimating by factors of 2, 5, or more.

  3. High power protects against both: With 80% power (the conventional standard), Type S error rate is negligible (~0.001%) and the exaggeration ratio is small (~1.1×), meaning your estimates are reliable in both direction and magnitude.

13.5.3 Design Analysis Workflow

Diagram of Gelman & Carlin’s recommended approach to design analysis. It will typically make sense to consider different plausible values of D, the assumed true effect size. Source: Figure 1 from Gelman and Carlin (2014).

The recommended approach from Gelman and Carlin (2014):

  1. Use external information (literature review, prior studies, meta-analyses) to posit plausible true effect sizes

    • Not the estimate from your current study
    • Not just “what would be interesting”
  2. Compute power, Type S, and Type M for your design under these plausible effects

  3. Interpret results through the lens of plausibility:

    • If your study has low power for plausible effects, a significant finding is suspect
    • Consider how much exaggeration to expect
  4. Report honestly: Include design analysis in publications, so readers understand limitations

Example: Type S and M Errors for Proportions

Suppose we want to detect a difference in conversion rates between two website designs. We hypothesize the true effect is a 1 percentage point increase (0.01 in proportion).

Setup: Two groups of size n=500 each. For proportions near 0.5, the standard error of the difference is approximately:

\text{SE}(\hat{p}_1 - \hat{p}_2) \approx 0.5 \cdot \sqrt{2/n} = 0.5 \cdot \sqrt{2/500} \approx 0.032

With true effect D = 0.01 and \text{SE} = 0.032:

Show code
from scipy import stats

def retrodesign(D, s, alpha=0.05, df=np.inf, n_sims=20000):
    """Compute power, Type S, and Type M errors."""
    if np.isinf(df):
        z = stats.norm.ppf(1 - alpha/2)
        p_hi = 1 - stats.norm.cdf(z - D/s)
        p_lo = stats.norm.cdf(-z - D/s)
        power = p_hi + p_lo
        typeS = p_lo / power if power > 0 else np.nan
        est = D + s * np.random.normal(size=n_sims)
    else:
        z = stats.t.ppf(1 - alpha/2, df)
        p_hi = 1 - stats.t.cdf(z - D/s, df)
        p_lo = stats.t.cdf(-z - D/s, df)
        power = p_hi + p_lo
        typeS = p_lo / power if power > 0 else np.nan
        est = D + s * np.random.standard_t(df, size=n_sims)

    sig = np.abs(est) > s * z
    exaggeration = np.mean(np.abs(est[sig])) / abs(D) if sig.any() and D != 0 else np.nan

    return {'power': power, 'typeS': typeS, 'exaggeration': exaggeration}

# Design analysis for n=500 per group
result_500 = retrodesign(D=0.01, s=0.032, alpha=0.05)
print(f"n=500 per group (SE=0.032):")
print(f"  Power: {result_500['power']:.3f}")
print(f"  Type S error: {result_500['typeS']:.3f}")
print(f"  Exaggeration: {result_500['exaggeration']:.1f}×")
n=500 per group (SE=0.032):
  Power: 0.061
  Type S error: 0.188
  Exaggeration: 7.6×

With such low power, a statistically significant result has a substantial chance of the wrong sign and will severely overestimate the effect!

What sample size gives 80% power? We need D = 2.8 \times \text{SE}:

0.01 = 2.8 \times 0.5\sqrt{2/n} \implies 0.01 = 1.4\sqrt{2/n} \implies n \approx 39{,}200 \text{ per group}

Show code
# Design analysis for n=39,200 per group
result_39k = retrodesign(D=0.01, s=0.5*np.sqrt(2/39200), alpha=0.05)
print(f"n=39,200 per group (SE≈0.0036):")
print(f"  Power: {result_39k['power']:.3f}")
print(f"  Type S error: {result_39k['typeS']:.6f}")
print(f"  Exaggeration: {result_39k['exaggeration']:.2f}×")
n=39,200 per group (SE≈0.0036):
  Power: 0.800
  Type S error: 0.000001
  Exaggeration: 1.12×

Implication: Small effects require enormous samples. Underpowered studies produce wildly exaggerated, often wrong-signed results.

13.6 Chapter Summary and Connections

13.6.1 Key Concepts Review

Surveys:

  • Target vs. frame: The population you want to learn about vs. the list you can sample from
  • Sampling methods: Simple random sampling (gold standard), cluster (watch for design effects), stratified (can improve precision), panels (propagate initial bias)
  • Nonresponse: Often MNAR in practice; adjustments help but can’t fix everything
  • Question design: Comprehension → Retrieval → Judgment → Reporting pipeline; avoid ambiguity, memory burden, leading options

Experiments:

  • Three goals: Reliability, internal validity, external validity
  • Four pitfalls: Systematic errors, lack of replication, changing multiple things, lack of controls
  • Design types: Within vs. between subjects; matched pairs and randomized blocks reduce variability
  • Factorial designs: Explore interactions; Latin squares and fractional factorials save runs but assume no interactions

Randomization:

  • Breaks links between treatment and confounders
  • Enables causal interpretation: association = causation (in expectation)
  • Observational comparisons require extreme caution

Power and Design Analysis:

  • Classical power: Probability of detecting an effect
  • Type S error: Probability significant result has wrong sign
  • Type M error: Expected overestimation factor for significant results
  • Low power is dangerous: Not just “likely to fail” but “likely to mislead if you succeed”

13.6.2 The Big Picture

Three fundamental lessons from this chapter:

  1. Good data beats clever modeling: No amount of sophisticated analysis can fix bias from poor design. A well-designed study with 100 observations beats a poorly designed study with 10,000.

  2. Randomize and replicate to earn causal interpretations: Without randomization, you’re swimming in a sea of confounders. Without replication, you can’t separate signal from noise.

  3. Plan with realistic effects; interpret significance critically: Use external information about plausible effect sizes. In low-power settings, “p < 0.05” can mean “probably wrong.”

13.6.3 Common Pitfalls to Avoid

  1. Treating sampling frames as if they were populations: Always ask what’s missing from your frame

  2. Ignoring nonresponse bias: With 10% response rates, the 90% who didn’t respond likely differ systematically

  3. No controls or replication: Without controls, you can’t isolate effects; without replication, you can’t assess variability

  4. Changing multiple things at once: Confounds prevent you from learning which change mattered

  5. Overinterpreting significant results from low-power designs: That p < 0.05 might be in the wrong direction and 10\times too large

  6. Using observed effects for power calculations: This creates a circular logic – your noisy estimate makes your underpowered study look well-powered

13.6.4 Chapter Connections

  • Chapters 1-4 (Probability & Inference): Throughout the course we assumed IID samples X_1, \ldots, X_n \sim F for inference. This chapter reveals what’s required to actually obtain such samples from human populations – careful sampling frames, randomization, and awareness of nonresponse.

  • Chapters 5-7 (Parametric Inference & Testing): We learned about Type I and Type II errors in hypothesis testing. This chapter introduces Type S (sign) and Type M (magnitude) errors, which reveal what happens when you actually get a “statistically significant” result in a low-power study – it may point the wrong direction and be wildly exaggerated.

  • Chapter 11 (Causal Inference): Randomization in experiments is the gold standard for causal inference because random assignment of treatment breaks the association between treatment and confounding variables. Observational studies (like surveys) require strong assumptions (no unmeasured confounders) that experiments avoid through randomization.

  • Chapter 12 (Missing Data): Nonresponse in surveys is fundamentally a missing data problem. The mechanisms (MCAR, MAR, MNAR) apply directly to understanding who responds vs. who doesn’t, and postsurvey adjustments are attempts to handle MAR nonresponse through weighting and modeling.

13.6.5 Self-Test Problems

Problem 1: Sampling Frame Issues

You want to survey university students about mental health. Your sampling frame is the university email directory.

  1. What coverage issues might exist?
  2. What nonresponse issues might you expect?
  3. Propose two mitigation strategies.
  1. Coverage: Recently enrolled students might not be in directory yet; students on leave are excluded; students who changed email or don’t check university email

  2. Nonresponse: Students with mental health issues might be less likely (or more likely, if seeking help) to respond; busy students; stigma effects

  3. Mitigations: Multiple contact methods (email, text, in-person); incentives; ensure anonymity; poststratification/raking to population margins; use sensitive question techniques

Problem 2: Experimental Design

You’re testing three teaching methods (A, B, C) across three schools (1, 2, 3) with three difficulty levels of material (Easy, Medium, Hard).

  1. How many runs in a full factorial design?
  2. Construct a 3×3 Latin square to reduce the design
  3. What assumption must hold for the Latin square to be valid?
  1. 3^3 = 27 runs (all combinations)

  2. Example Latin square (methods as treatments; school and difficulty as blocks):

School \ Difficulty Easy Medium Hard
School 1 A B C
School 2 B C A
School 3 C A B

This gives 9 runs instead of 27, estimating the main effect of method while blocking school and difficulty.

  1. Assumes no interactions of method with school or difficulty. If Method A works better for Easy material than expected from main effects alone (an interaction), the Latin square will confound this with main effects.
Problem 3: Type S and Type M Errors

A study with n=20 per group compares two means. Assuming equal variances of \sigma^2 = 1, the SE of the difference is \sqrt{1/20 + 1/20} \approx 0.32. If the true effect is D = 0.2, the power is approximately 0.10 (10%).

  1. Would you trust the sign of a significant result? The magnitude?
  2. What sample size would give 80% power? (Hint: 80% power requires the effect to be about 2.8 standard errors from zero)
  1. No! With only 10% power:
  • Type S error rate is nontrivial (~5%)
  • Type M exaggeration is large (~3.5–4×), so significant results substantially overstate the effect size
  1. For 80% power, need effect to be 2.8 SE from zero:
  • D = 2.8 \times \text{SE}
  • 0.2 = 2.8 \times \sqrt{2/n}
  • n \approx 392 per group

The study is drastically underpowered.

13.6.6 Python and R Reference

import numpy as np
import pandas as pd
from scipy import stats

# Survey: Poststratification weighting
def poststratify(df, outcome, strata, pop_marginals):
    """
    Reweight survey data to match population marginals (single categorical stratum).

    Parameters:
    -----------
    df : DataFrame with columns [outcome] and [strata]
    outcome : str, name of outcome column
    strata : str, name of stratification variable (categorical)
    pop_marginals : dict, {level: population proportion} summing to 1

    Returns:
    --------
    Weighted mean estimate of outcome
    """
    df = df.copy()
    # Compute sample proportions per stratum
    sample_props = df[strata].value_counts(normalize=True)
    # Weight = population_prop / sample_prop for each stratum
    df['w'] = df[strata].map(lambda x: pop_marginals.get(x, 0) / sample_props.get(x, 1))
    # Normalize weights to sum to sample size
    df['w'] = df['w'] * (len(df) / df['w'].sum())
    return np.average(df[outcome], weights=df['w'])

# Experiment: Difference in means with SE
def diff_in_means(y1, y2):
    """Difference in means with Welch standard error (unequal variances)."""
    n1, n2 = len(y1), len(y2)
    mean_diff = np.mean(y1) - np.mean(y2)
    se_diff = np.sqrt(np.var(y1, ddof=1)/n1 + np.var(y2, ddof=1)/n2)  # Welch SE
    return mean_diff, se_diff

# Power/Design analysis
def retrodesign(D, s, alpha=0.05, df=np.inf, n_sims=10000):
    """
    Design analysis: power, Type S, Type M (two-sided test).

    D: hypothesized true effect
    s: standard error
    alpha: significance level (default 0.05)
    df: degrees of freedom (np.inf for normal)
    """
    if np.isinf(df):
        z = stats.norm.ppf(1 - alpha/2)
        p_hi = 1 - stats.norm.cdf(z - D/s)
        p_lo = stats.norm.cdf(-z - D/s)
        power = p_hi + p_lo
        typeS = min(p_lo, p_hi) / power if power > 0 else np.nan
        est = D + s * np.random.normal(size=n_sims)
    else:
        z = stats.t.ppf(1 - alpha/2, df)
        p_hi = 1 - stats.t.cdf(z - D/s, df)
        p_lo = stats.t.cdf(-z - D/s, df)
        power = p_hi + p_lo
        typeS = min(p_lo, p_hi) / power if power > 0 else np.nan
        est = D + s * np.random.standard_t(df, size=n_sims)

    sig = np.abs(est) > s * z
    exaggeration = np.mean(np.abs(est[sig])) / abs(D) if sig.any() and D != 0 else np.nan

    return {'power': power, 'typeS': typeS, 'exaggeration': exaggeration}
# Survey: Poststratification weighting (single categorical stratum)
poststratify <- function(df, outcome, strata, pop_marginals) {
  # Compute sample proportions per stratum
  sample_props <- prop.table(table(df[[strata]]))
  # Weight = population_prop / sample_prop for each stratum
  w <- pop_marginals[as.character(df[[strata]])] / sample_props[as.character(df[[strata]])]
  # Normalize weights to sum to sample size
  w <- w * (nrow(df) / sum(w))
  # Weighted mean
  sum(w * df[[outcome]]) / sum(w)
}

# Experiment: Difference in means (Welch SE for unequal variances)
diff_in_means <- function(y1, y2) {
  n1 <- length(y1)
  n2 <- length(y2)
  mean_diff <- mean(y1) - mean(y2)
  se_diff <- sqrt(var(y1)/n1 + var(y2)/n2)  # Welch SE
  list(diff = mean_diff, se = se_diff)
}

# Power/Design analysis (two-sided test)
retrodesign <- function(D, s, alpha = 0.05, df = Inf, n.sims = 10000) {
  if (is.infinite(df)) {
    z <- qnorm(1 - alpha/2)
    p.hi <- 1 - pnorm(z - D/s)
    p.lo <- pnorm(-z - D/s)
    power <- p.hi + p.lo
    typeS <- ifelse(power > 0, min(p.lo, p.hi) / power, NA)
    est <- D + s * rnorm(n.sims)
  } else {
    z <- qt(1 - alpha/2, df)
    p.hi <- 1 - pt(z - D/s, df)
    p.lo <- pt(-z - D/s, df)
    power <- p.hi + p.lo
    typeS <- ifelse(power > 0, min(p.lo, p.hi) / power, NA)
    est <- D + s * rt(n.sims, df)
  }

  sig <- abs(est) > s * z
  exaggeration <- ifelse(any(sig) && D != 0,
                         mean(abs(est[sig])) / abs(D),
                         NA)

  list(power = power, typeS = typeS, exaggeration = exaggeration)
}

13.6.7 Connections to Source Material

The Literary Digest case study comes from the History Matters website (“Landon in a Landslide: The Poll That Changed Polling”), while the Gallup postmortem analysis (three tables) comes from Squire (1988). Survey methodology follows Groves et al.’s framework, while Type S and Type M errors come from Gelman and Carlin (2014). Experimental design principles build on classical work by Fisher, Box, and colleagues.

13.6.8 Further Reading

  • Literary Digest poll: Squire (1988), historical analysis of the 1936 polling failure
  • Survey design: Groves et al. (2009), comprehensive treatment
  • Type S/M errors: Gelman and Carlin (2014) in Perspectives on Psychological Science
  • Experimental design: Box, Hunter & Hunter (2005), Statistics for Experimenters, 2nd edition

Remember: The best statistical analysis cannot rescue a poorly designed study. Invest time in design – it’s the foundation of credible inference. Randomize when you can, adjust carefully when you can’t, and always think critically about what your data can and cannot tell you.


  1. Details from “Landon in a Landslide: The Poll That Changed Polling” at History Matters – a website so gloriously 1990s it’s a time capsule itself.↩︎