Sampling and Uncertainty

Can We Trust Our Estimates?

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

Today


  1. Know your sample
  2. Getting it right on average
  3. Quantifying how off you are

Your Research Design is due March 18.

Know Your Sample

Your Data Can Only Speak to What Is In It


Before you run a single analysis, ask:

  • What is my unit of observation?
  • What units are in my sample?
  • What units are left out?

Your data cannot tell you anything about anything else!

A cautionary tale

What Went Wrong?

The pollsters were certain Dewey would win.

  • Gallup, Roper, and Crossley all predicted a comfortable Dewey victory
  • They used quota sampling: interviewers chose respondents matching demographic quotas
  • But interviewers gravitated toward wealthier, more accessible, more Republican-leaning people
  • The sample was big. That was not the problem.
  • The problem was who was in it and who was not.

This Happens in Research Too

  • Survey only urban areas: miss rural poverty
  • Phone surveys: miss people without phones
  • Only villages near roads: miss remote populations
  • Household surveys: miss homeless, displaced, migrant populations
  • Voluntary response: people with strong opinions over-represented

Statistics Have Your Back

Let’s Measure Average Height at Penn


There are about 25,000 students. We cannot measure all of them. So we grab 30 at random.

For this simulation, the true average is 67.5 inches. Let’s see what happens.

Random Samples

Show code
library(ggplot2)
library(knitr)

set.seed(42)

# True population of Penn students
pop_size <- 25000
pop_height <- rnorm(pop_size, mean = 67.5, sd = 5)
true_mean <- mean(pop_height)

random_means <- round(sapply(1:4, function(i) mean(sample(pop_height, 30))), 1)

kable(
  data.frame(Sample = 1:4, `Sample Mean` = random_means,
             `True Mean` = round(true_mean, 1),
             `Off By` = random_means - round(true_mean, 1)),
  col.names = c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),
  align = "cccc"
)
Sample Sample Mean (in) True Mean (in) Off By
1 67.0 67.5 -0.5
2 67.5 67.5 0.0
3 68.5 67.5 1.0
4 67.0 67.5 -0.5

Each sample is a bit off, but sometimes too high, sometimes too low. On average, we get it right.

What About 1,000 Random Samples?

Show code
set.seed(42)
n_sims <- 1000
sample_means <- numeric(n_sims)

for (i in 1:n_sims) {
  sample_means[i] <- mean(sample(pop_height, 30))
}

ggplot(data.frame(means = sample_means), aes(x = means)) +
  geom_histogram(fill = "#011F5B", alpha = 0.7, bins = 30, color = "white") +
  geom_vline(xintercept = true_mean, color = "red", linewidth = 1.5, linetype = "dashed") +
  annotate("label", x = true_mean, y = 80,
           label = paste0("True mean = ", round(true_mean, 1), " in"),
           color = "red", size = 6) +
  labs(x = "Sample mean (inches)", y = "Count",
       title = "1,000 sample means from 1,000 random samples of 30") +
  theme_minimal(base_size = 18)

The Central Limit Theorem


  • The sample means center on the population value
  • And their distribution is normal
  • You never see this distribution in real life. You only have your one sample. But you know it is there and it is normal.
  • Why do we care that it is normal? Because we know everything about normal distributions. We can calculate exactly how spread out it is, where 95% of values fall, etc. That lets us quantify uncertainty.

But What If You Get Lazy?

Instead of sampling randomly, you stand outside the Palestra after a basketball game and measure 30 people who walk out. Repeat three times.

Show code
# Biased samples: people near the basketball court are taller
bball_means <- round(rnorm(3, mean = 73, sd = 0.8), 1)

kable(
  data.frame(Sample = 1:3, `Sample Mean` = bball_means,
             `True Mean` = round(true_mean, 1),
             `Off By` = bball_means - round(true_mean, 1)),
  col.names = c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),
  align = "cccc"
)
Sample Sample Mean (in) True Mean (in) Off By
1 71.6 67.5 4.1
2 71.7 67.5 4.2
3 72.4 67.5 4.9

The CLT Still Works Here


  • The distribution of basketball-court sample means is still normal
  • It still centers on a population value
  • But it centers on the basketball-court population value, not the Penn student population value
  • The statistics are doing their job perfectly. You asked the wrong question!
  • Dewey problem

Quantifying How Off You Are

The Standard Error

Any single estimate is still off by a little. How much?

The standard error (SE) is the standard deviation of the sampling distribution.

  • It tells you: how much does my estimate bounce around across repeated samples?
  • Small SE = precise (narrow distribution)
  • Large SE = imprecise (wide distribution)
  • Bigger sample = smaller SE

Bigger Sample, Less Wiggle

Show code
set.seed(42)

means_30 <- replicate(1000, mean(sample(pop_height, 30)))
means_200 <- replicate(1000, mean(sample(pop_height, 200)))

compare <- rbind(
  data.frame(mean_height = means_30,
             sample = paste0("n = 30  (SE = ", round(sd(means_30), 2), " in)")),
  data.frame(mean_height = means_200,
             sample = paste0("n = 200  (SE = ", round(sd(means_200), 2), " in)"))
)

ggplot(compare, aes(x = mean_height)) +
  geom_histogram(fill = "#011F5B", alpha = 0.7, bins = 30, color = "white") +
  geom_vline(xintercept = true_mean, color = "red", linewidth = 1.2, linetype = "dashed") +
  facet_wrap(~sample, ncol = 1) +
  labs(x = "Sample mean (inches)", y = "Count",
       title = "More data, less wiggle") +
  theme_minimal(base_size = 16)

Why 1.96?

We said the sampling distribution is normal. In a normal distribution:

Confidence Intervals


Your point estimate is a bit off each time. So besides reporting one number, we report a range.

The SE is the standard deviation of the sampling distribution. So 95% of sample means fall within 1.96 SEs of the true value:

\[CI = \text{estimate} \pm 1.96 \times SE\]

A 95% CI will contain the true value 95% of the time across repeated samples.

100 Confidence Intervals

Show code
set.seed(42)
n_sims <- 100
results <- data.frame(
  sim = 1:n_sims,
  mean_hat = numeric(n_sims),
  se = numeric(n_sims)
)

for (i in 1:n_sims) {
  s <- sample(pop_height, 30)
  results$mean_hat[i] <- mean(s)
  results$se[i] <- sd(s) / sqrt(30)
}

results$lower <- results$mean_hat - 1.96 * results$se
results$upper <- results$mean_hat + 1.96 * results$se
results$covers <- results$lower <= true_mean & results$upper >= true_mean
coverage <- round(mean(results$covers) * 100)

ggplot(results, aes(x = sim, y = mean_hat, color = covers)) +
  geom_point(size = 1.5) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0, alpha = 0.5) +
  geom_hline(yintercept = true_mean, color = "red", linewidth = 1.2, linetype = "dashed") +
  scale_color_manual(values = c("FALSE" = "orange", "TRUE" = "#011F5B"),
                     labels = c("Misses the truth", "Contains the truth")) +
  labs(x = "Sample number", y = "Estimated mean height (in)",
       title = paste0(coverage, " out of 100 intervals contain the true mean"),
       color = "") +
  theme_minimal(base_size = 16) +
  theme(legend.position = "bottom")

Same Logic for Treatment Effects


In research, we often care about whether some treatment has an effect, not just what the average is. Same problem: different sample, slightly different estimate, noise.

The question is usually: is the effect different from zero, or is it just noise?

To answer this, we imagine a world where there is no effect (the null hypothesis, \(H_0\)). Under that assumption, the sampling distribution of our estimate is centered at zero. How far out does our estimate land?

The p-value

If the true effect were zero, this is what the sampling distribution would look like. The further out our estimate lands, the less likely it could have arisen by chance.

The p-value

The p-value

The p-value is the shaded area: the probability of seeing an estimate at least that extreme if \(H_0\) were true.

p-values in Practice


  • Small p-value (< 0.05): very unlikely under no effect. Reject the null.
  • Large p-value (> 0.05): consistent with no effect. Fail to reject.
  • We usually set the threshold at 0.05, but it could be anything. It is a convention, not a law of nature.

Putting It All Together

Reading summary(lm())

We have been working with height. Now let’s see this in a regression. We simulate data where each additional inch of height adds ~0.4 points to a basketball tryout score.

options(scipen = 999)

set.seed(42)
tryout <- data.frame(
  height = round(sample(pop_height, 80), 1)
)
tryout$score <- 40 + 0.4 * tryout$height + rnorm(80, 0, 5)

model <- lm(score ~ height, data = tryout)
summary(model)

Call:
lm(formula = score ~ height, data = tryout)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.836  -3.067   0.504   3.273  13.627 

Coefficients:
            Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  39.2662     7.1884   5.462 0.000000542 ***
height        0.4092     0.1064   3.845    0.000244 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.895 on 78 degrees of freedom
Multiple R-squared:  0.1594,    Adjusted R-squared:  0.1486 
F-statistic: 14.79 on 1 and 78 DF,  p-value: 0.0002445

Every Number, Explained


  • Estimate (height): for each additional inch of height, tryout score increases by ~0.4 points
  • Std. Error: across repeated samples, this estimate would bounce around by about this much
  • t value: Estimate / Std. Error. How many SEs away from zero?
  • Pr(>|t|): the p-value. Would this be surprising if the true slope were zero?

Confidence Interval in R

confint(model)
                 2.5 %     97.5 %
(Intercept) 24.9551596 53.5772870
height       0.1973415  0.6210762


The 95% CI for the effect of height is [0.2, 0.62].

Does not contain zero, consistent with the small p-value: we have evidence that taller students score higher on the tryout.

What Is Next


  • Now you can run a regression and read every number in the output
  • After spring break: Crime and Punishment, how policing and incarceration affect communities
  • Your Research Design is due March 18: describe your sample, specify a regression model, interpret the output