Sampling and Uncertainty

Can We Trust Our Estimates?

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

Today

Know your sample
Getting it right on average
Quantifying how off you are

Your Research Design is due March 18.

Know Your Sample

Your Data Can Only Speak to What Is In It

Before you run a single analysis, ask:

What is my unit of observation?
What units are in my sample?
What units are left out?

Your data cannot tell you anything about anything else!

A cautionary tale

What Went Wrong?

The pollsters were certain Dewey would win.

Gallup, Roper, and Crossley all predicted a comfortable Dewey victory
They used quota sampling: interviewers chose respondents matching demographic quotas
But interviewers gravitated toward wealthier, more accessible, more Republican-leaning people
The sample was big. That was not the problem.
The problem was who was in it and who was not.

This Happens in Research Too

Survey only urban areas: miss rural poverty
Phone surveys: miss people without phones
Only villages near roads: miss remote populations
Household surveys: miss homeless, displaced, migrant populations
Voluntary response: people with strong opinions over-represented

Statistics Have Your Back

Let’s Measure Average Height at Penn

There are about 25,000 students. We cannot measure all of them. So we grab 30 at random.

For this simulation, the true average is 67.5 inches. Let’s see what happens.

Random Samples

Show code

library(ggplot2)
library(knitr)

set.seed(42)

# True population of Penn students
pop_size <- 25000
pop_height <- rnorm(pop_size, mean = 67.5, sd = 5)
true_mean <- mean(pop_height)

random_means <- round(sapply(1:4, function(i) mean(sample(pop_height, 30))), 1)

kable(
  data.frame(Sample = 1:4, `Sample Mean` = random_means,
             `True Mean` = round(true_mean, 1),
             `Off By` = random_means - round(true_mean, 1)),
  col.names = c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),
  align = "cccc"
)

Sample	Sample Mean (in)	True Mean (in)	Off By
1	67.0	67.5	-0.5
2	67.5	67.5	0.0
3	68.5	67.5	1.0
4	67.0	67.5	-0.5

Each sample is a bit off, but sometimes too high, sometimes too low. On average, we get it right.

What About 1,000 Random Samples?

Show code

set.seed(42)
n_sims <- 1000
sample_means <- numeric(n_sims)

for (i in 1:n_sims) {
  sample_means[i] <- mean(sample(pop_height, 30))
}

ggplot(data.frame(means = sample_means), aes(x = means)) +
  geom_histogram(fill = "#011F5B", alpha = 0.7, bins = 30, color = "white") +
  geom_vline(xintercept = true_mean, color = "red", linewidth = 1.5, linetype = "dashed") +
  annotate("label", x = true_mean, y = 80,
           label = paste0("True mean = ", round(true_mean, 1), " in"),
           color = "red", size = 6) +
  labs(x = "Sample mean (inches)", y = "Count",
       title = "1,000 sample means from 1,000 random samples of 30") +
  theme_minimal(base_size = 18)

The Central Limit Theorem

The sample means center on the population value
And their distribution is normal
You never see this distribution in real life. You only have your one sample. But you know it is there and it is normal.
Why do we care that it is normal? Because we know everything about normal distributions. We can calculate exactly how spread out it is, where 95% of values fall, etc. That lets us quantify uncertainty.

But What If You Get Lazy?

Instead of sampling randomly, you stand outside the Palestra after a basketball game and measure 30 people who walk out. Repeat three times.

Show code

# Biased samples: people near the basketball court are taller
bball_means <- round(rnorm(3, mean = 73, sd = 0.8), 1)

kable(
  data.frame(Sample = 1:3, `Sample Mean` = bball_means,
             `True Mean` = round(true_mean, 1),
             `Off By` = bball_means - round(true_mean, 1)),
  col.names = c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),
  align = "cccc"
)

Sample	Sample Mean (in)	True Mean (in)	Off By
1	71.6	67.5	4.1
2	71.7	67.5	4.2
3	72.4	67.5	4.9

The CLT Still Works Here

The distribution of basketball-court sample means is still normal
It still centers on a population value
But it centers on the basketball-court population value, not the Penn student population value
The statistics are doing their job perfectly. You asked the wrong question!
Dewey problem

Quantifying How Off You Are

The Standard Error

Any single estimate is still off by a little. How much?

The standard error (SE) is the standard deviation of the sampling distribution.

It tells you: how much does my estimate bounce around across repeated samples?
Small SE = precise (narrow distribution)
Large SE = imprecise (wide distribution)
Bigger sample = smaller SE

Bigger Sample, Less Wiggle

Show code

set.seed(42)

means_30 <- replicate(1000, mean(sample(pop_height, 30)))
means_200 <- replicate(1000, mean(sample(pop_height, 200)))

compare <- rbind(
  data.frame(mean_height = means_30,
             sample = paste0("n = 30  (SE = ", round(sd(means_30), 2), " in)")),
  data.frame(mean_height = means_200,
             sample = paste0("n = 200  (SE = ", round(sd(means_200), 2), " in)"))
)

ggplot(compare, aes(x = mean_height)) +
  geom_histogram(fill = "#011F5B", alpha = 0.7, bins = 30, color = "white") +
  geom_vline(xintercept = true_mean, color = "red", linewidth = 1.2, linetype = "dashed") +
  facet_wrap(~sample, ncol = 1) +
  labs(x = "Sample mean (inches)", y = "Count",
       title = "More data, less wiggle") +
  theme_minimal(base_size = 16)

Why 1.96?

We said the sampling distribution is normal. In a normal distribution:

Confidence Intervals

Your point estimate is a bit off each time. So besides reporting one number, we report a range.

The SE is the standard deviation of the sampling distribution. So 95% of sample means fall within 1.96 SEs of the true value:

\[CI = \text{estimate} \pm 1.96 \times SE\]

A 95% CI will contain the true value 95% of the time across repeated samples.

100 Confidence Intervals

Show code

set.seed(42)
n_sims <- 100
results <- data.frame(
  sim = 1:n_sims,
  mean_hat = numeric(n_sims),
  se = numeric(n_sims)
)

for (i in 1:n_sims) {
  s <- sample(pop_height, 30)
  results$mean_hat[i] <- mean(s)
  results$se[i] <- sd(s) / sqrt(30)
}

results$lower <- results$mean_hat - 1.96 * results$se
results$upper <- results$mean_hat + 1.96 * results$se
results$covers <- results$lower <= true_mean & results$upper >= true_mean
coverage <- round(mean(results$covers) * 100)

ggplot(results, aes(x = sim, y = mean_hat, color = covers)) +
  geom_point(size = 1.5) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0, alpha = 0.5) +
  geom_hline(yintercept = true_mean, color = "red", linewidth = 1.2, linetype = "dashed") +
  scale_color_manual(values = c("FALSE" = "orange", "TRUE" = "#011F5B"),
                     labels = c("Misses the truth", "Contains the truth")) +
  labs(x = "Sample number", y = "Estimated mean height (in)",
       title = paste0(coverage, " out of 100 intervals contain the true mean"),
       color = "") +
  theme_minimal(base_size = 16) +
  theme(legend.position = "bottom")

Same Logic for Treatment Effects

In research, we often care about whether some treatment has an effect, not just what the average is. Same problem: different sample, slightly different estimate, noise.

The question is usually: is the effect different from zero, or is it just noise?

To answer this, we imagine a world where there is no effect (the null hypothesis, \(H_0\)). Under that assumption, the sampling distribution of our estimate is centered at zero. How far out does our estimate land?

The p-value

If the true effect were zero, this is what the sampling distribution would look like. The further out our estimate lands, the less likely it could have arisen by chance.

The p-value

The p-value is the shaded area: the probability of seeing an estimate at least that extreme if \(H_0\) were true.

p-values in Practice

Small p-value (< 0.05): very unlikely under no effect. Reject the null.
Large p-value (> 0.05): consistent with no effect. Fail to reject.
We usually set the threshold at 0.05, but it could be anything. It is a convention, not a law of nature.

Putting It All Together

Reading `summary(lm())`

We have been working with height. Now let’s see this in a regression. We simulate data where each additional inch of height adds ~0.4 points to a basketball tryout score.

options(scipen = 999)

set.seed(42)
tryout <- data.frame(
  height = round(sample(pop_height, 80), 1)
)
tryout$score <- 40 + 0.4 * tryout$height + rnorm(80, 0, 5)

model <- lm(score ~ height, data = tryout)
summary(model)


Call:
lm(formula = score ~ height, data = tryout)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.836  -3.067   0.504   3.273  13.627 

Coefficients:
            Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  39.2662     7.1884   5.462 0.000000542 ***
height        0.4092     0.1064   3.845    0.000244 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.895 on 78 degrees of freedom
Multiple R-squared:  0.1594,    Adjusted R-squared:  0.1486 
F-statistic: 14.79 on 1 and 78 DF,  p-value: 0.0002445

Every Number, Explained

Estimate (height): for each additional inch of height, tryout score increases by ~0.4 points
Std. Error: across repeated samples, this estimate would bounce around by about this much
t value: Estimate / Std. Error. How many SEs away from zero?
Pr(>|t|): the p-value. Would this be surprising if the true slope were zero?

Confidence Interval in R

confint(model)

                 2.5 %     97.5 %
(Intercept) 24.9551596 53.5772870
height       0.1973415  0.6210762

The 95% CI for the effect of height is [0.2, 0.62].

Does not contain zero, consistent with the small p-value: we have evidence that taller students score higher on the tryout.

What Is Next

Now you can run a regression and read every number in the output
After spring break: Crime and Punishment, how policing and incarceration affect communities
Your Research Design is due March 18: describe your sample, specify a regression model, interpret the output

Sampling and Uncertainty

Today

Know Your Sample

Your Data Can Only Speak to What Is In It

A cautionary tale

What Went Wrong?

This Happens in Research Too

Statistics Have Your Back

Let’s Measure Average Height at Penn

Random Samples

What About 1,000 Random Samples?

The Central Limit Theorem

But What If You Get Lazy?

The CLT Still Works Here

Quantifying How Off You Are

The Standard Error

Bigger Sample, Less Wiggle

Why 1.96?

Confidence Intervals

100 Confidence Intervals

Same Logic for Treatment Effects

The p-value

The p-value

The p-value

p-values in Practice

Putting It All Together

Reading summary(lm())

Every Number, Explained

Confidence Interval in R

What Is Next

Reading `summary(lm())`