Global Development: Intermediate Topics in Politics, Policy, and Data
PSCI 3200 - Spring 2026
Today
Know your sample
Getting it right on average
Quantifying how off you are
Your Research Design is due March 18.
Know Your Sample
Your Data Can Only Speak to What Is In It
Before you run a single analysis, ask:
What is my unit of observation?
What units are in my sample?
What units are left out?
Your data cannot tell you anything about anything else!
A cautionary tale
What Went Wrong?
The pollsters were certain Dewey would win.
Gallup, Roper, and Crossley all predicted a comfortable Dewey victory
They used quota sampling: interviewers chose respondents matching demographic quotas
But interviewers gravitated toward wealthier, more accessible, more Republican-leaning people
The sample was big. That was not the problem.
The problem was who was in it and who was not.
This Happens in Research Too
Survey only urban areas: miss rural poverty
Phone surveys: miss people without phones
Only villages near roads: miss remote populations
Household surveys: miss homeless, displaced, migrant populations
Voluntary response: people with strong opinions over-represented
Statistics Have Your Back
Let’s Measure Average Height at Penn
There are about 25,000 students. We cannot measure all of them. So we grab 30 at random.
For this simulation, the true average is 67.5 inches. Let’s see what happens.
Random Samples
Show code
library(ggplot2)library(knitr)set.seed(42)# True population of Penn studentspop_size <-25000pop_height <-rnorm(pop_size, mean =67.5, sd =5)true_mean <-mean(pop_height)random_means <-round(sapply(1:4, function(i) mean(sample(pop_height, 30))), 1)kable(data.frame(Sample =1:4, `Sample Mean`= random_means,`True Mean`=round(true_mean, 1),`Off By`= random_means -round(true_mean, 1)),col.names =c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),align ="cccc")
Sample
Sample Mean (in)
True Mean (in)
Off By
1
67.0
67.5
-0.5
2
67.5
67.5
0.0
3
68.5
67.5
1.0
4
67.0
67.5
-0.5
Each sample is a bit off, but sometimes too high, sometimes too low. On average, we get it right.
What About 1,000 Random Samples?
Show code
set.seed(42)n_sims <-1000sample_means <-numeric(n_sims)for (i in1:n_sims) { sample_means[i] <-mean(sample(pop_height, 30))}ggplot(data.frame(means = sample_means), aes(x = means)) +geom_histogram(fill ="#011F5B", alpha =0.7, bins =30, color ="white") +geom_vline(xintercept = true_mean, color ="red", linewidth =1.5, linetype ="dashed") +annotate("label", x = true_mean, y =80,label =paste0("True mean = ", round(true_mean, 1), " in"),color ="red", size =6) +labs(x ="Sample mean (inches)", y ="Count",title ="1,000 sample means from 1,000 random samples of 30") +theme_minimal(base_size =18)
The Central Limit Theorem
The sample means center on the population value
And their distribution is normal
You never see this distribution in real life. You only have your one sample. But you know it is there and it is normal.
Why do we care that it is normal? Because we know everything about normal distributions. We can calculate exactly how spread out it is, where 95% of values fall, etc. That lets us quantify uncertainty.
But What If You Get Lazy?
Instead of sampling randomly, you stand outside the Palestra after a basketball game and measure 30 people who walk out. Repeat three times.
Show code
# Biased samples: people near the basketball court are tallerbball_means <-round(rnorm(3, mean =73, sd =0.8), 1)kable(data.frame(Sample =1:3, `Sample Mean`= bball_means,`True Mean`=round(true_mean, 1),`Off By`= bball_means -round(true_mean, 1)),col.names =c("Sample", "Sample Mean (in)", "True Mean (in)", "Off By"),align ="cccc")
Sample
Sample Mean (in)
True Mean (in)
Off By
1
71.6
67.5
4.1
2
71.7
67.5
4.2
3
72.4
67.5
4.9
The CLT Still Works Here
The distribution of basketball-court sample means is still normal
It still centers on a population value
But it centers on the basketball-court population value, not the Penn student population value
The statistics are doing their job perfectly. You asked the wrong question!
Dewey problem
Quantifying How Off You Are
The Standard Error
Any single estimate is still off by a little. How much?
The standard error (SE) is the standard deviation of the sampling distribution.
It tells you: how much does my estimate bounce around across repeated samples?
We said the sampling distribution is normal. In a normal distribution:
Confidence Intervals
Your point estimate is a bit off each time. So besides reporting one number, we report a range.
The SE is the standard deviation of the sampling distribution. So 95% of sample means fall within 1.96 SEs of the true value:
\[CI = \text{estimate} \pm 1.96 \times SE\]
A 95% CI will contain the true value 95% of the time across repeated samples.
100 Confidence Intervals
Show code
set.seed(42)n_sims <-100results <-data.frame(sim =1:n_sims,mean_hat =numeric(n_sims),se =numeric(n_sims))for (i in1:n_sims) { s <-sample(pop_height, 30) results$mean_hat[i] <-mean(s) results$se[i] <-sd(s) /sqrt(30)}results$lower <- results$mean_hat -1.96* results$seresults$upper <- results$mean_hat +1.96* results$seresults$covers <- results$lower <= true_mean & results$upper >= true_meancoverage <-round(mean(results$covers) *100)ggplot(results, aes(x = sim, y = mean_hat, color = covers)) +geom_point(size =1.5) +geom_errorbar(aes(ymin = lower, ymax = upper), width =0, alpha =0.5) +geom_hline(yintercept = true_mean, color ="red", linewidth =1.2, linetype ="dashed") +scale_color_manual(values =c("FALSE"="orange", "TRUE"="#011F5B"),labels =c("Misses the truth", "Contains the truth")) +labs(x ="Sample number", y ="Estimated mean height (in)",title =paste0(coverage, " out of 100 intervals contain the true mean"),color ="") +theme_minimal(base_size =16) +theme(legend.position ="bottom")
Same Logic for Treatment Effects
In research, we often care about whether some treatment has an effect, not just what the average is. Same problem: different sample, slightly different estimate, noise.
The question is usually: is the effect different from zero, or is it just noise?
To answer this, we imagine a world where there is no effect (the null hypothesis, \(H_0\)). Under that assumption, the sampling distribution of our estimate is centered at zero. How far out does our estimate land?
The p-value
If the true effect were zero, this is what the sampling distribution would look like. The further out our estimate lands, the less likely it could have arisen by chance.
The p-value
The p-value
The p-value is the shaded area: the probability of seeing an estimate at least that extreme if \(H_0\) were true.
p-values in Practice
Small p-value (< 0.05): very unlikely under no effect. Reject the null.
Large p-value (> 0.05): consistent with no effect. Fail to reject.
We usually set the threshold at 0.05, but it could be anything. It is a convention, not a law of nature.
Putting It All Together
Reading summary(lm())
We have been working with height. Now let’s see this in a regression. We simulate data where each additional inch of height adds ~0.4 points to a basketball tryout score.