The urban-only sample overestimates the population average
Any conclusions we draw from this sample will be misleading
This is selection bias — the sample isn’t representative
No amount of fancy statistics can fix a bad sample
Lesson: before trusting any result, ask: “Where did this data come from?”
The Sampling Distribution
A Thought Experiment
Imagine you could sample 200 countries, compute the average life expectancy, then…
Throw out the sample
Draw a new sample of 200 countries
Compute the average again
Repeat this 1,000 times
What would the distribution of those averages look like?
Let’s Find Out: 1 Sample
Show code
set.seed(42)# True populationpop_size <-10000pop_life <-rnorm(pop_size, mean =68, sd =10)true_mean <-mean(pop_life)# Draw ONE sampleone_sample <-sample(pop_life, 200)ggplot(data.frame(x = one_sample), aes(x = x)) +geom_histogram(fill ="#011F5B", alpha =0.7, bins =25, color ="white") +geom_vline(xintercept =mean(one_sample), color ="#011F5B", linewidth =1.5) +geom_vline(xintercept = true_mean, color ="red", linewidth =1.5, linetype ="dashed") +annotate("label", x =mean(one_sample) +3, y =25,label =paste0("Sample mean = ", round(mean(one_sample), 1)),color ="#011F5B", size =5) +annotate("label", x = true_mean -3, y =25,label =paste0("True mean = ", round(true_mean, 1)),color ="red", size =5) +labs(x ="Life expectancy", y ="Count",title ="One sample of 200 countries") +theme_minimal(base_size =18)
Now 1,000 Samples
Show code
set.seed(42)n_sims <-1000sample_means <-numeric(n_sims)for (i in1:n_sims) { sample_means[i] <-mean(sample(pop_life, 200))}ggplot(data.frame(means = sample_means), aes(x = means)) +geom_histogram(fill ="#011F5B", alpha =0.7, bins =30, color ="white") +geom_vline(xintercept = true_mean, color ="red", linewidth =1.5, linetype ="dashed") +annotate("label", x = true_mean, y =100,label =paste0("True mean = ", round(true_mean, 1)),color ="red", size =6) +labs(x ="Sample mean", y ="Count",title ="Distribution of 1,000 sample means",subtitle ="Each is the average from a random sample of 200 countries") +theme_minimal(base_size =18)
What Just Happened?
Each sample gives a slightly different answer
But the answers cluster around the truth
The distribution of sample estimates is called the sampling distribution
It’s approximately bell-shaped (Central Limit Theorem)
This Works for Regression Coefficients Too
Show code
set.seed(42)true_beta <-0.0004n_sims <-1000betas <-numeric(n_sims)for (i in1:n_sims) { x <-runif(100, 500, 50000) y <-55+ true_beta * x +rnorm(100, 0, 4) betas[i] <-coef(lm(y ~ x))[2]}ggplot(data.frame(betas = betas), aes(x = betas)) +geom_histogram(fill ="#011F5B", alpha =0.7, bins =30, color ="white") +geom_vline(xintercept = true_beta, color ="red", linewidth =1.5, linetype ="dashed") +annotate("label", x = true_beta, y =100,label =paste0("True slope = ", true_beta),color ="red", size =6) +labs(x =expression(hat(beta)), y ="Count",title ="1,000 estimated slopes from 1,000 samples",subtitle ="Same idea: estimates scatter around the truth") +theme_minimal(base_size =18)
The Key Insight
Our estimate from any one sample is one draw from the sampling distribution.
We don’t know exactly where in the distribution our estimate falls
But we can estimate how wide the distribution is
That width is the standard error
Standard Errors and Confidence Intervals
Standard Error
Standard error = the standard deviation of the sampling distribution.
It measures: how much would our estimate bounce around if we could repeat the study?
Small SE → our estimate is precise (the sampling distribution is narrow)
Large SE → our estimate is imprecise (the sampling distribution is wide)
A 95% confidence interval gives us a plausible range for the true value.
\[
CI = \hat{\beta} \pm 1.96 \times SE
\]
Interpretation: “If we repeated this study many times, about 95% of the confidence intervals would contain the true value.”
Visualizing Confidence Intervals
Show code
set.seed(42)n_sims <-100results <-data.frame(sim =1:n_sims,beta =numeric(n_sims),se =numeric(n_sims))for (i in1:n_sims) { x <-runif(100, 500, 50000) y <-55+ true_beta * x +rnorm(100, 0, 4) fit <-lm(y ~ x) results$beta[i] <-coef(fit)[2] results$se[i] <-summary(fit)$coefficients[2, 2]}results$lower <- results$beta -1.96* results$seresults$upper <- results$beta +1.96* results$seresults$covers <- results$lower <= true_beta & results$upper >= true_betacoverage <-round(mean(results$covers) *100)ggplot(results, aes(x = sim, y = beta, color = covers)) +geom_point(size =1.5) +geom_errorbar(aes(ymin = lower, ymax = upper), width =0, alpha =0.5) +geom_hline(yintercept = true_beta, color ="red", linewidth =1.2, linetype ="dashed") +scale_color_manual(values =c("FALSE"="orange", "TRUE"="#011F5B"),labels =c("Misses truth", "Contains truth")) +labs(x ="Simulation", y =expression(hat(beta)),title =paste0(coverage, " out of 100 confidence intervals contain the truth"),color ="") +theme_minimal(base_size =16) +theme(legend.position ="bottom")
Confidence Intervals in R
# Using the model from last classcountries <-data.frame(gdp_pc =runif(80, 500, 50000))set.seed(42)countries$life_exp <-55+0.0004* countries$gdp_pc +rnorm(80, 0, 4)model <-lm(life_exp ~ gdp_pc, data = countries)confint(model)