Linear Regression
Carolina Torreblanca
University of Pennsylvania
Global Development: Intermediate Topics in Politics, Policy, and Data
PSCI 3200 - Spring 2026
Almost every empirical paper in this course uses linear regression
For your final projects, you need to:
Today: what regression actually does, and how to use it.
“Given what I know about X, what is my best guess for Y?”
“When X goes up by one unit, how much does Y change?”
Comparing a treatment group to a control group:
\[ \text{Effect} = \overline{Y}_{treatment} - \overline{Y}_{control} \]
Remember the STAR experiment?
We can estimate this with regression
Method 1: Difference in means
treatment = the difference in means
Because the real world is more complicated:
set.seed(42)
n <- 80
countries <- data.frame(
gdp = runif(n, 1, 50) # GDP per capita in thousands of USD
)
countries$life_exp <- 55 + 0.4 * countries$gdp + rnorm(n, 0, 4)
ggplot(countries, aes(x = gdp, y = life_exp)) +
geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
labs(x = "GDP per capita (thousands USD)",
y = "Life expectancy (years)",
title = "Each dot is a country") +
theme_minimal(base_size = 18)There’s clearly a pattern. But what line best captures it?
We need a rule for “best line” That rule is OLS.
A line needs two numbers: an intercept and a slope
\[ Y_i = \alpha + \beta X_i + \epsilon_i \]
\[ Y_i = \alpha + \beta X_i + \epsilon_i \]
Life expectancy depends on a million things: healthcare, diet, war, genetics, luck…
For each data point, the line makes a prediction:
\[ \text{Prediction for country } i = \alpha + \beta X_i \]
The residual is how far off that prediction is:
\[ \text{Residual}_i = \text{Observed } Y_i - \text{Prediction}_i \]
OLS picks the \(\alpha\) and \(\beta\) that make the residuals as small as possible.
Specifically, it minimizes the sum of squared residuals:
\[ SSR = \sum_{i=1}^{N} \text{Residual}_i^2 \]
OLS = the line that makes these red segments as small as possible (in total).
Imagine you are God. You get to decide how the world works
You decree: life expectancy = 55 + 0.4 \(\times\) GDP
That’s the general rule. But you’re a generous God, each country gets a little bit of free will (randomness).
Question: If a researcher only sees the data can they figure out what you decided?
Now let’s pretend we’re a researcher who only sees the data, not the rule.
fit <- lm(life_exp ~ gdp, data = god_data)
ggplot(god_data, aes(x = gdp, y = life_exp)) +
geom_point(alpha = 0.5, color = "#011F5B") +
geom_abline(intercept = true_intercept, slope = true_slope,
color = "red", linewidth = 1.5, linetype = "dashed") +
geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2],
color = "#011F5B", linewidth = 1.5) +
annotate("label", x = 8, y = 73, label = "God's truth",
color = "red", size = 5) +
annotate("label", x = 8, y = 70, label = "OLS estimate",
color = "#011F5B", size = 5) +
labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
title = "OLS gets very close to the truth") +
theme_minimal(base_size = 18)true_intercept true_slope
55.0 0.4
(Intercept) gdp
54.8574688 0.3633341
Not perfect because of the free will (randomness) but very close.
par_df <- data.frame()
for (world in 1:3) {
set.seed(world * 10)
g <- runif(n, 1, 50)
le <- true_intercept + true_slope * g + rnorm(n, 0, 20)
f <- lm(le ~ g)
par_df <- rbind(par_df, data.frame(
gdp = g, life_exp = le,
world = paste("World", world,
" (slope =", round(coef(f)[2], 2), ")")
))
}
ggplot(par_df, aes(x = gdp, y = life_exp)) +
geom_point(alpha = 0.4, color = "#011F5B") +
geom_smooth(method = "lm", se = FALSE, color = "#011F5B", linewidth = 1) +
geom_abline(intercept = true_intercept, slope = true_slope,
color = "red", linewidth = 1, linetype = "dashed") +
facet_wrap(~world) +
labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
title = "Same rule, different randomness — OLS is close each time",
subtitle = "Red dashed = God's truth (slope = 0.4)") +
theme_minimal(base_size = 14)
set.seed(42)
x_sim <- runif(200, 0, 10)
# Three different truths — with clear, big coefficients
linear_y <- 10 + 3 * x_sim + rnorm(200, 0, 2)
quadratic_y <- 10 + 8 * x_sim - 0.8 * x_sim^2 + rnorm(200, 0, 2)
cubic_y <- 5 + 12 * x_sim - 3 * x_sim^2 + 0.2 * x_sim^3 + rnorm(200, 0, 2)
sim_worlds <- rbind(
data.frame(x = x_sim, y = linear_y, truth = "Y = 10 + 3X (linear)"),
data.frame(x = x_sim, y = quadratic_y, truth = "Y = 10 + 8X - 0.8X² (quadratic)"),
data.frame(x = x_sim, y = cubic_y, truth = "Y = 5 + 12X - 3X² + 0.2X³ (cubic)")
)
ggplot(sim_worlds, aes(x = x, y = y)) +
geom_point(alpha = 0.3, color = "#011F5B", size = 1.5) +
geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
geom_smooth(method = "loess", se = FALSE, color = "forestgreen",
linewidth = 1.2, linetype = "dashed") +
facet_wrap(~truth, scales = "free_y") +
labs(x = "X", y = "Y",
title = "Red = OLS line | Green dashed = actual pattern",
subtitle = "OLS forces a straight line even when the truth is curved") +
theme_minimal(base_size = 14)
What if we left something important out of the model?
Let’s play God again but this time with two variables that matter
Life expectancy is going to depend on both GDP and average education BUT education depends on gdp as wel
set.seed(42)
n_c <- 80
ctrl <- data.frame(gdp = runif(n_c, 1, 50)) # GDP in thousands
# Education is correlated with GDP (richer countries → more education)
ctrl$education <- 2 + 0.4 * ctrl$gdp + rnorm(n_c, 0, 3)
# God's truth: BOTH matter, but education matters a LOT
ctrl$life_exp <- 45 + 0.2 * ctrl$gdp + 1.5 * ctrl$education + rnorm(n_c, 0, 2)
God decided:
| Life Exp | Life Exp | |
|---|---|---|
| (Intercept) | 47.854*** (1.089) | 45.131*** (0.427) |
| gdp | 0.808*** (0.035) | 0.170*** (0.032) |
| education | 1.535*** (0.070) | |
| Num.Obs. | 80 | 80 |
| R2 Adj. | 0.871 | 0.982 |
Without controls:
“A $1,000 increase in GDP is associated with a 0.81 year increase in life expectancy.”
With controls:
“Holding education constant, a $1,000 increase in GDP is associated with a 0.17 year increase.”
“Holding constant” = comparing countries with the same education level.
We need a good reason for why the contrasts we’re estimating are causal.
Call:
lm(formula = life_exp ~ gdp, data = countries)
Residuals:
Min 1Q Median 3Q Max
-12.1672 -2.1310 0.4795 2.3804 10.7122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.6994 0.8775 62.33 <0.0000000000000002 ***
gdp 0.4204 0.0282 14.91 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.758 on 78 degrees of freedom
Multiple R-squared: 0.7402, Adjusted R-squared: 0.7369
F-statistic: 222.2 on 1 and 78 DF, p-value: < 0.00000000000000022
Interpretation: “A one unit ($1k) increase in GDP per capita is associated with a 0.42 year increase in life expectancy.”
Or: “A $10k increase in GDP is associated with a 4.2 year increase in life expectancy.”
modelsummary()ggplot(countries, aes(x = gdp, y = life_exp)) +
geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
geom_smooth(method = "lm", color = "red", fill = "red", alpha = 0.15) +
labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
title = "OLS line with confidence band") +
theme_minimal(base_size = 18)
https://carolina-torreblanca.github.io/psci3200-globaldev-main/