Introduction to OLS

Linear Regression

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

Agenda


  1. Why OLS?
  2. Two things regression does
  3. From difference-in-means to regression
  4. Continuous predictors and the regression line
  5. Playing God: simulating data
  6. Adding controls
  7. Running and interpreting OLS in R

Why OLS?

The Workhorse of Social Science


Almost every empirical paper in this course uses linear regression

  • Comparing a treatment group to a control group? Regression
  • Fixed effects? Regression with dummies as controls
  • Fancy Difference-in-Differences? Regression with an interaction term
  • Your final project? Likely Regression

Why Should You Care?


For your final projects, you need to:

  1. Come up with a research question
  2. Answer it with data
  3. Present and interpret your results


Today: what regression actually does, and how to use it.

Two Things Regression Does

Thing 1: Best Guess of Y


“Given what I know about X, what is my best guess for Y?”

  • If I tell you a country’s GDP per capita is $10,000…
  • What’s your best guess for its life expectancy?
  • Regression gives a linear best guess

Thing 2: How Much Does Y Change when X changes?


“When X goes up by one unit, how much does Y change?”

  • If GDP per capita increases by one unit
  • How many more years of life expectancy should we expect?
  • This is the slope
  • Maybe the causal effect of X on Y, depends…

From Difference-in-Means to Regression

Something You Already Know


Comparing a treatment group to a control group:

\[ \text{Effect} = \overline{Y}_{treatment} - \overline{Y}_{control} \]

Remember the STAR experiment?

We can estimate this with regression

Let’s Prove It

Show code
library(ggplot2)
library(modelsummary)

# Simulate an experiment
set.seed(42)
experiment <- data.frame(
  treatment = c(rep(0, 50), rep(1, 50))
)
experiment$outcome <- 10 + 3 * experiment$treatment + rnorm(100, 0, 2)

Method 1: Difference in means

mean(experiment$outcome[experiment$treatment == 1]) -
  mean(experiment$outcome[experiment$treatment == 0])
[1] 3.272746

Method 2: Regression

coef(lm(outcome ~ treatment, data = experiment))
(Intercept)   treatment 
   9.928656    3.272746 

Same Answer


  • The intercept = the control group mean
  • The slope on treatment = the difference in means
  • Regression with a binary variable IS a difference in means

Why Not Just Use Difference-in-Means for Everything?


Because the real world is more complicated:

  • What if your key variable isn’t binary? (GDP is continuous)
  • What if you need to control for something? (Education, geography…)
  • What if you want to compare across multiple groups at once?
  • Regression handles all of this. Difference-in-means can’t

The Regression Line

Now With a Continuous Variable

Show code
set.seed(42)
n <- 80
countries <- data.frame(
  gdp = runif(n, 1, 50)  # GDP per capita in thousands of USD
)
countries$life_exp <- 55 + 0.4 * countries$gdp + rnorm(n, 0, 4)

ggplot(countries, aes(x = gdp, y = life_exp)) +
  geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
  labs(x = "GDP per capita (thousands USD)",
       y = "Life expectancy (years)",
       title = "Each dot is a country") +
  theme_minimal(base_size = 18)

There’s clearly a pattern. But what line best captures it?

Which Line?

We need a rule for “best line” That rule is OLS.

The Equation


A line needs two numbers: an intercept and a slope

\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

  • \(\alpha\) (intercept): value of \(Y\) when \(X = 0\)
  • \(\beta\) (slope): how much \(Y\) changes when \(X\) goes up by one unit

What About \(\epsilon\)?


\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

Life expectancy depends on a million things: healthcare, diet, war, genetics, luck…

  • We’re only using GDP to predict life expectancy
  • Everything else lands in \(\epsilon_i\) (the error term)
  • The error is the gap between our simple model and complicated reality

How OLS Picks the Line


For each data point, the line makes a prediction:

\[ \text{Prediction for country } i = \alpha + \beta X_i \]

The residual is how far off that prediction is:

\[ \text{Residual}_i = \text{Observed } Y_i - \text{Prediction}_i \]

Minimizing Residuals


OLS picks the \(\alpha\) and \(\beta\) that make the residuals as small as possible.

Specifically, it minimizes the sum of squared residuals:

\[ SSR = \sum_{i=1}^{N} \text{Residual}_i^2 \]

  • Why squared? So positive and negative misses don’t cancel out
  • And big misses are penalized more than small ones

Visualizing Residuals

OLS = the line that makes these red segments as small as possible (in total).

Playing God

A Simulation


Imagine you are God. You get to decide how the world works

You decree: life expectancy = 55 + 0.4 \(\times\) GDP

That’s the general rule. But you’re a generous God, each country gets a little bit of free will (randomness).


Question: If a researcher only sees the data can they figure out what you decided?

You Create the World

set.seed(123)

# GOD'S RULE (the truth no one can see)
true_intercept <- 55
true_slope     <- 0.4

# Generate data for 100 countries
n <- 100
gdp <- runif(n, 1, 50)  # GDP in thousands of USD
life_exp <- true_intercept + true_slope * gdp + rnorm(n, 0, 20)

god_data <- data.frame(gdp, life_exp)


Now let’s pretend we’re a researcher who only sees the data, not the rule.

OLS Tries to Recover the Truth

Show code
fit <- lm(life_exp ~ gdp, data = god_data)

ggplot(god_data, aes(x = gdp, y = life_exp)) +
  geom_point(alpha = 0.5, color = "#011F5B") +
  geom_abline(intercept = true_intercept, slope = true_slope,
              color = "red", linewidth = 1.5, linetype = "dashed") +
  geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2],
              color = "#011F5B", linewidth = 1.5) +
  annotate("label", x = 8, y = 73, label = "God's truth",
           color = "red", size = 5) +
  annotate("label", x = 8, y = 70, label = "OLS estimate",
           color = "#011F5B", size = 5) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "OLS gets very close to the truth") +
  theme_minimal(base_size = 18)

How Close?

# What God decided
c(true_intercept = true_intercept, true_slope = true_slope)
true_intercept     true_slope 
          55.0            0.4 
# What OLS found
coef(fit)
(Intercept)         gdp 
 54.8574688   0.3633341 


Not perfect because of the free will (randomness) but very close.

Different Worlds, Same Rule

Show code
par_df <- data.frame()
for (world in 1:3) {
  set.seed(world * 10)
  g <- runif(n, 1, 50)
  le <- true_intercept + true_slope * g + rnorm(n, 0, 20)
  f <- lm(le ~ g)
  par_df <- rbind(par_df, data.frame(
    gdp = g, life_exp = le,
    world = paste("World", world,
                  "  (slope =", round(coef(f)[2], 2), ")")
  ))
}

ggplot(par_df, aes(x = gdp, y = life_exp)) +
  geom_point(alpha = 0.4, color = "#011F5B") +
  geom_smooth(method = "lm", se = FALSE, color = "#011F5B", linewidth = 1) +
  geom_abline(intercept = true_intercept, slope = true_slope,
              color = "red", linewidth = 1, linetype = "dashed") +
  facet_wrap(~world) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "Same rule, different randomness — OLS is close each time",
       subtitle = "Red dashed = God's truth (slope = 0.4)") +
  theme_minimal(base_size = 14)

OLS Works… When the Truth Is Linear


  • We made a linear rule, and OLS (which fits a line) recovered it
  • But what if the truth isn’t linear?
  • What if God decided something more complicated?

What If the Truth Is Curved?

Show code
set.seed(42)
x_sim <- runif(200, 0, 10)

# Three different truths — with clear, big coefficients
linear_y    <- 10 + 3 * x_sim + rnorm(200, 0, 2)
quadratic_y <- 10 + 8 * x_sim - 0.8 * x_sim^2 + rnorm(200, 0, 2)
cubic_y     <- 5 + 12 * x_sim - 3 * x_sim^2 + 0.2 * x_sim^3 + rnorm(200, 0, 2)

sim_worlds <- rbind(
  data.frame(x = x_sim, y = linear_y,    truth = "Y = 10 + 3X  (linear)"),
  data.frame(x = x_sim, y = quadratic_y, truth = "Y = 10 + 8X - 0.8X²  (quadratic)"),
  data.frame(x = x_sim, y = cubic_y,     truth = "Y = 5 + 12X - 3X² + 0.2X³  (cubic)")
)

ggplot(sim_worlds, aes(x = x, y = y)) +
  geom_point(alpha = 0.3, color = "#011F5B", size = 1.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
  geom_smooth(method = "loess", se = FALSE, color = "forestgreen",
              linewidth = 1.2, linetype = "dashed") +
  facet_wrap(~truth, scales = "free_y") +
  labs(x = "X", y = "Y",
       title = "Red = OLS line | Green dashed = actual pattern",
       subtitle = "OLS forces a straight line even when the truth is curved") +
  theme_minimal(base_size = 14)

All Models Are Wrong, But Some Are Useful


  • OLS always fits a straight line
  • When the truth is linear, it works great
  • When the truth is curved, the line misses the pattern
  • In practice, OLS is surprisingly robust unless things are dramatically nonlinear
  • We mostly use OLS to estimate associations and marginal effects, not to predict

Adding Controls

The Problem: Omitted Variables


What if we left something important out of the model?


Let’s play God again but this time with two variables that matter

Life expectancy is going to depend on both GDP and average education BUT education depends on gdp as wel

Visualize the DAG

God’s New Rule

set.seed(42)
n_c <- 80
ctrl <- data.frame(gdp = runif(n_c, 1, 50)) # GDP in thousands

# Education is correlated with GDP (richer countries → more education)
ctrl$education <- 2 + 0.4 * ctrl$gdp + rnorm(n_c, 0, 3)

# God's truth: BOTH matter, but education matters a LOT
ctrl$life_exp <- 45 + 0.2 * ctrl$gdp + 1.5 * ctrl$education + rnorm(n_c, 0, 2)


God decided:

  • GDP matters a little (true slope = 0.2)
  • Education matters a lot (true slope = 1.5)
  • But GDP and education are correlated

What Happens If We Forget Education?

Show code
model_no_ctrl   <- lm(life_exp ~ gdp, data = ctrl)
model_with_ctrl <- lm(life_exp ~ gdp + education, data = ctrl)

modelsummary(list("Life Exp" = model_no_ctrl,
                  "Life Exp" = model_with_ctrl),
  estimate  = "{estimate}{stars} ({std.error})",
  statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')
Life Exp Life Exp
(Intercept) 47.854*** (1.089) 45.131*** (0.427)
gdp 0.808*** (0.035) 0.170*** (0.032)
education 1.535*** (0.070)
Num.Obs. 80 80
R2 Adj. 0.871 0.982

What Just Happened?


  • Without education: the GDP coefficient is inflated
  • With education: it drops dramatically, closer to God’s truth (0.2)
  • Why? GDP and education are correlated
  • GDP was getting credit for education’s work
  • This is omitted variable bias

Interpretation With Controls


Without controls:

“A $1,000 increase in GDP is associated with a 0.81 year increase in life expectancy.”

With controls:

“Holding education constant, a $1,000 increase in GDP is associated with a 0.17 year increase.”


“Holding constant” = comparing countries with the same education level.

Controls and Causality

Does Controlling Make It Causal?


  • Remember the causality lectures: a causal effect is about the counterfactual
  • “What would have happened to this country if its GDP were higher, but everything else was held constant?

When Can We Make Causal Claims?


We need a good reason for why the contrasts we’re estimating are causal.

  • That reason always comes back to the same thing: the average counterfactual is valid
  • The comparison group is a good stand-in for what would have happened without the treatment
  • This is an assumption about the world, not about the model

Running OLS in R

The Full Workflow

# Fit the model
model <- lm(life_exp ~ gdp, data = countries)

# See the full output
summary(model)

Call:
lm(formula = life_exp ~ gdp, data = countries)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1672  -2.1310   0.4795   2.3804  10.7122 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  54.6994     0.8775   62.33 <0.0000000000000002 ***
gdp           0.4204     0.0282   14.91 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.758 on 78 degrees of freedom
Multiple R-squared:  0.7402,    Adjusted R-squared:  0.7369 
F-statistic: 222.2 on 1 and 78 DF,  p-value: < 0.00000000000000022

What Each Piece Means


  • Estimate: the intercept and slope are our best guesses of \(\alpha\) and \(\beta\)
  • Std. Error: the standard deviation of the sampling distribution of our estimates
  • Pr(>|t|): how likely we’d see an estimate this large if the true value were zero
  • R-squared: fraction of variation in \(Y\) explained by \(Xs\) (0 = nothing, 1 = everything)

Interpreting the Slope

coef(model)
(Intercept)         gdp 
 54.6994253   0.4204093 


Interpretation: “A one unit ($1k) increase in GDP per capita is associated with a 0.42 year increase in life expectancy.”

Or: “A $10k increase in GDP is associated with a 4.2 year increase in life expectancy.”

Clean Output with modelsummary()

modelsummary(model,
  estimate  = "{estimate}{stars} ({std.error})",
  statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')
Model 1
(Intercept) 54.699*** (0.878)
gdp 0.420*** (0.028)
Num.Obs. 80
R2 Adj. 0.737

Visualize the Line

ggplot(countries, aes(x = gdp, y = life_exp)) +
  geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
  geom_smooth(method = "lm", color = "red", fill = "red", alpha = 0.15) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "OLS line with confidence band") +
  theme_minimal(base_size = 18)

Wrapping Up

What Regression Gives You


  1. A best (linear) guess of \(Y\) for any value of \(X\) (prediction)
  2. A slope that tells you how \(Y\) changes when \(X\) changes (association)
  3. The ability to control for other variables, or hold them constant

What Regression Does NOT Give You


  • Causality does not come from the model, it comes from the persuasivness of your research design
  • A theory of the world! Controls only help if you’ve identified the right confounders (think about your DAGs!)