Introduction to OLS

Linear Regression

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

Agenda

Why OLS?
Two things regression does
From difference-in-means to regression
Continuous predictors and the regression line
Playing God: simulating data
Adding controls
Running and interpreting OLS in R

Why OLS?

Why Should You Care?

For your final projects, you need to:

Come up with a research question
Answer it with data
Present and interpret your results

Today: what regression actually does, and how to use it.

Two Things Regression Does

Thing 1: Best Guess of Y

“Given what I know about X, what is my best guess for Y?”

If I tell you a country’s GDP per capita is $10,000…
What’s your best guess for its life expectancy?
Regression gives a linear best guess

Thing 2: How Much Does Y Change when X changes?

“When X goes up by one unit, how much does Y change?”

If GDP per capita increases by one unit
How many more years of life expectancy should we expect?
This is the slope
Maybe the causal effect of X on Y, depends…

From Difference-in-Means to Regression

Something You Already Know

Comparing a treatment group to a control group:

\[ \text{Effect} = \overline{Y}_{treatment} - \overline{Y}_{control} \]

Remember the STAR experiment?

We can estimate this with regression

Let’s Prove It

Show code

library(ggplot2)
library(modelsummary)

# Simulate an experiment
set.seed(42)
experiment <- data.frame(
  treatment = c(rep(0, 50), rep(1, 50))
)
experiment$outcome <- 10 + 3 * experiment$treatment + rnorm(100, 0, 2)

Method 1: Difference in means

mean(experiment$outcome[experiment$treatment == 1]) -
  mean(experiment$outcome[experiment$treatment == 0])

[1] 3.272746

Method 2: Regression

coef(lm(outcome ~ treatment, data = experiment))

(Intercept)   treatment 
   9.928656    3.272746

Same Answer

The intercept = the control group mean
The slope on treatment = the difference in means
Regression with a binary variable IS a difference in means

Why Not Just Use Difference-in-Means for Everything?

Because the real world is more complicated:

What if your key variable isn’t binary? (GDP is continuous)
What if you need to control for something? (Education, geography…)
What if you want to compare across multiple groups at once?
Regression handles all of this. Difference-in-means can’t

The Regression Line

Now With a Continuous Variable

Show code

set.seed(42)
n <- 80
countries <- data.frame(
  gdp = runif(n, 1, 50)  # GDP per capita in thousands of USD
)
countries$life_exp <- 55 + 0.4 * countries$gdp + rnorm(n, 0, 4)

ggplot(countries, aes(x = gdp, y = life_exp)) +
  geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
  labs(x = "GDP per capita (thousands USD)",
       y = "Life expectancy (years)",
       title = "Each dot is a country") +
  theme_minimal(base_size = 18)

There’s clearly a pattern. But what line best captures it?

Which Line?

We need a rule for “best line” That rule is OLS.

The Equation

A line needs two numbers: an intercept and a slope

\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

$\alpha$ (intercept): value of $Y$ when $X = 0$
$\beta$ (slope): how much $Y$ changes when $X$ goes up by one unit

What About $\epsilon$?

\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

Life expectancy depends on a million things: healthcare, diet, war, genetics, luck…

We’re only using GDP to predict life expectancy
Everything else lands in $\epsilon_i$ (the error term)
The error is the gap between our simple model and complicated reality

How OLS Picks the Line

For each data point, the line makes a prediction:

\[ \text{Prediction for country } i = \alpha + \beta X_i \]

The residual is how far off that prediction is:

\[ \text{Residual}_i = \text{Observed } Y_i - \text{Prediction}_i \]

Minimizing Residuals

OLS picks the $\alpha$ and $\beta$ that make the residuals as small as possible.

Specifically, it minimizes the sum of squared residuals:

\[ SSR = \sum_{i=1}^{N} \text{Residual}_i^2 \]

Why squared? So positive and negative misses don’t cancel out
And big misses are penalized more than small ones

Visualizing Residuals

OLS = the line that makes these red segments as small as possible (in total).

Playing God

A Simulation

Imagine you are God. You get to decide how the world works

You decree: life expectancy = 55 + 0.4 $\times$ GDP

That’s the general rule. But you’re a generous God, each country gets a little bit of free will (randomness).

Question: If a researcher only sees the data can they figure out what you decided?

You Create the World

set.seed(123)

# GOD'S RULE (the truth no one can see)
true_intercept <- 55
true_slope     <- 0.4

# Generate data for 100 countries
n <- 100
gdp <- runif(n, 1, 50)  # GDP in thousands of USD
life_exp <- true_intercept + true_slope * gdp + rnorm(n, 0, 20)

god_data <- data.frame(gdp, life_exp)

Now let’s pretend we’re a researcher who only sees the data, not the rule.

OLS Tries to Recover the Truth

Show code

fit <- lm(life_exp ~ gdp, data = god_data)

ggplot(god_data, aes(x = gdp, y = life_exp)) +
  geom_point(alpha = 0.5, color = "#011F5B") +
  geom_abline(intercept = true_intercept, slope = true_slope,
              color = "red", linewidth = 1.5, linetype = "dashed") +
  geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2],
              color = "#011F5B", linewidth = 1.5) +
  annotate("label", x = 8, y = 73, label = "God's truth",
           color = "red", size = 5) +
  annotate("label", x = 8, y = 70, label = "OLS estimate",
           color = "#011F5B", size = 5) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "OLS gets very close to the truth") +
  theme_minimal(base_size = 18)

How Close?

# What God decided
c(true_intercept = true_intercept, true_slope = true_slope)

true_intercept     true_slope 
          55.0            0.4

# What OLS found
coef(fit)

(Intercept)         gdp 
 54.8574688   0.3633341

Not perfect because of the free will (randomness) but very close.

Different Worlds, Same Rule

Show code

par_df <- data.frame()
for (world in 1:3) {
  set.seed(world * 10)
  g <- runif(n, 1, 50)
  le <- true_intercept + true_slope * g + rnorm(n, 0, 20)
  f <- lm(le ~ g)
  par_df <- rbind(par_df, data.frame(
    gdp = g, life_exp = le,
    world = paste("World", world,
                  "  (slope =", round(coef(f)[2], 2), ")")
  ))
}

ggplot(par_df, aes(x = gdp, y = life_exp)) +
  geom_point(alpha = 0.4, color = "#011F5B") +
  geom_smooth(method = "lm", se = FALSE, color = "#011F5B", linewidth = 1) +
  geom_abline(intercept = true_intercept, slope = true_slope,
              color = "red", linewidth = 1, linetype = "dashed") +
  facet_wrap(~world) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "Same rule, different randomness — OLS is close each time",
       subtitle = "Red dashed = God's truth (slope = 0.4)") +
  theme_minimal(base_size = 14)

OLS Works… When the Truth Is Linear

We made a linear rule, and OLS (which fits a line) recovered it
But what if the truth isn’t linear?
What if God decided something more complicated?

What If the Truth Is Curved?

Show code

set.seed(42)
x_sim <- runif(200, 0, 10)

# Three different truths — with clear, big coefficients
linear_y    <- 10 + 3 * x_sim + rnorm(200, 0, 2)
quadratic_y <- 10 + 8 * x_sim - 0.8 * x_sim^2 + rnorm(200, 0, 2)
cubic_y     <- 5 + 12 * x_sim - 3 * x_sim^2 + 0.2 * x_sim^3 + rnorm(200, 0, 2)

sim_worlds <- rbind(
  data.frame(x = x_sim, y = linear_y,    truth = "Y = 10 + 3X  (linear)"),
  data.frame(x = x_sim, y = quadratic_y, truth = "Y = 10 + 8X - 0.8X²  (quadratic)"),
  data.frame(x = x_sim, y = cubic_y,     truth = "Y = 5 + 12X - 3X² + 0.2X³  (cubic)")
)

ggplot(sim_worlds, aes(x = x, y = y)) +
  geom_point(alpha = 0.3, color = "#011F5B", size = 1.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
  geom_smooth(method = "loess", se = FALSE, color = "forestgreen",
              linewidth = 1.2, linetype = "dashed") +
  facet_wrap(~truth, scales = "free_y") +
  labs(x = "X", y = "Y",
       title = "Red = OLS line | Green dashed = actual pattern",
       subtitle = "OLS forces a straight line even when the truth is curved") +
  theme_minimal(base_size = 14)

All Models Are Wrong, But Some Are Useful

OLS always fits a straight line
When the truth is linear, it works great
When the truth is curved, the line misses the pattern
In practice, OLS is surprisingly robust unless things are dramatically nonlinear
We mostly use OLS to estimate associations and marginal effects, not to predict

Adding Controls

The Problem: Omitted Variables

What if we left something important out of the model?

Let’s play God again but this time with two variables that matter

Life expectancy is going to depend on both GDP and average education BUT education depends on gdp as wel

Visualize the DAG

God’s New Rule

set.seed(42)
n_c <- 80
ctrl <- data.frame(gdp = runif(n_c, 1, 50)) # GDP in thousands

# Education is correlated with GDP (richer countries → more education)
ctrl$education <- 2 + 0.4 * ctrl$gdp + rnorm(n_c, 0, 3)

# God's truth: BOTH matter, but education matters a LOT
ctrl$life_exp <- 45 + 0.2 * ctrl$gdp + 1.5 * ctrl$education + rnorm(n_c, 0, 2)

God decided:

GDP matters a little (true slope = 0.2)
Education matters a lot (true slope = 1.5)
But GDP and education are correlated

What Happens If We Forget Education?

Show code

model_no_ctrl   <- lm(life_exp ~ gdp, data = ctrl)
model_with_ctrl <- lm(life_exp ~ gdp + education, data = ctrl)

modelsummary(list("Life Exp" = model_no_ctrl,
                  "Life Exp" = model_with_ctrl),
  estimate  = "{estimate}{stars} ({std.error})",
  statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')

	Life Exp	Life Exp
(Intercept)	47.854*** (1.089)	45.131*** (0.427)
gdp	0.808*** (0.035)	0.170*** (0.032)
education		1.535*** (0.070)
Num.Obs.	80	80
R2 Adj.	0.871	0.982

What Just Happened?

Without education: the GDP coefficient is inflated
With education: it drops dramatically, closer to God’s truth (0.2)
Why? GDP and education are correlated
GDP was getting credit for education’s work
This is omitted variable bias

Interpretation With Controls

Without controls:

“A $1,000 increase in GDP is associated with a 0.81 year increase in life expectancy.”

With controls:

“Holding education constant, a $1,000 increase in GDP is associated with a 0.17 year increase.”

“Holding constant” = comparing countries with the same education level.

Controls and Causality

Does Controlling Make It Causal?

Remember the causality lectures: a causal effect is about the counterfactual
“What would have happened to this country if its GDP were higher, but everything else was held constant?”

When Can We Make Causal Claims?

We need a good reason for why the contrasts we’re estimating are causal.

That reason always comes back to the same thing: the average counterfactual is valid
The comparison group is a good stand-in for what would have happened without the treatment
This is an assumption about the world, not about the model

Running OLS in R

The Full Workflow

# Fit the model
model <- lm(life_exp ~ gdp, data = countries)

# See the full output
summary(model)


Call:
lm(formula = life_exp ~ gdp, data = countries)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1672  -2.1310   0.4795   2.3804  10.7122 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  54.6994     0.8775   62.33 <0.0000000000000002 ***
gdp           0.4204     0.0282   14.91 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.758 on 78 degrees of freedom
Multiple R-squared:  0.7402,    Adjusted R-squared:  0.7369 
F-statistic: 222.2 on 1 and 78 DF,  p-value: < 0.00000000000000022

What Each Piece Means

Estimate: the intercept and slope are our best guesses of $\alpha$ and $\beta$
Std. Error: the standard deviation of the sampling distribution of our estimates
Pr(>|t|): how likely we’d see an estimate this large if the true value were zero
R-squared: fraction of variation in $Y$ explained by $Xs$ (0 = nothing, 1 = everything)

Interpreting the Slope

coef(model)

(Intercept)         gdp 
 54.6994253   0.4204093

Interpretation: “A one unit ($1k) increase in GDP per capita is associated with a 0.42 year increase in life expectancy.”

Or: “A $10k increase in GDP is associated with a 4.2 year increase in life expectancy.”

Clean Output with `modelsummary()`

modelsummary(model,
  estimate  = "{estimate}{stars} ({std.error})",
  statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')

	Model 1
(Intercept)	54.699*** (0.878)
gdp	0.420*** (0.028)
Num.Obs.	80
R2 Adj.	0.737

Visualize the Line

ggplot(countries, aes(x = gdp, y = life_exp)) +
  geom_point(size = 3, alpha = 0.6, color = "#011F5B") +
  geom_smooth(method = "lm", color = "red", fill = "red", alpha = 0.15) +
  labs(x = "GDP per capita (thousands USD)", y = "Life expectancy",
       title = "OLS line with confidence band") +
  theme_minimal(base_size = 18)

Wrapping Up

What Regression Gives You

A best (linear) guess of $Y$ for any value of $X$ (prediction)
A slope that tells you how $Y$ changes when $X$ changes (association)
The ability to control for other variables, or hold them constant

What Regression Does NOT Give You

Causality does not come from the model, it comes from the persuasivness of your research design
A theory of the world! Controls only help if you’ve identified the right confounders (think about your DAGs!)

Introduction to OLS

Agenda

Why OLS?

The Workhorse of Social Science

Why Should You Care?

Two Things Regression Does

Thing 1: Best Guess of Y

Thing 2: How Much Does Y Change when X changes?

From Difference-in-Means to Regression

Something You Already Know

Let’s Prove It

Same Answer

Why Not Just Use Difference-in-Means for Everything?

The Regression Line

Now With a Continuous Variable

Which Line?

The Equation

What About \(\epsilon\)?

How OLS Picks the Line

Minimizing Residuals

Visualizing Residuals

Playing God

A Simulation

You Create the World

OLS Tries to Recover the Truth

How Close?

Different Worlds, Same Rule

OLS Works… When the Truth Is Linear

What If the Truth Is Curved?

All Models Are Wrong, But Some Are Useful

Adding Controls

The Problem: Omitted Variables

Visualize the DAG

God’s New Rule

What Happens If We Forget Education?

What Just Happened?

Interpretation With Controls

Controls and Causality

Does Controlling Make It Causal?

When Can We Make Causal Claims?

Running OLS in R

The Full Workflow

What Each Piece Means

Interpreting the Slope

Clean Output with modelsummary()

Visualize the Line

Wrapping Up

What Regression Gives You

What Regression Does NOT Give You

Clean Output with `modelsummary()`