Correlation and Causation

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

Logistics

Assignments

  • Did everyone find the readings and slides for today?
  • For next week:
    • Remember you have a quasi-assignment

The Data Revolution


  • Development is undergoing a data revolution

  • More data than ever: surveys, satellites, sensors, social media

  • We seem to think more data is useful for answering important questions

  • But what kinds of questions can data actually answer?

Three Types of Questions

  1. Descriptive: What is happening?
    • How many people live in poverty?
  2. Predictive: What is likely to happen?
    • Which households are at risk of food insecurity?
  3. Causal: What would happen if we intervened?
    • Would a cash transfer reduce poverty?

Data can help answer all three, but each demands its own toolkit

Each Question Needs Different Tools

  • Description: How do we summarize? → Summary statistics

  • Prediction: How do we generalize? → Models

  • Causation: How do we reason about “what if”? → ???


Today’s focus: What toolkit does causation require?

The Descriptive Toolkit

Summary Statistics

Central Tendency

Where is the middle?

  • Mean: \(\bar{X} = \frac{1}{n}\sum_{i} X_i\)
  • Median: middle value
  • Mode: most frequent

Spread

How dispersed are the data?

  • Variance: \(\sigma^2 = \frac{1}{n}\sum_{i}(X_i - \bar{X})^2\)
  • Standard deviation: \(\sigma = \sqrt{\sigma^2}\)

Correlation: A Summary Statistic


Correlation summarizes co-variation in a single number:

\[r_{XY} = \frac{\sum_{i}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i}(X_i - \bar{X})^2 \sum_{i}(Y_i - \bar{Y})^2}}\]


- Ranges from -1 to +1 - Measures linear association

What Does Correlation Mean?

More colloquially: what do we mean when say that X and Y are correlated?

  • They tend to move together
  • When X is high, Y tends to be high (or low)
  • There is a systematic association in observed data

What Correlation Is (and Isn’t)


  • Technically: A measure that summarizes (linearly) how two numeric variables move together

  • Conceptually: The basis of inductive evidence: generalizable patterns

When we say “X and Y are correlated,” we’re saying:

“Here is a regularity worth noting.”

But it does not explain why the pattern exists.

Correlation is Not Causation

UFO sightings in New Mexico correlates with Patents granted in the US UFO sightings reported in New Mexico · Source: National UFO Reporting Center Total number of patents granted in the US · Source: USPTO 1975-2020, r=0.937, r²=0.878, p<0.01 tylervigen.com/spurious/correlation/3099 UFO sightings Patents granted 109 391K 82 306.4K 55 222K 28 137.1K 1 52K 1975 1985 1995 2005 2015

Correlation (in the summary statistic sense) Tells Us…

  • Strength: How tightly X and Y move together (0 to ±1)

  • Direction: Positive (both rise) or negative (one rises, other falls)

  • Nothing about why they move together


The UFOs and patents really are correlated in the data!

The Limits of Description

Correlation and summary statistics are powerful for description

But if we want to answer causal questions…

“Would a cash transfer reduce poverty?”

…we need a different toolkit entirely.

The Causal Toolkit

What is Causation?


Correlation tells us X and Y move together.

But causation says something stronger: X produces or changes Y


How do we study this? Scientists made two key moves:

Two Key Moves


Rather than trying to observe causation directly, scientists:

  1. Define causality as a comparison between TWO counterfactual states

  2. Shift from individual causal effects to averages

Let’s unpack each move.

Move 1: Define Causality as Counterfactual Comparison

A Note on Definitions


There are many ways to define causality:

  • Legal: “But for” the defendant’s action, would harm have occurred?
  • Philosophical: Necessary and sufficient conditions, mechanisms
  • Social science: Counterfactual comparison (what we’ll use)

A Misleading Claim


A pharmaceutical company says:

“Patient took our pill. Patient got better. Therefore, the pill works.”


What’s wrong with this reasoning?

Why This is Wrong


  • We saw the patient with the pill → got better

  • We didn’t see the patient without the pill

  • Maybe they would have gotten better anyway!

Causality is a Comparison


The causal question: Did the pill make the patient better?


To answer this, we need to compare:

  1. What happened to the patient with the pill
  2. What would have happened to the same patient without the pill


Causality is always relative treatment vs. what alternative?

Defining the Key Terms


  • Units: The people (or things) we’re studying
    • Example: Patients who could take the pill
  • Treatment: A well-defined intervention we want to evaluate (more on this later)
    • Example: Taking the pill (vs. not taking it)
  • Outcome: What we measure
    • Example: Whether the patient recovered

Potential Outcomes

For each patient, define two possible states of the world:

  • \(Y_i(1)\): What happens if patient \(i\) takes the pill

  • \(Y_i(0)\): What happens if patient \(i\) doesn’t take the pill

Every patient has both. But we can only ever see one.

The Individual Causal Effect

DEFINE the individual causal effect as the effect of the pill, relative to no pill, for patient \(i\):

\[TE_i = Y_i(1) - Y_i(0)\]

Can we calculate it?

The Fundamental Problem of Causal Inference

For any patient, we observe only one outcome:

  • If they took the pill → we see \(Y_i(1)\)
  • If they didn’t → we see \(Y_i(0)\)


We never see the same patient both taking and not taking the pill

Factual vs. Counterfactual

  • Factual: What we observe
    • Patient took the pill and recovered
  • Counterfactual: What would have happened
    • Would they have recovered without the pill?

To measure causal effects, we need both.

But we can never observe the counterfactual.

Why This Problem is Fundamental

  • No amount of data in the world lets us see both states for one patient

  • Ana took the pill and recovered — but would she have recovered anyway?

  • We will never know.


The individual treatment effect \(TE_i\) is unknowable. So what do we do?

Move 2: Go From Individual to Average

The Workaround

We can’t know the pill’s effect on any one patient.

But what if we could estimate the average effect across many patients?

The solution: Compare groups, not individuals.

Average Treatment Effect (ATE)

Instead of asking: “Did the pill help Ana?”

Ask: “On average, does the pill help patients recover?”

\[ATE = \overline{Y(1)} - \overline{Y(0)}\]

Average recovery rate with pill minus average recovery rate without pill.

Why Do Averages Work?


People are different


So how can comparing group averages tell us anything causal?


We need to make an assumption.

A Bad Comparison


Imagine: people who feel really sick take the pill. People who feel fine don’t.


The pill group was sicker to begin with.

They would have recovered less even without the pill.


Comparing these groups tells us nothing about the pill’s effect.

The Problem


  • Sicker people chose to take the pill
  • Sickness affects recovery
  • We can’t tell: is it the pill, or were they just sicker?


The groups aren’t comparable — they differed before treatment.

The Key Assumption


Assumption: The two groups are comparable on average.


The pill group and no-pill group would have had similar outcomes if neither had taken the pill.


If this holds → group averages are valid comparisons → we can estimate causal effects.

When Is This Plausible?


  • When who gets treatment isn’t determined by things that also affect the outcome

  • When there’s no systematic difference between groups before treatment

  • When it’s “as if” treatment was assigned at random


More on how to make this credible next class.

What’s Next?

Why Causality Matters for Development

  • Understanding cause and effect is how we change things in the real world

  • Causal inference separates good evaluations from bad

    • Did this program actually reduce poverty?
    • Would this policy improve outcomes?

Many Tools to Build Credible Comparisons


  • Randomized experiments
    • Random assignment makes treatment unrelated to differences
  • Observational methods
    • Natural experiments, Difference-in-Differences, Matching
    • Try to approximate the logic of randomization


We’ll explore these throughout the course.

Takeaways

Today’s Key Points


  1. Different questions need different tools — causal questions ask “what if?”

  2. Causality = comparing two states (treatment vs. no treatment)

  3. Fundamental Problem: We can’t observe both states for one person

  4. Solution: Compare groups — if treatment is unrelated to differences, groups are comparable on average

  5. Causal inference = building credible comparisons

Next Meeting


  • How randomization solves the comparison problem

  • Why random assignment creates comparable groups

  • Final Project overview

    • What are your options?
    • Possible data sources