Motivation and Concepts
Carolina Torreblanca
University of Pennsylvania
Global Development: Intermediate Topics in Politics, Policy, and Data
PSCI 3200 - Spring 2026
A quantity we care about that we cannot directly observe. We infer it from things we can measure.
State capacity
inferred from tax-to-GDP ratio, public goods provision, bureaucratic retention
Electoral fraud
inferred from digit patterns in vote tallies, turnout discontinuities, vote-share heaping
Ethnic polarization
inferred from residential segregation, cross-ethnic voting, intermarriage patterns
The proxy is observable. The latent variable is not. We use the former to reason about the latter.
Actors have beliefs, interests, and intentions that shape what they write but we cannot observe directly
Latent variable
ideology, sentiment, priorities, hostility
Language generation
word choice, framing, emphasis, omission
Observed text
speeches, reports, manifestos, tweets
Dictionary methods: researcher defines a word list; score documents by counts. Fast, transparent, brittle.
Supervised models: hand-code a sample; train a classifier to generalize. Flexible, requires labeled data.
Unsupervised models: let the model find structure without labels. Scalable, hard to validate.
Large language models: pretrained on massive corpora; prompted or fine-tuned. Powerful, opaque, expensive, hard to validate systematically.
quanteda is an R package that handles the entire pipeline
It is not the only option, but it is well-documented, actively maintained, and designed for social scientists
Let’s go to R.
https://carolina-torreblanca.github.io/psci3200-globaldev-main/