Text as Data

Motivation and Concepts

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

What is a latent variable?

A quantity we care about that we cannot directly observe. We infer it from things we can measure.

State capacity
inferred from tax-to-GDP ratio, public goods provision, bureaucratic retention

Electoral fraud
inferred from digit patterns in vote tallies, turnout discontinuities, vote-share heaping

Ethnic polarization
inferred from residential segregation, cross-ethnic voting, intermarriage patterns

The proxy is observable. The latent variable is not. We use the former to reason about the latter.

Actors have beliefs, interests, and intentions that shape what they write but we cannot observe directly

Latent variable
ideology, sentiment, priorities, hostility

↓

Language generation
word choice, framing, emphasis, omission

↓

Observed text
speeches, reports, manifestos, tweets

↑

We only observe this. We want to infer upward.

Sentiment / Tone: how positive, negative, or hostile is this document?
Topic: what is this document about?
Position: where does this actor sit in some ideological space?
Similarity: are these two documents saying the same thing? written by the same person?

Dictionary methods: researcher defines a word list; score documents by counts. Fast, transparent, brittle.
Supervised models: hand-code a sample; train a classifier to generalize. Flexible, requires labeled data.
Unsupervised models: let the model find structure without labels. Scalable, hard to validate.
Large language models: pretrained on massive corpora; prompted or fine-tuned. Powerful, opaque, expensive, hard to validate systematically.

quanteda is an R package that handles the entire pipeline

Corpus management: store documents with metadata, filter, subset
Tokenization and preprocessing in one step
Fast, sparse document-term matrix construction
Built-in methods: KWIC, wordclouds, dictionary scoring, scaling models, topic models

It is not the only option, but it is well-documented, actively maintained, and designed for social scientists

Let’s go to R.