Text as Data

Motivation and Concepts

Carolina Torreblanca

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2026

What is a latent variable?

A quantity we care about that we cannot directly observe. We infer it from things we can measure.

State capacity
inferred from tax-to-GDP ratio, public goods provision, bureaucratic retention

Electoral fraud
inferred from digit patterns in vote tallies, turnout discontinuities, vote-share heaping

Ethnic polarization
inferred from residential segregation, cross-ethnic voting, intermarriage patterns

The proxy is observable. The latent variable is not. We use the former to reason about the latter.

Text as a trace

Actors have beliefs, interests, and intentions that shape what they write but we cannot observe directly

  • Ideology: party manifestos
  • Hostility: dehumanizing language

The data generating process

Latent variable
ideology, sentiment, priorities, hostility

Language generation
word choice, framing, emphasis, omission

Observed text
speeches, reports, manifestos, tweets

We only observe this. We want to infer upward.

What we want to measure

  • Sentiment / Tone: how positive, negative, or hostile is this document?
  • Topic: what is this document about?
  • Position: where does this actor sit in some ideological space?
  • Similarity: are these two documents saying the same thing? written by the same person?

From words to numbers

  1. Corpus: collection of documents and metadata
  2. Tokenize: split into units (words, n-grams)
  3. Preprocess: remove stopwords, punctuation, normalize case
  4. Document-Term Matrix: rows = documents, columns = terms, cells = counts
  5. Model: apply a method to recover the latent quantity

Four families of methods

  • Dictionary methods: researcher defines a word list; score documents by counts. Fast, transparent, brittle.

  • Supervised models: hand-code a sample; train a classifier to generalize. Flexible, requires labeled data.

  • Unsupervised models: let the model find structure without labels. Scalable, hard to validate.

  • Large language models: pretrained on massive corpora; prompted or fine-tuned. Powerful, opaque, expensive, hard to validate systematically.

Enter quanteda

quanteda is an R package that handles the entire pipeline

  • Corpus management: store documents with metadata, filter, subset
  • Tokenization and preprocessing in one step
  • Fast, sparse document-term matrix construction
  • Built-in methods: KWIC, wordclouds, dictionary scoring, scaling models, topic models

It is not the only option, but it is well-documented, actively maintained, and designed for social scientists

Let’s go to R.