“Computational Text Analysis”


Welcome to the site for the course PS9594A: “Computational Text Analysis” at Western University, taught by Sebastián Vallejo Vera. In each week, you will find the code, exercises, and slides for the corresponding topic.

Before you start, check the required software and packages below. Also, don’t forget to read the Syllabus and check Perusall for the readings for the course. This site will be corrected/updated throughout the semester.

0.1 Software and Packages

For the first part of this course (Weeks 1 - 5), we will be mainly using R. For the second part of the this course (Weeks 6 - 11), we will use a combination of R and Python. I will assume that you are familiar with R language, RStudio, and R packages. If you are not, please come to office hours and I can help you out1. In R, these are the main packages you will need to have installed:

  • tidyverse (we will be piping)
  • tidylog (helps keep track of what your are pipins)
  • tidytext (great for working with text)
  • quanteda (stands for “Quantitative Analysis of Textual Data”)
    • quanteda.textstats (to obtain stats from our dfm)
    • quanteda.textplots (to obtain plots from our dfm stats)
    • quanteda.dictionaries (to use dictionaries with quanteda)
  • gutenbergr (to download texts from Project Gutenberg)
  • wesanderson (to make things pretty)
  • stm (to run Structural Topic Models)
  • pdftools (to load pdfs)

0.2 Datasets

Throughout the class, we will be using a number of sample datasets. Access to these datasets will be provided directly on the code. For your Final Essay, you can use one of the following datasets (or, even better, you can use your own):

0.3 Assignments

Weeks Assignment?
1: A Primer on Using Text as Data Optional
2: Tokenization and Word Frequency No
3: Dictionary-Based Approaches Homework 1
4: Complexity and Similarity Optional
5: Scaling Techniques and Topic Modeling Homework 2
6: Word Embeddings No
7: Getting Data Optional
8: Introduction to Supervised Machine-Learning Homework 3
9: Transformers Optional
10: Do We Even Need Theory Anymore? No
11: Llama 2, ChatGPT, and Generative Modeling Optional

For Homework 4, students will present to the class a summary of their final paper (Week 12). The rest of the class will provide feedback. You can find a suggested format for your presentation in this link.

0.4 Final Paper

The objective of this activity is for you to write a 4000-word max essay on a subject previously approved by me. Think of it as a research note for a journal (see, for example, the Letters at APSR).

I will provide a range of data sources to choose from, but you are welcome (encouraged) to suggest your own.

The essay should include all of the following:

  1. a research question;
  2. at least one computational text analysis technique that we have studied;
  3. an analysis of the data source you have provided;
  4. a write up of the initial findings;
  5. an outline of potential extensions of your analysis.

You will also need to provide the code you used in reproducible (markdown) format and will be assessed on both the substantive content of your essay contribution (the social science part) and your demonstrated competency in coding and text analysis (the computational part). While you will only be graded on the final submission, I encourage you to submit advances and/or come to office hours to revise your work. I am happy to provide feedback on your essay’s substantive and technical aspects. If this applies to you, use this opportunity to work on sections of your thesis/dissertation/working paper that will require or can be further developed by applying computational text analysis techniques.

0.5 Acknowledgments

The organization of the first part of this course (Weeks 1 - 5) and the format of the assignments are borrowed from Christopher Barrie’s excellent course on “Computational Text Analysis”, a syllabus from the prolific Tiago Ventura, and Grimmer, Roberts, and Stewart’s excellent book, “Text as data: A new framework for machine learning and the social sciences”. The code used throughout the course is a patchwork of my own code, but my own code borrows heavily from the internet (but that’s true for all code). I try my best to give credit to the original authors of the code (when and if possible).