“Computational Text Analysis”


Welcome to the site for the course PS9594A: “Computational Text Analysis” at Western University, taught by Sebastián Vallejo Vera. In each week, you will find the code, exercises, and slides for the corresponding topic.

Before you start, check the required software and packages below. Also, don’t forget to read the Syllabus and check Perusall for the readings for the course. This site will be corrected/updated throughout the semester.

0.1 Software and Packages

For the first part of this course (Weeks 1 - 5), we will be mainly using R. For the second part of the this course (Weeks 6 - 11), we will use a combination of R and Python. I will assume that you are familiar with R language, RStudio, and R packages. If you are not, please come to office hours and I can help you out1. In R, these are the main packages you will need to have installed:

  • tidyverse (we will be piping)
  • tidylog (helps keep track of what your are pipins)
  • tidytext (great for working with text)
  • quanteda (stands for “Quantitative Analysis of Textual Data”)
    • quanteda.textstats (to obtain stats from our dfm)
    • quanteda.textplots (to obtain plots from our dfm stats)
    • quanteda.dictionaries (to use dictionaries with quanteda)
  • gutenbergr (to download texts from Project Gutenberg)
  • wesanderson (to make things pretty)
  • stm (to run Structural Topic Models)
  • pdftools (to load pdfs)

0.2 Datasets

Throughout the class, we will be using a number of sample datasets. Access to these datasets will be provided directly on the code. For your Final Essay, you can use one of the following datasets (or, even better, you can use your own):

0.3 Acknowledgments

The organization of the first part of this course (Weeks 1 - 5) and the format of the assignments are borrowed from Christopher Barrie’s excellent course on “Computational Text Analysis”, a syllabus from the prolific Tiago Ventura, and Grimmer, Roberts, and Stewart’s excellent book, “Text as data: A new framework for machine learning and the social sciences”. The code used throughout the course is a patchwork of my own code, but my own code borrows heavily from the internet (but that’s true for all code). I try my best to give credit to the original authors of the code (when and if possible).