Assignments

There will be three worksheets (Exercise #N) that will walk you through the implementation of different text analysis techniques. At the end of each worksheet, you will find a set of questions. Some are optional (even though I strongly suggest you answer them), and some are graded assignments (shown below). You should partner up with someone else in your class and go through these together. Why? Christopher Barrie says: “This is called pair programming and there’s a reason we do this. Firstly, coding can be an isolating and difficult thing—it’s good to bring a friend along for the ride! Secondly, if there’s something you don’t know, maybe your partner will. This saves you both time. Thirdly, your partner can check your code as you write it, and vice versa. Again, this means both of you are working together to produce and check something as you go along.”

After the assignment (optional or graded) is due, I will pick a pair (or an individual, if you prefer to work alone) at random to answer each of these questions and walk us through your code. This is not a punitive exercise, but rather a space for collaborative learning. More often than not, the obstacles encountered by one person are also encountered by many others. Furthermore, there are many ways to arrive at the same solution, and being exposed to different approaches is beneficial to everyone. All that matters to me is that you try and, eventually, learn. (The same goes for those who only have to turn in the assignment.)

In total, the grade assignments are worth 30% of your grade.

Instruction for Submission

The assignments will be graded on accuracy and the quality of your programming style. The following are elements I will be looking for when grading:

  • All code must run.

  • Solutions should be readable:

    • Code should be thoroughly commented (I should be able to understand the code’s purpose by reading the comments).

    • Coding solutions should be broken up into individual code chunks in Jupyter or R Markdown notebooks, not clumped together into one large code chunk.

    • Each student-defined function should include a docstring explaining what the function does, each input argument, and what the function returns (required only when you define functions).

  • Commentary, responses, and/or solutions should be written in Markdown and should explain the outputs sufficiently.

Assignments and Due Date

Each assignments is worth 10 points.

Assignments 1 - Due Date: EOD Friday Week 4

This assignment must be completed after covering Week 3: Dictionary-Based Techniques.

  1. Replicate the results from the left-most column of Figure 3 in Ventura et al. (2021).
  2. Look at the keywords in context for Biden in the ventura_etal_df dataset, and compare the results with the same data, but pre-processed (i.e., lower-case, remove stopwords, etc.). Which version provides more information about the context in which Biden appears in the comments?
  3. Use a different collocation approach with the ventura_etal_df dataset, but pre-process the data (i.e., lower-case, remove stopwords, etc.). Which approach (pre-processed or not pre-processed) provides a better picture of the corpus or of the collocations you found?
  4. Compare the positive sentiment of comments mentioning trump and comments mentioning biden using bing and afinn. Note that afinn gives a numeric value, so you will need to choose a threshold to determine positive sentiment.
  5. Using bing, compare the sentiment of comments mentioning trump and comments mentioning biden using different metrics (e.g., Young and Soroka 2012, Martins and Baumard 2020, Ventura et al. 2021).
  6. Create your own domain-specific dictionary and apply it to the ventura_etal_df dataset. Show the limitations of your dictionary (e.g., false positives), and comment on how much of a problem this would be if you wanted to conduct an analysis of this corpus.

Assignments 2 - Due Date: EOD Friday Week 6

This assignment must be completed after covering Week 5: Scaling Techniques (Unsupervised Learning I).

  1. We had a hard time scaling our text, so we looked at some possible problems. What are possible solutions if we want to position U.S. presidents on an ideological scale using text?
  2. Use the data/candidate-tweets.csv data to run an STM. Decide what your covariates are going to be. Decide whether you will use all the data or a sample of the data. Decide whether you are going to aggregate or split the text in some way (i.e., decide your unit of analysis). Decide the number of topics you will look for (try more than one option). What can you tell me about the topics tweeted by the 2015 U.S. primary candidates?
  3. Choose three topics. Can you place the candidates on an ideological scale within each topic (determine the \(\theta\) threshold for when you can say that a tweet is mostly about a topic)? Does it make sense? Why or why not?

Assignments 3 - Due Date: EOD Friday Week 11

The first part of this assignment must be completed after covering Week 7: A Primer on Supervised Learning. The second part of this assignment must be completed after covering Week 10: Encoder-Only LLMs.

Part I:

  1. Think of a dataset (corpus) and a classification task. Ideally, both the corpus and the classification task can be used in your final paper. However, it’s OK if you do this for the assignment (you will still need a corpus). You can choose any task except sentiment classification.
  2. Decide the number of categories that you will be predicting.
  3. Decide the number of observations you will code per category.
  4. Create a draft codebook to guide coders who will (hypothetically) label your training set.
  5. Label a sample of your data (N=200); decide how you will sample the data and explain your decision. Have a classmate label the same sample (you can find the coder pairing here). Estimate inter-coder reliability and evaluate the results.
  6. How difficult or easy was the task? What problems did you run into? What would you change in your codebook to improve it? What other lessons did you learn from this exercise?

Note: We will use the codebook and training set for Part II. This can also be the basis for your final paper.

Part II:

  1. Once you have validated your training set and the labels produced by LLMs, expand the training set until you have (at least) 150 observations per category. You can use LLMs to label, but you will still need to manually validate the labels produced.
  2. Using your (validated) training set, fine-tune the encoder-only LLM of your choice.
  3. Report performance statistics for your fine-tuned model (e.g., accuracy, F1, recall, etc.), as well as the confusion matrix.
  4. Use your trained model to predict labels on a target corpus.
  5. Describe the results obtained, and whether they match the expectations you had about the data.