8 Week 8: Introduction to Supervised Machine-Learning
8.1 Setup
I will not be providing code to run SVM or Bi-LSTM. However, if you are interested in good tutorials, please check out the following links:
SVM
- scikit-learn: https://scikit-learn.org/stable/modules/svm.html
- scikit-learn (but specifically for NLP): https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- Mehmet Tekman in Kaggel: https://www.kaggle.com/code/mehmetlaudatekman/text-classification-svm-explained
Bi-LSTM
- Ravindu Senaratne in Medium: https://heartbeat.comet.ml/text-classification-using-bi-directional-lstm-ca0070df7a81
- Nuzulul Khairu Nissa in Medium (compares various model but the best performing one is Bi-LSTM): https://medium.com/mlearning-ai/the-classification-of-text-messages-using-lstm-bi-lstm-and-gru-f79b207f90ad
- Using GloVe (word embeddings) with Bi-LSTM: https://www.kaggle.com/code/akashkr/tf-keras-tutorial-bi-lstm-glove-gru-part-6
- Using Word2Vec (word embeddings) with Bi-LSTM: https://www.kaggle.com/code/stoicstatic/twitter-sentiment-analysis-using-word2vec-bilstm
8.2 Homework 3:
In this week’s lecture, we learned a framework for Supervised Machine Learning models. This framework includes creating a training set.
- Think of a dataset (corpus) and a classification task. Ideally, both the corpus and the classification task can be used in your final paper. However, it’s ok if this is done for this assignment (you will still need to get a corpus). You can choose whatever task, except for sentiment classification.
- Decide the number of categories that you will be predicting.
- Decide the number of observations you will code per category.
- Create a codebook (draft) to guide coders who will (hypothetically) label your training set.
- Label a sample of your data (N=100; decide how you will sample the data and explain your decision). Have a classmate label the same sample (you can find the coder pairing here). Estimate inter-coder reliability and evaluate the results.
- How difficult/easy was the task? What problems did you run into? What would you change from your codebook to improve it? What other lessons did you learn from this exercise?
Note: We will be using the codebook and training set for an optional assignment (next week). It can also be the basis for your final paper.