2 Tutorial 1: Getting Started with R, RStudio and RMarkdown
This tutorial was created by Hugo Machado and John Santos (with minor adaptations from me).
This tutorial will guide you through setting up R and RStudio, installing essential packages, learning the basics of RMarkdown, and suggesting some best practices that will help keep your work organized and find solutions to common problems (which you will most definitely encounter).
2.1 Installing R and RStudio
R is the programming language we’ll be using in this course. It is optimized for statistics and data analysis. Additionally, it is open source, easily customizable, and very popular in academia, which gives it the edge over proprietary (e.g., Stata, SPSS) or general purpose frameworks (e.g., Python) as the best language to learn for social scientists working with data.
Go to the R Project website: https://mirror.csclub.uwaterloo.ca/CRAN/
This link will send you to a CRAN mirror hosted in the University of Waterloo.
Select the appropriate link for your operating system (Windows, macOS, or Linux).
-
Follow the instructions to download and install R.
-
For macOS users:
- If you have a Mac with an Apple Silicon chip (M1, M2, M3, etc.), choose the arm64 version.
- If you have a Mac with an Intel chip, choose the x86_64 version.
Note: When updating R, you’ll need to reinstall your packages. It’s a good idea to do this when you have time and not right before a deadline.
-
2.1.1 RStudio
RStudio is an Integrated Development Environment (IDE) that makes working with R much easier and more user-friendly.
- Go to the RStudio Desktop download page: https://posit.co/download/rstudio-desktop/
- Download the free version of RStudio Desktop for your operating system.
- Follow the instructions to install RStudio.
2.2 A Quick Tour of the RStudio Interface
When you open RStudio, you’ll see four main panes:
- Source (top-left): This is where you write and edit your R code and RMarkdown documents.
- Console (bottom-left): This is where you can run R code interactively and see the output.
- Environment/History (top-right): The Environment tab shows you the objects (data, variables, functions) you have created. The History tab shows you the commands you have run.
-
Files/Plots/Packages/Help (bottom-right): This pane has several tabs:
- Files: Allows you to browse and manage files on your computer.
- Plots: Displays the plots you create.
- Packages: Shows you the installed R packages and allows you to load/unload them.
- Help: Displays R documentation.
You can customize the layout of these panes in Tools > Global Options > Pane Layout.
2.3 R Working Directory
The working directory is the folder where R will look for files and save output by default. Issues with conflicting working directories are common, especially if you have multiple folders for your different projects. Here are some best practices to deal with this.
-
Using
setwd()
: You can use thesetwd()
function to set the working directory. For example,setwd("~/Documents/R Projects/MyProject")
sets the working directory to the “MyProject” folder within the “R Projects” folder in your “Documents” directory. You can also open a new script (tab on the top left), right-click on the tab, and tell R to “Set working directory” wherever the script is located. If you are working with external data files (e.g., a .dta database, for example), your data will need to be in the same folder as the script you are working with.
- Using the RStudio Interface: You can also set the working directory through the RStudio menu: “Session” > “Set Working Directory” > “Choose Directory…”.
- R Projects: For better organization, consider creating R Projects for your assignments. An R Project is a special type of working directory that makes it easier to manage your code, data, and output. You can create a new project by going to “File” > “New Project…”. RStudio will automatically set the working directory to the project folder. For more information about working with projects, consult this guide: https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects
2.4 Introduction to RMarkdown
RMarkdown is a file format that allows you to combine R code, its output (e.g., tables, plots), and text in a single document. It’s a powerful tool for creating reproducible reports and assignments. You can also use it to write in HTML and render PDFs. However, it can be a little finicky and it is not as intuitive to write with, when compared to common word processors like MSWord or Google Docs, so it takes some practice (and patience) to get to a place where you feel comfortable writing in it.
2.4.1 Why Use RMarkdown?
- Reproducibility: Your code, output, and text are all in one place, making it easy to reproduce your analysis.
- Clarity: You can interweave your code with explanations and narrative, making your work easier to understand.
- Efficiency: You can generate different output formats (HTML, PDF, Word) from the same RMarkdown file. The outputs also look pretty nice.
2.4.2 Basic RMarkdown Syntax
-
Code Chunks: R code is enclosed in “chunks” that start with
```{r}
and end with```
.# This is an R code chunk x <- 10 y <- 20 x + y
Headers: You can create headers using
#
,##
,###
, etc. The number of#
determines the level of the header.-
Text Formatting: You can format text using Markdown syntax. For example:
-
Italic:
*italic*
-
Bold:
**bold**
-
Code
: `code` - Links:
[Link Text](URL)
-
Italic:
2.4.3 Creating an RMarkdown File
- In RStudio, go to “File” > “New File” > “R Markdown…”.
- Choose a title and author name.
- Select the desired output format (e.g., HTML, PDF).
- Click “OK”.
RStudio will create a new RMarkdown file with some example content. You can edit this file, add your own code and text, and then “knit” it to generate the output document.
2.4.4 Knitting
“Knitting” is the process of converting an RMarkdown file into an output document. To knit a document, click the “Knit” button at the top of the Source pane. You can choose the output format from the dropdown menu next to the “Knit” button.
2.4.5 RMarkdown cheatsheet:
This is a useful quick guide you can use to help with formatting and other related tasks: https://rstudio.github.io/cheatsheets/html/rmarkdown.html
2.5 Installing R Packages
R packages extend the functionality of base R. As we progress in the course, we’ll use several packages that help you perform different tasks related to statistical analysis. You will only need to install packages once, but you will need to load them after each new session.
Here are some basic packages we will use and how you would go about installing them.
# Install packages for data import/export:
install.packages("foreign")
install.packages("rio")
# Install tidyverse, a collection of packages for data science:
install.packages("tidyverse")
# Install wooldridge, which contains datasets you will use for the assignments:
install.packages("wooldridge")
# Install knitr and markdown for improved RMarkdown functionality:
install.packages("knitr")
install.packages("markdown")
For the Experienced: If you have some experience with R, you might be interested in exploring the pacman
package. It provides a convenient way to manage packages, allowing you to load, install, and update packages using a single function, p_load()
. For example, instead of the repeated install.packages
operations above, if you have pacman
installed you could use one line like: pacman::p_load(devtools, remotes, foreign, readstata13, rio, labelled, sjlabelled, tidyverse, fixest, wooldridge, modelsummary, stargazer, ggplot2, knitr, kableExtra, markdown, car, carData, lmtest, sandwich, survey, srvyr)
. Using pacman
may also automatically install packages from CRAN, Bioconductor or GitHub, according to where the latest version is. If you’re interested, you can learn more about pacman
on its CRAN page. However, for this course, using the standard install.packages()
method will be sufficient.
2.6 Using AI Tools to Assist with R Programming
Large Language Models (LLMs) and related AI tools, such as ChatGPT, Copilot or Gemini, can be valuable assistants for learning and using R. If used correctly, they can save you a lot of time, especially in debugging and finding ways of performing specific tasks you’re not really familiar with. However, it’s crucial to use them responsibly and ethically. Here are some general tips for using these tools.
2.6.1 How AI Can Help
- Code Suggestions and Auto-Completion: AI tools like GitHub Copilot can suggest code snippets as you type, helping you write code faster and with fewer errors. They also hallucinate at times, so be mindful of what they are suggesting. I’m not a big fan of auto-complete, it’s usually better to either ask it for something specific or to ask for edits to base code you’ve already written.
- Debugging Assistance: When you encounter an error, AI can help explain what the error message means and suggest potential solutions. Just copy and paste the error message (or a relevant excerpt), tell it what you wanted it to do, and ask what could be causing the error.
- Understanding Functions and Packages: AI can provide explanations of how R functions work, what arguments they take, and how to use them effectively. Check the documentation to make sure the LLMs are not telling you to do something that is off the mark.
- Learning New Concepts: You can ask AI to explain statistical concepts or R programming topics in a way that’s easy to understand. Documentation references and StackOverflow can be a bit cryptic.
- Finding Relevant Documentation: AI can help you locate relevant documentation for R packages and functions.
- Generating Code for Specific Tasks: You can describe the task you want to accomplish, and AI can generate R code to perform that task. The better and more informed your prompt is, the better the output you will get. These models are typically better at contained, well-defined, tasks than at broad tasks that are underspecified.
2.6.2 Limitations and Ethical Considerations
- Critical Thinking is Essential: AI is a tool, not a replacement for your own understanding. Always critically evaluate the code generated by AI. Make sure you understand how it works before using it.
- Don’t Blindly Copy and Paste: Avoid copying and pasting code without understanding it. This can lead to errors and a lack of comprehension.
- Avoid Plagiarism: Be transparent about your use of AI. Properly cite any AI-generated code or ideas in your assignments. Your instructors need to know what work is yours and what came from AI assistance.
- AI Makes Mistakes: AI is not perfect. It can generate incorrect or inefficient code. Always test the code thoroughly.
2.6.3 Prompt Engineering Tips
- Be Specific and Clear: The more specific you are in your prompts, the better the AI will understand your request.
- Provide Context: If you’re asking about a specific piece of code, include the relevant code in your prompt.
- Break Down Complex Tasks: If you have a complex task, break it down into smaller, more manageable steps.
- Iterate and Refine: If the AI doesn’t give you the answer you’re looking for, try rephrasing your prompt or providing more information.
2.6.4 Example
Here’s an example of how you could use an AI tool to help you generate a plot:
Prompt: “I have a data frame called mydata
with columns x
and y
. How can I create a scatterplot of y
against x
using ggplot2, with blue points and a red trend line?”
AI-Generated Code (example):
ggplot(mydata, aes(x = x, y = y)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red")
Remember: You would still need to load the ggplot2 package (library(ggplot2)
) and have your data frame mydata
loaded in your R environment for this code to work.
2.7 Quick Example: Importing and Checking Data from wooldridge
Let’s load a dataset from the wooldridge
package and take a quick look at it. We’ll use the wage1
dataset, which contains information about wages and other individual characteristics.
# Load the wooldridge package
library(wooldridge)
# Load the wage1 dataset
data("wage1")
# Display the first few rows of the dataset
head(wage1)
## wage educ exper tenure nonwhite female married numdep smsa northcen south west construc ndurman trcommpu trade services
## 1 3.10 11 2 0 0 1 0 2 1 0 0 1 0 0 0 0 0
## 2 3.24 12 22 2 0 1 1 3 1 0 0 1 0 0 0 0 1
## 3 3.00 11 2 0 0 0 0 2 0 0 0 1 0 0 0 1 0
## 4 6.00 8 44 28 0 0 1 0 1 0 0 1 0 0 0 0 0
## 5 5.30 12 7 2 0 0 1 1 0 0 0 1 0 0 0 0 0
## 6 8.75 16 9 8 0 0 1 0 1 0 0 1 0 0 0 0 0
## profserv profocc clerocc servocc lwage expersq tenursq
## 1 0 0 0 0 1.131402 4 0
## 2 0 0 0 1 1.175573 484 4
## 3 0 0 0 0 1.098612 4 0
## 4 0 0 1 0 1.791759 1936 784
## 5 0 0 0 0 1.667707 49 4
## 6 1 1 0 0 2.169054 81 64
# Get a summary of the dataset
summary(wage1)
## wage educ exper tenure nonwhite female married
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105 Mean :0.1027 Mean :0.4791 Mean :0.6084
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## numdep smsa northcen south west construc ndurman
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.000 Median :1.0000 Median :0.000 Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :1.044 Mean :0.7224 Mean :0.251 Mean :0.3555 Mean :0.1692 Mean :0.04563 Mean :0.1141
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:0.750 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## trcommpu trade services profserv profocc clerocc servocc
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.04373 Mean :0.2871 Mean :0.1008 Mean :0.2586 Mean :0.3669 Mean :0.1673 Mean :0.1407
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## lwage expersq tenursq
## Min. :-0.6349 Min. : 1.0 Min. : 0.00
## 1st Qu.: 1.2030 1st Qu.: 25.0 1st Qu.: 0.00
## Median : 1.5369 Median : 182.5 Median : 4.00
## Mean : 1.6233 Mean : 473.4 Mean : 78.15
## 3rd Qu.: 1.9286 3rd Qu.: 676.0 3rd Qu.: 49.00
## Max. : 3.2181 Max. :2601.0 Max. :1936.00
# Get the structure of the dataset
str(wage1)
## 'data.frame': 526 obs. of 24 variables:
## $ wage : num 3.1 3.24 3 6 5.3 ...
## $ educ : int 11 12 11 8 12 16 18 12 12 17 ...
## $ exper : int 2 22 2 44 7 9 15 5 26 22 ...
## $ tenure : int 0 2 0 28 2 8 7 3 4 21 ...
## $ nonwhite: int 0 0 0 0 0 0 0 0 0 0 ...
## $ female : int 1 1 0 0 0 0 0 1 1 0 ...
## $ married : int 0 1 0 1 1 1 0 0 0 1 ...
## $ numdep : int 2 3 2 0 1 0 0 0 2 0 ...
## $ smsa : int 1 1 0 1 0 1 1 1 1 1 ...
## $ northcen: int 0 0 0 0 0 0 0 0 0 0 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ west : int 1 1 1 1 1 1 1 1 1 1 ...
## $ construc: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ndurman : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trcommpu: int 0 0 0 0 0 0 0 0 0 0 ...
## $ trade : int 0 0 1 0 0 0 1 0 1 0 ...
## $ services: int 0 1 0 0 0 0 0 0 0 0 ...
## $ profserv: int 0 0 0 0 0 1 0 0 0 0 ...
## $ profocc : int 0 0 0 0 0 1 1 1 1 1 ...
## $ clerocc : int 0 0 0 1 0 0 0 0 0 0 ...
## $ servocc : int 0 1 0 0 0 0 0 0 0 0 ...
## $ lwage : num 1.13 1.18 1.1 1.79 1.67 ...
## $ expersq : int 4 484 4 1936 49 81 225 25 676 484 ...
## $ tenursq : int 0 4 0 784 4 64 49 9 16 441 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
Explanation:
-
library(wooldridge)
: Loads thewooldridge
package, making its datasets available. -
data("wage1")
: Loads thewage1
dataset into your R environment. -
head(wage1)
: Displays the first six rows of the dataset, allowing you to see the variable names and some sample data.-
Exercise: Try changing the number inside the parentheses of the
head()
function to display a different number of rows.
-
Exercise: Try changing the number inside the parentheses of the
-
summary(wage1)
: Provides descriptive statistics for each variable in the dataset (e.g., mean, median, min, max, quartiles). -
str(wage1)
: Shows the structure of the dataset, including the data type of each variable (e.g., numeric, integer, factor).
2.8 Getting Help in R
There are several ways to get help in R:
-
?
orhelp()
: To access the documentation for a specific function, type?
followed by the function name (e.g.,?mean
). You can also usehelp(mean)
. -
??
orhelp.search()
: To search for help topics related to a keyword, use??
followed by the keyword (e.g.,??regression
). You can also usehelp.search("regression")
. - Online Resources: The R community is very active online. Websites like Stack Overflow (https://stackoverflow.com/questions/tagged/r) are great places to find answers to common R questions.
-
Package Vignettes: Many R packages include vignettes, which are longer, tutorial-style documents that demonstrate how to use the package. You can access them using the
browseVignettes()
function. - Package Websites: Some packages have dedicated websites with additional resources, such as tutorials, articles, and examples. These can typically be found by searching for the package name followed by “R package” in a search engine.
2.9 Style Guides in R
Coding style guides are sets of conventions that prescribe how code should be formatted and written. Following a style guide can make your code more readable, maintainable, and consistent.
2.9.1 The tidyverse Style Guide
The tidyverse style guide, developed by Hadley Wickham and the RStudio team, is a widely used style guide for R. It covers various aspects of coding style, including:
- File names
- Object names
- Syntax
- Spacing
- Control flow
- Comments
You can find the tidyverse style guide here: https://style.tidyverse.org/. Note: I try to follow the guide, but I am not incredibly good at doing so.
2.9.2 Quick Examples
Here are a few examples of tidyverse style guidelines:
-
File Names: Use
.R
extension for R script files and.Rmd
for R Markdown files. File names should be meaningful and use lowercase letters, numbers, and underscores (e.g.,process_data.R
,analysis_report.Rmd
). -
Object Names: Use lowercase with underscores to separate words (e.g.,
my_variable
,data_frame
). Be descriptive but concise. -
Spacing: Place spaces around operators (e.g.,
x + y
, notx+y
) and after commas (e.g.,mean(x, na.rm = TRUE)
, notmean(x,na.rm=TRUE)
). - Indentation: Use two spaces for indentation. Do not use tabs.
2.9.3 Other Style Guides
While the tidyverse style guide is popular, there are other style guides you might encounter or choose to follow, such as:
- Google’s R Style Guide: https://google.github.io/styleguide/Rguide.html
- The Bioconductor Style Guide: https://contributions.bioconductor.org/
Ultimately, the most important thing is to be consistent in your style, regardless of which guide you choose to follow. Most style guides agree on the basic principles of writing clear and readable code.
2.10 Protips for R Success
2.10.1 Enable Helpful RStudio Options
These options can make your coding experience more pleasant and efficient:
- Highlight R function calls: Go to “Tools” > “Global Options” > “Code” > “Display” and check “Highlight R function calls.” This will visually distinguish function names in your code.
- Rainbow parentheses: In the same “Display” settings, check “Rainbow parentheses.” This will color-code matching parentheses, making it easier to track nested functions.
- Use a dark theme: If you find yourself staring at the screen for long periods, a dark theme can help reduce eye strain. Go to “Tools” > “Global Options” > “Appearance” and choose a dark theme from the “Editor Theme” options.
2.10.2 Good Coding Practices
- Organization: Use a clear directory structure for your projects. Keep your data, code, and output in separate folders.
-
Comments: Explain your code using comments (lines starting with
#
). Focus on explaining the why behind your code, not just the what. - Readability: Use consistent spacing and indentation to make your code easy to read. RStudio can help with this by automatically indenting code.
-
Section Headings: Use
#
to create section headings to organize your R scripts. The number of#
determines the level of the heading (e.g.,#
for a top-level heading,##
for a subheading, etc.). - Don’t Overwrite: Avoid overwriting your original variables. Create new variables when you need to modify data.
- Backups: Regularly back up your work. Save reusable code snippets for future use. For larger projects, consider using separate R scripts for different tasks (e.g., data cleaning, analysis, reporting).
- Coding as a Language: Learning to code is like learning a new language. It takes time and practice. Be patient with yourself.
- Precision: Approach your analysis with care and attention to detail. Avoid rushing, especially when tired.
2.10.3 Solutions to Common Problems
-
Knitting Issues:
- Restart RStudio and try knitting again.
- Check for special characters (e.g., Greek letters) that might be causing problems.
- Make sure your PDF file (if you’re knitting to PDF) is not open in another program.
- Delete temporary
.md
and.tex
files that are created during the knitting process. - Consider disabling “use tinytex when compiling .tex files” in “Tools” > “Global Options” > “Sweave” if you’re having issues with TinyTeX.
-
Error Messages:
- Carefully check your function arguments. Did you enter them correctly and in the right order?
- Pay attention to capitalization and the use of quotes.
- If you encounter naming conflicts (e.g.,
dplyr::recode()
vs.car::recode()
), specify the package you want to use (e.g.,dplyr::recode()
).
-
Model Problems:
- Double-check that you’re using the correct variables in your model.
- Review any recoding steps you performed. Did you make any mistakes that might be affecting your results?