8 Lecture 3 Exercises

This tutorial was created by John Santos (with minor adaptations from me).

Before we start with the exercises, we will take a look at dplyr and pipping.

8.1 {dplyr} primer

dplyr is an R package that is part of the tidyverse family of packages. tidyverse provides a unified “grammar” of coding, as opposed to the idiosyncratic conventions of Base R.

dplyr focuses on data management/transformation.

For our purposes, some of the most used dplyr functions include:

  • select(): pick a column (i.e. a variable) or multiple columns in a data frame.
  • mutate(): create a new column.
  • summarise(): create a new column that is calculated from an existing column.
  • group_by(): precedes the summarise() command and tells R by which other column you want to summarise.
  • filter(): pick rows (i.e. cases) according to criterion/criteria.

The other nifty thing about dplyr (and the tidyverse more generally) is the use of “pipelines,” which are created using the pipe operator, %>%. This means you only need to identify the data frame with which you are working once and then “pipe it through” a “pipeline of operations”, which saves time and reduces the chance of coding errors.

(Note: RStudio also has a “native pipe operator”, |>, but this only works on newer versions of R and is not yet universally adopted.)

8.1.1 The select() command in action

Say we wanted to pick out the feeling thermometer questions in our sample CES data set.

Here are the column names:

load("Sample_data/ces.rda")
names(ces)
##  [1] "weight"        "age"           "agegrp"        "age_18to34"    "age_35to54"    "age_55plus"    "gender"       
##  [8] "woman"         "lang"          "province"      "region"        "reg_on"        "reg_qc"        "reg_atl"      
## [15] "reg_west"      "educ"          "univ"          "relig"         "rel_catholic"  "rel_christian" "rel_other"    
## [22] "rel_none"      "income"        "incomegrp"     "mostimp"       "turnout"       "leftright"     "leftrightgrp" 
## [29] "lr_left"       "lr_centre"     "lr_right"      "feel_lib"      "feel_cpc"      "feel_ndp"      "feel_bloc"    
## [36] "feel_green"    "feel_ppc"      "feel_trudeau"  "feel_scheer"   "feel_singh"    "feel_blanchet" "feel_may"     
## [43] "feel_bernier"  "polinterest"   "marketlib"     "moraltrad"     "helpgroups"

Let’s pull those columns using Base R and put them into a new data frame called “feelings_base”. I’m feeling lazy, so I’m just going to do the parties.

feelings_base <- data.frame(
  feel_lib = ces$feel_lib,
  feel_cpc = ces$feel_cpc,
  feel_ndp = ces$feel_ndp,
  feel_bloc = ces$feel_bloc,
  feel_green = ces$feel_green,
  feel_ppc = ces$feel_ppc
)
head(feelings_base)
##   feel_lib feel_cpc feel_ndp feel_bloc feel_green feel_ppc
## 1       52        0       64        NA         97        0
## 2       51        0       64        40         88        0
## 3       52       49       NA        NA         NA       NA
## 4       30       85        0         0          0       30
## 5       20       62       51        20         50       20
## 6       81       31       79        NA         85        0

We could also pull these columns by using their index numbers, but this isn’t recommended because index numbers could change depending on the version of a data set or previous operations performed on it.

feelings_numsel <- ces[32:43]
head(feelings_numsel)
##   feel_lib feel_cpc feel_ndp feel_bloc feel_green feel_ppc feel_trudeau feel_scheer feel_singh feel_blanchet feel_may
## 1       52        0       64        NA         97        0           55           0         71            15       80
## 2       51        0       64        40         88        0           53           0         NA            32       59
## 3       52       49       NA        NA         NA       NA           74          24         63            NA       56
## 4       30       85        0         0          0       30            0          80          0             0        0
## 5       20       62       51        20         50       20           30          60         NA            50       60
## 6       81       31       79        NA         85        0           81          35         70            NA       74
##   feel_bernier
## 1            0
## 2            0
## 3           20
## 4           10
## 5           30
## 6            3

The dplyr way is more efficient:

library(dplyr)
feelings_dplyr <- select(ces, c(feel_lib:feel_blanchet))
## select: dropped 37 variables (weight, age, agegrp, age_18to34, age_35to54, …)
head(feelings_dplyr)
##   feel_lib feel_cpc feel_ndp feel_bloc feel_green feel_ppc feel_trudeau feel_scheer feel_singh feel_blanchet
## 1       52        0       64        NA         97        0           55           0         71            15
## 2       51        0       64        40         88        0           53           0         NA            32
## 3       52       49       NA        NA         NA       NA           74          24         63            NA
## 4       30       85        0         0          0       30            0          80          0             0
## 5       20       62       51        20         50       20           30          60         NA            50
## 6       81       31       79        NA         85        0           81          35         70            NA

You could get even fancier and use the starts_with() command to select every column whose column name starts with the character string “feel_”:

feelings_dplyr2 <- select(ces, starts_with("feel_"))
## select: dropped 35 variables (weight, age, agegrp, age_18to34, age_35to54, …)
head(feelings_dplyr2)
##   feel_lib feel_cpc feel_ndp feel_bloc feel_green feel_ppc feel_trudeau feel_scheer feel_singh feel_blanchet feel_may
## 1       52        0       64        NA         97        0           55           0         71            15       80
## 2       51        0       64        40         88        0           53           0         NA            32       59
## 3       52       49       NA        NA         NA       NA           74          24         63            NA       56
## 4       30       85        0         0          0       30            0          80          0             0        0
## 5       20       62       51        20         50       20           30          60         NA            50       60
## 6       81       31       79        NA         85        0           81          35         70            NA       74
##   feel_bernier
## 1            0
## 2            0
## 3           20
## 4           10
## 5           30
## 6            3

8.1.2 Using {summarise} to calculate summary statistics

Let’s say we want to calculate the sample mean for each feeling thermometer score. In Base, we might do something like the following, where we have a separate line of code for each calculation. (Again, because I’m lazy, I’ve only done three, but there would be 12 lines in total.)

mean(ces$feel_lib, na.rm = TRUE)
## [1] 48.34366
mean(ces$feel_cpc, na.rm = TRUE)
## [1] 43.61424
mean(ces$feel_ndp, na.rm = TRUE)
## [1] 50.03684

The dplyr way is more efficient, thanks to the %>% operator, which allows us to only write the name of the data frame once and then “pipe it through” the entire chain of commands.

ces %>%
  select(starts_with("feel_")) %>%
  summarise(across(everything(), 
                   list(mean), 
                   na.rm = TRUE))
## select: dropped 35 variables (weight, age, agegrp, age_18to34, age_35to54, …)
## summarise: now one row and 12 columns, ungrouped
##   feel_lib_1 feel_cpc_1 feel_ndp_1 feel_bloc_1 feel_green_1 feel_ppc_1 feel_trudeau_1 feel_scheer_1 feel_singh_1
## 1   48.34366   43.61424   50.03684    23.33321     47.45785   25.81046       44.84804      39.51979     50.36898
##   feel_blanchet_1 feel_may_1 feel_bernier_1
## 1        27.79471   46.85782       26.14414

8.1.3 The {group_by} and {summarise} combo

Calculating group-wise summary statistics in Base is a bit clunky. Recall from last week, Base R uses square brackets for selection and subsetting. If we want to calculate the average rating for the PPC among Western Canadians versus everyone else, we could have to do something like this:

mean(ces$feel_ppc[ces$reg_west==1], na.rm = TRUE)
## [1] 27.04223
mean(ces$feel_ppc[ces$reg_west==0], na.rm = TRUE)
## [1] 25.25404

The dplyr way of doing the same is:

ces %>%
  group_by(reg_west) %>% 
  summarise(mean_ppc = mean(feel_ppc, na.rm = TRUE))
## group_by: one grouping variable (reg_west)
## summarise: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   reg_west mean_ppc
##      <dbl>    <dbl>
## 1        0     25.3
## 2        1     27.0

At first, you might think the dplyr way is longer because it has three instead of two lines of code. However, it is more efficient because there is less duplication of code.

Where dplyr really starts to show an advantage is when you have to performan many calculations (more variables, more groupings, and/or more summary statistics).

If we want to calculate dispersion of the distributions (SD) or the uncertainty associated with the estimated means (SE), this is what it looks like in Base R:

sd(ces$feel_ppc[ces$reg_west==1], na.rm = TRUE)
## [1] 26.75502
sd(ces$feel_ppc[ces$reg_west==0], na.rm = TRUE)
## [1] 26.43218
sd(ces$feel_ppc[ces$reg_west==1], na.rm = TRUE) / 
  sqrt(sum(!is.na(ces$feel_ppc[ces$reg_west==1])))
## [1] 0.2654482
sd(ces$feel_ppc[ces$reg_west==0], na.rm = TRUE) / 
  sqrt(sum(!is.na(ces$feel_ppc[ces$reg_west==0])))
## [1] 0.1762576

The dplyr way:

ces %>%
  group_by(reg_west) %>% 
  summarise(mean_ppc = mean(feel_ppc, na.rm = TRUE),
            sd_ppc = sd(feel_ppc, na.rm = TRUE),
            n_ppc = sum(!is.na(feel_ppc)),
            semean_ppc = sd_ppc / sqrt(n_ppc)
            )
## group_by: one grouping variable (reg_west)
## summarise: now 2 rows and 5 columns, ungrouped
## # A tibble: 2 × 5
##   reg_west mean_ppc sd_ppc n_ppc semean_ppc
##      <dbl>    <dbl>  <dbl> <int>      <dbl>
## 1        0     25.3   26.4 22489      0.176
## 2        1     27.0   26.8 10159      0.265

8.2 Lecture 3 Exercises

Using the ’fertil2’ dataset from ’wooldridge’ on women living in the Republic of Botswana in 1988,

(i) Calculate the means and mean differences between those with and without electricity for the following two characteristics:

  • education (educ)
  • age when first child was born (agefbrth)

We’ll start by loading the dataset.

library(wooldridge)
cupcakes <- get(data('fertil2'))
head(cupcakes)
##   mnthborn yearborn age electric radio tv bicycle educ ceb agefbrth children knowmeth usemeth monthfm yearfm agefm idlnchld
## 1        5       64  24        1     1  1       1   12   0       NA        0        1       0      NA     NA    NA        2
## 2        1       56  32        1     1  1       1   13   3       25        3        1       1      11     80    24        3
## 3        7       58  30        1     0  0       0    5   1       27        1        1       0       6     83    24        5
## 4       11       45  42        1     0  1       0    4   3       17        2        1       0       1     61    15        3
## 5        5       45  43        1     1  1       1   11   2       24        2        1       1       3     66    20        2
## 6        8       52  36        1     0  0       0    7   1       26        1        1       1      11     76    24        4
##   heduc agesq urban urb_educ spirit protest catholic frsthalf educ0 evermarr
## 1    NA   576     1       12      0       0        0        1     0        0
## 2    12  1024     1       13      0       0        0        1     0        1
## 3     7   900     1        5      1       0        0        0     0        1
## 4    11  1764     1        4      0       0        0        0     0        1
## 5    14  1849     1       11      0       1        0        1     0        1
## 6     9  1296     1        7      0       0        0        0     0        1
data("fertil2")
head(fertil2)
##   mnthborn yearborn age electric radio tv bicycle educ ceb agefbrth children knowmeth usemeth monthfm yearfm agefm idlnchld
## 1        5       64  24        1     1  1       1   12   0       NA        0        1       0      NA     NA    NA        2
## 2        1       56  32        1     1  1       1   13   3       25        3        1       1      11     80    24        3
## 3        7       58  30        1     0  0       0    5   1       27        1        1       0       6     83    24        5
## 4       11       45  42        1     0  1       0    4   3       17        2        1       0       1     61    15        3
## 5        5       45  43        1     1  1       1   11   2       24        2        1       1       3     66    20        2
## 6        8       52  36        1     0  0       0    7   1       26        1        1       1      11     76    24        4
##   heduc agesq urban urb_educ spirit protest catholic frsthalf educ0 evermarr
## 1    NA   576     1       12      0       0        0        1     0        0
## 2    12  1024     1       13      0       0        0        1     0        1
## 3     7   900     1        5      1       0        0        0     0        1
## 4    11  1764     1        4      0       0        0        0     0        1
## 5    14  1849     1       11      0       1        0        1     0        1
## 6     9  1296     1        7      0       0        0        0     0        1

After loading the dataset, let’s look at the summary statistics (also known as descriptive statistics or simply descriptives) for each of the variables we’re about to analyze. When looking at these, keep in mine the units of analysis and whether there are missing values that we should deal with as we analyze the data.

summary(fertil2$electric)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1402  0.0000  1.0000       3
summary(fertil2$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   7.000   5.856   8.000  20.000
summary(fertil2$agefbrth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   17.00   19.00   19.01   20.00   38.00    1088

Now we can calculate our summary statistics by group for each of our two variables.

library(dplyr)
fertil2 %>%
  group_by(electric) %>%
  summarise(mean_educ = mean(educ, na.rm = TRUE))
## group_by: one grouping variable (electric)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   electric mean_educ
##      <int>     <dbl>
## 1        0      5.38
## 2        1      8.76
## 3       NA      5.67

Difference in education attainment between those without (0) and those with (1) electricity.

  • Without (0) = 5.382 years
  • With (1) = 8.763 years
  • Difference = 8.763 - 5.382 == 3.381 years of schooling

(Note: This isn’t important now, but the average among the NAs in the group often indicates if missingness could bias your results. In later methods courses, you’ll learn about ways to test for and deal with this.)

fertil2 %>%
  group_by(electric) %>%
  summarise(mean_agefbrth = mean(agefbrth, na.rm = TRUE))
## group_by: one grouping variable (electric)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   electric mean_agefbrth
##      <int>         <dbl>
## 1        0          18.8
## 2        1          20.2
## 3       NA          18

Difference in age when first child was born between those without (0) and those with (1) electricity.

  • Without (0) = 18.825 years old
  • With (1) = 20.162 years old
  • Difference = 20.162 - 18.825 == 1.337 years old

(ii) Evaluate if the mean differences are statistically significant at the 0.01 and 0.05 levels.

t.test(fertil2$educ ~ fertil2$electric)
## 
##  Welch Two Sample t-test
## 
## data:  fertil2$educ by fertil2$electric
## t = -19.569, df = 790.18, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -3.719608 -3.041416
## sample estimates:
## mean in group 0 mean in group 1 
##        5.382172        8.762684

For education, difference of means = 3.381 years, p = 2.2*10^-16 (which is less than 0.001), so the difference significant at both the 0.05 and 0.01 levels. In fact, it is even significant at the 0.001 level.

Using the confidence interval approach, you could also say that the 95% confidence interval of the difference is -3.720 and -3.041, which does not include zero. This indicates the difference is statistically significant at the 0.05 level.

t.test(fertil2$agefbrth ~ fertil2$electric)
## 
##  Welch Two Sample t-test
## 
## data:  fertil2$agefbrth by fertil2$electric
## t = -7.5801, df = 562.66, p-value = 1.432e-13
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.6827861 -0.9901586
## sample estimates:
## mean in group 0 mean in group 1 
##        18.82545        20.16193

The code above uses “formula notation”, where the quantitative variable is listed first, then a tilde (“~”), and then the categorical variable.

If you use formula notation in the incorrect order, it won’t work, as in the following case: t.test(fertil2$electric ~ fertil2$agefbrth).

Specifying “vector notation” and specifying two complementary subsets (using square brackets, or []) also works. The vector notation approach works regardless of the order you specify the two vectors, as can be seen in the example below:

t.test(fertil2$agefbrth[fertil2$electric==1],fertil2$agefbrth[fertil2$electric==0])
## 
##  Welch Two Sample t-test
## 
## data:  fertil2$agefbrth[fertil2$electric == 1] and fertil2$agefbrth[fertil2$electric == 0]
## t = 7.5801, df = 562.66, p-value = 1.432e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9901586 1.6827861
## sample estimates:
## mean of x mean of y 
##  20.16193  18.82545
t.test(fertil2$agefbrth[fertil2$electric==0],fertil2$agefbrth[fertil2$electric==1])
## 
##  Welch Two Sample t-test
## 
## data:  fertil2$agefbrth[fertil2$electric == 0] and fertil2$agefbrth[fertil2$electric == 1]
## t = -7.5801, df = 562.66, p-value = 1.432e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6827861 -0.9901586
## sample estimates:
## mean of x mean of y 
##  18.82545  20.16193

Regardless of which approach you use, for age at birth of first child, difference of means = 1.337 years (p = 1.432*10^-13). So, this difference is significant at both the 0.05 and 0.01 levels. In fact, it is even significant at the 0.001 level.

Using the confidence interval approach, you could also say that the 95% confidence interval of the difference is -1.683 and -0.990, which does not include zero. This indicates the difference is statistically significant at the 0.05 level.

A note for Stata users: By default, t.test() uses unequal variances (as proposed by Welch). It is possible to perform the “vanilla” Student’s t-test that assumes equal variances by specifying the option var.equal = TRUE. However, in practice, there is no reason to do so. Welch’s t-test is more robust than Student’s t-test when the assumptions of the test are violated, and it performs just as well in the (rare) cases when the assumptions are met. If you compare results between R and Stata, t-tests usually do not match because Stata, by default, assumes equal variances. You can get Stata to assume unequal variances by specifying it as an option (e.g. ttest agefbrth, by(electric) unequal welch).

(iii) Interpret the results.

Women who have access to electricity have, on average, 3.4 years more education and were 1.3 years younger than women who do not have access to electricity. These mean differences are significant at the 0.01 and 0.05 levels.

(iv) From this analysis, would you say that the comparisons between women with and without electricity are apples-to-apples or apples-to-oranges?

These are apples-to-oranges comparisons.

(v) How does that affect our ability to conclude anything about the relationship between access to electricity and number of children?

In the previous exercise, we found that women with access to electricity have, on average, fewer children than women who do not have access to electricity (mean difference = 0.43 children, p<0.001). Women with electricity are systematically different from women without electricity in terms of having more children, having children later, and having more years of schooling. Presumably, having more children, having more children, and having more years of schooling are also all related. So, we cannot be sure that there is a true link between access to electricity and the number of children a woman has without controlling for those other factors.

fertil2 %>%
  select(c(electric, educ, agefbrth)) %>%
  filter(!is.na(electric)) %>%
  group_by(electric) %>%
  summarise(mean_agefbrth = mean(agefbrth, na.rm = TRUE),
            mean_educ = mean(educ, na.rm = TRUE)) %>%
  mutate(electric = factor(electric,
                           levels = c(0,1),
                           labels = c("No electricity",
                                      "Electricity")))
## select: dropped 24 variables (mnthborn, yearborn, age, radio, tv, …)
## filter: removed 3 rows (<1%), 4,358 rows remaining
## group_by: one grouping variable (electric)
## summarise: now 2 rows and 3 columns, ungrouped
## mutate: converted 'electric' from integer to factor (0 new NA)
## # A tibble: 2 × 3
##   electric       mean_agefbrth mean_educ
##   <fct>                  <dbl>     <dbl>
## 1 No electricity          18.8      5.38
## 2 Electricity             20.2      8.76

8.3 Additional Exercises:

  1. Using the ‘ces’ dataset, calculate the means and mean differences of support for market liberalism (marketlib) across whether someone lives in Western Canada (reg_west) and between men and women (gender);
  2. evaluate if the mean differences are statistically significant at the 0.01 and 0.05 levels;
  3. interpret the results in the same way as you did the practice exercises from lecture;

8.3.1 STOP!!

Before you continue, try solving the exercises on your own. It’s the only way you will learn. Then, come back to this page and see how well you did.

8.3.2 Continue

We’ll start by loading the dataset and taking a look at the univariate summaries of each variable we’ll be analyzing.

load("Sample_data/ces.rda")
summary(ces$marketlib)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    8.59   11.00   16.00   32684
summary(ces$reg_west)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3145  1.0000  1.0000
summary(ces$gender)
##   Man Woman  NA's 
## 15517 21928   288

Market liberalism is measured on a scale from 0 to 16 points, with an average of 8.59.

Region West and gender are both binary variables.

There are NAs in marketlib and gender.

8.3.3 Group-wise means, mean differences, and statistical significance

I’m going to use t.test() here to get significance testing. The dplyr version follows, if you want to see it.

t.test(ces$marketlib ~ ces$reg_west)
## 
##  Welch Two Sample t-test
## 
## data:  ces$marketlib by ces$reg_west
## t = -2.3712, df = 3428.5, p-value = 0.01779
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.4571754 -0.0433279
## sample estimates:
## mean in group 0 mean in group 1 
##        8.498180        8.748431

Among those who live in Western Canada, the average market liberalism score is 8.748. Among those who do not live in Western Canada, the score is 8.498. This means that, on average, those living in Western Canada are 0.25 points out of 16 more supportive of market liberalism than those living elsewhere in the country. This difference is significant at the 0.05 level (p=0.018).

t.test(ces$marketlib ~ ces$gender)
## 
##  Welch Two Sample t-test
## 
## data:  ces$marketlib by ces$gender
## t = 9.4086, df = 5012.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
## 95 percent confidence interval:
##  0.7309037 1.1156689
## sample estimates:
##   mean in group Man mean in group Woman 
##            9.079404            8.156118

Among those men, the average market liberalism score is 9.079. Among women, the score is 8.156. This means that, on average, men are those living in Western Canada are 0.923 points out of 16 more supportive of market liberalism than women. This difference is significant at the 0.001 level (p=<0.001).

The dplyr version:

ces %>%
  group_by(reg_west) %>% 
  summarise(mean_marketlib = mean(marketlib, na.rm = TRUE))
## group_by: one grouping variable (reg_west)
## summarise: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   reg_west mean_marketlib
##      <dbl>          <dbl>
## 1        0           8.50
## 2        1           8.75
ces %>%
  group_by(gender) %>% 
  summarise(mean_marketlib = mean(marketlib, na.rm = TRUE))
## group_by: one grouping variable (gender)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   gender mean_marketlib
##   <fct>           <dbl>
## 1 Man              9.08
## 2 Woman            8.16
## 3 <NA>             5.38

8.3.4 Interpretation of results

Among both sets of comparisons here are statistically significant differences in support for the principle of market liberalism. However, for both gender and region, we cannot necessarily conclude that we’ve solved the reason why Westerners are more supportive of the free market than other Canadians or why men are more supportive of the free market than women. Moreover, for region, the magnitude of the difference is only about a quarter of a point on a 16 scale, which is of dubious substantive significance.

Both of these variables are usually referred to as demographic characteristics, or the personal attributes of an individual. They tend to be “far back” the processes that lead to outcomes of interest in political science (like policy preferences or vote choice) and they often exert their effect through intervening variables, or things that come in-between personal characteristics and outcomes.