5 Lecture 2 Exercises

This tutorial was created by John Santos (with minor adaptations from me).

5.1 Main Exercise

The data ’fertil2’ were collected on women living in the Republic of Botswana in 1988. The variable children refers to the number of living children. The variable electric is a binary indicator equal to one if the woman’s home has electricity, and zero if not. Using the “fertil2” data in {wooldridge}…

  1. Find the smallest and largest values of children in the sample. What is the average of children?
  2. What percentage of women have electricity in the home?
  3. Compute the average of children for those without electricity and do the same for those with electricity.
  4. From part (iii), can you infer that having electricity “causes” women to have fewer children?
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
data("fertil2")

(i) Find the smallest and largest values of children in the sample. What is the average of children?

Using Base R…

min(fertil2$children)
## [1] 0
max(fertil2$children)
## [1] 13
mean(fertil2$children)
## [1] 2.267828
summary(fertil2$children)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   2.268   4.000  13.000

Using the describe() function from the psych package…

describe(fertil2$children)
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4361 2.27 2.22      2    1.95 2.97   0  13    13 1.07     0.75 0.03

(ii) What percentage of women have electricity in the home?

Using Base…

prop.table(table(fertil2$electric))
## 
##         0         1 
## 0.8597981 0.1402019

Using tidyverse conventions…

fertil2%>%
  select(electric)%>%
  table()/nrow(fertil2)
## select: dropped 26 variables (mnthborn, yearborn, age, radio, tv, …)
## electric
##         0         1 
## 0.8592066 0.1401055

14% of women have electricity.

(iii) Compute the average of children for those without electricity and do the same for those with electricity.

Using Base to manually calculate averages from subsets…

The code below, translated into plain English, would be something like: “Calculate the mean of fertil2$children for all cases where fertil2$electric equals 0, while removing all cases that have NAs.”

mean(fertil2$children[fertil2$electric==0], na.rm = TRUE)
## [1] 2.327729

The code below calculates the compliment of the code above. In plain English, this code says, “Calculate the mean of fertil2$children for all cases where fertil2$electric equals 1, while removing all cases that have NAs.”

mean(fertil2$children[fertil2$electric==1], na.rm = TRUE)
## [1] 1.898527

Mean number of children among women without electricity = 2.33.

Mean number of children among women with electricity = 1.90.

We could also use the t.test() command from Base R:

t.test(fertil2$children ~ fertil2$electric)
## 
##  Welch Two Sample t-test
## 
## data:  fertil2$children by fertil2$electric
## t = 5.2409, df = 958, p-value = 1.965e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.2684895 0.5899142
## sample estimates:
## mean in group 0 mean in group 1 
##        2.327729        1.898527

Mean difference = 0.43, \(p\leq0.001\), 95% CI = 0.27 to 0.59.

On average, women with electricity have 0.43 fewer children than women without electricity, and this difference is statistically significant.

# dplyr
library(dplyr)
fertil2 %>%
  group_by(electric) %>%
  summarise(mean = mean(children),
            sd = sd(children))
## group_by: one grouping variable (electric)
## summarise: now 3 rows and 3 columns, ungrouped
## # A tibble: 3 × 3
##   electric  mean    sd
##      <int> <dbl> <dbl>
## 1        0  2.33  2.28
## 2        1  1.90  1.80
## 3       NA  2.67  2.89

(iv) From part (iii), can you infer that having electricity “causes” women to have fewer children?

While women with electricity, on average, have fewer children than women without electricity, and this relationship is statistically significant, we cannot necessarily infer that having electricity “causes” women to have fewer children. We would need a mechanism to link electricity to having children to conclude that electricity is the cause.

Perhaps, electricity is spurious and the common cause is SES.


5.2 Additional Exercises:

Use the {ces.Rda} data found here.

Overall ratings of Trudeau

The variables feel_trudeau has feeling thermometer ratings of Liberal Leader Justin Trudeau. On average, how do Canadians rate him? What’s the lowest rating? What’s the highest rating?

Trudeau ratings by groups

Do Trudeau’s ratings vary across groups of the population? Specifically, look at gender (gender), age (agegrp), and education (educ).

The variable (leftrightgrp) measures whether an individual places themselves on the left (0-4), centre (5), or right (6-10) of the political spectrum. Do ratings of Trudeau vary across self-placed ideological categories?

5.2.1 STOP!!

Before you continue, try solving the exercises on your own. It’s the only way you will learn. Then, come back to this page and see how well you did.

5.2.2 Continue

load("Sample_data/ces.Rda")

5.2.3 Overall ratings of Trudeau

The variable feel_trudeau has feeling thermometer ratings of Liberal Leader Justin Trudeau. On average, how do Canadians rate him? What’s the lowest rating? What’s the highest rating?

Using base R.

mean(ces$feel_trudeau)
## [1] NA
min(ces$feel_trudeau)
## [1] NA
max(ces$feel_trudeau)
## [1] NA

D’oh! That didn’t work because there are NAs.

Let’s remove those using the option na.rm = TRUE.

mean(ces$feel_trudeau, na.rm = TRUE)
## [1] 44.84804
min(ces$feel_trudeau, na.rm = TRUE)
## [1] 0
max(ces$feel_trudeau, na.rm = TRUE)
## [1] 100

Alternatively, we can use the summary() command.

summary(ces$feel_trudeau)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   50.00   44.85   75.00  100.00    2409

We can do the same using dplyr. We can use this method to calculate other summary statistics at the same time.

library(dplyr)
ces %>%
  summarise(mean = mean(feel_trudeau, na.rm = TRUE),
            min = min(feel_trudeau, na.rm = TRUE),
            max = max(feel_trudeau, na.rm = TRUE))
## summarise: now one row and 3 columns, ungrouped
##       mean min max
## 1 44.84804   0 100

We could also calculate all the statistics…

ces %>%
  summarise(mean = mean(feel_trudeau, na.rm = TRUE),
            median = median(feel_trudeau, na.rm = TRUE),
            min = min(feel_trudeau, na.rm = TRUE),
            max = max(feel_trudeau, na.rm = TRUE),
            sd = sd(feel_trudeau, na.rm = TRUE),
            se = sd(feel_trudeau, na.rm = TRUE) / sqrt(sum(!is.na(feel_trudeau))),
            lower95 = mean - (1.96*se),
            upper95 = mean + (1.96*se)) 
## summarise: now one row and 8 columns, ungrouped
##       mean median min max       sd        se  lower95 upper95
## 1 44.84804     50   0 100 34.54668 0.1838109 44.48777 45.2083

5.2.4 Trudeau ratings by groups

Do Trudeau’s ratings vary across groups of the population? Have a look at gender (gender), age (agegrp), and education (educ). The variable (leftrightgrp) measures whether an individual places themselves on the left (0-4), centre (5), or right (6-10) of the political spectrum. Do ratings of Trudeau vary across self-placed ideological categories?

5.2.5 Gender

mean(ces$feel_trudeau[ces$gender=="Man"], na.rm=T)
## [1] 43.35218
mean(ces$feel_trudeau[ces$gender=="Woman"], na.rm=T)
## [1] 45.93724
t.test(ces$feel_trudeau ~ ces$gender)
## 
##  Welch Two Sample t-test
## 
## data:  ces$feel_trudeau by ces$gender
## t = -6.9009, df = 31483, p-value = 5.266e-12
## alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
## 95 percent confidence interval:
##  -3.319280 -1.850828
## sample estimates:
##   mean in group Man mean in group Woman 
##            43.35218            45.93724

Using dplyr

ces %>%
  group_by(gender) %>%
  summarise(avg = mean(feel_trudeau, na.rm = TRUE))
## group_by: one grouping variable (gender)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   gender   avg
##   <fct>  <dbl>
## 1 Man     43.4
## 2 Woman   45.9
## 3 <NA>    44.8

We can also use the base R command t.test().

This option is somewhat limited because it only works when comparing across two categories. However, it does test the significance of the difference, which is useful.

t.test(ces$feel_trudeau ~ ces$gender, na.rm = TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  ces$feel_trudeau by ces$gender
## t = -6.9009, df = 31483, p-value = 5.266e-12
## alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
## 95 percent confidence interval:
##  -3.319280 -1.850828
## sample estimates:
##   mean in group Man mean in group Woman 
##            43.35218            45.93724

5.2.5.1 Age

ces %>%
  group_by(agegrp) %>%
  summarise(avg = mean(feel_trudeau, na.rm = TRUE))
## group_by: one grouping variable (agegrp)
## summarise: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   agegrp   avg
##   <fct>  <dbl>
## 1 18-34   49.1
## 2 35-54   43.2
## 3 55+     43.7

5.2.5.2 Education

ces %>%
  group_by(educ) %>%
  summarise(avg = mean(feel_trudeau, na.rm = TRUE))
## group_by: one grouping variable (educ)
## summarise: now 5 rows and 2 columns, ungrouped
## # A tibble: 5 × 2
##   educ         avg
##   <fct>      <dbl>
## 1 HS or less  38.0
## 2 Some PSE    42.3
## 3 Bachelors   51.3
## 4 Postgrad    52.1
## 5 <NA>        38.1

5.2.5.3 Ideology

ces %>%
  group_by(leftrightgrp) %>%
  summarise(avg = mean(feel_trudeau, na.rm = TRUE))
## group_by: one grouping variable (leftrightgrp)
## summarise: now 4 rows and 2 columns, ungrouped
## # A tibble: 4 × 2
##   leftrightgrp   avg
##   <fct>        <dbl>
## 1 Left          59.4
## 2 Centre        42.9
## 3 Right         37.8
## 4 <NA>          43.6