3 Lecture 2: Introduction to Causal Inference

Slides

  • 3 Introduction to Causal Inference (link)

3.1 Introduction

We now dive deeper into causal inference and the counterfactual problem. We show why randomized trails solve that counterfactual problem, but also how the counterfactual problem is still a problem when using observational data.

The lecture slide are displayed in full below:

Figure 3.1: Slides for 3 Introduction to Causal Inference.

library(tidyverse) # for wrangling data
library(tidylog) # to know what we are wrangling

3.2 Vignette 2.1

Usually, we do not know the data generation process, but here, we are gods. Let’s create a world where taking a treatment A (e.g., taking a pill) positively affect Y (e.g., health) by one unit. Let’s run an experiment.

df <- data.frame(health_no_pill= rnorm(5000),
                 # Randomly assign a treatment
                 pill=sample(c(0,1),5000,replace=T))
hist(df$health_no_pill)
knitr::kable(table(df$pill), format="markdown")
Var1 Freq
0 2503
1 2497

Now we can create our counterfactual:

df <- df %>%
  mutate(health_w_pill = health_no_pill + 1) # Our Y when A=1 aka our counterfactual
## mutate: new variable 'health_w_pill' (double) with 5,000 unique values and 0% NA

Let’s look at our counterfactual:

health_w_pill <- cbind.data.frame(df$health_w_pill,"with Pill")
colnames(health_w_pill) <- c("health","treatment")
health_no_pill <- cbind.data.frame(df$health_no_pill,"without Pill")
colnames(health_no_pill) <- c("health","treatment")
comparison_y <- rbind.data.frame(health_w_pill,health_no_pill)

comparison_y %>%
  group_by(treatment) %>%
  mutate(mean_health = mean(health)) %>%
  ungroup() %>%
  ggplot(aes(x=health,fill = treatment,color = treatment)) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health, color = treatment ),
             linetype = "dashed")
## group_by: one grouping variable (treatment)
## mutate (grouped): new variable 'mean_health' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

Now let’s give each individual the treatment (either the pill or a placebo):

df <- df %>%
  mutate(health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_obs' (double) with 5,000 unique values and 0% NA
head(df,10)
##    health_no_pill pill health_w_pill health_obs
## 1       0.8423212    0     1.8423212  0.8423212
## 2      -0.5428214    1     0.4571786  0.4571786
## 3      -1.5839233    1    -0.5839233 -0.5839233
## 4      -1.6744286    0    -0.6744286 -1.6744286
## 5       0.5431052    1     1.5431052  1.5431052
## 6      -0.6689065    1     0.3310935  0.3310935
## 7      -0.2207760    0     0.7792240 -0.2207760
## 8       0.4573494    1     1.4573494  1.4573494
## 9       0.4404777    0     1.4404777  0.4404777
## 10      0.0932991    1     1.0932991  1.0932991

We can see the average effect of the pill on the treated group (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##    pill   health
##   <dbl>    <dbl>
## 1     0 -0.00903
## 2     1  1.01

Or we can plot it:

df %>%
  group_by(pill) %>%
  mutate(mean_health_obs = mean(health_obs)) %>%
  ungroup() %>%
  ggplot(aes(x=health_obs,fill = factor(pill),color = factor(pill))) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health_obs, color = factor(pill) ),
             linetype = "dashed")
## group_by: one grouping variable (pill)
## mutate (grouped): new variable 'mean_health_obs' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

3.3 Vignette 2.2

Ok… but what happens if we cannot randomize? What if we have observational data, such that…

df <- data.frame(income = runif(10000)) %>%
  # In this case, your health is determined randomly AND by your levels of income
  mutate(health_no_pill = rnorm(10000) + income,
         health_w_pill = health_no_pill + 1) %>%
  # Now we give the pill only to people that have money
  mutate(pill = income > .7,
         health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_no_pill' (double) with 10,000 unique values and 0% NA
##         new variable 'health_w_pill' (double) with 10,000 unique values and 0% NA
## mutate: new variable 'pill' (logical) with 2 unique values and 0% NA
##         new variable 'health_obs' (double) with 10,000 unique values and 0% NA
head(df,10)
##        income health_no_pill health_w_pill  pill health_obs
## 1  0.15681246    -0.47126247     0.5287375 FALSE -0.4712625
## 2  0.99598697    -0.80685390     0.1931461  TRUE  0.1931461
## 3  0.11365139     0.60829196     1.6082920 FALSE  0.6082920
## 4  0.04471854    -0.19103219     0.8089678 FALSE -0.1910322
## 5  0.08604455    -1.84413255    -0.8441325 FALSE -1.8441325
## 6  0.80914912     0.02072352     1.0207235  TRUE  1.0207235
## 7  0.59190844     0.73166084     1.7316608 FALSE  0.7316608
## 8  0.90610009     0.51558137     1.5155814  TRUE  1.5155814
## 9  0.92152681     3.51082541     4.5108254  TRUE  4.5108254
## 10 0.12657253     0.57418145     1.5741814 FALSE  0.5741814

Let’s see what happens now to the estimated mean average ‘effect’ (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.342
## 2 TRUE   1.87

Oh no! That is more than the actual effect of the pill, which we know is 1 since we created it. However, if we were to properly model (this is an RDD!), then (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  filter(abs(income-.7)<.01) %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs)) ## BOOM!!
## filter: removed 9,769 rows (98%), 231 rows remaining
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.636
## 2 TRUE   1.85