3 Lecture 2: Introduction to Causal Inference

Slides

  • 3 Introduction to Causal Inference (link)

3.1 Introduction

We now dive deeper into causal inference and the counterfactual problem. We show why randomized trails solve that counterfactual problem, but also how the counterfactual problem is still a problem when using observational data.

The lecture slide are displayed in full below:

Figure 3.1: Slides for 3 Introduction to Causal Inference.

library(tidyverse) # for wrangling data
library(tidylog) # to know what we are wrangling

3.2 Vignette 2.1

Usually, we do not know the data generation process, but here, we are gods. Let’s create a world where taking a treatment A (e.g., taking a pill) positively affect Y (e.g., health) by one unit. Let’s run an experiment.

df <- data.frame(health_no_pill= rnorm(5000),
                 # Randomly assign a treatment
                 pill=sample(c(0,1),5000,replace=T))
hist(df$health_no_pill)
knitr::kable(table(df$pill), format="markdown")
Var1 Freq
0 2463
1 2537

Now we can create our counterfactual:

df <- df %>%
  mutate(health_w_pill = health_no_pill + 1) # Our Y when A=1 aka our counterfactual
## mutate: new variable 'health_w_pill' (double) with 5,000 unique values and 0% NA

Let’s look at our counterfactual:

health_w_pill <- cbind.data.frame(df$health_w_pill,"with Pill")
colnames(health_w_pill) <- c("health","treatment")
health_no_pill <- cbind.data.frame(df$health_no_pill,"without Pill")
colnames(health_no_pill) <- c("health","treatment")
comparison_y <- rbind.data.frame(health_w_pill,health_no_pill)

comparison_y %>%
  group_by(treatment) %>%
  mutate(mean_health = mean(health)) %>%
  ungroup() %>%
  ggplot(aes(x=health,fill = treatment,color = treatment)) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health, color = treatment ),
             linetype = "dashed")
## group_by: one grouping variable (treatment)
## mutate (grouped): new variable 'mean_health' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

Now let’s give each individual the treatment (either the pill or a placebo):

df <- df %>%
  mutate(health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_obs' (double) with 5,000 unique values and 0% NA
head(df,10)
##    health_no_pill pill health_w_pill health_obs
## 1     -0.03324713    1     0.9667529  0.9667529
## 2      0.63341976    0     1.6334198  0.6334198
## 3     -0.04272456    1     0.9572754  0.9572754
## 4      0.32105826    1     1.3210583  1.3210583
## 5     -1.76224035    1    -0.7622404 -0.7622404
## 6     -1.35063641    0    -0.3506364 -1.3506364
## 7      1.34648622    1     2.3464862  2.3464862
## 8     -1.27607506    0    -0.2760751 -1.2760751
## 9      2.41346412    0     3.4134641  2.4134641
## 10     0.13955778    0     1.1395578  0.1395578

We can see the average effect of the pill on the treated group (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##    pill  health
##   <dbl>   <dbl>
## 1     0 0.00273
## 2     1 1.01

Or we can plot it:

df %>%
  group_by(pill) %>%
  mutate(mean_health_obs = mean(health_obs)) %>%
  ungroup() %>%
  ggplot(aes(x=health_obs,fill = factor(pill),color = factor(pill))) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health_obs, color = factor(pill) ),
             linetype = "dashed")
## group_by: one grouping variable (pill)
## mutate (grouped): new variable 'mean_health_obs' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

3.3 Vignette 2.2

Ok… but what happens if we cannot randomize? What if we have observational data, such that…

df <- data.frame(income = runif(10000)) %>%
  # In this case, your health is determined randomly AND by your levels of income
  mutate(health_no_pill = rnorm(10000) + income,
         health_w_pill = health_no_pill + 1) %>%
  # Now we give the pill only to people that have money
  mutate(pill = income > .7,
         health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_no_pill' (double) with 10,000 unique values and 0% NA
##         new variable 'health_w_pill' (double) with 10,000 unique values and 0% NA
## mutate: new variable 'pill' (logical) with 2 unique values and 0% NA
##         new variable 'health_obs' (double) with 10,000 unique values and 0% NA
head(df,10)
##       income health_no_pill health_w_pill  pill health_obs
## 1  0.1042854      0.9562552     1.9562552 FALSE  0.9562552
## 2  0.9809103      2.6780747     3.6780747  TRUE  3.6780747
## 3  0.3296182      2.3931954     3.3931954 FALSE  2.3931954
## 4  0.6411059      1.4756062     2.4756062 FALSE  1.4756062
## 5  0.2321954     -0.6245798     0.3754202 FALSE -0.6245798
## 6  0.9268927      0.2778808     1.2778808  TRUE  1.2778808
## 7  0.5797159      0.6043916     1.6043916 FALSE  0.6043916
## 8  0.3811220      1.8681736     2.8681736 FALSE  1.8681736
## 9  0.8734975      2.1836346     3.1836346  TRUE  3.1836346
## 10 0.3998066     -0.8228780     0.1771220 FALSE -0.8228780

Let’s see what happens now to the estimated mean average ‘effect’ (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.353
## 2 TRUE   1.86

Oh no! That is more than the actual effect of the pill, which we know is 1 since we created it. However, if we were to properly model (this is an RDD!), then (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  filter(abs(income-.7)<.01) %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs)) ## BOOM!!
## filter: removed 9,803 rows (98%), 197 rows remaining
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.645
## 2 TRUE   1.65