3 Lecture 2: Introduction to Causal Inference

Slides

  • 3 Introduction to Causal Inference (link)

3.1 Introduction

We now dive deeper into causal inference and the counterfactual problem. We show why randomized trails solve that counterfactual problem, but also how the counterfactual problem is still a problem when using observational data.

The lecture slide are displayed in full below:

Figure 3.1: Slides for 3 Introduction to Causal Inference.

library(tidyverse) # for wrangling data
library(tidylog) # to know what we are wrangling

3.2 Vignette 2.1

Usually, we do not know the data generation process, but here, we are gods. Let’s create a world where taking a treatment A (e.g., taking a pill) positively affect Y (e.g., health) by one unit. Let’s run an experiment.

df <- data.frame(health_no_pill= rnorm(5000),
                 # Randomly assign a treatment
                 pill=sample(c(0,1),5000,replace=T))
hist(df$health_no_pill)
knitr::kable(table(df$pill), format="markdown")
Var1 Freq
0 2467
1 2533

Now we can create our counterfactual:

df <- df %>%
  mutate(health_w_pill = health_no_pill + 1) # Our Y when A=1 aka our counterfactual
## mutate: new variable 'health_w_pill' (double) with 5,000 unique values and 0% NA

Let’s look at our counterfactual:

health_w_pill <- cbind.data.frame(df$health_w_pill,"with Pill")
colnames(health_w_pill) <- c("health","treatment")
health_no_pill <- cbind.data.frame(df$health_no_pill,"without Pill")
colnames(health_no_pill) <- c("health","treatment")
comparison_y <- rbind.data.frame(health_w_pill,health_no_pill)

comparison_y %>%
  group_by(treatment) %>%
  mutate(mean_health = mean(health)) %>%
  ungroup() %>%
  ggplot(aes(x=health,fill = treatment,color = treatment)) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health, color = treatment ),
             linetype = "dashed")
## group_by: one grouping variable (treatment)
## mutate (grouped): new variable 'mean_health' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

Now let’s give each individual the treatment (either the pill or a placebo):

df <- df %>%
  mutate(health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_obs' (double) with 5,000 unique values and 0% NA
head(df,10)
##    health_no_pill pill health_w_pill health_obs
## 1      0.42322505    1     1.4232250  1.4232250
## 2      0.39419621    1     1.3941962  1.3941962
## 3     -0.04481056    1     0.9551894  0.9551894
## 4      1.06594742    0     2.0659474  1.0659474
## 5     -0.19369337    1     0.8063066  0.8063066
## 6      1.08453509    1     2.0845351  2.0845351
## 7      0.69245997    0     1.6924600  0.6924600
## 8      0.53320426    1     1.5332043  1.5332043
## 9      0.50099025    0     1.5009902  0.5009902
## 10     1.91985072    0     2.9198507  1.9198507

We can see the average effect of the pill on the treated group (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##    pill  health
##   <dbl>   <dbl>
## 1     0 -0.0344
## 2     1  0.981

Or we can plot it:

df %>%
  group_by(pill) %>%
  mutate(mean_health_obs = mean(health_obs)) %>%
  ungroup() %>%
  ggplot(aes(x=health_obs,fill = factor(pill),color = factor(pill))) +
  geom_density(alpha = .5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  geom_vline(aes(xintercept = mean_health_obs, color = factor(pill) ),
             linetype = "dashed")
## group_by: one grouping variable (pill)
## mutate (grouped): new variable 'mean_health_obs' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

3.3 Vignette 2.2

Ok… but what happens if we cannot randomize? What if we have observational data, such that…

df <- data.frame(income = runif(10000)) %>%
  # In this case, your health is determined randomly AND by your levels of income
  mutate(health_no_pill = rnorm(10000) + income,
         health_w_pill = health_no_pill + 1) %>%
  # Now we give the pill only to people that have money
  mutate(pill = income > .7,
         health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_no_pill' (double) with 10,000 unique values and 0% NA
##         new variable 'health_w_pill' (double) with 10,000 unique values and 0% NA
## mutate: new variable 'pill' (logical) with 2 unique values and 0% NA
##         new variable 'health_obs' (double) with 10,000 unique values and 0% NA
head(df,10)
##         income health_no_pill health_w_pill  pill   health_obs
## 1  0.521080535    2.607448254    3.60744825 FALSE  2.607448254
## 2  0.007725727   -1.066143985   -0.06614399 FALSE -1.066143985
## 3  0.589929569   -0.934537963    0.06546204 FALSE -0.934537963
## 4  0.406625681    0.504229056    1.50422906 FALSE  0.504229056
## 5  0.103787734    1.006819180    2.00681918 FALSE  1.006819180
## 6  0.553671626    0.060997530    1.06099753 FALSE  0.060997530
## 7  0.519004499   -0.589790989    0.41020901 FALSE -0.589790989
## 8  0.788197296    2.845975145    3.84597514  TRUE  3.845975145
## 9  0.264524445   -1.379120082   -0.37912008 FALSE -1.379120082
## 10 0.188454419   -0.003144779    0.99685522 FALSE -0.003144779

Let’s see what happens now to the estimated mean average ‘effect’ (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.348
## 2 TRUE   1.83

Oh no! That is more than the actual effect of the pill, which we know is 1 since we created it. However, if we were to properly model (this is an RDD!), then (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):

df %>%
  filter(abs(income-.7)<.01) %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs)) ## BOOM!!
## filter: removed 9,803 rows (98%), 197 rows remaining
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.623
## 2 TRUE   1.84