3 Lecture 2: Introduction to Causal Inference
Slides
- 3 Introduction to Causal Inference (link)
3.1 Introduction
We now dive deeper into causal inference and the counterfactual problem. We show why randomized trails solve that counterfactual problem, but also how the counterfactual problem is still a problem when using observational data.
The lecture slide are displayed in full below:
Figure 3.1: Slides for 3 Introduction to Causal Inference.
3.2 Vignette 2.1
Usually, we do not know the data generation process, but here, we are gods. Let’s create a world where taking a treatment A (e.g., taking a pill) positively affect Y (e.g., health) by one unit. Let’s run an experiment.
df <- data.frame(health_no_pill= rnorm(5000),
# Randomly assign a treatment
pill=sample(c(0,1),5000,replace=T))
hist(df$health_no_pill)

Var1 | Freq |
---|---|
0 | 2463 |
1 | 2537 |
Now we can create our counterfactual:
df <- df %>%
mutate(health_w_pill = health_no_pill + 1) # Our Y when A=1 aka our counterfactual
## mutate: new variable 'health_w_pill' (double) with 5,000 unique values and 0% NA
Let’s look at our counterfactual:
health_w_pill <- cbind.data.frame(df$health_w_pill,"with Pill")
colnames(health_w_pill) <- c("health","treatment")
health_no_pill <- cbind.data.frame(df$health_no_pill,"without Pill")
colnames(health_no_pill) <- c("health","treatment")
comparison_y <- rbind.data.frame(health_w_pill,health_no_pill)
comparison_y %>%
group_by(treatment) %>%
mutate(mean_health = mean(health)) %>%
ungroup() %>%
ggplot(aes(x=health,fill = treatment,color = treatment)) +
geom_density(alpha = .5) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
geom_vline(aes(xintercept = mean_health, color = treatment ),
linetype = "dashed")
## group_by: one grouping variable (treatment)
## mutate (grouped): new variable 'mean_health' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

Now let’s give each individual the treatment (either the pill or a placebo):
## mutate: new variable 'health_obs' (double) with 5,000 unique values and 0% NA
head(df,10)
## health_no_pill pill health_w_pill health_obs
## 1 -0.03324713 1 0.9667529 0.9667529
## 2 0.63341976 0 1.6334198 0.6334198
## 3 -0.04272456 1 0.9572754 0.9572754
## 4 0.32105826 1 1.3210583 1.3210583
## 5 -1.76224035 1 -0.7622404 -0.7622404
## 6 -1.35063641 0 -0.3506364 -1.3506364
## 7 1.34648622 1 2.3464862 2.3464862
## 8 -1.27607506 0 -0.2760751 -1.2760751
## 9 2.41346412 0 3.4134641 2.4134641
## 10 0.13955778 0 1.1395578 0.1395578
We can see the average effect of the pill on the treated group (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
## pill health
## <dbl> <dbl>
## 1 0 0.00273
## 2 1 1.01
Or we can plot it:
df %>%
group_by(pill) %>%
mutate(mean_health_obs = mean(health_obs)) %>%
ungroup() %>%
ggplot(aes(x=health_obs,fill = factor(pill),color = factor(pill))) +
geom_density(alpha = .5) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
geom_vline(aes(xintercept = mean_health_obs, color = factor(pill) ),
linetype = "dashed")
## group_by: one grouping variable (pill)
## mutate (grouped): new variable 'mean_health_obs' (double) with 2 unique values and 0% NA
## ungroup: no grouping variables remain

3.3 Vignette 2.2
Ok… but what happens if we cannot randomize? What if we have observational data, such that…
df <- data.frame(income = runif(10000)) %>%
# In this case, your health is determined randomly AND by your levels of income
mutate(health_no_pill = rnorm(10000) + income,
health_w_pill = health_no_pill + 1) %>%
# Now we give the pill only to people that have money
mutate(pill = income > .7,
health_obs = ifelse(pill==1,health_w_pill,health_no_pill))
## mutate: new variable 'health_no_pill' (double) with 10,000 unique values and 0% NA
## new variable 'health_w_pill' (double) with 10,000 unique values and 0% NA
## mutate: new variable 'pill' (logical) with 2 unique values and 0% NA
## new variable 'health_obs' (double) with 10,000 unique values and 0% NA
head(df,10)
## income health_no_pill health_w_pill pill health_obs
## 1 0.1042854 0.9562552 1.9562552 FALSE 0.9562552
## 2 0.9809103 2.6780747 3.6780747 TRUE 3.6780747
## 3 0.3296182 2.3931954 3.3931954 FALSE 2.3931954
## 4 0.6411059 1.4756062 2.4756062 FALSE 1.4756062
## 5 0.2321954 -0.6245798 0.3754202 FALSE -0.6245798
## 6 0.9268927 0.2778808 1.2778808 TRUE 1.2778808
## 7 0.5797159 0.6043916 1.6043916 FALSE 0.6043916
## 8 0.3811220 1.8681736 2.8681736 FALSE 1.8681736
## 9 0.8734975 2.1836346 3.1836346 TRUE 3.1836346
## 10 0.3998066 -0.8228780 0.1771220 FALSE -0.8228780
Let’s see what happens now to the estimated mean average ‘effect’ (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
## pill health
## <lgl> <dbl>
## 1 FALSE 0.353
## 2 TRUE 1.86
Oh no! That is more than the actual effect of the pill, which we know is 1 since we created it. However, if we were to properly model (this is an RDD!), then (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment, and those who do not):
df %>%
filter(abs(income-.7)<.01) %>%
group_by(pill) %>%
summarize(health = mean(health_obs)) ## BOOM!!
## filter: removed 9,803 rows (98%), 197 rows remaining
## group_by: one grouping variable (pill)
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
## pill health
## <lgl> <dbl>
## 1 FALSE 0.645
## 2 TRUE 1.65