3 Lecture 2: Introduction to Causal Inference

Slides

  • 3 Introduction to Causal Inference (link)

3.1 Introduction

We now dive deeper into causal inference and the counterfactual problem. We show why randomized trials solve the counterfactual problem, but also why it remains a challenge when using observational data.

The lecture slides are displayed in full below:

Figure 3.1: Slides for 3 Introduction to Causal Inference.

library(tidyverse) # for wrangling data
library(tidylog) # to know what we are wrangling

3.2 Vignette 2.1

Usually, we do not know the data-generating process, but here, we are gods. Let’s create a world where taking treatment A (e.g., taking a pill) positively affects Y (e.g., health) by one unit. Let’s run an experiment.

df <- data.frame(
  # Simulate baseline health outcomes *without* treatment for 5,000 individuals
  health_no_pill = rnorm(5000),
  # Randomly assign treatment: pill = 1 (treated) or 0 (control)
  pill = sample(c(0, 1), 5000, replace = TRUE)
)

# Visualize the distribution of baseline (no-pill) health outcomes
hist(df$health_no_pill)
knitr::kable(table(df$pill), format="markdown")
Var1 Freq
0 2514
1 2486

Now we can create our counterfactual:

df <- df %>%
  # Define the potential outcome under treatment (A = 1):
  # health_w_pill = health_no_pill + 1 implies a constant treatment effect of +1 for everyone.
  mutate(health_w_pill = health_no_pill + 1) # Y(1): the potential outcome if treated

Let’s look at our counterfactual:

# Reshape the two potential outcomes into a long format for easy plotting/comparison
health_w_pill <- cbind.data.frame(df$health_w_pill, "with Pill")
colnames(health_w_pill) <- c("health", "treatment")

health_no_pill <- cbind.data.frame(df$health_no_pill, "without Pill")
colnames(health_no_pill) <- c("health", "treatment")

# Stack the two datasets to compare distributions under treatment vs. no treatment
comparison_y <- rbind.data.frame(health_w_pill, health_no_pill)

comparison_y %>%
  group_by(treatment) %>%
  # Compute group means (used to add mean lines to the plot)
  mutate(mean_health = mean(health)) %>%
  ungroup() %>%
  ggplot(aes(x = health, fill = treatment, color = treatment)) +
  # Plot the distribution of health outcomes under each condition
  geom_density(alpha = 0.5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Add dashed vertical lines at the mean health for each condition
  geom_vline(
    aes(xintercept = mean_health, color = treatment),
    linetype = "dashed"
  )
## ungroup: no grouping variables remain

Now let’s give each individual the treatment (either the pill or a placebo):

df <- df %>%
  # Construct the observed outcome:
  # - if pill == 1, we observe the treated potential outcome Y(1)
  # - if pill == 0, we observe the untreated potential outcome Y(0)
  mutate(health_obs = ifelse(pill == 1, health_w_pill, health_no_pill))

# Inspect the first 10 rows to verify the observed outcome was assigned correctly
head(df, 10)
##    health_no_pill pill health_w_pill
## 1      0.17747640    1     1.1774764
## 2      0.88211462    1     1.8821146
## 3      0.05187795    1     1.0518779
## 4     -0.37384787    1     0.6261521
## 5     -0.57010506    1     0.4298949
## 6     -0.73979276    1     0.2602072
## 7     -0.18576048    0     0.8142395
## 8     -1.39315770    0    -0.3931577
## 9      0.09258727    0     1.0925873
## 10     0.37311881    0     1.3731188
##     health_obs
## 1   1.17747640
## 2   1.88211462
## 3   1.05187795
## 4   0.62615213
## 5   0.42989494
## 6   0.26020724
## 7  -0.18576048
## 8  -1.39315770
## 9   0.09258727
## 10  0.37311881

We can see the average effect of the pill on the treated group (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment and those who do not):

df %>%
  # Compare average observed health outcomes for treated (pill = 1) vs. control (pill = 0)
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##    pill  health
##   <dbl>   <dbl>
## 1     0 -0.0116
## 2     1  0.971

Or we can plot it:

df %>%
  group_by(pill) %>%
  # Compute the mean observed health within treated vs. control groups
  mutate(mean_health_obs = mean(health_obs)) %>%
  ungroup() %>%
  ggplot(aes(x = health_obs, fill = factor(pill), color = factor(pill))) +
  # Plot the distribution of observed outcomes by treatment status
  geom_density(alpha = 0.5) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Add dashed vertical lines at the group means
  geom_vline(
    aes(xintercept = mean_health_obs, color = factor(pill)),
    linetype = "dashed"
  )
## ungroup: no grouping variables remain

3.3 Vignette 2.2

Ok… but what happens if we cannot randomize? What if we have observational data, such that…

df <- data.frame(income = runif(10000)) %>%
  # Simulate baseline health without treatment (Y(0)).
  # Here, health depends on a random component plus income (so income confounds health).
  mutate(
    health_no_pill = rnorm(10000) + income,
    # Define the treated potential outcome (Y(1)) with a constant +1 treatment effect
    health_w_pill = health_no_pill + 1
  ) %>%
  # Assign treatment non-randomly: only higher-income individuals receive the pill
  mutate(
    pill = income > 0.7,
    # Construct the observed outcome based on treatment assignment
    health_obs = ifelse(pill == 1, health_w_pill, health_no_pill)
  )

# Inspect the first 10 rows
head(df, 10)
##        income health_no_pill health_w_pill
## 1  0.40958217     0.41907633    1.41907633
## 2  0.99318164     1.05363382    2.05363382
## 3  0.49829806    -0.91587272    0.08412728
## 4  0.12191040    -0.69444718    0.30555282
## 5  0.86762782    -0.06252199    0.93747801
## 6  0.08248242     1.43009145    2.43009145
## 7  0.72646626     1.77140833    2.77140833
## 8  0.96196816    -0.55896438    0.44103562
## 9  0.32019844    -0.33216602    0.66783398
## 10 0.58158538    -1.30425532   -0.30425532
##     pill health_obs
## 1  FALSE  0.4190763
## 2   TRUE  2.0536338
## 3  FALSE -0.9158727
## 4  FALSE -0.6944472
## 5   TRUE  0.9374780
## 6  FALSE  1.4300915
## 7   TRUE  2.7714083
## 8   TRUE  0.4410356
## 9  FALSE -0.3321660
## 10 FALSE -1.3042553

Let’s see what happens now to the estimated mean average ‘effect’ (remember from the lecture that the effect is, in essence, the difference between those who receive the treatment and those who do not):

df %>%
  group_by(pill) %>%
  summarize(health = mean(health_obs))
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.346
## 2 TRUE   1.86

Oh no! That is more than the true effect of the pill, which we know is 1 because we created it. However, if we model this properly (this is an RDD!), then—remember from the lecture that the effect is, in essence, the difference between those who receive the treatment and those who do not:

df %>%
  # Keep observations close to the cutoff (income = 0.7),
  # mimicking the local comparison at the heart of a regression discontinuity design (RDD)
  filter(abs(income - 0.7) < 0.01) %>%
  # Compare mean observed outcomes just below vs. just above the cutoff
  group_by(pill) %>%
  summarize(health = mean(health_obs)) # BOOM!!
## filter: removed 9,790 rows (98%), 210 rows remaining
## summarize: now 2 rows and 2 columns, ungrouped
## # A tibble: 2 × 2
##   pill  health
##   <lgl>  <dbl>
## 1 FALSE  0.670
## 2 TRUE   1.64

3.4 Lecture Assignment

  1. Using simulated data, describe (in text) and show (with plots) a plausible situation where a treatment has an effect, but the observed outcome is null. Tip: Before committing to an example, remember that a difference-in-differences approach requires two periods (change over time) and two groups (treatment and control).

  2. Using a difference-in-differences approach, show how to estimate the true effect from your example.