Skip to contents

This vignette demonstrates how to perform causal inference with observational data using the modelbased package. While the examples below use the terms “treatment” and “control” groups, these labels are arbitrary and interchangeable. Causal inference can be applied to any scenario where observations are assigned to different conditions, regardless of whether a “treatment” in the medical sense is involved.

The lack of randomization

Randomized controlled trials (RCTs) are frequently employed to estimate the average treatment effect (ATE), which represents the average difference in outcomes between individuals in the treatment and control groups. In the context of RCTs, the ATE can be interpreted as a causal effect.

Estimating causal effects with observational data, where treatment and control group assignments are not randomized, presents a greater challenge due to the potential for confounding bias. In observational studies, individual characteristics associated with the outcome (e.g., age, gender, education, health status) are likely unequally distributed between the treatment and control groups. This imbalance violates a key assumption for causal effect estimation, an assumption typically satisfied by randomization in RCTs but often unmet in observational data.

Propensity scores and G-computation

Two primary methods exist for addressing the lack of randomization in observational data: propensity scores and g-computation. Regarding propensity scores, this vignette focuses on inverse probability weighting (IPW), a common technique for estimating propensity scores (Chatton and Rohrer 2024; Gabriel et al. 2024).

IPW assigns weights to individual observations to reflect their contribution to the outcome under the assumption of exchangeability between groups. When specific characteristics are over-represented in the treatment group, IPW assigns lower weights to these individuals, thereby adjusting for the imbalanced distribution of confounders between treatment and control groups.

G-computation, rooted in the traditional methods of stratification and standardization, addresses confounding in observational studies by partitioning the study population into strata, calculating stratum-specific outcomes, and then weighting these outcomes to represent a target population (e.g., the general population). In modelbased, g-computation can be implemented by setting estimate = "population". This approach directly models counterfactual scenarios by creating copies of each observation, effectively assigning each individual to both treatment and control conditions. The final estimate is then derived by averaging predictions across these counterfactual observations (Dickerman and Hernán 2020).

Both propensity score and g-computation methods offer distinct advantages and disadvantages. IPW, for instance, can be directly incorporated as weights within regression models, facilitating a straightforward interpretation of the treatment coefficient as the average treatment effect (ATE). However, deriving the ATE becomes more complex in models incorporating interaction terms or other complexities. Conversely, modelbased g-computation readily estimates the ATE even within complex model specifications. A robust approach often involves combining IPW and g-computation to achieve “doubly robust” estimation of the ATE (Chatton and Rohrer 2024; Gabriel et al. 2024). Illustrative examples will follow.

Calculating inverse probability weights

First, we create some toy data with a treatment effect that is roughly 9 points higher than the effect in the control group. The example data set has health related quality of life QoL as outcome, which was measured for 188 cancer patient at three time points (baseline, two follow-ups). The treatment variable is simulated and not part of the original data set. Since we have repeated measurements, we use the glmmTMB package to fit a linear mixed model to the data.

# example
library(modelbased)
library(datawizard)
library(parameters)
data(qol_cancer, package = "parameters")

# sort and group data by patient ID, then assign each patient either to
# the treatment or control condition, with higher educated patients having
# a higher chance belonging to the treatment group
set.seed(1234)
d <- qol_cancer |>
  data_arrange("ID") |>
  data_group("ID") |>
  data_modify(treatment = rbinom(1, 1, ifelse(education == "high", 0.7, 0.4))) |>
  data_ungroup()

# create a treatment effect that increased over time
# with more improvements for higher educated patients
d$QoL <- d$QoL + rnorm(
  nrow(d),
  (d$treatment * d$time * 5) + ifelse(d$education == "high", 5, 0),
  sd = 2
)

# convert to factors
d <- to_factor(d, c("treatment", "time"))

We first look at the simple effect of treatment, which is about 9.3.

# simple pooled effect
m1 <- glmmTMB::glmmTMB(
  QoL ~ treatment + time + education + hospital + age + phq4 + (1 | ID),
  data = d
)

# we don't need the full summary table,
# only the coefficient for treatment
model_parameters(m1, keep = "^treatment")
#> # Fixed Effects
#> 
#> Parameter     | Coefficient |   SE |        95% CI |    z |      p
#> ------------------------------------------------------------------
#> treatment [1] |        9.27 | 1.96 | [5.43, 13.11] | 4.73 | < .001

Next, inverse probability weights are calculated. This involves employing a regression model with the predictor of interest, treatment, as the outcome variable. This model estimates the probability of observed confounder variables given the treatment assignment. For binary predictors, such as the treatment variable in this analysis, a logistic regression model is utilized. Potential confounders serve as independent variables in this model, generating predicted probabilities for each observation’s membership in either the treatment or control group. Based on these probabilities, the IPW is calculated.

# logistic regression model
m_ipw <- glm(
  treatment ~ time + hospital + phq4 + education + age,
  data = d,
  family = binomial()
)

# add predictions, i.e. the probability of belonging to treatment
# or control for each patient in the sample
d$predictions <- predict(m_ipw, newdata = d, type = "response")

# calculating the IPW
d$ipw <- ifelse(d$treatment == 1, 1 / d$predictions, 1 / (1 - d$predictions))

Next, we run the same model again, including the weights. We see that the treatment effect, after weighting, is about 9 points, i.e. the treatment on average results in a 9-point higher score for QoL.

m2 <- glmmTMB::glmmTMB(
  QoL ~ treatment + time + education + hospital + age + phq4 + (1 | ID),
  weights = ipw,
  data = d
)
model_parameters(m2, keep = "^treatment")
#> # Fixed Effects
#> 
#> Parameter     | Coefficient |   SE |        95% CI |    z |      p
#> ------------------------------------------------------------------
#> treatment [1] |        9.04 | 1.99 | [5.13, 12.95] | 4.53 | < .001

Given the simplicity of the model, g-computation offers no significant advantage in calculating contrasts between treatment levels, as the results are equivalent to a simpler approach.

estimate_contrasts(m2, "treatment", estimate = "population")
#> Counterfactual Contrasts Analysis (G-computation)
#> 
#> Level1 | Level2 |    Difference (CI) |      p
#> ---------------------------------------------
#> 1      | 0      | 9.04 (5.13, 12.95) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

However, we have not properly modelled the longitudinal nature of our data. Since patients’ quality of life (QoL) was measured at three distinct time points, allowing for an examination of treatment effects over time, we need to include an interaction between treatment and time. This revealed a substantially greater increase in QoL over time within the treatment group. In such more complex modeling scenarios, g-computation becomes particularly advantageous.

# interaction terms involved
m3 <- glmmTMB::glmmTMB(
  QoL ~ treatment * time + education + hospital + age + phq4 + (1 | ID),
  weights = ipw,
  data = d
)

# estimated marginal means, to show how treatment
# develops over time between treatment and control conditions
estimate_means(
  m3,
  c("time", "treatment"),
  estimate = "population"
) |> plot()

We now see that the coefficient table is no longer helpful, because the treatment effect cannot be isolated. If we want to know the ATE of treatment, we need to calculate contrasts for the levels of our treatment predictor.

# ATE no longer visible from coefficient table
model_parameters(m3, keep = "^treatment")
#> # Fixed Effects
#> 
#> Parameter                | Coefficient |   SE |         95% CI |    z |      p
#> ------------------------------------------------------------------------------
#> treatment [1]            |        4.11 | 2.22 | [-0.24,  8.46] | 1.85 | 0.064 
#> treatment [1] × time [2] |        5.05 | 1.64 | [ 1.84,  8.26] | 3.08 | 0.002 
#> treatment [1] × time [3] |        9.63 | 1.64 | [ 6.42, 12.84] | 5.88 | < .001

# contrasts of treatment levels, using g-computation
estimate_contrasts(m3, "treatment", estimate = "population")
#> Counterfactual Contrasts Analysis (G-computation)
#> 
#> Level1 | Level2 |    Difference (CI) |      p
#> ---------------------------------------------
#> 1      | 0      | 9.00 (5.07, 12.94) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

In even more complex scenarios, g-computation remains a valuable tool for generating accurate estimates. This example compares “simple” contrasts derived from an IPW model with contrasts generated from the same model using g-computation. The results clearly demonstrate the increased accuracy achieved by combining IPW with g-computation in such situations.

# more complex model
m4 <- glmmTMB::glmmTMB(
  QoL ~ treatment * time + treatment * education + hospital + age + phq4 + (1 | ID),
  weights = ipw,
  data = d
)

# complex model, no g-computation
estimate_contrasts(m4, "treatment")
#> Marginal Contrasts Analysis
#> 
#> Level1 | Level2 |     Difference (CI) |      p
#> ----------------------------------------------
#> 1      | 0      | 10.97 (6.70, 15.23) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

# complex model, with IPW *and* G-computation (double-robust) is accurate!
estimate_contrasts(m4, "treatment", estimate = "population")
#> Counterfactual Contrasts Analysis (G-computation)
#> 
#> Level1 | Level2 |    Difference (CI) |      p
#> ---------------------------------------------
#> 1      | 0      | 8.99 (5.10, 12.88) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

The last example completes the “causal inference” topic by showing how to calculate the average treatment effect on the treated (ATT), and average treatment effect on the untreated (ATU).

The ATT measures how much the treatment changes the outcome for those who already received it. It can help to decide whether a program or treatment should be discontinued for the people currently benefiting from it. Likewise, the ATU estimates how much the treatment would change the outcome if it were given to those who didn’t already receive it. It helps us decide whether a program or treatment should be expanded to include those who haven’t yet benefited from it. It helps to assess the potential benefits of extending a program or treatment to a wider population.

library(insight)

# we need the data used to fit the model - this may not be
# identical to the original data due to case-wise deletion
model_data <- get_data(m4)

# the ATT - apply g-computation only to the subset of the treated
estimate_contrasts(
  m4,
  "treatment",
  newdata = subset(model_data, treatment == 1),
  estimate = "population"
)
#> Counterfactual Contrasts Analysis (G-computation)
#> 
#> Level1 | Level2 |    Difference (CI) |      p
#> ---------------------------------------------
#> 1      | 0      | 9.79 (5.42, 14.15) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

# the ATU - apply g-computation only to the subset of the conrol
estimate_contrasts(
  m4,
  "treatment",
  newdata = subset(model_data, treatment == 0),
  estimate = "population"
)
#> Counterfactual Contrasts Analysis (G-computation)
#> 
#> Level1 | Level2 |    Difference (CI) |      p
#> ---------------------------------------------
#> 1      | 0      | 8.35 (4.34, 12.37) | <0.001
#> 
#> Variable predicted: QoL
#> Predictors contrasted: treatment
#> Predictors averaged: time, education, hospital (0.95), age (0.22), phq4 (-0.076), ID
#> p-values are uncorrected.

Conclusion

This vignette explored the estimation of average treatment effects (ATEs) from observational data, focusing on the application of inverse probability weighting (IPW) and g-computation. While both methods address confounding bias inherent in non-randomized studies, they offer distinct strengths. IPW provides a straightforward approach for incorporating weights into regression models, simplifying ATE interpretation in basic models. However, g-computation excels in more complex scenarios, such as those involving longitudinal data and interaction terms, where isolating treatment effects becomes challenging.

Our examples demonstrated the utility of both methods. Specifically, we showed that while IPW effectively estimates the ATE in a simple model, g-computation becomes essential when modeling the interaction between treatment and time. Furthermore, combining IPW with g-computation, as illustrated in our last examples, offers increased accuracy, highlighting the benefits of this “doubly robust” approach for estimating ATEs in complex models.

Combining IPW and g-computation doesn’t introduce any additional bias and hence has no negative effects. While IPW alone can sometimes be less accurate than g-computation, particularly when g-computation is feasible, the most reliable estimates are achieved by using both methods together. Therefore, we recommend their combined use.

References

Chatton, Arthur, and Julia M. Rohrer. 2024. “The Causal Cookbook: Recipes for Propensity Scores, G-Computation, and Doubly Robust Standardization.” Advances in Methods and Practices in Psychological Science 7 (1): 25152459241236149. https://doi.org/10.1177/25152459241236149.
Dickerman, Barbra A., and Miguel A. Hernán. 2020. “Counterfactual Prediction Is Not Only for Causal Inference.” European Journal of Epidemiology 35 (7): 615–17. https://doi.org/10.1007/s10654-020-00659-8.
Gabriel, Erin E., Michael C. Sachs, Torben Martinussen, Ingeborg Waernbaum, Els Goetghebeur, Stijn Vansteelandt, and Arvid Sjölander. 2024. “Inverse Probability of Treatment Weighting with Generalized Linear Outcome Models for Doubly Robust Estimation.” Statistics in Medicine 43 (3): 534–47. https://doi.org/10.1002/sim.9969.