Application of causal forests to randomised controlled trial data to identify heterogeneous treatment effects: a case study | BMC Medical Research Methodology

Application of causal forests to randomised controlled trial data to identify heterogeneous treatment effects: a case study | BMC Medical Research Methodology

Trial Summary

We used data from the VANISH trial, which was a 2 × 2 factorial trial including 408 randomised patients which compared early vasopressin vs. norepinephrine (standard care) and corticosteroids vs. placebo for patients with septic shock being treated in ICUs in England. The primary outcome used in the trial was the number of days alive and free from kidney failure in the 28 days after randomisation, a composite outcome of 28-day mortality and days free of kidney injury. This paper focuses on modelling the treatment effect for the binary outcome of 28-day mortality for the first-line treatments, early vasopressin versus norepinephrine. The trial did not detect a significant difference in 28-day mortality between arms (30.9% in the vasopressin group versus 27.5% in the norepinephrine group, risk difference: 3.4% 95%CI [-5.4–12.3%]). This original result is dependent on the underlying logistic regression model being correctly specified and the assumption that the treatment effect is homogeneous. If this is not the case, this null result may have been due to between-participant differences. If the homogeneity assumption is violated, the confidence intervals suggest that some patients experienced positive treatment effects and some negative. Therefore, it is of interest to explore the HTEs and identify the patients who may have benefitted from, or may have been harmed by, the treatment.

Data handling

The VANISH trial collected a variety of patient baseline variables, daily measurements, and outcomes at day 28. All baseline measurements with missingness < 30% were considered possible treatment effect modifiers. These include demographic information and components of ICU APACHE II and SOFA illness severity scores (physiological measurements such as arterial pressure and temperature) and are listed in the supplementary material. Categorical variables were one-hot encoded (each category is represented with a binary indicator variable), and continuous variables were centred and scaled to have a mean of 0 and a standard deviation of 1.

Classical approach– univariable regression

The classical approach of conducting subgroup analysis through testing each covariate-treatment interaction in a separate logistic regression model was initially used. For each of the \(\:k=1,\dots\:,K\) potential covariate modifiers, where \(\:{X}_{k}\) denotes the \(\:{k}^{th}\)covariate of interest considered as an effect modifier, the following model was fit in each instance:

$$\:logit\left(\text{{\rm\:E}}\left[Y|W,{\varvec{X}}_{\varvec{k}}\right]\right)\:=\:{\beta\:}_{1,k}W+{{\upbeta\:}}_{2,k}{X}_{k}+{{\upbeta\:}}_{3,k}W*{X}_{k},$$

where \(\:Y\) is the mortality binary outcome and \(\:W\) is the binary treatment indicator. In this situation, we are interested in the coefficients \(\:{\beta\:}_{3,k}\) for all \(\:{X}_{k}\). This model assumes a linear relationship between each of the variable \(\:{X}_{k}\) and the outcome on the logit scale.

To account for the increase in type I error, which builds up when multiple tests are performed, we conservatively applied the Bonferroni multiple testing correction, that is, 0.05 divided by the number of tests being carried out. Results of the univariable analysis are visualised using a volcano plot, which plots effect sizes on the x-axis and \(\:{\text{log}}_{10}\left\{p\right\}\)-values on the y-axis.

Machine learning approach– hierarchical Lasso

As an intermediary step between the classical method and the most up-to-date causal machine learning methods, we employed a variable selection technique to select which baseline covariates had strong interactions with the treatment. Hierarchical lasso penalisation (based on group lasso) [25] was applied to a logistic regression model, including the treatment, all baseline characteristics and all interaction terms between the treatment and baseline variables \(\:{X}_{k}\), k = 1,…K.

$$\:logit\left(\text{{\rm\:E}}\left[Y|W,\:\varvec{X}\right]\right)\:=\:{\beta\:}_{1}W+\:\sum\:_{k\:=\:1}^{K}\left\{{\beta\:}_{2,k}{X}_{k}+\:{\beta\:}_{3,k}W*{X}_{k}\right\}.$$

The penalisation imposes a strong hierarchy such that

$$\:{\beta\:}_{3,k}\:\ne\:\:0\:\Rightarrow\:\:{\beta\:}_{2,k}\:\ne\:\:0\:and\:{\beta\:}_{1}\:\ne\:0.$$

This meant that if a covariate by treatment interaction was kept in the model, the main effect term for the variable in the interaction and treatment was also retained. The penalty term was selected using 5-fold cross-validation. This method was implemented using the “glinternet” package in R [25]. When using centred and scaled independent variables in a logistic regression, the magnitude of the log odds ratios are interpretable as variable importance measures. We refer to these as standardised odds ratios in the results section. In this work, we used this to understand the importance of the features retained by the hierarchical lasso.

Variable selection models, such as the hierarchical lasso, do not typically generate uncertainty estimates for adjusted coefficients in their selected models. Estimating uncertainty using data that informed model selection may lead to overly optimistic estimates. To demonstrate the uncertainty in estimates obtained through the hierarchical lasso, we employ a one-step selective inference procedure for the lasso proposed by Lie et al. [26]. However, we highlight that the confidence intervals presented for lasso estimates are not to be interpreted causally.

Causal machine learning

The primary estimand of interest for causal machine learning methods is the so-called Conditional average treatment effect, which can be written in terms of potential outcomes for the outcome Y under the two possible treatments denoted \(\:W\in\:\left\{\text{0,1}\right\}\) as,

$$\:\tau\:\left(x\right)=E\left[Y\left(1\right)-Y\left(0\right)|\:X=x\right],$$

Where \(\:X\:=\:x\) is the baseline covariate vector. Under consistency, no interference and unconfoundedness (which holds by design in this case, as we are in a randomised setting), the CATE is \(\:\tau\:\left(x\right)=E\left[Y\left|W=1,\:X=x\right]-E\left[Y\right|W=0,\:X=x\right].\)

Causal forest (CF)

The causal forest (CF) is a non-parametric CML method which estimates the CATE [19]. The causal forest begins by assuming that in a small neighbourhood \(\:N\left(x\right)\) (which will be defined by the splits) the CATE \(\:\tau\:\left(x\right)\) is constant. Assuming a partially linear model \(\:E\left[Y|W,X\right]=m\left(X\right)+\:W\tau\:\left(x\right)\:\), we can estimate \(\:\tau\:\left(x\right)\) through the individuals \(\:i\) in the neighbourhood of \(\:x\), denoted by \(\:i:{X}_{i}\in\:N\left(x\right)\)

$$\:\widehat{\tau\:}\left(x\right)=\:\frac{\sum\:_{\{i:{X}_{i}ϵ\:N(x\left)\right\}}^{}\left\{\left({{W}_{i}-\widehat{p}\left({X}_{i}\right)\left)\right(Y}_{i}-\widehat{m}\left({X}_{i}\right)\right)\right\}}{{\sum\:}_{\{i:{X}_{i}ϵN(x\left)\:\right\}}^{}{\left\{{W}_{i}-\widehat{p}\left({X}_{i}\right)\right\}}^{2}}$$

Where \(\:i\) indexes the individuals in the data. The propensity score \(\:p\left(X\right)=E\left[W|\:X\right]\) is obtained by regressing W on X many times non-parametrically via random forest, and treatment-free mean outcome \(\:m\left(X\right)=E\left[Y|X\right]\) is obtained by regressing Y on X, not including W, via random forests. The estimate of \(\:m\left(X\right)\) can be re-written as:

$$\eqalign{\hat m\left( x \right) & = \>\sum {\>_{i = 1}^n} \alpha {\>_i}\left( x \right){Y_i},\> \cr{\rm{where}}\>\alpha {\>_i} \left( x \right) &= \>{1 \over B}\sum {\>_{b = 1}^B} {{1\left( {\left\{ {{X_i} \in \>\>{L_b}\left( x \right),\>i\> \in \>{{\cal S}_b}} \right\}} \right)} \over {\left| {\{ i:{X_i} \in \>{L_b},\>i\> \in \>{{\cal S}_b}} \right|}}, \cr} $$

where \(\:{\alpha\:}_{i}\left(x\right)\) are weights that represent the frequency at which the \(\:i\)th observation falls into the leaf containing \(\:x\). B is the number of trees in the causal forest, \(\:{L}_{b}\left(x\right)\) represents the leaf in tree b containing observation\(x\)and \(\:{\mathcal{S}}_{b}\)denotes the set of trees which use the observation \(i\).

Out-of-bag observations, those not used in the training of the model, are used to generate the “residuals” \(\:Y-\widehat{E}\left[Y|X\right]\:\)and \(\:W-\widehat{E}\left[W|X\right].\:\)Then, the data is split using “causal trees”, which, instead of choosing splits to minimise prediction error, greedily search over covariate thresholds to maximise the difference in the ATE between two leaves resulting from a split so that the estimated treatment effects are similar within the leaf (approximating homogenous treatment effects within a leaf), while they maximise the difference between the resulting leaves, (capturing heterogeneity of the treatment effects across patients with sufficiently different \(\:X\) values). This is formalised as the maximisation of the squared difference in resulting leaf means \(\:{n}_{L}\:{n}_{R}{({y}^{L}-{y}^{R})}^{2}\), where n is the number of observations in a leaf, and L and R represent the left and right leaves from a split. This procedure is performed on many random subsamples of the data, forming causal forests. Once all nuisance parameters have been estimated, the modified formula for the CATE becomes:

$$\:\widehat{\tau\:}\left(x\right)=\:\frac{\sum\:_{i=1}^{n}{\alpha\:}_{i}\left(x\right)\left\{({{W}_{i}-p\left({X}_{i}\right)\left)\right(Y}_{i}-m\left({X}_{i}\right))\right\}}{\sum\:{{\alpha\:}_{i}\left(x\right)\left\{{W}_{i}-p\left({X}_{i}\right)\right\}}^{2}}\:.$$

Honesty splitting is implemented on each tree (and a subsample of the data); the subsample is partitioned, with one part used to grow the tree and the other half used to estimate the CATE. Work by Athey and Wager shows that this algorithm leads to consistent estimators of the CATE [19]. Once CATE estimates are obtained for all observations, we can use post-estimation methods to test for heterogeneity. The CF can include many variables in a single analysis without increasing type I error. Testing for the presence of heterogeneity is done through an “omnibus test”, which uses the predicted CATEs to fit a linear predictor of the outcomes and determine the performance of our estimates, where the null hypothesis is that there is no HTE and the alternative hypothesis is that there is heterogeneity present. Output from the omnibus test includes a differential forest prediction (DFP) and associated p-value. A DFP close to 1 indicates that the CF adequately captures the underlying heterogeneity. In-built variable importance metrics are available, which utilise the frequency of which variables are chosen to be split and the stage of the tree-growing process at which they are used; see the formula in supplementary material A1. We can also observe how the CATE varies with each variable to visualise heterogeneity across baseline characteristics, for example, using scatter plots of the CATE against individual covariates. If relevant, heatmaps can be used to explore how the CATE varies across two covariates by plotting predictions of the CATE at specified quantiles of the two covariates, holding all other variables at their median.

CFs, the omnibus test and the associated variable importance, were implemented in the R grf package [27].

Data-driven subgroups– an extension of the CF

In addition to individual-level CATEs, we sought to define data-driven subgroups of heterogeneous treatment effects to facilitate a more clinically actionable result from this model. To achieve this, we extracted root (initial) splits from each tree. Root splits are used because the first split of a causal tree maximises the difference between the ATEs in the results of two halves of the data. Individual trees are grown on different subsampled datasets; the most common first split (root) and its average represent the most likely value defining the “first-order” subgroups. The mean of the most common split was calculated to determine a data-driven threshold. Using this mean threshold, we split our trial population into two subgroups and estimate the Group ATE (GATE) in each group. The GATEs are computed using the function “average_treatment_effect” from the GRF package. The GATE estimation uses augmented inverse-propensity weighting (AIPW); see supplementary appendix A2 for the AIPW formula. AIPW involves the averaging of doubly robust scores obtained through the honesty approach [28]. We also calculate confidence intervals for the GATEs but note that this might be an underestimation of the uncertainty and, therefore, should be interpreted cautiously [29]. We hypothesise that this underestimation of uncertainty is due to the same data being used to estimate the GATEs as was used to fit the causal forest. This is a crude initial exploration of data-derived subgroups from this model.

Implementation

Before implementing the above methods, we handled missing data in the potential effect modifiers. MissForest was used to impute the missing values [30]. Briefly, for each variable to be imputed, this approach fits a random forest on the observed part using the values available in all other variables, predicts the missing part to impute the variable at hand, and then iterates. It has a built-in stopping rule based on out-of-bag errors. Alternative missing data approaches, single mean imputation, complete-case analysis and inverse probability weighting (IPW) were carried out as sensitivity analyses. For complete-case analysis, any participants with at least one entry missing (after removal of variables with > 30% missingness) were removed from the analysis. IPWs were calculated using logistic regression, which produced probabilities of each participant having missing data and used the inverse of these probabilities as weights in the analysis. It was not possible to incorporate these weights for the hierarchical lasso.

Next, we removed highly correlated variables to attenuate issues with lasso regression and forests models, where two highly correlated variables are equally likely to be selected (for either inclusion in the final model or creating splits). Therefore, where variables had a Pearson correlation coefficient > 0.7, the variable carrying the least information (higher degree of missingness or detailing a score instead of a raw value) was excluded from the set of effect modifiers.

We also repeated the analysis without the components of the APACHE II and SOFA scores, including only the summary scores. This comparison was done in keeping with a typical approach, which assesses severity through these summary scores.

The CF [27] has several hyperparameters with defaults that can be set or tuned. Hyperparameters were set manually; full details can be found in supplementary appendix A3.

Simulation Study

Alongside the case study analysis, we conducted a targeted simulation study to investigate the performance of our extension of the CF to identify subgroups based on thresholds using the modal initial (root) splits. The aim was to determine how often this data-driven method would identify a true subgroup effect for a continuous covariate with an underlying binary threshold value when this existed.

To do so, using a logistic model for a binary outcome, we generated 1000 trial datasets containing 1000 participants with a binary treatment indicator W (generated from a Bernoulli distribution), normally distributed continuous prognostic variable X1 and a normally distributed continuous treatment effect modifier X2 which defined a true subgroup membership using a threshold based on the observed interaction in the VANISH dataset. This meant we generated the true subgroups split (i.e. the difference in treatment response) using a threshold of 4.68 for X2. Performance was measured by the percentage of times the model correctly identified X2 as the treatment effect modifier. The mean and 95% confidence intervals of the identified thresholds are reported.

We also compared the inference obtained from using the traditional analysis approach and the hierarchical lasso looking at the treatment interaction with the continuous covariate X2, as well as the interaction with a binary subgroup covariate that dichotomised the continuous covariate at an alternative hypothesised clinically known value of 5.4 that was different to the true subgroup threshold effect. This was to mimic what could be observed in practice where these two methods can only utilise known information to categorise a continuous covariate. Full details following the ADEMP structure for planning simulations are in the Appendix [31].

We also compared the subgroup effects identified for the continuous covariate through traditional univariable interaction tests, the hierarchical lasso, and root splits of the CF where a true subgroup effect based on an underlying threshold value existed for the continuous covariate. The true threshold value for the subgroup effect was set to be different to an alternatively hypothesised known clinical value for handling that continuous covariate.

link