Multi-metric comparison of machine learning imputation methods with application to breast cancer survival | BMC Medical Research Methodology
Motivating example: breast cancer case study
To apply our most effective method to a real-world situation, we used data from a retrospective cancer survival study of women with breast cancer was conducted in Morocco, involving 711 incident cancers diagnosed in 2009 and followed until December 2014 [19]. That study aimed to identify prognostic variables, including epidemiological, clinical, pathological, biomarker expression, and treatment characteristics. Additionally, other risk factors such as oral contraceptive use, a family history of breast cancer, and obesity were evaluated. The outcome of interest was event-free survival, calculated from the date of surgery or initiation of chemotherapy to the earliest date for either locoregional recurrence or distant metastasis. Thirteen baseline covariates were included in the cancer study’s dataset : age, body mass index (BMI), radiotherapy, mammographic size, Scarff-Bloom-Richardson (SBR) grade, nulliparity, lymph nodes (N0, N1, N2, N3), oral contraception, PgR, vascular invasion, trastuzumab, hormone therapy, HER2, and ER. Age refers to the patient’s age in years. BMI is a measure of body fat based on height and weight. Radiotherapy indicates whether the patient received radiation treatment. Mammographic size measures the size of the tumor in centimeters. The Scarff-Bloom-Richardson (SBR) grade categorizes tumor aggressiveness into three grades: I, II, and III. Nulliparity refers to whether a woman has never given birth. Lymph nodes are classified into four stages (N0, N1, N2, N3) based on the extent of cancer spread, following the TNM classification system. Oral contraception indicates the use of birth control pills. PgR (Progesterone Receptor) indicates hormone therapy responsiveness. Vascular invasion refers to the presence of cancer cells in blood vessels. Trastuzumab is a targeted therapy drug. Hormone therapy involves treatments that block hormones to slow down or stop cancer growth. HER2 (Human Epidermal Growth Factor Receptor 2) identifies a protein that can promote cancer growth. ER (Estrogen Receptor) status helps determine the cancer’s growth sensitivity to estrogen, guiding hormone therapy choices. Table 1 presents the distribution of the variables for the dataset along with the percentage of missing values for each variable. The proportion of missing values varied between covariates, ranging from 0.6 to 43%. Missing values were observed in 607 women, resulting in an almost 85% reduction in sample size in CCA, leaving a study sample of only 105 cases. This example highlights the need to appropriately handle missing data in order to optimally analyse the study data, while both minimizing the risks of loss of efficacy and reducing potential bias.
Machine learning imputation algorithms
Using various performance metrics, our analysis consisted in comparing the performances of eight different ML-based imputation methods using SI and MI (Table 2).
A comprehensive presentation follows for each of the eight algorithms investigated. We will explore SI strategies for KNN, missMDA, CART, missForest, missRanger, and missCforest, and MI strategies for miceCART and miceRandomForest.
K-nearest neighbours (KNN)
The KNN imputation method, which is similar to the hot-deck method, uses donor observations. The value imputed is an aggregation of the values of the \(\:k\) closest neighbors. The method of aggregation depends on the type of variable. For continuous variables, the default aggregation is the median, while for categorical variables it is the most frequent category among the \(\:k\) values.
This method computes a distance to determine the nearest neighbours using a version of Gower’s distance that can handle different types of variables, specifically binary, categorical, ordered, continuous, and semi-continuous variables [20]. The distance between two observations is a weighted average of the contributions of each variable, with the weights reflecting the importance of each variable.
For continuous variables, KNN calculates the absolute distance between two observations and then divides it by the total range of that variable. The same approach is used for ordinal variables, after converting them to integer variables. For nominal and binary variables, a binary distance of 0/1 is used [21]. KNN imputation preserves the inherent structure and relationships within the data during the imputation process. As a non-parametric method, it refrains from making assumptions about data distribution, and consequently offers a non-parametric solution ideal for datasets with unknown or non-normal distributions. However, its effectiveness is closely tied to the critical choice of \(\:k\). Additionally, its reliance on a distance metric that accommodates diverse variables introduces sensitivity to this choice, necessitating careful weighting.
Classification and regression trees (CART)
CART, often referred to as decision trees, is a versatile class of ML algorithms. The core distinction between classification and regression trees lies in the criteria used for data splitting and tree pruning; a more detailed discussion on these aspects can be found in [22, 23]. A key feature of the CART algorithm is its inherent ability to preserve interactions between the data in the same dataset.
CART identifies predictors and cut-off points within these predictors that can be used to divide the sample. These partitions split the sample into sub-samples of greater homogeneity. The splitting procedure is repeated on the sub-samples, producing a binary tree structure. The target variable in these models can either be discrete (classification trees) or continuous (regression trees).
When employing CART for imputation of missing data, the algorithm treats the variable with missing values as the dependent variable (target), and the remaining variables as predictors. The procedure involves building a decision tree where the splits are based on the predictors, the aim being to predict the missing values of the target variable. For continuous variables, regression trees are employed, predicting a value that minimizes the variance within nodes. For categorical variables, classification trees are used, where the prediction corresponds to the most frequent class within a node.
Decision trees possess properties that make them particularly appealing for imputation tasks [24]. They are robust against outliers, can handle multicollinearity and skewed distributions, and are flexible enough to accommodate interactions and non-linear relationships. Moreover, many aspects of model fitting have been automated, meaning minimal tuning by the imputer [10].
Imputation with multivariate data analysis (missMDA)
missMDA is based on the Factor Analysis of Mixed Data (FAMD) method. It handles missing values in mixed data types, including both continuous and categorical variables. This method considers the similarities between individuals and the associations between variables. It performs a Principal Component Analysis (PCA) for continuous variables and a Multiple Correspondence Analysis (MCA) for categorical variables.
The imputation process begins with initial estimates for missing values, using the mean for continuous variables and modes for categorical variables. For mixed data, the Factor Analysis of Mixed Data (FAMD) method, which integrates the principles of PCA and MCA, is utilized to handle both data types cohesively. The algorithm iteratively updates these estimates as follows: in each cycle, FAMD predicts missing values based on observed data; these predictions replace previous estimates, and a new model incorporating these updates is then fitted. The process continues until a convergence criterion is met, typically when changes in imputed values between iterations fall below a set threshold. This ensures that the final imputed values are both statistically robust and contextually aligned with the dataset’s structure [25].
Nonparametric imputation using RandomForest (missForest)
The missForest algorithm, a non-parametric imputation technique, utilizes a Random Forest model to iteratively predict missing data values. Initially, it estimates the missing values, typically using the mean or mode, and then employs a Random Forest model to impute missing values for each variable based on observed values from the other variables. The process is iterative, with predictions being updated in each cycle. A key component of this process is the utilization of Out-Of-Bag (OOB) error estimates to assess the accuracy of imputations after each iteration. The algorithm replaces missing values with new predictions and refits the Random Forest models to these updated data, continuing until the OOB error suggests that further iterations would not result in more accurate imputations. This method is adept at handling mixed-type data and provides an estimate of the imputation error, thereby offering an indication of the reliability of the imputed values.
Random Forests differ from the CART methodology by generating multiple trees instead of just one. By averaging numerous trees, the variance of unstable trees is significantly reduced, leading to a more reliable and robust model [23]. The introduction of variability in individual trees yields a more resilient solution, which subsequently enhances the method’s accuracy. This variability can be generated through different processes, including bootstrapping and random input selection [22].
Alternative implementation of missForest algorithm using predictive mean matching (missRanger)
An alternative to missForest is the implementation of missRanger, which incorporates in the imputation process the Predictive Mean Matching (PMM) option [26]. Missing values for a variable are imputed using predictions made by a Random Forest, which employs all remaining variables as covariates. The algorithm performs repeated iterations across all variables. This process continues until there is no further improvement observed in the average OOB prediction error of the models; this serves as the stopping criterion for the iterations. An option within this algorithm is the use of the PMM method [27]. For each missing value, PMM creates a donor pool, which consists of complete cases that have predicted values of the outcome closest to the predicted value of the missing entry. From this donor pool, PMM then randomly selects one donor case, and the actual observed value of this selected case is used to replace the missing value. This way, the PMM aims to restore the variance of resulting conditional distributions to a realistic level and preserves the original distribution of the data.
The missRanger imputation method provides a robust alternative to the widely-used missForest algorithm. In particular, the intergrated PMM option ensures the credibility of imputed values [27].
Imputation based on ensemble conditional trees (missCforest)
The missCforest can be utilized for the imputation of numerical, categorical, and mixed-type data [28]. Through ensemble prediction using Conditional Inference Trees (Ctree) as base learners, missing values are imputed [29]. Ctree is a non-parametric class of regression and classification trees that combines recursive partitioning with conditional inference theory [30]. missCforest redefines the imputation problem as a prediction problem using a single imputation approach. The missing values are predicted iteratively based on the set of complete cases, which is updated at each iteration. There is no predefined stopping criterion, and the imputation procedure stops when all missing data have been imputed. Since the recursive partitioning of Conditional Trees is based on numerous test procedures, this algorithm is resistant to outliers and pays special attention to the statistical relationship between covariates (i.e., variables used for imputation) and the outcome (i.e., variable to be imputed).
Multiple imputation based on CART algorithm (miceCART)
The MICE (Multivariate Imputation by Chained Equations) algorithm is an iterative method for handling missing data. It operates in a variable-by-variable manner, using the observed data to generate initial random imputations for the missing data [31].
miceCART is a variation of the CART algorithm, designed to work within the framework of MI [24]. Initial missing values are estimated by drawing a random distribution from the observed values of each associated variable. A tree is then fitted to the first variable with at least one missing value, using the remaining variables as predictors.
Only individuals with observed values for the outcome are considered. This produces a tree with multiple leaves, each containing a subset of the data. An individual with a missing value for the outcome is placed inside one of these leaves. A random value from this leaf’s subset is then selected and used for imputation. This procedure is carried out for each variable with missing data and is ultimately repeated multiple times, resulting in multiple imputed datasets.
Multiple imputation based on RandomForest algorithm (miceRF)
MICE RandomForest uses the Random-Forest algorithm to predict missing values based on observed data, updating imputations until convergence is reached. By combining Random-Forest with MI, a degree of uncertainty can be introduced into the imputation model, which makes it more suitable for generating parameter estimates with desirable characteristics [24].
The miceRFt approach can yield more accurate and reliable imputations when the data exhibits complex patterns that standard regression models (used in traditional MICE) might fail to capture [32].
However, as with any imputation method, the resulting imputations should be validated through methods such as sensitivity analyses or comparisons with CCA to ensure the robustness of the imputation procedure.
Simulation study
In this simulation study, we aimed to evaluate the performance of different ML methods for missing data imputation using different types of performance metrics.
Our simulation design was tailored to incorporate mixed-type covariates (i.e., both continuous and categorical), include non-linear interdependencies between variables, and address the key issue of missing data. The approach used for this comprehensive assessment was as follows:
-
(1)
Generate survival data for all cases.
-
(2)
Estimate the Cox Proportional Hazards (PH) model using complete data.
-
(3)
Introduce missing values into covariates.
-
(4)
Impute missing data using different methods.
-
(5)
Re-estimate the Cox PH model using imputed data.
-
(6)
Compare the performance of the imputation methods using several performance metrics.
Data generation
We generated a total of 500 all-case datasets, each containing 700 observations (no missing values). The datasets included three covariates \(\:X_1\), \(\:X_2\) and \(\:X_3\)– in addition to status and observed survival times – which were simulated as follows:
-
(i)
\(\:X_1\) a continuous covariate, drawn from a standard normal distribution \(\:\left(mean=0,sd=1\right)\)
-
(ii)
\(\:X_2\) a continuous covariate non-linearly dependent on \(\:X_1\), with the following distribution: \(\:X_2=X_1^2+X_1+U\) where \(\:U\sim\:\textUniform\left(0,1\right)\)
-
(iii)
\(\:X_3\) a binary covariate, generated from a binomial distribution with \(\:p=0.5\)
To generate individual survival times, we employed the cumulative hazard inversion method, as described by [33]. The method initiates with the specification of log-hazards ratios for covariates \(\:X_1\), \(\:X_2\), and \(\:X_3\), which were set at \(\:0.1\), \(\:0.3\), and \(\:0.6\), respectively.
This technique is based on inverting the survival function, formulated as \(\:S_i^-1\left(u\right)=H_0^-1\left(-log\left(u\right)exp\left(-X_i\beta\:\right)\right)\), where \(\:S_i^-1\left(u\right)\) denotes the inverted survival function for the \(\:i\)th individual, \(\:H_0^-1\left(u\right)\) represents the inverted cumulative baseline hazard function, \(\:X_i\) is the vector of covariates for the individual, and \(\:\beta\:\) comprises the corresponding effect parameters on survival.
To generate individual simulated event times \(\:T_i\), we applied this formula: \(\:T_i=S_i^-1\left(U_i\right)\), where \(\:U_i\) is a random variable drawn from a uniform \(\:U\left(0,1\right)\) distribution.
For our analysis, we adopted a generalized Weibull distribution to model the baseline hazard, chosen for its clinically plausible risk shape which closely follows the pattern observed for the risk of certain cancers. This parametric framework is particularly suitable for representing the aggressive nature of cancer risks in the early observation stages, with a notable decrease in risk following treatment. The generalized Weibull distribution is therefore an ideal fit for scenarios that require a realistic depiction of baseline hazards, as it effectively captures the dynamic risk pattern over time, a trend that aligns with observations highlighted in the present work [34].
The individual censoring times \(\:C_i\) were drawn from an exponential distribution. We calibrated the parameter of exponential distribution to obtain a censoring rate of approximately \(\:30\text\%\). The individual observed survival times were then determined as the minimum of the uncensored (\(\:T_i^\text*\)) and censored survival times \(\:T_i=min\left(T_i^\text*,C_i\right)\), with the event status adjusted accordingly. We applied administrative censoring at seven years.
Missing values
All complete datasets were subjected to the generation of missingness in variables \(\:X_2\) and \(\:X_3\). Specifically, we introduced missing values under the MAR mechanism, which resulted in 30% of the observations in the datasets exhibiting missing values. The probability of missingness in variables \(\:X_2\) and \(\:X_3\) was determined by the linear predictor of a logistic model, which comprised the fully observed \(\:X_1\), the event status indicator, and the observed time [9].
$$\:l_p_i=0.1\times\:X_1i+0.1\times\:\textEvent_i+0.1\times\:T_i$$
Imputation models
All imputation models included the fully observed covariate \(\:X_1\), event status, survival times, and marginal Nelson-Aalen cumulative hazard, in accordance with the approach detailed in [35].
Estimation models
Irrespective of the method used for imputation, the substantive model, which is the model used in the estimation step (post-imputation), was a multivariate Cox Proportional Hazards [36]. In the case of SI methods, a single model per dataset was estimated; for MI methods, we created ten imputed datasets (\(\:m=10\)), and combined the log-hazard ratios from the Cox model using Rubin’s rules [1].
Performance metrics
To compare the performance of the imputation methods, we considered three distinct types of performance metrics, each based on different criteria. We describe the different metrics of all three in the following sections.
Substantive model-free performance metrics
This set of metrics assesses the performance of imputation methods independently of any estimation model. They provide a direct measure of the quality of the imputed data.
Gower’s distance
Gower’s Distance serves as a measure to quantify the dissimilarity between datasets, as outlined by [20]. Specifically, it is adept at handling datasets composed of mixed variable types (i.e., including both categorical and continuous variables). In our study, Gower’s Distance was employed to evaluate the discrepancy between two distinct datasets: one with complete cases and the other with imputed data. The calculation of Gower’s Distance, denoted as \(\:D\), between any two data points \(\:i\) and \(\:j\) is given as:
$$\:D\left(i,j\right)=\frac1p\sum\:_k=1^p\delta\:_ijk\cdot\:d_ijk$$
Where \(\:p\) is the total number of variables within the dataset; \(\:\delta\:_ijk\) is a binary variable that is assigned a value of 1 when both data points \(\:i\) and \(\:j\) possess a non-missing value for variable \(\:k\), and 0 when either data point has a missing value for that variable; finally, \(\:d_ijk\) represents the normalized distance between data points \(\:i\) and \(\:j\) for variable \(\:k\).
For continuous variables, \(\:d_ijk\) is often calculated as the absolute difference between the values for \(\:i\) and \(\:j\), normalized by the range of variable across all data points. For categorical variables, \(\:d_ijk\) is typically 0 if \(\:i\) and \(\:j\) have the same category for variable \(\:k\), and 1 otherwise.
To obtain the overall Gower’s distance for the dataset, we calculate \(\:D\left(i,j\right)\) for all pairs of data points. The overall Gower’s distance is then computed by averaging the pairwise dissimilarities across all pairs of data points, which can be expressed as:
$$\beginaligned&\textOverall Gowers distance=\\&\frac2n\left(n-1\right)\sum\:_i=1^n-1\sum\:_j=i+1^nD\left(i,j\right)\endaligned$$
where \(\:n\) is the total number of data points in the dataset. The factor of \(\:\frac2n\left(n-1\right)\) ensures that the average is computed correctly by considering all unique pairs without redundancy. This provides a comprehensive measure of the dissimilarity between the complete (original) and imputed datasets.
Normalized root mean squared error (NRMSE)
For continuous variables, the NRMSE is utilized to quantify the divergence of imputed values from their actual counterparts [37]. This metric serves as an indicator of imputation accuracy, with a lower NRMSE signifying greater precision in imputation.
$$\:NRMSE=\sqrt{\frac\textmean\left(\left(X_\texttrue-X_\textimp\right)^2\right)\textvar\left(X_\texttrue\right)}$$
Here, \(\:X_\texttrue\) represents the actual values of the covariate, and \(\:X_\textimp\) denotes the corresponding imputed values. This normalized measure offers a standardized assessment of the deviation between original and the imputed values.
Proportion of false classified (PFC)
For categorical covariates, the PFC is a metric designed to evaluate the accuracy of imputation. It metrics the proportion of instances in which the imputed binary values deviate from their original counterparts.
$$\:\textPFC=\frac1n\sum\:_i=1^nI\left(X_\texttrue,i\ne\:X_\textimp,i\right)$$
Where \(\:n\) represents the total number observations; \(\:X_\texttrue,i\) and \(\:X_\textimp,i\) denote the actual and imputed values for the \(\:i\)-th observation, respectively; \(\:I\left(\cdot\:\right)\) is an indicator function, equaling 1 when actual and imputed values differ, and 0 otherwise.
The PFC metric therefore provides a straightforward and intuitive measure of imputation accuracy specifically tailored to categorical data, quantifying the frequency of misclassification introduced through the imputation process.
Post-imputation bias, accuracy, and reliability of regression estimates
This set of metrics evaluates the impact of imputation on the bias, accuracy, and precision of regression model estimates.
Post-imputation bias
The post-imputation Bias is a metric used to evaluate the extent to which the imputation process influences the estimation of regression coefficients. It is computed by comparing the estimated coefficients derived from imputed data with those obtained from all cases data.
$$\textPost-imputation bias = \beta_\texttrue – \hat\beta_\textimp$$
Where \(\:\beta\:_\texttrue\) is the coefficient estimated from the all cases data; and \(\hat\beta_\textimp\) is the average estimated coefficient for a given covariate obtained from the imputed data.
This measure quantifies the deviation in the regression coefficients due to the imputation of missing data, thereby assessing the impact of the imputation technique on the regression analysis. A smaller value of post-imputation bias indicates that the imputation process has minimal distortion on the regression estimates, suggesting a more accurate and reliable imputation methodology.
Empirical standard error (empirical SE)
The empirical SE is a statistical metric that quantifies the variability of estimated regression coefficients derived from imputed data. It serves as an indicator of the precision of these estimates, with a smaller SE suggesting greater precision.
$$\textEmpirical SE = \sqrt\frac1m-1 \sum_k=1^m \left(\beta_\textimp,k – \hat\beta_\textimp\right)^2$$
Where \(\:m\) represents the number of imputed datasets; \(\beta_\textimp,k\) is the estimated coefficient from the \(\:k\)-th imputed dataset; and \(\hat\beta_\textimp\) is the average of the estimated coefficients across all \(\:m\) imputed datasets.
This measure essentially calculates the standard deviation of the regression coefficients across multiple imputed datasets, providing insight into the spread or dispersion of the coefficient estimates. A lower Empirical Standard Error implies that the coefficient estimates across different imputed datasets are more consistent, indicating a higher level of reliability and stability in the imputation process.
Empirical coverage rate (ECR)
The ECR is a metric used to evaluate the accuracy of imputed data in terms of statistical inference. It measures how often the 95% CI, calculated from the imputed data, encompasses the true regression coefficients.
$$\:\textECR=\frac1p\sum\:_i=1^pI\left(\beta\:_\texttrue,k\in\:\textCI_95\text\%,k\right)$$
Where \(\:p\) is the total number of coefficients being estimated; \(\:\beta\:_\texttrue,k\) represents the true value of the \(\:k\)-th coefficient; \(\:\textCI_95\text\%,k\) is the 95% confidence interval for the \(\:k\)-th coefficient, as calculated from the imputed data; finally, \(\:I\left(\cdot\:\right)\) is an indicator function that equals 1 if the true coefficient \(\:\beta\:_\texttrue,k\) falls within the corresponding 95% confidence interval \(\:\textCI_95\text\%,k\), and 0 otherwise.
The ECR quantifies the proportion of times these confidence intervals accurately capture the true parameter values. A value close to 95% indicates that the confidence intervals derived from the imputed data are reliable and effectively represent the uncertainty surrounding the parameter estimates. This metric is crucial for assessing the validity of statistical inferences made from imputed datasets.
Post-imputation predictive accuracy
This set of metrics assesses how well a predictive model performs when trained on imputed data, compared to its performance on all cases.
Time-dependent area under the ROC curve (AUC)
The AUC is a metric used to assess the discriminatory power of a predictive model in a time-dependent context, which is particularly relevant in cancer survival analysis. This measure evaluates the ability of the model to correctly distinguish between different outcome classes (e.g., event vs. no event) at various time points [38]. The time-dependent AUC is calculated as follows.
$$\textAUC(t) = \int_0^t \textSensitivity(u) \times \textSpecificity(u) \, du$$
Where \(\:\textAUC\left(t\right)\) is the area under the ROC curve up to time \(\:t\); \(\:\textSensitivity\left(u\right)\) is the true positive rate (sensitivity) at time \(\:u\); and \(\textSpecificity(u)\) is the derivative of the true negative rate (specificity) with respect to time at \(\:u\).
In practical terms, this metric integrates the sensitivity and the rate of change of specificity over time, providing a comprehensive measure of the model’s performance in accurately classifying individuals at risk over the entire follow-up period. A higher Time-Dependent AUC indicates better discriminative ability of the model at different time points, which is crucial for the accurate prediction of cancer survival outcomes in clinical research.
C-index
The C-index, or Concordance index, is a widely used metric in cancer survival analysis to evaluate the concordance between predicted and observed outcomes. It provides a measure of the predictive accuracy of a model, particularly in the context of censored survival data [39]. The C-index is computed as follows.
$$\textC-index = \frac\sum_i<j I(T_i < T_j) \cdot I(H_i > H_j) + 0.5 \cdot I(H_i = H_j)\sum_i<j I(T_i < T_j)$$
Where \(\:T_i\) and \(\:T_j\) are the observed survival times for pairs of individuals \(\:i\) and \(\:j\), respectively; \(H_i\) and \(H_j\) are the predicted instant hazards (or risks) for individuals \(\:i\) and \(\:j\), respectively; and \(\:I\left(\cdot\:\right)\) is an indicator function that equals 1 if the condition is true and 0 otherwise.
The C-index quantifies the proportion of all usable patient pairs in which the predictions and outcomes are concordant. A pair is considered usable if one of the pair’s members experiences the event of interest and the other is either censored or experiences the event at a later time. A C-index of 0.5 suggests no predictive discrimination (random chance), while a C-index of 1.0 indicates perfect discrimination. In practice, a high C-index indicates that the model has good predictive accuracy, successfully ranking individuals in the order of their observed times to the event, which is particularly important in the analysis of imputed datasets in survival studies.
Importantly, in our evaluation approach, each of these metrics offers a unique perspective on the performance of imputation methods, covering aspects from the precision of imputed data to the impact on subsequent statistical analyses and predictive modeling. This comprehensive evaluation ensures a robust understanding of the strengths and limitations of the different imputation methods studied here.
link