Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users | BMC Medical Research Methodology

Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users | BMC Medical Research Methodology

Extreme gradient boosting model

Our study focused on tuning the hyper-parameters of an extreme gradient boosting model for predicting a binary outcome, denoting high-need high-cost health care status. Details regarding the theory and computation of gradient boosting are discussed in Friedman [14] and Chen et al. [10]. The extreme gradient boosting model has performed well in many empirical predictive tasks; however, it is associated with several hyper-parameters that must be judiciously tuned for the resulting supervised machine learning model to have optimal performance.

We used the Python XGBoost implementation of the extreme gradient boosting classifier ( Extreme gradient boosting model hyper-parameters and their default values are presented in Table 1. The tuning range (support) for extreme gradient boosting hyper-parameters are also described in Table 1. Choices regarding tuning range are governed by the measurement scale and support of the specific hyper-parameter under consideration.

Table 1 Extreme gradient boosting model hyper-parameters, default values, and tuning ranges for hyper-parameter optimization experiments

Hyper-parameter optimization

Hyper-parameter optimization algorithms seek to identify an optimal set of hyper-parameters (λ) that optimize an objective function – \(f(\lambda )\) – corresponding to a user-selected evaluation metric [9, 17]. The choice of evaluation metric is problem dependent. Metric choice will govern whether HPO is viewed as a minimization/maximization problem. In our study we focus on an AUC metric suitable for evaluating the performance of binary prediction models. We note that other metric functions exist for evaluating binary prediction models (e.g. binomial/cross-entropy loss, F1 score, etc.). Formally, hyper-parameter tuning methods attempt to optimize the objective below:

$${\lambda }^{*}= \text{arg}\underset{\lambda \in\Lambda }{\text{max}}\ f(\lambda )$$

  • \(\lambda\) is a particular hyper-parameter configuration. We note that \(\lambda\) is a J-dimensional tuple \(\lambda \equiv ({\lambda }_{1},{\lambda }_{2},…,{\lambda }_{J})\). We use \({\lambda }_{j}\) to denote the jth hyper-parameter in the tuple.

  • \(\Lambda\) defines the search space (support) of the hyperparameters. In our study, \(\Lambda\) is a product space over (bounded) continuous and discrete variables.

  • \(f(\lambda )\) denotes the metric function employed in the HPO experiment, which evaluates model performance at a given hyper-parameter configuration (\(\lambda \in\Lambda\)). In our study, we use an AUC metric for evaluating the performance of our binary prediction model – and HPO is cast as an AUC maximization problem.

  • \({\lambda }^{*}\) is the optimal hyper-parameter configuration that yields the best model performance (with respect to the user defined metric function).

  • We assume s= 1…S trials in our HPO experiment. \({\lambda }^{s}\) denotes the hyper-parameter configuration used in the sth trial of the HPO experiment. In this study, we budget S= 100 trials for each HPO method being comparatively evaluated.

Hyper-parameter optimization algorithms

Hyperopt: random sampling

Candidate hyper-parameter configurations are randomly and independently sampled from univariate probability distributions (e.g. bounded discrete/continuous distributions) [5].

$${\lambda }_{j} \sim {P}_{j}({\Lambda }_{j})$$

\({\lambda }_{j}\) denotes a realization of a specific hyper-parameter (jth element of the tuple). \({P}_{j}({\Lambda }_{j})\) is the probability distribution from which \({\lambda }_{j}\) is randomly drawn/sampled.

Hyperopt: simulated annealing

Simulated annealing treats hyper-parameter search as an energy minimization problem, where the metric function (\(f(\lambda )\)) represents energy, and solutions are perturbed stochastically until an optimum is identified. The algorithm accepts worse solutions with the following probability [19]:

$$P\left(\Delta f, T\right)= \text{exp}\left(-\frac{\Delta f}{ T}\right)$$

\(\Delta f=f\left({\lambda }^{s+1}\right)- f\left({\lambda }^{s}\right)\), is the change in the objective function value between successive HPO trials. T is the annealing temperature (a meta-parameter of the HPO algorithm). Initially, simulated annealing explores the hyper-parameter space broadly by accepting poor solutions with high probability, but as T cools, the acceptance rate of worse solutions decreases, allowing the method to converge.

Hyperopt: tree-structured Parzen estimator

The tree-Parzen estimator is a Bayesian optimization algorithm that models the conditional probability distribution \(P({\lambda }^{s+1} | ({f}({\lambda }^{s}),\ { f}({\lambda }^{s-1}),\dots ,\ {f}({\lambda }^{1}))\) instead of modeling \(f(\lambda )\) directly. The tree-Parzen estimator approach divides observed hyper-parameter configurations into two sets:

  • \({g}\): configurations where the metric value exceeds a threshold: i.e. \(f\left(\lambda \right)>{f}^{*}\).

  • \(\mathcal{L}\): configurations where the metric value is less than a threshold: i.e. \(f\left(\lambda \right) \le {f}^{*}\).

The tree-Parzen estimator fits two nonparametric kernel density estimators to approximate P(λ |\({g}\)) and P(λ |\(\mathcal{L}\)), respectively. The algorithm chooses new hyper-parameter configurations to maximize expected improvement, which can be shown proportional to maximizing the expression below: [4]:

$${\lambda }^{*}=\text{arg}\underset{\lambda \in\Lambda }{\text{max}}\frac{P\left(\lambda \right|{g})}{P(\uplambda |\mathcal{L})}$$

Hyperopt: adaptive tree-structured Parzen estimator

The adaptive tree-Parzen estimator extends the standard tree-Parzen estimator by dynamically adjusting the trade-off between exploration and exploitation based on optimization progress. Early in the search, it favors exploration to gain a broad understanding of the search space, while later it prioritizes exploitation by focusing more on high-performing regions. This is achieved by dynamically varying the threshold \({(f}^{*})\) throughout the course of the hyper-parameter optimization experiment [4].

Optuna: Gaussian Process

Gaussian Process (GP) based hyper-parameter optimization attempts to fit a surrogate model for the HPO metric function (\(f(\lambda )\)) using a distribution over functions [23]. The Gaussian Process used in HPO is defined as:

$$GP(\mu \left(\lambda \right), k(\lambda ,{\lambda }^{\prime}))$$

Where \(\mu \left(\lambda \right)\) is the GP mean function, and \(, k(\lambda ,{\lambda }^{\prime})\) is the GP covariance/kernel function. The GP mean and covariance functions are updated using past hyper-parameter configurations. Given past evaluations of the metric function, GP-based optimization uses an acquisition function (e.g., expected improvement, upper confidence bound, probability of improvement, etc.) to determine which hyper-parameter configuration to sample next.

Optuna: Quasi-Monte Carlo sampler

Quasi Monte Carlo methods improve on random search by using low-discrepancy sequences instead of purely random samples (e.g. Sobol sequences, and related processes). These sequences fill the search space more uniformly than independent random draws, reducing variance in HPO metric function evaluations [5].

Optuna: CMA-ES (Covariance Matrix Adaptation Evolution Strategy)

CMA-ES is an evolutionary algorithm for hyper-parameter optimization that iteratively updates the parameters of a multivariate normal (MVN) distribution, \(MVN(\mu (\lambda ),\Sigma (\lambda ))\). The mean \(\mu (\lambda )\) represents the center of promising hyper-parameter configurations, while the covariance matrix \(\Sigma (\lambda )\) captures dependencies between hyper-parameters, allowing the algorithm to model correlations and adapt the search distribution over successive trials of the HPO experiment.

CMA-ES follows an evolutionary process: in each iteration, the CMA-ES algorithm selects the best-performing hyper-parameter configurations (analogous to survival/fitness) and updates the MVN mean and covariance (similar to recombination and mutation). These adaptive updates refine the HPO search space, enabling CMA-ES to efficiently explore and sample better-performing hyper-parameter sets [15, 16].

Scikit-Optimize (skopt): Gaussian Process

SkOpt uses a Gaussian process (GP) surrogate model for hyper-parameter optimization. SkOpt supports flexible kernel selection and acquisition function tuning, adapting to different search spaces. SkOpt efficiently balances exploration and exploitation in hyper-parameter search.

SMAC3: Bayesian optimization with random forests

SMAC3 is a Bayesian optimization framework for hyper-parameter tuning that uses random forests to estimate a surrogate model of the HPO metric function [20]. SMAC3 builds an ensemble of regression trees to estimate performance, making it robust in high-dimensional and discrete spaces. The expected improvement acquisition function balances exploration and exploitation to efficiently select the next hyper-parameter configuration.

Description of the high-need high-cost dataset: outcome and feature variables

Total individual health care costs were calculated for all Ontario adult residents > 18 years old separately during fiscal year 2017 and fiscal year 2019 using provincial health administrative data. Individual health conditions were captured over the previous two fiscal years. Total costs were dichotomized into a binary outcome, denoting whether an individual was in the top-5 percentile of the empirical cost distribution, separately for fiscal year 2017 and fiscal year 2019.

Our study is restricted to older adults aged > 65 years (n= 2,321,168). Median age was 73 years old (IQR: 68–80 years) and 54.6% of the sample was female. A total of 345,503 (15.3%) of older Ontario residents experienced costs in the top-5 percentile during fiscal year 2017.

The 2019 population contained 2,483,064 residents aged > 65 years. Of these, 366,219 (14.7%) were in the top-5 percentile of the empirical cost distribution during fiscal year 2019. Demographic and clinical characteristics of health care users were similar in 2017 and 2019.

Subject matter experts from medicine, epidemiology, and biostatistics purposefully selected features for inclusion in our prediction models that they hypothesized to be drivers of high-need high-cost health care utilization. The dataset consists of approximately 104 features, resulting in a clinical tabular dataset whose design matrix contained 143 columns (Supplementary Material 2). The features measure patient socio-demographic characteristics, diagnoses with physical/mental health conditions, past hospitalizations and other previous healthcare utilization. Most of the variables are binary valued, denoting the presence/absence of a health condition or health care utilization. Few variables are categorical. Two continuous measures exist in the dataset (i.e. age in years, and frailty index score). All variables were hypothesized to have a connection with high-need high-cost health care utilization, and as such the dataset is characterized by a strong signal to noise ratio. The ratio of samples-to-features is large, as is the ratio of samples-to-events (i.e. high-need high-cost health care users).

Study design and evaluation metrics for hyper-parameter optimization experiments

For each of the nine HPO methods being compared, we budget S= 100 trials for each HPO experiment. A single trial (s= 1…S) in our experiment involves: 1) identification of a tuple of hyper-parameters, 2) optimization of the extreme gradient boosting model at the particular hyper-parameter configuration on the training dataset (n.b. the same training dataset is used for estimating the extreme gradient boosting model under each sampled hyper-parameter configuration), and 3) estimation of an AUC evaluation metric on the held-out validation dataset (n.b. the same validation dataset is used for AUC estimation and selection of an optimal tuple of hyper-parameters which maximizes model performance). Given nine HPO methods, we conduct nine HPO experiments, each with a budget of 100 trials. We train 900 extreme gradient boosting models and estimate 900 replicates of the AUC evaluation metric. We note that the budget of S= 100 trials (for each HPO method) is an arbitrary choice. Choice of HPO budget (S) is often informed by the size of the training dataset, the statistical model being fitted, and availability of high-performance computing facilities – as these factors impact model fitting time, and total runtime of the entire HPO experiment.

For each of the HPO methods/experiments we construct a line plot illustrating variation in the metric function over sequential trials, at distinct hyper-parameter configurations. The plot allows for graphical identification of the optimal extreme gradient boosting model identified using each HPO method, comparison of the optimal model against learned alternative models, and comparison against the extreme gradient boosting model fitted using default hyper-parameter values.

We use a random 80:10:10 training/validation/test split of the fiscal year 2017 dataset. Predictions from optimal models identified by each of the hyper-parameter optimization methods are estimated using a randomly selected hold-out test dataset, and generalization performance is evaluated (internal validation). A temporally independent fiscal year 2019 test dataset is also used to evaluate model generalization performance (external validation). The same training, validation, and test datasets are used to evaluate each of the HPO methods being compared, to ensure comparability of discrimination/calibration metrics and inferences regarding feature importance.

The study considers several metrics of discrimination and calibration performance [24,25,26]. We evaluate model discrimination performance using accuracy, sensitivity (a.k.a. recall), specificity, positive predictive value (PPV; a.k.a. precision), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC). For each machine learning model and hyper-parameter optimization method, we plot a receiver operating characteristic (ROC) curve and LOESS smoothed calibration curve [2]. We also evaluate model calibration performance using the integrated calibration index (ICI) and the E50, E90 and Emax indices [3]. ICI measures the (integrated) area between the observed and ideal calibration curves; whereas, E50, E90, and Emax measure the median, 90th percentile, and maximum deviation between observed and ideal calibration curves. We use a total gain feature importance metric to compare the stability of important features identified by each of the extreme gradient boosting models, and associated HPO methods. HPO algorithms are also compared in terms of runtime over the 100 budgeted experiments (n.b. runtime is measured in hours).

link