Comparison of spatial prediction models from Machine Learning of cholangiocarcinoma incidence in Thailand | BMC Public Health

Data collection and study area
The retrospective cohort analytical study examined 554 sub-districts across four regions of Thailand (northern, central, northeastern, southern). We collected data from two main sources:
CCA case data
Information from four Population-Based Cancer Registries (PBCRs): northern (Lampang Cancer Hospital), central (Lop Buri Cancer Hospital), northeastern (Khon Kaen Provincial Cancer Registry), and southern region (Surat Thani Cancer Hospital) [23]. All the CCA cases were diagnosed between January 1, 2012, and December 31, 2021, based on the International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3), with the specific codes: C22.1 (Intrahepatic bile duct), C24.0 (Extrahepatic bile duct), C24.8 (Overlapping lesion of biliary), and C24.9 (Biliary tract, NOS) (excluding C24.1, Ampulla of Vater) [24, 25]. Key variables included sex, age at diagnosis, birth date, ICD-O-3 code, address, and basis of diagnosis. Population data from the Office of the National Economic and Social Development Board [26] was used to calculate the age-standardized rates (ASR) by sex and age groups every five years between 2012 and 2021 (Table 1).
Spatial variables
First, environment data (elevation, water source coordinates, and the size and extent of areas) from the Central Geoinformatics System and Services Project, Department of Water Resources, Ministry of Natural Resources and Environment [27]. Second, climatic data (average rainfall, average temperature, and coordinates for all meteorological stations) were obtained from the Thailand Meteorological Department using a statistical data request system [28]. All spatial variables were aggregated at the sub-district level (Table 1).
Study areas
The study covered four provinces representing the four main region of Thailand respective sizes and geographical coordinates (latitudes and longitudes): (i) Lampang province (Northern): 12,533.96 km2, 17.2°−19.5°N, 98.9°−100.2°E; (ii) Lop Buri province (Central): 6,208.70 km2, 14.6°−15.8°N, 100.3°−101.5°E; (iii) Khon Kaen province (Northeastern): 10,885.99 km2, 15.6°−17.1°N, 101.6°−103.3°E; and (iv) Surat Thani province (Southern): 12,891.4 km2, 8.3°−10.2°N, 98.5°−100.2°E, for each provinces representing the regions, respective [23].
Variables and measurement
ASR, Age-Standardized Rates; CCA, cholangiocarcinoma; IACR, International Association of Cancer Registries.
Statistical analysis
Incidence rate of CCA
The ASR was calculated for each sex and standardized using the Segi World standard population estimates [29]. The International Association of Cancer Registries (IACR) guidelines [30] were used to calculate the ASR of CCA cases in each sub-district.
Machine learning models
We implemented four different machine learning models to predict CCA incidence based on spatial variables. In our data management process, residential address codes were utilized as the key identifier for linking CCA cases data with all spatial factors. Prior to analysis, distribution testing was conducted for all variables. In cases where data exhibited abnormal distribution patterns (left or right skewness), variable transformation through logarithmic conversion was performed on all affected variables before proceeding with into machine learning models. Each model represents a different approach to predictive modeling, selected to provide a comprehensive comparison of techniques applicable to spatial epidemiology:
Linear regression
A statistical model that examines the linear relationship between the dependent variable (ASR of CCA) and multiple independent variables (spatial factors). We selected this model as a baseline comparison since it represents traditional statistical approaches and assumes linear relationships between variables.
Random Forest
An ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of individual trees. Random Forests are well-suited for spatial epidemiology because they can capture non-linear relationships, handle interactions between variables, and are robust against overfitting. The algorithm works by bootstrapping samples of observations and variables to build diverse decision trees, with each tree contributing a vote toward the final prediction [31]. The Random Forest model was configured with the following specifications: number of trees = 500; variables randomly sampled at each split (mtry) = 2; minimum node size = 5; and Gini criterion was employed as the splitting criterion.
Neural Network
A computational model inspired by the human brain’s neural structure, designed to recognize complex patterns through interconnected layers of nodes (neurons). Neural Networks process information through three main components: an input layer (receiving spatial variables), hidden layers (processing information through weighted connections), and an output layer (producing CCA incidence predictions). This architecture allows Neural Networks to model highly complex, non-linear relationships between spatial factors and disease incidence [32, 33]. The Neural Network utilized a 5 → 15 → 10 → 1 architecture with ReLU activation for hidden layers and linear activation for the output layer. Training employed a 0.01 learning rate with Adam optimizer, L2 regularization (weight decay = 0.0001), batch size of 32, and 200 epochs with early stopping to optimize performance.
Extreme Gradient Boosting (XGBoost)
An advanced implementation of gradient boosting that builds models sequentially, with each new model correcting errors made by previous ones. XGBoost has three key components: (i) a loss function to evaluate model accuracy, (ii) weak learners (typically decision trees) that perform slightly better than random guessing, and (iii) an additive model that combines weak learners into a strong predictive system. XGBoost includes regularization techniques to prevent overfitting, making it potentially valuable for spatial prediction with limited data [34]. The XGBoost model was configured with a learning rate of 0.05, maximum tree depth of 6, and minimum child weight of 3. Both subsample and column sample ratios were set at 0.8. For regularization purposes, alpha and lambda parameters were established at 0.2 and 0.1, respectively. The model was trained using 1000 boosting rounds with an early stopping mechanism to prevent overfitting and optimize performance.
Model training and validation
For model development and evaluation, we randomly split the dataset into training (70%) and testing (30%) subsets. This ratio was selected to balance the need for adequate training data while ensuring sufficient test data for reliable performance evaluation, given our sample size constraints. The 70:30 split is widely used in machine learning applications and provides a good compromise between these competing needs.
While we considered alternative splitting ratios (80:20, 90:10) by evaluating the same model as all real analyses, our preliminary analyses showed that the 70:30 split offered the optimal balance between model learning and validation for our dataset size. With approximately 554 sub-districts in our study, this split provided 388 sub-districts (4,465 cases) for training and 166 for testing (1,914 cases)—sufficient numbers for both robust model training and meaningful validation without risking overfitting.
Table 1 illustrates our complete research methodology from data collection through model evaluation. The process began with gathering CCA case data from four regional cancer registries and spatial data from government databases. After preprocessing, which included calculating ASR values and standardizing spatial variables, we implemented the 70:30 random split stratified by region to maintain proportional representation. Each model was trained using identical training data and hyperparameter optimization techniques, then evaluated on the common test set using RMSE, R2, and visual assessment via scatter plots.
Model evaluations
We implemented a comprehensive evaluation framework using three complementary approaches to ensure robust assessment of model performance:
Root Mean Square Error (RMSE)
RMSE quantifies prediction errors in the same units as the dependent variable, giving greater weight to large errors—critical in health applications where significant errors can have serious consequences. This metric calculates the square root of the average squared differences between predicted and actual CCA incidence values:
$$RMSE=\sqrt \lbrack\sum(predicted-actual)2/n\rbrack$$
Lower RMSE values indicate better model performance with less prediction error. We selected RMSE over alternative metrics like Mean Absolute Error (MAE) because RMSE gives greater weight to large errors through the squaring mechanism, making it particularly valuable for health applications where large prediction errors could have significant consequences for resource allocation and intervention planning. This sensitivity to outliers helps identify models that might perform well on average but produce concerning errors in certain regions or incidence ranges.
R-squared (R2)
This coefficient of determination measures the proportion of variance in the dependent variable (ASR of CCA) that is explained by the model’s independent variables (spatial factors). By providing a straightforward scale from 0 to 1, it allows us to measure the proportion of variance explained in CCA incidence and facilitates meaningful comparisons with previous research findings:
R2 = 1 – (Sum of Squared Residuals/Total Sum of Squares)
R2 values range from 0 to 1, with values closer to 1 indicating that the model explains a greater proportion of the variance in CCA incidence, suggesting better predictive performance. For each model, we calculated 95% confidence intervals for R2 values using bootstrap resampling with 1000 iterations to quantify the uncertainty in our performance estimates and allow for more rigorous statistical comparison between models.
Scatter plots
We created scatter plots to visualize the relationship between predicted and actual CCA incidence values for each model. These visual representations serve multiple analytical purposes:
-
Identifying patterns in prediction accuracy across different incidence levels.
-
Revealing potential systematic biases (such as consistent over-prediction in high-incidence areas).
-
Detecting heteroscedasticity in prediction errors.
-
Identifying regional clusters or outliers that might require special attention.
We enhanced these scatter plots with a 45-degree reference line representing perfect prediction, regression lines showing actual trends, and color-coding by region to enable deeper visual analysis of model performance.
After comprehensive comparison of these models, we conducted variable importance analysis using the best-performing model (Random Forest) to identify the key spatial predictors of CCA incidence. This analysis quantifies the mean decrease in prediction accuracy when each variable is excluded from the model while keeping all others constant. The approach involved:
-
1.
Training the optimal Random Forest model on the complete dataset
-
2.
Permuting each predictor variable one at a time (effectively removing its information while maintaining the same data structure)
-
3.
Measuring the resulting decrease in prediction accuracy
-
4.
Ranking variables by their impact on model performance
This permutation-based approach offers advantages over alternative variable importance methods as it directly measures the impact on the model’s predictive performance rather than changes in node purity, providing more interpretable results that directly relate to our prediction goals.
For model implementation, we used the Random Forest, Nural Network, XGBoost, and stats packages in R. We performed all analyses and visualizations using R software version 4.2.1 (R Core Team) [35] with RStudio software version 1.4.1 [36]. Spatial data processing utilized the sf and raster packages, while visualization employed ggplot2 with custom themes for optimal clarity. Statistical validation, including confidence interval calculation, was implemented using the resamples.
link