A comparative analysis of classical and machine learning methods for forecasting TB/HIV co-infection
This study aimed to explore prediction models for the incidence of TB/HIV coinfection, estimating the epidemic trend of cases from a past, present, and future perspective, thereby providing a reference for prevention and control for public policy agencies. To this end, an analysis of TB/HIV notification data from 2012 to 2023 in the state of Mato Grosso, Brazil, was conducted. The monthly incidence rate for the reported cases in this range was calculated, and the temporal trend of these data was evaluated, stratified by male, female, and total population.
In this context, we compare cases of TB/HIV in the state of Mato Grosso with other regions of the country. The Central-West region, where Mato Grosso is located, ranked second in the number of TB/HIV coinfection notifications in 2022, with a proportion of 10.0%. The South region ranks first with 12.6% of notifications. The state of Mato Grosso, with 9.3%, also stands out for being among the ten states with the highest number of cases of coinfection, even surpassing the national average of 8.4%3.
This epidemiological scenario highlights that TB/HIV cases in Mato Grosso are significantly relevant compared to other regions of the country. According to Humayun et al.(2022), there is a predominance of coinfection in male individuals, corroborating with the results found in our investigation that indicate a higher incidence in this population group.
The evidence reinforces the need for special attention to male individuals, especially in regions with extensive rural areas, such as the state of Mato Grosso45. Developing and implementing specific prevention and control measures can improve TB/HIV coinfection rates in this group. These include targeted health campaigns for men, mobile health clinics, workplace health initiatives, community-based interventions, improved health literacy, enhanced screening and diagnostic services, and integrated health services for concurrent TB and HIV care.
To further understand the dynamics of TB/HIV incidence rates, the time series trends for the period from 2012 to 2023 were evaluated using the ADF, DF-GLS, KPSS, and Phillips-Perron tests. The choice of multiple tests is justified by their complementarity and the specific characteristics of the analyzed datasets. The identified weak stationarity implies that the mean and autocovariance are stable over time, although the complete distribution of time series values does not need to be constant. The autocovariance should be influenced only by the time difference (lag) between two observations, regardless of the specific moments when the observations occur46.
These evaluations revealed that the value distribution for the time series, stratified by female, male, and total population, demonstrated a stationary trend, with significant seasonality observed in the male and total population groups. The absence of a decreasing trend in notifications suggests that current strategies and interventions to combat TB/HIV are insufficient to reduce the incidence of these conditions. This highlights the need for reviewing and intensifying prevention, diagnosis, and treatment measures.
Seasonality underscores the particular challenges in controlling TB/HIV coinfection, such as the stigma associated with both diseases, difficulties in accessing and adhering to treatment, and the necessity for integrated health approaches that consider the interactions between TB, HIV, and other social determinants of health.
The observed trends are further complicated by external factors, most notably the COVID-19 pandemic, which had significant impacts on TB notifications worldwide2. These impacts were due to a combination of factors including the reorganization of health systems, changes in people’s behavior in seeking medical care, and disruptions in health services. Thus, when observing the period from 2020 to 2022 regarding TB/HIV notifications in the state of Mato Grosso, it was possible to verify the same downward trend.
It should be emphasized that the decline in TB/HIV coinfection notifications may not indicate a lack of cases, but rather that these cases possibly went unreported due to difficulties in accessing health services. During the pandemic, late diagnoses and treatment interruptions were common, which may have contributed to complications arising from TB/HIV coinfection. The situation experienced during the pandemic further highlighted the need to fulfill commitments related to TB and HIV, especially through the expansion of ethical and person-centered care, with equity and access to health and social rights4.
In light of these challenges, predictive models play a crucial role in controlling and managing TB, one of the oldest and most persistent infectious diseases affecting humanity. The insights provided by these models guide the development of policies and public health programs, steering decisions on research priorities, health infrastructure development, and education and communication strategies. They are essential for the ongoing monitoring of the effectiveness of public health policies and TB control programs, allowing for quick adjustments in response to real-time feedback, and are effectively considered a global strategy to end TB, providing a basis for evidence-based decisions in disease control.
In this perspective, this study explored two prevalent approaches to constructing predictive models for infectious diseases. The first group of models comes from classical statistics, such as exponential smoothing (SES, DES, Holt-Winters) and autoregressive integrated moving average (ARIMA); the second comprises machine learning-based prediction models like Support Vector Regression (SVR), Decision Trees (XGBoost), and artificial neural network models (LSTM, CNN, and GRU).
Although classical statistical predictive models have been relatively successful in predicting infectious diseases, they struggle to extract nonlinear relationships in a time series47. In contrast, one of the main advantages of machine learning over classical statistical methods is its ability to capture and model the complexity and nonlinearity of data without the need to explicitly specify the form of the relationship between input and output variables48. However, it is important to note that the literature already tacitly indicates that the characteristics of the data are prevalent in determining the assertiveness of these methods, regardless of the approach49,50,51.
Thus, in the context of the data evaluated in this study, it was conclusive that deep learning models, specifically Bidirectional LSTM and the custom CNN + LSTM model, demonstrated superiority in predicting these stratified datasets, evidenced by lower error metrics, suggesting greater efficiency in capturing complex patterns in the data compared to simpler methods like SES, DES, and Holt-Winters. Another point of note is the performance of the ARIMA model52,53, which has shown a peculiar prediction capacity in a wide variety of scenarios and only performed better in the male time series when compared only to classical statistical models. This result reaffirms that the success of a model’s prediction is intrinsically related to the characteristics and behavior of the data, which can be very beneficial, as simpler models like SES and DES can provide assertive guidance for decision-making processes regarding TB/HIV coinfection.
However, attention must be given to the assertiveness of these simpler models. Despite lower AIC and BIC54 indicating a more efficient model in terms of simplicity and information use, they do not necessarily translate into greater accuracy in predictions, as demonstrated by higher sMAPE values.
A lower sMAPE is preferable, indicating a smaller percentage error between predictions and actual values, reflecting more accurate predictions. A lower sMAPE is preferable, indicating a smaller percentage error between predictions and actual values, reflecting more accurate predictions. For instance, lower sMAPE values are crucial because sMAPE is a measure of accuracy that is less sensitive to extreme values and more interpretable in percentage terms, making it easier to understand the model’s predictive performance. By comparing sMAPE values across different models, we can identify which models perform better in terms of prediction accuracy.
Regarding the performance of the SVR and XGBoost models based on the machine learning approach and parameter optimization using the Grid Search technique, it can be argued that their results might be conditioned by a suboptimal configuration, which can lead to inferior performance31. Another relevant issue is that although these two models are successful, they might not be the most suitable for dealing with the TB/HIV dataset with intrinsic temporal characteristics. There’s also the possibility of the curse of dimensionality or overfitting, coupled with a deficiency in capturing seasonal patterns and trends. Thus, models like SVR and XGBoost might not be as effective in modeling patterns without extensive feature engineering55,56.
On the other hand, deep learning models, designed specifically to manage large volumes of data, exhibit a remarkable ability to generalize. This is largely due to their ability to efficiently extract and organize feature hierarchies. Among these, models such as Long Short-Term Memory (LSTM) prove particularly effective in capturing temporal sequences and identifying long-range dependencies present in the data6.
Moreover, despite their sensitivity to configuration, deep learning models tend to be more robust compared to other methods. This robustness stems from their unique ability to learn complex data representations, allowing more flexible adaptation to different types of patterns and structures inherent in the datasets they are trained on.
Our results based on time series and predicting the future behavior of TB/HIV coinfection suggest that, in Mato Grosso, meeting the targets proposed for 2030 by the UN SDGs and the End TB Strategy seems increasingly unlikely. According to Silva et al. (2021), if TB is not controlled and the current death rate continues, 31.8 million people will die from TB between 2020 and 2050, leading to an economic loss of 17.5 billion dollars.
The epidemiological scenario projected in this study shows that the incidence of TB/HIV will not be reduced unless decisive measures are taken by policymakers and health professionals. According to WHO, recommendations to advance TB/HIV coinfection control include TB screening for all PLHIV at the time of diagnosis and all follow-up visits, as well as routine HIV testing for all TB patients. It is crucial that PLHIV with active TB receive both TB treatment and antiretroviral therapy (ART). Furthermore, PLHIV without active TB should receive TB preventive treatment to reduce the risk of developing the disease58.
Brazil has invested in public policies that span the health sector with intra and intersectoral scope, aiming to accelerate efforts to eliminate diseases such as TB. In 2023, the country’s Ministry of Health reaffirmed its commitment to eliminating TB and announced the goal of eliminating the disease by 2030, advancing the target initially proposed in the National Plan (10 cases per 100,000 inhabitants and less than 230 deaths per year) by five years4.
Among the implemented policies, the Healthy Brazil Program – Unite to Care, established on 7 February 2024, stands out as a government policy aligned with the SDGs, aimed at controlling TB, HIV, and other socially determined diseases. The priority is for those affected by these diseases to undergo proper treatment, with reduced costs and better results in the network of health professionals and services59. Added to this is the resumption of investment in innovation, science, and technology, including exclusive funding for TB research4. These actions may represent an advancement in terms of addressing TB/HIV coinfection in the country, especially in the state of Mato Grosso, the setting of this study, where future predictions for the incidence of this coinfection are not encouraging.
This study stands out for its utilization of a wide array of predictive models, including deep learning approaches such as Bidirectional LSTM and CNN + LSTM, showcases the study’s strength in employing cutting-edge data analysis techniques to address complex epidemiological challenges. These models demonstrated superior performance in capturing the intricate patterns within the data, suggesting their potential utility in guiding more effective TB/HIV prevention and control strategies.
However, the study is not without limitations. One significant concern is the accuracy of the dataset used, which may be compromised by incomplete record-keeping or underreporting of TB/HIV cases. Such discrepancies could skew the analysis and affect the reliability of the predictive models. Potential inaccuracies arise from various sources, including data entry errors, inconsistencies in reporting practices, and limitations in diagnostic capabilities, particularly in resource-limited settings.
To mitigate the problem of data inaccuracies in future studies, several strategies can be implemented. Enhancing data collection processes through standardized reporting protocols and comprehensive training for healthcare workers is essential. Employing robust data validation and cleaning techniques, along with integrating data from multiple sources such as electronic health records and community health surveys, can improve data accuracy. Additionally, using advanced analytical methods like machine learning algorithms to handle missing data and establish continuous monitoring and feedback systems will help identify and correct emerging issues promptly.
Another limitation, in addition to data precision, is the use of univariate time series analysis, which ignores the impact of socioeconomic variables and other external factors that significantly influence TB incidence rates. Incorporating these multifaceted factors could provide a more holistic understanding of the drivers behind the trends in TB/HIV coinfection and enhance the predictive accuracy of the models.
Furthermore, deep learning models, such as Bidirectional LSTM and CNN + LSTM, are sensitive to their configuration. Parameter optimization was performed using Grid Search; however, suboptimal configurations might still affect performance. These models also require substantial computational resources and time for training. Our models accounted for seasonality and trends in the data, but abrupt changes in external conditions (e.g., public health interventions, pandemics) could alter these patterns, reducing the predictive accuracy of the models.
link