Ensemble-based eye disease detection system utilizing fundus and vascular structures

Ensemble-based eye disease detection system utilizing fundus and vascular structures

Datasets

This work employed three open-source datasets of retinal disease images: the Multi-Label Retinal Diseases (MuReD) dataset29 from the Mendeley platform, the Retinal Fundus Multi-Disease Image Dataset (RFMID)30 from the IEEE Dataport, and the Digital Retinal Images for Vessel Extraction (DRIVE) dataset31. The datasets are publicly accessible and widely used in research, ensuring the reliability and consistency of the results of evaluating the performance of our proposed algorithm.

For MuReD and RFMID, each dataset has 3000 high-resolution retinal images in color. The dimensions of each image are 512×512 pixels, and they are stored in JPEG format and RGB mode. The MuReD dataset has 740 cataract fundus images, 752 diabetic retinopathy fundus images, 759 myopia fundus images, and 749 normal fundus images. The RFMID dataset has 750 cataract fundus images, 750 diabetic retinopathy fundus images, 743 myopia fundus images, and 757 normal fundus images. These images were randomly sampled to maintain a balanced distribution across categories, which is critical for training unbiased classification models.

Regarding the partitioning of the data, the merged dataset of 6000 images was divided into training (70%), validation (15%), and test sets (15%) to ensure robust model generalizability. The proportion of samples in each category is approximately equal, with a gender ratio of 1:1, ensuring balanced categories. This balance is essential for training accurate and unbiased diagnostic models. Table 2 displays the distribution of the categories of illness in these datasets.

Table 2 Distribution of Categories in the Retinal Image Datasets.

The DRIVE dataset consists of fundus images and their vessel segmentation masks for both the training and test sets. The training set has 20 images and their corresponding masks, whereas the test set consists of 20 images and their corresponding masks.

Preprocessing and training

During the preprocessing, we removed low-quality, overexposed, or blurry images, enhancing the clarity and consistency of the training data. The following evaluation metrics are employed: accuracy, recall, precision, F1-score, ROC curves, and AUC values. 

In order to improve the model’s ability to generalize and avoid overfitting, extensive data augmentation techniques were implemented on the training data. These included random rotation within a range of ±20 degrees, horizontal flipping, irregular circular cropping, Gaussian blur, and Contrast Limited Adaptive Histogram Equalization (CLAHE).

The selection of these preprocessing techniques was based on the following considerations:

  • Image resizing and normalization: Ensured consistency in image size and pixel intensity, reducing unnecessary variability and enabling the model to focus on important disease-related features.

  • Random rotation and horizontal flipping: Simulated variations in patient head position and device angles, helping the model learn features such as those of myopia and cataracts from different orientations.

  • Contrast Limited Adaptive Histogram Equalization (CLAHE): Enhanced local contrast, improving the visibility of blood vessels and small lesions, which is crucial for detecting diabetic retinopathy.

  • Gaussian blur: Reduced image noise, enabling the model to concentrate on broader structural features, such as the blurriness characteristic of cataracts.

  • Irregular circular cropping: Ensured the retinal region remained centered in the image, improving the model’s ability to detect changes in peripheral retinal morphology, such as those caused by myopia and diabetic retinopathy.

This experiment was conducted on a machine equipped with an AMD Ryzen 5 3600 CPU, GPU P100, 3.50 GHz base frequency, 6 cores, 12 logical processors, 120 GB SSD, 1 TB HDD, 16 GB RAM, and 16 GB virtual memory. The experiments were performed using Jupyter notebooks in Anaconda. The proposed model was implemented using Python and common libraries such as pandas, NumPy, Matplotlib, Seaborn, TensorFlow, Keras, and Scikit-learn.

Subsequently, we evaluated the retinal disease classification model using several key metrics. The confusion matrix provided the foundation for calculating accuracy, which is essential for assessing the classification performance. The Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) were used to visualize and evaluate the model’s performance, especially in multi-class classification. We employed ten-fold cross-validation to ensure the robustness and generalizability of the model.

The backbone of the segmentation model

To determine the most suitable and accurate vessel segmentation model for future research requirements, we conducted a comprehensive comparison and analysis of five segmentation techniques: U-Net32, SegNet33, DenseNet34, ResNet35, and Attention U-Net36.

Table 3 Performance Comparison of Vessel Segmentation Models: U-Net, SegNet, DenseNet, ResNet, and Attention U-Net. Results reveal that U-Net achieved the highest performance metrics.

Table 3 presents the performance metrics of various segmentation models applied to the vessel segmentation task. The results demonstrate that the U-Net model surpassed the competition in all the performance metrics used, particularly in terms of F1 score and accuracy. Moreover, U-Net demonstrates a higher frames per second (FPS) rate, indicating faster inference speed, which is advantageous for practical applications in the real world. Consequently, U-Net was selected for the vessel segmentation.

The dual-branch input for various backbones of the feature extractor

We examined six well-established deep learning models: MobileNetV3, ResNet50, DenseNet, Swin Transformer, EfficientNetV2, and InceptionV3. Each model has its strengths. ResNet50 excels at capturing the fine details critical for identifying small lesions in diabetic retinopathy, while DenseNet enhances the feature propagation for recognizing complex pathologies. MobileNetV3 enjoys a lightweight and efficient design, making it ideal for mobile screening without compromising accuracy. EfficientNetV2 effectively processes high-resolution retinal images, which is essential for detecting subtle abnormalities. InceptionV3 specializes in multi-scale feature extraction, making it suitable for comprehensive pathology assessments. Lastly, Swin Transformer employs a hierarchical design to capture both local and global structural changes, crucial for analyzing complex deformations.

Each model underwent training using a diverse range of augmentation techniques, such as flipping, rotation, CLAHE, and Gaussian blur. Both dual-branch and single retina input methods were used. Figure 3 displays the performance of the different models when single retina and dual-branch inputs are merged with various processing methods.

Fig. 3
figure 3

This figure demonstrates the performance of models using single retina and dual-branch inputs with various data augmentation techniques. The results indicate that the combination of dual-branch input with Gaussian blur and flipping yields the highest accuracy, underscoring the effectiveness of these augmentation strategies.

Figure 3 demonstrates that the employment of a dual-branch input, in conjunction with Gaussian blur and flipping, resulted in the highest level of accuracy among all the models that were tested. This highlights the effectiveness of this strategy in enhancing the model’s performance. Furthermore, among the several combinations of data augmentation for the dual-branch input, the pairing of Gaussian blur with flipping consistently proved to be the most successful. This supports the idea that combining Gaussian blur and flipping as a data augmentation technique with dual-branch input is especially advantageous.

Figure 4 unequivocally illustrates that using the dual-branch input strategy, along with Gaussian blur and image flipping approaches, greatly enhances the effectiveness of the model in diagnosing eye problems. More specifically, the employment of dual-branch input allows the model to accurately focus on the crucial regions within the retinal images, thus improving the accuracy of the disease classification.

In our experiments, the feature extraction model adopted a transfer learning strategy, whereby the network was first initialized with weights pre-trained on a large-scale dataset (ImageNet) and then fine-tuned for the target task. As shown in Table 4, compared to training from scratch, the pre-trained model achieved higher accuracy and more stable performance across multiple metrics. This improvement is mainly due to the pre-trained network having already learned general low- and mid-level features, so that only a small amount of labeled data is needed during fine-tuning to capture task-relevant features, thereby significantly reducing reliance on large-scale labeled datasets. By employing this “pre-trained + fine-tuning” approach, the model demonstrated clear advantages in feature extraction, highlighting the importance of transfer learning in disease classification problems of this kind.Below, a comprehensive analysis of each condition is provided.

  • Cataracts are commonly identified by the clouding of the lens in the eye37. By using vessel-segmented pictures, the model is able to enhance its ability to accurately detect the abnormal areas in the lens, thanks to the dual-channel input. The heatmap demonstrates that the model’s attention is primarily directed towards the central retina, which is a frequently affected area for cataract-related abnormalities.

  • Diabetic retinopathy is characterized by changes in the small blood vessels of the retina, such as the development of small bulges, bleeding, and the formation of new blood vessels38. By employing vessel-segmented images, the model is able to identify these small changes in the blood vessels with greater clarity, thereby enhancing its diagnostic capabilities. The heatmap generated by our proposed model clearly shows that the model’s attention is focused on the intersections of the local retina, effectively capturing the distinctive characteristics associated with retinal lesions.

  • Myopia, also known as nearsightedness, is a refractive error of the eye that causes distant objects to appear blurry. Myopia frequently occurs alongside structural changes in the retina, such as the thinning or distortion of the posterior pole. The employment of the dual-channel input approach, particularly in conjunction with Gaussian blur processing, enhances the model’s capacity to detect these morphological alterations with greater efficiency. The heatmap indicates that the model predominantly concentrates on the posterior pole of the retina, which aligns with the region impacted by myopia.

  • When normal retinas are used, the model demonstrates a more consistent region of focus when given the dual-channel input. This suggests that the model is able to accurately distinguish between normal and pathological retinal pictures, rather than only focusing on the boundaries. This feature has a crucial role in decreasing the occurrence of false-positive results and improving the accuracy of the diagnosis.

Fig. 4
figure 4

Grad-CAM Heatmap Comparison of single retina and dual-branch Models. (C) indicates cases of cataracts. (D) indicates diabetic retinopathy. (M) indicates myopia. (N) indicates normal.

To summarize, the dual-branch input method we have presented enhances the model’s ability to extract features, enabling more accurate identification and classification of various retinal diseases.

On the ensemble of various classifier model

The experiments conducted in this study aimed to evaluate the performance of various machine learning algorithms in classifying retinal diseases. We implemented several widely used algorithms, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB), and Light Gradient Boosting (LGB). The primary focus was on determining the best-performing models in terms of accuracy across different datasets.

For the experiments, features were extracted from retinal images using transfer learning models as discussed above. Taking ResNet50 as an example, it was first truncated to remove the final fully connected layer, and then features were extracted from the pooling layer following the final residual block. The extracted features, with dimensions of 4200×2048, were then fed into the downstream classifier models for classification tasks. We employed ten-fold cross-validation to validate the model’s robustness and generalizability. Specifically, the dataset was split into ten equal parts, with nine parts used for training and one part for validation in each iteration. This process was repeated ten times, ensuring that the model would perform consistently across unseen data and minimize overfitting.

Table 4 Comparison of Feature Extraction Performance between Pre-trained + Fine-tuning and trained from Scratch Deep Learning Algorithms. Left: Methods utilizing pre-trained weights with subsequent fine-tuning; Right: Methods trained entirely from scratch (without pre-training).

We evaluated the performance of each algorithm individually and select those with the highest average accuracies. The results of the experiments are presented in Table 4, which shows the accuracy of each algorithm as it uses six transfer learning models: MobileNetV3 (MN), ResNet50 (RN), DenseNet (DN), Swin Transformer (ST), EfficientNetV2 (EN), and InceptionV3 (IN).

The SVM, MLP, and XGB models consistently outperformed the others, achieving average accuracies of 95.28%, 93.57%, and 93.35%, respectively. Notably, SVM achieved the highest accuracy, indicating its robustness and reliability across different feature extraction methods. The MLP and XGB models also demonstrated strong performance. Following the identification of the best-performing models, we incorporated SVM, MLP, and XGB into an ensemble learning framework. This approach was intended to leverage the strengths of each algorithm to improve the overall classification accuracy and stability of the model.

Table 5 displays the retinal image classification performance of various base models while employing two different voting classifiers. For MobileNetV3, the accuracies are 95.3% (hard voting) and 96.5% (soft voting); for ResNet50, 98.8% and 99.2%; for DenseNet, 92.7% and 93.9%; for Swin Transformer, 94.2% and 95.4%; for EfficientNetV2, 93.5% and 94.7%; and for InceptionV3, 90.1% and 91.3%. Table 5 indicates that the average classification accuracies of the hard voting and soft voting classifiers in this study are 94.1% and 95.2%, respectively. It is clear that the soft voting classifier outperforms the hard voting classifier. Therefore, in our model, we selected the soft voting classifier.

In addition to cross-validation, we used multiple evaluation metrics to assess the model’s performance. These included Accuracy, Recall, Precision, F1 Score, as well as the Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC). These metrics helped ensure a comprehensive evaluation of the model’s performance across different categories, with macro and weighted averages used for overall performance assessment.

The primary advantage of the soft voting classifier lies in its ability to weigh the predicted probabilities of each base model, rather than simply relying on the majority vote as in hard voting. This allows the soft voting classifier to incorporate more nuanced information from each model, improving overall classification stability and accuracy. Specifically, the soft voting classifier can better handle situations where models disagree by accounting for the confidence levels of their predictions, leading to a more balanced decision-making process. As a result, it provides a robust mechanism for integrating the strengths of diverse classifiers, such as SVM, MLP, and XGB, ensuring that the final predictions are not overly influenced by outliers or misclassifications from individual models.

This characteristic of soft voting is particularly beneficial in the context of retinal disease classification, where subtle variations in the images can lead to different interpretations by base classifiers. By averaging the probabilities, the soft voting classifier helps minimize the impact of any one model’s potential errors, leading to a more reliable and consistent performance. These improvements are especially significant in medical applications like retinal disease diagnosis, where the stability and interpretability of the model are crucial for clinical decision-making.

Table 5 Accuracy performance analysis for ensemble models.The table presents the classification accuracy of various base models using hard and soft voting. Soft voting consistently outperformed hard voting.

The confusion matrix in Figure 5 shows that the TP (True Positive) rates are 100.00%, 98.00%, 100.00%, and 98.80%, respectively; the TN (True Negative) rates are 100.00%, 99.60%, 100.00%, and 100.00%; the FP (False Positive) rates are 0.00%, 0.40%, 0.00%, and 0.00%; and the FN (False Negative) rates are 0.00%, 2.00%, 0.00%, and 1.20%. This demonstrates very high TP and TN rates, while the FP and FN rates are very low. High TP/TN and low FP/FN are critical for models to detect retinal diseases. We also plotted the ROC curves for each class. The AUC score and its curve indicate superior model performance.

Fig. 5
figure 5

Performance of the soft voting classifier in retinal disease classification. (a) shows the confusion matrix of the RetinaDNet model. (b) shows the ROC curve for each class and calculated macro-average and weighted-average ROC curves.

Table 6 shows that the model’s performance on the training set (99.5% accuracy) and the test set (99.2% accuracy) are very close, demonstrating good generalization ability and no signs of overfitting. Overfitting typically results in significantly lower accuracy on the test set than on the training set. In our case, the difference is small, with the accuracy on the test set slightly lower than on the training set, which aligns with normal learning patterns.

Table 6 Confusion Matrix and Accuracy on Training Set.

The high accuracy for Class C and Class M is due to their distinct features in retinal images, such as the overall blurriness in cataract and optic disc deformation in myopia. Additionally, the large and high-quality dataset for these classes further supports the model’s outstanding performance. To prevent overfitting, we applied techniques such as data augmentation, regularization, and early stopping, ensuring the model’s generalization ability remains strong.

Moreover, the model’s high accuracy is also attributed to the dual-branch feature fusion input and the use of ensemble learning methods. These strategies significantly enhance the classification accuracy, particularly in recognizing complex structural features. Therefore, the model shows no signs of overfitting and achieves excellent performance through thoughtful design and optimization.

Figure 6 shows a heatmap of the significant improvements in the areas of features focused on by our classification model after employing ensemble learning strategies for integrating multiple machine learning models, compared to using a single retina input. The heatmap demonstrates the higher interpretability of the model’s results, with the areas of focus highly consistent with current medical knowledge39.In our ablation studies, when the ensemble learning strategies were removed or replaced, the model’s focus shifted to less relevant areas of the retinal images, leading to inaccurate classifications. This highlights the crucial role of ensemble learning in guiding the model to attend to medically significant regions and in improving the overall classification accuracy.

Fig. 6
figure 6

Heatmap of Feature Focus Improvements with Ensemble Learning. The figure illustrates the significant enhancements in feature focus achieved by our classification model after integrating ensemble learning strategies. The highlighted areas align well with known medically relevant regions, demonstrating the model’s increased interpretability and accuracy.

In order to mitigate the risk of overfitting caused by small sample sizes, we have implemented a series of targeted strategies. First, rather than training the model from scratch, we leverage weights pre-trained on large-scale datasets and fine-tune them, making full use of the prior knowledge learned from general features. Second, we incorporate an ensemble learning approach by weighting or voting on the predictions from different downstream classification models, thereby enhancing overall stability and generalization. Furthermore, we take advantage of auxiliary information such as vascular segmentation: by integrating pathology-related features at the input stage and employing a dual-branch feature input structure, we enrich the model’s representational capacity. These multi-level strategies-namely, pre-training, vascular segmentation integration, and model prediction fusion-enable us to fully harness the limited data available while significantly bolstering the model’s resistance to overfitting, thereby providing an effective solution to the small-sample issue commonly encountered in medical imaging.

In summary, the experimental results validate the effectiveness of the selected algorithms and the ensemble learning approach in accurately classifying retinal diseases. The ablation studies further underscore the importance of ensemble learning strategies by showing how their absence negatively impacts the model’s accuracy and interpretability. The integration of SVM, MLP, and XGB not only enhances the classification performance but also provides a balanced trade-off between accuracy and computational efficiency40. These findings pave the way for the development of advanced diagnostic tools that can assist healthcare professionals in identifying and treating retinal disorders with greater accuracy41.

Model interpretability for enhanced clinical usability

The interpretability of machine learning models is crucial for ensuring their clinical usability, particularly in the context of healthcare, where decision-making transparency is essential. Various interpretability tools are available to help healthcare professionals understand and trust the model’s decision-making process4243444546. These tools include feature importance methods, saliency maps, and more advanced techniques like Grad-CAM (Gradient-weighted Class Activation Mapping), which provide visual explanations of model predictions by highlighting the regions of interest in input images.

Among these methods, Grad-CAM has proven to be particularly effective in the context of medical image classification. Grad-CAM generates heatmaps that highlight the areas of the input image that most influence the model’s predictions, offering a direct link between the model’s focus and the regions in the image474849. This is especially useful in medical applications, where understanding which features the model is attending to can provide valuable insights for clinicians. For instance, when classifying retinal diseases, Grad-CAM can pinpoint specific regions in the retinal image, such as the central region for cataracts, the retinal vascular area for diabetic retinopathy, and the peripheral retina for myopia, thus improving interpretability and facilitating clinical validation of the model’s decision-making process.

In our study, we employed Grad-CAM to visualize the model’s focus in retinal images. As shown in Figure 7, the model’s attention was directed to the following areas: the central region for cataracts, corresponding to lens opacification; the retinal vascular area for diabetic retinopathy, indicating microvascular damage; the peripheral retina for myopia, reflecting ocular shape changes and retinal stretching; and the entire retinal area for normal images, with no apparent lesions. These focus areas align closely with known medical characteristics of these diseases, demonstrating that the model’s decisions are interpretable and grounded in relevant clinical features. The heatmap visualizations, with arrows pointing to the lens and retinal areas, provide a transparent understanding of the model’s behavior, thereby enhancing its clinical applicability.

Fig. 7
figure 7

Model’s Different Focus Areas under Grad-CAM Visualization for Each Category.

By integrating Grad-CAM into our model, we offer not only high classification accuracy but also an interpretable framework that can support healthcare professionals in diagnosing retinal diseases with confidence.

Ablation study

To further investigate the importance of each module within our proposed RetinaDNet framework, we conducted an ablation study by systematically removing key components:

  • RetinaDNet (Full): The complete model, which includes both the dual-branch (original + vascular) inputs, pre-training on ImageNet, and ensemble learning.

  • RetinaDNet w/o Vascular Branch: The vascular segmentation branch was removed, using only single-branch input (fundus images).

  • RetinaDNet w/o Pre-training: Trained from scratch without using any pre-trained weights.

  • RetinaDNet w/o Ensemble: The ensemble learning stage was replaced by the best single classifier (SVM).

Table 7 shows the classification performance under each setting. It is evident that removing any of these modules leads to a noticeable decline in performance, suggesting that all three design elements—vascular feature integration, transfer learning (pre-training), and ensemble learning—jointly contribute to the superior accuracy and stability of our approach.

Table 7 Ablation study results on the retinal disease classification task. Removing the vascular branch, pre-training, or ensemble learning each results in a notable reduction in performance, underscoring the importance of every module.

Comparison to existing work

Table 8 presents a comprehensive comparison of the performance of the proposed RetinaDNet with that of other state-of-the-art models18,19,20,21,22,23,24. The results clearly demonstrate that the dual-branch feature fusion combined with ensemble learning significantly enhances the classification.The table highlights performance across multiple datasets, including our proprietary MuReD and RFMiD datasets, as well as the widely-used public benchmarks MESSIDOR2 and EyePACS-1. Critically, the results clearly demonstrate that RetinaDNet’s dual-branch feature fusion, combined with ensemble learning, significantly enhances classification accuracy. Specifically, RetinaDNet achieves a top accuracy of 99.2% on our combined MuReD and RFMiD datasets. Furthermore, when applied to the external datasets MESSIDOR2 and EyePACS-1 (indicated by *), RetinaDNet achieves accuracies of 97.5% and 91.0% respectively, outperforming or being highly competitive with established methods such as Yaqoob et al. (97.0% on MESSIDOR2) and Gulshan et al. (90.3% on EyePACS-1), which rely on single deep learning models. This consistently superior or competitive performance across diverse datasets underscores the robustness and generalizability of our approach. The improvements are directly attributable to the synergistic effect of integrating both retinal image features and vessel segmentation information through the dual-branch architecture, coupled with the stability and accuracy gains provided by the ensemble learning strategy. RetinaDNet is compared to the top ten results from the ODIR-2019 competition. Table 9 indicates that our proposed RetinaDNet has higher accuracy than current methods, and so can better assist doctors in clinical diagnosis17.

In RetinaDNet, the dual-branch input architecture simultaneously integrates two sources of information: the retinal image, which contains the lesion’s textural features, and the vessel segmentation image, which highlights vascular pathways and morphological changes. By fusing these inputs during the feature extraction stage, the model not only captures more comprehensive lesion details but also effectively filters out background noise, thereby significantly improving its ability to detect subtle lesions and local structural abnormalities. Consequently, this leads to notably enhanced diagnostic accuracy.

Table 8 Comparison of Methods for Diabetic Retinopathy Diagnosis. Note: RetinaDNet (Ours) is evaluated on our proprietary dataset (MuReD, RFMID) as well as on external datasets. The rows marked with (*) represent RetinaDNet (Ours) applied on external datasets, achieving slightly improved performance over baseline methods.
Table 9 Test results and rankings in the ODIR-2019 competition.

Moreover, the proposed RetinaDNet framework distinguishes itself from previous approaches by leveraging dual-branch feature fusion and employing ensemble learning as the downstream model. This is unlike the other methods used in the comparison in Table 1, none of which have incorporated such a comprehensive strategy. The dual-branch design allows a more efficient integration of complementary features, while the ensemble learning model ensures a robust and consistent performance across various conditions. This combination not only enhances the extraction of hidden vascular information but also fully exploits the rich details contained within retinal images. This ability to use previously overlooked structural data makes RetinaDNet especially promising for small-sample diagnosis scenarios, where accurate feature extraction is critical for effective clinical support. The superior performance of RetinaDNet highlights its potential to significantly impact real-world applications by offering a more reliable and informative diagnostic tool. 

The improvement in our retinal lesion classification performance can be attributed to several key factors: i) Introducing vessel segmentation images as an auxiliary input channel improved the precision by providing additional structural information, allowing the model to focus on critical regions within the retinal images. ii) We rigorously selected and combined the top three machine learning algorithms, significantly enhancing the model’s stability and precision. iii) Additionally, the soft voting classifier outperformed the hard voting classifier by effectively using well-calibrated classifiers, leading to improved accuracy.

link