Explainable attention-based deep learning for classification and interpretation of heart murmurs using phonocardiograms
This section provides a comprehensive evaluation of the proposed Explainable HeartSound Transformer (EHST) for multi-class heart sound diagnosis. We assess its performance across multiple datasets and compare it with several state-of-the-art baseline models. Our analysis includes overall classification performance, class-wise breakdowns, rare-class detection, ablation studies, confusion matrix analysis, and interpretability via explainability metrics such as Grad-CAM and SHAP. Additionally, further analyses address segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency.
Overall multi-dataset performance
We evaluated EHST on six datasets: HeartWave, CirCor DigiScope, PhysioNet/CinC, Pascal (A+B), GitHub (Valvular), and Shenzhen (HSS). Performance metrics including accuracy, precision, recall, macro-F1 score, Matthews Correlation Coefficient (MCC), and Area Under the ROC Curve (AUC) were computed and compared with ten baseline models. Table 8 summarizes the results across all six datasets.
The overall results in Table 8 indicate that EHST consistently achieves higher performance than the baseline models. For example, on the HeartWave dataset, EHST attains an accuracy of 96.7% and a macro-F1 score of 95.5%, compared to 94.8% and 93.1% respectively for the best baseline. Similarly, in CirCor DigiScope and PhysioNet CinC, EHST outperforms the baselines by 3–5% in accuracy and 2–4% in F1 score. Additionally, the AUC values remain above 0.90 across all datasets, which confirms the model’s robust ability to distinguish between normal and abnormal heart sounds even in noisy and imbalanced settings. The consistently high Matthews Correlation Coefficient (MCC), typically around 0.91–0.94 (not shown in this table), further corroborates EHST’s strong performance in dealing with imbalanced data.

Grouped bar chart of accuracy across datasets.
Figure 3 presents a grouped bar chart comparing the accuracy of EHST with the Mean Baselines and Best Baseline across the six datasets. It is evident from the chart that EHST achieves higher accuracy values on each dataset. For instance, on the HeartWave dataset, EHST reaches an accuracy of 96.7% compared to 94.8% for the best baseline. This visual comparison highlights EHST’s significant improvement in overall classification accuracy, reinforcing its suitability for clinical applications where high accuracy is crucial.

Line plot of F1 score across datasets.
The line plot in Fig. 4 illustrates the F1 scores for EHST, Mean Baselines, and Best Baseline methods across six datasets. The plot clearly shows that EHST consistently achieves higher F1 scores on all datasets compared to the baseline methods. The trend line for EHST lies above those for both the mean and best baselines, indicating a superior balance between precision and recall. This consistent performance across datasets underscores EHST’s capability to reliably distinguish between classes even in challenging, imbalanced settings.

Box plot of AUC distribution across methods.
The box plot in Fig. 5 depicts the distribution of AUC values for EHST, Mean Baselines, and Best Baseline methods over several cross-validation folds. The vertical extent of each box represents the interquartile range (IQR) of AUC scores, with the median indicated by a horizontal line. EHST exhibits a higher median AUC with a narrower IQR compared to the baseline methods, which implies not only high discriminative power but also low variability across folds. Consistently, AUC values remain above 0.90 across all datasets, underscoring EHST’s robust ability to differentiate between normal and abnormal heart sounds, even in the presence of noise and imbalanced data.
Overall, EHST demonstrates improvements of 3–5% in accuracy and 2-4% in macro-F1 score over baseline methods, underscoring its strong discriminative power and clinical applicability.
Class-wise breakdown on HeartWave
Table 9 presents the detailed class-wise performance (precision, recall, and F1 score) for each of the nine heart sound classes in the HeartWave dataset, evaluated using 5-fold cross-validation.
The class-wise analysis shows that EHST achieves high F1 scores for normal heart sounds (97.9%) and maintains F1 scores above 90% for minority classes such as congenital anomalies and miscellaneous rare conditions. This indicates effective handling of class imbalance through the weighted loss function and attention mechanisms, ensuring that both common and rare pathologies are accurately detected.
Rare-class detection across multiple datasets
Rare-class detection performance was measured using micro-F1 scores for underrepresented classes across selected datasets. Table 10 presents these results, demonstrating that EHST outperforms the best baseline by 2–3% in detecting rare conditions.
Figure 6 shows a dumbbell plot that compares the rare-class F1 scores for EHST and the best baseline across four datasets: Pascal (A), Pascal (B), GitHub, and Shenzhen (HSS). In this plot, each dataset is represented by two markers—one for EHST (displayed in tomato) and one for the best baseline (displayed in steelblue). A vertical line connects the markers for each dataset, clearly illustrating the performance gap between EHST and the best baseline. The plot clearly demonstrates that EHST consistently outperforms the best baseline across all four datasets, with improvements ranging from approximately 2% to 3% in F1 score. For example, on the Pascal (A) dataset, EHST achieves an F1 score of 86.4% compared to 82.9% for the best baseline, as shown by the gap between the markers. Similar trends are observed for the Pascal (B), GitHub, and Shenzhen datasets, highlighting EHST’s robust performance in detecting underrepresented classes. The relatively narrow gap between the upper and lower bounds (represented by the connecting lines) also suggests that the model’s performance is consistent across cross-validation folds.

Rare-class F1 Score comparison for EHST vs. best baseline.
This analysis confirms that EHST consistently outperforms baselines in rare-class detection, which is critical for clinical applications where underrepresented pathologies must be reliably identified.
Ablation studies and confusion matrix
Ablation studies were performed on the HeartWave dataset (using 5-fold CV) to quantify the contributions of self-attention and Grad-CAM. Table 11 shows that the removal of self-attention results in a significant performance drop, while Grad-CAM removal mainly impacts interpretability.
Removing the self-attention module leads to a 2.7% drop in F1 score, highlighting its importance for identifying critical features such as murmurs. While removing Grad-CAM has minimal impact on classification performance, it diminishes the model’s interpretability—an essential factor in clinical applications (Fig. 7).

Grouped Bar Chart of Ablation Study Metrics for EHST. The chart uses lightcoral for Accuracy, lightseagreen for F1 Score, and lightsteelblue for AUC.
The chart shows that the removal of the self-attention mechanism results in a significant drop in both Accuracy and F1 Score, with values falling from 96.7% to 94.2% for Accuracy and from 95.5% to 92.3% for F1 Score. The removal of the Grad-CAM module, however, has a negligible effect on these quantitative metrics, as indicated by the nearly identical values when only Grad-CAM is removed. When both components are removed, the performance further declines, underscoring the importance of the self-attention mechanism for robust classification. Although the AUC remains relatively stable (with EHST at 97% and dropping to 93% when both modules are removed), its conversion to percentage scale confirms that all configurations achieve AUC values above 90%, indicating strong discriminative ability overall.
The confusion matrix for the Shenzhen dataset is presented in Table 12. This matrix indicates that misclassifications predominantly occur between the Mild and Severe classes, which may be attributed to overlapping symptom characteristics.
The strong diagonal of the confusion matrix indicates that EHST is highly accurate for Normal and Severe cases. However, some misclassifications occur between Mild and Severe, suggesting that further refinement of training data or feature extraction methods could help improve discrimination between these classes.
Explainability metrics and SHAP analysis
To evaluate the interpretability of EHST, we use both Grad-CAM and SHAP analyses. Table 13 shows the overlap ratio and Intersection-over-Union (IoU) for systolic and diastolic murmur annotations on the HeartWave dataset. Additionally, Table 14 lists the top five features ranked by mean SHAP value.
In addition to these tables, we employ a beeswarm plot to visualize the SHAP values for each feature across the entire HeartWave dataset. Figure 8 displays the beeswarm plot, which illustrates the distribution of SHAP values per feature. Each point represents a single instance; the position along the horizontal axis indicates the impact on the model output, while the color reflects the feature’s value. This visualization helps in understanding not only which features are most influential but also whether higher or lower feature values push the model output in a particular direction.

Beeswarm plot of SHAP values for EHST on HeartWave dataset.
The explainability metrics show that EHST improves overlap and IoU by 5–8% compared to baseline models, indicating superior alignment between the model’s attention maps and expert annotations. The beeswarm plot further corroborates these findings by highlighting that features such as the murmur frequency band, S1 amplitude, and S2 duration have the greatest impact on the predictions. These insights provide a clear, interpretable understanding of the model’s decision-making process, ensuring that clinicians can trust the automated diagnoses. Overall, the combination of quantitative metrics and visualizations like the beeswarm plot confirms the clinical relevance and transparency of EHST.
Statistical validation with A-Test
To rigorously evaluate the statistical significance and robustness of EHST’s performance compared to baseline models, we employed the non-parametric A-Test35. The A-Test quantifies the probability that a randomly selected observation from one distribution (EHST results) will be greater than a randomly selected observation from another distribution (baseline results):
$$\begin{aligned} A_{12} = P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2), \end{aligned}$$
(15)
where \(X_1\) and \(X_2\) are the distributions of accuracy, macro F1-score, or AUC obtained from cross-validation folds. An \(A_{12}\) value of 0.5 indicates no difference, whereas values approaching 0 or 1 imply a strong effect size in favor of one method. Conventionally, thresholds of \(A > 0.71\) or \(A < 0.29\) are considered large effects, \(0.64 \le A \le 0.71\) or \(0.29 \le A \le 0.36\) medium, and \(0.56 \le A \le 0.64\) or \(0.36 \le A \le 0.44\) small36.
Table 15 presents the A-Test results comparing EHST with the best-performing baseline models across all datasets. The consistently low values (close to 0.1–0.2) indicate a large effect size in favor of EHST, confirming the robustness and statistical significance of our improvements.
The results confirm that EHST significantly outperforms the baseline models across all datasets and evaluation metrics, further validating the robustness of the proposed framework.
Segmentation method analysis
To evaluate the impact of the segmentation approach on performance, we compared manual annotations with automated peak detection on the HeartWave dataset. As shown in Table 16, the accuracy and macro F1 scores obtained using automated peak detection are nearly equivalent to those derived from manual annotation. This close performance indicates that the automated segmentation pipeline is robust and reliable, reducing the need for labor-intensive manual labeling while still maintaining high-quality input for the EHST model.
These results confirm that the automated method effectively captures the essential heart sound segments with minimal loss in performance, making it a viable option for scaling up the data preparation process.
Window length sensitivity
We examined the sensitivity of EHST to different window lengths used during segmentation on the HeartWave dataset. As summarized in Table 17, the model’s performance varies slightly with different window lengths. A window length of 1.0 s produced the highest accuracy (96.7%) and macro F1 (95.5%), suggesting that this duration provides an optimal balance between capturing sufficient temporal dynamics and minimizing noise.
The marginal differences observed imply that while shorter windows might not capture the full extent of a heartbeat cycle, longer windows could introduce additional noise. Therefore, a 1.0 s window appears to be the optimal setting for the EHST model.
Data augmentation strategy
To assess the impact of data augmentation, we evaluated EHST on the PhysioNet/CinC dataset using various augmentation strategies. Table 18 compares the performance without any augmentation, with individual strategies (noise injection, time-stretching, random cropping), and with a combination of all three techniques. The combined strategy results in the highest accuracy (90.3%) and macro F1 (88.9%), demonstrating that the integration of multiple augmentation methods effectively enhances model robustness by simulating realistic variability and reducing overfitting.
This analysis confirms that using a combination of augmentation strategies best simulates the diverse acoustic conditions encountered in clinical settings, thereby improving the model’s generalizability.
While the combined augmentation strategy significantly improves validation metrics by simulating diverse acoustic conditions, it may also introduce distributional shifts that impact performance on real-world, unseen data. This potential domain shift is a known challenge in PCG and other biomedical signal processing tasks.
To minimize adverse effects, our augmentation parameters were carefully selected to reflect realistic variations encountered in clinical practice. Furthermore, we validated our model on multiple independent datasets with varied noise profiles and demographic distributions to assess robustness beyond the training set.
Nonetheless, there remains a trade-off between increasing training data diversity and preserving fidelity to clinical conditions. Future work will explore adaptive and domain-adversarial augmentation techniques aimed at enhancing model generalizability across heterogeneous clinical environments.
Demographic-based performance
We further analyzed EHST’s performance on the HeartWave dataset across different demographic groups. Table 19 presents the accuracy and macro F1 scores for pediatric (Age < 18) and adult (Age \(\ge\) 18) subgroups. The results show consistent performance between the two groups, with adults achieving slightly higher scores. This consistency indicates that EHST generalizes well across diverse patient populations, which is critical for clinical deployment in varied settings.
These findings demonstrate that the model’s performance remains robust regardless of age, suggesting its effectiveness in both pediatric and adult clinical environments.
Overall, these detailed analyses across segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency provide a comprehensive picture of EHST’s performance and robustness in diverse clinical scenarios. The results demonstrate that EHST not only excels in classification performance but also generalizes well across different data conditions and remains computationally efficient for practical use.
Discussion
Across six datasets, EHST achieved higher accuracy, macro-F1 score, and AUC values than the baseline models, as shown in Table 8. In particular, on the HeartWave dataset, EHST attained higher class-specific performance, detecting both common and rare pathologies and showing improved sensitivity for underrepresented classes, as reflected in elevated micro-F1 scores. While these results indicate notable performance gains, they are specific to the datasets and experimental conditions evaluated.
Ablation studies suggest that the self-attention mechanism and integrated explainability modules contribute meaningfully to classification accuracy and model transparency. The combination of Grad-CAM, attention visualization, and SHAP yielded clinically interpretable outputs, aligning model attention with known pathophysiological features in the evaluated datasets.
In terms of computational efficiency, EHST exhibited training and inference times comparable to strong baseline architectures, including CNN-RNN hybrids and Transformer variants. Optimized model depth and parameter sharing in the multi-head attention layers contributed to a balance between complexity and scalability. For example, training times on the HeartWave dataset were similar to those of the best-performing baseline CNN-RNN model, and inference latency was within real-time constraints under the tested conditions.
The model’s performance remained stable across variations in segmentation method, window length, data augmentation strategy, and demographic subgroup, suggesting potential for broader applicability. However, real-world deployment may involve additional variability in recording devices, patient populations, and environmental noise, which were not exhaustively represented in the current datasets.
Overall, within the scope of the datasets and metrics considered, EHST demonstrated consistent performance advantages over the ten baseline models evaluated, while providing interpretable outputs without marked loss in efficiency. These findings support EHST’s potential for use in scalable, explainable heart sound analysis, pending further validation in prospective and more heterogeneous clinical settings.
Comparison with time growing neural networks (TGNNs)
Time Growing Neural Networks (TGNNs) offer a unique approach to modeling cardiac cycles by incrementally adapting their architecture to capture systolic and diastolic variations25. While TGNNs excel at representing temporal growth patterns, they process sequences sequentially, which can limit the ability to capture global dependencies and complex interactions spanning multiple heartbeats.
In contrast, our proposed EHST utilizes a multi-head self-attention mechanism that simultaneously attends to all time points within a heartbeat segment. This parallel attention allows the model to flexibly learn relationships across both systolic and diastolic phases without explicit segmentation or architectural growth, leading to richer temporal and spectral representations.
Additionally, EHST integrates explainability tools such as Grad-CAM and attention visualization, providing clinicians with interpretable heatmaps linked to physiologically meaningful heart sound components. This level of transparency is often lacking in TGNN architectures, making our method more suitable for clinical applications where interpretability is critical.
Empirically, EHST demonstrates consistent improvements in classification accuracy and robustness over baseline models, including those based on TGNNs, across multiple datasets, further validating the advantages of attention-based modeling for heart sound diagnosis.
Limitations
Despite the promising performance of EHST, several limitations remain. The model shows difficulty in accurately distinguishing between mild and severe cases, likely due to overlapping acoustic features and the limited number of borderline examples in the training data. While EHST performed well under controlled conditions, its robustness to real-world noise, device variability, and differences in recording environments has yet to be fully established. These factors, along with potential domain shifts when applied to new patient populations, may affect performance in practical deployments. The datasets used in this study do not comprehensively represent all demographic groups, age ranges, or rare cardiac pathologies, underscoring the need for greater diversity in training data to improve generalizability.
Extensive prospective clinical trials and longitudinal studies are required to validate EHST’s effectiveness, reliability, and interpretability in routine clinical workflows. Additionally, integration with multimodal data sources, such as ECG and echocardiography, could further enhance diagnostic coverage. Future work will also focus on refining the model’s interpretability mechanisms and applying advanced data augmentation and domain adaptation strategies to improve noise resilience and adaptability across heterogeneous clinical settings. Ultimately, optimizing EHST for robust, transparent, and scalable deployment remains a key objective before real-world adoption.
link
