Temporal and spatial self supervised learning methods for electrocardiograms

Datasets
We validated the performance of the TSSL model using three publicly available ECG datasets. Each dataset was divided into training, validation, and test sets in a 6:2:2 ratio. The size of each dataset is shown in Table 1, with specific details provided below:
Chapman53: The Chapman database, sponsored by Chapman University and Shaoxing People’s Hospital (Zhejiang University Shaoxing Hospital), contains 10,646 12-lead ECG recordings with a sampling frequency of 500 Hz and a duration of 10 seconds. This dataset includes 11 common heart rhythm types, which we merged into 4 categories according to the recommendations of Lan et al.44 and Kiyasseh et al.21.
CPSC201854: The China Physiological Signal Challenge 2018 (CPSC2018) database contains 6,877 12-lead ECG recordings, with a sampling frequency of 500 Hz and a recording length ranging from 6 to 60 seconds. We truncated these recordings to a uniform length of 10 seconds and removed those that were less than 10 seconds long. This dataset contains 9 types of heart rhythms.
PTB-XL55: The Physikalisch Technische Bundesanstalt XL (PTB-XL) database comprises 21,837 12-lead ECG recordings from 18,885 patients, with a sampling frequency of 500 Hz and a duration of 10 seconds, similar to the Chapman database. The dataset contains five diagnostic labels.
All 10-second ECG data obtained from the three databases were only preprocessed through normalization before being fed into the model for training and testing.
Experiment setup
We train the encoder through self-supervised learning and transfer it to downstream tasks to extract representations of the input signal. The performance in these downstream tasks is then evaluated to assess the effectiveness of the representation extraction. In ECG application scenarios, downstream tasks are typically arrhythmia classification tasks. We employ the following two methods to perform downstream tasks:
Linear evaluation (Linear): In this approach, the encoder transferred to the downstream task is frozen, meaning its parameters are no longer updated, and only a linear classifier is trained. Following the recommendations of21,44, the encoder and classifier are trained on the same dataset, with the encoder’s performance in the downstream task reflecting the effectiveness of its representation extraction.
Transferability of representations (Fine-tuning): The encoder transferred to the downstream task is not frozen; instead, both the encoder and classifier are trained together. This approach allows for evaluating the robustness and transferability of the learned representations. In this scenario, the encoder is initially trained on one dataset and then fine-tuned in a supervised manner on other datasets.
During model training, the batch size is set to 128, and the Adam optimizer56 is used to optimize parameters with a learning rate of 1e-3. The maximum number of epochs is set to 400, and training is terminated if the validation loss does not improve within 15 epochs, with the best validation loss model saved. The experiments in this paper were conducted on a machine with an NVIDIA GeForce 1050Ti GPU using the PyTorch57 framework. Since trained deep learning models exhibit some degree of randomness, each model was trained 5 times, and the average ROC and standard deviation are reported as the final results.
Impact of data augmentation
Most existing self-supervised learning methods for ECG signals primarily use the SimCLR39 framework. SimCLR’s main approach involves applying different data augmentation techniques to the same instance, generating different views to define positive and negative pairs. However, this paper redefines positive and negative pairs from the instance level to the individual level, where signals from the same individual at different times are considered different views. Therefore, in our method, data augmentation is not a necessary step. To investigate the impact of data augmentation on the performance of our method, we compared the effects of various data augmentation methods on SimCLR and the proposed TSSL.
We used 7 data augmentation methods, including four artificial transformations41: Random Resized Crop (RRC), Time Out (TO), Gaussian Noise (GN), Gaussian Blur (GB); and three physiological noise transformations: Baseline Wander (BW), Electrode Motion (EM), and Muscle Artifact (MA). Among these, the physiological noise used in the physiological noise transformations was obtained from the MIT-BIH Noise Stress Test Database58. The effects of each transformation are illustrated in Fig. 5.

ECG transformations. Comparison between the effects of different transformation methods on signals.
In this section, we use LeNet, as shown in Fig. 3, as the encoder, and perform downstream tasks on three datasets using two methods, resulting in nine downstream tasks: three linear tasks and six fine-tuning tasks. The experimental results are shown in Fig. 6, where “Linear: Chapman” refers to the linear evaluation performed on the Chapman dataset. In this case, the encoder is first trained and then frozen while training the classifier. “Fine: CPSC2018\(\rightarrow\)Chapman” indicates that the encoder is first trained on CPSC2018 and then transferred to the Chapman dataset for fine-tuning. Since data augmentation is not necessary for TSSL, additional results without data augmentation (None) are also presented for TSSL.
As shown in Fig. 6, no single data augmentation method demonstrates absolute superiority across all tasks. In linear tasks, SimCLR performs best with RRC on the Chapman and PTB-XL datasets, while it excels with MA on the CPSC2018 dataset. In contrast, TSSL without data augmentation performs optimally in linear tasks, suggesting that data augmentation may disrupt the extraction of certain diagnostic features. However, in fine-tuning tasks, although SimCLR with RRC excels in linear tasks, it does not exhibit the same advantage here. In contrast, TSSL with data augmentation shows some improvement over TSSL without data augmentation, indicating that noise can enhance the model’s ability to generalize to different data distributions and improve the robustness of the representations.
From the above analysis, it is evident that SimCLR’s performance is closely tied to the choice of data augmentation method, with significant variation in results across different methods. Designing a universal data augmentation strategy that works well in all scenarios is challenging. In contrast, TSSL without data augmentation (None) demonstrates excellent performance across all tasks. Although using data augmentation in fine-tuning tasks offers a slight improvement over not using it, the gains are limited, and no single fixed data augmentation method consistently outperforms the no augmentation approach across all tasks.

Results of different data augmentation methods on SimCLR and TSSL. “Linear: Chapman” means that the encoder is trained on the Chapman dataset, then the encoder parameters are frozen and transferred to downstream tasks to train the classifier on the Chapman dataset. “Fine: CPSC2018\(\rightarrow\)Chapman” means that the encoder is trained on CPSC2018, then transferred to downstream tasks to train both the encoder and classifier on the Chapman dataset.
The robustness of different data augmentation methods across various tasks is relatively poor, making it challenging to intuitively rank the performance of these methods. In the field of intelligent optimization algorithms59, the Friedman test is commonly used to evaluate the overall performance of optimization algorithms across multiple tasks. Therefore, we used the Friedman test to evaluate the overall performance of different data augmentation methods across 9 tasks. The evaluation results are shown in Table 2. Since a higher ROC value indicates better performance, equivalent to maximizing the optimization task, a higher mean rank indicates better method performance. As shown in the table, the MA transformation yielded the best performance for SimCLR, so subsequent experiments with SimCLR will use the MA transformation. For TSSL, the best performance is achieved without using data augmentation; therefore, the TSSL method does not utilize data augmentation.
Comparison of methods
In this section, we use the four encoders from Fig. 3 to compare TSSL with state-of-the-art self-supervised learning methods for ECG, including SimCLR39, BYOL40, CLOCS21, MTSR46, and 3KG42. Additionally, to better highlight the performance of self-supervised learning, we also compare against encoders trained using supervised learning (Supervised) and randomly initialized encoders (Random Init). We also present the results of using temporal self-supervision (TSL) and spatial self-supervision (SSL) alone, conducting an ablation analysis to verify the effectiveness of the TSSL strategy. For a more thorough analysis, we ranked the models by complexity from simple to complex as LeNet, AlexNet, VGG, and ResNet, and ranked the datasets by classification difficulty from simple to complex as Chapman, CPSC2018, and PTB-XL.
Linear evaluation
Tables 3, 4, 5 and 6 present the linear evaluation results for different methods using the 4 encoders, with the bolded values representing the best results (supervised methods are included only for reference and are not part of the comparison). First, the tables show that TSSL outperforms other algorithms regardless of whether a simple or complex network structure is used as the encoder, indicating TSSL’s strong robustness across different network structures and its consistent ability to achieve excellent performance. Secondly, on the simpler Chapman dataset, TSSL’s advantage over other algorithms is not significant. For instance, in Table 5, TSSL shows only a 1.7% improvement compared to 3KG and CLOCS. However, on the more complex PTB-XL dataset, TSSL shows improvements of 16.5% and 15.7% over 3KG and CLOCS, respectively, further demonstrating TSSL’s superior performance in feature extraction.
Additionally, as seen in Tables 3, 4, 5 and 6, TSSL outperforms supervised training on the CPSC2018 and PTB-XL datasets, which is a surprising result. This is because self-supervised learning lacks direct label guidance and a clear task objective, whereas supervised learning typically delivers better results when labeled data is available. To investigate the reason behind this phenomenon, Fig. 7 shows the changes in training loss, validation loss, and test loss during the classifier training process after using the VGG encoder trained with both supervised learning and TSSL, and transferring it to linear tasks. In this experiment, we removed the early stopping setting and trained for 100 epochs. As shown in the figure, although the training loss for supervised learning is lower than that of TSSL on the training set, supervised learning overfits too early, leading to higher validation and test losses compared to TSSL. This results in supervised training performing worse than TSSL. Therefore, TSSL effectively mitigates the overfitting phenomenon.

Loss variation of linear evaluation classifier.
Fine-tuning evaluation
Tables 7, 8, 9 and 10 present the fine-tuning evaluation results for different methods across the 4 encoders. First, compared to other methods, TSSL consistently delivers the best performance across all models and datasets. Secondly, compared to the linear evaluation results, TSSL shows improved classification performance on each dataset after fine-tuning. The best performance on the three datasets (Chapman, CPSC2018, PTB-XL) improved from 0.984, 0.882, and 0.877 to 0.995, 0.921, and 0.897, respectively, indicating that the extracted representations have strong robustness and transferability.
Ablation analysis
As seen in Tables 3, 4, 5 and 6, the representations extracted using only spatial self-supervision (SSL) outperform those from a randomly initialized encoder, indicating that SSL has a certain effect. However, its performance on CPSC and PTB-XL is relatively poor, with suboptimal representation extraction. This is because spatial self-supervision only maintains the correlation between leads, lacking effective measures to extract the latent, efficient features of ECGs. TSL performs well across various models and datasets, indicating that TSL is an effective self-supervised learning method for ECG classification. By combining TSL and SSL, TSSL achieves better results than either method alone, validating our approach of using spatial self-supervision to maintain lead correlation during temporal self-supervision. Figures 8, 9 and 10 show the absolute differences in cosine similarity between leads for a particular signal representation extracted by TSL and TSSL, as well as the classification probability, when using the VGG encoder for linear evaluation on the three datasets. This demonstrates that maintaining inter-lead correlation when extracting self-supervised signal representations helps improve classification accuracy.
As shown in Tables 7, 8, 9 and 10, in fine-tuning scenarios, representations extracted by temporal self-supervised learning (TSL) outperform those extracted by temporal-spatial self-supervised learning (TSSL). This is because too many supervision conditions can reduce the model’s ability to generalize across data distributions, weakening the transferability of the representations. In practical applications, TSL and TSSL can be used flexibly depending on the specific scenario.

Absolute error of cosine similarity between signals and representations across leads on Chapman. Train the encoder using different TSL and TSSL respectively, obtain the absolute differences between the correlations of the representations across leads and those of the original signals, as well as the classification probabilities on different classes after training the representations.

Absolute error of cosine similarity between signals and representations across leads on CPSC2018.

Absolute error of cosine similarity between signals and representations across leads on PTB-XL.

Effect of label quantity on classification performance, classification results trained with different available labels.
Analysis of dependency on labels
The purpose of proposing TSSL is to reduce the dependency of deep learning methods on labels in the field of automatic ECG signal detection, where acquiring labels is challenging. In this section, we use VGG as the encoder, training the classifier with only a portion of the labels. We compare the classification performance of supervised learning and TSSL under different label quantities and examine how label quantity affects classification performance.
As seen in Fig. 11, TSSL exhibits more robust performance compared to supervised learning methods when the number of labels varies. Particularly on the Chapman dataset, which has a lower classification difficulty, TSSL requires only 1% of the labels to achieve performance close to that of supervised training with 100% of the labels, whether in linear or classification tasks. On the more complex CPSC2018 and PTB-XL datasets, this proportion is 10%. In other words, compared to supervised training, TSSL can achieve nearly equivalent performance with only 10% of the labeling effort, thus reducing dependency on label quantity.
link