Emotion recognition with multiple physiological parameters based on ensemble learning

Table of Contents

Discussion of random initialization results

This study aimed to investigate the impact of random seed initialization on model performance. The same base model was initialized with five different random seeds (3, 6, 9, 12, 15), and trained for 300 epochs. The batch sizes for the SELF, SWELL, and WESAD datasets are set to 512, 128, and 128, respectively. Models were trained on the training dataset and evaluated continuously on the validation dataset. After training, each model was saved, and their performance on the test dataset was evaluated. Table 1 presents the training and test accuracies of five models initialized with different random seeds across the three datasets.

Table 1 Training and testing results of the base model with RI strategy.

The results indicate that random seed initialization significantly impacts model training and generalization. For the SELF dataset, seed 9 achieves the highest accuracy across the training, validation, and test sets, whereas seed 15 may cause the model to fall into a local optimum, resulting in performance degradation. For the SWELL dataset, where overall classification accuracy is relatively high, seed 9 delivers the best test performance, whereas seed 15 impedes model convergence. For the WESAD dataset, although model performance remains stable across different initializations, seed 9 consistently outperforms others. Overall, the choice of random seed is critical for model training and generalization. Seed 9 helps guide the model toward a more stable and generalizable solution, providing a foundation for further optimization strategies and the development of more robust emotion recognition methods.

Discussion of CAWR results

This study introduced the CAWR strategy alongside random seed initialization to adjust the learning rate, aiming to optimize the training effectiveness of the base model. Five different random seeds (3, 6, 9, 12, 15) were used. The dynamic learning rate cycle was T = 60 epochs, with training conducted for 300 epochs. The batch sizes for the SELF, SWELL, and WESAD datasets are set to 512, 128, and 128, respectively. The maximum and minimum learning rates were set at \(\:{\eta\:}_{max}\) = 0.001 and \(\:{\eta\:}_{min}\) = 0, respectively. Throughout the training, the model’s performance was assessed using the validation set, and the model’s state was saved at the end of each learning rate cycle. The accuracy and loss curves during training with random seed 9 for the three datasets are shown in Fig. 5.

Figure 5 clearly illustrates the periodic variation of model accuracy and loss values during the CAWR learning rate process. Throughout the annealing and restart cycle, both training and validation accuracies exhibit steady improvement, while loss values gradually decrease. At the end of each annealing cycle, resetting the learning rate to its initial maximum value enables the model to escape local optima and explore a wider solution space. This strategy significantly enhances model performance, achieving peak validation accuracies of 89.04%, 98.00%, and 98.69% on the SELF, SWELL, and WESAD datasets, respectively. Additionally, the CAWR strategy reduces the number of training epochs required to reach high accuracy. For instance, on the SELF dataset, the model initialized with random seed 9 achieves comparable results to the baseline model trained for 300 epochs in just 100 epochs, when using the CAWR strategy. This improvement not only enhances efficiency but also substantially reduces training time and computational cost.

As the number of training epochs increases, the model shows varying degrees of overfitting. To comprehensively assess the model’s performance, the saved models at the end of training are tested using the testing dataset. Table 2 presents the testing results and performance comparisons before and after introducing the dynamic learning rate.

Table 2 The comparison of testing results with and without CAWR.

Table 2 shows that applying the CAWR learning rate adjustment strategy significantly improves the test accuracy of all models, with a notable average increase. For the SELF dataset, the model initialized with random seed 15 achieves an accuracy increase from 48.25 to 86.99%, demonstrating that CAWR effectively mitigates underfitting. Similarly, in the SWELL dataset, the accuracy of the seed 15 model rises from 57.61 to 97.35%, further validating CAWR’s ability to enhance model robustness. Even in the WESAD dataset, where baseline model accuracy is already relatively high, CAWR still delivers consistent performance gains. Additionally, CAWR significantly reduces performance variability caused by random seed initialization. Under the RI training strategy, the accuracy gap between the best and worst models in the SELF dataset is 21.26%, whereas CAWR narrows this discrepancy to just 2.54%, greatly enhancing training stability and reproducibility.

LSTM effectively retains long-term dependencies through its gating mechanism, alleviating the vanishing gradient problem inherent in traditional RNNs, but it remains susceptible to gradient explosion. To address this, CAWR adopts a monotonically decreasing learning rate strategy within each cycle, gradually reducing the learning rate to zero. This approach suppresses instability caused by large gradients and facilitates model convergence. Furthermore, the compatibility between LSTM’s activation functions (Sigmoid for gating and Tanh for state updates) and the cosine annealing strategy further enhances optimization. By periodically adjusting the learning rate, CAWR optimizes weight updates at different training phases, accelerates convergence toward the global optimum, and enhances multi-class classification accuracy. Notably, under constrained training iterations, this strategy maximizes the model’s potential, significantly enhancing overall performance. These results validate the effectiveness of CAWR in multi-physiological signal-based emotion recognition tasks.

Ensemble learning results discussion

The soft voting method was used to combine models from 25 different states, including models with the same random seed but different training epochs, different random seeds but the same training epochs, and all models saved during training. Table 3 presents the test accuracy of different ensemble strategies.

Table 3 Ensemble learning results.

The results indicate that all ensemble approaches significantly improve the model’s generalization capability. Whether by varying training epochs with a fixed random seed, using different random seeds with fixed training epochs, or employing a full-model ensemble, ensemble learning effectively addresses both underfitting and overfitting, thereby enhancing prediction stability. The SELF dataset exhibits the most pronounced benefits from these strategies. Ensembling across different training epochs enriches feature representation, while ensembling models trained with different random seeds reduces the impact of initialization variability. The full-model ensemble achieves the highest generalization performance. These findings highlight that a well-designed ensemble strategy not only improves classification accuracy but also enhances model robustness, making it a powerful tool for physiological signal analysis.

After 300 training epochs, the soft voting ensemble strategy, which integrates five models with different random initializations, achieves the highest classification accuracy across all three datasets. Table 4 further evaluates the classification consistency for each emotion category, including precision, recall, and F1-score. These metrics offer a comprehensive evaluation of the model’s performance and its balance across emotion classes.

Table 4 Performance evaluation of the ensemble model.

Overall, the model demonstrates high classification performance across all datasets and emotion categories, highlighting its effectiveness in emotion recognition. However, the difficulty of distinguishing between classes varies. In the SELF dataset, all emotion categories achieve F1 scores above 95%. Recognition of Calm, Happy, Anger, Sad, and Fear is well-balanced, whereas Disgust and Surprise exhibit lower recall rates, likely due to high intra-class variability or feature overlap causing misclassification. In the SWELL dataset, all categories achieve F1 scores exceeding 98%, indicating that the model can accurately distinguish between high/low arousal and valence states. However, the recall rate for LVHA is slightly lower, possibly because the boundary between high-arousal/low-valence samples is less distinct. In contrast, HVLA and LVLA show greater stability, likely due to the even distribution of low-arousal state data. For the WESAD dataset, all categories achieve F1 scores above 96%. Recognition of Baseline and Stress is the most accurate, likely because their physiological signal patterns are distinct. However, Amusement shows a lower recall rate, with some samples misclassified as Baseline or Meditation, reflecting lower separability in physiological signals.

To evaluate the model’s generalization capability to unseen subjects or videos, we applied a segment-wise prediction and voting strategy for external validation. Predictions were generated at the segment level on external datasets, with final classifications determined through soft voting. The results showed classification accuracies of 42.86%, 50%, and 50% on the SELF, SWELL, and WESAD datasets, respectively. These findings indicate that the model’s high accuracy on internal data did not generalize effectively to external data. The primary limitation likely arises from individual differences in physiological signals, which significantly impact cross-subject emotion recognition. Future research should prioritize strategies to reduce inter-subject variability, such as domain adaptation or personalized modeling, to enhance the model’s generalizability and enable broader real-world applications.

To validate the effectiveness of our proposed method, we reproduced and compared several representative models from existing studies. Specifically, we strictly followed the original architectures and hyperparameter settings described in the respective papers to replicate DCNN³⁷, CNN³⁸, and Res2 Net³⁹. Additionally, we implemented two widely recognized general-purpose models, DeepConvNet⁴⁰and EEGNet⁴¹, ensuring that their parameter configurations remained consistent with their original implementations. Given the limitations of RNN, GRU, and LSTM in modeling ultra-long time series, we developed hybrid frameworks, CNN + RNN and CNN + GRU, to assess different temporal modeling strategies. Table 5 summarizes the F1 scores of each model across the SELF, SWELL, and WESAD datasets, providing a comprehensive comparison of their classification performance.

Table 5 Comparison with other methods.

As demonstrated in Table 5, our approach achieves the best performance across all datasets, outperforming existing models in terms of F1 scores. The SELF dataset has a relatively small sample size and multimodal and complex signal characteristics. While Res2 Net, CNN + RNN, and CNN + GRU outperform traditional CNN and lightweight models, our method further enhances feature learning, achieving superior results. The SWELL dataset consists of large-scale, single-channel ECG signals, which are uniform but extensive in quantity. Models with enhanced feature modeling capabilities, such as Res2 Net, CNN + RNN, and CNN + GRU, achieve high F1 scores, and our method further optimizes performance on this basis. The WESAD dataset contains multi-channel physiological signals, offering rich emotional information. Multi-channel temporal modeling significantly enhances the performance of Res2 Net, CNN + RNN, and CNN + GRU. However, our method surpasses all existing models, demonstrating its strength in integrating and analyzing multimodal physiological signals.

In conclusion, our method demonstrates superior performance in complex emotion classification tasks, proving its competitiveness with state-of-the-art techniques. This study highlights the significant potential of ensemble learning in multimodal emotion recognition. By combining random seed initialization with cosine annealing learning rate adjustment, we effectively enhance model performance. The soft voting strategy not only improves classification accuracy but also ensures stability across diverse emotion categories. Furthermore, extending training epochs and optimizing the learning rate maximize the model’s potential, significantly boosting overall classification performance. These findings underscore the importance of well-designed ensemble strategies and optimized training protocols in addressing complex classification challenges.

link