Reweighting balanced representation learning for long tailed image recognition in multiple domains
Datasets descriptions
Yang et al.4 proposed six benchmark datasets for MDLT learning: VLCS-MLT, PACS-MLT, OfficeHome-MLT, TerraInc-MLT, DomainNet-MLT, and Digits-MLT. Among these, Digits-MLT is a synthetic dataset created by combining two digit datasets, while the remaining five are real-world multi-domain datasets widely used in domain generalization research. Following their setup, we use these datasets to assess the performance of our proposed method. Table 2 provides detailed information on the datasets used in the experiments, and Fig. 3 displays the class distributions. In the figure, class labels are sorted in descending order based on the number of samples per class to clearly demonstrate the LT characteristics of each dataset. It should be noted that the class and domain labels in the experimental data have been reindexed based on ascending alphabetical order, and the indices correspond to the original labels in the respective datasets.

Distribution of training, validation, and test data for each MDLT dataset.
Experimental cettings
Setup for model training
All models were implemented using PyTorch and trained on an NVIDIA RTX A4000 GPU. Following the setup of Yang et al.4, we used the same CNN architecture for the Digits-MLT dataset and used ResNet-50 as the backbone network for the remaining five datasets. To ensure a fair comparison, we evaluated the following groups of learning solutions: (1) domain-invariant feature learning solutions, including IRM7, DANN13, CDANN39, CORAL35, MMD40; (2) imbalanced learning solutions, such as Focal14, CBLoss41, LDAM42, BSoftmax15, CRT17, BoDA4; and (3) other baseline, including ERM43, GroupDRO44, Mixup45, SagNet46, MLDG47, MTL48, Fish49. In addition, we evaluated the two-stage training procedure of our model by following the protocol outlined in the work of Yang et al.4. The implementation of our algorithm is based on the codebase provided by Yang et al.4, and their optimal parameter settings were used for all comparison algorithms.
Evaluation setup
Two widely used evaluation metrics for imbalanced learning—top-1 overall accuracy and the F1 score across all classes—were adopted. In addition, we computed accuracy for four disjoint class subsets: many-shot classes (more than 100 samples), medium-shot classes (20~100 samples), few-shot classes (fewer than 20 samples), and zero-shot classes (no training samples). Unlike Yang et al., who selected the best-performing model based on accuracy, we selected the best model during training according to the F1 score for all algorithms. This is because the F1 score provides a more comprehensive evaluation, particularly in the context of imbalanced datasets.
Results and analysis
Quantitative results
For the MDLT image classification task, we conducted experiments on all selected benchmark datasets, with results presented in Tables 3, 4, 5, 6, 7, 8 and 9. It is important to note that, in this study, we did not perform dataset-specific tuning of the parameters \(\alpha\), \(\beta\), and \(\gamma\). Instead, we adopted a fixed set of values—\(1\times 10^{-3}\), \(1\times 10^{-2}\), and 10, respectively—to identify a parameter configuration that provides better average performance across different datasets, as supported by the experiments discussed in the Parameter Analysis section. As a result, our method may not yield optimal performance on certain datasets (e.g., VLCS-MLT), as shown in Tables 3, 4, 5, 6, 7 and 8. However, subsequent analysis demonstrates that these parameters significantly affect performance depending on the dataset. For example, in the case of Digits-MLT, when \(\gamma\) is held constant at 10 and \(\alpha =1\), \(\beta =1\times 10^{-7}\), both the average F1 score and accuracy improve to 0.686 and 0.688, respectively, while the lowest values also remain relatively high at 0.644 and 0.646.
Overall, the proposed model performs well across most evaluation measures, especially on the “worst” measure (i.e., performance on the most difficult classes), as shown in Table 9. This robustness is especially evident in the Digits-MLT and TerraInc-MLT datasets, where the model demonstrates strong recognition capabilities for hard-to-classify categories. Furthermore, the two-stage training approach outperforms single-stage training in most cases. However, in scenarios involving few-shot or zero-shot classes, two-stage training may suffer from overfitting due to the limited number of samples, leading to slightly lower results. Nevertheless, our method consistently surpasses the performance of baseline algorithms. Figure 4 displays a heatmap of the macro F1 scores per class and domain for each dataset (excluding DomainNet due to its large number of classes—345—which limits visual clarity). Darker cells indicate lower F1 scores, highlighting domain-class combinations that are more difficult to recognize.
To further assess the stability of the algorithm’s performance, we conducted five independent runs for each method using distinct random seeds, resulting in five sets of experimental outcomes per method. We employed analysis of variance (ANOVA) to calculate p-values across different performance metrics. As shown in Table 10, the ANOVA results for the Digits-MLT dataset indicate that the p-values for all evaluated metrics are substantially below standard significance thresholds (e.g., 0.05), thereby confirming that the observed performance differences among the compared algorithms are statistically significant. To visualize the distribution of F1 scores, violin plots were generated (Fig. 5), depicting the spread of both average and worst-case F1 scores across algorithms. These plots show that the proposed method achieves F1 scores that are highly concentrated in the upper range, with a narrow gap between the average and worst-case values and relatively low variance, suggesting both superior and stable performance.
In summary, while the proposed model may not be the top performer on every dataset, it remains close to optimal in most cases. Moreover, it exhibits a smaller discrepancy between “average” and “worst-case” results compared to other algorithms, indicating greater stability. These findings support the effectiveness of the proposed model in the MDLT image classification task, especially in improving recognition performance for difficult classes.

F1 score heatmap of the proposed model on different datasets.

Distribution of F1 score for different algorithms on Digits-MLT.
Qualitative analysis via visualization
To further analyze the model’s capabilities, we performed a t-SNE visualization of the test set features after training, using both class labels and domain labels for interpretation. Figure 6 illustrates the t-SNE results for the Digits-MLT test set. For each model, the upper row presents the t-SNE plot colored by class labels, while the lower row shows the same features colored by domain labels. From the t-SNE plots-based visualizations, the feature clustering produced by our model closely resembles that of DANN and BoDA, which aligns with the quantitative results in Table 3. However, the domain-based visualization reveals significant differences. In the context of domain-invariant representation learning, ideal feature representations should minimize differences between domains. Thus, better alignment of feature distributions across domains reflects stronger generalization and reduced domain-specific bias. As shown in Fig. 6, our model demonstrates greater overlap between domains, with data points from domain A aligning closely with those from domain B. In this way, the phenomenon of shortcut learning—where classification is based on domain labels—is effectively mitigated.

t-SNE of Digits-MLT test set on different model.
Parameter analysis
The proposed method incorporates three key parameters—\(\alpha\), \(\beta\), and \(\gamma\)—which control the balance between classification accuracy and the correction of imbalance-related errors. In our main experiments, these parameters were consistently set to \(\alpha =1\times 10^{-3}\), \(\beta =1\times 10^{-2}\), \(\gamma =10\). To analyze the effect of each parameter on model performance, a series of experiments were performed using a range of values for each parameter. Specifically, \(\alpha\), \(\beta\), and \(\gamma\) were set to \(\{1\times 10^{-7},1\times 10^{-5},1\times 10^{-3},0.01,0.1,1,10,100,1\times 10^{3},1\times 10^{4}\}\), with one parameter varied at a time and the other two fixed. Figure 7 shows the results on the Digits-MLT dataset, reporting average and worst-case F1 scores and accuracy. In the figure, solid lines represent average values, dashed lines represent the worst-case values, colors differentiate the parameters, and the shaded area between lines of the same color indicates the difference between average and worst-case performance under different parameter values. The results show that the model maintains robust performance as \(\alpha\) increases, indicating its resilience when emphasizing the CB component. The best test performance is achieved when \(\beta\) is set at approximately 0.01; however, further increases in \(\beta\) lead to a rapid decline in performance, likely due to overcorrection in the RB component. Within a certain range, model performance exhibited minor waves as the \(\gamma\) value increased; however, when \(\gamma\) became excessively large, such as greater than 100, classification performance declined significantly.

Single-parameter analysis results under different parameter values on Digits-MLT.
During the single-parameter analysis on the Digits-MLT dataset, it was observed that the proposed algorithm achieved relatively high F1 scores and accuracy when the hyperparameters were set to \(\alpha =1\), \(\beta =1\times 10^{-7}\), and \(\gamma =10\). To further investigate the convergence behavior of the algorithm under this setting, both the loss values and F1 scores were recorded at intervals of 100 iterations over a total of 5000 training iterations. The results are illustrated in Fig. 8. As shown in Fig. 8, the loss value consistently decreased with the number of iterations, indicating progressive convergence. Concurrently, the F1 score showed a steady upward trend, improving as the loss decreased, and eventually stabilized, demonstrating a favorable convergence trend.

Iterative training results on Digits-MLT.
Based on the parameter analysis, it is evident that relatively small values for the hyperparameters tend to result in higher F1 scores and accuracy in practical applications. This implies that during model optimization, priority should be given to exploring smaller parameter ranges to enhance performance more effectively.
Ablation studies
To assess the contribution of individual components within the proposed model, ablation studies were conducted by selectively enabling or disabling key modules: the balanced cross-entropy loss (BCE), CB component, RB component, and cross-domain penalty term (Penalty). The impact of each component on model performance was assessed using the benchmark Digits-MLT dataset, with results summarized in Table 11. The results show that the inclusion of all components (as in experiment number 8) yields the best overall performance, with an average F1 score of 0.672 and a worst-case F1 score of 0.630. These findings indicate that the full combination of these components provides the most robust and consistent improvements in the performance of the model.
link
