An efficient personalized federated learning approach in heterogeneous environments: a reinforcement learning perspective
We conducted extensive experiments on both standard and real-world datasets to address the following seven key research questions:
-
Q1: How does our approach outperform the state-of-the-art personalized federated learning methods in heterogeneous environments? (see Result 1)
-
Q2: How does our approach perform on new clients that appear after training? (see Result 2)
-
Q3: How do the size of local data storage and the extent of data heterogeneity affect the performance of our model? (see Result 3)
-
Q4: How sensitive is our approach to hyperparameters – the model mixing parameter \(\lambda _n\) and the number of nearest neighbors k? (see Result 4)
-
Q5: How robust is our approach to changes in data distribution? (see Result 5)
-
Q6: How do the model performance and training efficiency of our method fare in the fall detection task? (see Result 6)
-
Q7: What is the specific contribution of each component in our approach to overall model performance and training efficiency under heterogeneous environments? (see Result 7)
Datasets
We utilize the CIFAR-10, CIFAR-100, FEMNIST, and MobiAct datasets for training and testing, covering two domains: image recognition and fall detection. The first three datasets, CIFAR-10, CIFAR-100, and FEMNIST, are well-established in the field, offering rich content and diverse application scenarios that simulate real-world data distribution effectively. The MobiAct dataset, as a real-world example of fall detection in smart home environments, provides a robust test for evaluating the practical effectiveness of our proposed algorithm.
Dataset introduction
-
CIFAR-10: This widely used computer vision dataset consists of 60,000 color images at a resolution of 32\(\times\)32 pixels, spread across 10 categories, with 6,000 images in each category. The dataset is divided into 50,000 images for training and 10,000 images for testing.
-
CIFAR-100: While structurally similar to CIFAR-10, CIFAR-100 comprises 60,000 color images at a resolution of 32\(\times\)32 pixels but is organized into 100 distinct categories, with each category containing 600 images. This dataset offers more fine-grained classification, with subcategories such as “bees” and “beetles” under the broader “insects” category.
-
FEMNIST: An extension of the MNIST dataset and a benchmark for FL, FEMNIST contains 805,263 grayscale images at a resolution of 28\(\times\)28 pixels. These images are distributed across 62 character classes, including 10 digits, 26 lowercase letters, and 26 uppercase letters. The data originates from 3,500 users, reflecting the diversity and non-IID nature of real-world handwriting samples.
-
MobiAct42: Designed for human activity recognition and fall detection research, the MobiAct dataset captures data from 67 volunteers using smartphone sensors (e.g., accelerometers and gyroscopes) across more than 3,200 tests. It includes 12 different daily activities (such as walking, running, and stair climbing) and simulates four types of falls (forward, backward, left, and right). The heterogeneity of data stems from the varying physical conditions of the volunteers.
Non-IID data partitioning
To rigorously evaluate personalized federated learning methods, it is essential to simulate FL environments realistically and partition datasets in a manner that reflects non-IID conditions. Below, we detail the partitioning processes for each dataset:
-
CIFAR-10: Since there is no natural partitioning method, we use the Dirichlet distribution to allocate data among clients. The parameter \(\alpha\) controls the uniformity of the distribution-larger values of \(\alpha\) result in more similar client data distributions, while smaller values increase the differences. For each label y, an N-dimensional vector \(p_y\) is sampled from the Dirichlet distribution, where \(p_y,n\) represents the proportion of label y data assigned to the n-th client. This approach ensures that each client’s dataset is composed of samples from multiple labels, with varying distributions.
-
CIFAR-100: Utilizing the dataset’s hierarchical label structure (coarse and fine labels), we partition the data using the Pachinko allocation method43. This technique employs two-level Dirichlet distributions: the first assigns a distribution with parameter \(\alpha\) to each client to determine the probability of coarse labels appearing, and the second assigns a distribution with parameter \(\beta\) to determine the probability of fine labels within each coarse label. Data is then allocated to clients according to these distributions, allowing for controlled variation in data distribution at both levels.
-
FEMNIST: Data is partitioned based on the authorship of characters, with all characters written by a single individual forming a client’s dataset. The inherent variability in handwriting styles across different authors ensures a non-IID distribution.
-
MobiAct: Data is divided based on the volunteers who contributed it, creating two groups of client datasets with distinct data heterogeneity. In the “volunteer-based partitioning” group, each volunteer represents a client, reflecting natural non-IID datasets resulting from individual habits and distinct physical conditions. In the “label skew” group, volunteers are assigned different sets of activities, introducing varying degrees of label skew. Specifically, we create three partitions: “natural partition,” where each client’s dataset is based on a single volunteer’s data; “label skew 8,” where each client has 8 activities distinct from others, introducing moderate label skew4; and “label skew 4,” where each client has 4 distinct activities, introducing significant label skew.
These partitioning strategies ensure that the datasets realistically simulate FL conditions, enabling a thorough evaluation of personalized federated learning algorithms.
Experimental tasks
To thoroughly evaluate the performance, training efficiency, and overall effectiveness of our FedPRL algorithm across various classification tasks, and to validate the characteristics we designed into the algorithm, we developed four specific classification tasks, each aligned with a particular dataset. Table 1 provides a summary of the datasets, models, and transition representation details for each experimental task.
For the datasets including CIFAR-10, CIFAR-100, and FEMNIST, we selected MobileNet-v244 as the base model. The model’s weights are initialized using the default pre-trained weights provided by the torchvision library. We use the output from the model’s last hidden layer as the transition representation for these tasks.
For the MobiAct dataset, we implemented gated recurrent unit (GRU) model as the base model. Each GRU layer consists of 256 hidden units, with model weights randomly initialized within a uniform distribution ranging from [-0.1, 0.1]. The input to the model’s final linear layer is used as the transition representation.
Implementation details
Heterogeneous hardware setup
To simulate a FL environment with system heterogeneity, we introduced a variety of edge devices and transmission protocols, as detailed in Tables 2 and 3. This setup is designed to replicate the differing computing and transmission capabilities of various edge platforms. The specific configurations are inspired by devices such as the Intel Core i7, Raspberry Pi PI3, Jetson AGX Orin, Jetson Nano, and Jetson Xavier NX. The frequencies and bandwidths listed represent average operational values for these platforms. To reflect real-world dynamic conditions, we modeled these values using normal distributions with varying standard deviations. During each training round, a value is randomly drawn from the relevant distribution to represent the client’s CPU frequency or network bandwidth. This approach effectively simulates the variability in computing resources and communication capabilities in practical situations, thereby highlighting how system heterogeneity can impact the process of FL.
Experimental environment
All experiments were conducted on a remote server running Ubuntu 20.04.5, equipped with an Intel(R) Xeon(R) Platinum 8255C CPU, an RTX 3080 GPU with 10GB of video memory, 40GB of RAM, and 80GB of storage capacity.
Parameter setting
For experiments involving the CIFAR-10, CIFAR-100, and FEMNIST datasets, federated training was carried out over 300 rounds. In each round, 10% or 20% of the clients were selected using our intelligent client selection method. After completing local training, clients adjusted the learning rate to 99% of its original value. For the MobiAct dataset, the training spanned 500 rounds, with a similar selection of 10% of clients per round and the same learning rate adjustment strategy.
In all experiments, our method utilized Euclidean distance as the metric for calculating local predictions, with the kNN model implemented using the IndexFlatL2 class from the FAISS library45 for accurate nearest neighbor searches. The hyperparameters for other baseline methods were set according to their original papers, with grid search employed to optimize these parameters in the current experimental environment. The hyperparameters used in our method for each dataset during the experiments are summarized in Table 4.
Evaluation metrics
To evaluate the performance, training efficiency, and overall effectiveness of our method on classification tasks, we utilize three key metrics: the average accuracy of personalized models, total training time, and F1 score.
Average accuarcy of personalized models: This metric represents the weighted average accuracy across all personalized models on individual clients, providing a comprehensive measure of the personalized federated learning method’s performance across the complete set of clients. It can be calculated as follows:
$$\beginaligned \text Accuracy=\frac\sum ^N_n=1(TP_n+TN_n)D_n, \endaligned$$
(16)
where \(TP_n + TN_n\) represents the count of correctly predicted samples by client n’s personalized model, while \(|D_n|\) indicates the total number of samples in client n’s local dataset.
Total training time: This metric captures the cumulative sum of local training and data transmission time for all participating clients throughout the global model training process, reflecting the general efficiency of the federated training process. It can be calculated as follows:
$$\beginaligned \text Total Training Time=\sum \limits ^I_i=1\sum \limits ^N_n=1T^i_n, \endaligned$$
(17)
where \(T^i_n\) is the sum of local training and data transmission time for client n in the i-th round of federated training.
F1 score: The F1 score offers a balanced evaluation of the model’s classification performance by integrating both precision and recall. It can be calculated as follows:
$$\beginaligned \text F1=2\times \frac\text precision\cdot \text recall\text precision+\text recall, \endaligned$$
(18)
where presion = \(\fracTPTP+FP\) and recall = \(\fracTPTP+FN\). Here, TP, TN, FP, and FN indicate true positives, true negatives, false positives, and false negatives, respectively.
Comparison methods
We compare our FedPRL against three state-of-the-art baseline methods in federated learning and personalized federated learning.
-
FedAvg1: A foundational algorithm in FL, FedAvg aggregates local models from multiple devices by averaging their parameters to create a global model.
-
Ditto20: An advanced personalized federated learning algorithm, Ditto leverages regularization techniques to generate personalized models for each client, enhancing model accuracy, fairness, and robustness in FL.
-
FedRep24: A state-of-the-art personalized federated learning algorithm, FedRep achieves superior personalized model performance by training the representation layer on local clients while optimizing the classifier layer on the server.
Results
We conducted a series of experiments and performed a detailed analysis of the results to address the key research questions Q1-Q7.
Result 1: Comparison of average performance and training efficiency in personalized models (for Q1)
To address Q1, we evaluated the average performance and training efficiency of personalized models produced by various algorithms across the natural partition, label skew 8 partition, and label skew 4 partition of the CIFAR-10, CIFAR-100, FEMNIST, and MobiAct datasets. The algorithms tested included our FedPRL algorithm as well as established FL algorithms: FedAvg, Ditto, and FedRep. To assess the general performance of these algorithms across all clients, we calculated the average accuracy of personalized models, with weights proportional to the number of samples in each client’s dataset. Additionally, we evaluated the model training efficiency of each algorithm by measuring the total training time and recording the number of training rounds required to achieve the target accuracy.
(1) Comparison of average accuarcy in personalized models: The results of average accuracy and training rounds are presented in Table 5, where MobiAct (v1), MobiAct (v2), and MobiAct (v3) correspond to the natural partition, label skew 8 partition, and label skew 4 partition of the MobiAct dataset, respectively. Our FedPRL algorithm consistently achieved the highest accuracy across all datasets, outperforming the state-of-the-art methods FedAvg, Ditto, and FedRep. Specifically, on the CIFAR-10 dataset, FedPRL outperformed the baseline algorithms by 21.70%, 7.36%, and 6.32%, respectively. On CIFAR-100, the accuracy improvements were 20.70%, 7.21%, and 5.50%, respectively. On FEMNIST, FedPRL surpassed the baselines by 13.11%, 3.34%, and 2.34%. For the MobiAct dataset, FedPRL improved accuracy by 17.33%, 14.22%, and 12.06% on the natural partition (v1), 18.92%, 15.56%, and 12.53% on the label skew 8 partition (v2), and 9.61%, 6.25%, and 5.98% on the label skew 4 partition (v3). Overall, our algorithm demonstrated an average accuracy improvement of 16.90%, 8.99%, and 7.46% over the baseline algorithms.
(2) Comparison of training efficiency in personalized models: Figure 4 illustrates the total training time of our FedPRL algorithm compared to three baseline algorithms (FedAvg, Ditto, and FedRep) across the CIFAR-10, CIFAR-100, and FEMNIST datasets. On each of these datasets, FedPRL achieved the shortest final total training time, significantly outperforming the state-of-the-art methods FedAvg, Ditto, and FedRep. Specifically, on the CIFAR-10 dataset, FedPRL demonstrated accelerations of 26.54%, 23.86%, and 28.32% compared to the baseline algorithms, respectively. For CIFAR-100, these accelerations were 26.42%, 27.73%, and 29.48%, while on FEMNIST, FedPRL achieved improvements of 22.99%, 22.34%, and 24.98%, respectively. Overall, FedPRL averaged a speedup of 25.32%, 24.64%, and 27.59% against the baseline algorithms.
Furthermore, since Ditto and FedRep primarily address data heterogeneity without optimizations for system heterogeneity, their total training times remain comparable to FedAvg. Ditto’s personalized strategy involves lower computational demands, while FedRep’s additional fine-tuning steps introduce computational overhead, generally resulting in a total training time order of Ditto < FedAvg < FedRep, as shown in Fig. 4. Notably, in Fig. 4, after reaching the midpoint (150) of total training rounds, FedPRL’s total training time consistently became markedly lower than that of the baseline algorithms, with this advantage widening progressively as training advanced. This improvement is due to FedPRL’s reinforcement learning mechanism, which, upon reaching halfway through the training rounds, shifts focus from exploring new client combinations to leveraging previously identified optimal combinations, thereby maximizing both model performance and training efficiency.
In addition to reporting average accuracy, Table 5 presents the number of training rounds required for each algorithm to achieve predefined target accuracies across datasets. For each dataset, we set target accuracies at 77.0%, 59.6%, 64.3%, 80.2%, 78.8%, and 75.8%, respectively. Our proposed FedPRL consistently achieves these targets with the fewest training rounds across all datasets, outperforming all baseline algorithms. These results demonstrate that FedPRL offers superior convergence speed.
Summary 1: The results clearly highlight the superiority of the FedPRL algorithm, demonstrating its enhanced effectiveness over FedAvg, Ditto, and FedRep in addressing both data and system heterogeneity in heterogeneous environments. FedPRL effectively mitigates the adverse effects of data heterogeneity, leading to notable improvements in the performance of FL models across diverse client datasets. Additionally, it addresses system heterogeneity, reducing model total training time, accelerating convergence, and thereby significantly enhancing model training efficiency.

The variation of total training time during federated training on CIFAR-10, CIFAR-100, and FEMNIST.
Result 2: Performance on new clients (for Q2)
In real-world scenarios, the model trained through FL must not only perform well on the clients that participated in the training process but also generalize effectively to new clients that join afterward. Ensuring excellent performance on these new clients requires the model to have strong generalization capabilities. To address Q2, we conducted experiments in which only 80% of the clients took part in the initial training phase, while the remaining 20% were introduced subsequently to test the adaptability of our FedPRL algorithm to new clients. Specifically, we assessed whether the algorithm could generate personalized models that are effective for these new clients.
Table 6 presents the accuracy of the personalized models generated by the FedPRL algorithm for these new clients. When compared to the results in Table 5, we observe that the personalized models created by FedPRL perform similarly for both the clients involved in the training and those that joined later. Specifically, across the six datasets, the accuracy of personalized models for new clients is only slightly lower-by 5.30%, 5.90%, 4.04%, 1.90%, 2.23%, and 0.12%, respectively-resulting in an average decrease of just 3.25%.
Summary 2: The FedPRL algorithm demonstrates strong generalization to new clients, as the accuracy for these new clients is only marginally lower than for those that participated in training. Thanks to the design of the FedPRL algorithm, new clients can easily obtain the global model from the server, use it to build their local data storage for the kNN model, and quickly generate high-quality personalized models. This confirms that FedPRL is highly effective at adapting to new clients and efficiently producing accurate personalized models for them.
Result 3: Impact of local data storage size and data heterogeneity (for Q3)
Distinguishing between new and old clients is essential not only for practical scenarios but also for understanding how various factors impact the performance of the FedPRL algorithm. In this context, “new clients” are those that join after federated training has concluded and did not participate in global model training, while “old clients” are those that contributed to the global model training.
To address Q3, we proportionally reduced the local data storage size for new clients while keeping the global model unchanged, and then tested the average accuracy of the personalized models generated for these new clients. For each client n, the local dataset size is \(|D_n|\), and a capacity parameter \(w_n\) determines the size of the data storage, calculated as \(w_n\cdot |D_n|\). By adjusting the capacity parameter, we modified the local data storage size to observe its impact on model accuracy. Additionally, we introduced a parameter \(\alpha\) to represent data heterogeneity, where \(\alpha\) ranges from 0 to 1, with smaller values indicating stronger data heterogeneity. We created five sub-datasets with varying degrees of data heterogeneity (i.e., 0.1, 0.3, 0.5, 0.7, and 1.0) on the CIFAR-10 and CIFAR-100 datasets and adjusted the capacity parameters on these datasets to evaluate the corresponding model accuracy. This approach allowed us to assess the impact of data heterogeneity on personalized model performance.
The experimental results are presented in Fig. 5. The data shows that as the size of the local data storage decreases, the accuracy of the personalized model also declines. However, even when reducing the local data storage from the full dataset size to one-third (with a capacity parameter of 0.33), the accuracy remains close to that of the maximum storage size, with changes not exceeding 0.83%. This suggests that the FedPRL algorithm is not highly sensitive to the size of local data storage, offering considerable flexibility. Devices with limited storage capacity can still achieve high accuracy by setting local data storage to one-third of the dataset size, while those with larger storage capacity can maximize accuracy by utilizing the full dataset.
It is important to note that if the local data storage size for old clients is also altered during the experiment, the global model and transition representation would change, introducing multiple variables that could obscure the impact of local data storage size on algorithm performance. Figure 6 shows the results of experiments where these variables were introduced.

The relationship between local data storage size and accuracy under different data heterogeneity (global model unchanged).

The relationship between local data storage size and accuracy under different data heterogeneity (global model changed).
From Figs. 5 and 6, it is also evident that with a fixed capacity parameter, the accuracy of the personalized model increases as \(\alpha\) decreases. This indicates that stronger data heterogeneity leads to higher accuracy in the personalized model, demonstrating the FedPRL algorithm’s effectiveness in handling diverse client data distributions. Additionally, when the capacity parameter is set to 0 (indicating no local data storage), the global model is used, and the test accuracy reflects the global model’s performance. In Fig. 5, where the global model is fixed, the test accuracy at the starting point (capacity parameter 0) remains constant across different \(\alpha\) values. However, in Fig. 6, where the global model changes with \(\alpha\), the accuracy at the starting point decreases as data heterogeneity increases, but it recovers as the local data storage size increases, indicating that larger data storage can offset the negative impact of data distribution heterogeneity on the global model.
Summary 3: As local data storage size decreases, personalized model accuracy diminishes, but the FedPRL algorithm remains largely insensitive to storage size, achieving optimal accuracy with storage set to between one-third and the full size of the local dataset. Moreover, stronger data heterogeneity enhances personalized model accuracy, underscoring the algorithm’s effectiveness in managing data heterogeneity challenges.
Result 4: Hyperparameter sensitivity (for Q4)
To address Q4, we designed an experiment to assess the sensitivity of the FedPRL algorithm to its hyperparameters. Since the algorithm’s personalization strategy involves a model interpolation method and a kNN model, it is crucial to assess the impact of the model mixing parameter \(\lambda _n\) in the interpolation method and the number of nearest neighbors k in the kNN model, as these parameters influence the prediction performance of personalized models.
(1) Impact of \(\lambda _n\) on algorithm performance: We conducted a series of experiments using the control variable method to test the average accuracy of personalized models at various \(\lambda _n\) values. The experiments were performed on the CIFAR-10 and CIFAR-100 datasets, with data heterogeneity \(\alpha\) set to 0.3. Fifty clients participated in the FL process, each with non-IID datasets of varying sizes. The number of nearest neighbors k in the kNN model was fixed at 10 and 12, with Euclidean distance used as the distance metric.
Figure 7 illustrates the relationship between \(\lambda _n\) and the average accuracy of personalized models. The four lines represent clients with different sample sizes. The results indicate that the optimal \(\lambda _n\) is approximately 0.8 for CIFAR-10 and 0.9 for CIFAR-100, with the optimal value increasing as the number of client samples grows. Additionally, the accuracy of personalized models varies significantly with changes in \(\lambda _n\), demonstrating that the FedPRL algorithm is sensitive to the interpolation parameter \(\lambda _n\).
This sensitivity can be explained by the principle of model personalization: clients with more local data typically experience greater differences in data distribution compared to other clients. Because the global model, which is based on an averaged distribution, may not accurately reflect these specific distributions, the local model becomes more critical for capturing the client’s data characteristics. Consequently, clients with larger datasets rely more heavily on their local models, leading to a higher optimal \(\lambda _n\) value to achieve the best personalized model performance.

The relationship between model interpolation parameter \(\lambda _n\) and accuracy in clients with different sample sizes.
(2) Impact of k on algorithm performance: We also tested the impact of the number of nearest neighbors k on the FedPRL algorithm’s performance by evaluating the average accuracy of personalized models across various k values. These experiments were conducted on the CIFAR-10 and CIFAR-100 datasets, with data heterogeneity \(\alpha\) set to 0.1. Again, fifty clients participated, with varying sample sizes and non-IID data distributions. The distance metric was Euclidean distance, and the k values tested were 1, 3, 5, 7, 10, 12, 14, 16.
Figure 8 shows the relationship between k and the average accuracy of personalized models. For CIFAR-10, the optimal k value lies between 5 and 12, where the average accuracy is both highest and most stable. Within this range, the accuracy varies by no more than 0.25%. For CIFAR-100, the optimal k value range is between 5 and 14. Given the broad range of optimal k values and the minimal fluctuation in accuracy, we conclude that the FedPRL algorithm is not sensitive to the k parameter.

The relationship between the number of nearest neighbors k and the average accuracy of personalized models.
Summary 4: The performance of the FedPRL algorithm is sensitive to the interpolation parameter \(\lambda _n\), with the optimal value increasing as the quantity of client samples increases. However, the algorithm is not sensitive to the k value in the kNN model. For the CIFAR-10 and CIFAR-100 datasets, optimal k values can be chosen from the ranges 5-12 and 5-14, respectively, without significant impact on performance.
Result 5: Robustness to data distribution shifts (for Q5)
As discussed earlier, the FedPRL algorithm incorporates three strategies for updating local data storage in response to changes in client data distribution within a dynamic environment: First-in-First-out (FIFO), Insert, and NoUpdate.
To address Q5 and assess the effectiveness of these local data storage update strategies, we simulated the dynamic environment described and conducted experiments to assess the performance of personalized models under each strategy as client data distribution changed.
Figure 9 illustrates how the accuracy of personalized models varies with the three local data storage update strategies as client data distribution evolves. The horizontal axis represents time, during which new sample data is incrementally added to the client dataset. At time \(t_0=14\), data from a new distribution begins to be introduced. Before \(t_0\), the client adds samples from the original distribution, and after \(t_0\), it starts incorporating samples from the new distribution.
If the client does not update its data store (NoUpdate strategy), accuracy drops sharply at \(t_0=14\) when the data distribution changes, and it struggles to recover. Under the FIFO strategy, we observe some fluctuations in accuracy before \(t_0\), caused by changes in data storage affecting kNN predictions. Although accuracy decreases at \(t_0\), it gradually improves as the data store is populated with samples from the new distribution. The Insert strategy produces similar results to FIFO, but with notable differences: before \(t_0\), the growing number of samples in the data store enhances kNN predictions, leading to improved accuracy. After \(t_0\), accuracy also increases, but at a slower rate compared to FIFO, due to the retention of samples from the older distribution.
Summary 5: Of the local data storage update strategies tested, the FIFO and Insert strategies significantly enhance the robustness of the FedPRL algorithm in environments where data distribution is dynamically changing. The FIFO strategy proves to be more robust than the Insert strategy, as it effectively clears out outdated samples. In contrast, the NoUpdate strategy results in a substantial and irrecoverable decline in personalized model accuracy following a change in data distribution.

Personalized model accuracy under three strategies with changing data distributions.
Result 6: Model performance and training efficiency in fall detection task (for Q6)
To address Q6, we evaluated the performance and training efficiency of the FedPRL algorithm on the MobiAct dataset, using natural partition, label skew 8 partition, and label skew 4 partition settings. This assessment focused on the algorithm’s effectiveness in the fall detection task within a smart home environment. We used F1 score (%) and total training time (s) as the evaluation metrics, with the FedAvg algorithm serving as the baseline for comparison.
Figure 10 illustrates the F1 score and total training time for both the FedPRL and FedAvg algorithms during training, highlighting their convergence behavior across these two metrics. The results clearly show that the FedPRL algorithm reaches the target F1 score more quickly than FedAvg. For example, with a target F1 score of 77.8% under the label skew 4 partition, FedAvg needs 200 training rounds to achieve the target, while FedPRL reaches it in just 83 rounds, reducing the training rounds by 58.5%. Similarly, under the label skew 8 partition with an F1 score target of 86.75%, FedAvg takes 448 rounds to meet the goal, whereas FedPRL accomplishes this in only 72 rounds, reducing the training rounds by 83.93%.
Moreover, the total client training time for FedAvg increases linearly, whereas FedPRL converges to a constant time, indicating that FedPRL effectively identifies the optimal client combination. This balance between maximizing the F1 score and minimizing total training time demonstrates the effectiveness of the client selection method used by FedPRL.
Summary 6: The client selection method developed for the FedPRL algorithm intelligently identifies the optimal client combination, enhancing both model performance and training efficiency in heterogeneous environments. This approach effectively tackles the challenges associated with data and system heterogeneity, with real-world validations underscoring the superiority and practical applicability of the FedPRL algorithm, indicating its potential for integration with the medical and healthcare field.

The variation of F1 score and total training time during federated training on MobiAct.
Result 7: Ablation study (for Q7)
To address Q7, we conducted comprehensive ablation studies on the CIFAR-10 dataset to assess each component’s specific contributions on the performance and efficiency of the FedPRL algorithm. The FedPRL algorithm incorporates three specifically designed and optimized components: (1) client selection based on RL and user quality evaluation, (2) local training based on global knowledge distillation of non-target classes, and (3) global model personalization based on local data storage. In our visualizations, these components are labeled as “client selection”, “local training”, and “personalization” in both Table 7 and Fig. 11. Our ablation studies use the FedAvg baseline across datasets exhibiting data heterogeneity degrees of 0.1 (highest heterogeneity), 0.3, 0.5, 0.7, and 1.0 (lowest heterogeneity), incrementally adding one or two components until reaching the full FedPRL configuration, with each intermediate combination serving as a distinct training method in our study. Evaluation metrics include model accuracy (%) and total training time (s). By systematically comparing different component combinations, we gain insights into each component’s specific contributions to both overall model performance and training efficiency.
Table 7 reports the model accuracy across different combinations of FedPRL components when trained on datasets with varying degrees of data heterogeneity. The results indicate that each component contributes positively to accuracy compared to the baseline FedAvg, though with differing magnitudes. Specifically, the client selection component alone yields a modest average accuracy increase of 1.70%, while the local training component alone achieves a slightly higher average improvement of 3.79%. Notably, the global model personalization component alone drives the most substantial accuracy gain, with an average improvement of 15.47%, outperforming the other components. Each component independently enhances model accuracy across datasets with varying data heterogeneity levels. The improvements become more pronounced as heterogeneity increases, as greater distribution differences diminish FedAvg’s performance, allowing FedPRL’s components to deliver more impactful optimization. Although both the client selection and local training components offer modest improvements, they fall short on highly heterogeneous datasets, where accuracy remains lower than on less heterogeneous datasets. In contrast, the global model personalization component achieves significant accuracy gains, aligning or even exceeding performance on highly heterogeneous datasets compared to those with less heterogeneity. Therefore, the global model personalization component plays a critical role in effectively addressing data heterogeneity, consistently boosting model accuracy across all levels of heterogeneity, particularly on datasets with pronounced distribution differences.
Moreover, compared to methods utilizing a single component, adding additional components consistently enhances model accuracy, underscoring the contribution of each component to overall performance. Specifically, incorporating the client selection component yields an average accuracy increase of 1.41% over the single-component method. Adding the local training component results in a further average improvement of 3.27%. Most significantly, integrating the global model personalization component elevates accuracy by an average of 15.01%. In terms of model performance, the client selection and local training components specifically support and amplify the impact of the global model personalization component. This synergy further enhances model accuracy across datasets with varying degrees of data heterogeneity, illustrating the complementary roles of each component in strengthening model adaptability and robustness.
Figure 11 illustrates the total training time for different combinations of FedPRL components following 300 rounds of federated learning. Due to the high efficiency of our designed personalization step, which can be performed after global model training, the global model personalization component minimally impacts overall training time. Consequently, the inclusion of the global model personalization component results in only a marginal increase in total training time, averaging a mere 0.55%. Furthermore, the computational overhead introduced by the optimized local training component is minimal, exerting negligible influence on the total training time, as evidenced by the nearly identical total training times observed before and after its integration. Notably, the figure clearly demonstrates that the total training time for methods incorporating the client selection component are markedly lower than those without it, with an average reduction of 26.65%. This observation highlights the client selection component’s role in enhancing training efficiency, as it reduces total training time and accelerates convergence, thereby significantly enhancing the overall efficiency of the federated learning process.
Summary 7: The ablation experiments demonstrate that the three key components in FedPRL are critical for enhancing both model performance and training efficiency. The client selection component enables the algorithm to adapt effectively to heterogeneous environments, addressing system heterogeneity issues. By selecting an optimal subset of clients, this component significantly reduces total training time and accelerates convergence, thus boosting training efficiency. Moreover, it contributes to a modest improvement in model performance while maintaining efficiency gains. The local training component complements the global model personalization component, further enhancing model performance across datasets with varying degrees of data heterogeneity. The global model personalization component is crucial for addressing data heterogeneity, leading to substantial improvements in model performance, especially on datasets with high heterogeneity. In sum, FedPRL, incorporating these three components, effectively addresses both data and system heterogeneity in heterogeneous environments, delivering marked improvements in both model performance and training efficiency.

Ablation Study on FedPRL: Total training time across different component combinations after 300 rounds.
link
