Reinforcement learning-driven task migration for effective temperature management in 3D noc systems

This section will initially present the assumptions of the problem. Subsequently, we will provide a concise specification of the cooling algorithm and its associated parameters, followed by an explanation of how to select the migration destination.
The primary objective of this work is to develop an efficient task migration strategy for 3D Network-on-Chip (NoC) systems that minimizes peak temperature and balances thermal distribution across processing cores. Due to the vertical stacking of layers in 3D architectures, thermal hotspots emerge, negatively impacting system reliability and performance. An optimal migration strategy should intelligently relocate tasks to cooler regions while considering communication overhead, task dependencies, and computational efficiency.
Several constraints must be addressed to ensure effective migration. First, task dependencies must be preserved to avoid excessive communication delays, meaning tasks with strong interdependencies should remain in close proximity. Second, migration incurs data transfer costs that should be minimized to prevent network congestion and additional energy consumption. Additionally, the system must dynamically adapt to workload variations, requiring a real-time decision-making mechanism. Therefore, the proposed reinforcement learning-based approach optimizes task placement by continuously learning from temperature profiles and workload changes, ensuring a thermally balanced and performance-efficient system.
Problem definition
The system is presumed to comprise an equal number of processing elements, indicating that each task can be executed by any of the processing units. Initially, all tasks are arbitrarily allocated to the processors, with each processing element allotted a maximum of one work. How tasks are related to each other and their pace of communication is demonstrated with the task communication graph. The relational graph of tasks is a directed acyclic graph, denoted as CTG = (T,E) where \(\:{T}_{i}\in\:T\) represents the i-th task and \(\:{\:e}_{i,j}\in\:E\) signifies the relational dependency between two tasks. The task is \(\:{T}_{i}\) and \(\:{\:T}_{j}\) The communication capacity of each edge is denoted by w(i, j ).
The mapping of tasks to processing elements is represented by the mapping matrix M, which has dimensions \(\:\left|PE\right|\times\:\left|T\right|\)|, where \(\:\left|PE\right|\) denotes the number of processors and \(\:\left|T\right|\) signifies the number of tasks in the CTG. Each element of the matrix \(\:{M}_{{T}_{m}}^{{Pe}_{i}}\) is assigned a value of one or zero based on the placement of task \(\:{T}_{m}\) on the core of \(\:{Pe}_{i}\) Relations (1) and (2) present the mathematical formulation of this issue38,39.
$$\:\forall\:{T}_{m}\in\:T\::\:\sum\:_{\begin{array}{cc}\forall\:{Pe}_{i}\in\:&\:PE\:\end{array}}{M}_{{T}_{m}}^{{Pe}_{i}}=1$$
(1)
$$\:{M}_{{T}_{m}}^{{Pe}_{i}}=\left\{\begin{array}{c}1\\\:0\end{array}\right.\begin{array}{c}if\\\:\:\:else\end{array}\text{m}\text{a}\text{p}\left({T}_{m}\right)={Pe}_{i}$$
(2)
The identification of hot regions on the chip is conducted based on a specific temperature threshold, denoted as \(\:{TH}_{1}\). The temperature barrier must be established below the chip’s vulnerability temperature threshold. Any core above this temperature threshold is designated as a hot spot.
The primary objective of the proposed algorithm is to regulate the chip’s temperature to minimize its maximum temperature as effectively as feasible. Consequently, the temperature of all processing cores post-migration must remain below a specified temperature threshold. The temperature model employed to assess the temperature of processing cores resembles the model utilized in7 and is (3). This model accounts for the thermal interdependence among processing elements in the vertical layers through thermal resistance, indicating that the temperature of the upper plates is elevated due to reduced heat exchange40.
$$\:{Th}_{i,j,k}={T}_{amb}+\sum\:_{m=1}^{k}\frac{{R}_{i,j,m}}{A}\times\:(\sum\:_{s=m}^{n}({P}_{i,j,s}+P{R}_{i,j,s}))$$
(3)
The available parameters in this regard are defined as follows:
-
\(\:{\mathbf{T}\mathbf{h}}_{\varvec{i},\varvec{j},\varvec{k}}\): temperature of the processing element at position (I, j,k) in the 3D on-chip network.
-
\(\:{\varvec{T}}_{\varvec{a}\varvec{m}\varvec{b}}\): ambient temperature.
-
\(\:{\varvec{R}}_{\varvec{i},\varvec{j},\varvec{k}}\): Thermal resistance of the element in position (I, j,k). This resistance exists between the elements in layer i and the elements in layer i-1. In the first layer, this resistance is defined by the heat sink.
-
A: The area of each processing element.
-
\(\:{\varvec{P}}_{\varvec{i},\varvec{j},\varvec{s}}\): the average power consumption of the processor in position (I, j,s)which is calculated according to the average power consumption of the task running on it.
-
\(\:{\varvec{R}\varvec{P}}_{\varvec{i},\varvec{j},\varvec{s}}\): The average power consumption of the router at the location (I, j,k) which is calculated based on the power consumption of the router per bit and the amount of data routed through it.
The average power consumption of cores and routers is computed to get the steady-state temperature in the temperature relationship.
Reinforcement learning
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns optimal actions through interactions with an environment to maximize a cumulative reward. In the context of temperature control in 3D Networks-on-Chip (NoC), RL provides an adaptive and dynamic mechanism to address the thermal challenges posed by the compact design and high power densities of 3D NoCs41. The environment here represents the chip’s thermal and workload states, while the agent’s actions correspond to migrating tasks between cores or layers. The goal is to learn a policy that minimizes temperature hotspots, ensures uniform thermal distribution, and maintains acceptable system performance, such as communication latency and energy efficiency.
In 3D NoCs, uneven task distributions or localized high-power workloads can create thermal hotspots, leading to performance degradation and potential hardware failures. Traditional methods, such as heuristic or model-based techniques, struggle to adapt to dynamic workload changes and runtime variations. RL, however, excels in such scenarios by continuously learning and adapting its policies based on the chip’s current state. The RL agent evaluates the thermal distribution across the NoC and selects migration actions to move tasks between cores, aiming to reduce thermal stress while maintaining balanced workload distribution42. The reward function is tailored to the specific goals of the NoC: minimizing temperature peaks, reducing thermal gradients, and maintaining energy and latency efficiency. Over time, the RL agent identifies effective task migration strategies that enhance the system’s thermal stability and performance.
Using RL for temperature control in 3D NoCs offers significant advantages over traditional methods. The agent’s ability to explore and exploit actions ensures that it not only learns optimal task migrations but also adapts to unexpected workload and thermal conditions in real time. RL incorporates long-term planning through the use of a discount factor, which ensures that the agent considers the future impact of task migrations and avoids decisions that provide short-term relief but worsen the overall thermal profile. Additionally, RL can seamlessly integrate with the multi-layered architecture of 3D NoCs, handling the complex thermal dynamics between layers and cores. This makes RL an ideal solution for ensuring thermal reliability, energy efficiency, and sustained performance in modern 3D NoCs.

Suggested pseudocode for Reinforcement Learning.
Algorithm 1 is specifically designed to address the problem of temperature control in 3D Networks-on-Chip (NoC) using task migration based on Reinforcement Learning (RL). The RL agent learns to make optimal decisions for task migration to balance the temperature across the chip while minimizing performance degradation. The algorithm starts by initializing the system’s thermal state, task assignments, and a Q-table to store state-action values. The states represent the current thermal distribution and task mappings on the 3D NoC, while actions correspond to migrating tasks between cores or layers. Using an ε-greedy policy, the agent explores potential migrations while gradually converging toward effective task placement strategies. This ensures the system achieves both thermal optimization and efficient utilization of resources.
At each iteration, the RL agent interacts with the environment (the 3D NoC) by selecting an action migrating a task or leaving the task mappings unchanged. The environment responds by updating the thermal distribution and providing a reward. The reward is designed to incentivize uniform temperature distribution, reduce hotspots, and maintain performance metrics such as low communication latency. The Q-table is updated using the Q-learning formula, which balances immediate rewards with the long-term benefits of actions. Unlike traditional thermal control algorithms that rely on fixed heuristics or thermal models, this RL-based approach dynamically adapts to workload changes and thermal variations in real-time. The agent progressively improves its decision-making capabilities, identifying migration patterns that reduce thermal stress on the NoC.
This RL-based task migration approach offers several advantages for temperature control in 3D NoCs. By leveraging RL, the system adapts to dynamic workloads and evolving thermal profiles, making it robust against runtime variations. The discount factor (γ) ensures the agent considers the long-term impact of migrations, avoiding short-term fixes that may exacerbate future thermal issues. Meanwhile, the exploration-exploitation strategy balances finding new solutions with leveraging known effective migrations, ensuring both flexibility and efficiency. The termination condition can be tailored to practical constraints, such as achieving a stable thermal distribution or completing a predefined number of migrations. Overall, this algorithm aligns perfectly with the challenges of 3D NoC thermal management, offering scalability and adaptability compared to static or heuristic methods.
Subsequently, they analyzed five case studies utilizing this method for task mapping and ultimately derived several guidelines for selecting various RL functions and parameters applicable to the task mapping issue, ensuring an appropriate balance between quality. The solution is given, and the execution duration is determined.
While these guidelines pertain to task mapping, they are also applicable for selecting parameters in system administration concerning migration. Migration constitutes a form of dynamic mapping executed during operation, as not all tasks within the system necessitate relocation or reassignment; only those tasks situated at the current system-level hot spots should be modified.
It is important to note that the cost function in this algorithm, which is its central component, has been chosen based on the established objectives43. Table 1 lists the symbols and parameters used in this study. Subsequently, we will elucidate the selection criteria and the significance of each function and parameter in relation to the intended application.
MOVE function
The MOVE function facilitates the establishment of a neighborhood to allocate the work from the hot core to the appropriate node. This can be accomplished between the kernels across many layers. The acceptance or rejection of the generated neighborhood is contingent upon the cost differential between the subsequent state and the current state, in addition to the accept function. Neighborhood generation occurs randomly inside the neighborhood space. The neighborhood space encompasses all cores with temperatures below the \(\:{TH}_{1}\) threshold, and prioritization of cooler cores may be executed only based on the cost function, or due to the heightened significance of the temperature parameter, the search space incorporates cores. At a temperature below a secondary threshold referred to as \(\:{TH}_{2}\) \(\:{(TH}_{2}<{TH}_{1})\) Selecting an appropriate value for \(\:{TH}_{2}\)is crucial, since a diminished value will reduce the search space and hence expedite the convergence process. Conversely, it heightens the danger of lacking a migration destination for all the hot cores. Selecting a substantial value near \(\:{TH}_{1}\) will yield the contrary outcome of the preceding scenario. This study considers the final situation, specifically the restriction of the search space through the application of a temperature threshold.
If no job is allocated to the cold core, regarded as the destination, the number of transfers per hot area will be one; otherwise, two transfers will occur. Upon each execution of this function, if the quantity of cores with a temperature below the threshold \(\:{TH}_{2}\) exceeds the number of overheated cores in the system, a cool core is designated for each overheated core, and transfers are conducted.
If the quantity of cold cores is fewer than that of hot cores in the system, the hot cores are arranged in descending order based on the core temperature parameter. From the start of this list up to the number of cold cores, the hot processing elements are picked for transfer. Acknowledges.
ACCEPT function
The inverse exponential function, with normalized cost difference, was selected as the acceptance function (4). This decision is based on the findings of the analysis conducted in44. The authors of this research suggest that, by evaluating several functions employed for task mapping via the cooling approach, a function with an inverse exponential form may be a more suitable option for these issues. The normalization implemented in this context ensures that the acceptance probability remains within an appropriate range, regardless of variations in the cost function. Furthermore, the function allows for the temperature value T to be independent of the cost function, namely within the interval (0, 1)45. The values of \(\:{T}_{0}\) and \(\:{T}_{f}\) are designated as 1 and 0.001, respectively.
$$\:Accept\left(\varDelta\:C,T\right)=true\iff\:random\left(\right)<\frac{1}{1+{e}^{\frac{\varDelta\:C}{C.T}}}$$
(4)
TEMP function
The result of this function denotes a temperature level in the cooling process and is contingent upon the number of iterations of the algorithm. The function is defined in this article as (5), with q representing a geometric component situated at (0.9, 0.98). In this context, T_0 represents the initial temperature, i is the iteration counter, and L signifies the number of iterations at each temperature level46.
$$\:Temp\left(i\right)=T.{q}^{\left[\frac{i}{L}\right]}$$
(5)
Reward function
In a reinforcement learning-based task migration strategy for 3D NoC thermal management, the reward function plays a crucial role in guiding the agent toward optimal decisions. The objective is to minimize peak temperature while considering migration overhead and task communication costs. The proposed reward function is formulated as follows:
$$\:R=-(\propto\:{T}_{max}+\beta\:{O}_{mig}+\gamma\:{C}_{comm}$$
(6)
Where, \(\:{T}_{max}\:\)represents the maximum chip temperature after migration. Reducing this value helps prevent thermal hotspots, which can degrade system reliability and performance. \(\:{O}_{mig}\) denotes the migration overhead, which includes network traffic congestion, latency due to task relocation, and additional energy consumption incurred by task transfers. Lowering this cost ensures that the migration process does not introduce excessive delays or inefficiencies. \(\:{C}_{comm}\:\)accounts for the communication cost between tasks, which is influenced by task placement and routing. High communication overhead leads to increased network congestion and power consumption, negatively impacting both performance and thermal stability.
The \(\:\propto\:,\beta\:,\gamma\:\) parameters are weight coefficients that determine the importance of each factor in the reward function. If reducing thermal hotspots is the primary concern, \(\:\propto\:\) is assigned a higher value to prioritize temperature reduction. Conversely, if minimizing migration overhead is critical, \(\:\beta\:\) is increased to ensure that task transfers do not disrupt system performance. Similarly, \(\:\gamma\:\) controls the impact of task communication cost, ensuring that migrated tasks remain close to their dependent tasks to optimize data flow. By optimizing this reward function, the RL agent learns to balance thermal management, migration efficiency, and communication overhead dynamically. Unlike traditional heuristic-based approaches, this method enables adaptive decision-making that continuously improves system performance while maintaining thermal stability. Through iterative learning, the RL model effectively finds optimal task migration strategies, resulting in a more reliable, energy-efficient, and high-performance 3D NoC system.
Terminate function
This function, represented by the subsequent relation, establishes the algorithm’s termination criteria and integrates the final temperature, the count of rejected responses indicating instances where the objective function has not improved after transitioning between states and the number of cores47. The system will experience elevated temperatures.
$$\:Terminate\left(i,R\right)=true⟺\left(Temp\right(i)<{T}_{f}\wedge\:R\ge\:{R}_{max})\vee\:({T}_{Pe1}<{TH}_{1},\forall\:{Pe}_{i}\in\:PE$$
(7)
The value of \(\:{R}_{max}\)in this relation is deemed comparable to L, which corresponds to the number of repeats at each temperature level. \(\:L\ge\:N\times\:M\), where N represents the number of tasks on the hot cores and M denotes the cool areas on the chip surface. The RL algorithm concludes in two manners. The initial scenario occurs when, following the reassignment of tasks and the establishment of a new mapping, the maximum temperature of the cores \(\:\left({T}_{Pei}\right)\)is below the temperature threshold \(\:{TH}_{1}\). The subsequent scenario arises when \(\:Temp\left(i\right)\) is less than the final value, but the count of unfavorable responses must equal \(\:{R}_{max}\).
Cost function
The cost function is a fundamental parameter of the RL algorithm that must be established based on the specific problem type. Given that the objective of migration is to minimize the chip’s temperature, a primary element of the cost function will be the temperature criterion. Conversely, it is evident that the migration overhead must be incorporated into the cost function. This overhead can be evaluated in relation to temperature or traffic imposed on the system.
The greater the distance between the migration’s origin and destination, the more time will be required for data transfer, thereby impacting system traffic more significantly. Furthermore, the quantity of intermediate routers employed for this transmission will escalate, and as a result of the data volume traversing the router impacting its power consumption, the equilibrium temperature of the chip will rise.
This study evaluates migration overhead based on distance, specifically employing the Manhattan distance, as the implementations were conducted within a mesh topology, the predominant configuration utilized in Network-on-Chip (NoC) systems. was The communication pace between tasks is another element. Placing jobs with greater communication demands on proximate processors reduces network traffic and lowers core temperatures by minimizing the need of intermediate routers for data transmission48. This factor will influence the steady-state temperature and system traffic; hence, the cost function is defined as (8) to (12)49.
$$\:Cost=\alpha\:\times\:{C}_{1}+\beta\:\times\:{C}_{2}+\gamma\:\times\:{C}_{3}$$
(8)
$$\:{C}_{1}=\frac{max{T}_{{Pe}_{i}}\left(after\:move\right)}{max{T}_{{Pe}_{i}}\left(for\:initial\:state\right)}$$
(9)
$$\:{C}_{2}=\frac{\sum\:_{\begin{array}{c}i\in\:HCores\&\\\:j\in\:CCores\end{array}}[\left({M}_{{T}_{m}}^{{Pe}_{i}}\times\:{TS}_{{Pe}_{i}}+{M}_{{T}_{m}}^{{Pe}_{j}}\times\:{TS}_{{Pe}_{j}}\right)\times\:dist\left({Pe}_{i},{Pe}_{j}\right)]}{(\#ofHCores)\times\:max{TS}_{{Pe}_{i}}\times\:2\times\:max\:dist({Pe}_{i},{Pe}_{j})}$$
(10)
$$\:{C}_{3}=\frac{\sum\:_{i\ne\:j}({W}_{{T}_{m},{T}_{n}}\times\:dist({Pe}_{i}\times\:{Pe}_{j})\times\:{M}_{{T}_{m}}^{{Pe}_{i}}\times\:{M}_{{T}_{n}}^{{Pe}_{j}})}{\sum\:_{i\ne\:j}({W}_{{T}_{m},{T}_{n}}\times\:max\:dist({Pe}_{i}\times\:{Pe}_{j})\times\:{M}_{{T}_{m}}^{{Pe}_{i}}\times\:{M}_{{T}_{n}}^{{Pe}_{j}})}$$
(11)
$$\:\text{d}\text{i}\text{s}\text{t}\left({Pe}_{i},{Pe}_{j}\right)=\left|{x}_{i}-\left.{x}_{j}\right|+\left|{y}_{i}+\left.{y}_{j}\right|+\left|{z}_{i}-\left.{z}_{j}\right|\right.\right.\right.$$
(12)
The cost function is a parameterized function including three components: C_1, and C_3. To standardize the chip temperature measurement units, Manhattan distance, and communication rate between activities, each component is normalized using its maximum value. Furthermore, this normalization ensures that the values of these three portions fall inside the range of (0,1). Consequently, by utilizing the parameters multiplied in the Cost function for each section, the significance of each section may be ascertained without concern for defining the range of its values.
In function \(\:{C}_{1}\), the peak temperature of processor cores post-migration is normalized to the highest temperature of the chip in its original state, whereas function \(\:{C}_{2}\) computes the migration overhead. Each migration procedure encompasses two cold cores (CCore) and two hot cores (HCore).
During each execution of the Move function, the migration of existing tasks to the hot cores is contingent upon the number of cold cores, with the overhead being influenced by both the number of transfers and the size of the tasks (TS). The distance between two hot cores and the influence of the cooling rate (dist) is another significant parameter in this section. \(\:{C}_{3}\) models the impact of the correlation rate between tasks during migration.
In the relationship associated with \(\:{C}_{3}\), for a fraction, the overall communication rate between each task pair is multiplied by the distance between the processor cores hosting these task pairs. This value is normalized using a similar method, while considering the maximum distance between the two pairs of tasks. Ultimately, the functions. The coefficients of these three components correspond to the quantity of \(\:{C}_{i}\). The coefficients of these three components are established based on the significance of each \(\:{\:C}_{i}\).
Figure 1 shows a reinforcement learning-based task transfer algorithm for thermal management in 3D networks on chips (3D NoCs). The algorithm starts the optimization process by starting from the initial state of the system, including the initial values of the Q-table, the task mapping, and the settings of the reinforcement learning parameters such as the learning rate (α), discount factor (γ), and exploration rate (ε). After this step, the initial chip temperature and the system cost are calculated to be used as a basis for comparison. Then, the RL agent uses the ε-greedy strategy to select an action, which may be task transfer or maintaining the current state.

Flowchart of a reinforcement learning-based task transfer algorithm for thermal management in 3D networks on a chip.
Next, the execution of the selected action entails updating the system state, during which the new cost is calculated. This cost includes factors such as temperature, migration overhead, and task communication rate. The Q-table is updated according to the reinforcement learning formula to improve the agent’s decision-making capability. The best observed state is saved during the process, if it is better than the previous states. This process continues until a termination condition is met; this condition can be reaching a certain number of iterations or stabilizing the thermal profile of the system. The output of the algorithm includes an optimal task map and the associated thermal results. This algorithm dynamically and adaptively helps in thermal management of 3D networks on a chip.
The flowchart shows a reinforcement learning (RL) task transfer algorithm designed for thermal management in 3D networks on a chip (NoCs). The process begins with initializing the system state, including the Q-table, task mapping, and RL parameters such as learning rate (α), discount factor (γ), and exploration rate (ε). Following this, the initial temperature and system cost are calculated to establish a baseline. The RL agent then uses an ε-strange strategy to select an action, balancing exploration and exploitation for efficient task migration.
The selected action, usually transferring a task or maintaining the current state, is executed, resulting in an updated system state. The algorithm evaluates the new state by calculating its associated cost, which includes factors such as temperature, migration overhead, and communication rates. The Q-table is updated using the RL method formula to refine the agent’s decision-making capabilities. The best observed state is saved if it outperforms the previous states. This process is repeated until the termination condition is met, either reaching a predefined maximum number of iterations or reaching a stable thermal profile. The algorithm terminates with the output of the optimal task mapping and the associated thermal results. This iterative process ensures adaptive, dynamic, and efficient thermal management of the 3D NoC.
link