An adaptive coverage method for dynamic wireless sensor network deployment using deep reinforcement learning

An adaptive coverage method for dynamic wireless sensor network deployment using deep reinforcement learning

Hierarchical state representation

To address the challenge of high-dimensional state spaces in wireless sensor network (WSN) coverage and sensing deployment, a novel hierarchical state representation method is proposed. This approach not only reduces computational complexity but also adaptively captures the dynamic characteristics of the network:

$$\begin{aligned} \textbf{s} = [\textbf{s}_{global}, \textbf{s}_{local}, \textbf{s}_{neighbor}] \end{aligned}$$

(7)

where: \(\textbf{s}_{global}\) represents the global network coverage state, with a dimension of \(d_g<< M\), \(\textbf{s}_{local}\) represents the local sensor state, including energy levels, current sensing radius, \(\textbf{s}_{neighbor}\) represents the aggregated features of neighboring node states.

The global state is reduced in dimension through region segmentation and feature aggregation:

$$\begin{aligned} \textbf{s}_{global} = \Phi (A, \{P'{det}(s_i, p, t)\}_{j=1,i=1}^{M,N}) \end{aligned}$$

(8)

where \(\Phi\) is a dimensionality reduction function, which can be based on region partitioning feature aggregation or principal component analysis. This representation method enables efficient fusion of multi-scale state information, allowing the sensor network to achieve adaptive region priority adjustment and dynamic balance between energy consumption and coverage quality under limited computational resources. Additionally, it facilitates real-time optimization of the network topology.

To ensure robustness under partial observability and node failures, the hierarchical state representation incorporates mechanisms for synchronization and fault tolerance. Specifically, the local and neighbor states are periodically updated to reflect the current operational status of each node and its surroundings. In the event of node failures, the global state representation is designed to accommodate dynamic changes in network topology, allowing the system to adaptively adjust its strategy based on the remaining active nodes.

Action space

In wireless sensor network (WSN) coverage sensing deployment, the actions that each sensor node can perform at a specific time step constitute its action space:

$$\begin{aligned} \textbf{a}_i = [\delta _i, r_i] \end{aligned}$$

(9)

where: \(\delta _i \in \{0, 1\}\) represents the active/sleep decision, enabling dynamic control of the network’s activity level. \(r_i \in [0, r_i^{max}]\) represents the adjustment of the sensing radius, which can be discretized into K levels, supporting fine-grained control of sensing capability.

The reward function is designed to balance the trade-offs between activating more nodes and expanding individual node coverage. Specifically, the energy efficiency reward component \(R_{ene}\) penalizes excessive use of sensing radius, encouraging nodes to optimize their energy consumption while maintaining adequate coverage. This ensures that the network can adaptively adjust its strategy based on the current energy levels and coverage requirements, achieving a dynamic balance between coverage quality and energy conservation.

Multi-objective cooperative reward function

To address the dual challenges of coverage efficiency and energy constraints in wireless sensor networks, a multi-objective reward function is designed, which simultaneously considers the maximization of coverage and the balance of energy consumption:

$$\begin{aligned} R(s, a) = \alpha \cdot R_{cov}(s, a) + \beta \cdot R_{ene}(s, a) + \gamma \cdot R_{bal}(s, a) \end{aligned}$$

(10)

where: The coverage reward is given by:

$$\begin{aligned} R_{cov}(s, a) = Q_{cov} \end{aligned}$$

(11)

The optimization aims to achieve effective coverage of the target area, focusing on high-value regions and the elimination of coverage blind spots. The energy efficiency reward is:

$$\begin{aligned} R_{ene}(s, a) = -\frac{1}{N} \sum _{i=1}^{N} \frac{E(r_i)}{e_i} \end{aligned}$$

(12)

The reward for energy efficiency encourages nodes to adjust their sensing strength based on their remaining energy, maximizing energy usage efficiency. The reward for load balancing is:

$$\begin{aligned} R_{bal}(s, a) = -\sigma (\{\frac{e_i^{init} – e_i}{e_i^{init}}\}_{i=1}^{N}) \end{aligned}$$

(13)

where \(\sigma\) represents the standard deviation, which is used to measure the imbalance in energy consumption. By minimizing the standard deviation of energy consumption \(\sigma\), the network load balancing is promoted, thereby extending the overall network lifetime.

It should be noted that while the standard deviation is a common metric for measuring energy consumption imbalance, it may not be robust under skewed distributions. Alternative metrics such as the Gini coefficient could also be considered for more accurately capturing the energy imbalance in certain scenarios. The Gini coefficient, which is often used in economics to measure inequality, can provide a different perspective on the distribution of energy consumption across nodes. However, in our current model, the standard deviation is chosen due to its simplicity and computational efficiency. Future work could explore the integration of other metrics to enhance the robustness of the load balancing reward.

The coefficients \(\alpha\), \(\beta\), and \(\gamma\) can be dynamically adjusted over time to balance the optimization objectives at different stages.

$$\begin{aligned} \alpha (t)= & \alpha _0 \cdot (1 – e^{-\lambda _{\alpha } \cdot t}) \end{aligned}$$

(14)

$$\begin{aligned} \beta (t)= & \beta _0 \cdot e^{-\lambda _{\beta } \cdot t} \end{aligned}$$

(15)

$$\begin{aligned} \gamma (t)= & \gamma _0 \cdot (1 – e^{-\lambda _{\gamma } \cdot t}) \end{aligned}$$

(16)

The reward function is designed with a time-varying weighting system, where energy efficiency is prioritized at the beginning of the network deployment (with a larger \(\beta\)), and as time progresses, the focus gradually shifts towards coverage quality and load balancing (with increasing \(\alpha\) and \(\gamma\)), enabling dynamic strategy adjustment throughout the WSN lifecycle.

Hierarchical actor-critic network structure

To address the challenge of high-dimensional decision space in wireless sensor network deployment, a deep reinforcement learning model based on the Actor-Critic architecture is employed:

Actor Network: \(\pi _{\theta }(a|s)\), outputs the probability distribution of taking actions in a given state. By introducing a hierarchical decision process, it first determines the node’s working state (\(\delta\)), and then optimizes the sensing radius (r). Additionally, this network integrates an attention mechanism, which dynamically focuses on key areas and energy-constrained nodes.

Critic Network: \(V_{\phi }(s)\), estimates the state value function. By using a dual time-scale evaluation, it simultaneously considers immediate coverage rewards and long-term network lifetime, while also incorporating graph convolution layers to effectively capture the spatial correlations among nodes.

Both networks share the underlying feature extraction layers to accelerate knowledge sharing and improve learning efficiency. A graph feature extraction module, specifically designed to adapt to the dynamic changes in the wireless sensor network topology, is also integrated.

Gradient-aware adaptive learning rate adjustment

To accelerate the algorithm’s convergence and adapt to the dynamic environments in WSN deployment, an adaptive learning rate adjustment strategy is designed:

$$\begin{aligned} \eta (t) = \frac{\eta _0}{1 + \kappa \cdot t^{\omega }} \cdot \exp (\xi \cdot |\nabla J(\theta )|) \end{aligned}$$

(17)

where \(\eta _0\) is the initial learning rate, and \(\kappa\) and \(\omega\) control the rate of learning decay, while \(\xi\) is the gradient-based adaptive factor. This learning rate adjustment scheme combines time decay and gradient-adaptive mechanisms, providing a higher learning rate in regions with large gradients to quickly escape local optima, while ensuring stable convergence over time.

The choice of the gradient-aware learning rate adjustment mechanism is motivated by the need to balance exploration and exploitation in the dynamic WSN environment. The parameter \(\xi\) plays a crucial role in determining the sensitivity of the learning rate to the gradient magnitude. A careful selection of \(\xi\) is necessary to ensure that the learning rate adaptation does not lead to instability, especially in the presence of noise which is common in WSN deployments. Experimental analysis and parameter tuning are required to determine the optimal values for \(\xi\), \(\kappa\), and \(\omega\), taking into account the specific characteristics of the WSN application and the level of environmental noise. Future work could investigate more sophisticated adaptive learning rate strategies that incorporate additional information about the environment’s dynamics and the agent’s performance over time.

ACDRL algorithm pseudocode

Algorithm 1
figure a

ACDRL for WSN Coverage Optimization

Convergence analysis

Theorem 1

Under following conditions, ACDRL converges to optimal policy with probability 1:

  1. 1.

    State space \(\mathcal {S}\) and action space \(\mathcal {A}\) are compact

  2. 2.

    Learning rates satisfy Robbins-Monro conditions: \(\sum _{t=1}^{\infty } \eta (t) = \infty\), \(\sum _{t=1}^{\infty } \eta ^2(t) < \infty\)

  3. 3.

    Value estimator \(V_{\phi }\) is Lipschitz continuous in compact parameter space

Proof Sketch

Using stochastic approximation framework:

  1. 1.

    Define joint parameter space \(\Theta = \{\theta , \phi \}\): \(\Theta _{t+1} = \Theta _t + \eta (t)[h(\Theta _t) + \xi _t]\) where \(h(\cdot )\) is update direction vector, \(\xi _t\) martingale noise

  2. 2.

    Construct Lyapunov function: \(L(\Theta ) = \mathbb {E}_{\pi _{\theta }}[V_{\phi }(s) – V^*(s)]^2 + D_{KL}(\pi _{\theta }||\pi ^*)\)

  3. 3.

    Prove under adaptive learning rate: \(\lim _{t\rightarrow \infty } \nabla _{\Theta } L(\Theta _t) = 0\) a.s.

  4. 4.

    Analyze ODE system: \(\dot{\Theta } = h(\Theta )\) showing equilibrium corresponds to Nash solution

  5. 5.

    Bound approximation error using GCN properties

\(\square\)

Remark

The adaptive learning rate (Eq.17) satisfies:

$$\begin{aligned} \sum _{t=1}^{\infty } \frac{\eta _0 \exp (\xi g_t)}{1+\kappa t^{\omega }} = \infty ,\quad \sum _{t=1}^{\infty } \left( \frac{\eta _0 \exp (\xi g_t)}{1+\kappa t^{\omega }}\right) ^2 < \infty \end{aligned}$$

(18)

where \(g_t = |\nabla J(\theta _t)|\). Exponential term temporarily boosts learning rate during large gradients, while denominator ensures ultimate convergence.

link