Deep double-Q network dynamic power control method based on Sum tree sampling

文档序号：1966278 发布日期：2021-12-14 浏览：19次中文

阅读说明：本技术 一种基于Sum tree采样的深度双Q网络动态功率控制方法 (Deep double-Q network dynamic power control method based on Sum tree sampling ) 是由刘德荣刘骏王永华林得有王宇慧于 2021-08-20 设计创作，主要内容包括：本发明涉及无线电通信技术领域,公开了一种基于SumTree采样的深度双Q网络动态功率控制方法,其采用深度双Q网络进行动作价值估计时,先在当前Q网络中先找出最大Q值对应的动作,然后再利用这个选择出来的动作在目标网络里面去计算目标Q值,可以有效减小过度估计,降低损失,提高频谱分配效率,并且本发明是深度双Q网络训练时的采用是结合优先级和随机抽样,使所有样本都有被抽取到的可能性,既提高了重要经验样本利用率,又可防止样本的多样性降低,避免造成系统的过拟合,加快算法收敛速度,因此,本发明通过结合优先级和随机抽样与深度双Q网络算法,可提高动态功率控制的成功率。(The invention relates to the technical field of radio communication, and discloses a deep double-Q network dynamic power control method based on SumTree sampling, when the deep double-Q network is adopted to estimate the action value, firstly the action corresponding to the maximum Q value is found out in the current Q network, then, the selected action is utilized to calculate the target Q value in the target network, so that the overestimation can be effectively reduced, the loss is reduced, the spectrum allocation efficiency is improved, in addition, the invention combines priority and random sampling during deep double-Q network training to ensure that all samples have the possibility of being extracted, thereby not only improving the utilization rate of important experience samples, but also preventing the diversity reduction of the samples, avoiding the overfitting of a system and accelerating the convergence speed of an algorithm, therefore, the invention can improve the success rate of dynamic power control by combining the priority and the random sampling and the deep double-Q network algorithm.)

1. A deep double-Q network dynamic power control method based on Sum Tree sampling is characterized by comprising the following steps:

s1, constructing a spectrum sharing model, wherein the spectrum sharing model comprises a main base station, M main users and N secondary users, the main users and the secondary users are randomly distributed in a network environment, and the main users and the secondary users share the same wireless network in a non-cooperative mode;

s2, under the spectrum sharing model constructed in the step S1, modeling the power control problem in spectrum allocation as a Markov decision process in deep reinforcement learning, and training a deep double-Q network based on combination of priority and random sampling; outputting a power transmitting strategy of a secondary user after deep double-Q network training is finished;

and S3, the secondary user obtains proper transmission power according to the power transmission strategy of the secondary user obtained in the step S2 to carry out communication.

2. The Sum Tree sampling based deep dual Q network dynamic power control method of claim 1, wherein in step S1, the secondary users are pad-connected to the channel of the primary user; the master user adaptively controls the self-transmitting power, and the secondary user updates the transmitting power according to the training result of the deep double-Q network played back based on the priority experience; the spectrum sharing model measures the link quality by the signal-to-noise ratio,

the signal-to-noise ratio of the ith primary user is as follows:

the signal-to-noise ratio of the jth secondary user is:

wherein h is_iiAnd h_jjRespectively representing the channel gain, P, of the ith primary user and the jth secondary user_i(t) and P_j(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, h_ij(t)、h_ji(t)、h_kj(t) denotes the ith masterChannel gains between a user and the jth secondary user, the jth secondary user and the ith primary user, and the kth secondary user and the jth secondary user, N_i(t) and N_j(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively;

the spectrum sharing model judges the power distribution effect through the total throughput of all secondary users, and the relationship between the throughput of the jth secondary user and the signal-to-noise ratio is as follows:

T_j(t)＝Wlog₂(1+γ_j(t))。

3. the Sum Tree sampling-based deep double Q network dynamic power control method according to claim 2, characterized in that the transmission power control strategy of the master user is as follows:

wherein, mu_iIs a set threshold value of the signal-to-noise ratio of a main user; under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. The signal-to-noise ratio gamma of the primary user i at the moment t'_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; the signal-to-noise ratio gamma of the primary user i at the moment t'_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the ith master user at the moment of predicting t +1 is as follows:

4. the Sum Tree sampling based deep double Q network dynamic power control method of claim 1, wherein step S2 comprises:

s2.1, initializing an experience pool, and setting the experience pool as a Sum Tree storage structure; initializing a weight parameter theta of a Q network and a target network of the deep double-Q network;

s2.2, modeling the power control problem in spectrum allocation as a Markov decision process in deep reinforcement learning, establishing a state space S (t), defining an action space A (t) and defining a reward function R_t；

S2.3, accumulating experience pools with priorities;

and S2.4, training a deep double-Q network.

5. The Sum Tree sampling based deep double Q network dynamic power control method of claim 5, wherein in step S2.2, the establishment procedure of the state space S (t) is as follows:

the spectrum sharing model comprises a plurality of auxiliary base stations, and the auxiliary base stations receive information of a primary user and a secondary user and transmit the information to the secondary user;

assuming that there are X assisting base stations in the environment, the received signal strength of each assisting base station is taken as a state space, that is:

S(t)＝[s₁(t)，s₂(t)，...，s_k(t)，...，s_x(t)]；

wherein, the signal strength received by the kth assisting base station is:

wherein l_ik(t)、l_jk(t) represents the distance between the secondary base station and the primary and secondary users at time t, P_i(t) and P_j(t) denotes the transmission power of the ith primary user and the jth secondary user at time t, respectively, l₀(t) denotes the reference distance, τ denotes the path loss exponent, σ (t) denotes the average of the systemThe noise power. At time t, the secondary user k is in state s_k(t) selecting an action, this time the user will enter s_k(t) next state s_k(t+1)。

6. The Sum Tree sampling based deep double Q network dynamic power control method of claim 5, wherein in step S2.2, the transmit power selected by the sub-users in each time slot is set as an action value, the transmit power of each sub-user is a discretized value, and the transmit power of each sub-user in the same time slot forms an action space, that is:

A(t)＝[P₁(t)，P₂(t)，...，P_n(t)]；

each secondary user can select H different emission values, so that the system model has H in commonⁿAn action space is selectable.

7. The Sum Tree sampling based deep dual Q network dynamic power control method of claim 5, characterized in that in step S2.2, the reward function R is_tThe procedure of establishment is as follows:

and defining that the sending fails when the signal-to-noise ratio of the main user is lower than a set threshold value, and setting the reward as-r. And defining that the signal-to-noise ratio of the main user is greater than or equal to a set threshold, the signal-to-noise ratio of any secondary user is greater than or equal to the set threshold, and the transmitting power of the main user is also greater than or equal to the sum of the transmitting powers of the secondary users, namely successfully transmitting data, so as to obtain the reward r. Defining that only the signal-to-noise ratio of the primary user is higher than a set threshold, the signal-to-noise ratios of the secondary users are all lower than the set threshold, and the obtained reward is 0 at the moment, namely:

8. the Sum Tree sampling based deep double Q network dynamic power control method of claim 5, wherein step S2.3 comprises:

S2.31, according to the initial state S₀(t) and all actions A of the Secondary user₀(t) calculating a Q value corresponding to each action;

s2.3.2, the main user adaptively controls the self-transmitting power;

s2.3.3, the secondary user selects actions based on a greedy algorithm, and selects action A randomly with an epsilon probability_tOr selecting action A with a probability of 1-epsilon_t＝max_aQ(s_t，a；θ，α，β)；

S2.3.4 obtaining a reward R according to a reward function_tTo the next state S_t+1；

S2.3.5, sample data (S)_t，A_t，R_t，γ_t+1，S_t+1) Storing the time sequence difference error of each sample into a leaf node in an experience pool, and determining the priority according to the time sequence difference error of each sample;

s2.3.6, S obtained in step S2.3.4_t+1As an input state, steps 2.3.1 through S2.3.5 are repeated until the leaf nodes in the experience pool are full.

9. The Sum Tree sampling based deep double Q network dynamic power control method of claim 5, wherein step S2.4 comprises:

s2.4.1, sampling from the experience pool in the step S2.3 by a method of Sum Tree sampling and random sampling combination;

s2.4.2, calculating the importance weight W of the sample_j；

S2.4.3, calculating the network loss function L (theta) E [ (Q)_target-Q(s，a；θ))²]Updating the weight parameter theta of two neural networks of the deep double-Q network;

s2.4.4, priority of updating samples;

s2.4.5, updating the gradient based on the gradient descent method;

s2.4.6, updating the weight parameters theta of the two neural networks of the deep double-Q network again, wherein theta + eta, delta, and resetting delta to 0;

s2.4.7, updating the priority of the sample again;

s2.4.8, updating the weight parameter theta of the target Q network;

s2.4.9, repeat step S2.4.1 until S (t) is in a termination state, and return to step S2.3.

10. The Sum Tree sampling based deep double Q network dynamic power control method of claim 9, wherein in step S2.4.1, the priority probability extracted to sample j is:

wherein p is_jAnd p_kRespectively representing the priorities of the sample j and the arbitrary sample k;

the priority of sample j is:

p_j＝|TD_error(j)|+∈；

where e is a very small normal number, which guarantees that p_jThe sampling rate is greater than 0, alpha is a priority index, random uniform sampling is carried out when alpha is 0, and k represents the sampling batch number;

and, correcting the bias according to the sample importance weights:

wherein, w_jRepresents the weight coefficient, N represents the empirical pool size, and β represents the non-uniform probability compensation coefficient, when β is 1, p (j) is completely compensated.

Technical Field

The invention relates to the technical field of radio communication, in particular to a depth double-Q network dynamic power control method based on Sum Tree sampling.

Background

Mitola proposed the concept of Cognitive Radio (CR) technology completely earlier than 1999, and its purpose is to alleviate the problems of shortage of spectrum resources and low spectrum utilization. The cognitive radio can learn the surrounding environment and make corresponding adjustments to the behavior of the cognitive radio based on the learning result. An important role of cognitive radio technology in spectrum allocation is as follows: on the premise that interference is not generated when a Primary User (PU) obtaining the spectrum use right is normally used, a Secondary User (SU) selects a proper opportunity to perform spectrum access through sensing the surrounding radio environment so as to improve the utilization rate of spectrum resources.

In the cognitive radio network, when a frequency spectrum channel is in an idle state or the interference of a secondary user to a primary user does not exceed a maximum threshold value which can be tolerated by the primary user, the secondary user can freely use an idle frequency spectrum, and the problem of power interference between the primary user and the secondary user does not need to be considered. However, when the authorized spectrum is occupied or the interference exceeds the maximum threshold, the secondary user not only wants to communicate with a higher transmission power to satisfy the requirement of Quality of Service (QoS) in the information transmission, but also considers the influence of the excessive transmission power on the communication Quality of the primary user. Therefore, in the power control method based on the cognitive radio technology, the secondary user needs to complete the information transmission task with proper transmission power through continuous learning.

The power control problem in spectrum allocation can be modeled as a markov decision process (DTMDP) and solved using model-free Reinforcement Learning (RL). Q-learning is one of the most popular RL algorithms that learns behavior value functions by interacting with the environment to obtain immediate reward feedback, and has the disadvantage of slow convergence speed in behavior selection since Q-learning is a gradual optimization process. A deep Q-network (DQN) is a novel deep RL algorithm, an RL process is combined with a deep neural network (deep neural network) to approximate a Q action value function, and the neural network can make up for the limitation of Q learning in the aspects of generalization and function approximation capability. The deep Double-Q network (Double DQN) is an improvement of an algorithm on the basis of the common DQN, and because the common DQN has 2 networks originally, the Double DQN does not need to introduce a new network, and only decomposes the operation of obtaining the actual value in a target into the operation of using different networks for action selection and action evaluation.

Researchers have combined the prior experience playback and the Q-learning algorithm for dynamic spectrum access, but the Q-learning algorithm cannot effectively overcome the limitation of the reinforcement learning in a high-dimensional and continuous state. Researchers apply the DQN algorithm to spectrum allocation, and compared with a reinforcement learning method, the performance of the system is greatly improved, but the ordinary DQN does not distinguish the priority of samples in the process of extracting experience from an experience pool, and the problems of low utilization rate of the experience, low convergence speed and the like are important.

Chinese patent application CN112383922A (published 2021, 02/19) discloses a deep reinforcement learning spectrum sharing method based on prior experience replay, which includes the following steps: constructing a spectrum sharing model; under a spectrum sharing model, modeling a spectrum sharing problem as a Markov Decision Process (MDP) of interaction between an intelligent agent and the environment in deep reinforcement learning, training a deep reinforcement learning model based on sample prior experience replay, and obtaining learning value information of cognitive user power transmission; and judging a spectrum sharing control decision under the spectrum big data according to the acquired power transmission learning value information of the cognitive user, wherein the control decision realizes that the cognitive user shares the spectrum of the master user without influencing the communication quality of the master user by adjusting the transmission power of the cognitive user, so that the high-efficiency utilization of available spectrum resources is realized. Although the patent is based on prior experience replay sampling, samples with high priority are frequently sampled as the training process progresses, so that the diversity of the samples is reduced, and overfitting of the system is caused; in addition, the deep Q network is trained, and the deep Q network may select an over-estimated value at a large probability, which may further result in an over-optimistic estimation of the value, and further cause a reduction in spectrum sharing efficiency.

Disclosure of Invention

The invention aims to provide a depth double-Q network dynamic power control method based on Sum Tree sampling, which can effectively reduce over-estimation and loss, accelerate the algorithm convergence speed and improve the success rate of dynamic power control.

In order to achieve the above object, the present invention provides a depth double-Q network dynamic power control method based on Sum Tree sampling, which includes:

and S3, the secondary user obtains proper transmission power according to the power transmission strategy of the secondary user obtained in the step S2 to carry out communication.

Preferably, in step S1, the secondary user is pad-connected to the channel of the primary user; the master user adaptively controls the self-transmitting power, and the secondary user updates the transmitting power according to the training result of the deep double-Q network played back based on the priority experience; the spectrum sharing model measures the link quality by the signal-to-noise ratio,

the signal-to-noise ratio of the ith primary user is as follows:

the signal-to-noise ratio of the jth secondary user is:

wherein h is_iiAnd h_jjRespectively representing the channel gain, P, of the ith primary user and the jth secondary user_i(t) and P_j(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, h_ij(t)、h_ji(t)、h_kj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, N_i(t) and N_j(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively;

T_j(t)＝W log₂(1+γ_j(t))。

as a preferred scheme, the transmission power control strategy of the primary user is as follows:

wherein, mu_iIs a set threshold value of the signal-to-noise ratio of a main user; under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. Signal-to-noise ratio gamma of primary user i at time t_i(t)≤μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time t_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the ith master user at the moment of predicting t +1 is as follows:

preferably, step S2 includes:

S2.3, accumulating experience pools with priorities;

and S2.4, training a deep double-Q network.

Preferably, in step S2.2, the establishment of the state space S (t) is as follows:

assuming that there are X assisting base stations in the environment, the received signal strength of each assisting base station is taken as a state space, that is:

S(t)＝[s₁(t)，s₂(t)，...，s_k(t)，...，s_x(t)]；

wherein, the signal strength received by the kth assisting base station is:

wherein l_ik(t)、l_jk(t) represents the distance between the secondary base station and the primary and secondary users at time t, P_i(t) and P_j(t) denotes the transmission power of the ith primary user and the jth secondary user at time t, respectively, l₀(t) denotes a reference distance, τ denotes a path loss exponent, and σ (t) denotes an average noise power of the system. At time t, the secondary user k is in state s_k(t) selecting an action, this time the userWill enter s_k(t) next state s_k(t+1)。

Preferably, in step S2.2, the transmission power selected by the secondary user in each time slot is set as an action value, the transmission power of each secondary user is a discretized value, and an action space is formed by the transmission powers of the respective secondary users in the same time slot, that is:

A(t)＝[P₁(t)，P₂(t)，...，P_n(t)]；

each secondary user can select H different emission values, so that the system model has H in commonⁿAn action space is selectable.

Preferably, in step S2.2, the reward function R_iThe procedure of establishment is as follows:

preferably, step S2.3 comprises:

s2.3.1, according to the initial state S₀(t) and all actions A of the Secondary user₀(t) calculating a Q value corresponding to each action;

s2.3.2, the main user adaptively controls the self-transmitting power;

S2.3.4 obtaining a reward R according to a reward function_tTo the next oneA state S_t+1；

s2.3.6, S obtained in step S2.3.4_t+1As an input state, steps 2.3.1 through S2.3.5 are repeated until the leaf nodes in the experience pool are full.

Preferably, step S2.4 includes:

s2.4.1, sampling from the experience pool in the step S2.3 by a method of Sum Tree sampling and random sampling combination;

s2.4.2, calculating the importance weight W of the sample_j；

S2.4.3, calculating the network loss function L (theta) E [ (Q)_target-Q(s，a；θ))²]Updating the weight parameter theta of two neural networks of the deep double-Q network;

s2.4.4, priority of updating samples;

s2.4.5, updating the gradient based on the gradient descent method;

s2.4.6, updating the weight parameters theta of the two neural networks of the deep double-Q network again, wherein theta + eta, delta, and resetting delta to 0;

s2.4.7, updating the priority of the sample again;

s2.4.8, updating the weight parameter theta of the target Q network;

s2.4.9, repeat step S2.4.1 until S (t) is in a termination state, and return to step S2.3.

Preferably, in step S2.4.1, the priority probability of the sample j is:

wherein p is_jAnd p_kRespectively representing the priorities of the sample j and the arbitrary sample k;

the priority of sample j is:

p_j＝|TD_error(j)|+∈；

and, correcting the bias according to the sample importance weights:

Compared with the prior art, the invention has the beneficial effects that:

when the invention adopts the deep double-Q network to estimate the action value, firstly the action corresponding to the maximum Q value is found out in the current Q network, and then the selected action is utilized to calculate the target Q value in the target network, thereby effectively reducing over-estimation, reducing loss and improving the spectrum distribution efficiency.

Drawings

Fig. 1 is a flow chart of a dynamic power control method of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a spectrum sharing model according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a Sum Tree memory cell according to an embodiment of the present invention.

Fig. 4 is a schematic block diagram of a deep dual Q network according to an embodiment of the present invention.

FIG. 5 is pseudo code of an algorithm of an embodiment of the present invention.

FIG. 6 is a graph comparing loss functions of three algorithms under the same environment in accordance with an embodiment of the present invention.

FIG. 7 is a comparison of the number of steps explored by the three algorithms of the present invention.

Fig. 8 is a comparison graph of power control success rates for three algorithms in accordance with an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

As shown in fig. 1, a method for controlling dynamic power of a deep dual-Q network based on Sum Tree sampling in a preferred embodiment of the present invention includes:

and S3, the secondary user obtains proper transmission power according to the power transmission strategy of the secondary user obtained in the step S3 to carry out communication.

As shown in fig. 2, the spectrum sharing model includes a plurality of secondary base stations, and the primary user, the secondary user and the secondary base stations are randomly distributed in the network environment. In step S1, the secondary user is pad-switched to the primary user' S channel, and the secondary user can adaptively adjust its transmission parameters according to the information obtained from the secondary base station. I.e., the secondary user will continually learn the signals received from the assisting base station and select the appropriate transmit power with the goal of maximizing network utility.

The main user adaptively controls the self-transmitting power, and the secondary user updates the transmitting power according to the training result of the deep double-Q network played back based on the priority experience. Signal to interference plus noise ratio (SINR) is an important measure of link quality. The spectrum sharing model of the present embodiment measures link quality by signal-to-noise ratio,

the signal-to-noise ratio of the ith primary user is as follows:

the signal-to-noise ratio of the jth secondary user is:

wherein h is_iiAnd h_jjRespectively representing the channel gain, P, of the ith primary user and the jth secondary user_i(t) and P_j(t) respectively representing the transmission power of the ith primary user and the jth secondary user at the moment t, h_ij(t)、h_ji(t)、h_kj(t) respectively represents the channel gains between the ith main user and the jth secondary user, the jth secondary user and the ith main user, and the kth secondary user and the jth secondary user, N_i(t) and N_j(t) representing the ambient noise received by the ith primary user and the jth secondary user respectively;

the spectrum sharing model judges the power distribution effect through the total throughput of all secondary users. The model has the advantages that the channel gain, the transmitting power, the environmental noise and the like are dynamically changed, and according to the Shannon theorem, the relationship between the throughput of the jth user and the signal-to-noise ratio is as follows:

T_j(t)＝W log₂(1+γ_j(t))。

in the dynamically changing system, it is required to ensure that the power distribution effect of the system is optimal, that is, it is required to ensure that the secondary users can adjust their own transmission power through continuous learning, so that the total throughput of all secondary users reaches the maximum.

The transmission power control strategy of the primary user in this embodiment is as follows:

wherein, mu_iIs a set threshold value of the signal-to-noise ratio of a main user; under the strategy, the master user controls the transmission power in a gradual updating mode at each time point t. Signal-to-noise ratio gamma of primary user i at time t_i(t)≤μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user increases the transmitting power; signal-to-noise ratio gamma of primary user i at time t_i(t)≥μ_iAnd master user i predicts the signal-to-noise ratio gamma 'at the moment of t + 1'_i(t)≥μ_iWhen the master user reduces the transmitting power; otherwise, keeping the current transmitting power unchanged; the signal-to-noise ratio of the ith master user at the moment of predicting t +1 is as follows:

specifically, step S2 includes:

S2.3, accumulating experience pools with priorities;

and S2.4, training a deep double-Q network.

In step S2.1, the present embodiment uses a binary tree structure of memory cells as the memory structure of the memory bank, as shown in fig. 3. The Sum Tree storage structure has four layers of node structures from top to bottom, the node at the top is called a root node, the line at the bottom is called a leaf node, and the two lines in the middle are called internal nodes. The data for all empirical samples is stored in the leaf nodes, which in addition store the priorities of the samples. All nodes except leaf child nodes do not store data, but the sum of the priorities of the left and right child nodes at the lower level is stored and displayed by numbers.

The primary user and the secondary user of the system model are in a non-cooperative relationship, the secondary user is connected to a primary user channel in a pad mode, and the primary user and the secondary user cannot know power emission strategies of the primary user and the secondary user. In the signal transmission process, the assisting base station plays an important role, and is responsible for collecting the communication information of the primary user and the secondary user and transmitting the obtained information to the secondary user. In step S2.2, the state space S (t) is established as follows:

assuming that there are X assisting base stations in the environment, the received signal strength of each assisting base station is taken as a state space, that is:

S(t)＝[s₁(t)，s₂(t)，...，s_k(t)，...，s_x(t)]；

wherein, the signal strength received by the kth assisting base station is:

wherein l_ik(t)、l_jk(t) represents the distance between the secondary base station and the primary and secondary users at time t, P_i(t) and P_j(t) denotes the transmission power of the ith primary user and the jth secondary user at time t, respectively, l₀(t) denotes a reference distance,. tau.denotes a path loss exponent, and σ (t) denotes an average noise power of the systemAnd (4) rate. At time t, the secondary user k is in state s_k(t) selecting an action, this time the user will enter s_k(t) next state s_k(t+1)。

In addition, in step S2.2, the transmission power selected by the secondary user in each time slot is set as an action value, the transmission power of each secondary user is a discretized value, and the transmission power of each secondary user in the same time slot forms an action space, that is:

A(t)＝[P₁(t)，P₂(t)，...，P_n(t)]；

each secondary user can select H different emission values, so that the system model has H in commonⁿAn action space is selectable.

The reward function is an important link for effective training and learning of the neural network in the deep reinforcement learning. In step S2.2, the reward function R of the present embodiment_iThe procedure of establishment is as follows:

specifically, step S2.3 comprises:

s2.3.1, according to the initial state S₀(t) and all actions A of the Secondary user₀(t) calculating a Q value corresponding to each action;

s2.3.2, the main user adaptively controls the self-transmitting power;

s2.3.4 obtaining a reward R according to a reward function_tTo the next state S_t+1；

s2.3.6, S obtained in step S2.3.4_t+1As an input state, steps 2.3.1 through S2.3.5 are repeated until the leaf nodes in the experience pool are full.

The deep double-Q network comprises an environment, a playback memory unit, two neural networks with the same structure but different parameters and an error function, and is shown in figure 4. To avoid the deep Q network selecting an over-estimated value at a large probability, which results in an over-optimistic estimation of the value, the deep dual Q network decouples:

the maximum Q value in each action is not found directly in the target Q network, but the action corresponding to the maximum Q value is found in the current Q network, and then the selected action is utilized to calculate the target Q value in the target network.

In a step S2.3.1, at least one of the steps,

step S2.4 of the present embodiment includes:

s2.4.1, sampling from the experience pool in the step S2.3 by a method of Sum Tree sampling and random sampling combination;

s2.4.2, calculating the importance weight W of the sample_j；

S2.4.3, calculating the network loss function L (theta) E [ (Q)_target-Q(s，a；θ))²]Updating the weight parameter theta of two neural networks of the deep double-Q network; wherein Qtarget is the target Q network; s is the state, a is the action, which is a parameter for deep reinforcement learning;

s2.4.4, priority of updating samples;

s2.4.5, updating the gradient based on the gradient descent method;

s2.4.6, updating the weight parameters theta of the two neural networks of the deep double-Q network again, wherein theta + eta, delta, and resetting delta to 0; wherein eta is a preset value, and delta is a weight difference value;

s2.4.7, updating the priority of the sample again;

s2.4.8, updating the weight parameter theta of the target Q network;

s2.4.9, repeat step S2.4.1 until S (t) is in a termination state, and return to step S2.3.

Specifically, the pseudo code of the algorithm of the present embodiment is shown in fig. 5.

When a common DQN algorithm is used for sampling in an experience pool, an equal probability sampling mode is generally used, and the difference of importance among different samples is not considered, so that a large amount of sample data of the sample contains failure information, and the use rate of the sample data is low. The Sum tree samples are trained according to priorities, and the priorities depend on the sizes of TD errors (time-Difference error, TD-error), the larger the value of the TD error is, the stronger the back propagation effect of the neural network is, the higher the importance of the samples to be learned is, and the higher the corresponding priorities are, the samples should be trained preferentially.

However, if the sampling is performed according to the priority, the samples with large TD errors are frequently sampled as the training process proceeds, so that the diversity of the samples is reduced, and overfitting of the system may be caused. Therefore, randomness is added in the sampling process, and not only the samples with high priority are extracted, but also the samples with low priority have certain probability to be extracted, and only the higher the priority is, the higher the probability to be extracted is, but all the samples have the possibility to be extracted.

In order to ensure that the sampling can be performed according to the priority and that all samples have the possibility of being decimated, the present embodiment adopts a method combining priority and random sampling, so that both priority transition and non-zero probability sampling of the lowest priority can be ensured.

In step S2.4.1, the priority probability of sampling j is:

wherein p is_jAnd p_kRespectively representing the priority of a sample j and the priority of any sample k, wherein alpha is a priority index;

the priority of sample j is:

p_j＝|TD_error(j)|+∈；

and, correcting the bias according to the sample importance weights:

wherein, ω is_jRepresents the weight coefficient, N represents the empirical pool size, and β represents the non-uniform probability compensation coefficient, when β is 1, p (j) is completely compensated.

The Sum tree samples the samples in the experience pool with the TD error as the priority, so that uniform random extraction is replaced, the sampled priority probability is positively correlated with the TD error, the samples with the large TD error can be frequently sampled, but the diversity of the samples is reduced, so that the randomness can be added in the sampling process, the samples with the low priority can be extracted with a certain probability, the important experience utilization rate is improved, and the convergence speed is accelerated.

In this embodiment, a method for controlling dynamic power of Double DQN based on Sum tree sampling is provided, which is based on Python platform to perform experimental simulation, and since a core algorithm is Sum tree sampling combined with Double DQN, it is referred to as ST _ Double DQN algorithm hereinafter and in experiments. And comparing the performance of the native DQN algorithm, the double DQN algorithm and the ST _ double DQN algorithm under the same simulation environment. Each algorithm iterates 40000 times, and the performance results of each index are displayed once every 1000 times.

As shown in fig. 6, fig. 6 is a graph comparing loss functions of the three algorithms under the same environment, and it can be seen from the graph that all of the three deep reinforcement learning algorithms can achieve convergence after a certain round of learning. The natural DQN algorithm had large fluctuations in the first 20 rounds and the average loss value of convergence was also the largest; the double DQN algorithm also has larger fluctuation in the first 20 rounds, but the fluctuation amplitude is smaller compared with the natural DQN algorithm; the ST-double DQN algorithm provided by the invention can rapidly reduce the loss value to be within 0.25 in 5 rounds of training, rapidly achieve convergence in 10 rounds, and minimize the average loss value of convergence, thereby indicating that the method has better adaptability and learning capability.

As shown in fig. 7, fig. 7 is a comparison graph of the number of exploration steps for the three algorithms. From the simulation results, it can be seen that no matter which algorithm, the secondary user can generally explore the successful power transmission strategy within 3.5 steps. With the increase of training times, the natural DQN algorithm and the double DQN algorithm can successfully transmit power when the number of exploration steps is approximately stabilized between 2.0 and 3.0 steps. The ST double DQN algorithm proposed herein can find a successful power transmission strategy after on-line learning by searching in less than 2.0 steps, starting from step 19. Therefore, compared with a natural DQN algorithm and a double DQN algorithm, the algorithm provided by the invention finds a proper power control strategy, not only has better stability, but also needs the minimum average exploration step number, and can effectively improve the performance of the system.

As shown in fig. 8, fig. 8 is a comparison graph of power control success rates of the three algorithms. The total times of the experimental training is 40000 times, each 1000 times is defined as a round displayed by an image, 15 times of training is selected in each round for testing, if a user can select a successful access action in the test, the transmission task is regarded as being successfully completed, and the ratio of the successful times to the total times of the testing is defined as the success rate of power control. In the simulation result, the success rate fluctuation range of the natual DQN algorithm and the double DQN algorithm is large, and the natual DQN algorithm and the double DQN algorithm are not converged and are unstable all the time. While the ST _ double DQN algorithm has a large initial-stage fluctuation, it can converge after 24 rounds and achieve 100% success rate of test. Therefore, the ST _ double DQN algorithm proposed herein can still maintain good adaptability in the face of a dynamically changing environment, and effectively improve the success rate of power control and the utilization rate of spectrum.

To Sum up, the embodiment of the present invention provides a deep double-Q network dynamic power control method based on Sum Tree sampling, which comprises the steps of, when estimating an action value using a deep double-Q network, first finding out an action corresponding to a maximum Q value in a current Q network, then calculating a target Q value in a target network using the selected action, and decoupling selection of the target Q value action and calculation of the target Q value using a deep double-Q network algorithm, thereby effectively reducing overestimation, reducing loss, and improving spectrum allocation efficiency, and in the embodiment, the deep double-Q network training is performed by combining priorities and random sampling, giving different priorities to experience samples, and adding randomness during sampling, so that all samples have a possibility of being extracted, thereby improving utilization rate of important experience samples, and preventing diversity reduction of samples, the overfitting of the system is avoided, the algorithm convergence speed is increased, and the network performance is improved. And the experimental simulation result also shows that the success rate of dynamic power control can be improved by combining the priority, random sampling and a deep double-Q network algorithm.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于NOMA系统下的逐天线功率鲁棒优化方法

Deep double-Q network dynamic power control method based on Sum tree sampling

相关技术

网友询问留言