Mobile multi-agent cooperative target searching method

文档序号:1252968 发布日期:2020-08-21 浏览:4次 中文

阅读说明:本技术 一种移动多智能体协同目标搜索方法 (Mobile multi-agent cooperative target searching method ) 是由 陈志� 狄小娟 岳文静 祝驭航 于 2020-04-30 设计创作,主要内容包括:本发明公开了一种移动多智能体协同目标搜索方法,结合AC算法“集中学习,分散执行”的思想对传统DDPG算法进行改进,对Critic的输入进行了拓展,将传统DDPG算法状态行为观测信息多对一的模式改变为一对一;接着对每个智能体采用改进的DDPG算法进行训练,每个智能体在集中学习的过程中Critic输入不仅包含自身的状态行为观测信息,而且也要包括其他智能体的策略以及行为观测信息;最后在所有智能体训练完毕的情况下,每个Actor在不考虑其它智能体的情况下独立执行协同搜索任务,本发明解决了在执行搜索任务时每个智能体状态不断改变引起的环境不稳定、搜索时间长且执行效率低下的问题。(The invention discloses a mobile multi-agent cooperative target searching method, which improves the traditional DDPG algorithm by combining the thought of 'centralized learning and decentralized execution' of an AC algorithm, expands the input of Critic and changes the multi-to-one mode of state behavior observation information of the traditional DDPG algorithm into one-to-one mode; secondly, training each agent by adopting an improved DDPG algorithm, wherein the Critic input of each agent in the centralized learning process not only contains the state behavior observation information of each agent, but also contains the strategies and behavior observation information of other agents; and finally, under the condition that all the agents are trained, each Actor independently executes the collaborative search task under the condition that other agents are not considered.)

1. A mobile multi-agent cooperative target searching method is characterized by comprising the following steps:

step 1, a target operation domain O is given and is uniformly divided into m multiplied by N grid regions with the same size, the coordinate of each grid is expressed by using the central coordinate of the region where the grid is located, N intelligent agents are set to start searching in the m multiplied by N grids, and the target number is S;

step 2, obtaining a random noise function psi, and using theta ═ theta1,...,θi,...,θNDenotes N agentsAnd let the policy set of all agents be pi ═ pi1,...,πi,...,πNThe action set is a ═ a (a)1,...,ai,...,aN) The vector set of the environmental state is s ═ s1,...,si,...,sN) (ii) a Then there is a, based on the output of the Actor network of the individual agenti=πi(sii);

Step 3, updating the Actor network and the target network of each agent by calculating the certainty strategy of each agent, and the concrete steps are as follows:

step 3.1, the target benefit of agent i is J (θ)i)=E[Ri]Then the policy gradient formula is:

wherein R isiRepresents the sum of the target profit amounts, θ, of agent ii、ai、πi、siRespectively representing policy parameters, actions, policies and observed state information, p, of agent iπThe distribution of the states is represented by,the method comprises the steps that a centralized state-action function of an ith intelligent agent is represented, namely a real-time action feedback function of a Critic network of the intelligent agent i to an Actor network;

step 3.2, receiving the initial state s, randomly selecting and executing the action a, calculating and judging whether the currently selected action a is the Critic currently evaluated optimal strategy according to the strategy gradient formula in the step 3.1, if so, setting the currently selected action a as the deterministic strategy action, and using the deterministic strategy actionTo show that if not, the action is reselected and substituted into the policy gradient formula in step 3.1 to calculate until the deterministic policy action is acquired

And 4, updating the critical network and the target network of each agent by combining the TD in the DQN and the idea of the target network, and specifically comprising the following steps:

step 4.1, performing deterministic policy actionsObtaining a new target revenue function strategy gradient updating formula:

wherein D ═ { s, s', a1...ai...aN,r1...ri...rNIs an experience replay buffer pool containing the historical experiences of all agents, s ═ s1′,...,sN') represents the state vector updated before taking action a, riIndicating agent i takes action aiThe value of the later-obtained instant target prize,representing adoption of deterministic policiesA status-action value function of the case i-th agent centralization;

step 4.2, sampling and obtaining are carried out in an experience buffer pool DThe status-action value of the i-th agent centralized type corresponds to an experience pool function, the used parameter is taken from the experience pool, so that the parameter belongs to a delay parameter before the current latest action is taken, gamma is a discount factor and determines the importance degree of future awards,is parameter θ 'with delayed update'iTarget policy set of ai'、μi'、si' represent actions, policies and observations of delayed updates of agent i, respectively;

and 4.3, updating the target gain function strategy gradient in the step 4.1 by a minimum loss function to obtain a global optimal strategy, wherein the updating rule is as follows:

wherein r ═ { r ═ r1,...,ri,...,rNRepresents the set of instant target rewards obtained after all agents take action a;

step 5, after obtaining the global optimal strategy, each agent independently executes the search task, and the specific steps are as follows:

step 5.1, in the process of independent searching of the intelligent agents, calculating the sum of target income values of the intelligent agents all the timeValue, J (μ)i) Indicating that agent i takes deterministic policy μiThe obtained optimal profit value;

step 5.2, the sum of the target income values obtained by calculation in the step 5.1 is addedAnd comparing the number with the target number S, if the value is greater than or equal to S, indicating that the search is successful, otherwise, indicating that the search is failed, returning to the step 4, and repeating the step downwards until the search is successful.

2. The mobile multi-agent cooperative target search method as recited in claim 1, wherein: deterministic policy action in step 3.2Is the action actually taken at that time aiPlus the noise function Ψ.

3. The mobile multi-agent cooperative target search method as recited in claim 1, wherein: in step 4.2, the i-th agent centralized state-action value corresponding experience pool function Q' is an average value obtained through multiple sampling.

4. The mobile multi-agent cooperative target search method as recited in claim 1, wherein: in step 4.2, the value of the discount factor gamma is 0.8.

Technical Field

The invention relates to a mobile multi-agent cooperative target searching method, which mainly utilizes an improved traditional DDPG algorithm to design a mobile multi-agent cooperative control strategy more suitable for complex environments, and belongs to the field of reinforcement learning, multi-agent systems and deep learning cross technology application.

Background

In the subject of searching by using intelligent agents, most of the early research is mainly directed to searching by single intelligent agents in static environment, and does not consider the cooperative searching of multiple intelligent agents in dynamic environment. The traditional method usually adopts random search, rule search and the like. However, the use of random search requires a priori knowledge of the environment, but in real life we often do not know the actual information of the search area, so this method is not feasible in most cases. And the regular search mode is used for searching, the search track is fixed, and at the moment, if the target position is changed at any time, the search efficiency is also greatly reduced. However, with the complication of application scenarios, research on multi-agent collaborative search has attracted more attention at present. Obviously, it is far from sufficient to consider some simple situations, and it is one of the hot problems of the present society to study how to make a multi-agent system have higher autonomous learning ability to adapt to different scene changes.

The multi-agent system is actually a complex distributed computing system, and the application range relates to a plurality of fields, such as a robot system, distribution decision and the like. The multi-agent reinforcement learning is also the key point of multi-agent system research, and the multi-agent can solve more and more complex tasks by applying reinforcement learning technology to the multi-agent. However, reinforcement learning is not widely used in multi-agent target search, and the selection of behavior strategy is more difficult than that of a general robot in consideration of the complexity of the actual environment. Therefore, the intelligent agent has higher autonomous coordination capability by combining some traditional mobile multi-agent reinforcement learning algorithms and optimizing the algorithms, so that the mobile intelligent agent has stronger adaptability to the environment, and has profound significance for completing multi-agent cooperative target search tasks.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to make up the defects of the prior art aiming at the problem of multi-agent collaborative search in a dynamic environment in an actual environment, and provides a mobile multi-agent collaborative target search method. The method combines the traditional DDPG algorithm, the AC algorithm and the DQN algorithm, solves the problems of unstable environment and low training efficiency of the traditional reinforcement learning methods such as the AC algorithm, the DQN and the like under the condition of multiple intelligent agents, and also avoids the problem that all the intelligent agents can only obtain the predicted Q value by one central Critic, but the training time is too long (the training time cannot adapt to the random environment) to be difficult to complete the task in the expected time due to the fact that two networks need to be trained. The method can successfully optimize the cooperative search strategy of the multiple intelligent agents while considering the behavior strategies of other intelligent agents, has good real-time performance and robustness, and solves the problems of unstable environment, long search time and low execution efficiency caused by the continuous change of the state of each intelligent agent during the execution of a search task.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a mobile multi-agent cooperative target searching method comprises the following steps:

step 1) a target operation domain O is given and is uniformly divided into m multiplied by N grid regions with the same size, the coordinate of each grid is expressed by using the central coordinate of the region where the grid is located, N intelligent agents are set to start searching in the m multiplied by N grids, and the target number is S;

step 2) obtaining a random noise function psi by using theta ═ theta1,...,θNDenotes the parameters of N agent policies and has the set of policies for all agents pi-pi ═ pi1,...,πNThe action set is a ═ a (a)1,...,aN) The vector set of the environmental state is s ═ s1,...,sN) (ii) a Then there is a, based on the output of the Actor network of the individual agenti=πi(si,θi);

Step 3) updating the Actor network and the target network of each agent by calculating the deterministic strategy of each agent, and the method comprises the following specific steps:

step 3.1) set the target profit (finding the target number) of agent i to J (theta)i)=E[Ri]Then the policy gradient formula is:

the R isiRepresents the sum of the target profit amounts, θ, of agent ii、ai、πi、siRespectively representing policy parameters, actions, policies and observed state information, p, of agent iπThe distribution of the states is represented by,the method comprises the steps that a centralized state-action function of an ith intelligent agent is represented, namely a real-time action feedback function of a Critic network of the intelligent agent i to an Actor network;

step 3.2) receiving the initial state s, randomly selecting and executing the action a, calculating and judging whether the currently selected action a is the Critic currently evaluated optimal strategy according to the strategy gradient formula in the step 3.1), if so, setting the currently selected action a as the deterministic strategy, and using the deterministic strategy(abbreviation. mu.)i) To show that if not, the action of reselection is substituted into the strategy gradient formula in the step 3.1) again for calculation until a deterministic strategy can be acquired

Step 4) updating the criticic network and the target network of each agent by combining the TD in the DQN and the idea of the target network, and the method specifically comprises the following steps:

step 4.1) executing deterministic policy actionsObtaining a new target revenue function strategy gradient updating formula:

d ═ s, s', a1...aN,r1...rNIs an experience replay buffer pool containingHistorical experience with all agents, where s ═ s(s)1′,...,sN') represents the state vector updated before taking action a, riIndicating agent i takes action aiThe value of the later-obtained instant target prize,representing adoption of deterministic policiesA status-action value function of the case i-th agent centralization;

step 4.2) sampling and obtaining in an experience buffer pool DThe status-action value of the i-th agent centralized type corresponds to an experience pool function, the used parameter is taken from the experience pool, so that the parameter belongs to a delay parameter before the current latest action is taken, gamma is a discount factor and determines the importance degree of future awards,is parameter θ 'with delayed update'iTarget policy set of ai'、μi'、si' represent actions, policies and observations of delayed updates of agent i, respectively;

step 4.3) updating the target gain function strategy gradient in the step 4.1) by minimizing a loss function to obtain a global optimal strategy, wherein the updating rule is as follows:

r ═ r1,...,rNRepresents the set of instant target rewards obtained after all agents take action a;

and 5) after the global optimal strategy is obtained, each intelligent agent independently executes a search task. The specific search rules are as follows:

step (ii) of5.1) calculating the sum of target profit values of all the agents at any moment in the process of independent search of more agentsValue, said J (μ)i) Indicating that agent i takes deterministic policy μiThe obtained optimal profit value;

step 5.2) obtaining the sum of the target profit values calculated and obtained in the step 5.1)And comparing the number with the target number S, if the value is greater than or equal to S, indicating that the search is successful, otherwise, indicating that the search is failed, returning to the step 4) and repeating the step downwards until the search is successful.

Preferably, deterministic policy in step 3.2)In fact, the action a actually taken at that timeiPlus the noise function Ψ.

Preferably, Q' obtained by sampling in step 4.2) is actually an average value obtained by sampling a plurality of times.

Preferably, the value of the discount factor γ in the step 4.2) is 0.8.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method adds a certain noise function to the actually taken action, considers the interference of actual light and background noise, has good real-time performance and robustness, and improves the stability of the improved algorithm.

(2) According to the invention, multiple sampling is carried out in the experience pool to obtain the average value, and the Actor network is updated by adopting a method of minimizing the loss function, so that the problems of overlarge loss function difference value and difficulty in stable convergence caused by possibly too late experience samples are effectively avoided, and the accuracy of the obtained global optimal strategy is ensured.

(3) The invention adopts a centralized learning and decentralized execution framework by combining with an AC algorithm. In the training process of the agent, in addition to the information observed by the agent, information such as the action and state of another agent is also considered. And each is independently executed after the training is finished. The problems that the traditional reinforcement learning methods such as an AC algorithm and a DQN method are unstable in environment and low in training efficiency under the condition of multiple intelligent agents are solved, and the problem that the intelligent agents are too long in training time and cannot adapt to random environments is avoided. The collaborative search strategy of the multi-agent can be successfully optimized while considering the behavior strategy of other agents.

Drawings

FIG. 1 is a flow chart of a mobile multi-agent cooperative target searching method.

FIG. 2 is a frame design diagram of the idea of centralized learning and decentralized execution of the improved DDPG algorithm.

FIG. 3 is a diagram of a multi-agent collaborative search target simulation model.

FIG. 4 is a distribution diagram of agents within region O.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

A mobile multi-agent cooperative target searching method is disclosed, as shown in figure 1, firstly, the traditional DDPG algorithm is improved by combining the ideas of 'centralized learning and decentralized execution' of an AC algorithm, the input of Critic is expanded, and the mode of observing information of state behavior of the traditional DDPG algorithm in many-to-one mode is changed into one-to-one mode; secondly, training each agent by adopting an improved DDPG algorithm, wherein the Critic input of each agent in the centralized learning process not only contains the state behavior observation information of each agent, but also contains the strategies and behavior observation information of other agents; and finally, under the condition that all the agents are trained, each Actor independently executes the collaborative search task under the condition that other agents are not considered, and the method specifically comprises the following steps:

step 1) a target operation domain O is given and is uniformly divided into m multiplied by N grid regions with the same size, the coordinate of each grid is expressed by using the central coordinate of the region where the grid is located, N intelligent agents are set to start searching in the m multiplied by N grids, and the target number is S;

step 2) obtaining a random noise function psi by using theta ═ theta1,...,θNDenotes the parameters of N agent policies and has the set of policies for all agents pi-pi ═ pi1,...,πNThe action set is a ═ a (a)1,...,aN) The vector set of the environmental state is s ═ s1,...,sN) (ii) a Then there is a, based on the output of the Actor network of the individual agenti=πi(sii);

Step 3) updating the Actor network and the target network of each agent by calculating the deterministic strategy of each agent, and the method comprises the following specific steps:

step 3.1) set the target profit (finding the target number) of agent i to J (theta)i)=E[Ri]Then the policy gradient formula is:

the R isiRepresents the sum of the target profit amounts, θ, of agent ii、ai、πi、siRespectively representing policy parameters, actions, policies and observed state information, p, of agent iπThe distribution of the states is represented by,the method comprises the steps that a centralized state-action function of an ith intelligent agent is represented, namely a real-time action feedback function of a Critic network of the intelligent agent i to an Actor network;

step 3.2) receiving the initial state s, randomly selecting and executing the action a, calculating and judging whether the currently selected action a is the Critic currently evaluated optimal strategy according to the strategy gradient formula in the step 3.1), if so, setting the currently selected action a as the deterministic strategy, and using the deterministic strategy(abbreviation. mu.)i) To show that if not, the action of reselection is substituted into the strategy gradient formula in the step 3.1) again for calculation until a deterministic strategy can be acquiredDeterministic policyIn fact, the action a actually taken at that timeiPlus the noise function Ψ.

Step 4) updating the criticic network and the target network of each agent by combining the TD in the DQN and the idea of the target network, and the method specifically comprises the following steps:

step 4.1) executing deterministic policy actionsObtaining a new target revenue function strategy gradient updating formula:

d ═ s, s', a1...aN,r1...rNIs an experience replay buffer pool containing the historical experiences of all agents, where s ═ s1′,...,sN') represents the state vector updated before taking action a, riIndicating agent i takes action aiThe value of the later-obtained instant target prize,representing adoption of deterministic policiesA status-action value function of the case i-th agent centralization;

step 4.2) sampling and obtaining in an experience buffer pool DThe centralized state-action value of the ith intelligent agent corresponds to an experience pool function, the used parameters are taken out from the experience pool, so the current latest action is taken, the current latest action belongs to a delay parameter, the Q' obtained by sampling is actually an average value obtained by sampling for multiple times, gamma is a discount factor and determines the importance degree of future reward, the value of the discount factor gamma is 0.8,is parameter θ 'with delayed update'iTarget policy set of ai'、μi'、si' represent actions, policies and observations of delayed updates of agent i, respectively;

step 4.3) updating the target gain function strategy gradient in the step 4.1) by minimizing a loss function to obtain a global optimal strategy, wherein the updating rule is as follows:

r ═ r1,...,rNRepresents the set of instant target rewards obtained after all agents take action a;

and 5) after the global optimal strategy is obtained, each intelligent agent independently executes a search task. The specific search rules are as follows:

step 5.1) in the process of independent search of more intelligent agents, calculating the sum of target profit values of all intelligent agents at any momentValue, said J (μ)i) Indicating that agent i takes deterministic policy μiThe obtained optimal profit value;

step 5.2) obtaining the sum of the target profit values calculated and obtained in the step 5.1)And comparing the number with the target number S, if the value is greater than or equal to S, indicating that the search is successful, otherwise, indicating that the search is failed, returning to the step 4) and repeating the step downwards until the search is successful.

Simulation (Emulation)

Now, the area O is divided into 8 × 8 grid areas according to the map, and the coordinates of each grid are represented by the center coordinates of the area where the grid is located, and then 8 agents are dispatched to start target search in the grids (the initial distribution is shown in fig. 4), and targets usually move in a certain rule, and the number of the targets is 15. Adding the noise function psi of the current environment at the same time, and using theta ═ theta1,...,θ8Denotes the corresponding parameters of these 8 agent policies, while letting the set of all agent policies be pi ═ pi1,...,π8The action set is a ═ a (a)1,...,a8) The vector set of the environmental state is s ═ s1,...,s8)。

The 8 agents are trained in the region by adopting an improved DDPG algorithm to update the Actor and Critic networks, namely the corresponding target networks respectively (the training model is shown in figure 2).

Firstly, performing one-to-one mode division on an Actor and a criticic network for each of the 8 agents (see fig. 3), wherein the Actor receives a current initial state s, adopts a strategy pi to obtain an action a, and then scores the Actor according to a corresponding environment state and action selection by the one-to-one criticic according to a current objective function of the Actor. Taking the sum of the action taken by the current Actor and the noise function Ψ as a deterministic strategy if the current highest score is reached(abbreviation. mu.)i) Otherwise, the strategy pi is re-established to continue execution until the deterministic strategy is obtained

Next, combining TD in DQN and target network idea to update Critic and target network thereof, mainly by executing deterministic strategy actionObtaining a new target strategy gradient update formula:

d is a pool of experience with respect to state actions and rewards of the agent at a previous time, including historical experience of all agents.Representing adoption of deterministic policiesIn this case, the ith agent is a centralized state-action value function, that is, a real-time action feedback function of the target prediction network, that is, the Critic network of agent i, to the Actor network. Then sampling is carried out at D to obtain a corresponding delay centralized state-action value functionThe parameters contained in the function are all taken from the experience pool and are delayed, wherein riRepresenting the instant prize earned by agent i, gamma being a discount factor, determining the importance of the future prize,parameter θ 'with deferred update for target policy'iS' represents a state vector of late updates, ai',μi',si' represents actions, policies, and observations of agent i lag updates, respectively. Finally, updating the gradient through a minimum loss function to obtain a global optimal strategy, thereby obtaining a target detection number statistical formula J (mu) corresponding to each intelligent agenti)。

After the global optimal strategy is obtained, each agent independently executes a search task. With independent search by each agentIn the process, the sum of target income values of all the agents is calculated at any momentValue, while summing the target profit valueAnd comparing the value with the target number 15, if the value is greater than or equal to 15, indicating that the search is successful, otherwise, indicating that the search is failed, repeating the operation after sampling and averaging for multiple times in the experience pool to obtain a new Q' value and repeating the operation until the search is successful.

The invention solves the problems of unstable environment and overlarge Gradient direction variance under the condition of multiple agents in the traditional reinforcement learning methods such as Q-learning, Policy Gradient and the like, and also avoids the problems that all agents can obtain the predicted Q value only by one central Critic, but the training time is longer and the efficiency is low due to the fact that two networks are trained, the cooperative search strategy of the multiple agents can be successfully optimized while the behavior strategies of other agents are considered, and the method has better instantaneity and robustness.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:业务办理方法、装置、电子设备和介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!