Bionic robot fish cluster navigation simulation method based on deep reinforcement learning technology

文档序号：21347 发布日期：2021-09-21 浏览：31次中文

阅读说明：本技术 基于深度强化学习技术的仿生机器鱼群集导航模拟方法 (Bionic robot fish cluster navigation simulation method based on deep reinforcement learning technology ) 是由高天寒张岩于 2021-06-21 设计创作，主要内容包括：本发明提供一种基于深度强化学习技术的仿生机器鱼群集导航模拟方法,涉及多智能体路径导航规划技术领域。该方法首先构建3D鱼群群集环境模型,然后在3D鱼群群集环境中构建仿生机器鱼群集的智能体模型；该智能体模型包括感知模型、运动模型和决策模型三部分；再构建鱼群群集的奖励函数,并在在奖励函数中引入好奇心机制；基于好奇心机制和PPO2算法构建智能体模型的分布式训练框架,让智能体以学习的方式获得行为策略；最后基于构建的分布式训练框架训练智能体模型,实现仿生机器鱼群集的导航模拟。该方法可以使虚拟鱼群在3D环境下学习到合理的鱼群行为,并应用到真实世界的仿生机器鱼群导航行为当中。(The invention provides a bionic robot fish clustering navigation simulation method based on a deep reinforcement learning technology, and relates to the technical field of multi-agent path navigation planning. Firstly, constructing a 3D fish swarm environment model, and then constructing an intelligent model of a bionic robot fish swarm in the 3D fish swarm environment; the intelligent agent model comprises a perception model, a motion model and a decision model; then constructing a reward function of the fish swarm cluster, and introducing a curiosity mechanism into the reward function; constructing a distributed training framework of an intelligent agent model based on a curiosity mechanism and a PPO2 algorithm, and enabling the intelligent agent to obtain a behavior strategy in a learning mode; and finally, training an intelligent body model based on the constructed distributed training framework to realize navigation simulation of the bionic robot fish cluster. The method can enable the virtual fish school to learn reasonable fish school behaviors in a 3D environment, and can be applied to the bionic robot fish school navigation behaviors of the real world.)

1. A bionic robot fish swarm navigation simulation method based on a deep reinforcement learning technology is characterized in that:

constructing a 3D fish swarm environment model;

constructing an intelligent agent model of the bionic robot fish cluster; the intelligent agent model comprises a perception model, a motion model and a decision model;

constructing a reward function of the fish swarm, and introducing a curiosity mechanism into the reward function;

constructing a distributed training framework of an intelligent agent model, and enabling the intelligent agent to obtain a behavior strategy in a learning mode;

and training an intelligent model based on the constructed distributed training framework to realize navigation simulation of the bionic robot fish cluster.

2. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 1, wherein: the specific method for constructing the 3D fish swarm environment model comprises the following steps:

firstly, constructing a fish swarm environment;

constructing a 3D scene in a Unity3D engine system by taking the length of a bionic robot fish as 1 unit; transparent air walls are arranged at the periphery and the top of the 3D scene, and the bottom of the 3D scene simulates real ocean terrain and consists of uneven ground and aquatic weeds; the top and peripheral air walls and the bottom terrain form a closed space through collision bodies;

secondly, constructing a coordinate system of a fish school motion world;

setting one vertex of the intersection of the bottom terrain of the 3D scene and the surrounding air walls as a coordinate origin; setting a fish group consisting of n bionic machine fish in a 3D scene, and using F ═ F₁,f₂,…,f_nDenotes, the ith bionic machine fish f_iIs denoted as p_i(x_i,y_i,z_i) I ═ 1,2, …, n; in the 3D scene, a region is randomly initialized as a target region to be used as a reward signal to drive the fish swarm behavior.

3. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 2, wherein:

the construction method of the perception model comprises the following steps:

setting each bionic robot fish to represent an intelligent agent, wherein each fish can sense all environment state information in a spherical field with the current position as the center and the radius of r, wherein r is the visual range of the fish and can be manually adjusted; when other agents enter the visual field range of a certain bionic robot fish, the agents can sense the position information and the current state of the other agents; when the vision field of the bionic robot fish contacts the target area, the direction and the distance of the target can be sensed;

in addition, the surface layer of each intelligent agent is wrapped with a capsule collision body in a Unity3D engine, and when the intelligent agent collides with other intelligent agents or obstacles, the intelligent agent can sense collision information; note that the principle of collision in the Unity3D engine is intersection detection of bounding boxes, triggering a collision when the surface layers of a collider intersect;

the construction method of the motion model comprises the following steps:

constructing an intelligent body motion model with continuous actions in a virtual 3D scene; setting three continuous actions of the intelligent agent, namely forward movement, left-right rotation and up-down rotation; the intelligent agent controls action selection in a mode of observation information → neural network model → action decision output set; the action decision output set is a floating point type decision action array vectorrAction, the size of each element is a continuous value of-1, vectorrAction [0] refers to the forward action of the intelligent agent, vectorrAction [1] represents the left-right steering action, and vectorrAction [2] represents the up-down steering action;

the construction method of the decision model comprises the following steps:

setting a decision to be given by the intelligent agent every m time steps, and inputting the decision into a 3D scene to drive the intelligent agent to move; controlling the advancing and steering of the intelligent agent according to the action model; wherein the decision of each agent is derived from a neural network fit.

4. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 3, wherein: the forward movement is specifically: controlling the agent to move forward by applying a force M to the agent in the same direction as the agent is facing, wherein the force M is applied as follows:

M＝|vectorAction[0]*fishMaxMoveSpeed| (1)

wherein, the fishermaxmovespeed is the maximum moving speed of the agent;

the action outputs of the left-right rotation and the up-down rotation respectively correspond to a second element and a third element in the decision action array, and represent the target value of the change size of the rotation angle;

calculating smooth values smoothPitchChange and smoothYawChang of the variation of the left and right and up and down axial directions of the intelligent body, wherein the formula is as follows:

smoothPitchChange＝Mathf.MoveTowards(smoothPitchChange,pitchChange,2*Time.fixedDeltaTime) (2)

smoothYawChange＝Mathf.MoveTowards(smoothYawChange,yawChange,2*Time.fixedDeltaTime) (3)

the method comprises the steps that a function Mathf.MoveTowards () returns a variable quantity used for changing the approach of an intelligent agent from a current value to a target value, pitchChange and yawChange respectively correspond to the target values of the change of the intelligent agent in the left-right axial direction and the up-down axial direction, and time.fixedDeltaTime is the time of each frame of a unity3D engine system;

then, according to formula 4 and formula 5, obtaining angle change quantities pitch and yaw of the horizontal axis and the vertical axis in each frame time of the agent:

pitch＝smoothPitchChange*Time.fixedDeltaTime*pitchSpeed (4)

yaw＝smoothYawChange*Time.fixedDeltaTime*yawSpeed (5)

where yawSpeed and pitchSpeed are the speeds of the agent turning left and right and up and down.

5. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 4, wherein: the specific method for constructing the reward function of the fish swarm and introducing the curiosity mechanism into the reward function comprises the following steps:

setting that when food is in the observation range of the intelligent agent, the intelligent agent receives a reward signal, and in order to drive the intelligent agent to approach the food, the reward size is positively correlated with the distance from the intelligent agent to the food; meanwhile, in order to give the intelligent agent a more definite training target, a distance threshold is set for the distance from the intelligent agent to food, the intelligent agent receives positive reward in the threshold, otherwise, the intelligent agent receives negative reward, and the following formula is shown:

reward_dis＝-0.05*(distanceToFood-threshold) (6)

wherein, reward _ dis is the reward value received by the intelligent agent, distanceToFood is the distance between the intelligent agent and food, and threshold is the distance threshold;

adding an inherent curiosity reward to the reward function, and giving positive reward feedback to the agent when the agent explores an unknown state; and a balance parameter is set to balance the odds and ends with other awards.

6. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 5, wherein: the method is used for constructing a distributed training framework of an intelligent model based on a curiosity mechanism and a PPO2 algorithm, and comprises the following specific steps:

n independent strategies are combined into a fish school swimming strategy, and each intelligent agent is provided with a neural network with a curiosity mechanism as a strategy network; in the learning stage of the intelligent agents, a common central network is set, each intelligent agent updates the parameters of the neural network thereof and then sends the learned strategy to the central network, the central network updates the global parameters after receiving the strategy parameters sent by the intelligent agents and returns the updated global parameters to the strategy network corresponding to the intelligent agent; after the update is completed, the agent uses the latest strategy to collect data for learning.

7. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 6, wherein: the specific method for training the intelligent model based on the constructed distributed training framework to realize the navigation simulation of the bionic robot fish cluster comprises the following steps:

initializing a random strategy theta at the beginning of training₀And a truncation threshold e; setting a total number of Kmax rounds of the learning process, wherein each round, the agent follows the current strategy theta_kCollection strategy trajectory D_k(s₀,a₀,r₀,s₁,a₁,r₁,s₂…s_T) Wherein, theta_kRepresents the policy after the kth update; s_t,a_t，r_t，s_t+1Respectively representing the environment state, the action, the reward and the next state collected at the tth step of the strategy track, wherein T is equal to 0, T](ii) a T represents the maximum number of steps of the strategy track; then, the curiosity reward of the current round is calculated by combining a built-in curiosity mechanism, and the curiosity reward is calculated according to a strategy track D_kCalculating a loss function value with curiosity reward by using the interactive data in the step (1); each agent performs gradient descent according to the loss function value, and updates the network parameters learned by the strategy through back propagation; after the strategy of the intelligent agent is updated, the strategy of the intelligent agent is transmitted to the central network to update the global strategy, and after the central network is updated each time, the updated global strategy is transmitted to the intelligent agent transmitting the strategyCan be used for energy.

8. The bionic robotic fish clustering navigation simulation method based on the deep reinforcement learning technology as claimed in claim 7, wherein: the specific calculation of the loss function value with curiosity rewards is as follows:

A^θ′(s_t,a_t)＝δ_t+(γλ)δ_t+1+…+(γλ)^T-t+1δ_T-1 (8)

wherein J (theta) represents a strategic gradient loss function of the PPO2 algorithm; function clip (a, a)_min，a_max) Limiting the value of a to a_minAnd a_maxIf a is greater than a_maxReturn to a_maxIf a is less than a_minReturn to a_minOtherwise, returning to a; p is a radical of_θ(a_t|s_t) Is in strategy θ, state s_tLower motion a_tA probability distribution of (a); a. the^θ′(a_t|s_t) Is in the strategy θ', state s_tEstimating the action advantage; e is a truncation threshold; γ is a discount factor; v(s)_t) Is state s_tThe value of (D); delta_tIs the time sequence difference error under the time step t;representing curiosity reward at time step t, r_tRepresenting a feedback award that the environment is normal.

Technical Field

The invention relates to the technical field of multi-agent path navigation planning, in particular to a bionic robot fish clustering navigation simulation method based on a deep reinforcement learning technology.

Background

Clustering behavior of fish is a typical self-organizing phenomenon. The fishes naturally gather into groups in order to guarantee self survival during swimming and show complex clustering behaviors. Swimming of each fish can be achieved by following only two basic rules: following the next fish; the movement is continued. If the natural fish swarm behavior is simulated based on the two simple rules to realize the bionic robot fish swarm navigation, most artificial swarm systems are difficult to realize at present.

A common method for simulating fish swarm behavior to achieve biomimetic robotic fish swarm navigation is the Artificial Fish Swarm Algorithm (AFSA). The artificial fish school algorithm is an optimization algorithm based on simulated fish school behaviors, and is a novel optimization algorithm proposed in 2002 by Lixianlei and the like. In a water area, the place with the largest number of fish lives is the place with the largest amount of nutrient substances in the water area, and according to the characteristics, the actions such as foraging of fish swarms are simulated, so that global optimization is realized, and the basic idea of a fish swarms algorithm is realized.

Another more advanced approach is to use deep reinforcement learning to simulate shoal gathering behavior to achieve biomimetic robotic fish crowd navigation. The fish swarm self-organization behavior simulation is realized by constructing an environment model, an agent model and rewards and training a fish swarm agent by using a deep reinforcement learning algorithm, and then the training model and the sensing model are directly deployed in the bionic robot fish. The method for simulating the self-organization behavior of the fish school by using the deep reinforcement learning not only provides a new idea for realizing the bionic robot fish school navigation, but also promotes the development of the deep reinforcement learning in the direction of multiple intelligent agents.

The Artificial Fish Swarm Algorithm (AFSA) has the problems of low convergence precision, easy falling into local optimum, low later convergence speed and the like. The algorithm is very sensitive to each hyper-parameter and is easily influenced by the step length, the population size and the crowding factor, so that the algorithm has a very large limitation.

At present, in order to simplify training, a method for simulating fish swarm clustering behaviors by using deep reinforcement learning is used for simulating a simple 2D environment mostly, and an action space and a state space are small, so that the clustering behaviors of fish swarms in the nature cannot be truly reflected. The method has small significance for practical application such as underwater robot, submarine navigation and the like, and has small help for researching the real cluster behavior in the nature.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a bionic robot fish cluster navigation simulation method based on a deep reinforcement learning technology, so as to realize navigation simulation of the bionic robot fish cluster.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: the bionic robot fish cluster navigation simulation method based on the deep reinforcement learning technology specifically comprises the following steps:

constructing a 3D fish swarm environment model;

constructing an intelligent agent model of the bionic robot fish cluster; the intelligent agent model comprises a perception model, a motion model and a decision model;

constructing a reward function of the fish swarm, and introducing a curiosity mechanism into the reward function;

constructing a distributed training framework of an intelligent agent model, and enabling the intelligent agent to obtain a behavior strategy in a learning mode;

and training an intelligent model based on the constructed distributed training framework to realize navigation simulation of the bionic robot fish cluster.

Further, the specific method for constructing the 3D fish swarm environment model comprises the following steps:

firstly, constructing a fish swarm environment;

secondly, constructing a coordinate system of a fish school motion world;

setting one vertex of the intersection of the bottom terrain of the 3D scene and the surrounding air walls as a coordinate origin; setting a fish group consisting of n bionic machine fish in a 3D scene, and using F ═ F₁,f₂,…,f_nDenotes, the ith bionic machine fish f_iIs denoted as p_i(x_i,y_i,z_i) I ═ 1,2, …, n; in a 3D scene, randomly initiatingAnd (4) forming a piece of area as a target area and driving fish swarm behavior as a reward signal.

Further, the construction method of the perception model comprises the following steps:

the construction method of the motion model comprises the following steps:

the construction method of the decision model comprises the following steps:

Further, the forward movement is specifically: controlling the agent to move forward by applying a force M to the agent in the same direction as the agent is facing, wherein the force M is applied as follows:

M＝|vectorAction[0]*fishMaxMoveSpeed| (1)

wherein, the fishermaxmovespeed is the maximum moving speed of the agent;

calculating smooth values smoothPitchChange and smoothYawChang of the variation of the left and right and up and down axial directions of the intelligent body, wherein the formula is as follows:

smoothPitchChange＝Mathf.MoveTowards(smoothPitchChange，pitchChange，2*Time.fixedDeltaTime) (2)

smoothYawChange＝Mathf.MoveTowards(smoothYawChange，yawChange，2*Time.fixedDeltaTime) (3)

then, according to formula 4 and formula 5, obtaining angle change quantities pitch and yaw of the horizontal axis and the vertical axis in each frame time of the agent:

pitch＝smoothPitchChange*Time.fixedDeltaTime*pitchSpeed (4)

yaw＝smoothYawChange*Time.fixedDeltaTime*yawSpeed (5)

where yawSpeed and pitchSpeed are the speeds of the agent turning left and right and up and down.

Further, the specific method for constructing the reward function of the fish swarm and introducing the curiosity mechanism into the reward function is as follows:

reward_dis＝-0.05*(distanceToFood-threshold) (6)

wherein, reward _ dis is the reward value received by the intelligent agent, distanceToFood is the distance between the intelligent agent and food, and threshold is the distance threshold;

Further, the method constructs a distributed training framework of the intelligent model based on a curiosity mechanism and a PPO2 algorithm, and comprises the following specific steps:

Further, the method for training the intelligent model based on the constructed distributed training framework to realize the navigation simulation of the bionic robot fish cluster comprises the following specific steps:

initializing a random strategy theta at the beginning of training₀And a truncation threshold e; setting a total number of Kmax rounds of the learning process, wherein each round, the agent follows the current strategy theta_kCollection strategy trajectory D_k(s₀，a₀，r₀，s₁，a₁，r₁，s₂...s_T) Wherein, theta_kRepresents the policy after the kth update; s_t，a_t，r_t，s_t+1Respectively representing the environment state, the action, the reward and the next state collected at the tth step of the strategy track, wherein T is equal to 0, T](ii) a T represents the maximum number of steps of the strategy track; then, the curiosity reward of the current round is calculated by combining a built-in curiosity mechanism, and the curiosity reward is calculated according to a strategy track D_kCalculating a loss function value with curiosity reward by using the interactive data in the step (1); each agent performs gradient descent according to the loss function value, and updates the network parameters learned by the strategy through back propagation; after the strategy of the intelligent agent is updated, the strategy of the intelligent agent is transmitted to the central network to update the global strategy, and after the central network is updated each time, the updated global strategy is transmitted to the intelligent agent transmitting the strategy.

Further, the specific calculation of the loss function value with curiosity rewards is as follows:

A^θ′(s_t，a_t)＝δ_t+(γλ)δ_t+1+...+(γλ)^T-t+1δ_T-1 (8)

wherein J (theta) represents a strategic gradient loss function of the PPO2 algorithm; function clip (a, a)_min，a_max) Limiting the value of a to a_minAnd a_maxIf a is greater than a_maxReturn to a_maxIf a is less than a_minReturn to a_minOtherwise, returning to a; p is a radical of_θ(a_t|s_t) Is in strategy θ, state s_tLower motion a_tA probability distribution of (a); a. the^θ′(a_t|s_t) Is in the strategy θ', state s_tEstimating the action advantage; e is the truncation threshold; gamma is a foldA deduction factor; v(s)_t) Is state s_tThe value of (D); delta_tIs the time sequence difference error under the time step t;representing curiosity reward at time step t, r_tRepresenting a feedback award that the environment is normal.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: a bionic robot fish swarm navigation simulation method based on a deep reinforcement learning technology is trained based on the deep reinforcement learning technology and by taking a near-end strategy optimization algorithm and a curiosity mechanism as cores. Simulation experiments show that the training method based on deep reinforcement learning and curiosity mechanism can enable the virtual fish school to learn reasonable fish school behaviors in a 3D environment and is applied to the bionic robot fish school navigation behaviors in the real world. The trained shoal learns the behavior of a shoal storm autonomously. The learned virtual fish realizes the behavior from random swimming to gradual gathering, each fish can be randomly explored after the simulation starts, other fish can be actively close to the nearest fish after the other fish are found, exploration can be stopped after the fish school finds a region with rich nutrition, and the fish can gather and forage. The method observes the clustering behavior of the fish shoal by controlling the speed parameters of part of the fish, and finds surprising consistency with the natural fish shoal clustering phenomenon through comparative analysis.

When the maximum speeds of all the fishes are the same, each fish can automatically adjust the speed and the direction of the fish according to the size of the fish school and keep consistency with the moving direction of the whole fish school. When the maximum speed of fish in the fish school is halved, the rest of the fish can automatically adjust and slow down the self-moving speed in order to avoid collision. At the moment, the moving speed of the whole fish school is slowed down, and the phenomenon shows the adaptability of the fish individuals to the whole activity of the fish school. However, when only the maximum speed of individual fish is limited, the fish school will not slow down their swimming speed, which results in that the slow swimming fish can only move at the outmost periphery of the fish school and the chance of getting food is greatly reduced. This phenomenon is also common in nature and is a typical phenomenon of high-quality and low-out, and the survival probability of the individuals discarded by the fish shoal is reduced. According to the phenomenon, the cluster navigation behavior of the bionic fish swarm can be controlled by controlling the speed of the bionic robot fish.

Drawings

Fig. 1 is a flowchart of a biomimetic robotic fish clustering navigation simulation method based on a deep reinforcement learning technique according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an effect of a specific environment scenario provided by an embodiment of the present invention;

FIG. 3 is a diagram of a single agent model architecture provided by an embodiment of the present invention;

FIG. 4 is a diagram illustrating a curiosity mechanism model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating storm effects of fish in accordance with an embodiment of the present invention;

FIG. 6 is a diagram illustrating the effect of initialized chaotic fish schools according to an embodiment of the present invention;

fig. 7 is a graph comparing experimental results with curiosity and without curiosity provided by the embodiment of the invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In this embodiment, the method for simulating the navigation of the bionic robot fish cluster based on the deep reinforcement learning technology, as shown in fig. 1, includes the following steps:

step 1, constructing a 3D fish swarm environment model;

step 1.1, constructing a fish swarm environment;

in order to simulate the real-world fish swarm environment, a 3D scene is constructed in a Unity3D engine system by taking the length of a bionic robot fish as 1 unit; transparent air walls are arranged at the periphery and the top of the 3D scene, and the bottom of the 3D scene simulates real ocean terrain and consists of uneven ground and aquatic weeds; the top and peripheral air walls and the bottom terrain form a closed space through collision bodies so as to limit the movement of fish schools;

step 1.2, constructing a coordinate system of a fish school motion world;

setting one vertex of the intersection of the bottom terrain of the 3D scene and the surrounding air walls as a coordinate origin; setting a fish group consisting of n bionic machine fish in a 3D scene, and using F ═ F₁,f₂,…,f_nDenotes, the ith bionic machine fish f_iIs denoted as p_i(x_i,y_i,z_i) I ═ 1,2, …, n; in order to simulate the behavior of foraging of fish swarms in nature, in a 3D scene, a region is randomly initialized as a target region and used as an incentive signal to drive the behavior of the fish swarms;

in this embodiment, in order to simulate a real-world fish swarm environment, a length of a biomimetic robotic fish is 1 unit, and a 3D scene with a length, width and height of 100 × 100 × 50 is constructed in a Unity3D engine, as shown in fig. 2, transparent air walls are arranged around and at the top of the environment, and the bottom of the environment simulates a real marine terrain and consists of an uneven ground and waterweeds. The top and surrounding air walls and the bottom terrain form a closed space through the collision body to limit the movement of the fish school. And setting one vertex of the intersection of the bottom terrain and the peripheral air wall as a coordinate origin (0, 0, 0). Supposing a fish group consisting of n fish, using F ═ F₁,f₂,…,f_nDenotes, then fish f_iCan be expressed as p_i(x_i,y_i,z_i). In the scene, in order to simulate the foraging behavior of the natural fish school, an area is initialized randomly as a target area, namely a nutrient-rich area, and is used as a reward signal to drive the fish school behavior.

Step 2, constructing an intelligent body model of the bionic robot fish cluster; the intelligent agent model comprises a perception model, a motion model and a decision model;

step 2.1, constructing a perception model;

In this embodiment, the constructed agent model of the bionic robot fish cluster is shown in fig. 3, and the proportion of the agent with respect to the environment and the size of the observation range can be seen in fig. 3. According to the actual conditions of the bionic robot fish, the ability of sensing the environment is simulated, the sensing ability of the bionic robot fish simulates the fish vision in nature, the fish senses the surrounding environment mainly through fish eyes, and as the physiological structure of the fish is special (the fish eyes grow on two sides, and the single-eye visual field is close to or exceeds 180 degrees on the vertical plane and the horizontal plane), the fish head can flexibly change the direction in the swimming process, and the visual field of the fish almost has no dead angle. Therefore, the invention uses a spherical area as the observation scope of the agent, i.e. the agent can sense all the environmental status information in the spherical area with radius r and the current position as the center, such as the position direction of other agents, the direction and distance of food, etc., wherein r is the visual distance of fish, and can be adjusted manually.

Step 2.2, constructing a motion model;

constructing an intelligent body motion model with continuous actions in a virtual 3D environment; setting three continuous actions of the intelligent agent, namely forward movement, left-right rotation and up-down rotation; the intelligent agent controls action selection in a mode of observation information → neural network model → action decision output set; the action decision output set is a floating point type decision action array vectorrAction, the size of each element is a continuous value of-1, vectorrAction [0] refers to the forward action of the intelligent agent, vectorrAction [1] represents the left-right steering action, and vectorrAction [2] represents the up-down steering action;

the forward movement is specifically: controlling the agent to move forward by applying a force M to the agent in the same direction as the agent is facing, wherein the force M is applied as follows:

M＝|vectorAction[0]*fishMaxMoveSpeed| (1)

wherein, the fisherMaxMoveSpeed is the maximum moving speed of the agent, namely the moving speed of the agent is between 0 and fisherMaxMoveSpeed;

the intelligent agent can correct the current axial direction to the target value, in order to make the process performance smoother, smooth values smoothPitchChange and smoothYawChang of the variation of the left and right and up and down axial directions of the intelligent agent need to be calculated firstly, namely the variation from the current angle value to the target value every 0.02s, and the specific formula is as follows:

smoothPitchChange＝Mathf.MoveTowards(smoothPitchChange，pitchChange，2*Time.fixedDeltaTime) (2)

smoothYawChange＝Mathf.MoveTowards(smoothYawChange，yawChange，2*Time.fixedDeltaTime) (3)

the method comprises the steps that a function Mathf.MoveTowards () returns a variable quantity for changing the approach of an intelligent agent from a current value to a target value, pitchChange and yawChange respectively correspond to the target values of the change of the intelligent agent in the left-right axial direction and the up-down axial direction, and time.fixed DeltaTime is the time of each frame of a unity3D engine system and serves as a change speed limit, namely the maximum speed does not exceed 2 time.fixed DeltaTime in the angle change process;

then, according to formula 4 and formula 5, obtaining angle change quantities pitch and yaw of the horizontal axis and the vertical axis in each frame time of the agent:

pitch＝smoothPitchChange*Time.fixedDeltaTime*pitchSpeed (4)

yaw＝smoothYawChange*Time.fixedDeltaTime*yawSpeed (5)

where yawSpeed and pitchSpeed are the speeds of the agent turning left and right and up and down;

step 2.3, constructing a decision model;

setting a decision to be given by the intelligent agent every m time steps, and inputting the decision into a 3D scene to drive the intelligent agent to move; the intelligent agent is controlled to advance and turn according to the action model, namely, a floating point number between-1 and 1 is given out by each decision strategy of the intelligent agent every 0.1s to control the advance and turn of the intelligent agent; wherein the decision of each agent is obtained by fitting a neural network;

in a virtual 3D environment, in order to simulate the swimming process of natural fishes in water more truly, the invention constructs an intelligent body motion model with continuous action; besides being influenced by the action decision output set of the agent, the moving speed and the moving angle of the agent can also generate rigid collision with other agents or obstacles in the moving process of the agent, so that the moving speed and the moving angle of the agent are changed, and the realization of the characteristic is realized by a physical system in a Unity3D engine;

each time step in the Unity3D engine is 0.02 s. In the embodiment, the intelligent agent is set to give a decision action every 5 time steps, namely every 0.1s decision of each intelligent agent gives a floating point number between-1 and 1 to control the advance and steering of the intelligent agent;

step 3, constructing a reward function of the fish swarm;

setting that when food is in the observation range of the intelligent agent, the intelligent agent receives a reward signal, and in order to drive the intelligent agent to approach the food, the reward size is positively correlated with the distance from the intelligent agent to the food; meanwhile, in order to give the intelligent agent a more definite training target, a distance threshold is set for the distance from the intelligent agent to the food, the intelligent agent receives a positive reward in the threshold, otherwise, the intelligent agent receives a negative reward, and the method is specifically shown in formula 6:

reward-dis＝-0.05*(distanceToFood-threshold) (6)

wherein, reward _ dis is the reward value received by the intelligent agent, distanceToFood is the distance between the intelligent agent and food, and threshold is the distance threshold;

in this embodiment, the agent obtains a positive reward of 0.5 when eating food, the food disappears, and the agent is given a negative reward of-0.5 when a collision occurs between agents or the agent collides with an obstacle;

step 4, introducing a curiosity mechanism into the reward function;

adding an inherent curiosity reward to the reward function, and giving positive reward feedback to the agent when the agent explores an unknown state; setting a balance parameter to balance the ratio of curiosity rewards and other rewards;

step 5, constructing a distributed training framework of the intelligent agent model based on a curiosity mechanism and a PPO2 algorithm, and enabling the intelligent agent to obtain a behavior strategy in a learning mode;

n independent strategies are combined into a fish school swimming strategy, and each intelligent agent is provided with a neural network with a curiosity mechanism as a strategy network; in the learning stage of the intelligent agents, a common central network is set, each intelligent agent updates the parameters of the neural network thereof and then sends the learned strategy to the central network, the central network updates the global parameters after receiving the strategy parameters sent by the intelligent agents and returns the updated global parameters to the strategy network corresponding to the intelligent agent; after the updating is finished, the intelligent agent uses the latest strategy to collect data for learning;

step 6, training an intelligent model based on the distributed training framework constructed in the step 5 to realize navigation simulation of the bionic robot fish cluster;

initializing a random strategy theta at the beginning of training₀And a truncation threshold e; setting a total number of Kmax rounds of the learning process, wherein each round, the agent follows the current strategy theta_kCollection strategy trajectory D_k(s₀，a₀，r₀，s₁，a₁，r₁，s₂...s_T) Wherein, theta_kRepresents the policy after the kth update; s_t，a_t，r_t，s_t+1Respectively representing the environment state, the action, the reward and the next state collected at the tth step of the strategy track, wherein T is equal to 0, T](ii) a T represents the maximum number of steps of the strategy track; then, the curiosity reward of the current round is calculated by combining a built-in curiosity mechanism, and the curiosity reward is calculated according to a strategy track D_kInteraction data s in₀，a₀，r₀，s₁，a₁，r₁，s₂...s_TCalculating a loss function value with curiosity rewards; each agent performs gradient descent according to the loss function value, and updates the network parameters learned by the strategy through back propagation; after the strategy of the intelligent agent is updated, the strategy of the intelligent agent is transmitted to a central network to update a global strategy, and after the central network is updated each time, the updated global strategy is transmitted to the intelligent agent transmitting the strategy;

the specific calculation of the loss function value with curiosity rewards is as follows:

A^θ′(s_t，a_t)＝δ_t+(γλ)δ_t+1+...+(γλ)^T-t+1δ_T-1 (8)

wherein J (theta) represents a strategic gradient loss function of the PPO2 algorithm; function clip (a, a)_min，a_max) Limiting the value of a to a_minAnd a_maxIf a is greater than a_maxReturn to a_maxIf a is less than a_minReturn to a_minOtherwise, returning to a; the function min (x, y) will return the smaller of x and y; p is a radical of_θ(a_t|s_t) Is in strategy θ, state s_tLower motion a_tA probability distribution of (a); a. the^θ′(a_t|s_t) Is in the policy of thetaState s_tEstimating the action advantage; e is a truncation threshold; γ is a discount factor; v(s)_t) Is state s_tThe value of (D); delta_tIs the time sequence difference error under the time step t;representing curiosity reward at time step t, r_tRepresenting a feedback award that the environment is normal.

The invention constructs a curiosity mechanism, promotes the exploration capability of the intelligent agent and avoids the intelligent agent from falling into a local optimal strategy. This is achieved by introducing an inherent curiosity reward, as shown in fig. 4, by entering the current state s_tAnd action a_tPredicting the next stateThen with s_t+1By contrast, byAnd s_t+1Building curiosity rewards And s_t+1The larger the difference the larger the rewardThe larger. Meanwhile, a balance parameter is set to balance the ratio of curiosity rewards to other rewards, and the parameter needs to be adjusted continuously through experiments and is usually between 0.001 and 0.1. Furthermore, over time, unexplored states will be less and curiosity rewards will be lower.

In this embodiment, the training method based on deep reinforcement learning and curiosity mechanism can enable the shoal to learn reasonable shoal behavior in a 3D environment, and the trained shoal learns the behavior of "shoal storm" autonomously, as shown in fig. 5. And the trained intelligent body model can be stored as a pb file to be deployed in the bionic robot fish and applied to a real environment.

In the embodiment, the learned bionic robot fish realizes the behavior from random swimming to gradual gathering. The position and the angle of each fish can be initialized before the experiment starts, as shown in fig. 6, each fish can be randomly explored after the experiment starts, can be actively drawn close to the nearest fish after other fishes are found, exploration can be stopped after a fish swarm finds a nutrient-rich area, the fishes gather to forage, and the bionic robot fish can avoid colliding with the surrounding environment and other bionic robot fish in the whole process. In the embodiment, the clustering behavior of the fish school is observed by controlling the speed parameters of part of the fishes, and the phenomenon that the bionic robot fish is clustered with the natural fish school is found to have consistency by comparative analysis.

In this embodiment, when the maximum speeds of all the biomimetic robotic fish are the same, each fish can automatically adjust the speed and direction thereof according to the fish school scale, and keep consistency with the moving direction of the whole fish school. When the maximum speed of more than thirty percent of the fish in the fish school is halved, it is found that the remaining fish will automatically adjust to slow down their own movement speed in order to avoid collisions. At the moment, the moving speed of the whole fish school is slowed down, and the phenomenon shows the adaptability of the fish individuals to the whole activity of the fish school. However, when only the maximum speed of individual fish is limited, the fish school will not slow down their swimming speed, which results in that the slow swimming fish can only move at the outmost periphery of the fish school and the chance of getting food is greatly reduced. This phenomenon is also common in nature and is a typical phenomenon of high-quality and low-out, and the survival probability of the individuals discarded by the fish shoal is reduced. According to the phenomenon, the clustering navigation behavior of the bionic fish swarm can be controlled by controlling the speed of the bionic robot fish.

To demonstrate the effect of the curiosity mechanism, this example was conducted in several comparative experiments. The upper limit of the training times of each experiment is set to be 500w rounds, and under the same experimental environment, it can be obviously seen that after the training times reach about 70w times, the convergence rate of the deep reinforcement learning method combined with the curiosity mechanism is obviously better than that of the method without the curiosity mechanism, as shown in fig. 7. After the curiosity mechanism is combined with the PPO algorithm, the local optimal solution is avoided by the trained decision strategy, a strategy better than that of a common PPO algorithm is obtained, and a better cluster navigation effect is realized.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种增强驾驶感的AR-HUD抬头显示界面的设计方法

Bionic robot fish cluster navigation simulation method based on deep reinforcement learning technology

相关技术

网友询问留言