Reinforced learning state layering method based on equivalent subspace

文档序号:1964429 发布日期:2021-12-14 浏览:14次 中文

阅读说明:本技术 一种基于等效子空间的强化学习状态分层方法 (Reinforced learning state layering method based on equivalent subspace ) 是由 高子文 刘俊涛 王振杰 王元斌 黄志刚 于 2021-09-07 设计创作,主要内容包括:本发明公开了一种基于等效子空间的强化学习状态分层方法:通过智能体单个时间步对环境的观测数据生成状态语义特征向量,将其与通过环境交互产生的相关信息组成状态表示向量,并收集预设回合内的状态表示向量组成状态表示集,通过聚类分析生成若干等效子空间,得到等效状态划分;基于等效状态划分,在学习训练过程中对智能体观测的状态进行分类计算,得到包含类别信息的one-hot子状态向量;学习训练过程中基于one-hot子状态向量进行后续策略计算,并以预设时间分辨率通过上述步骤重新更新状态等效划分。本发明将状态空间划分为不同抽象层次的等效状态子空间,以解决智能体强化学习状态空间过大的问题,提升环境搜索效率,为强化学习算法提供可解释基础。(The invention discloses a reinforcement learning state layering method based on equivalent subspace, which comprises the following steps: generating a state semantic feature vector through observation data of an intelligent agent on an environment at a single time step, forming a state expression vector by the state semantic feature vector and related information generated through environment interaction, collecting the state expression vector in a preset round to form a state expression set, and generating a plurality of equivalent subspaces through clustering analysis to obtain equivalent state division; based on equivalent state division, carrying out classification calculation on states observed by the intelligent agent in the learning and training process to obtain a one-hot sub-state vector containing class information; and performing subsequent strategy calculation based on the one-hot sub-state vector in the learning training process, and updating the state equivalent division again through the steps according to the preset time resolution. The invention divides the state space into equivalent state subspaces with different abstraction levels, so as to solve the problem of overlarge state space of intelligent reinforcement learning, improve the environmental search efficiency and provide interpretable basis for reinforcement learning algorithm.)

1. A reinforcement learning state layering method based on equivalent subspace, characterized in that the method comprises the following steps:

(1) generating a state semantic feature vector through observation data of an intelligent agent on an environment at a single time step, forming a state expression vector by the state semantic feature vector and related information generated through environment interaction, collecting the state expression vector in a preset round to form a state expression set, and generating a plurality of equivalent subspaces through clustering analysis to obtain equivalent state division;

(2) classifying and calculating the observed state of the intelligent agent in the learning and training process based on the equivalent state division obtained in the step (1) to obtain a one-hot sub-state vector containing class information;

(3) and (3) performing subsequent strategy calculation based on the one-hot sub-state vector generated in the step (2) in the learning training process, and updating the state equivalent division again through the steps with a preset time resolution.

2. The equivalent subspace-based reinforcement learning state layering method according to claim 1, wherein said step (1) comprises:

(1-1) combining the related information acquired by the agent in a single time step into a state representation vector;

(1-2) in the interaction process of the intelligent agent and the environment, collecting state representation vectors in a preset continuous time step to form a state representation set, and carrying out cluster analysis on the state representation set;

and (1-3) obtaining a classification center set of clustering calculation, namely equivalent state division of the environment state space in an abstract level, wherein each class is an equivalent subspace.

3. The equivalent subspace-based reinforcement learning state layering method according to claim 2, wherein the state representation vector in said step (1-1) is < s, r, a, next _ s >, where: s represents a state semantic feature vector observed by the agent; r represents reward feedback acquired by the agent from the environment; a represents the action selected by the agent through the decision; next _ s represents the state semantic feature vector for the agent at the next time step.

4. The equivalent subspace-based reinforcement learning state layering method according to claim 3, wherein said step (1-2) comprises:

(1-2-1) determining a total number k of state classes;

(1-2-2) selecting k classification training samples in the environment state set, and taking the vector as a classification center of the class;

(1-2-3) sequentially calculating the distance from the unknown vector sample to the center of the k classified training samples;

(1-2-4) determining a classification training sample closest to an unknown sample, adding the unknown sample to the classification training sample set, and recalculating a classification center;

(1-2-5) repeating the above steps until all samples are clustered.

5. The equivalent subspace-based reinforcement learning state layering method according to claim 3, wherein in the step (1-3):

the equivalent state subspace refers to states which have similar characteristics and generate the same return and state transition, a set of all equivalent subspaces forms a complete environment state space, each equivalent subspace can be regarded as the same category of the state, all the states in the subspace have similarity, the equivalent subspaces have the common characteristics of all the states of the category, the states belonging to different subspaces have difference, the same processing is carried out on the states in the same equivalent subspace to improve the searching efficiency of the state space, and the equivalent state division is updated by repeating the steps at a preset time resolution ratio and is used for state hierarchical calculation in the intelligent body learning training process.

6. The equivalent subspace-based reinforcement learning state layering method according to claim 1 or 2, wherein said step (2) comprises:

(2-1) dividing the clustering center set by the equivalent state obtained in the step (1) to form a classifier;

and (2-2) classifying the state semantic feature vectors obtained in the step 1 through a classifier in the intelligent agent learning training process.

7. The equivalent subspace-based reinforcement learning state layering method according to claim 6, wherein said step (2-2) comprises:

(2-2-1) calculating the distance from the semantic feature vector to each cluster center;

(2-2-2) selecting the minimum distance to determine classification attribution;

and (2-2-3) generating a one-hot vector according to the classification attribution, wherein 1 in the vector represents state class information, and the vector is the equivalent subspace selection of the intelligent agent.

8. The equivalent subspace-based reinforcement learning state layering method according to claim 1 or 2, wherein said step (3) comprises:

and performing subsequent strategy calculation based on the one-hot sub-state vector, playing a role in layering a state space, updating the state equivalent partition by using a preset time resolution, setting a preset time step threshold, reconstructing the equivalent sub-space if the learning training process reaches the threshold, and replacing the previous equivalent state partition with the new equivalent state partition for the learning training of the intelligent agent.

9. The method for state layering based on reinforcement learning of equivalent subspace as claimed in claim 1 or 2, wherein the generating of the state semantic feature vector for the observation data of the environment at the single time step of the agent in the step (1) is specifically: mapping the complex high-dimensional real-time state observed by the agent in a single time step into a low-dimensional state space, and reserving semantic information therein to realize efficient environment abstract representation and understanding; by the graph neural network technology, high-dimensional and continuous environment states are mapped to a low-dimensional and discrete feature space, and state semantic feature vectors are generated, so that complex relations and changes among unit objects, and local and global features of the states are represented.

10. The reinforcement learning state layering method based on equivalent subspace as claimed in claim 1 or 2, wherein the step (1) of generating a plurality of equivalent subspaces through clustering analysis specifically comprises:

after a preset round, collecting intelligent body state semantic feature vectors and corresponding return, action and state transition information to form a state representation vector < state semantic feature, return, action and subsequent state >, collecting a plurality of state representation vectors to form a state representation set after a plurality of time step lengths, clustering the set at intervals of a fixed time step length, and obtaining a clustering result which is the equivalent subspace division of the environment state.

Technical Field

The invention belongs to the technical field of reinforcement learning, and particularly relates to a reinforcement learning state layering method based on an equivalent subspace.

Background

In recent years, the deep reinforcement learning technology combines the perception capability of deep learning with the decision-making capability of reinforcement learning, achieves breakthrough progress, and is widely applied to the fields of chess games (such as AlphaGO, AlphaZero), game AI, autonomous driving, robot control and the like. In the field of military command, a deep reinforcement learning method is utilized to obtain an agent through a large amount of training, intelligent decision making is carried out, and the development of the field of military command decision making is greatly promoted.

The reinforcement learning framework mainly comprises elements such as an agent, a behavior strategy, an environment state, an action space, a return function, an interaction environment and the like. The interaction process of the agent and the environment is shown in the attached figure 1 and mainly comprises the following steps: (1) the agent senses the current environmental state st(ii) a (2) The intelligent agent is based on the current environment state stAnd selecting an action a from the action space by the currently adopted strategytAnd executing the action; (3) when the selected action of the agent acts on the environment, the environment transitions to a new state st+1And gives a return value Rt(ii) a And the process is circulated. At present, the environment under the condition of complicated confrontation is often complicated and changeable, the number of the intelligent agents participating in the confrontation is large, and different attributes and various interrelations exist, so that the state space is huge.

The state space of the counterreinforcement learning under the omnibearing multidimensional complex countermeasures environment condition is huge, and the search of the environment state is difficult. The main problems faced in the reinforcement learning environment status handling are: 1) states in a complex environment have multiple dimensions and contain multiple attributes, so that the environment state space is a huge high-dimensional continuous space set, the state scale under the observation condition of an agent is multiplied, and the state search efficiency is low; 2) the state data set objectively reflects real environment information and is a data representation of real-time environment states, but semantic information is difficult to be integrated in a subjective level, so that complex data of the state data set has inexplicability, upper-level information is difficult to mine, and states of different classes are difficult to distinguish.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides an equivalent subspace-based reinforcement learning state layering method, which is used for establishing a layered state model aiming at the problem of difficult environment search caused by overlarge reinforcement learning state space and dividing the state space into equivalent state subspaces of different abstract levels so as to solve the problem of overlarge reinforcement learning state space, improve the environment search efficiency and provide interpretable basis for a reinforcement learning algorithm.

In order to achieve the above object, the present invention provides a reinforcement learning state layering method based on an equivalent subspace, the method comprising:

(1) generating a state semantic feature vector through observation data of an intelligent agent on an environment at a single time step, forming a state expression vector by the state semantic feature vector and related information generated through environment interaction, collecting the state expression vector in a preset round to form a state expression set, and generating a plurality of equivalent subspaces through clustering analysis to obtain equivalent state division;

(2) classifying and calculating the observed state of the intelligent agent in the learning and training process based on the equivalent state division obtained in the step (1) to obtain a one-hot sub-state vector containing class information;

(3) and (3) performing subsequent strategy calculation based on the one-hot sub-state vector generated in the step (2) in the learning training process, and updating the state equivalent division again through the steps with a preset time resolution.

In one embodiment of the present invention, the step (1) comprises:

(1-1) combining the related information acquired by the agent in a single time step into a state representation vector;

(1-2) in the interaction process of the intelligent agent and the environment, collecting state representation vectors in a preset continuous time step to form a state representation set, and carrying out cluster analysis on the state representation set;

and (1-3) obtaining a classification center set of clustering calculation, namely equivalent state division of the environment state space in an abstract level, wherein each class is an equivalent subspace.

In an embodiment of the present invention, the state representation vector in the step (1-1) is < s, r, a, next _ s >, where: s represents a state semantic feature vector observed by the agent; r represents reward feedback acquired by the agent from the environment; a represents the action selected by the agent through the decision; next _ s represents the state semantic feature vector for the agent at the next time step.

In one embodiment of the present invention, the step (1-2) comprises:

(1-2-1) determining a total number k of state classes;

(1-2-2) selecting k classification training samples in the environment state set, and taking the vector as a classification center of the class;

(1-2-3) sequentially calculating the distance from the unknown vector sample to the center of the k classified training samples;

(1-2-4) determining a classification training sample closest to an unknown sample, adding the unknown sample to the classification training sample set, and recalculating a classification center;

(1-2-5) repeating the above steps until all samples are clustered.

In one embodiment of the present invention, in the step (1-3):

the equivalent state subspace refers to states which have similar characteristics and generate the same return and state transition, a set of all equivalent subspaces forms a complete environment state space, each equivalent subspace can be regarded as the same category of the state, all the states in the subspace have similarity, the equivalent subspaces have the common characteristics of all the states of the category, the states belonging to different subspaces have difference, the same processing is carried out on the states in the same equivalent subspace to improve the searching efficiency of the state space, and the equivalent state division is updated by repeating the steps at a preset time resolution ratio and is used for state hierarchical calculation in the intelligent body learning training process.

In one embodiment of the present invention, the step (2) comprises:

(2-1) dividing the clustering center set by the equivalent state obtained in the step (1) to form a classifier;

and (2-2) classifying the state semantic feature vectors obtained in the step 1 through a classifier in the intelligent agent learning training process.

In one embodiment of the present invention, the step (2-2) comprises:

(2-2-1) calculating the distance from the semantic feature vector to each cluster center;

(2-2-2) selecting the minimum distance to determine classification attribution;

and (2-2-3) generating a one-hot vector according to the classification attribution, wherein 1 in the vector represents state class information, and the vector is the equivalent subspace selection of the intelligent agent.

In one embodiment of the present invention, the step (3) comprises:

and performing subsequent strategy calculation based on the one-hot sub-state vector, playing a role in layering a state space, updating the state equivalent partition by using a preset time resolution, setting a preset time step threshold, reconstructing the equivalent sub-space if the learning training process reaches the threshold, and replacing the previous equivalent state partition with the new equivalent state partition for the learning training of the intelligent agent.

In an embodiment of the present invention, the generating of the state semantic feature vector for the observation data of the environment at a single time step of the agent in step (1) specifically includes: mapping the complex high-dimensional real-time state observed by the agent in a single time step into a low-dimensional state space, and reserving semantic information therein to realize efficient environment abstract representation and understanding; by the graph neural network technology, high-dimensional and continuous environment states are mapped to a low-dimensional and discrete feature space, and state semantic feature vectors are generated, so that complex relations and changes among unit objects, and local and global features of the states are represented.

In an embodiment of the present invention, the step (1) of generating a plurality of equivalent subspaces through clustering analysis specifically includes:

after a preset round, collecting intelligent body state semantic feature vectors and corresponding return, action and state transition information to form a state representation vector < state semantic feature, return, action and subsequent state >, collecting a plurality of state representation vectors to form a state representation set after a plurality of time step lengths, clustering the set at intervals of a fixed time step length, and obtaining a clustering result which is the equivalent subspace division of the environment state.

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:

(1) the equivalent subspace greatly reduces the state space to be searched, and the searching efficiency of the intelligent agent on the environment is improved;

(2) the equivalent state division result data can be used for mining deep information of state categories from a subjective level through corresponding visualization processing and artificial observation and reasoning, and a basis is provided for interpretability of the environment state data.

Drawings

FIG. 1 is a schematic diagram illustrating a process for interaction between an agent and an environment in the prior art;

FIG. 2 is a schematic flowchart of a reinforcement learning state layering method based on equivalent subspace according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of the equivalent state partition updating in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

For the problems in the prior art, as shown in fig. 2, the invention provides a reinforcement learning state layering method based on an equivalent subspace, which comprises the following specific steps:

(1) generating a state semantic feature vector through observation data of an intelligent agent on an environment at a single time step, forming a state expression vector by the state semantic feature vector and related information generated through environment interaction, collecting the state expression vector in a preset round to form a state expression set, and generating a plurality of equivalent subspaces through clustering analysis to obtain equivalent state division;

the intelligent agent obtains state data through interaction with the environment, and obtains a state semantic feature vector after processing, wherein the state semantic feature vector is the expression of the environment state. The vector is used for reinforcement learning training. And the < s, r, a, next _ s > is a state representation vector, wherein s is the state semantic feature vector, and the state semantic feature vector, the reward r, the action selection a and the next time step, together form the state representation vector. This vector is used for clustering analysis described later.

(2) Classifying and calculating the observed state of the intelligent agent in the learning and training process based on the equivalent state division obtained in the step (1) to obtain a one-hot sub-state vector containing class information;

(3) and (3) performing subsequent strategy calculation based on the one-hot sub-state vector generated in the step (2) in the learning training process, and updating the state equivalent division again through the steps with a preset time resolution.

The specific process of the step (1) is as follows:

(1-1) combining the relevant information acquired by the agent within a single time step into a state representation vector < s, r, a, next _ s >, where:

(1-1-1) s represents state semantic feature vectors observed by the agent;

(1-1-2) r represents reward feedback acquired by the agent from the environment;

(1-1-3) a represents an action selected by the agent through the decision;

(1-1-4) next _ s represents the state semantic feature vector for the agent at the next time step.

(1-2) in the interaction process of the intelligent agent and the environment, collecting state representation vectors in a preset continuous time step to form a state representation set, and carrying out cluster analysis on the state representation set, wherein the method comprises the following specific steps:

(1-2-1) determining a total number k of state classes;

(1-2-2) selecting k classification training samples in the environment state set, and taking the vector as a classification center of the class;

(1-2-3) sequentially calculating the distance from the unknown vector sample to the center of the k classified training samples;

(1-2-4) determining a classification training sample closest to an unknown sample, adding the unknown sample to the classification training sample set, and recalculating a classification center;

(1-2-5) repeating the above steps until all samples are clustered.

And (1-3) obtaining a classification center set of clustering calculation, namely equivalent state division of the environment state space in an abstract level, wherein each class is an equivalent subspace. By equivalent state subspaces is meant states that have similar characteristics, producing the same reward and state transition. The set of all equivalent subspaces constitutes the complete environment state space, each equivalent subspace can be considered as the same class of state. It can be considered that all states in the subspace have similarity, the equivalent subspace has the common characteristic of all states in the class, and the states belonging to different subspaces have difference. The same processing is carried out on the states in the same equivalent subspace so as to improve the searching efficiency of the state space. The equivalent state division is updated by repeating the steps with preset time resolution and is used for state layering calculation in the intelligent agent learning training process.

In the step (2), based on the equivalent state division obtained in the step 1, the states observed by the agent are classified and calculated in the learning and training process to obtain a one-hot sub-state vector containing category information, and the specific process is as follows:

(2-1) dividing the clustering center set by the equivalent state obtained in the step (1) to form a classifier;

(2-2) classifying the state semantic feature vectors obtained in the step (1) through a classifier in the intelligent agent learning training process;

(2-2-1) calculating the distance from the semantic feature vector to each cluster center;

(2-2-2) selecting the minimum distance to determine classification attribution;

and (2-2-3) generating a one-hot vector according to the classification attribution, wherein 1 in the vector represents state class information, and the vector is the equivalent subspace selection of the intelligent agent.

As shown in fig. 3, in the learning training process in step (3), subsequent policy calculation is performed based on the one-hot sub-state vector, so that the effect of state space layering is achieved. And updating the state equivalent partition through the steps according to the preset time resolution, setting a preset time step threshold, reconstructing the equivalent subspace if the learning training process reaches the threshold, and replacing the previous equivalent state partition with the new equivalent state partition for the learning training of the intelligent agent.

The technical solution of the present invention is further illustrated by the following specific examples:

step 1: state understanding and coding

The complex high-dimensional real-time state observed by the agent in a single time step is mapped to a low-dimensional state space, semantic information in the complex high-dimensional real-time state is reserved, and efficient environment abstract representation and understanding are achieved. The method is characterized in that high-dimensional and continuous environment states are mapped to a low-dimensional and discrete feature space through a graph neural network technology, and a state semantic feature vector is generated, so that complex relations and changes among unit objects and local and global features of the states are represented.

Step 2: equivalent state partitioning

After a preset round, the semantic feature vector of the state of the intelligent object and corresponding return, action and state transition information are collected to form a state representation vector < state semantic feature, return, action and subsequent state >. Collecting several state representation vectors over a plurality of time steps forms a state representation set. And clustering the set at intervals of a fixed time step, wherein the obtained clustering result is the equivalent subspace division of the environment state.

And step 3: reinforcement learning state layering

In the learning and training process, the intelligent agent divides the clustering result according to the equivalent state, classifies the acquired state semantic feature vectors by adopting a nearest neighbor method to obtain the codes of the sub-state space, converts the codes into a one-hot vector form and records the category information of the environmental state, and the vector belongs to the abstract state expression of a higher level and is used for subsequent strategy calculation. And updating the equivalent state partition with a preset time resolution.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种实体对齐的主动学习框架方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!