Navigation filter parameter optimization method based on reinforcement learning

文档序号：1567588 发布日期：2020-01-24 浏览：36次中文

阅读说明：本技术 一种基于强化学习的导航滤波器参数优化方法 (Navigation filter parameter optimization method based on reinforcement learning ) 是由熊凯郭建新石恒魏春岭于 2019-09-19 设计创作，主要内容包括：本发明涉及一种基于强化学习的导航滤波器参数优化方法。首先,基于∈贪心策略,根据状态动作值函数选择不同系统噪声和测量噪声方差的组合；同时,通过导航滤波器在应用环境中进行探索,并根据导航滤波器的测量残差计算得到奖赏；进而,根据计算得到的奖赏,利用时序差分方法更新状态动作值函数,其取值反映了所选择的噪声方差与实际应用环境的匹配程度；随着导航滤波过程的进行,通过迭代计算,能够以较大的概率选择与实际应用环境相匹配的噪声方差,从而实现自适应地调整导航滤波器中系统噪声方差和测量噪声方差的目的。所提方法能够增强导航滤波器克服系统噪声和测量噪声方差不确定性影响的能力,改善卫星自主导航精度。(The invention relates to a navigation filter parameter optimization method based on reinforcement learning. Firstly, based on the greedy strategy, selecting a combination of different system noises and measurement noise variances according to a state action value function; meanwhile, exploring in an application environment through a navigation filter, and calculating according to a measurement residual of the navigation filter to obtain a reward; furthermore, according to the rewards obtained by calculation, a state action value function is updated by a time sequence difference method, and the value of the state action value function reflects the matching degree of the selected noise variance and the actual application environment; with the progress of the navigation filtering process, through iterative computation, the noise variance matched with the practical application environment can be selected with a high probability, so that the purposes of adaptively adjusting the system noise variance and measuring the noise variance in the navigation filter are achieved. The method can enhance the capability of the navigation filter to overcome the influence of system noise and measurement noise variance uncertainty and improve the autonomous navigation precision of the satellite.)

1. A navigation filter parameter optimization method based on reinforcement learning is characterized by comprising the following steps:

(1) initializing a reference filter, a search filter and a navigation filter, distributing an initial filter estimation value and a corresponding estimation error variance matrix for each filter, and setting the system noise variance of the reference filter according to the priori knowledge

(2) Initializing a state set S and a corresponding action set A in reinforcement learning based on different system noise variance and measurement noise variance combinations;

(3) setting a state action value function, a state value function and a reward initial value for each element in the state set S and the action set A, and randomly selecting a state S belonging to S as the initial value of the state;

(4) based on the greedy strategy belonging to the field in reinforcement learning, selecting the action a belonging to the field A according to the state action value function, correspondingly, transferring the state from the field s to the field s', and correspondingly searching the noise variance of the filter system

(5) obtained according to step (1)

(6) obtained according to the step (4)

(7) calculating reward according to the measurement residual of the reference filter obtained in the step (5) and the measurement residual of the search filter obtained in the step (6);

(8) obtained according to the step (4)

(9) updating the state action value function by utilizing a time sequence difference method in reinforcement learning according to the reward obtained in the step (7), and resetting the state action value function and the reward;

(10) resetting the filtering estimation value and the estimation error variance matrix of the search filter by using the filtering estimation value and the estimation error variance matrix of the reference filter obtained in the step (5);

(11) repeating the steps (4) to (10) to obtain

2. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (1), the method for initializing the reference filter, the search filter and the navigation filter comprises the following steps:

wherein the content of the first and second substances,to know

3. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (2), the method for initializing the state set S and the corresponding action set a in reinforcement learning includes: each element in the state set S is a combination of different system noise variances and measurement noise variances, and each element in the action set a is an action of state transition, i.e. the selection of a certain set of system noise variances and measurement noise variances is shifted to the selection of another set of system noise variances and measurement noise variances.

4. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (3), the method for setting the state action value function, the state value function and the reward initial value includes: for any state S ∈ S and action a ∈ A, set

Q(s，a)←0，V(s)←0，R←0

Where Q (s, a) represents a state action value function, V(s) represents a state value function, and R represents a prize.

5. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (4), the method for selecting the action according to the state action value function is as follows:

a←greedy(A，Q(s，a)，s，∈)

the function greedy (a, Q (s, a), s, e) represents a greedy policy, that is, an action is selected in the action set a at a probability of e, an action that maximizes the state action value function Q (s, a) is selected at a probability of 1-e, and a probability is selected for a preset random action by e (0, 1).

6. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (5), the method for performing recursive calculation by using the reference filter includes:

wherein the content of the first and second substances,to know

7. The reinforcement learning-based navigation filter parameter optimization method according to claim 6, wherein: in the step (6), the method for performing recursive solution by searching the filter is as follows:

wherein the content of the first and second substances,

8. The reinforcement learning-based navigation filter parameter optimization method of claim 7, wherein: the method for calculating the reward R according to the measurement residual error comprises the following steps:

9. the reinforcement learning-based navigation filter parameter optimization method of claim 7, wherein: the method for carrying out recursion calculation through the navigation filter comprises the following steps:

wherein the content of the first and second substances,and P_kRespectively representing the filtered estimation value of the navigation filter at the k moment and a corresponding error variance matrix,

10. The reinforcement learning-based navigation filter parameter optimization method according to claim 8, wherein: the method for updating the state action value function Q (s, a) by using the time sequence difference method comprises the following steps:

Q(s，a)←Q(s，a)+α[R+γV(s′)-Q(s，a)]

wherein α ∈ (0,1) is a preset learning rate, γ ∈ (0,1) is a preset discount factor, and the method for resetting the state value function v(s) and the prize R is as follows:

V(s)←max_aQ(s，a)

s←s′

R←0

wherein the symbol max_aQ (s, a) represents the maximum value of the state action value function Q (s, a) for a given state s.

11. The reinforcement learning-based navigation filter parameter optimization method according to claim 1, wherein: in the step (10), the method for resetting the filter estimation value and the estimation error variance matrix of the search filter comprises:

Technical Field

The invention relates to a navigation filter parameter optimization method based on reinforcement learning, and belongs to the technical field of satellite autonomous navigation.

Background

The traditional navigation filter designed based on the Kalman filtering theory is widely applied to the technical field of satellite autonomous navigation. It is well known that the design of conventional navigation filters relies on known system noise variance and measurement noise variance. However, in the process of solving the actual engineering problem, the situation that the noise variance has uncertainty is often encountered. For example, in a constellation satellite autonomous navigation system based on inter-satellite relative measurement, the in-orbit measurement error characteristic of a navigation sensor is influenced by factors such as attitude jitter of an observation platform, sun illumination conditions, a space thermal environment and the like, and considering the limitation of simulating a space application environment in a ground laboratory, the measurement error characteristic of the navigation sensor in the actual application environment may be different from the situation of laboratory testing. For uncertainty of statistical characteristics of system noise and measurement noise, a traditional navigation filter designed based on the Kalman filtering theory does not have self-adaption capability. In case the actual noise variance deviates from its nominal value, a degradation of the filter performance may result. Accordingly, there is a need for a filter design that is specifically tailored to improve the performance of the navigation system.

Aiming at the influence of uncertainty of noise statistical characteristics, various strategies have been given in the past research, and the main purpose is to enhance the adaptive capacity of the filter, wherein the adaptive filter with the noise variance online estimation capacity is widely regarded. For systems with system noise or uncertainty in statistical characteristics of measurement noise, various algorithms such as Adaptive Kalman Filter (AKF) have been proposed. The problem with adaptive filters is that for situations where there is uncertainty in the noise variance, the overall performance of the filter is not guaranteed due to the coupling of the noise variance estimate and the filter estimation error. For example, for a constellation satellite autonomous navigation system under study, when the noise variance and the satellite position and speed are estimated at the same time, the situation that the estimation precision of the adaptive kalman filter is inferior to that of the conventional kalman filter often occurs.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the problem that the estimation error of the filter is increased due to the uncertainty of the noise variance in the model, a navigation filter parameter optimization method based on reinforcement learning is provided. The method can identify the situation of the increase of the measurement error of the navigation sensor according to the measurement residual errors of the reference filter and the search filter, interact with the practical application environment, adjust, and obtain the value of the noise variance matrix in the navigation filter through learning, thereby realizing the optimization processing of the measurement information of different navigation sensors and enhancing the capability of the constellation satellite autonomous navigation system to cope with the influence of the uncertainty of the noise variance.

The technical solution of the invention is as follows:

a navigation filter parameter optimization method based on reinforcement learning comprises the following steps:

(1) initializing a reference filter, a search filter and a navigation filter, distributing an initial filter estimation value and a corresponding estimation error variance matrix for each filter, and setting a system noise variance in the reference filter according to prior knowledge

And measuring the variance of the noise

(2) Initializing a state set S and a corresponding action set A in reinforcement learning based on different system noise variance and measurement noise variance combinations;

(4) based on the greedy strategy belonging to the form of E in reinforcement learning, selecting the action a belonging to the form of A according to the action value function of the state, correspondingly, transferring the state from s to s', and correspondingly searching the system noise variance in the filter

And measuring the variance of the noiseA combination of (1);

(5) obtained according to step (1)

And

carrying out recursive calculation through a reference filter to obtain a filtering estimation value, an estimation error variance matrix and a measurement residual of the reference filter;

(6) obtained according to the step (4)

And

carrying out recursive calculation through a search filter to obtain a filtering estimation value, an estimation error variance matrix and a measurement residual of the search filter;

(7) calculating reward according to the measurement residual of the reference filter obtained in the step (5) and the measurement residual of the search filter obtained in the step (6);

(8) obtained according to the step (4)

And

carrying out recursive calculation through a navigation filter to obtain a filtering estimation value of the navigation filter and a corresponding estimation error variance matrix;

(11) repeating the steps (4) to (10) to obtain the system noise variance as the design parameter of the navigation filter

And measuring the variance of the noise

And completing the navigation filter parameter optimization based on reinforcement learning.

Further, in the step (1), the method for initializing the reference filter, the search filter and the navigation filter includes:

wherein the content of the first and second substances,

and

respectively representing the initial filtered estimate of the reference filter and the corresponding estimated error variance matrix,

and

respectively representing the initial filtered estimate of the search filter and the corresponding estimated error variance matrix,

and P₀The initial filtering estimation value of the navigation filter and the corresponding estimation error variance matrix are respectively represented and can be obtained according to the prior knowledge about the carrier motion.

Further, in the step (2), the method for initializing the state set S and the corresponding action set a in reinforcement learning includes: each element in the state set S is a combination of different system noise variances and measurement noise variances, and each element in the action set a is an action of state transition, i.e. the selection of a certain set of system noise variances and measurement noise variances is shifted to the selection of another set of system noise variances and measurement noise variances.

Further, in the step (3), the method for setting the state action value function, the state value function and the reward initial value includes: for any state S ∈ S and action a ∈ A, set

Q(s,a)←0,V(s)←0，R←0

Where Q (s, a) represents a state action value function, V(s) represents a state value function, and R represents a prize.

Further, in the step (4), the method for selecting the action according to the state action value function includes:

a←greedy(A,Q(s,a),s,∈)

Further, in the step (5), a method of performing recursive computation by using a reference filter includes:

wherein the content of the first and second substances,and

respectively representing the filtered estimate of the reference filter at time k and the corresponding error variance matrix,

and

respectively representing the filtered prediction value of the reference filter at the k moment and the corresponding prediction error variance matrix, y_kThe amount of observation is represented by a graph,

representing the measured residual of the reference filter,

filter gain array, filter parameter, representing reference filterAnd

obtained through the step (1), and a state transition matrix F_kAnd an observation matrix H_kThe known quantity can be obtained according to a navigation system model established in advance, and I represents a unit array.

Further, in the step (6), a method of performing recursive solution by searching for a filter is as follows:

wherein the content of the first and second substances,

and

respectively representing the filtered estimate of the search filter at time k and the corresponding error variance matrix,

and

respectively representing the filtered prediction value of the search filter at the k moment and a corresponding prediction error variance matrix,representing the measured residual of the search filter,

filter gain array, filter parameter, representing search filter

And

obtained through the step (4).

Further, in step (7), the method for calculating the reward according to the measured residual error is as follows:

further, in the step (8), the method for performing recursive solution by using the navigation filter includes:

wherein the content of the first and second substances,

and P_kRespectively representing the filtered estimation value of the navigation filter at the k moment and a corresponding error variance matrix,and

respectively representing the filtering prediction value of the navigation filter at the k moment and a corresponding prediction error variance matrix,representing the measured residual of the navigation filter, K_kRepresenting the filter gain array of the navigation filter.

Further, in the step (9), the method for updating the state action value function by using the time-series difference method includes:

Q(s,a)←Q(s,a)+α[R+γV(s′)-Q(s,a)]

wherein, α ∈ (0,1) is a preset learning rate, γ ∈ (0,1) is a preset discount factor, and the method for resetting the state value function and the prize is as follows:

V(s)←max_aQ(s,a)

s←s′

R←0

wherein the symbol max_aQ (s, a) represents the maximum value of the state action value function Q (s, a) for a given state s.

Further, in the step (10), the method for resetting the filter estimation value and the estimation error variance matrix of the search filter includes:

compared with the prior art, the invention has the beneficial effects that:

the invention provides a navigation filter parameter optimization method based on reinforcement learning, which is used for the optimization design of a navigation filter under the condition that uncertainty exists in system noise variance and measurement noise variance in a model, updates a state action value function through interaction with an application environment, and adaptively adjusts the system noise variance and the measurement noise variance in the navigation filter according to the state action value function, thereby improving the adaptability of a navigation system to the application environment and improving the performance of the navigation filter.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a set of states for representing different combinations of system noise variance and measurement noise variance;

FIG. 3 is a schematic diagram of a constellation satellite autonomous navigation system;

fig. 4 is a schematic diagram of a comparison curve of simulation accuracy between the method (RLKF) of the present invention and the conventional Kalman Filter (KF) and the Adaptive Kalman Filter (AKF).

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Under the condition that uncertainty exists in system noise variance and measurement noise variance serving as navigation filter parameters, the performance of a navigation filter designed based on the traditional Kalman filtering theory is reduced; aiming at the problems, the advantages of a reinforcement learning method in the aspects of unknown environment exploration and complex problem decision are fully exerted, and a navigation filter parameter optimization method based on reinforcement learning is provided, wherein firstly, based on the greedy strategy, a combination of different system noises and measurement noise variances is selected according to a state action value function; meanwhile, exploring in an application environment through a navigation filter, and calculating according to a measurement residual of the navigation filter to obtain a reward; furthermore, according to the rewards obtained by calculation, a state action value function is updated by a time sequence difference method, and the value of the state action value function reflects the matching degree of the selected noise variance and the actual application environment; with the progress of the navigation filtering process, through iterative computation, the noise variance matched with the practical application environment can be selected with a high probability, so that the purposes of adaptively adjusting the system noise variance and measuring the noise variance in the navigation filter are achieved. The method can enhance the capability of the navigation filter to overcome the influence of system noise and measurement noise variance uncertainty and improve the autonomous navigation precision of the satellite.

As shown in fig. 1, the present invention provides a navigation filter parameter optimization method based on reinforcement learning, which comprises the following steps:

(1) initializing a reference filter, a search filter and a navigation filter for eachAllocating initial filtering estimation value and corresponding estimation error variance matrix to each filter, and setting system noise variance of the reference filter according to prior knowledge

And reference filter measurement noise variance

The method for initializing the reference filter, the search filter and the navigation filter comprises the following steps:

wherein the content of the first and second substances,

and

respectively representing the initial filtered estimate of the reference filter and the corresponding estimated error variance matrix,

and

respectively representing the initial filtered estimate of the search filter and the corresponding estimated error variance matrix,

(2) Initializing a state set S and a corresponding action set A in reinforcement learning based on different system noise variance and measurement noise variance combinations; the method for initializing the state set S and the corresponding action set A in reinforcement learning comprises the following steps: each element in the state set S is a combination of different system noise variances and measurement noise variances (actually including the system noise variances and measurement noise variances of the three filters, i.e., the reference filter, the search filter, and the navigation filter), and each element in the action set a is an action of state transition, i.e., a transition from selecting a certain set of system noise variances and measurement noise variances to selecting another set of system noise variances and measurement noise variances. A state set diagram for representing different combinations of system noise variance and measurement noise variance is shown in fig. 2, where each grid represents each element in the state set, and contains M × N elements, values of the system noise variance and the measurement noise variance are different in different grids, and an arrow in the grid represents a direction of state transition, i.e., a selectable action.

(3) Setting a state action value function, a state value function and a reward initial value for each element in the state set S and the action set A, and randomly selecting a state S belonging to S as the initial value of the state; the method for setting the state action value function, the state value function and the reward initial value comprises the following steps: for any state S ∈ S and action a ∈ A, set

Q(s,a)←0,V(s)←0，R←0

Where Q (s, a) represents a state action value function, V(s) represents a state value function, and R represents a prize.

(4) Based on the greedy strategy belonging to the form of E in reinforcement learning, selecting the action a belonging to the form of A according to the action value function of the state, correspondingly, transferring the state from s to s', and correspondingly searching the system noise variance in the filter

And measuring the variance of the noise

A combination of (1); the method for selecting the action according to the state action value function comprises the following steps:

a←greedy(A,Q(s,a),s,∈)

the function greedy (a, Q (s, a), s, e) represents a greedy policy, that is, an action is selected in the action set a at a probability of e, an action that maximizes the state action value function Q (s, a) is selected at a probability of 1 e, and a probability is selected for a preset random action by e (0, 1).

(5) Obtained according to step (1)

And

carrying out recursive calculation through a reference filter to obtain a filtering estimation value, an estimation error variance matrix and a measurement residual of the reference filter; the method for carrying out recursion calculation through the reference filter comprises the following steps:

wherein the content of the first and second substances,

and

respectively representing the filtered estimated values of the reference filter at the k timeThe corresponding error variance matrix is used for carrying out the error variance analysis,

and

respectively representing the filtered prediction value of the reference filter at the time k and the corresponding prediction error variance matrix, y_kThe amount of observation is represented by a graph,

representing the measured residual of the reference filter,filter gain array, filter parameter, representing reference filter

And

(6) Obtained according to the step (4)And

carrying out recursive calculation through a search filter to obtain a filtering estimation value, an estimation error variance matrix and a measurement residual of the search filter; the method for recursive solution by a search filter comprises the following steps:

wherein the content of the first and second substances,

and

respectively representing the filtered estimate of the search filter at time k and the corresponding error variance matrix,and

respectively representing the filtered prediction value of the search filter at the k moment and a corresponding prediction error variance matrix,

representing the measured residual of the search filter,

filter gain array, filter parameter, representing search filter

And

obtained through the step (4)。

(7) Calculating reward according to the measurement residual of the reference filter obtained in the step (5) and the measurement residual of the search filter obtained in the step (6); the method for calculating the reward according to the measurement residual error comprises the following steps:

(8) obtained according to the step (4)

Andcarrying out recursive calculation through a navigation filter to obtain a filtering estimation value of the navigation filter and a corresponding estimation error variance matrix; the method for carrying out recursion calculation through the navigation filter comprises the following steps:

wherein the content of the first and second substances,

and P_kRespectively representing the filtered estimation value of the navigation filter at the k moment and a corresponding error variance matrix,

and P_k|k-1Respectively representing the filtering prediction value of the navigation filter at the k moment and a corresponding prediction error variance matrix,

representing the measured residual of the navigation filter, K_kRepresenting the filter gain array of the navigation filter.

(9) Updating the state action value function by utilizing a time sequence difference method in reinforcement learning according to the reward obtained in the step (7), and resetting the state action value function and the reward; the method for updating the state action value function by utilizing the time sequence difference method comprises the following steps:

Q(s,a)←Q(s,a)+α[R+γV(s′)-Q(s,a)]

wherein, α ∈ (0,1) is a preset learning rate, γ ∈ (0,1) is a preset discount factor, and the method for resetting the state value function and the prize is as follows:

V(s)←max_aQ(s,a)

s←s′

R←0

wherein the symbol max_aQ (s, a) represents the maximum value of the state action value function Q (s, a) for a given state s.

(10) Resetting the filtering estimation value and the estimation error variance matrix of the search filter by using the filtering estimation value and the estimation error variance matrix of the reference filter obtained in the step (5); the method for resetting the filtering estimation value and the estimation error variance matrix of the search filter comprises the following steps:

(11) repeating the steps (4) to (10) to obtainI.e. obtained as the system noise variance of the navigation filterNamely, the noise variance is measured as the navigation filter, thereby completing the navigation filter parameter optimization based on reinforcement learning. In practice, the steps 4 to 10 are repeated for determining the function of the state action value, and after determining the function, the function is selected from the step 4And

the invention provides the following examples:

the effectiveness of the method is verified by a simulation example by taking autonomous navigation of 3 constellation satellites flying on the earth orbit as an example. The schematic diagram of the constellation satellite autonomous navigation system is shown in fig. 3, relative position vectors between constellation satellites are obtained through measurement of a navigation sensor, measurement information of the navigation sensor is processed through a navigation filter, and positions and speeds of three satellites in a constellation are estimated.

Let the semi-major axis of the orbit of 3 constellation satellites be 27900km, the inclination angle of the orbit be 55 degrees, and the right ascension at the ascending intersection point be 0 degree, 120 degrees and 240 degrees respectively. The nominal value of the measurement precision of the navigation sensor for measuring the relative position between the stars is assumed to be 50m, and the data updating rate is 0.1 Hz. In case of interference of the navigation sensor signal, the measurement error variance is increased. In a conventional Kalman filter, the system noise variance and the measurement noise variance are set to

Wherein the content of the first and second substances,

in the reinforcement learning method, the learning rate α is selected to be 0.1, the discount factor γ is selected to be 0.9, the random action selection probability e is selected to be 0.1, the state set grid size M is 9, and N is 9. The simulation time is 2 days, and the satellite orbit prediction period is 1 s.

Compared with the traditional Kalman Filtering (KF) algorithm, the method has the main advantages that the method has self-adaptive capacity to different application environments and is beneficial to solving the problem that the performance of the filter is influenced by the uncertainty of the noise statistical characteristic; compared with algorithms such as Adaptive Kalman Filtering (AKF) and the like, the method has the main advantage that the problem of error coupling caused by simultaneous estimation of noise variance and satellite positions by the traditional AKF algorithm can be avoided. And carrying out simulation research aiming at the constellation satellite autonomous navigation system, and counting the position estimation error root mean square of the navigation filter under the condition of different measurement noise variances. The simulation precision comparison curve of the method (RLKF) provided by the invention with the traditional Kalman Filtering (KF) and the Adaptive Kalman Filtering (AKF) is shown in FIG. 4.

In fig. 4, the abscissa indicates the amplification factor of the standard deviation of the measured noise relative to the nominal value, the larger the numerical value of the amplification factor is, the larger the difference between the actual system and the filtering model is, the ordinate indicates the root mean square statistical value of the estimation error of the 1 st satellite position in the constellation, the unit is m, and the symbols KF, AKF and RLKF in the graph respectively indicate a kalman filter, an adaptive kalman filter and the reinforced learning kalman filter provided by the present invention.

As is apparent from fig. 4, the constellation satellite autonomous navigation performed by the method of the present invention can obtain a positioning accuracy better than that of the conventional KF and AKF, and the greater the difference between the nominal value and the actual value of the measured noise variance, the more significant the advantage of the method of the present invention is.

Obviously, compared with the traditional method, the constellation satellite autonomous navigation precision obtained by the method provided by the invention is improved to a greater extent. Therefore, the navigation filter parameter optimization method based on reinforcement learning provided by the invention is effective.

The main technical content of the invention can be used for the scheme design of the constellation satellite autonomous navigation system, realizes the autonomous navigation of a new generation Beidou satellite navigation system in China, can also be popularized and used for the autonomous navigation of other earth orbit spacecrafts and deep space probes, and has wide application prospect.

Those skilled in the art will appreciate that the invention may be practiced without these specific details.

17页详细技术资料下载

Navigation filter parameter optimization method based on reinforcement learning

相关技术

网友询问留言