Machine learning device, control device, and machine learning method

文档序号：1566862 发布日期：2020-01-24 浏览：12次中文

阅读说明：本技术 机器学习装置、控制装置、以及机器学习方法 (Machine learning device, control device, and machine learning method ) 是由恒木亮太郎猪饲聪史园田直人于 2019-07-11 设计创作，主要内容包括：本发明提供一种机器学习装置、控制装置、以及机器学习方法,实现机器学习装置的收敛时间的缩短。机器学习装置(200)对伺服控制装置(100)进行机器学习,该伺服控制装置使用具有IIR滤波器(1092、1102)的前馈计算部(109、110)进行的前馈控制来控制伺服电动机(300),该机器学习是与前馈计算部的IIR滤波器的传递函数的系数优化有关的机器学习,该伺服电动机驱动机床、机器人或工业机械的轴,通过使用半径r和角度θ来进行表示的极坐标,分别表现IIR滤波器的传递函数为零的零点以及传递函数无限发散的极,并在规定的搜索范围内分别搜索半径r和角度θ并进行学习,由此,进行IIR滤波器的传递函数的系数优化。(The invention provides a machine learning device, a control device, and a machine learning method, which can shorten the convergence time of the machine learning device. A machine learning device (200) performs machine learning on a servo control device (100) which controls a servo motor (300) relating to coefficient optimization of an IIR filter in a feedforward calculation unit by using feedforward control performed by feedforward calculation units (109, 110) having IIR filters (1092, 1102), wherein the servo motor drives an axis of a machine tool, a robot, or an industrial machine, represents a zero point at which a transfer function of the IIR filter is zero and a pole at which the transfer function diverges infinitely by using polar coordinates represented by a radius r and an angle theta, and performs learning by searching for the radius r and the angle theta in a predetermined search range.)

1. A machine learning device for performing machine learning on a servo control device which controls a servo motor using feedforward control by a feedforward calculation unit having an IIR filter, the machine learning being machine learning related to coefficient optimization of a transfer function of the IIR filter, the servo motor driving an axis of a machine tool, a robot, or an industrial machine,

the coefficient of the transfer function of the IIR filter is optimized by expressing a zero point at which the transfer function of the IIR filter is zero and a pole at which the transfer function diverges infinitely in each of polar coordinates expressed by using a radius r and an angle θ, and searching for and learning the radius r and the angle θ in a predetermined search range.

2. The machine learning apparatus of claim 1,

the search range of the radius r is defined according to an attenuation factor, and the search range of the angle theta is defined according to a frequency of suppressing vibration.

3. The machine learning apparatus of claim 1 or 2,

the machine learning device searches for the zero point before searching for the pole.

4. The machine learning apparatus according to any one of claims 1 to 3,

the machine learning device fixes the pole when searching for the zero point.

5. The machine learning apparatus according to any one of claims 1 to 4,

the machine learning device searches for the angle θ before searching for the radius r.

6. The machine learning apparatus according to any one of claims 1 to 5,

the machine learning device fixes the radius r to a fixed value when searching for the angle θ.

7. The machine learning apparatus according to any one of claims 1 to 6,

the zero is represented by a complex number and a conjugate complex number.

8. The machine learning apparatus according to any one of claims 1 to 7,

the feedforward calculation section is a velocity feedforward calculation section or a position feedforward calculation section.

9. The machine learning apparatus according to any one of claims 1 to 8,

the feedforward calculation section is a velocity feedforward calculation section,

the machine learning device further includes a position feedforward calculation unit having an IIR filter,

the machine learning device optimizes the transfer function of the IIR filter of the speed feedforward calculation unit before optimizing the transfer function of the IIR filter of the position feedforward calculation unit.

10. The machine learning apparatus according to any one of claims 1 to 9,

the machine learning device includes:

a state information acquiring unit that acquires state information from the servo control device, the state information including a servo state and a transfer function of the feedforward calculating unit, the servo state including at least a positional deviation, by causing the servo control device to execute a predetermined machining program;

a behavior information output unit that outputs behavior information to the servo control device, the behavior information including adjustment information of a coefficient of a transfer function included in the state information;

a report output unit that outputs a report value in reinforcement learning based on the positional deviation included in the state information; and

and a cost function updating unit that updates the behavior cost function based on the report value output by the report output unit, the status information, and the behavior information.

11. The machine learning apparatus of claim 10,

the return output unit outputs the return value based on an absolute value of the positional deviation.

12. The machine learning apparatus of claim 10 or 11,

the machine learning device includes: and an optimization behavior information output unit that generates and outputs correction information of the coefficient of the transfer function of the feedforward calculation unit, based on the cost function updated by the cost function update unit.

13. A control device is characterized by comprising:

the machine learning device of any one of claims 1-12; and

and a servo control device for controlling a servo motor for driving a shaft of a machine tool, a robot, or an industrial machine, by using feedforward control by a feedforward calculation unit having an IIR filter.

14. The control device according to claim 13,

the machine learning device is included in the servo control device.

15. A machine learning method of a machine learning device for performing machine learning on a servo control device for controlling a servo motor by using feedforward control by a feedforward calculation unit having an IIR filter, the machine learning being machine learning related to coefficient optimization of a transfer function of the IIR filter, the servo motor driving an axis of a machine tool, a robot, or an industrial machine,

Technical Field

The present invention relates to a machine learning device that performs machine learning on a servo control device that controls a servo motor using feedforward control by a feedforward calculation unit having an IIR (Infinite impulse response) filter, the machine learning being machine learning related to coefficient optimization of a transfer function of the IIR filter, the servo motor driving an axis of a machine tool, a robot, or an industrial machine, a control device including the machine learning device, and a machine learning method.

Background

For example, patent document 1 describes a servo control device using feedforward control by a feedforward calculator having an IIR filter.

Patent document 1 relates to an invention relating to a control device for a servo motor, and includes the following descriptions: the velocity feedforward device (corresponding to a position feedforward calculation unit in the present embodiment described later) is configured by a velocity feed arithmetic unit (corresponding to a differentiator in the present embodiment described later) and a velocity feedforward filter, and an IIR filter (paragraph 0080 or the like) can be used as the velocity feedforward filter.

Disclosure of Invention

An object of the present invention is to provide a machine learning device, a control device including the machine learning device, and a machine learning method, which can shorten a convergence time of machine learning for optimizing a coefficient of a transfer function of an IIR filter in a servo control device that controls a servo motor by feedforward control using a feedforward calculation unit having the IIR filter.

(1) The present invention relates to a machine learning device (for example, a machine learning device 200) that performs machine learning on a servo control device (for example, a servo control device 100 described later) that controls a servo motor (for example, a servo motor 300 described later) that drives an axis of a machine tool, a robot, or an industrial machine using feedforward control performed by a feedforward calculator (for example, a speed feedforward calculator 109 or a position feedforward calculator 110 described later) having an IIR filter (for example, an IIR filter 1092 or 1102 described later),

(2) In the machine learning device according to the above (1), the search range of the radius r may be defined according to an attenuation factor, and the search range of the angle θ may be defined according to a frequency of suppressing vibration.

(3) In the machine learning device according to the above (1) or (2), the machine learning device may perform the zero point search before performing the pole search.

(4) In the machine learning device according to any one of the above (1) to (3), the machine learning device may fix the pole when searching for the zero point.

(5) In the machine learning device according to any one of the above (1) to (4), the machine learning device may perform the search for the angle θ before performing the search for the radius r.

(6) In the machine learning device according to any one of the above (1) to (5), the machine learning device may fix the radius r to a fixed value when searching for the angle θ.

(7) In the machine learning device according to any one of the above (1) to (6), the zero point may be expressed by a complex number and a complex conjugate.

(8) In the machine learning device according to any one of the above (1) to (7), the feedforward calculation unit may be a speed feedforward calculation unit or a position feedforward calculation unit.

(9) In the machine learning device according to any one of the above (1) to (8), the machine learning device may be,

the feedforward calculation section is a velocity feedforward calculation section,

the machine learning device further includes a position feedforward calculation unit having an IIR filter,

(10) In the machine learning device according to any one of the above (1) to (9), the machine learning device may be,

the machine learning device includes:

a state information acquisition unit (for example, a state information acquisition unit 201 described later) that acquires state information from the servo control device, the state information including a servo state and a transfer function of the feedforward calculation unit, the servo state including at least a positional deviation, by causing the servo control device to execute a predetermined machining program;

a behavior information output unit (for example, a behavior information output unit 203 described later) that outputs behavior information including adjustment information of a coefficient of a transfer function included in the state information to the servo control device;

a report output unit (for example, a report output unit 2021 described later) that outputs a report value in reinforcement learning based on the positional deviation included in the state information; and

and a cost function updating unit (e.g., a cost function updating unit 2022 described later) that updates the behavior cost function based on the report value output by the report output unit, the status information, and the behavior information.

(11) In the machine learning device according to the above (10), the return output unit may output the return value based on an absolute value of the positional deviation.

(12) In the machine learning device according to the above (10) or (11), the machine learning device may include: and an optimization behavior information output unit (for example, an optimization behavior information output unit 205 described later) that generates and outputs correction information of the coefficient of the transfer function of the feedforward calculation unit, based on the cost function updated by the cost function update unit.

(13) The present invention relates to a control device, comprising: the machine learning device according to any one of (1) to (12) above; and

(14) In the control device of the above (13), the machine learning device may be included in the servo control device.

(15) The present invention relates to a machine learning method of a machine learning device (for example, a machine learning device 200) that performs machine learning of a servo control device (for example, a servo control device 100 described later) that controls a servo motor (for example, a servo motor 300 described later) that drives an axis of a machine tool, a robot, or an industrial machine using feedforward control performed by a feedforward calculation unit (for example, a speed feedforward calculation unit 109 or a position feedforward calculation unit 110 described later) having an IIR filter (for example, an IIR filter 1092 or 1102 described later),

Effects of the invention

According to the present invention, the convergence time of machine learning for realizing coefficient optimization of the transfer function of the IIR filter can be shortened.

Drawings

Fig. 1 is a block diagram showing a configuration example of a control device according to an embodiment of the present invention.

Fig. 2 is a block diagram showing a part of a machine tool including a servo motor, which is an example of a control target of the servo control device.

Fig. 3 is a diagram for explaining the operation of the servo motor when the machining shape is an octagon.

Fig. 4 is a diagram for explaining the operation of the motor when the machining shape is a shape in which every other corner of the octagon is replaced with a circular arc.

Fig. 5 is an explanatory diagram of a complex plane showing search ranges of poles and zeros.

Fig. 6 is a block diagram showing a machine learning device according to the present embodiment.

Fig. 7 is a flowchart illustrating an operation of the machine learning device according to the present embodiment.

Fig. 8 is a flowchart for explaining the operation of the optimization behavior information output unit of the machine learning device according to the present embodiment.

Fig. 9 is a block diagram showing another configuration of the control device.

Description of the symbols

10. 10A control device

100 servo control device

101 position command generating unit

102 subtracter

103 position control unit

104 adder

105 subtracter

106 speed control part

107 adder

108 integrator

109 velocity feedforward calculating section

110 position feedforward calculating part

200 machine learning device

201 status information acquiring unit

202 learning unit

203 behavior information output unit

204 cost function storage unit

205 optimized behavior information output unit

300 servo motor

400 control object

500 network

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

Fig. 1 is a block diagram showing a configuration example of a control device according to an embodiment of the present invention. The control device 10 shown in fig. 1 includes a servo control device 100 and a machine learning device 200.

The servo motor 300 is a control target of the servo control device 100, and is included in, for example, a machine tool, a robot, an industrial machine, or the like. The servo control device 100 may be provided as a part of a machine tool, a robot, an industrial machine, or the like together with the servo motor 300.

First, the servo control device 100 will be explained.

The servo control device 100 includes: a position command creating unit 101, a subtractor 102, a position control unit 103, an adder 104, a subtractor 105, a velocity control unit 106, an adder 107, an integrator 108, a velocity feedforward calculation unit 109, and a position feedforward calculation unit 110. The velocity feedforward calculator 109 includes a second order differentiator 1091 and an IIR filter 1092. The position feedforward calculation unit 110 includes a differentiator 1101 and an IIR filter 1102.

The position command generating unit 101 generates a position command value, and outputs the generated position command value to the subtractor 102, the speed feedforward calculating unit 109, the position feedforward calculating unit 110, and the machine learning device 200. The subtractor 102 obtains a difference between the position command value and the detected position of the position feedback, and outputs the difference as a position deviation to the position control unit 103 and the machine learning device 200.

The position command generating unit 101 generates a position command value based on a program for operating the servo motor 300. The servo motor 300 is included in a machine tool, for example. In a machine tool, when a table on which a workpiece (workpiece) is mounted moves in an X-axis direction and a Y-axis direction, a servo control device 100 and a servo motor 300 shown in fig. 1 are provided for the X-axis direction and the Y-axis direction, respectively. When the table is moved in three or more axes, the servo control device 100 and the servo motor 300 are provided for each axis direction.

The position command creating unit 101 sets a feed speed to create a position command value so that the position command value will be a machining shape specified by the machining program.

The position control unit 103 outputs a value obtained by multiplying the position gain Kp by the position deviation to the adder 104 as a speed command value.

The adder 104 adds the speed command value to the output value (position feedforward term) of the position feedforward calculation unit 110, and outputs the speed command value to the adder 105 as a speed command value for feedforward control. The subtractor 105 obtains a difference between the output of the adder 104 and the speed detection value of the speed feedback, and outputs the difference to the speed control unit 106 as a speed deviation.

The speed control unit 106 adds a value obtained by integrating the speed deviation multiplied by the integral gain K1v and a value obtained by multiplying the speed deviation multiplied by the proportional gain K2v to output the resultant value as a torque command value to the adder 107.

The adder 107 adds the torque command value to the output value (velocity feedforward term) of the velocity feedforward calculator 109, outputs the resultant to the servo motor 300 as a torque command value for feedforward control, and drives the servo motor 300.

The rotational angle position of the servo motor 300 is detected by a rotary encoder as a position detection unit associated with the servo motor 300, and the detected speed value is input to the subtractor 105 as speed feedback. The speed detection value is integrated by the integrator 108 into a position detection value, and the position detection value is input to the subtractor 102 as position feedback.

The second order differentiator 1091 of the speed feedforward calculator 109 performs second order differentiation on the position command value and multiplies the position command value by the constant α, and the IIR filter 1092 performs IIR filtering represented by a transfer function vff (z) represented by equation 1 (hereinafter, expressed as equation 1) on the output of the second order differentiator 1091, and outputs the processing result to the adder 107 as a speed feedforward term. Coefficient a of mathematical formula 1₁、a₂、b₀～b₂Are coefficients of the transfer function of the IIR filter 1092. The denominator and the numerator of the transfer function vff (z) are both quadratic functions, but are not particularly limited to quadraticThe function may be a function of three or more times.

[ mathematical formula 1 ]

The differentiator 1101 of the position feedforward calculation unit 110 differentiates the position command value and multiplies the result by a constant β, and the IIR filter 1102 performs IIR filtering represented by a transfer function pff (z) represented by equation 2 (hereinafter, expressed as equation 2) on the output of the differentiator 1101, and outputs the result of the IIR filtering to the adder 104 as a position feedforward term. Coefficient c of math figure 2₁、c₂、d₀～d₂Are the coefficients of the transfer function of IIR filter 1102. The denominator and the numerator of the transfer function pff (z) are both quadratic functions, but are not particularly limited to quadratic functions, and may be functions of three or more.

[ mathematical formula 2 ]

As described above, the servo control device 100 is configured.

Next, a control object 400 including the servo motor 300 controlled by the servo control device 100 will be described.

Fig. 2 is a block diagram showing a part of a machine tool including a servo motor 300, which is an example of a control target 400 of the servo control device 100.

The servo control apparatus 100 moves the table 303 via the connection mechanism 302 by the servo motor 300, thereby processing the object to be processed (workpiece) mounted on the table 303. The coupling mechanism 302 includes a coupling (coupling)3021 connected to the servomotor 300, and a ball screw 3023 fixed to the coupling 3021, and a nut 3022 is screwed to the ball screw 3023. By the rotational driving of the servo motor 300, the nut 3022 screwed with the ball screw 3023 moves in the axial direction of the ball screw 3023. By the movement of the nut 3022, the table 303 is moved.

The rotational angle position of the servo motor 300 is detected by a rotary encoder 301 as a position detecting unit associated with the servo motor 300. As described above, the detected signal is used as velocity feedback. The detected signal is integrated by the integrator 108 and used as position feedback. Further, an output of the linear scale 304 attached to an end portion of the ball screw 3023 and detecting a moving distance of the ball screw 3023 may be used as the position feedback. Further, an acceleration sensor may also be used to generate position feedback.

< machine learning device 200>

The machine learning device 200 executes a preset machining program (hereinafter, also referred to as a "machining program at the time of learning") to learn the coefficient of the transfer function of the IIR filter 1092 of the speed feedforward calculator 109 and the coefficient of the transfer function of the IIR filter 1102 of the position feedforward calculator 110.

Here, the machining shape specified by the machining program at the time of learning is, for example, an octagon, a shape in which every other corner of the octagon is replaced with a circular arc, or the like. The machining shape specified by the machining program at the time of learning is not limited to these machining shapes, and may be another machining shape.

Fig. 3 is a diagram for explaining the operation of the motor when the machining shape is an octagon. Fig. 4 is a diagram for explaining the operation of the motor when the machining shape is a shape in which every other corner of the octagon is replaced with a circular arc. In fig. 3 and 4, the table is moved in the X-axis and Y-axis directions to process the object (workpiece) clockwise.

As shown in fig. 3, when the machining shape is octagonal, at the angular position a1, the motor speed for moving the table in the Y-axis direction is slow, and the motor speed for moving the table in the X-axis direction is fast.

At the angular position a2, the motor rotation direction in which the table moves in the Y-axis direction is reversed, and the table moves so as to be linearly reversed in the Y-axis direction. Further, the motor that moves the table in the X-axis direction rotates at a constant speed in the same rotational direction from the position a1 toward the position a2 and from the position a2 toward the position A3.

At angular position a3, the motor speed for moving the table in the Y-axis direction is increased, and the motor speed for moving the table in the X-axis direction is decreased.

At the angular position a4, the motor rotation direction in which the table moves in the X-axis direction is reversed, and the table moves so as to be linearly reversed in the X-axis direction. Further, the motor that moves the table in the Y-axis direction rotates at the same speed in the same rotational direction from the position A3 toward the position a4 and from the position a4 toward the position of the next corner.

As shown in fig. 4, when the machining shape is a shape in which every other corner of the octagon is replaced with a circular arc, the motor speed for moving the table in the Y-axis direction is slow and the motor speed for moving the table in the X-axis direction is fast at the corner position B1.

At the position B2 of the arc, the rotation direction of the motor that moves the table in the Y-axis direction is reversed, and the table moves so as to be linearly reversed in the Y-axis direction. Further, the motor that moves the table in the X-axis direction rotates at a constant speed in the same rotational direction from the position B1 toward the position B3. Unlike the case shown in fig. 3 in which the machining shape is octagonal, the motor that moves the table in the Y-axis direction decelerates gradually toward position B2, stops rotating at position B2, and increases gradually in the direction of rotation when passing through position B2 so as to form a circular arc machining shape before and after position B2.

At angular position B3, the motor speed for moving the table in the Y-axis direction is increased, and the motor speed for moving the table in the X-axis direction is decreased.

At the position B4 of the arc, the motor rotation direction in which the table moves in the X-axis direction is reversed, and the table moves so as to be linearly reversed in the X-axis direction. Further, the motor that moves the table in the Y axis direction rotates at the same speed in the same rotational direction from the position B3 toward the position B4 and from the position B4 toward the next corner position. The motor that moves the table in the X-axis direction is decelerated gradually toward position B4, rotation is stopped at position B4, and the rotational direction is increased gradually when passing through position B4, so that a circular arc machining shape is formed before and after position B4.

In the present embodiment, machine learning related to the coefficient optimization of the transfer function of the IIR filter 1092 of the speed feedforward calculator 109 and the coefficient optimization of the transfer function of the IIR filter 1102 of the position feedforward calculator 110 is performed by evaluating the vibration when the rotation speed is changed in the linear control and investigating the influence on the position deviation from the position a1 and the position A3, and the position B1 and the position B3 of the machined shape specified by the machining program at the time of learning as described above. The machine learning related to the coefficient optimization of the IIR filter transfer function is not particularly limited to the speed feedforward calculation unit and the position feedforward calculation unit, and may be applied to a current feedforward calculation unit having an IIR filter provided for performing current feedforward in the servo control device, for example.

Hereinafter, the machine learning device 200 will be described in more detail.

As described later, the machine learning device 200 according to the present embodiment performs reinforcement learning related to the coefficient optimization of the transfer function of the velocity feedforward calculator 109 and the position feedforward calculator 110, which are configured as a velocity loop and a position loop, respectively, in the servo control device 100, as an example of machine learning. The machine learning in the present invention is not limited to reinforcement learning, and may be applied to a case where other machine learning (for example, supervised learning) is performed.

Before describing each functional block included in the machine learning device 200, a basic structure of reinforcement learning will be described first. An agent (corresponding to the machine learning device 200 in the present embodiment) observes an environmental state, selects a certain behavior, and changes the environment according to the behavior. As the environment changes, providing some return, the agent learns better behavior choices (decisions).

Supervised learning represents the complete correct answer, while the return in reinforcement learning is mostly based on segment values of partial variations of the environment. Thus, the agent learns the selection behavior such that the return to the future is summed to a maximum.

In this way, in reinforcement learning, by learning behaviors, appropriate behaviors, that is, a method to be learned for maximizing the return obtained in the future is learned on the basis of the interaction given to the environment by the behaviors. This means that, in the present embodiment, it is possible to obtain a behavior affecting the future, for example, behavior information selected to reduce the positional deviation.

Here, any learning method may be used as the reinforcement learning, and in the following description, a case where Q-learning (Q-learning), which is a method of learning the value Q (S, A) of the selection behavior a, is used in a certain environmental state S will be described as an example.

Q learning aims to select, as an optimal behavior, a behavior a having the highest value Q (S, A) from among behaviors a that can be acquired in a certain state S.

However, at the time point when Q learning is initially started, the correct value of the value Q (S, A) is not known at all for the combination of state S and behavior a. Therefore, the agent selects various behaviors a in a certain state S, and for the behavior a at that time, selects a better behavior in accordance with the given reward, thereby continuing to learn the correct value Q (S, A).

Further, in order to maximize the total of the future returns, the goal is to eventually obtain Q (S, A) as E [ ∑ (γ) ]^t)r_t]. Here, E2]Represents an expected value, t represents time, γ represents a parameter called discount rate described later, and r_tRepresents the return at time t, and Σ is the sum of times t. The expected value in the equation is an expected value when the optimal behavior state changes. However, since the best behavior is not known in the Q learning process, reinforcement learning is performed while searching for various behaviors. The update formula of the value Q (S, A) can be expressed by, for example, the following equation 3 (hereinafter, equation 3).

[ mathematical formula 3 ]

In the above-mentioned numerical expression 3, S_tRepresenting the environmental state at time t, A_tIndicating the time of dayAnd (c) behavior of t. By action A_tChange of state to S_t+1。r_t+1Indicating the return obtained by the change in status. Further, the term with max is: in a state S_t+1Then, γ is multiplied by the Q value at the time of selecting the behavior a whose Q value is known to be the highest. Here, γ is a parameter of 0 < γ ≦ 1, referred to as the discount rate. Further, α is a learning coefficient, and a range of α is set to 0 < α ≦ 1.

The above mathematical formula 3 represents the following method: according to the attempt A_tIs fed back as a result of the above process_t+1Update the state S_tBehavior of_tValue of Q (S)_t、A_t)。

The updated equation represents: if behavior A_tResulting in the next state S_t+1Value of optimal behavior of_aQ(S_t+1A) ratio state S_tBehavior of_tValue of Q (S)_t、A_t) When it is large, Q (S) is increased_t、A_t) Whereas if small, Q (S) is decreased_t、A_t). That is, the value of a certain behavior in a certain state is made to approach the optimal behavior value in the next state resulting from the behavior. Wherein, although the difference is due to the discount rate γ and the return r_t+1But is basically a structure in which the best behavioral value in a certain state is propagated to the behavioral value in its previous state.

Here, Q learning exists as follows: a table of Q (S, A) for all state behavior pairs (S, A) is prepared for learning. However, the number of states of Q (S, A) values for all the state behavior pairs may be too large, and it may take a long time for Q learning to converge.

Therefore, a well-known technique called DQN (Deep Q-Network) can be utilized. Specifically, the value of the cost function Q (S, A) may be calculated by approximating the cost function Q by an appropriate neural network by adjusting parameters of the neural network using the appropriate neural network to construct the cost function Q. By using DQN, the time required for Q learning to converge can be shortened. DQN is described in detail in, for example, the following non-patent documents.

< non-patent document >

"Human-level control through depth retrieval for retrieval" learning ", VolodymerMniH 1, line, retrieval 1/17 in 29 years, Internet < URL: http: davidqi. com/research/source 14236. pdf)

The machine learning device 200 performs the Q learning described above.

The machine learning device 200 performs machine learning (hereinafter, referred to as learning) on the transfer function of the IIR filter 1092 of the speed feedforward calculator 109 and the transfer function of the IIR filter 1102 of the position feedforward calculator 110 shown in fig. 1.

The machine learning device 200 performs coefficient learning of the transfer function of the IIR filter 1092 located on the inner side (inner ring) of the IIR filter 1102 prior to coefficient learning of the transfer function of the IIR filter 1102. Specifically, the coefficient of the transfer function of the IIR filter 1102 is fixed, and the optimum value of the coefficient of the transfer function of the IIR filter 1092 is learned. After that, the machine learning device 200 fixes the coefficient of the transfer function of the IIR filter 1092 to the optimum value obtained by the learning, and learns the coefficient of the transfer function of the IIR filter 1102. The learning of the coefficient of the transfer function of IIR filter 1092 is performed prior to the learning of the coefficient of the transfer function of IIR filter 1102, and the learning related to the optimization of the coefficient of the transfer function of IIR filter 1102 can be performed under the condition of the velocity feedforward term (output of IIR filter 1092) optimized by the learning.

The learning of the coefficient of the transfer function of IIR filter 1092 and the learning of the coefficient of the transfer function of IIR filter 1102 may be performed simultaneously, but when the learning is performed simultaneously, the amount of information processing for machine learning increases, and the convergence time of machine learning becomes longer.

As described above, the machine learning device 200 first performs machine learning of the coefficient of the transfer function of the IIR filter 1092 of the speed feedforward calculator 109. The machine learning device 200 learns the coefficient a of the transfer function vff (z) of the IIR filter 1092 relating the state S to the coefficient a₁，a₂，b₀～b₂Is selected as the value Q of the behavior A, whichIn the velocity feedforward calculation section 109, the servo state such as a command and feedback including the coefficient a of the transfer function vff (z) of the IIR filter 1092 is set as the state S₁，a₂，b₀～b₂The positional deviation information and the positional command of the servo control device 100 are acquired by executing the machining program at the time of learning.

Specifically, the machine learning device 200 according to the embodiment of the present invention searches for and learns the radius r and the angle θ of the zero point and the pole of the transfer function vff (z) represented by polar coordinates within a predetermined range, thereby setting the coefficient of the transfer function vff (z) of the IIR filter 1092. The pole is a z value at which the transfer function vff (z) is infinite, and the zero is a z value at which the transfer function vff (z) is 0.

Thus, the coefficients in the numerator of the transfer function vff (z) are deformed in the following manner.

b₀+b₁z^-1+b₂z^-2＝b₀(1+(b₁/b₀)z^-1+(b₂/b₀)z^-2)

Hereinafter, unless otherwise specified, the term "b" is used₁' and b₂' to respectively represent (b)₁/b₀) And (b)₂/b₀) The description is given.

Then, the machine learning device 200 learns the radius r and the angle θ such that the positional deviation is minimum, and sets the coefficient a of the transfer function vff (z)₁、a₂、b₁' and b₂’。

Coefficient b₀For example, the radius r and the angle θ may be set to the optimum values r₀And theta₀Then, the result is obtained by machine learning. Coefficient b₀Learning can be performed simultaneously with the angle θ. Further, learning may be performed simultaneously with the radius r.

Thereafter, the coefficient c of the transfer function pff (z) of the IIR filter 1102 is performed in the same manner as the transfer function vff (z) of the IIR filter 1092₁，c₂，d₀～d₂And (4) learning. In the following description, velocity feed forward is usedThe learning of the coefficient of the transfer function vff (z) of the IIR filter 1092 of the calculation unit 109 will be described, and the learning of the coefficient of the transfer function pff (z) of the IIR filter 1102 of the position feedforward calculation unit 110 will be similarly performed.

The machine learning device 200 calculates the coefficient a from the transfer function vff (z) of the IIR filter 1092₁，a₂，b₀～b₂The behavior a is determined by observing the state information S including the servo states of the command and the feedback including the position command and the positional deviation information of the servo control apparatus 100 at the position a1 and the position A3 and the position B1 and the position B3 of the machined shape by executing the machining program at the time of learning. The machine learning device 200 returns a reward every time action a is performed. The machine learning device 200, for example, tries to search for the best behavior a to maximize the return to the future. In this way, the machine learning device 200 can select the optimum behavior a (i.e., the optimum zero point and extreme value of the transfer function vff (z) of the IIR filter 1092) for the state S including the servo states such as the command and the feedback including the position command and the position deviation information of the servo control device 100 acquired by executing the machining program at the time of learning based on the coefficient values calculated based on the zero point and extreme value of the transfer function vff (z) of the IIR filter 1092. At the positions a1 and A3 and the positions B1 and B3, the rotation directions of the servo motors in the X-axis direction and the Y-axis direction are not changed, and the machine learning device 200 can learn the zero point and the extreme value of the transfer function vff (z) of the IIR filter 1092 in the linear operation.

That is, by selecting the behavior a in which the value of Q is the largest among the behaviors a of the transfer functions vff (z) applied to the IIR filter 1092 in a certain state S, based on the cost function Q learned by the machine learning device 200, it is possible to select the behavior a in which the positional deviation obtained by executing the machining program at the time of learning is the smallest (that is, the zero point and the extreme value of the transfer function vff (z) of the IIR filter 1092).

The radius of the zero point and the radius of the pole of the transfer function vff (z) of the IIR filter 1092 are expressed in polar coordinates so that the learning position deviation is minimized, as described belowr and angle theta to obtain coefficient a of transfer function VFF (z)₁、a₂、b₁’、b₂' method and obtaining coefficient b₀The method of (1).

The machine learning device 200 sets a pole and a zero obtained from the IIR filter 1092, where the pole is z where the transfer function vff (z) of equation 1 is infinite and the zero is z where the transfer function vff (z) is 0.

In order to find the poles and the zero points, the machine learning device 200 multiplies z by the denominator and the numerator of equation 1²Then, the numerical expression 4 (hereinafter, expressed as the numerical expression 4) is obtained

[ mathematical formula 4 ]

The pole is z with denominator 0 of mathematical formula 4, i.e. z²+a₁z+a₂Z of 0, and zero is z of 0 which is a numerator of math figure 4, i.e., z²+b₁’z+b₂' -0 z.

In the present embodiment, the polar coordinates represent the poles and the zero points, and the poles and the zero points represented by the polar coordinates are searched for.

To suppress vibration, where zero is particularly important, the machine learning device 200 will first be extremely fixed, and will be in the numerator (z)²+b₁’z+b₂') wherein z is re^iθAnd its complex conjugate z^*＝re^-iθCoefficient b calculated when the zero point is set (angle theta is within a predetermined range, and r is 0. ltoreq. r.ltoreq.1)₁’(＝-re^iθ-re^-iθ) And b₂’(＝r²) Set as coefficients of the transfer function VFF (z), whereby the zero point re is searched for on the polar coordinates^iθTo learn the optimum coefficient b₁’、b₂'. The radius r depends on the attenuation ratio and the angle θ depends on the frequency at which the vibrations are suppressed. Thereafter, the zero point may be fixed to the optimum value to learn the coefficient b₀The value of (c). Next, the poles of the transfer function vff (z) are represented in polar coordinates, and the extreme value re represented in polar coordinates is searched for in the same way as the zero point described above^iθ. By this, canLearning the optimum coefficient a of the denominator of the transfer function VFF (z)₁、a₂The value of (c).

In the case of learning coefficients in the numerator of the transfer function vff (z) at a fixed level, it is sufficient to suppress the gain at a high frequency side, and the level corresponds to a second order low pass filter, for example. For example, the transfer function of the second-order low-pass filter is expressed by equation 5 (hereinafter, expressed as equation 5). ω is the peak gain frequency of the filter.

[ math figure 5 ]

In the case of a third-order low-pass filter, the transfer function may be configured by providing three first-order low-pass filters represented by 1/(1+ Ts) (T is a time constant of the filter), or by combining the first-order low-pass filters with a second-order low-pass filter of equation 5.

In addition, the transfer function in the z region is obtained as the transfer function in the s region by using a bilinear transformation.

Although the poles and the zeros of the transfer function vff (z) can be searched simultaneously, the poles and zeros are searched separately and learned separately, so that the amount of machine learning can be reduced and the learning time can be shortened.

In the complex plane of FIG. 5, the search range of the pole and the zero point can be narrowed to a predetermined search range indicated by a hatched area, and the radius r is set to a range of, for example, 0. ltoreq. r.ltoreq.1, thereby defining an angle θ in the frequency range to which the speed loop can respond. Since the frequency range is about 200Hz in terms of vibration due to resonance of the speed ring, the upper limit of the frequency range may be set to 200Hz, for example. Although the search range can be determined by the resonance characteristics of the control object such as a machine tool, when the sampling period is 1msec, the angle θ corresponds to 90 degrees at about 250Hz, and therefore, when the upper limit of the frequency range is 200Hz, the search range of the angle θ as in the complex plane of fig. 5 is obtained. By narrowing the search range to a predetermined range in this way, the amount of machine learning can be reduced, and the convergence time of machine learning can be shortened.

When searching for a zero point on polar coordinates, first, the coefficient b is set₀Fixed to 1, for example, and the radius r is fixed to an arbitrary value within a range (0. ltoreq. r.ltoreq.1), and z and the complex conjugate z thereof are set in the search range shown in FIG. 5 by attempting to set the angle θ^*Is (z)²+b₁’z+b₂') coefficient b of zero point₁’(＝-re^jθ-re^-jθ) And b₂’(＝r²). The initial setting value of the angle θ is set within the search range shown in fig. 5.

Coefficient b to be obtained by machine learning device 200₁' and b₂' the adjustment information is output as a behavior a to the IIR filter 1092, and coefficients b of the numerator of the transfer function vff (z) of the IIR filter 1092 are set₁' and b₂'. Coefficient b₀As described above, for example, set to 1. An optimum angle theta is determined by learning the angle theta searched by the machine learning device 200 so that the value Q becomes maximum₀Then, the angle theta is fixed to the angle theta₀The radius r is set to be variable, and the coefficient b of the numerator of the transfer function VFF (z) of the IIR filter 1092 is set₁’(＝-re^jθ-re^-jθ) And b₂’(＝r²). By learning the search for the radius r, an optimum radius r is determined at which the value Q is the maximum₀. Through angle theta₀And radius r₀To set the coefficient b₁' and b₂', thereafter, by pair b₀Learning is performed to determine the coefficients b of the numerator of the transfer function VFF (z)₀，b₁' and b₂’。

The case of searching for a pole on polar coordinates can also be learned as the numerator of the transfer function vff (z). First, the radius r is fixed to a value in a range (for example, 0. ltoreq. r.ltoreq.1), the angle θ is searched for in the above-mentioned search range, and when the optimum angle θ of the pole of the transfer function VFF (z) of the IIR filter 1092 is determined by learning, the radius r is searched for and learned by fixing the angle θ to the angle, and thereby the maximum angle of the pole of the transfer function VFF (z) of the IIR filter 1092 is determinedAn optimal angle θ and an optimal radius r. Thus, the optimum angle θ of the pole and the optimum coefficient a corresponding to the optimum radius r are determined₁、a₂. As described above, the radius r depends on the attenuation rate, the angle θ depends on the frequency of the suppressed vibration, and it is desirable to learn the angle θ prior to the radius in order to suppress the vibration.

As described above, by performing the learning so that the radius r and the angle θ of the zero point and the pole of the transfer function vff (z) of the IIR filter 1092, which are represented by polar coordinates, are searched within a predetermined range and the positional deviation is minimized, the coefficient a can be learned more directly than the coefficient a₁、a₂、b₀、b₁' and b₂' more efficient implementation of the coefficients a of the transfer function VFF (z) ((z))₁、a₂、b₀、b₁' and b₂' optimization.

The coefficient b of the transfer function vff (z) of the IIR filter 1092 is calculated₀In learning, e.g. the coefficient b₀Is set to 1, the coefficient b of the transfer function vff (z) included in the subsequent behavior a is set to₀Plus or minus an increment. Coefficient b₀The initial value of (a) is not limited to 1. Coefficient b₀The initial value of (c) may be set to an arbitrary value. The machine learning apparatus 200 gives a return according to the positional deviation every time the action a is performed, and performs reinforcement learning by searching for the optimum action a with trial and error, and by using the coefficient b of the transfer function vff (z)₀The value Q is adjusted to an optimum value that maximizes the total of future returns. Here, coefficient b₀Is after the learning of the radius r, but may be learned simultaneously with the angle θ, or may be learned simultaneously with the radius r.

The radius r, the angle θ, and the coefficient b₀The machine learning amount can be reduced by the separate learning, and the convergence time of the machine learning can be shortened.

When learning related to coefficient optimization of the transfer function vff (z) of the IIR filter 1092 ends, learning related to coefficient optimization of the transfer function pff (z) of the IIR filter 1102 is performed next.

The machine learning device 200 to obtainPole and zero, multiplying the denominator and numerator of equation 2 by z²Equation 6 (hereinafter, equation 6) is obtained.

[ mathematical formula 6 ]

D of math figure 6₁' and d₂' corresponds to (d)₁/d₀) And (d)₂/d₀)。

The radius r and the angle θ of the zero point and the pole of the transfer function pff (z) of the IIR filter 1102, which is represented in polar coordinates, are learned, and the coefficient c of the transfer function pff (z) is calculated₁、c₂、d₀～d₂Since the optimization of (3) is the same as the above-described case where the radius r of the zero point and the pole of the transfer function vff (z) of the IIR filter 1092 is learned as the angle θ in polar coordinates, the description thereof will be omitted.

Fig. 6 is a block diagram showing a machine learning device 200 according to the present embodiment. The following describes that the coefficients of the transfer function vff (z) of the IIR filter 1092 of the learning speed feedforward calculator 109 are the same as those of the transfer function of the IIR filter 1102 of the learning position feedforward calculator 110 to be performed later.

In order to perform the reinforcement learning, as shown in fig. 6, the machine learning device 200 includes: a state information acquisition unit 201, a learning unit 202, a behavior information output unit 203, a cost function storage unit 204, and an optimized behavior information output unit 205. The learning unit 202 includes a report output unit 2021, a cost function update unit 2022, and a behavior information generation unit 2023.

The state information acquiring unit 201 acquires, from the servo control device 100, a state S including a servo state such as a command and feedback including a coefficient a of a transfer function vff (z) of the IIR filter 1092 of the velocity feedforward calculating unit 109 in the servo control device 100₁、a₂、b₀～b₂The position command and the positional deviation information of the servo control device 100 obtained by executing the machining program at the time of learning. The state information S phaseThe environmental state S in Q learning.

The state information acquisition unit 201 outputs the acquired state information S to the learning unit 202. The state information acquisition unit 201 also acquires, from the behavior information generation unit 2023, an angle θ, a radius r, and a coefficient a corresponding to the angle θ and the radius r, which represent the zero point and the pole in polar coordinates₁、a₂、b₁’、b₂' and stores the coefficient a obtained from the servo control device 100₁、a₂、b₁’、b₂The angle θ and radius r at which the zero point and the pole are expressed by the corresponding polar coordinates are also output to the learning unit 202.

The initial setting of the transfer function vff (z) of the IIR filter 1092 at the time point when the Q learning is first started is set by the user in advance. In the present embodiment, then, as described above, the coefficient a of the transfer function vff (z) of the IIR filter 1092 initially set by the user is searched for by the reinforcement learning in which the zero point and the radius r of the pole and the angle θ are represented by the polar coordinates, respectively, in the predetermined range₁、a₂、b₀～b₂The adjustment is optimal. The coefficient α of the second-order differentiator 1091 of the velocity feedforward calculator 109 is a fixed value, and α is set to 1, for example. The initial setting of the denominator of the transfer function vff (z) of the IIR filter 1092 is as shown in equation 5 (a transfer function of the z region converted by bilinear transformation). Furthermore, coefficients b for the numerator of the transfer function VFF (z)₀～b₂The initial setting of (b) may be, for example₀Where r is 1, r is a value in the range of 0. ltoreq. r.ltoreq.1, and θ is a value in the above-described predetermined search range.

The initial setting of the position feedforward calculating unit 110 is also performed in the same manner.

In addition, regarding the coefficient a₁、a₂、b₀～b₂And coefficient c₁、c₂、d₀～d₂When the operator adjusts the machine tool in advance, the machine learning may be performed by setting values of the radius r and the angle θ of the zero point and the pole of the transfer function whose polar coordinates represent the completion of the adjustment as initial values.

The learning unit 202 is a part that learns the value Q (S, A) when a certain behavior a is selected in a certain environmental state S. In addition, regarding the behavior A, for example, a coefficient b₀Fixed to 1, the coefficient b of the numerator of the transfer function vff (z) of the IIR filter 1092 is calculated from correction information in which the radius r of the zero point of the transfer function vff (z) and the angle θ are represented by polar coordinates₁’、b₂' of the apparatus. In the following description, the coefficient b is used₀E.g. initially set to 1, the behavior information a being a coefficient b₁’、b₂The case of the correction information of' will be described as an example.

The reward output unit 2021 is a part that calculates a reward when the action a is selected in a certain state S. Here, a set of positional deviations (positional deviation set) which are state variables in the state S is represented by the PD (S), and a positional deviation set which is a state variable relating to the state information S 'changed from the state S by the behavior information a is represented by the PD (S'). The value of the positional deviation in the state S is a value calculated based on a preset evaluation function f (pd (S)).

As the evaluation function f, for example, a function such as,

function for calculating integral value of absolute value of position deviation

∫|e|dt

Function for calculating integral value by time-weighting absolute value of position deviation

∫t|e|dt

Function for calculating integral value of absolute value of position deviation raised to power of 2n (n is natural number)

∫e²ⁿdt (n is a natural number)

Function for calculating maximum value of absolute value of position deviation

Max{|e|}。

At this time, when the value f (PD (S ')) of the positional deviation of the servo control apparatus 100 operated by the post-correction velocity feedforward calculation unit 109 relating to the state information S' corrected by the behavior information a is larger than the value f (PD (S)) of the positional deviation of the servo control apparatus 100 operated by the pre-correction velocity feedforward calculation unit 109 relating to the state information S before the correction by the behavior information a, the feedback output unit 2021 makes the feedback value negative.

On the other hand, when the value f (PD (S ')) of the positional deviation of the servo control apparatus 100 operated by the post-correction velocity feedforward calculation unit 109 relating to the state information S' corrected by the behavior information a is smaller than the value f (PD (S))) of the positional deviation of the servo control apparatus 100 operated by the pre-correction velocity feedforward calculation unit 109 relating to the state information S before the correction by the behavior information a, the feedback output unit 2021 sets the feedback value to a positive value.

When the value f (PD (S ')) of the positional deviation of the servo control apparatus 100 operated by the post-correction velocity feedforward calculation unit 109 based on the state information S' corrected by the behavior information a is equal to the value f (PD (S)) of the positional deviation of the servo control apparatus 100 operated by the pre-correction velocity feedforward calculation unit 109 based on the state information S before the correction by the behavior information a, the feedback output unit 2021 sets the feedback value to zero.

Further, the negative value when the value f (PD (S ')) of the positional deviation in the state S' after the execution of the action a is larger than the value f (PD (S))) of the positional deviation in the previous state S may be set to be larger in proportion. That is, the negative value may be made larger according to the degree to which the value of the positional deviation becomes larger. Conversely, a positive value when the value f (PD (S ')) of the positional deviation in the state S' after the execution of the action a is smaller than the value f (PD (S)) of the positional deviation in the previous state S may be set to be larger in proportion. That is, the positive value may be made larger according to the degree to which the value of the positional deviation becomes smaller.

The merit function update unit 2022 performs Q learning based on the state S, the behavior a, the state S' when the behavior a is applied to the state S, and the report value calculated as described above, thereby updating the merit function Q stored in the merit function storage unit 204.

The update of the merit function Q may be performed by online learning, batch learning, or small-batch learning.

The online learning is a learning method as follows: by applying some kind of behavior a to the current state S, the updating of the cost function Q is done immediately each time the state S transitions to a new state S'. Further, the batch learning is a learning method as follows: by repeatedly applying a certain behavior a to the current state S, the state S shifts to a new state S', whereby data for learning is collected, and the merit function Q is updated using all the collected data for learning. Further, the small batch learning is a learning method intermediate between the online learning and the batch learning, and is a learning method for updating the merit function Q every time data for learning is accumulated to some extent.

The behavior information generation unit 2023 selects the behavior a in the process of Q learning for the current state S. During Q learning, the behavior information generator 2023 corrects the coefficient b of the transfer function vff (z) of the IIR filter 1092 of the servo controller 100 so as to be based on the radius r and the angle θ in which the zero point is expressed by the polar coordinates₁’、b₂The action of' corresponds to the action a in Q learning, and generates action information a, and outputs the generated action information a to the action information output unit 203.

More specifically, the behavior information generation unit 2023 fixes the coefficient a of the transfer function vff (z) of equation 4, for example, in order to search for a zero point on the polar coordinates₁、a₂、b₀In the state of molecule (z)²+b₁’z+b₂') zero point of z is re^iθIn a state where the radius r received from the state information acquisition unit 201 is fixed, the angle θ received from the state information acquisition unit 201 is increased or decreased within the search range of fig. 5. Then, z and its complex conjugate z are set as zero points by a fixed radius r and an increasing or decreasing angle θ^*Then, the coefficient b is re-found from the zero point₁’、b₂’。

The following strategies may be adopted: the behavior information generator 2023 resets the coefficient b of the transfer function vff (z) of the IIR filter 1092 by increasing or decreasing the angle θ₁’、b₂' when a transition is made to the state S ' and a positive reward (a positive reward) is returned, the next action A ' is selectedAnd an action a' for making the positional deviation smaller by increasing or decreasing the angle θ in the same manner as the previous action.

In addition, the following strategy can be adopted in the opposite way: when a negative return (negative return) is returned, the behavior information generation unit 2023 selects, as the next behavior a ', a behavior a' whose positional deviation is smaller than the previous value, for example, by decreasing or increasing the angle θ in reverse to the previous behavior.

The behavior information generation unit 2023 may adopt the following strategy: the behavior a 'is selected by a well-known method such as a greedy algorithm for selecting the behavior a' having the highest value Q (S, A) among the values of the currently estimated behaviors a, or an epsilon greedy algorithm for selecting the behavior a 'having the highest value Q (S, A) in addition to randomly selecting the behavior a' with a certain small probability epsilon.

The behavior information output unit 2023 continues the search for the angle θ, and determines the optimum angle θ at which the value Q becomes maximum by learning from the optimized behavior information described later from the optimized behavior information output unit 205₀Then, the angle theta is fixed to the angle theta₀The radius r is searched within the range of 0 ≦ r ≦ 1, and the coefficient b of the numerator of the transfer function VFF (z) of the IIR filter 1092 is set in the same manner as the search for the angle θ₁’、b₂'. The behavior information generation unit 2023 continues the search for the radius r, and determines the optimum radius r at which the value Q becomes maximum by learning from the optimized behavior information described later from the optimized behavior information output unit 205₀Then, the optimal coefficient b of the molecule is determined₁’、b₂'. Thereafter, as described above, by learning the coefficient b₀The optimum values of the coefficients of the numerator of the transfer function vff (z) are learned.

Then, the behavior information generating unit 2023 searches for the coefficient of the transfer function relating to the denominator of the transfer function vff (z) based on the radius r and the angle θ of the pole represented by the polar coordinates, as described above. In this learning, the radius r and the angle θ of the pole represented by the polar coordinates are optimally adjusted by reinforcement learning, as in the case of the numerator of the transfer function vff (z) of the IIR filter 1092. At this time, the radius r is learned after learning the angle θ as in the case of the numerator of the transfer function vff (z). Since the learning method is the same as the case of searching for the zero point of the transfer function vff (z), detailed description is omitted.

The behavior information output unit 203 is a part that transmits the behavior information a output from the learning unit 202 to the servo control device 100. As described above, the servo control device 100 slightly corrects the current state S, that is, the radius r and the angle θ indicating the zero point of the transfer function vff (z) of the IIR filter 1092 currently set in polar coordinates, based on the behavior information, and then performs the correction on the current state S' (that is, the coefficient b of the transfer function vff (z) of the IIR filter 1092 corresponding to the corrected zero point) in the next state S ″₁’、b₂') transfer.

The cost function storage unit 204 is a storage device that stores the cost function Q. The cost function Q may be stored as a table (hereinafter referred to as a behavior value table) for each of the state S and the behavior a, for example. The cost function Q stored in the cost function storage unit 204 is updated by the cost function update unit 2022. The merit function Q stored in the merit function storage unit 204 may be shared with another machine learning apparatus 200. If the merit function Q is shared among a plurality of machine learning apparatuses 200, the respective machine learning apparatuses 200 can perform reinforcement learning in a distributed manner, and therefore, the efficiency of reinforcement learning can be improved.

The optimization behavior information output unit 205 generates behavior information a (hereinafter referred to as "optimization behavior information") for causing the speed feedforward calculation unit 109 to perform an operation of maximizing the value Q (S, A) based on the merit function Q updated by the merit function update unit 2022 by Q learning.

More specifically, the optimization behavior information output unit 205 acquires the cost function Q stored in the cost function storage unit 204. As described above, the merit function Q is a function updated by the merit function update unit 2022 by Q learning. The optimization behavior information output unit 205 generates behavior information from the cost function Q, and outputs the generated behavior information to the servo control device 100 (IIR filter 1092 of the speed feedforward calculation unit 109). The optimized behavior information includes the same behavior information as the behavior information output by the behavior information output unit 203 during Q learningPassing angle theta, radius r and coefficient b₀To correct the coefficient information of the transfer function vff (z) of the IIR filter 1092.

In the servo control device 100, the angle θ, the radius r and the coefficient b are determined based on₀To modify coefficients of transfer functions associated with the numerator of the transfer function vff (z) of the IIR filter 1092.

After optimizing the coefficients of the numerator of the transfer function vff (z) of the IIR filter 1092 by the above operation, the machine learning device 200 optimizes the coefficients of the denominator of the transfer function vff (z) of the IIR filter 1092 by learning the angle θ and the radius r in the same manner as the optimization. Thereafter, as well as the learning and optimization of the coefficients of the transfer function vff (z) of the IIR filter 1092, the angle θ, the radius r, and the coefficient d may be passed₀The learning and optimization of the coefficient of the transfer function pff (z) of the IIR filter 1102 are performed to reduce the value of the positional deviation.

As described above, by using the machine learning device 200 according to the present invention, parameter adjustments of the speed feedforward calculation unit 109 and the position feedforward calculation unit 110 of the servo control device 100 can be simplified.

The functional blocks included in the servo control device 100 and the machine learning device 200 are explained above.

In order to realize these functional blocks, the servo control device 100 and the machine learning device 200 each include an arithmetic Processing device such as a CPU (Central Processing Unit). The servo control device 100 and the machine learning device 200 also include a main storage device such as an auxiliary storage device such as a Hard Disk Drive (HDD) that stores various control programs such as application software and an Operating System (OS), and a Random Access Memory (RAM) that stores data temporarily required after the execution of the programs by the arithmetic processing device.

In the servo control device 100 and the machine learning device 200, the arithmetic processing device reads application software or OS from the auxiliary storage device, and performs arithmetic processing based on the application software or OS while expanding the read application software or OS on the main storage device. Further, various hardware of each device is controlled based on the calculation result. In this way, the functional blocks of the present embodiment are realized. That is, the present embodiment can be realized by hardware in cooperation with software.

Since the machine learning device 200 has an increased amount of computation associated with machine learning, it is possible to perform high-speed Processing when a GPU (General-purpose Graphics Processing unit) is used for computation associated with machine learning, for example, by using a technique of mounting a GPU (Graphics Processing unit) on a personal computer, which is called GPGPU (General-purpose Graphics Processing unit). In order to perform higher-speed processing, a computer cluster may be constructed using a plurality of computers each having such a GPU mounted thereon, and parallel processing may be performed by a plurality of computers included in the computer cluster.

Next, the operation of the device learning apparatus 200 in Q learning in the present embodiment will be described with reference to the flow of fig. 7. The flow shown in fig. 7 is a flow for determining the coefficient b related to the transfer function of the numerator of the transfer function vff (z) of the IIR filter 1092 of the velocity feedforward calculator 109₀～b₂Angle theta, radius r and coefficient b₀Learning related procedures.

In the following flow, the coefficient b relating to the transfer function of the numerator for determining the transfer function vff (z) of the IIR filter 1092 of the velocity feedforward calculator 109 is represented by polar coordinates₀～b₂Angle theta of the zero point of the transfer function VFF (z), radius r and coefficient b₀The learning of (2) is explained as an example, but a coefficient a relating to a transfer function for determining a denominator is expressed in polar coordinates₁、a₂The angle θ and the radius r of the pole are learned by the same procedure. The coefficient c of the transfer function pff (z) of the IIR filter 1102 for determining the position feedforward calculating unit 110 is expressed in polar coordinates₁、c₂、d₀～d₂Angle theta of zero point and pole of transfer function PFF (z), radius r and coefficient d₀The learning-related flow of (2) is also the flow described in fig. 7The description is omitted because the description is made as such.

In step S11, the state information acquisition unit 201 acquires the state information S from the servo control device 100. The acquired state information is output to the merit function update section 2022 or the behavior information generation section 2023. As described above, the state information S is information corresponding to the state during Q learning, and includes the coefficient a of the transfer function vff (z) of the IIR filter 1092 at the time point of step S11₁、a₂、b₀～b₂. In addition, the state information S of which learning is started first₀Is an initial setting, coefficient b related to the numerator of the transfer function VFF (z) of the IIR filter 1092₀Coefficient a related to denominator₁、a₂Set to a fixed value. In this way, the coefficient a of the transfer function vff (z) of the IIR filter 1092 is obtained₁、a₂、b₀～b₂A set PD (S) of positional deviations corresponding to the predetermined feed rate and to the machining shape of the circle at the initial value.

The state S is obtained from the subtractor 102 at the time point when the servo control device 100 is operated by the machining program at the time of learning and Q learning is started first₀Value of position deviation of lower PD (S)₀). The position command generating unit 101 sequentially outputs position commands in a predetermined machining shape designated by the machining program, for example, an octagonal machining shape. For example, a position command value corresponding to the octagonal machined shape is output from the position command generating unit 101, and the position command value is output to the subtractor 102, the speed feedforward calculating unit 109, the position advance calculating unit 110, and the machine learning device 200. The subtractor 102 determines the difference between the position command value and the detected position output from the integrator 108 between the position a1 and the position A3 and between the position B1 and the position B3 of the machined shape as the position deviation PD (S)₀) And outputs the result to the machine learning device 200. In the machine learning device 200, the difference between the position command value and the detection position output from the integrator 108 at the positions a1 and A3 and at the positions B1 and B3 of the machined shape may be extracted as the position deviation PD (S)₀)。

In step S12, the behavior information generation unit 2023 uses the data as described aboveThe radius r and the angle theta of the zero point are represented in polar coordinates, and the angle theta, the radius r and the coefficient b are represented by₀Generates new behavior information a, and outputs the generated new behavior information a to the servo control device 100 via the behavior information output unit 203. The behavior information generating unit 2023 outputs new behavior information a according to the above policy.

The servo control device 100 that has received the behavior information a corrects the coefficient b of the transfer function vff (z) of the IIR filter 1092 in the current state S based on the received behavior information₁’、b₂'state S', to drive a machine tool including the servo motor 300. As described above, this behavior information corresponds to the behavior a in Q learning.

In step S13, the state information acquiring unit 201 acquires the positional deviation PD (S ') in the new state S' and the coefficient b of the transfer function vff (z) of the IIR filter 1092 from the subtractor 102₁’、b₂'. In this way, the state information acquisition unit 201 acquires the set PD (S ') of positional deviations corresponding to the machined shape of the octagon (specifically, the position a1 and the position A3, and the position B1 and the position B3 of the machined shape) as the coefficients of the transfer function vff (z) in the state S'. The acquired status information is output to the report output unit 2021.

In step S14, the report output unit 2021 determines the magnitude relationship between the value f (PD (S ')) of the positional deviation in the state S ' and the value f (PD (S))) of the positional deviation in the state S, and when f (PD (S ')) > f (PD (S))), sets the report to a negative value in step S15. If f (PD (S')) < f (PD (S)), the reward is made positive in step S16. If f (PD (S'))) is f (PD (S)), (S), the reward is set to zero in step S17. Additionally, negative, positive values of the reward may be weighted.

When any one of step S15, step S16, and step S17 is finished, the cost function update unit 2022 updates the cost function Q stored in the cost function storage unit 204 based on the reported value calculated in the step S18. In addition, step S18 illustrates online updating, but instead of online updating, batch updating or mini-batch updating may be performed.

Next, in step S19, when the learning of the angle θ is not finished, the process returns to step S11 again, and proceeds to step S20 when finished.

Next, in step S20, when learning of the radius r is not completed, the process returns to step S11 again, and when the learning of the radius r is completed, the process proceeds to step S21.

Next, in step S21, the coefficient b₀If the learning is not finished, the process returns to step S11 again, and the process ends when the learning is finished.

Further, by returning to step S11 and repeating the above-described processing of steps S11 to S21, the cost function Q converges to an appropriate value. The learning of the angle θ, the radius r, and the coefficient may be finished on condition that the above-described processing is repeated a predetermined number of times or for a predetermined time.

As described above, by the operations described with reference to fig. 7 and the like, by performing a search within a predetermined range and learning values indicating the zero point and the radius r of the pole and the angle θ of each transfer function of the IIR filter 1092 and the IIR filter 1102 in polar coordinates by the machine learning device 200 described as an example in the present embodiment, an effect is obtained that the learning time required for optimizing the coefficient of each transfer function of the IIR filter 1092 and the IIR filter 1102 can be further shortened.

Next, the operation of the optimization behavior information output unit 205 in generating the optimization behavior information will be described with reference to the flow of fig. 8.

First, in step S31, the optimization behavior information output unit 205 acquires the cost function Q stored in the cost function storage unit 204. As described above, the merit function Q is a function updated by the merit function update unit 2022 performing Q learning.

In step S32, the optimization behavior information output unit 205 generates optimization behavior information from the cost function Q, and outputs the generated optimization behavior information to the IIR filter 1092 of the servo control device 100.

Through the above operations, the machine learning device 200 performs the operation of expressing the angle θ of the zero point of the coefficient for determining the numerator of the transfer function vff (z) of the IIR filter 1092 in polar coordinatesRadius r and coefficient b₀After the optimization, a coefficient a relating to a transfer function for determining a denominator is expressed in polar coordinates₁、a₂Angle theta and radius r of the pole. To determine the coefficient c of the transfer function PFF (z) of the IIR filter 1102 by the same operation₁、c₂、d₀～d₂Then, the angle θ and the radius r of the zero point and the pole of the transfer function PFF (z) are learned and the coefficient d is calculated by expressing the zero point and the pole in polar coordinates₀Learning and optimization.

Further, by the operation described with reference to fig. 8, in the present embodiment, the optimal behavior information may be generated from the cost function Q obtained by learning by the machine learning device 200, and the servo control device 100 may simplify the coefficient a of the transfer function vff (z) of the IIR filter 1092 currently set, based on the optimal behavior information₁、a₂、b₀～b₂And the value of the positional deviation can be reduced. Further, the velocity feedforward is initially set to a higher dimension, and the value of the positional deviation can be further reduced by learning by the machine learning device 200. Coefficient c of transfer function PFF (z) for IIR filter 1102₁、c₂、d₀～d₂With the coefficient a of the transfer function vff (z) of the IIR filter 1092₁、a₂、b₀～b₂The value of the positional deviation can be reduced as in the adjustment of (2).

In the present embodiment, the report output unit 2021 calculates a report value by comparing a value f (PD (S)) of the positional deviation in the state S, which is calculated from a preset evaluation function f (PD (S)) with the positional deviation PD (S ') in the state S as an input, with a value f (PD (S ')) of the positional deviation in the state S ' which is calculated from the evaluation function f with the positional deviation PD (S ') in the state S ' as an input.

However, elements other than the positional deviation may be applied to each calculation of the return value.

For example, in addition to the position deviation as the output of the subtractor 102, at least one of a speed command of the position feedforward control as the output of the adder 104, a difference between the speed command and the speed feedback of the position feedforward control, a torque command of the position feedforward control as the output of the adder 107, and the like may be given to the machine learning device 200.

The feedforward calculator may include either a position feedforward calculator or a velocity feedforward calculator. In this case, for example, when only the position feedforward calculator is provided, the second order differentiator 1091, the IIR filter 1092, and the adder 107 are not required.

The servo control unit of the servo control device and each of the components included in the machine learning device described above may be realized by hardware, software, or a combination thereof. The servo control method performed by cooperation of each component included in the servo control device may be realized by hardware, software, or a combination thereof. Here, the software implementation means that a computer is implemented by reading a program and executing the program.

Various types of non-transitory computer-readable recording media (non-transitory computer readable media) can be used to store and provide a program to a computer. The non-transitory computer-readable recording medium includes various types of tangible storage media. Examples of the non-transitory computer-readable recording medium include: magnetic recording media (e.g., magnetic disks, hard disk drives), optical-magnetic recording media (e.g., magneto-optical disks), CD-ROM (read Only memory), CD-R, CD-R/W, semiconductor memory (e.g., mask ROM, PROM (Programmable ROM), eprom (erasable PROM)), flash ROM, RAM (random access memory).

The above embodiment is a preferred embodiment of the present invention, but the scope of the present invention is not limited to the above embodiment, and various modifications may be made without departing from the spirit of the present invention.

< modification of servo control device having machine learning device >

In the above-described embodiment, the machine learning device 200 is configured by a device separate from the servo control device 100, but a part or all of the functions of the machine learning device 200 may be realized by the servo control device 100.

< degree of freedom of System Structure >

Fig. 9 is a block diagram showing another configuration of the control device. As shown in FIG. 9, the control device 10A includes n (n is a natural number of 2 or more) servo control devices 100-1 to 100-n, n machine learning devices 200-1 to 200-n, and a network 500. In addition, n is an arbitrary natural number. The n servo control devices 100-1 to 100-n correspond to the servo control device 100 shown in FIG. 1, respectively. The n machine learning devices 200-1 to 200-n correspond to the machine learning device 200 shown in fig. 1, respectively.

Here, the servo control device 100-1 and the machine learning device 200-1 are a one-to-one set and are communicably connected. The servo control devices 100-2 to 100-n and the machine learning devices 200-2 to 200-n are also connected in the same manner as the servo control device 100-1 and the machine learning device 200-1. In fig. 9, n groups of servo control devices 100-1 to 100-n and machine learning devices 200-1 to 100-n are connected via a network 500, but the servo control devices and the machine learning devices of each group may be directly connected via a connection interface with respect to the n groups of servo control devices 100-1 to 100-n and machine learning devices 200-1 to 200-n. These servo control devices 100-1 to 100-n and machine learning devices 200-1 to 200-n may be provided in a plurality of groups in the same factory, or may be provided in different factories.

The Network 500 is, for example, a Local Area Network (LAN) constructed in a factory, the internet, a public telephone Network, or a combination thereof. The specific communication method in the network 500 is not particularly limited, such as wired connection or wireless connection.

In the control apparatus of fig. 9, the machine learning apparatuses 200-1 to 200-n and the servo control apparatuses 100-1 to 100-n are communicably connected as a one-to-one group, and for example, one machine learning apparatus 200-1 may be communicably connected to a plurality of servo control apparatuses 100-1 to 100-m (m < n or m ═ n) via a network 500 to perform machine learning of each of the servo control apparatuses 100-1 to 100-m.

In this case, the functions of the machine learning apparatus 200-1 can be distributed as a distributed processing system appropriately distributed among a plurality of servers. The respective functions of the machine learning device 200-1 may be realized by a virtual server function or the like on the cloud.

In addition, when there are a plurality of machine learning devices 200-1 to 200-n corresponding to a plurality of servo control devices 100-1 to 100-n of the same model name, the same specification, or the same series, the learning results of the machine learning devices 200-1 to 200-n can be shared. In this way, a more ideal model can be constructed.

In the embodiment of the present invention, the transfer functions of the IIR filters 1092 and 1102 are described by taking the case of a quadratic function as an example, but the transfer functions are not limited to quadratic functions as described in the embodiment. Or may be a function of more than three times.

28页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：模块化直线度在线测量装置

Machine learning device, control device, and machine learning method

相关技术

网友询问留言