Method and device for operating a robot

文档序号：1929913 发布日期：2021-12-07 浏览：9次中文

阅读说明：本技术 用于运行机器人的方法和设备 (Method and device for operating a robot ) 是由 S·霍普 M·吉福特哈勒 R·克鲁格于 2021-06-02 设计创作，主要内容包括：本发明涉及用于运行机器人(102)的方法和设备,其中根据机器人(102)和/或该机器人(102)的环境的第一状态,并且根据第一模型的输出,确定调节量的第一部分,用于针对机器人(102)从第一状态到第二状态的转移来操控机器人(102),其中根据第一状态并且与第一模型无关地,确定调节量的第二部分,其中根据第一状态并且根据第一模型的输出,利用第二模型来确定质量级别,其中根据质量级别来确定第一模型的至少一个参数,其中根据质量级别和理论值,确定第二模型的至少一个参数,其中根据奖励来确定理论值,所述奖励分配给从第一状态到第二状态的转移。(The invention relates to a method and a device for operating a robot (102), wherein, depending on a first state of the robot (102) and/or of an environment of the robot (102), and determining a first portion of the adjustment amount for operating the robot (102) for a transition of the robot (102) from the first state to the second state based on an output of the first model, wherein a second part of the adjustment quantity is determined from the first state and independently of the first model, wherein the quality level is determined with the second model from the first state and from the output of the first model, wherein at least one parameter of the first model is determined on the basis of the quality level, wherein at least one parameter of the second model is determined on the basis of the quality level and the theoretical value, wherein the theoretical value is determined based on a reward assigned to a transition from the first state to the second state.)

1. A method for operating a robot (102), characterized by determining (304), from a first state of the robot (102) and/or an environment of the robot (102), and from an output of a first model, a first part of an adjustment quantity for maneuvering the robot (102) for a transition of the robot (102) from the first state to a second state; wherein a second portion of the adjustment quantity is determined (306) from the first state and independently of the first model; wherein a quality level is determined (310) with a second model according to the first state and according to an output of the first model; wherein at least one parameter of the first model is determined (312) in dependence of the quality level; wherein at least one parameter of the second model is determined (316) from the quality level and a theoretical value; wherein the theoretical value is determined (314) from a reward assigned to the transition from the first state to the second state.

2. The method according to claim 1, characterized by determining (300) at least one force and at least one moment acting on an end effector (108) of the robot (102), wherein the first state and/or the second state is determined (302) from the at least one force and the at least one moment.

3. The method of claim 2, wherein the first state and/or the second state are defined (302) with respect to an axis, wherein a force causes the end effector (108) to move in a direction of the axis, wherein a moment causes the end effector (108) to rotate about the axis.

4. A method according to claim 3, characterized by determining (302) the following vector: the vector defines a constant part of the adjustment quantity, wherein the vector defines a first force, a second force, a third force, a first moment, a second moment and a third moment, wherein different axes are defined for the forces, wherein each moment is assigned a further axis of the different axes.

5. The method according to claim 4, characterized in that the first model comprises a first function approximator, in particular a first Gaussian process or a first artificial neural network (204), wherein a first portion of the vector defines (304) an input to the first function approximator, wherein the input is defined (306) independently of a second portion of the vector.

6. The method according to claim 4 or 5, characterized in that the second model comprises a second function approximator, in particular a second Gaussian process or a second artificial neural network (206), wherein the vector defines an input for the second function approximator.

7. The method according to any one of claims 3 to 6, characterized in that a vector defining (304, 306) the adjustment quantity is determined, wherein the vector defines a first force, a second force, a third force, a first moment, a second moment and a third moment, wherein different axes are defined for the forces, wherein a further axis of the different axes is assigned to each moment, wherein a first part of the vector is defined (306) independently of an output of the first artificial neural network (204), in particular constantly, the first model comprising the first artificial neural network (204), wherein a second part of the vector is defined (304) depending on the output of the first artificial neural network (204).

8. The method of any of the above claims, wherein the end effector (108) comprises at least one finger (114), the at least one finger (114) having a section complementary to the workpiece (110) to facilitate hand-gripping or self-centering of a surface of the section.

9. The method according to any of the preceding claims, characterized in that the theoretical value is determined (310) according to a limit, wherein the limit is determined according to a graph in which nodes define states of the robot (102), wherein according to the first state a sub-graph of the graph is determined, which sub-graph comprises a first node representing the first state, wherein the limit is determined according to values assigned to the following nodes of the sub-graph: a path from the first node to a second node includes the node, the path representing a final state for the robot (102).

10. The method according to claim 9, characterized by determining (308) the graph according to at least one state of the robot (102), wherein edges defining an unproductive action are assigned for: the nodes are leaves in the graph and are not assigned to the final state of the robot (102).

11. The method according to claim 10, characterized in that the non-resulting action is assigned (308) to a, in particular constant, value for the first part of the regulating variable, whereby the changing part of the remaining strategy for determining the limit is kept out of consideration.

12. Method according to any of the preceding claims, characterized in that the theoretical value is determined (314) according to a predefined limit.

13. The method according to any of the preceding claims, characterized in that for training the first artificial neural network (204) a cost function is determined from the output of the second artificial neural network (206), wherein the parameters of the first artificial neural network (204) are learned in the training, the cost function having a smaller value for the parameters of the first artificial neural network (204) than for the further parameters.

14. The method according to any of the preceding claims, characterized in that in the training for the second artificial neural network (206), for an output of the second artificial neural network (206), a cost function is defined from the output of the second artificial neural network (206) and the theoretical value, wherein parameters of the second artificial neural network (206) are learned, the cost function having a smaller value for the parameters of the second artificial neural network (206) than for further parameters.

15. An apparatus (104) for operating a robot (102), characterized in that the apparatus (104) is configured to carry out the method according to any one of the preceding claims.

16. A computer program, characterized in that the computer program comprises instructions for: the steps of the method according to any one of claims 1 to 14 are carried out when the instructions are carried out by a computer.

Technical Field

The invention relates to a method and a device for operating a robot.

Background

Robots are employed in a variety of industrial applications. The motion strategy implemented by the robot in the application can be predefined by a controller in a closed control loop or by an agent (Agenten) which learns and specifies the strategy according to or independently of the model.

Disclosure of Invention

By means of the device and the method according to the independent claims, applications can be realized that are improved with respect to conventional applications.

A method for operating a robot, wherein a first part of an adjustment quantity is determined for operating the robot for a transition of the robot from a first state to a second state, depending on a first state of the robot and/or of an environment of the robot, and depending on an output of a first model; wherein a second part of the adjustment quantity is determined from the first state and independently of the first model; wherein the quality level is determined with the second model according to the first state and according to the output of the first model; wherein at least one parameter of the first model is determined in dependence on the quality level; wherein at least one parameter of the second model is determined based on the quality level and the theoretical value; wherein the theoretical value is determined based on a reward assigned to a transition from the first state to the second state. In this way, the remaining strategy for operating the robot is used which is particularly effective at reaching the target, without divergence occurring which interferes with the learning process.

It may be provided that at least one force and at least one moment are determined, which act on the end effector of the robot, wherein the first state and/or the second state is determined as a function of the at least one force and the at least one moment.

Preferably, the first state and/or the second state are defined with respect to an axis, wherein the force causes the end effector to move in the direction of the axis, wherein the moment causes the end effector to rotate about the axis. The manipulation is particularly effective for exploratory and industrial applications. In this way, exploration (i.e. especially random trial of new actions) becomes safer. It can be ensured that neither the robot nor the object to be manipulated, but also persons standing around, are injured.

It may be provided that the following vectors are determined: the vector defines a constant portion of the adjustment quantity, wherein the vector defines a first force, a second force, a third force, a first moment, a second moment and a third moment, wherein different axes are defined for the forces, wherein each moment is assigned a further one of the different axes. Vectors are particularly well suited for describing states and for steering.

The first model may comprise a first function approximator, in particular a first gaussian process or a first artificial neural network, wherein a first part of the vector defines an input thereto, wherein the input is defined independently of a second part of the vector.

The second model may comprise a second function approximator, in particular a second gaussian process or a second artificial neural network, wherein the vector defines an input for this.

It is possible to provide: a vector defining an adjustment quantity is determined; wherein the vector defines a first force, a second force, a third force, a first moment, a second moment, and a third moment; wherein different axes are defined for the forces, wherein each moment is assigned a further one of the different axes; wherein a first part of the vector is defined, in particular constantly, independently of an output of a first artificial neural network, a first model comprising said first artificial neural network; wherein a second portion of the vector is defined based on an output of the first artificial neural network. In this way, the robot can be controlled in a predefined manner and with constant variables and can thus be moved more quickly into the final state in a task-dependent manner.

Preferably, the end effector comprises at least one finger having a section complementary to the workpiece, the surface of said section being conveniently embodied with a hand-held or self-centering (selbstzenriered). In this way, particularly good bracing (Halt) can be achieved.

The theoretical value may be determined from a limit, wherein the limit is determined from the following graph: in the graph, nodes define states of the robot; wherein a subgraph of the graph is determined from the first state, the subgraph comprising a first node representing the first state; wherein the bounds are determined according to values assigned to the following nodes of the subgraph: a path from the first node to the second node includes the node, the path representing a final state for the robot. In a subgraph, the Q-value assigned to the subgraph can be determined analytically. The Q value may be used as a lower limit.

Preferably, the graph is determined according to at least one state of the robot, wherein edges defining a no-result (folgenlose) action are assigned for the following nodes: the nodes are leaves in the graph and are not assigned to the final state of the robot. By this, the cause for divergence in the learning process is avoided.

A non-resulting action can be assigned to a particularly constant value for the first part of the manipulated variable. In this way, in order to avoid divergence during the learning process, a particularly well suited limit is determined. In the case of a determination, by embedding a no-result action, it is entirely possible to determine the limit first; in other cases, it may be possible to determine a higher lower bound than without an unforeseen action.

The setpoint value is preferably determined according to predefined limits. In this way, domain knowledge can be taken into account in connection with the task.

In order to train a first artificial neural network, a cost function may be determined from the output of a second artificial neural network, wherein parameters of the first artificial neural network for which the cost function has a smaller value than for further parameters are learned in the training.

In training for the second artificial neural network, for an output of the second artificial neural network, a cost function may be defined from the output of the second artificial neural network and theoretical values, wherein parameters of the second artificial neural network are learned for which the cost function has a smaller value than for further parameters.

Device for operating a robot, characterized in that the device is configured to carry out the method according to any one of the preceding claims.

Drawings

Further advantageous embodiments emerge from the following description and the drawings. In the drawings:

figure 1 shows a schematic view of a robot and an apparatus for operating the robot,

figure 2 shows a schematic view of a part of the apparatus,

fig. 3 shows steps in a method for operating a robot.

Detailed Description

Fig. 1 schematically shows a robot 102 and a device 104 for operating the robot 102. The robot 102 is configured to grasp a first workpiece 106 with an end effector, in this example a grasping device 108. The robot 102 may move in a number of different poses p in the workspace 110. In this example, gesture p may be described in terms of a three-dimensional Cartesian coordinate system. The origin 112 of the coordinate system is arranged in this example centrally between two fingers 114 of the gripping device 108, with which fingers 114 the first workpiece 106 can be gripped. Further layouts of the cartesian coordinate system are likewise possible.

The cartesian coordinate system utilizes coordinates x, y, z to define a position for motion in the workspace 110.

In fig. 1, in the workspace 110, a second workpiece 116 is shown. In this example, an opening 118 is provided in the second workpiece 116, the opening 118 being configured to receive the first workpiece 106.

The first workpiece 106 is, for example, a shaft, in particular a motor shaft. For example, the second workpiece 116 is a ball bearing configured to receive a shaft. The ball bearing may be arranged in the motor housing.

The robot 102 is configured to move the first workpiece 106 in the workspace 110 on a trajectory according to a strategy such that the first workpiece 106 is received in the second workpiece 116 at the end of the trajectory.

The device 104 includes at least one processor 120 and at least one memory 122 for instructions, which when executed by the at least one processor 120 perform the methods described below. At least one graphics Processing Unit (graphics Processing Unit) can also be provided, with which the function approximator, in particular the artificial neural network, can be trained particularly efficiently. The at least one processor 120 and the at least one memory 122 may be implemented as one or more microprocessors. The device 104 may be disposed outside the robot 102, or the device 104 may be disposed in a manner integrated into the robot 102. Data lines may be provided for communication between the processor, memory, controls and the robot 102. These data lines are not shown in fig. 1.

The apparatus 104 may include an output device 124, the output device 124 configured to operate the robot 102. The output device 124 may include an output stage or communication interface for manipulating one or more actuators of the robot 102.

Fig. 2 schematically shows part of the device 104.

The device 104 comprises an, in particular autonomous, agent 202, said agent 202 being configured to interact with its environment.

At each discrete time step t, agent 202 may observe the stateAnd perform actions according to policy πThe actionsDefine the next state. After each action, agent 202 receives a reward。

The environment is illustrated in this example by a Markov decision process with state, action, transition dynamics, and reward functions. The transition dynamics may be random.

Future rewardsIs expected value of the sum ofBy the result ofIs defined by a factor ofWhen in the slave stateReaching said future reward when issuing a tracking strategy of pi。

For the behavior in the first step independent of the strategy pi, the Q value can be considered. The Q value may be considered as the desired value for the sum of the rewards for the future:

，

when performing an action at a momentary time step tAnd from then on following policy pi, the future reward is reached.

The goal of agent 202 tracking is to determine the optimal strategy for reaching the final state. It is possible to arrange that,the following indication (Angabe) is determined: the indicationIt is explained whether the robot 102 has completed its task, i.e. has reached a final state. By means of an optimal strategyFor each current stateSelect an actionThe actionsMake all relations to corresponding statesThe desired reward for future states is maximized. In this example, the desired reward for the future states is considered in the sum of the future states, which sum defines the desired reward, weighted by the discount γ.

Agent 202 may accomplish this by: determining a Q function for an environment and selecting an actionThe actionsThe Q value of the Q function at each time step t is maximized.

Q functionIs used to optimize the function approximator,and the Q function may be determined, for example, by estimating the Q value from experience and an instantaneous estimate of the Q function：

。

This is called time difference learning (temporal difference learning). Can be set according to the indicationTo determine the stateWhether it is the final state.

These functions may be approximated by artificial neural networks. For example, two artificial neural networks are used. The first artificial neural network 204 represents a deterministic strategy pi. The first artificial neural network 204 is referred to as an actor network (actor network). The second artificial neural network 206 is based on the state at its inputAnd actionsAnd determining the Q value. The second artificial neural network 206 is called a critic network (cognitive network). This behavior is called depth Deterministic Policy Gradients (Deep Deterministic policies Gradients) and is described, for example, in "Continuous control with Deep recovery learning" (arXiv predictive arXiv:1509.02971, 2015) by t.p. lilicrap, j.j. Hunt, a. Pritzel, n. Heess, t. Erez, y. Tassa, d. Silver and d. Wierstra.

Depth-deterministic policy gradients are model-free methods for continuous state and action spaces. To avoid instability in the learning process with a nonlinear neural network, a limit for the Q value may be determined. These limits are employed during the learning process, whereby the learning process remains stable.

During the learning process, data about the interaction of the agent with the environment is stored in a reoccurring memory (wiedergabespieicher). In place of a list of transfers, rewards, and an indication as to whether a final state has been reachedStored in a replay memory, and processed as follows. It is also possible to store the list and then continuously derive new graphs from the list. Transitions include states in this exampleAnd act ofPassing through the stateCarry out actionsTo the state reached. The reward achieved therebyAnd indicationTo such transfer. Instead of a list of transitions, a data graph is provided in which a first node defines a stateAnd the second node is defined by being in stateCarry out actionsTo the state reached. An edge between a first node and a second node in the graph defines an action in this case. RewardAnd indicationIs assigned to the edge. Different transitions in the data map may lead to divergence of the learning process with different probabilities. For time difference learning, the probability is related to the structure of the data map.

The probability that the transition used to reach the final state will diverge with respect to the probability that another transition used to not reach the final state is minimal. Compared to the state from which divergence in the learning process cannot reach the final state in the graphProbability of formation from which the state of the final state can be reached via the path in the graphResulting in a lower probability of divergence of the learning process.

The lower limit can be determined starting from the data diagram. For example, the following subgraphs of the data graph are determined: with the subgraph, all Q values assigned to the subgraph can be analytically determined, assuming the subgraph is complete. These Q values may be used as lower bounds for the data map. Other lower and upper bounds may be defined a priori (i.e., through domain knowledge, for example, about the robot 102, the first artifact 106, the second artifact 116, and/or the task to be solved and/or the reward function used).

The possibility of using the limit is to limit the Q-function according to a lower limit LB and an upper limit UB. Thus, a Q function is obtained:

。

in training for the second artificial neural network 206, the output for the second artificial neural network 206The cost function can be defined as mean squared error:

。

in this example, the goal of the training is to learn parameters of the second artificial neural network 206 for which the cost function has a smaller value than for the other parameters. The cost function is minimized, for example, in a gradient descent method.

It can be provided that the following nodes in the data diagram are provided with no-result work, that is to say zero actions: limiting an actionNone of the edges of (c) start from the node. Additional nodes may also be equipped with unproductive work. The non-resulting action is in this example the following: the action leaves the state of the robot 102 unchanged, for example by an acceleration given in advance to zero.

By this, the following leaves in the data map are avoided: no more actions are taken from the leaf. The lower bound may be determined for each transition that ends in an infinite loop. More lower bounds may be determined by the non-resulting action than if there were no non-resulting action. It is possible in this way that the lower limit is narrower overall.

As an input to the first artificial neural network 204, a defined state is used in this exampleThe first part of the variable of (1). Input and qualification states of the first artificial neural network 204Is independent of the second part of the variable.

The variable may be an estimated variable. These variables may also be measured variables or may be calculated or estimated from measured variables.

In this example, by the estimated force in the x-directionEstimated force in y-directionEstimated force in z-directionEstimated moment of rotation about an axis extending in the x-directionEstimated moment of rotation about an axis extending in the y-directionAnd estimated moment of rotation about an axis extending in the z-directionDefine the stateThese forces and moments occur on the gripping device 108 in a transient posture p. In this example, the estimated forceEstimated forceIndependently of, and in dependence on, the estimated momentIndependently, the inputs to the first artificial neural network 204 are determined. The second part of the variable is not used as an input to the first artificial neural network 204 in this example. This reduces the dimensions and is thus responsible for: the first artificial neural network 204 becomes smaller and thus the first artificial neural network 204 may be trained more quickly or more simply. In other applications, these variables may be determined and used for this purpose, wherein no further variables are used. This is particularly advantageous for the following movements of the robot 102: in this movement, the first workpiece 106 is inserted into the opening 118, the opening 118 extending in the z-direction.

In this example, the action is defined by a first portion, which is in particular constant, and by a second portion. In the example for the movement of the gripping means 108, the first part defines the force in the x-directionForce in the y-directionForce in z-directionAnd for an axis extending around the z directionMoment of rotation of the wire. Force ofCan be. In this example, this means that the robot 102 moves the gripping device 108 continuously along the z-axis of the gripping device 108. The second part defines in this example a moment for rotation about an axis extending in the x-directionAnd/or a moment about an axis extending in the y-direction is definedOf the second theoretical value of (1). In this example, a first theoretical value is determined based on a first output variable of the first artificial neural network 204. In this example, a second theoretical value is determined based on a second output variable of the first artificial neural network 204. In this example, each of these output variables is scaled to [ -3, 3 [ -3]In Nm interval.

In short, in this example, the operation isDetermining an adjustment quantity. In this example, in the first section is，，. The first part may also be provided with further values. When the robot 102 is to be operated differently, the adjustment amount may also be constructed differently. In this example, the adjustment amount is output to the adjuster 208For adjusting the new pose p of the gripping means 108, the regulator 208 operates the robot 102 until the new pose p is reached. In this example, if the regulator 208 for the gripper 108 has reached a steady state (for example, at a speed below a predefined threshold speed), the actionAnd (6) ending. It may be provided that the regulator 208 is configured to determine a state of the robot 102. It may be provided that the regulator 208 is configured to indicate that the robot 102 has reached a final stateTo a first value (e.g., 1). It may be provided that the indication is initialized with a further valueOr otherwise indicateTo another value (e.g., 0). Can be measured for determining the indicationCan be calculated or estimated from the measured variables for determining the indicationThe speed of (2).

In this way, the robot 102 exerts a constant force in the z direction on the first workpiece 106, with which force the workpiece 106 is moved in the z direction. The robot 102 moves with respect to the gripping device 108 by means of moments in the x-direction and the y-direction.

An example for another application is the following movement of the robot 102: in this motion, the first workpiece 108 should be screwed into the opening 118. In this case, a continuous rotational movement about an axis extending in the z direction may be of interest. This can be considered as follows: with estimated force in the x-directionIndependently of the estimated force in the y-directionIndependently of the estimated force in the z-directionIndependently, from the estimated torqueFrom the estimated torqueAnd with the estimated momentIndependently, the inputs to the first artificial neural network 204 are determined. The first output of the first artificial neural network 204 in this case may be for torqueOf the first theoretical value of (a). The second output of the first artificial neural network 204 in this case may be for torqueOf the second theoretical value of (1). In this case, the determination for determining the adjustment amount is made independently of the first artificial neural network 204The additional variables of (a).

RewardCan be predefined by a first reward function, which assigns a value for a reward to the transition to the final stateAnd each additional transfer is assigned a value for the reward。

RewardThis can be predetermined by a second reward function which assigns a value to the transition as a function of the instantaneous position p of the gripper 108 and the distance between the positions p of the gripper in the final state. In this example, the final state is reached if the second workpiece 116 receives the first workpiece 106 in an opening 118 provided for this purpose.

In this example, the position error is used for this purposeDetermining a first reward. In this example, the position errorDetermined as Euclidean difference vectorsAnd (4) norm. For example, the difference vector is determined from the position of the instantaneous pose p from the grasping apparatus 108 and the position of the pose p from the grasping apparatus 108 in the final state. In this example, the orientation error is used for this purposeDetermining a second reward. In this example, the orientation errorDetermined as the angular error for rotation about the x-axis and the angular error for rotation about the y-axis between the orientation from the instantaneous attitude p of the gripping means 108 and the orientation from the attitude p of the gripping means 108 in the final stateAnd (4) norm. In this example, no rotation about the z-axis is considered. The rotation about the z-axis can be taken into account in a further task (aufgabenstellung).

In this example, the following formula is used:

to determine a second reward function having an adjustable first parameter=0.015 and an adjustable second parameter= 0.7. Hereby, the prize is awardedHold in this exampleIn the interval [ -1, 0 [ ]]In (1).

In such an arrangement, it is easily possible for divergence to occur during the learning process, since in the associated data diagram a large path length occurs between the initial state and the final state.

In order to avoid divergence during the learning process (i.e. during training), it is provided that during the learning process, an action without a result is used and that an additional upper bound UB and a lower bound LB are used. In this example, an upper bound UB and a lower bound LB are defined based on the minimum prize and the maximum prize. In this example, for the Q functionSetting a lower boundAnd an upper bound of 0, for the Q functionThe method is applicable to the following steps:

。

in this example, an unproductive action is defined for the second portion of the adjustment amount. In the case of the above-mentioned example,is defined as a no result action. In other scenarios, an unproductive action may also be defined for another portion of the adjustment amount.

In this example, a learning device 210 is provided, the learning device 210 being configured to determine a reward based on at least one of the reward functions. The learning means 210 is in this example configured to determine the Q-functionThe value of (c). It may be provided that the learning device 210 is configured to evaluate the indicationAnd in accordance with the indicationsOr in the utilization of actionsState of arrivalDetermining the Q function independently of the Q value when it is the final stateOr otherwise determine the Q function based on the Q valueThe value of (c). It can be provided that the learning device is designed to limit the Q function by means of the lower limit LB or the upper limit UBThe value of (c).

To train the first artificial neural network 204, the Q value at the output of the second artificial neural network 206 may be basedTo define a cost function:

。

the goal of the training is in this example to learn parameters of the first neural network 204 for which the cost function has a smaller value than for the further parameters. By limiting the Q valueThe Q value ofItself negative in this example. The cost function is thus minimized in this example in a gradient descent method.

Hereinafter, training of the first artificial neural network 204 and the second artificial neural network 206 is described. For training, an Adam optimizer may be employed.

The first artificial neural network 204 (i.e., the actor network) comprises three fully connected layers in this example, and in this first artificial neural network 204, two hidden layers each comprise 100 neurons. The first artificial neural network 204 includes a description-specific stateThe input layer of forces and moments. In this example, the estimated variables are used . The forces and moments can be mapped linearly in this example to the ranges [ -1, +1 ] described by the states]To the value of (1). The first artificial neural network 204 includes a defined action forThe force and moment output layer. These layers include in this example the tanh activation function. The weights for these layers can be initialized randomly, in particular by a Glorot uniform distribution. The output of the output layer is defined in this example for the actionAmount of regulation of . In this example, the output layer is two-dimensional. The first output defines in this example the moment in the x-direction. The second output defines in this example a moment in the y-direction. In this example, the manipulated variable is specified constantly independently of the first artificial neural network 204First part ofE.g. of，，，. The first output and the second output define an adjustment amountThe second part of (1).

The second artificial neural network 206 (i.e., the critic network) includes three fully connected layers in this example, and in this second artificial neural network 206, two hidden layers each include 100 neurons. The second artificial neural network 206 includes input layers for the following forces and moments: the forces and moments describe the stateAnd actions. These forces and moments can be mapped linearly in this example to the interval [ -1, +1 ] described by the state]To the value of (1). The second artificial neural network 206 includes a one-dimensional output layer. The output layer does not comprise non-linearities in this example, in particular does not comprise an activation function. At the output of the second artificial neural network 206, the Q value is output. The additional layers include the ReLU activation function in this example. The weights for these layers can be initialized randomly, in particular by He uniform distribution.

During training, it can be provided that the gripper device 108 is moved out of the initial position. The starting gesture may be, for example, one of 8 possible predefined starting gestures. In this example, one round of training (the ine epicode) has ended either when the final state has been reached or after a predetermined number of calculation steps T = T (e.g. T = 1000). In this example, training has been performed in a pre-given number of rounds (e.g., 40 rounds). In this example, the weights of the first artificial neural network 204 and/or the second artificial neural network 206 have been determined after a predetermined number of calculation steps t = N (e.g., N = 20) in the round. After training, a testing phase may be implemented. In the test phase, a predetermined number of rounds, in this example 8 rounds, may be performed. The starting gesture may be distinguished from one or more starting gestures from training in a testing phase.

In training, different actions are performed by the first artificial neural network 204 according to the weights of the first artificial neural network 204. In training, the second artificial neural network 206 is trained, based on the weights of the second artificial neural network 206,implementing Q functionDifferent values of (a). In this example, the training proceeds by adapting the weights of the first artificial neural network 204 and/or the second artificial neural network 206. The goal of the training is, for example, to adapt the weights of the first artificial neural network 204 and/or the second artificial neural network 206 such that the action determined by the first artificial neural network 204 isFunction of QThe value determined by the second artificial neural network 206 takes a larger value than the value for the further weight. For example, the following weights are determined for the first artificial neural network 204 and/or the second artificial neural network 206: for the weight, Q functionThe value of (c) is taken as the maximum value. To adapt the weights, the following cost function is used in this example: the cost function is defined according to the weights of the first artificial neural network 204 and/or the second artificial neural network 206.

In this example, a cost function is defined based on the outputs of the first artificial neural network 204 and the second artificial neural network 206. In this example, the weights of the first artificial neural network 204 and/or the second artificial neural network 206 are adapted according to the value of the gradient of the cost function and according to a learning rate, which defines what influence the value of the gradient of the cost function has on each of the weights. The selection of different learning rates may be set for the first artificial neural network 204 and for the second artificial neural network. Preferably, the first learning rate of the first artificial neural network 204 is less than the second learning rate of the second artificial neural network 206. Advantageously, the first learning rate is one tenth of the second learning rate.

As a hyper-parameter of the training, the above-mentioned predetermined number of calculation steps and/or rounds may also be varied in addition to the first learning rate and the second learning rate.

Preferably, a first reward function is used, since the first reward function is sufficient with less computational memory. This is advantageous in particular in embedded systems. The first reward function is a sparse reward function (e _ s _ rewards _ function) that is more simply defined generically than the second reward function. In this way, the user can end the training more quickly during the work. This is advantageous in particular in the following industrial environments: in the industrial environment, the robot 102 should be trained for the task.

By this method, the following advantages are achieved in the training with respect to the method according to the depth-deterministic strategy gradient:

higher robustness against variations in the hyper-parameters (Robustheit),

the final state is more reliably reached and,

smaller variance in initializing weights with different Random Seeds (Random Seeds),

a higher robustness with respect to variations of the reward function,

higher robustness with respect to limited memory (e.g. in embedded systems),

safer exploration.

Fig. 3 schematically shows steps in a method for operating the robot 102.

In step 300, at least one force and at least one moment acting on the end effector 108 of the robot 102 are determined. Can be set up to determine the estimated variablesOr the corresponding variables are determined as described.

In step 302, a first state is determined based on at least one force and at least one moment。

First stateIn this example, about an axis, wherein the force causes the end effector 108 to move in the direction of the axis, and wherein the moment causes the end effector 108 to rotate about the axis.

In this example, the following vector is determined: the vector defining a first stateWherein the vector defines a first force, a second force, a third force, a first moment, a second moment and a third moment, wherein different axes are defined for the forces, wherein each moment is assigned a further one of the different axes. E.g. by estimated variablesThe vector is defined.

In step 304, a first state of the robot 102 and/or the environment of the robot 102 is determinedAnd determining a first portion of the adjustment amount for the robot 102 from the first state based on the output of the first modelTo a second stateTo operate the robot 102.

The first model includes, for example, a first artificial neural network 204. The first portion of the vector defines the input to the first artificial neural network 204 in this example, for determining the first portion of the adjustment quantity. In this example, the input to the first artificial neural network 204 is defined independently of the second portion of the vector.

In step 306, according to the first stateAnd determining a second portion of the adjustment amount independently of the first model.

In this example, a vector defining the adjustment amount is determined. The vector includes a first force, a second force, a third force, a first moment, a second moment, and a third moment, wherein different axes are defined for the forces. Each moment is assigned a different one of the axes.

In this example, the first portion of the vector is defined, in particular constantly, independent of the output of the first artificial neural network 204. In this example, a second portion of the vector is defined based on the output of the first artificial neural network 204.

In this example, the vector is as for the adjustment quantityDetermined as described.

In step 308, a data map is determined based on at least one state of the robot 102. In this example, the last taken transfer is used to supplement the data map. Edges defining an action without a result may be assigned to the following nodes: the nodes are leaves in the data graph and are not assigned to the final state of the robot 102. Optionally, the non-resulting action is assigned to a, in particular, constant value for the first part of the manipulated variable.

In step 310, a quality level is determined using the second model based on the first state and based on the output of the first model.

The second model includes a second artificial neural network 206 in this example. The vector defines the input of the second artificial neural network 206 in this example. The output of the second artificial neural network 206 defines a quality level in this example.

In step 312, parameters of the first model are determined based on the quality level. To this end, training is performed as previously described for the first artificial neural network 204.

In step 314, the theoretical value is determined according to the following reward: the reward is assigned to a transition from a first state to a second state.

In this example, the theoretical value is determined according to the limit. From the data map, the limits are determined. In this example, a subgraph of the graph is determined from the first state, and as described, bounds are determined from Q values assigned to nodes of the subgraph.

In an optional step 314, it is provided that the setpoint value is determined on the basis of predefined limits, which take account of the domain knowledge in relation to the task.

In step 316, at least one parameter of the second model is determined based on the quality level and the theoretical value. To this end, training is performed as previously described for the second artificial neural network 206.

The steps of the method may be repeated for a number of rounds and/or periods (epochs) in this or another order to train the first artificial neural network 204 according to an optimal strategy to maximize the Q value of the second artificial neural network 206. It may be provided that the robot 102 is operated according to the adjustment amount ζ, which is determined by an optimal strategy.

The end effector 108 may include at least one finger 114, the at least one finger 114 having a section complementary to the first workpiece 110 and self-centering against a surface of the at least one finger 114. In this way, particularly good support can be achieved with a constant downward pressure. Self-centering is also particularly important in the case of high moments: the high moment does not act vertically downward with respect to the workspace 110. In this way, a possible twisting of the object in other cases can be avoided.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种用于连续体机器人的参数优化方法

Method and device for operating a robot

相关技术

网友询问留言