Two-arm cooperative grabbing method based on deep learning

文档序号：1945860 发布日期：2021-12-10 浏览：7次中文

阅读说明：本技术 一种基于深度学习的双臂协作抓取方法 (Two-arm cooperative grabbing method based on deep learning ) 是由王万良钱宇彤尤文波屠杭垚陈嘉诚潘杰赵燕伟于 2021-08-17 设计创作，主要内容包括：本发明公开了一种基于深度学习的双臂协作抓取方法,包括首先构建右臂抓取检测模型,通过数据集训练,来让模型达到识别未知物体的抓取位置的效果；通过在抓取检测模型的基础上加入通道剪枝训练,来得到一个轻量的左臂遮挡检测模型,大大加快检测速度；用相机采集操作画面,对图像进行尺寸上的变化和预处理,以便模型更好的操作；最后将检测到的信息按照一定逻辑通过ROS系统与机器人进行通信,以实现清除遮挡物、抓取物体的操作。本发明将深度学习与双臂协作的思想带入到非结构化的物体抓取中,提高了抓取的效率。同时由剪枝训练得到轻量级左臂遮挡检测模型,更进一步加快了操作的速度。(The invention discloses a double-arm cooperative grabbing method based on deep learning, which comprises the steps of firstly constructing a right-arm grabbing detection model, and enabling the model to achieve the effect of recognizing the grabbing position of an unknown object through data set training; a light left arm shielding detection model is obtained by adding channel pruning training on the basis of grabbing the detection model, so that the detection speed is greatly increased; acquiring an operation picture by using a camera, and carrying out size change and preprocessing on the image so as to facilitate better operation of the model; and finally, communicating the detected information with the robot through an ROS system according to a certain logic so as to realize the operations of clearing the shelters and grabbing objects. The invention brings the thought of deep learning and double-arm cooperation into unstructured object grabbing, and improves the grabbing efficiency. Meanwhile, a lightweight left arm shielding detection model is obtained through pruning training, and the operation speed is further increased.)

1. A double-arm cooperative grabbing method based on deep learning comprises the following steps:

s1: constructing a grabbing detection model of the right mechanical arm to realize the estimation of the grabbing pose, and specifically comprising the following steps:

s11: designing a backbone network part of a grasping detection model: the main network part is based on Darknet-53 network to extract target characteristics;

s12: designing a feature fusion part of the grasping detection model: the design of the feature fusion part adopts a similar FPN upsampling and fusion mode to fuse three different feature scalesAndindependent grabbing position detection is respectively carried out on the feature fusion graphs of multiple scales, and the detection precision of the network at a smaller grabbing position is improved;

s13: and (3) designing an output result part of the grasping detection model: the output result part adopts the idea of multi-scale prediction, so that after the characteristic graphs of three different characteristic scales in S12 are respectively subjected to a series of convolution operations, the number of channels is reduced to a fixed value, and finally the prediction of the capture frame is carried out on the basis; the grab frame is defined as:

R＝{x,y,w,h,confidence,c,α} (1)

wherein x and y respectively represent the abscissa and the ordinate of the central point of the grabbing frame, w and h respectively represent the width and the length of the grabbing frame, and confidence represents the confidence of the grabbing frame; converting the prediction of the mechanical arm grabbing rotation angle theta into a combined problem of predicting the grabbing frame class c and the regression part specific angle alpha, and defining the combined problem as follows:

θ＝c×δ+α (2)

wherein c belongs to {0,1, 2.,. N }, and N belongs to N; the rotation angle theta is divided into a class from 0 to 180 degrees at intervals deltaA category; alpha represents a part of specific angles predicted by regression of the detection model, and alpha belongs to [0 DEG, delta);

s14: and (3) designing a loss function of the grasping detection model: the loss function is divided into a center offset loss (L)_xy) Wide and high loss (L)_wh) Confidence loss (L)_con) Class loss (L)_class) And specific angle loss (L) of the angle of rotation_α) (ii) a The above loss function is defined as:

wherein S is²Representing the number of grids, B representing the number of candidate frames generated by each grid;indicates whether the jth candidate box of the ith mesh is responsible for the prediction of the object, and if so, the jth candidate box of the ith mesh is responsible for the predictionOtherwise, thenσ () represents a sigmoid function; s²B, σ () andthe same meaning is given below; lambda [ alpha ]_xyA weighting factor representing the loss of center shift,andrespectively represents the horizontal and vertical coordinates of the predicted central point of the jth candidate frame of the ith grid,andthen representing the horizontal and vertical coordinates of the center point of the corresponding real candidate frame; the loss function adopts a BCE loss function;

wherein λ is_whThe weighting coefficients representing the wide high loss,andrespectively representing the predicted width and height values of the jth candidate frame of the ith grid,andthen the width and height values of the corresponding real candidate box are represented; the loss function adopts an MSE loss function;

wherein λ is_conA weighting coefficient representing a confidence loss; since most of the content in the image does not contain the capture frame to be detected, an object-free weighting coefficient lambda is introduced_noobjTo avoid the tendency of the model to predict no object within the mesh;representing the predicted confidence value of the jth candidate box of the ith mesh,then representing the confidence value of the corresponding real candidate box;

wherein λ is_classA weighting coefficient representing a class loss; p_i ^jRepresenting the predicted category of the jth candidate box of the ith mesh,a category of the corresponding real candidate frame is represented;

wherein λ is_αA weighting factor representing a part specific angle loss;represents the predicted part specific angle of the jth candidate box of the ith mesh,then represents a specific angle of the portion of the corresponding real candidate frame; the loss function adopts a SmoothL1 function;

therefore, the total loss of the model is defined as:

L_detect＝L_xy+L_wh+L_con+L_c+L_α (8)

s2: constructing a shielding detection model of the left mechanical arm to realize the estimation of the pose of the removed shielding object, and specifically comprising the following steps:

s21: design of a shielding detection model network: the tail end grabbing structure of the left mechanical arm is a sucker depending on a siphon effect; that is, it does not require accurate calculation of the grab frame in the operation of removing the obstruction as in the right arm grab; in this case, it is a good strategy to replace the increase in detection speed with the decrease in detection accuracy; the shielding detection model network is obtained by pruning channels on the basis of the grabbing detection model network of the right mechanical arm to reduce network parameter redundancy; the specific pruning steps are as follows:

s22: in the model network in S21, except the output result part, there is a BN layer after each convolution layer; the convolutional layers are normalized by a scale factor, which is specifically defined as follows:

wherein σ²Andrespectively represent the variance and the mean in a min-batch, and gamma and beta respectively represent trainable scale factors and offsets;

s23: in order to reasonably and effectively distinguish important channels from unimportant channels, channel sparsity training is performed by using a sign () function on the scale factor γ in S22, which is specifically defined as follows:

where ξ is a penalty factor;

s24: introducing a global threshold eta and a safety threshold mu; the global threshold is used for determining whether the feature channel is to be pruned or not so as to control the pruning proportion; the safety threshold prevents excessive pruning of convolutional layer channels and preserves the integrity of the network connection; when the scaling factor gamma in S23 is smaller than the minimum value of eta and mu, starting to trim the channel;

s25: initializing a pruning mask for all the convolution layers according to the global threshold and the safety threshold; discarding the maxpool layer and the upsample layer in the original network structure; the route layer is sequentially connected with the pruning mask of the access layer, and the mask of the access layer is used as the pruning mask of the route layer; ensuring that all the layers connected with the shortcut have the same channel number;

s26: on the basis of S25, the occlusion detection model also needs to perform parameter fine adjustment and manual pruning on a training set;

s27: iterating S22 to S26 until the result is reasonable and effective;

s3: when the target detection models in S1 and S2 are detected in real time, the input images acquired by the camera need to be preprocessed in size; finally, predicting the target detection model to obtain the grabbing poses of the object to be grabbed and the obstacle in the pixel coordinate system;

s31: designing a grabbing frame screening algorithm of the model: the screening algorithm of the grabbing frame can solve the problem of multiple grabbing frames of the same image, and the grabbing frame with the highest confidence level in the detection result is extracted through local maximum search;

s4: an ROS system is adopted as a communication system for operating the double mechanical arms;

s41: calling a left arm shielding detection model to identify whether a shielding object exists or not; if so, sending the position and the rotation angle of the blocking object grabbing frame to a left arm system subscribing to the topic through the ROS topic 'shadow' so that the left arm unfolds and removes the blocking object; if not, directly entering S43;

s42: calling the shielding detection model again to confirm that no shielding object exists; if yes, repeating the operation of S41;

s43: calling a right arm grabbing detection model, and identifying a grabbing frame of an object to be grabbed; if so, transmitting the position and the rotation angle of the grabbing frame to a right arm system subscribed to the topic through an ROS topic 'box', so that the right arm can unfold the grabbing operation; if not, exiting the program;

s44: the operation of S43 is repeated until the process terminates.

2. The two-arm cooperative grasping method based on deep learning as claimed in claim 1, characterized in that: the algorithm of step S31 specifically includes:

s311, arranging corresponding grabbing frames according to the confidence coefficient;

s312, selecting the largest frame as an initial reserved frame, and deleting the frame with the IOU larger than a set threshold value;

s313 repeats the operation in "2" until all the grab boxes are traversed.

Technical Field

The invention belongs to the field of intelligent robot control and computer vision, and relates to a double-arm cooperative grabbing method based on deep learning.

Background

With the development of science and technology and society, the application field of the robot is continuously expanded, and the robot gradually enters the service fields of medical biochemistry, family cleaning, medical rehabilitation and the like. In the new application fields, the traditional robot cannot meet the requirements of users, and the development of the intelligent cooperative robot capable of being blended with the environment is a popular and leading-edge field of colleges and universities and research institutions at home and abroad at present. In a background, it is of great significance to research a method for breaking the limitation that the existing robot autonomously picks up and places objects in a cluttered environment.

The traditional robot usually works in a structured environment, and manually designed characteristics and three-dimensional models of objects to be grabbed are adopted to establish a database and save grabbing experience. This method is exposed to a great limitation in the current and increasing unstructured environments, such as homes, pharmacies, etc., which contain a large number of unknown objects, and there are many factors such as the size of the object to be grabbed changes and overlapping occlusion between the objects. For this problem, a common method is a deep learning-based method, such as a neural network, and a deep learning model is obtained through iterative training. And the grabbing experience is transferred to the grabbing of the unknown object by means of the robustness and the strong generalization capability of the deep learning model.

On one hand, in order to improve the precision of frame grabbing detection, the network structure is more and more complex, and the prediction time is prolonged and the requirement on hardware performance is improved; on the other hand, the single mechanical arm has low grabbing efficiency when the sheltering object exists. Therefore, how to utilize the deep learning method and simultaneously take into account the efficiency and the real-time performance of the network, and simultaneously, the method of two-arm cooperation is used to improve the grabbing efficiency, which is a problem that needs to be solved urgently by a cooperative robot.

Disclosure of Invention

Aiming at the existing problems, the invention provides a double-arm cooperative grabbing method based on deep learning. The method can reduce the number of parameters of the network model, and simultaneously does not influence the speed and the precision of grabbing so as to meet the requirement of real-time grabbing and ensure the autonomous grabbing capability of the network model in a non-structural environment.

The invention aims to achieve the aim, and adopts the technical scheme that the two-arm cooperative grabbing method based on deep learning comprises the following steps:

s1: constructing a grabbing detection model of the right mechanical arm to realize the estimation of the grabbing pose, and specifically comprising the following steps:

s11: designing a backbone network part of a grasping detection model: the backbone network part is based on Darknet-53 network to extract target features.

S12: designing a feature fusion part of the grasping detection model: the design of the feature fusion part adopts a similar FPN upsampling and fusion mode to fuse three different feature scalesAndindependent grabbing position detection is respectively carried out on the feature fusion graphs of multiple scales, and the detection precision of the network at a smaller grabbing position is improved.

S13: and (3) designing an output result part of the grasping detection model: and an output result part adopts a multi-scale prediction idea, so that after the feature maps of three different feature scales in S12 are subjected to a series of convolution operations respectively, the number of channels is reduced to a fixed value, and finally, the capture frame is predicted on the basis. The grab frame is defined as:

R＝{x,y,w,h,confidence,c,α} (1)

wherein x and y respectively represent the abscissa and the ordinate of the central point of the grabbing frame, w and h respectively represent the width and the length of the grabbing frame, and confidence represents the confidence of the grabbing frame. The invention converts the prediction of the mechanical arm grabbing rotation angle theta into the combined problem of predicting the grabbing frame class c and the regression part specific angle alpha, and defines the combined problem as follows:

θ＝c×δ+α (2)

wherein c belongs to {0,1, 2. The rotation angle theta is divided into a class from 0 to 180 degrees at intervals deltaA category; alpha represents part of specific angles predicted by regression of the detection model, and alpha is epsilon [0 DEG, delta).

S14: and (3) designing a loss function of the grasping detection model: the loss function is divided into a center offset loss (L)_xy) Wide and high loss (L)_wh) Confidence loss (L)_con) Class loss (L)_class) And specific angle loss (L) of the angle of rotation_α). The above loss function is defined as:

wherein S is²Representing the number of meshes, B represents the number of candidate boxes generated per mesh.Indicates whether the jth candidate box of the ith mesh is responsible for the prediction of the object, and if so, the jth candidate box of the ith mesh is responsible for the predictionOtherwise, thenσ () represents a sigmoid function. S²B, σ () andthe meaning is the same. Lambda [ alpha ]_xyA weighting factor representing the loss of center shift,andrespectively represents the horizontal and vertical coordinates of the predicted central point of the jth candidate frame of the ith grid,andthen the horizontal and vertical coordinates of the center point of the corresponding real candidate frame are represented. The loss function adopts a BCE loss function.

Wherein λ is_whThe weighting coefficients representing the wide high loss,andrespectively representing the predicted width and height values of the jth candidate frame of the ith grid,andthe width and height values of the corresponding real candidate box are indicated. The loss function adopts MSE loss function.

Wherein λ is_conA weighting factor representing the confidence loss. Since most of the content in the image does not contain the capture frame to be detected, an object-free weighting coefficient lambda is introduced_noobjTo avoid the tendency of the model to predict the absence of objects within the mesh.Representing the predicted confidence value of the jth candidate box of the ith mesh,the confidence value of the corresponding real candidate box is indicated.

Wherein λ is_classA weighting factor representing the class penalty.Representing the predicted category of the jth candidate box of the ith mesh,the category of the corresponding real candidate box is represented.

Wherein λ is_αA weighting factor representing a fraction of a specific angular loss.Represents the predicted part specific angle of the jth candidate box of the ith mesh,it represents a specific corner of the portion of the corresponding real candidate frame. The loss function adopts a SmoothL1 function.

Therefore, the total loss of the model is defined as:

L_detect＝L_xy+L_wh+L_con+L_c+L_α (8)

s2: constructing a shielding detection model of the left mechanical arm to realize the estimation of the pose of the removed shielding object, and specifically comprising the following steps:

s21: design of a shielding detection model network: the end grabbing structure of the left mechanical arm is a sucker which depends on the siphon effect. That is, it does not require accurate calculation of the grab frame in the operation of removing the obstruction as in the right arm grab. In this case, it is a good strategy to replace the increase in detection speed with the decrease in detection accuracy. The shielding detection model network is obtained by pruning channels on the basis of the grabbing detection model network of the right mechanical arm and reducing network parameter redundancy. The specific pruning steps are as follows:

s22: in the model network in S21, except for the output result portion, each convolution layer is followed by a BN layer. The convolutional layers are normalized by a scale factor, which is specifically defined as follows:

wherein σ²Andrespectively representing the variance and mean in one min-batch, and gamma and beta respectively representing the trainable scale factor and offset.

S23: in order to reasonably and effectively distinguish important channels from unimportant channels, the invention performs channel sparsity training by using a sign () function on the scale factor γ in S22, which is specifically defined as follows:

where ξ is a penalty factor.

S24: a global threshold η and a safety threshold μ are introduced. The global threshold is used for determining whether the feature channel is to be pruned or not so as to control the pruning proportion; the security threshold prevents excessive pruning of convolutional layer channels and preserves the integrity of the network connection. When the scaling factor γ in S23 is less than the minimum of both η and μ, the channel clipping is started.

S25: a pruning mask is initialized for all convolutional layers according to the global threshold and the security threshold. Discarding the maxpool layer and the upsample layer in the original network structure. The route layer is sequentially connected with the pruning mask of the access layer, and the mask of the access layer is used as the pruning mask of the route layer. And ensuring that all the layers connected with the shortcut have the same channel number.

S26: on the basis of S25, the occlusion detection model also needs to perform parameter fine-tuning and manual pruning on the training set.

S27: and iterating S22 to S26 until the result is reasonable and effective.

S3: in the real-time detection of the target detection models in S1 and S2, the input images acquired by the camera need to be pre-processed in size. And finally, predicting the grabbing pose of the object to be grabbed and the obstacle under the pixel coordinate system by using the target detection model.

S31: designing a grabbing frame screening algorithm of the model: the screening algorithm of the grabbing frame can solve the problem of multiple grabbing frames of the same image, and the grabbing frame with the highest confidence level in the detection result is extracted through local maximum search.

S4: the invention adopts the ROS system as a communication system for operating the double mechanical arms.

S41: and calling the left arm shielding detection model to identify whether a shielding object exists or not. If so, the position and rotation angle of the occlusion grasping box is sent to the left arm system subscribing to the topic through the ROS topic 'shadow' so that the left arm unfolds the operation of removing the occlusion. If not, the process proceeds directly to S43.

S42: and calling the occlusion detection model again to confirm that no occlusion exists. If so, the operation of S41 is repeated.

S43: and calling the right arm grabbing detection model, and identifying a grabbing frame of the object to be grabbed. If so, the position and rotation angle of the grabbing frame are sent to the right arm system subscribing to the topic through the ROS topic 'box', so that the right arm unfolds the grabbing operation. If not, the program is exited.

S44: the operation of S43 is repeated until the process terminates.

Further, the algorithm in step S31 specifically includes:

s311, arranging corresponding grabbing frames according to the confidence coefficient;

s312, selecting the largest frame as an initial reserved frame, and deleting the frame with the IOU larger than a set threshold value;

s313 repeats the operation in "2" until all the grab boxes are traversed.

Compared with the prior art, the invention has the advantages that:

1. the two mechanical arms use different models. And the right mechanical arm uses a one-stage type target detection model to change the predicted grabbing frame into a regression problem, and directly regresses the coordinates, the width and the height values and the rotation angle of the central point of the grabbing frame. The detection precision of the model is unchanged in a non-structural environment, but the detection speed is greatly improved, and real-time detection can be realized. The left mechanical arm uses a channel pruning strategy on the basis that the right arm grabs the detection model, and the speed is increased by 2 times under the condition that the detection precision is almost the same. This means that the total time consumption for the two-arm cooperative gripping is greatly reduced.

2. Double-arm cooperative gripping is used instead of single-arm gripping. The invention fully utilizes robot resources on one hand, and provides an extension idea for removing the shelters and grabbing objects by two arms simultaneously in the future on the other hand.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a network structure of a right-arm grabbing detection model in the method of the present invention.

FIG. 3 is a diagram of a process for generating a network structure of a left arm occlusion detection model according to the method of the present invention.

FIG. 4 is a schematic diagram of a grabbing frame of an object to be grabbed in the method of the present invention.

Detailed Description

The invention is further described below with reference to fig. 1-4.

The flow chart of the method of the invention is shown in figure 1 and comprises the following steps:

s1: constructing a grabbing detection model of the right mechanical arm to realize the estimation of the grabbing pose, and specifically comprising the following steps:

s11: designing a backbone network part of a grasping detection model: the backbone network part is based on Darknet-53 network to extract target features.

S12: designing a feature fusion part of the grasping detection model: the design of the feature fusion part adopts an up-sampling and fusion mode similar to FPN, three different feature scales are fused, independent capture position detection is respectively carried out on feature fusion graphs of multiple scales, and the detection precision of the network at a smaller capture position is improved. That is, when the input image size is (w, h), the different feature sizes of the three are respectivelyAnd

s13: and (3) designing an output result part of the grasping detection model: and an output result part adopts a multi-scale prediction idea, so that after the feature maps of three different feature scales in S12 are subjected to a series of convolution operations respectively, the number of channels is reduced to a fixed value, and finally, the capture frame is predicted on the basis. The grab frame is defined as:

R＝{x,y,w,h,confidence,c,α} (1)

wherein x and y respectively represent the abscissa and the ordinate of the central point of the grabbing frame, w and h respectively represent the width and the length of the grabbing frame, and confidence represents the confidence of the grabbing frame. The invention converts the prediction of the mechanical arm grabbing rotation angle theta into the combined problem of predicting the grabbing frame class c and the regression part specific angle alpha, and defines the combined problem as follows:

θ＝c×δ+α (2)

Wherein λ is_conA weighting factor representing the confidence loss. Since most of the content in the image does not contain the capture frame to be detected, an object-free weighting coefficient lambda is introduced_noobjTo avoid the tendency of the model to predict the absence of objects within the mesh.Representing the predicted confidence value of the jth candidate box of the ith mesh,then the corresponding true candidate is representedA confidence value of the box.

Therefore, the total loss of the model is defined as:

L_detect＝L_xy+L_wh+L_con+L_c+L_α (8)

s2: constructing a shielding detection model of the left mechanical arm to realize the estimation of the pose of the removed shielding object, and specifically comprising the following steps:

wherein σ²Andrespectively representing the variance and mean in one min-batch, and gamma and beta respectively representing the trainable scale factor and offset.

S23: in order to reasonably and effectively distinguish important channels from unimportant channels, the invention performs channel sparse training by using sign () function on the scale factor γ in S22, which is specifically defined as follows:

where ξ is a penalty factor.

S26: on the basis of S25, the occlusion detection model also needs to perform parameter fine-tuning and manual pruning on the training set.

S27: and iterating S22 to S26 until the result is reasonable and effective.

The algorithm comprises the following specific steps:

1. arranging corresponding grabbing frames according to the confidence coefficient;

2. selecting the largest frame as an initial reserved frame, and deleting the frame with the IOU larger than a set threshold value;

3. the operation in "2" is repeatedly performed until all the grabbing boxes are traversed.

S4: the invention adopts the ROS system as a communication system for operating the double mechanical arms.

S42: and calling the occlusion detection model again to confirm that no occlusion exists. If so, the operation of S41 is repeated.

S44: the operation of S43 is repeated until the process terminates.

The network structure of the right arm grabbing detection model is shown in fig. 2. Where CBL represents a volume block consisting of a convolution kernel, a BN layer and a Relu activation function.

The left arm occlusion detection model network structure generation process is shown in fig. 3. Firstly, the right arm grabbing detection model network needs to be pre-trained, and then each BN layer is endowed with a scale factor and then sparse training is carried out. The training results will distinguish important and unimportant channels. The invention cuts off unimportant channels (channels with small scale factors) to obtain a semi-finished product and a new network to be finely adjusted. And then fine-tuning the new network on the hyperparameter according to experience to obtain a further semi-finished network. And finally, evaluating the new semi-finished product network, and continuing iterative pruning if the effect is not feasible until the effect is reasonable and effective. At this time, an ultra-lightweight occlusion detection model network is born.

The object to be grabbed grabbing frame is shown in fig. 4. The two thicker border line segments represent the positions where the two fingers fall to the claws for gripping.

The description of the embodiments is merely exemplary in nature and is not intended to limit the invention to the particular forms set forth, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种手足切换多模态四足特种机器人

Two-arm cooperative grabbing method based on deep learning

相关技术

网友询问留言