Object grabbing method and device

文档序号：551884 发布日期：2021-05-14 浏览：2次中文

阅读说明：本技术 对象抓取方法及装置 (Object grabbing method and device ) 是由段文杰夏冬青陈亚南耿嘉王正丁有爽邵天兰于 2021-02-05 设计创作，主要内容包括：本发明公开了一种对象抓取方法及装置,方法包括：沿预设深度方向获取与三维物品区域相对应的二维彩色图以及对应于所述二维彩色图的深度图；将所述二维彩色图以及所述深度图输入深度学习模型,根据输出结果预测所述二维彩色图中包含的多个可抓取对象；根据与所述三维物品区域相对应的点云信息,计算各个可抓取对象的三维位姿信息；根据所述三维位姿信息将各个可抓取对象沿所述预设深度方向进行排序,根据排序结果确定各个可抓取对象的抓取顺序。该方式通过深度学习模型能够实现任意物品的抓取操作,并且,通过将各个可抓取对象沿预设深度方向进行排序,能够确保在多个物品之间存在堆叠情况时先抓取位于上层的物品,从而保障抓取效率,提高抓取质量。(The invention discloses an object grabbing method and device, wherein the method comprises the following steps: acquiring a two-dimensional color image corresponding to a three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction; inputting the two-dimensional color image and the depth image into a depth learning model, and predicting a plurality of graspable objects contained in the two-dimensional color image according to an output result; calculating three-dimensional pose information of each object capable of being grabbed according to the point cloud information corresponding to the three-dimensional object area; and sequencing the grabbed objects along the preset depth direction according to the three-dimensional pose information, and determining the grabbing sequence of the grabbed objects according to a sequencing result. This mode can realize the operation of snatching of arbitrary article through the degree of depth learning model to, through sequencing each object that can snatch along predetermineeing the depth direction, can ensure to snatch earlier the article that is located the upper strata when having the circumstances of piling up between a plurality of articles, thereby the guarantee snatchs efficiency, improves and snatchs the quality.)

1. An object fetching method, comprising:

acquiring a two-dimensional color image corresponding to a three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction;

inputting the two-dimensional color image and the depth image into a deep learning model, and predicting a plurality of graspable objects contained in the two-dimensional color image according to an output result;

calculating three-dimensional pose information of each object capable of being grabbed according to the point cloud information corresponding to the three-dimensional object area;

and sequencing the grabbed objects along the preset depth direction according to the three-dimensional pose information, and determining the grabbing sequence of the grabbed objects according to a sequencing result.

2. The method of claim 1, wherein prior to performing the method, further comprising:

acquiring a sample image corresponding to a three-dimensional sample region, and determining a plurality of object objects contained in the sample image;

and respectively labeling the graspable region and the uncapturable region contained in the sample image according to the position relation among the plurality of object objects, and training the deep learning model according to the labeled sample image.

3. The method according to claim 2, wherein the labeling the graspable region and the non-graspable region included in the sample image according to the positional relationship among the plurality of item objects respectively comprises:

determining the stacking sequence of each article object along the preset depth direction, marking the region corresponding to the article object positioned at the top layer as a graspable region, and marking the region corresponding to the article object positioned at the bottom layer as a non-graspable region; and/or the presence of a gas in the gas,

according to the exposure proportion of each article object, marking the region corresponding to the article object with the exposure proportion larger than a preset threshold value as a graspable region, and marking the region corresponding to the article object with the exposure proportion not larger than the preset threshold value as a non-graspable region; and/or the presence of a gas in the gas,

determining a contact area contained in each article object according to the shape and/or type of each article object, marking an area corresponding to the article object of which the contact area is not shielded as a graspable area, and marking an area corresponding to the article object of which the contact area is shielded as a non-graspable area.

4. The method of claim 2 or 3, wherein the training of the deep learning model from the annotated sample images comprises:

and taking the marked sample image as an original training set, carrying out expansion processing on the original training set by randomly adjusting the brightness and/or contrast of the picture, the affine transformation mode of the picture and/or the white balance transformation mode of the picture, and training the deep learning model by the expanded training set obtained after expansion.

5. The method of any one of claims 1 to 4, wherein the point cloud information is constructed from the two-dimensional color map and a depth map corresponding to the two-dimensional color map, and then the calculating of the three-dimensional pose information of each graspable object includes:

establishing a three-dimensional coordinate system corresponding to the three-dimensional article area; the directions of a first coordinate axis and a second coordinate axis contained in the three-dimensional coordinate system are matched with the two-dimensional color image, and the direction of a third coordinate axis in the three-dimensional coordinate system is matched with the preset depth direction;

and calculating the depth coordinate value of each grippable object corresponding to the third coordinate axis, and calculating the three-dimensional pose information of each grippable object according to the depth coordinate value.

6. The method of claim 5, wherein the preset depth direction comprises at least one of: the camera direction of shooing, the direction at gravity direction and the perpendicular line place of article bearing surface, wherein, the plane at article bearing surface carrier place for being used for placing three-dimensional article, the carrier includes: ground, trays, conveyor belts, and/or material baskets;

the sorting the graspable objects along the preset depth direction according to the three-dimensional pose information, and determining the grasping order of the graspable objects according to the sorting result includes:

sequencing the grabbed objects according to the distance between the grabbed objects and the camera or the object bearing surface, and determining the grabbing sequence of the grabbed objects according to the sequencing result; the closer the distance from the camera to the grabbed objects, the closer the grabbing sequence of the grabbed objects is; the farther away from the camera, the later the grabbing sequence of the grabbed objects; or the closer to the object bearing surface, the later the grabbing sequence of the grabbed objects is; the farther away from the article carrying surface the graspable objects are, the more forward the grasping order.

7. The method according to any one of claims 1 to 6, wherein after determining the grabbing order of each object to be grabbed according to the sorting result, the method further comprises:

acquiring a conversion relation between a camera coordinate system and a robot coordinate system;

and converting the three-dimensional pose information of each grippable object corresponding to the camera coordinate system into the robot coordinate system according to the conversion relation, and outputting the converted three-dimensional pose information of each grippable object to the robot so as to enable the robot to execute the gripping operation.

8. The method according to any one of claims 1 to 7, wherein the three-dimensional pose information is position information; the three-dimensional article area comprises a plurality of graspable objects stacked along a preset depth direction; wherein the grippable object comprises: cartons, plastic pouches, cosmeceuticals, and/or toys.

9. An object grasping apparatus, comprising:

the acquisition module is suitable for acquiring a two-dimensional color image corresponding to a three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction;

the prediction module is suitable for inputting the two-dimensional color image and the depth image into a deep learning model and predicting a plurality of graspable objects contained in the two-dimensional color image according to an output result;

the calculation module is suitable for calculating the three-dimensional pose information of each grabbed object according to the point cloud information corresponding to the three-dimensional object area;

and the sequencing grabbing module is suitable for sequencing the grabbed objects along the preset depth direction according to the three-dimensional pose information and determining the grabbing sequence of the grabbed objects according to a sequencing result.

10. The apparatus of claim 9, wherein the prediction module is further adapted to:

acquiring a sample image corresponding to a three-dimensional sample region, and determining a plurality of object objects contained in the sample image;

11. The apparatus of claim 10, wherein the prediction module is specifically adapted to:

12. The apparatus according to claim 10 or 11, wherein the prediction module is specifically adapted to:

13. The apparatus of any of claims 9-12, wherein the point cloud information is constructed from the two-dimensional color map and a depth map corresponding to the two-dimensional color map, and the calculation module is specifically adapted to:

14. The apparatus of claim 13, wherein the preset depth direction comprises at least one of: the camera direction of shooing, the direction at gravity direction and the perpendicular line place of article bearing surface, wherein, the plane at article bearing surface carrier place for being used for placing three-dimensional article, the carrier includes: ground, trays, conveyor belts, and/or material baskets; the sorting and grabbing module is specifically adapted to:

the closer the distance from the camera to the grabbed objects, the closer the grabbing sequence of the grabbed objects is; the farther away from the camera, the later the grabbing sequence of the grabbed objects; or the closer to the object bearing surface, the later the grabbing sequence of the grabbed objects is; the farther away from the article carrying surface the graspable objects are, the more forward the grasping order.

15. The apparatus of any of claims 9-14, wherein the rank grabbing module is further adapted to:

acquiring a conversion relation between a camera coordinate system and a robot coordinate system;

16. The apparatus according to any one of claims 9 to 15, wherein the three-dimensional pose information is position information; the three-dimensional article area comprises a plurality of graspable objects stacked along a preset depth direction; wherein the grippable object comprises: cartons, envelopes, plastic bags, cosmeceuticals, and/or toys.

17. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the object grabbing method according to any one of claims 1-8.

18. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the object fetching method of any one of claims 1-8.

Technical Field

The invention relates to the technical field of manipulator control, in particular to an object grabbing method and device.

Background

At present, along with the extensive popularization of intelligent robots such as manipulators, more and more articles can be grabbed and transported by means of the intelligent robots. For example, the commodity circulation packing can snatch through intelligent robot to promote by a wide margin and snatch efficiency.

In the prior art, in order to realize accurate grabbing, the object to be grabbed needs to be registered in advance, so that the number, shape, position and other characteristics of the object to be grabbed are determined, and then the robot is controlled to execute grabbing operation according to the number, shape, position and other characteristics of the object to be grabbed.

However, the inventor finds that the conventional grasping method can only be used for grasping a pre-registered article, and for an unregistered article, the information such as the shape and the position of the unregistered article cannot be determined, so that the article cannot be grasped normally. Therefore, the existing grabbing mode cannot flexibly grab any article, and the use scene is limited.

Disclosure of Invention

In view of the above, the present invention has been made to provide an object grasping method and apparatus that overcomes or at least partially solves the above problems.

According to an aspect of the present invention, there is provided an object grasping method, including:

acquiring a two-dimensional color image corresponding to a three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction;

calculating three-dimensional pose information of each object capable of being grabbed according to the point cloud information corresponding to the three-dimensional object area;

Optionally, before the method is executed, the method further includes:

acquiring a sample image corresponding to a three-dimensional sample region, and determining a plurality of object objects contained in the sample image;

Optionally, the labeling, according to the positional relationship among the plurality of article objects, the graspable region and the non-graspable region included in the sample image respectively includes:

Optionally, the training the deep learning model according to the labeled sample image includes:

Optionally, if the point cloud information is constructed according to the two-dimensional color image and the depth map corresponding to the two-dimensional color image, the calculating of the three-dimensional pose information of each graspable object includes:

Optionally, the preset depth direction includes at least one of: the camera direction of shooing, the direction at gravity direction and the perpendicular line place of article bearing surface, wherein, the plane at article bearing surface carrier place for being used for placing three-dimensional article, the carrier includes: ground, trays, conveyor belts, and/or material baskets; the sorting the grabbed objects along the preset depth direction according to the three-dimensional pose information, and the determining the grabbing sequence of the grabbed objects according to the sorting result comprises:

Optionally, after determining the grabbing order of each object to be grabbed according to the sorting result, the method further includes:

acquiring a conversion relation between a camera coordinate system and a robot coordinate system;

Optionally, the three-dimensional pose information is position information; the three-dimensional article area comprises a plurality of graspable objects stacked along a preset depth direction; wherein the grippable object comprises: cartons, envelopes, plastic bags, cosmeceuticals, and/or toys.

According to still another aspect of the present invention, there is also provided an object grasping apparatus including:

Optionally, the prediction module is further adapted to:

acquiring a sample image corresponding to a three-dimensional sample region, and determining a plurality of object objects contained in the sample image;

Optionally, the prediction module is specifically adapted to:

Optionally, the point cloud information is constructed according to the two-dimensional color image and a depth map corresponding to the two-dimensional color image, and the calculation module is specifically adapted to:

Optionally, the ranking grabbing module is further adapted to:

acquiring a conversion relation between a camera coordinate system and a robot coordinate system;

According to still another aspect of the present invention, there is provided an electronic apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the object grabbing method.

According to still another aspect of the present invention, a computer storage medium is provided, where at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to perform an operation corresponding to the object fetching method.

In the object grabbing method and device provided by the invention, the two-dimensional color image corresponding to the three-dimensional object area and the depth image corresponding to the two-dimensional color image are obtained along the preset depth direction, and the two-dimensional color image and the depth image are input into the deep learning model, so that a plurality of grabbed objects contained in the two-dimensional color image are predicted. And further calculating the three-dimensional pose information of each grippable object according to the point cloud information corresponding to the three-dimensional object area, and sequencing each grippable object along the preset depth direction according to the three-dimensional pose information so as to determine the gripping sequence of each grippable object according to the sequencing result. Therefore, the grabbing operation of any article can be realized through the deep learning model in the mode, and the articles on the upper layer can be grabbed firstly when stacking conditions exist among a plurality of articles by sequencing the grabbed objects along the preset depth direction, so that the grabbing efficiency is guaranteed.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow diagram illustrating an object fetching method according to an embodiment of the present invention;

FIG. 2 is a flow diagram illustrating an object fetching method according to another embodiment of the present invention;

fig. 3 is a schematic structural view showing an object grasping apparatus according to still another embodiment of the present invention;

FIG. 4 shows a schematic structural diagram of an electronic device according to the present invention;

FIG. 5 illustrates a structural diagram of a deep learning model in one example.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a schematic flowchart of an object grabbing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step S110: and acquiring a two-dimensional color image corresponding to the three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction.

Wherein the preset depth direction includes at least one of: the camera shoots the direction, the gravity direction and the direction of the vertical line of the object bearing surface. Wherein, the plane that the carrier place that article loading face is for placing three-dimensional article, this carrier includes: ground, pallets, conveyor belts, and/or material baskets.

For example, the preset depth direction may be a depth direction along which the camera takes a picture, also called a picture taking direction. Correspondingly, the light generated by the camera of the camera starts from the first direction and extends to the second direction, and the preset depth direction is the direction pointing to the second direction from the first direction. For example, when a camera of the camera takes a picture from top to bottom, the preset depth direction is the direction from top to bottom; when the camera of the camera takes a picture from left to right, the preset depth direction is the direction from left to right. In addition, the three-dimensional article region means: a three-dimensional region in which a plurality of articles are stacked. Since a stacking phenomenon often exists among a plurality of articles in this embodiment, the orientation relationship among the articles cannot be accurately described only by the planar image, and thus the description is performed by the three-dimensional article region.

In specific implementation, a two-dimensional color image corresponding to the three-dimensional object area and a depth image corresponding to the two-dimensional color image are obtained through a 3D camera. The two-dimensional color image corresponds to an image of a plane area vertical to a preset depth direction; each pixel point in the depth map corresponding to the two-dimensional color image corresponds to each pixel point in the two-dimensional color image one by one, and the value of each pixel point is the depth value of the pixel point. Wherein the depth value is determined according to the distance of the article from the camera. For example, when the camera is taken from top to bottom, the two-dimensional color map corresponds to a top plan view, while the depth map is used to represent the distance of each object from the camera.

Therefore, the azimuth relationship between the objects can be accurately described from a three-dimensional perspective through the two-dimensional color image and the depth image corresponding to the two-dimensional color image.

Step S120: and inputting the two-dimensional color image and the depth image into a deep learning model, and predicting a plurality of graspable objects contained in the two-dimensional color image according to an output result.

The deep learning model is obtained by training a plurality of training samples generated in advance. Specifically, the deep learning model can predict a plurality of graspable objects contained in the two-dimensional color image based on the two-dimensional color image and the depth image by learning the training sample. In specific implementation, the graspable objects contained in the multiple training samples can be labeled in advance, and accordingly, the graspable objects contained in the two-dimensional color image are predicted based on the labeled deep learning model. The deep learning model may be various types of machine learning models, and the specific details are not limited in the present invention.

Step S130: and calculating the three-dimensional pose information of each object capable of being grabbed according to the point cloud information corresponding to the three-dimensional object area.

The point cloud is a data set of points in a preset coordinate system. The points contain rich information including three-dimensional coordinates X, Y, Z, color, classification values, intensity values, time, etc. The point cloud can atomize the real world, and the real world can be restored through high-precision point cloud data. Therefore, the point cloud information can reflect the three-dimensional characteristics of the three-dimensional object area, and specifically, the point cloud information can be constructed according to the two-dimensional color image and the depth image, or can be additionally combined with elements such as a laser detector, an infrared detector, a visible light detector (such as a visible light detector like an LED), a radar detector and the like to generate the point cloud information.

And calculating the three-dimensional pose information of each object capable of being grabbed through the point cloud information. And the three-dimensional pose information is used for describing the gesture of the graspable object in the three-dimensional world. In this embodiment, the depth value information in the three-dimensional pose information is mainly focused, and the depth value information may be the distance between the graspable object and the camera.

Step S140: and sequencing the grabbed objects along the preset depth direction according to the three-dimensional pose information, and determining the grabbing sequence of the grabbed objects according to the sequencing result.

Because the three-dimensional pose information contains depth value information, the grabbed objects can be sequenced along the preset depth direction according to the three-dimensional pose information. In the present embodiment, the preset depth direction coincides with the shooting direction of the camera. For example, when the camera takes a picture from top to bottom, the various grabbed objects are sorted according to the height information: higher heights are ranked more forward, and lower heights are ranked more backward. Correspondingly, when the grabbing sequence of each object capable of being grabbed is determined according to the sequencing result, the object with the higher height is grabbed firstly, and the object with the lower height is grabbed later, so that the robot can be ensured to grab the objects from top to bottom in sequence.

Therefore, the grabbing operation of any article can be realized through the deep learning model in the mode, and the articles on the upper layer can be grabbed firstly when stacking conditions exist among a plurality of articles by sequencing the grabbed objects along the preset depth direction, so that the grabbing efficiency is guaranteed.

Fig. 2 is a flowchart illustrating an object capture method according to another embodiment of the present invention. As shown in fig. 2, the method includes:

step S200: and training the deep learning model through the pre-acquired sample images corresponding to the three-dimensional sample regions.

Specifically, the deep learning model is obtained by training in the following way:

first, a sample image corresponding to a three-dimensional sample region is acquired, and a plurality of object objects included in the sample image are determined. Wherein, a plurality of articles to be grabbed as samples are contained in the three-dimensional sample area. The sample image corresponding to the three-dimensional sample region includes: the depth map comprises a two-dimensional color map corresponding to a three-dimensional sample region and a depth map corresponding to the two-dimensional color map, wherein the two-dimensional color map is acquired along a preset depth direction. The specific obtaining manner may refer to the corresponding description in step S110, and is not described herein again. When a plurality of article objects contained in the sample image are determined, information such as outlines and boundary lines among the articles can be identified in an example segmentation mode, and the article objects contained in the sample image are segmented according to identification results.

Then, according to the position relationship among the plurality of article objects, the graspable region and the graspable region included in the sample image are labeled respectively. Since a stacking phenomenon exists among the plurality of article objects in this embodiment, the article objects stacked below may not be easily grasped, and thus, an area corresponding to a graspable article object and an area corresponding to a non-graspable article object need to be marked. Specifically, when the graspable region and the non-graspable region included in the sample image are respectively labeled according to the positional relationship among the plurality of object objects, the labeling can be implemented by at least one of the following implementation manners:

in an optional implementation manner, a stacking order of each article object along a preset depth direction is determined, an area corresponding to the article object located at the top layer is marked as a graspable area, and an area corresponding to the article object located at the bottom layer is marked as a non-graspable area. Conventional example segmentation algorithms do not distinguish whether an item in the scene is graspable, i.e.: a complete and accurate instance mask needs to be given for all the items in the scene. Therefore, if the conventional example segmentation algorithm is directly applied to the recognition of the graspable region, the article to be pressed located at the bottom layer is recognized as the graspable article or the article irrelevant to the background is recognized as the graspable article, thereby causing a recognition error. In order to prevent the above problem, in this implementation, a stacking order of each article object along a preset depth direction is determined, so that an area corresponding to the article object located on the top layer is marked as a graspable area, and an area corresponding to the article object located on the bottom layer is marked as a non-graspable area, thereby avoiding an abnormality caused by grasping an article on the bottom layer by the robot. For example, in a carton unstacking scenario, it is necessary to unstack from the uppermost layer to the lowermost layer, and it is not possible to grasp a lower layer of cartons without fully grasping the lower layer of cartons. Therefore, in a similar scene, only the uppermost carton is labeled as a graspable object, and the rest cartons are labeled as non-graspable objects. The articles on the uppermost layer and the articles on the non-uppermost layer can be accurately distinguished through the marking mode, and then accurate pixel-level article positioning is given.

In yet another optional implementation manner, according to the exposure proportion of each article object, an area corresponding to the article object whose exposure proportion is greater than a preset threshold is marked as a graspable area, and an area corresponding to the article object whose exposure proportion is not greater than the preset threshold is marked as a non-graspable area. In some scenarios, the mutual stacking relationship between the articles is not easy to determine, and there may be a mutual overlapping situation between the articles on the same layer, at this time, it is difficult to accurately mark the top layer article. For example, in a commercial superstore goods picking scene, the upper-lower layer relation between goods is not clear, and meanwhile, the goods on the same layer are also overlapped, so that the requirements on the grabbing sequence are not strict, and the difference between the grabbed goods and the non-grabbed goods is strict. At this point, items with less surface exposure, or items that may cause other items in the scene to fly out after grabbing, should not be labeled as grippable items. Accordingly, in the above scenario, a labeling threshold, such as 85%, may be set. If the exposed surface area of the article is greater than 85%, then the article is marked as a graspable article; if the article has an exposed surface area of no greater than 85%, the article is marked as a non-graspable article. Of course, the proportion of the exposed article can be quantified by the exposed volume in addition to the exposed surface area, and the invention is not limited to specific details.

In yet another alternative implementation manner, a contact area included in each item object is determined according to the shape and/or type of each item object, an area corresponding to an item object whose contact area is not blocked is marked as a graspable area, and an area corresponding to an item object whose contact area is blocked is marked as a non-graspable area. Wherein, the contact area refers to a force-bearing area which is convenient for grabbing in the object. For example, in the case of an article such as a metal part, in order to prevent the part from being damaged, a specific region in the metal part, that is, a contact region, is required to be grasped, and the specific region is generally a region which is firmer and is not easy to fall off in the metal part. Therefore, when the articles are marked, it is necessary to determine whether the contact area is completely exposed and is not blocked. If the exposed surface area of the article is large, but the contact area is blocked, the article is marked as a non-graspable article.

The above-mentioned labeling methods can be used alone or in combination, and the present invention is not limited thereto.

And finally, training a deep learning model according to the labeled sample image. In order to improve the model effect, the larger the number of samples is, the better the training effect is, in order to prevent the problem of poor training effect caused by the small number of samples, in this embodiment, the number of training samples is increased by applying a multiplication mode to sample data, so as to achieve the goal of training the deep learning model. In order to achieve the effect of data multiplication, various methods can be adopted. Specifically, the marked sample image may be used as an original training set, the original training set may be extended by randomly adjusting the brightness, the contrast, the picture affine transformation mode and/or the picture white balance transformation mode of the picture, and the extended training set obtained by the extension may be used to train the deep learning model. By means of the expansion processing mode, the number of samples can be increased, and training effects are improved.

Step S210: and acquiring a two-dimensional color image corresponding to the three-dimensional object area and a depth image corresponding to the two-dimensional color image along a preset depth direction.

The preset depth direction can be flexibly set according to an actual service scene, and specifically, the preset depth direction includes at least one of the following: the camera shoots the direction, the gravity direction and the direction of the vertical line of the object bearing surface.

In one implementation, the preset depth direction is: the depth direction along which the camera takes a picture is also called the camera shooting direction. Specifically, the light generated by the camera of the camera starts from the first direction and extends to the second direction, and the preset depth direction is a direction from the first direction to the second direction. For example, when a camera of the camera takes a picture from top to bottom, the preset depth direction is the direction from top to bottom; when the camera of the camera takes a picture from left to right, the preset depth direction is the direction from left to right. For example, if a camera is used to take a picture, the preset depth direction is: pointed by the camera in the direction of the article area. If the two cameras are used for photographing, the preset depth direction is as follows: the midpoint position of the two cameras points to the direction of the article area. Of course, for scenes with multiple cameras, the preset depth direction may be set according to a direction in which the center positions of the multiple cameras point to the article area, and the present invention is not limited to the specific details.

In another implementation, the preset depth direction is: the direction of the vertical line of the article carrying surface is as follows: perpendicular to the direction of the object carrying surface. In particular, the shooting angle of the camera can be flexibly set, for example, the shooting angle of the camera may be at an angle with the object placement direction, that is: the camera is in a tilted state. Therefore, for the sake of accurate description, the preset depth direction may also be a direction perpendicular to the article carrying surface, and in practical implementation, the preset depth direction may be any direction, for example, the preset depth direction may be a vertical direction, or may be a certain inclined direction, and the present invention does not limit the preset depth direction.

Wherein, article bearing surface means: the plane of the carrier for placing the three-dimensional object. For example, when a three-dimensional object is placed on the ground, the ground is a carrier for placing the three-dimensional object, and correspondingly, the object bearing surface is a plane where the ground is located; for another example, when a three-dimensional object is placed on a tray, a conveyor belt, or a material basket, the tray, the conveyor belt, or the material basket is a carrier for placing the three-dimensional object, and correspondingly, the object bearing surface is a plane on which the tray, the conveyor belt, or the material basket is located. In a specific scenario, the carrier such as the tray, the conveyor belt, or the material basket may be disposed obliquely, for example, for convenience of loading and unloading, a plane on which the conveyor belt is disposed may form a preset angle with a horizontal plane, and correspondingly, a preset depth direction is perpendicular to the plane on which the conveyor belt is disposed, and thus, a preset angle is also formed with a vertical direction.

In addition, the preset depth direction may be a gravity direction. For example, when the object carrying surface is consistent with the horizontal plane, the predetermined depth direction is the gravity direction.

In addition, the three-dimensional article region means: a three-dimensional region in which a plurality of articles are stacked. Since a stacking phenomenon often exists among a plurality of articles in this embodiment, the orientation relationship among the articles cannot be accurately described only by the planar image, and thus the description is performed by the three-dimensional article region.

Step S220: and inputting the two-dimensional color image and the depth image into the deep learning model, and predicting a plurality of graspable objects contained in the two-dimensional color image according to an output result.

Since the deep learning model is generated from the samples labeled with the graspable region and the non-graspable region, a plurality of graspable objects included in the two-dimensional color image can be predicted by the model. Specifically, after the two-dimensional color image and the depth image are input into the deep learning model, the model outputs a graspable region and a non-graspable region included in the two-dimensional color image, wherein the graspable region corresponds to the graspable object.

In particular, when the model outputs the prediction results corresponding to the respective pixel regions, the prediction results may be expressed in various ways. For example, in one representation, the predicted results include: two states, grippable and not grippable. For another example, in another expression, the prediction result may predict a probability for the capture point of each pixel region: after a two-dimensional color image and a corresponding depth image are input, the deep learning model predicts a 2D probability image that an object can be successfully grabbed by using a sucker or other grabbing tools on each pixel point in an image 2D space, and the value of each pixel in the image represents the 'probability for controlling the sucker to move to the point and successfully picking up the object from the frame', which is predicted by the model. Therefore, in the latter mode, the prediction result can be accurate to the corresponding snatchable probability of each pixel point. The larger the probability of grabbing is, the larger the success rate of grabbing operation performed from the corresponding pixel point is; the smaller the probability of grasping is, the smaller the success rate of the grasping operation performed from the corresponding pixel point is.

Step S230: and calculating the three-dimensional pose information of each object capable of being grabbed according to the point cloud information corresponding to the three-dimensional object area.

The point cloud is a data set of points in a preset coordinate system. The points contain rich information including three-dimensional coordinates X, Y, Z, color, classification values, intensity values, time, etc. The point cloud can atomize the real world, and the real world can be restored through high-precision point cloud data. Therefore, the point cloud information can reflect the three-dimensional characteristics of the three-dimensional object area. In the present embodiment, point cloud information can be constructed from the two-dimensional color map and the depth map. Or, the point cloud information can be generated by additionally combining elements such as a laser detector, an infrared detector or a radar detector, so that the point cloud information is more accurate.

And calculating the three-dimensional pose information of each object capable of being grabbed through the point cloud information. And the three-dimensional pose information is used for describing the three-dimensional posture of the graspable object in the three-dimensional world. The three-dimensional pose information of the object capable of being grabbed is also called object position information and object position information, and can be determined in various modes.

The three-dimensional pose information can be described by a grabbing point or a grabbing area contained in the object to be grabbed. For example, the three-dimensional pose information is represented by a grab point. Correspondingly, when the grabbing point corresponding to the object which can be grabbed is determined, the method can be realized in multiple modes: the point with the maximum probability of being grabbed in the area corresponding to the object to be grabbed can be used as the grabbing point; it is also possible to calculate the 2D barycenter of the graspable region (which may be a part of the graspable object, or one graspable object has a plurality of graspable regions) to determine the grasping point from the 2D barycenter. The grasp points are used to describe the approximate orientation of the graspable object in three-dimensional space.

In addition, because the grabbing point is a point in a three-dimensional coordinate system, the depth value information corresponding to the grabbing point can be determined according to the three-dimensional pose information, namely: the distance between the grabbed object and the camera can be controlled.

In one specific example, three-dimensional pose information of each graspable object is calculated by: firstly, establishing a three-dimensional coordinate system corresponding to a three-dimensional article area; the directions of a first coordinate axis and a second coordinate axis contained in the three-dimensional coordinate system are matched with the two-dimensional color image, and the direction of a third coordinate axis in the three-dimensional coordinate system is matched with the preset depth direction. Then, a depth coordinate value of each grippable object corresponding to the third coordinate axis is calculated, and three-dimensional pose information of each grippable object is calculated according to the depth coordinate value. It follows that the depth coordinate values are used to reflect the distance between the grippable object and the camera.

Step S240: and sequencing the grabbed objects along the preset depth direction according to the three-dimensional pose information, and determining the grabbing sequence of the grabbed objects according to the sequencing result.

Specifically, the depth coordinate value of each grippable object can be determined according to the three-dimensional pose information, and the numerical value of the depth coordinate value can reflect the sorting condition of each grippable object along the preset depth direction. During specific implementation, sequencing the grabbed objects according to the distance between the grabbed objects and the camera or the object bearing surface, and determining the grabbing sequence of the grabbed objects according to the sequencing result; the closer the distance from the camera to the grabbed objects, the closer the grabbing sequence of the grabbed objects is; the farther from the camera, the later the grasping order of the graspable objects. Or the closer to the object bearing surface, the later the grabbing sequence of the grabbed objects is; the farther away from the article carrying surface the graspable objects are, the more forward the grasping order. In general, since the camera takes a picture from top to bottom, the graspable object located close to the camera is located at the top layer, and the graspable object located far from the camera is located at the bottom layer. Wherein, the distance between the object that can snatch and the article loading face is: the object can be grasped and spaced from the object carrying surface along the vertical line of the object carrying surface. Namely: the distance between the graspable object and the object carrying surface is a vertical distance between the graspable object and the object carrying surface.

Therefore, the objects which can be grabbed can be sequentially arranged according to the sequence from the top layer to the bottom layer through the sequencing result, and further the objects can be grabbed sequentially according to the sequence from the top layer to the bottom layer during grabbing.

Step S250: and outputting a grabbing instruction to the robot so that the robot can execute grabbing operation according to the grabbing instruction.

Specifically, the above steps are mainly processed according to the information shot by the camera, so as to realize the recognition of the object which can be grabbed and the determination of the grabbing sequence. In general, the camera and the robot belong to different devices, and therefore, a grabbing instruction needs to be further output to the robot so that the robot can perform grabbing operation according to the grabbing instruction.

It is considered that the camera is often not located at the same position as the robot. Therefore, the graspable object can be positioned by means of coordinate system transformation, which is specifically realized by the following means: acquiring a conversion relation between a camera coordinate system and a robot coordinate system; and converting the three-dimensional pose information of each grippable object corresponding to the camera coordinate system into the robot coordinate system according to the conversion relation, and outputting the converted three-dimensional pose information of each grippable object to the robot so as to enable the robot to execute the gripping operation. The three-dimensional pose information of the graspable object described in each step is determined according to a camera coordinate system, and in order to facilitate the robot to perform the grasping operation, the three-dimensional pose information needs to be converted into the robot coordinate system. The conversion process between the camera coordinate system and the robot coordinate system can be determined according to the relative relationship between the position of the camera and the position of the robot.

It can be seen that the three-dimensional object region in the present embodiment includes a plurality of graspable objects stacked along the preset depth direction, where the graspable objects include: cartons, plastic pouches (including but not limited to snack food packaging, milk tetra pillow packaging, milk plastic packaging, and the like), envelopes, file pouches, postcards, cosmeceuticals, and/or irregular toy articles, and the like. As can be seen, the graspable objects in the present embodiment include, but are not limited to: cartons, plastic bags (such as bagged milk, snack packages, and the like), ointments, toothpaste, cylindrical cosmetic bottles, and/or other common commodity packages. Wherein "item" includes, but is not limited to: the box of express delivery trade, the carton of retail trade, the soft package of plastics such as milk soft-packing, lilac pillow packing, snacks packing, medicine box, the various toys of cylinder cosmeceutical, anomalous, the small-size commodity of retail trade (such as cosmeceutical, snacks etc.), slice object such as envelope, file bag, postcard. The model then provides a graspable area for each article. This mode can help the end to be equipped with vacuum chuck's industrial robot, from the article heap of chaotic stack, picks up article one by one for subsequent station processes such as sweep sign indicating number, load. In short, the objects and the graspable objects in the embodiment may be various object objects, and the specific meaning of the present invention is not limited.

In summary, in the embodiment, the grabbing operation of any article can be realized through the deep learning model, and the articles on the upper layer can be grabbed firstly when stacking conditions exist among a plurality of articles by sequencing the grabbed objects along the preset depth direction, so that the grabbing efficiency is guaranteed.

For ease of understanding, finally, details of implementation of the deep learning network portion in the above scheme are described in detail by taking a specific example as an example:

firstly, in a conventional processing mode, a color image of an object area is input through a model, and a mask of the area occupied by each object example in the color image is predicted. And intercepting the point cloud matched with the mask in the point cloud collected by the 3D camera by using the mask predicted by the model. And matching the intercepted point cloud with the registered target object point cloud, and calculating to obtain a pose transformation relation between the intercepted point cloud and the target object point cloud. In specific implementation, the method can be realized through a CPN (Cascaded Pyramid network) network model in the field of key point detection. In the model, the features of different layers of the neural network are firstly roughly extracted through a global network GlobalNet, then the roughly extracted features are finely extracted through RefineNet, and the features of different layers are combined and then output as the network. By the method, smaller features in the picture can be processed, and further key points of targets in various scenes can be accurately identified. However, the above-described conventional processing manner can only process objects that have been registered for training in advance. And thus cannot handle as many as thousands of target object types.

To solve the above problem, in this example, a network model of the CPN structure is improved, so that the network is more sensitive and accurate to the prediction of the long and thin objects and the small objects.

Fig. 5 shows a structural diagram of the deep learning model in the present example. The first part of the deep learning model of this example includes two parallel global networks (GlobalNet), and the feature layer structure diagrams of the left and right global networks, as shown in fig. 5, where one rectangular block in fig. 5 represents one feature layer. The two groups of global networks have the same structure but different properties of the identification features, wherein one global network identifies the color features of the color image, and the other global network identifies the distance features of the depth image. The distance characteristics are determined by related parameters such as position conversion attributes of a camera in the process of forming the depth map, the color characteristics can accurately position the position and the posture of a target object (a slender object) in the picture in a two-dimensional plane (a plane perpendicular to the visual angle direction), and the distance characteristics can accurately identify the distance from each part of the surface of the object to the viewpoint in the visual angle direction. The color feature and the distance feature may play a complementary role. Such as: when multiple equal-height boxes are placed side by side, distance features cannot distinguish them, while color features can distinguish them by the texture of the box boundaries. For another example, for an object with a uniform color but a complex surface, the color feature cannot determine the grabbing position on the object, but the network can find an area (e.g., a more flat surface) on the surface of the object suitable for grabbing through the distance feature. The output of each feature layer in the global network of the present example does not undergo refinement extraction by RefineNet, and the network architecture is different from that in CPN.

In addition, each feature layer in the global network part of the scheme comprises a backbone network part and other parts. The main network part obtains a characteristic layer, and the main network part comprises a convolution layer, an activation function layer, a pooling layer, a normalization layer and an upper sampling layer. The main network part adopts a residual error learning framework to learn the residual error of input/output of every two layers (or three layers) in the convolution layer, and further corrects the output, for example, the input of the first layer of convolution is compared with the output of the second layer of convolution, and the characteristics of the lost characteristics during the learning period are further used for correcting the output of the second layer of convolution; or comparing the input of the first layer of convolution with the output of the third layer of convolution to obtain the loss characteristic, and further correcting the output of the third layer of convolution. And extracting all the characteristic layers processed by the main network from other parts, sampling from the deepest characteristic layer, adding the characteristic layers to the previous characteristic layer after the convolution layer, and so on to obtain the output of the global network.

The second part of the deep learning model in this example fuses the results of the two parallel global networks, and specifically merges the output channels of the two global networks. For example, if the color map outputs a 256-channel result and the depth map also outputs a 256-channel result, the result is a 512-channel result after the fusion. And finally, performing up-sampling on the combined network to obtain a probability prediction result graph of the graspable position capable of accurately identifying the target object. Specifically, in the model training process, the output of the model on the training set and the sparse cross entropy of the corresponding label are calculated as a loss function, and the parameters in the model are updated through reverse retransmission. The training of the model is ended when the loss function hardly drops any more. The deep learning model in this example is effective in predicting the respirable regions on very elongated boxes.

Fig. 3 is a schematic structural view showing an object grasping apparatus according to still another embodiment of the present invention, as shown in fig. 3, the apparatus including:

the acquisition module 31 is adapted to acquire a two-dimensional color image corresponding to a three-dimensional object region and a depth image corresponding to the two-dimensional color image along a preset depth direction;

the prediction module 32 is adapted to input the two-dimensional color image and the depth image into a deep learning model, and predict a plurality of graspable objects contained in the two-dimensional color image according to an output result;

a calculation module 33 adapted to calculate three-dimensional pose information of each graspable object from the point cloud information corresponding to the three-dimensional article region;

and the sequencing and grabbing module 34 is adapted to sequence the grabbed objects along the preset depth direction according to the three-dimensional pose information, and determine a grabbing sequence of the grabbed objects according to a sequencing result.

Optionally, the prediction module is further adapted to:

acquiring a sample image corresponding to a three-dimensional sample region, and determining a plurality of object objects contained in the sample image;

Optionally, the prediction module is specifically adapted to:

and taking the marked sample image as an original training set, performing expansion processing on the original training set in a brightness contrast adjustment mode, a picture affine transformation mode and/or a picture white balance random transformation mode, and training the deep learning model through the expanded training set obtained after expansion.

Optionally, the ranking grabbing module is further adapted to:

acquiring a conversion relation between a camera coordinate system and a robot coordinate system;

The specific structure and the working principle of each module may refer to the description of the corresponding step in the method embodiment, and are not described herein again.

The embodiment of the application provides a non-volatile computer storage medium, wherein at least one executable instruction is stored in the computer storage medium, and the computer executable instruction can execute the object capture method in any method embodiment.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.

As shown in fig. 4, the electronic device may include: a processor (processor)402, a Communications Interface 404, a memory 406, and a Communications bus 408.

Wherein:

the processor 402, communication interface 404, and memory 406 communicate with each other via a communication bus 408.

A communication interface 404 for communicating with network elements of other devices, such as clients or other servers.

The processor 402 is configured to execute the program 410, and may specifically perform relevant steps in the above embodiments of the domain name resolution method.

In particular, program 410 may include program code comprising computer operating instructions.

The processor 402 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 406 for storing a program 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may be specifically configured to cause the processor 402 to perform the operations in the above-described method embodiments.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：对象抓取方法及装置

Object grabbing method and device

相关技术

网友询问留言