Method and device for detecting associated object

文档序号：144501 发布日期：2021-10-22 浏览：14次中文

阅读说明：本技术 关联对象的检测方法及装置 (Method and device for detecting associated object ) 是由张学森王柏润刘春亚陈景焕于 2021-04-28 设计创作，主要内容包括：本公开提供了一种关联对象的检测方法及装置。检测方法包括：由待检测图像中检测得到至少一个匹配对象组,每个所述匹配对象组包括至少两个目标对象；获取各个所述匹配对象组中的每个所述目标对象的视觉信息,和每个所述匹配对象组中所述至少两个目标对象的空间信息；根据每个所述匹配对象组中所述至少两个目标对象的所述视觉信息和所述空间信息,确定各所述匹配对象组中所述至少两个目标对象是否关联。本公开检测方法提高关联对象的检测精度。(The disclosure provides a method and a device for detecting a related object. The detection method comprises the following steps: detecting at least one matching object group from an image to be detected, wherein each matching object group comprises at least two target objects; acquiring visual information of each target object in each matching object group and spatial information of the at least two target objects in each matching object group; and determining whether the at least two target objects in each matching object group are related according to the visual information and the spatial information of the at least two target objects in each matching object group. The detection method disclosed by the invention improves the detection precision of the associated object.)

1. A method of detecting a related object, comprising:

detecting at least one matching object group from an image to be detected, wherein each matching object group comprises at least two target objects;

acquiring visual information of each target object in each matching object group and spatial information of the at least two target objects in each matching object group;

and determining whether the at least two target objects in each matching object group are related according to the visual information and the spatial information of the at least two target objects in each matching object group.

2. The method of claim 1, wherein detecting at least one matching object group from the image to be detected comprises:

detecting each target object and the object type of each target object from the image to be detected;

and aiming at each object class, combining each target object in the object class with each target object in other object classes respectively to obtain the at least one matched object group.

3. The method of claim 1, wherein obtaining visual information for each of the target objects in the respective set of matching objects comprises:

and performing visual feature extraction on each target object in the matching object group to obtain visual information of the target object.

4. The method of claim 1, wherein obtaining spatial information of the at least two target objects in each of the matching object groups comprises:

detecting a detection frame of each target object from the image to be detected;

for each matching object group, generating the spatial information of the at least two target objects in the matching object group according to the position information of the detection frames of the at least two target objects in the matching object group.

5. The method of claim 4, wherein for each of the matching object groups, generating the spatial information of the at least two target objects in the matching object group according to the position information of the detection boxes of the at least two target objects in the matching object group comprises:

generating an auxiliary bounding box of the matched object group, wherein the auxiliary bounding box covers the detection box of each target object in the matched object group;

respectively determining the position characteristic information of each target object in the matched object group according to the auxiliary boundary frame and the detection frame of each target object;

and fusing the position characteristic information of each target object in the same matching object group to obtain the spatial information of the at least two target objects in the matching object group.

6. The method of claim 5, wherein,

the auxiliary bounding box is the bounding box with the smallest area in the bounding boxes covering all the target objects in the matched object group.

7. The method of any of claims 1 to 6, wherein determining whether the at least two target objects in each of the matching object groups are associated based on the visual information and the spatial information of the at least two target objects in each of the matching object groups comprises:

performing fusion processing on the visual information and the spatial information of the at least two target objects in each matching object group to obtain fusion characteristics of each matching object group;

and performing relevance classification processing on the fusion characteristics of each matching object group, and determining whether the at least two target objects in the matching object group are relevant.

8. The method of claim 7, wherein performing relevance classification processing on the fused features of each matching object group to determine whether the at least two target objects in the matching object group are relevant comprises:

performing relevance classification processing on the fusion characteristics of each matching object group to obtain a relevance score between the at least two target objects in each matching object group;

for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest relevance score as a target matching object group;

determining the at least two target objects in the target matching object group as associated target objects.

9. The method of claim 1, wherein, in a case where the target object is a human body part, the determining whether the at least two target objects in each of the matching object groups are related comprises:

and determining whether each human body part in the same matching object group belongs to the same human body.

10. The method of claim 1, further comprising:

acquiring a sample image set; the sample image set comprises at least one sample image, each sample image comprises at least one sample matching object group and label information corresponding to the sample matching object group, each sample matching object group comprises at least two sample target objects, and the label information represents the association result of each sample target object in the sample matching object group;

processing the sample image through a correlation detection network to be trained, and detecting the sample image to obtain a sample matching object group;

processing the sample images through an object detection network to be trained to obtain visual information of each sample target object in each sample matching object group; processing the sample images through the association detection network to be trained to obtain spatial information of the at least two sample target objects in each sample matching object group;

obtaining a correlation detection result of each sample matching object group according to the visual information and the spatial information of the at least two sample target objects in each sample matching object group through the correlation detection network to be trained;

and determining an error between the correlation detection result of each sample matching object group and the corresponding label information, and adjusting a network parameter of at least one of the correlation detection network and the object detection network according to the error until the error converges.

11. An apparatus for detecting an associated object, comprising:

the detection module is used for detecting at least one matched object group from an image to be detected, wherein each matched object group comprises at least two target objects;

an obtaining module, configured to obtain visual information of each target object in each of the matching object groups and spatial information of the at least two target objects in each of the matching object groups;

a determining module, configured to determine whether the at least two target objects in each of the matching object groups are related according to the visual information and the spatial information of the at least two target objects in each of the matching object groups.

12. The apparatus of claim 11, wherein the detection module comprises:

the detection submodule is used for detecting each target object and the object type of each target object from the image to be detected;

and the combination sub-module is used for combining each target object in the object class with each target object in other object classes respectively according to each object class to obtain the at least one matched object group.

13. The apparatus of claim 11, wherein the means for obtaining is further configured to:

and performing visual feature extraction on each target object in the matching object group to obtain visual information of the target object.

14. The apparatus of claim 11, wherein the means for obtaining is further configured to:

detecting a detection frame of each target object from the image to be detected;

15. The apparatus of claim 14, wherein the means for obtaining is further for:

generating an auxiliary bounding box of the matched object group, wherein the auxiliary bounding box covers the detection box of each target object in the matched object group;

16. The apparatus of any of claims 11 to 15, wherein the means for determining comprises:

the fusion submodule is used for carrying out fusion processing on the visual information and the spatial information of the at least two target objects in each matching object group to obtain fusion characteristics of each matching object group;

and the determining submodule is used for performing relevance classification processing on the fusion features of each matching object group and determining whether the at least two target objects in the matching object group are relevant or not.

17. The apparatus of claim 16, wherein the determination submodule is further configured to:

for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest relevance score as a target matching object group;

determining the at least two target objects in the target matching object group as associated target objects.

18. An electronic device, comprising:

a processor; and

a memory, communicatively coupled to the processor, storing computer instructions readable by the processor, the computer instructions, when read by the processor, causing the processor to perform the method of any of claims 1 to 10.

19. A storage medium storing computer readable instructions for causing a computer to perform the method of any one of claims 1 to 10.

20. A computer program comprising computer readable code, wherein the computer readable code when run in an electronic device causes a processor in the electronic device to implement the method of any of claims 1 to 10.

Technical Field

The disclosure relates to the technical field of computer vision, in particular to a method and a device for detecting a related object.

Background

Target detection is an important part of intelligent video analysis, for example, human body, human face, etc. in video frames or scene images are detected. In the related art, a target detection frame in a video frame or a scene image may be acquired by using a target detector such as fast RCNN (Region-CNN), so as to implement target detection.

However, in dense scenes, it may happen that different objects occlude each other. Taking a scene with relatively dense people as an example, the shielding of the limb parts may occur between different people, and in this case, the target detector cannot meet the detection requirement of the high-precision scene.

Disclosure of Invention

The embodiment of the disclosure provides a detection method and device of a related object, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for detecting a related object, including: detecting at least one matching object group from an image to be detected, wherein each matching object group comprises at least two target objects; acquiring visual information of each target object in each matching object group and spatial information of the at least two target objects in each matching object group; and determining whether the at least two target objects in each matching object group are related according to the visual information and the spatial information of the at least two target objects in each matching object group.

In some embodiments, detecting at least one matching object group from the image to be detected includes: detecting each target object and the object type of each target object from the image to be detected; and aiming at each object class, combining each target object in the object class with each target object in other object classes respectively to obtain the at least one matched object group.

In some embodiments, obtaining visual information of each of the target objects in the respective set of matching objects comprises: and performing visual feature extraction on each target object in the matching object group to obtain visual information of the target object.

In some embodiments, obtaining spatial information of the at least two target objects in each of the matching object groups comprises: detecting a detection frame of each target object from the image to be detected; for each matching object group, generating the spatial information of the at least two target objects in the matching object group according to the position information of the detection frames of the at least two target objects in the matching object group.

In some embodiments, for each of the matching object groups, generating the spatial information of the at least two target objects in the matching object group according to the position information of the detection boxes of the at least two target objects in the matching object group includes: generating an auxiliary bounding box of the matched object group, wherein the auxiliary bounding box covers the detection box of each target object in the matched object group; respectively determining the position characteristic information of each target object in the matched object group according to the auxiliary boundary frame and the detection frame of each target object; and fusing the position characteristic information of each target object in the same matching object group to obtain the spatial information of the at least two target objects in the matching object group.

In some embodiments, the auxiliary bounding box is the bounding box with the smallest area among the bounding boxes covering the target objects in the matching object group.

In some embodiments, determining whether the at least two target objects in each of the matching object groups are associated based on the visual information and the spatial information of the at least two target objects in each of the matching object groups comprises: performing fusion processing on the visual information and the spatial information of the at least two target objects in each matching object group to obtain fusion characteristics of each matching object group; and performing relevance classification processing on the fusion characteristics of each matching object group, and determining whether the at least two target objects in the matching object group are relevant.

In some embodiments, performing relevance classification processing on the fusion features of each matching object group to determine whether the at least two target objects in the matching object group are related includes: performing relevance classification processing on the fusion characteristics of each matching object group to obtain a relevance score between the at least two target objects in each matching object group; for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest relevance score as a target matching object group; determining the at least two target objects in the target matching object group as associated target objects.

In some embodiments, in a case where the target object is a human body part, the determining whether the at least two target objects in each of the matching object groups are related includes: and determining whether the human body parts in the same matching object group belong to the same human body.

In some embodiments, the method further comprises: acquiring a sample image set; the sample image set comprises at least one sample image, each sample image comprises at least one sample matching object group and label information corresponding to the sample matching object group, each sample matching object group comprises at least two sample target objects, and the label information represents the association result of each sample target object in the sample matching object group; processing the sample image through a correlation detection network to be trained, and detecting the sample image to obtain a sample matching object group; processing the sample images through an object detection network to be trained to obtain visual information of each sample target object in each sample matching object group; processing the sample images through the association detection network to be trained to obtain spatial information of the at least two sample target objects in each sample matching object group; obtaining a correlation detection result of each sample matching object group according to the visual information and the spatial information of the at least two sample target objects in each sample matching object group through the correlation detection network to be trained; and determining an error between the correlation detection result of each sample matching object group and the corresponding label information, and adjusting a network parameter of at least one of the correlation detection network and the object detection network according to the error until the error converges.

In a second aspect, the present disclosure provides an apparatus for detecting an associated object, including: the detection module is used for detecting at least one matched object group from an image to be detected, wherein each matched object group comprises at least two target objects; an obtaining module, configured to obtain visual information of each target object in each of the matching object groups and spatial information of the at least two target objects in each of the matching object groups; a determining module, configured to determine whether the at least two target objects in each of the matching object groups are related according to the visual information and the spatial information of the at least two target objects in each of the matching object groups.

In some embodiments, the detection module comprises: the detection submodule is used for detecting each target object and the object type of each target object from the image to be detected; and the combination sub-module is used for combining each target object in the object class with each target object in other object classes respectively according to each object class to obtain the at least one matched object group.

In some embodiments, the obtaining module is further configured to: and performing visual feature extraction on each target object in the matching object group to obtain visual information of the target object.

In some embodiments, the obtaining module is further configured to: detecting a detection frame of each target object from the image to be detected; for each matching object group, generating the spatial information of the at least two target objects in the matching object group according to the position information of the detection frames of the at least two target objects in the matching object group.

In some embodiments, the obtaining module is further configured to: generating an auxiliary bounding box of the matched object group, wherein the auxiliary bounding box covers the detection box of each target object in the matched object group; respectively determining the position characteristic information of each target object in the matched object group according to the auxiliary boundary frame and the detection frame of each target object; and fusing the position characteristic information of each target object in the same matching object group to obtain the spatial information of the at least two target objects in the matching object group.

In some embodiments, the auxiliary bounding box is the bounding box with the smallest area among the bounding boxes covering the target objects in the matching object group.

In some embodiments, the determining module comprises: the fusion submodule is used for carrying out fusion processing on the visual information and the spatial information of the at least two target objects in each matching object group to obtain fusion characteristics of each matching object group; and the determining submodule is used for performing relevance classification processing on the fusion features of each matching object group and determining whether the at least two target objects in the matching object group are relevant or not.

In some embodiments, the determination sub-module is further to: performing relevance classification processing on the fusion characteristics of each matching object group to obtain a relevance score between the at least two target objects in each matching object group; for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest relevance score as a target matching object group; determining the at least two target objects in the target matching object group as associated target objects.

In some embodiments, in the case that the target object is a human body part, the determining module is further configured to: and determining whether each human body part in the same matching object group belongs to the same human body.

In a third aspect, the disclosed embodiments provide an electronic device, including: a processor; and a memory, communicatively coupled to the processor, storing computer instructions readable by the processor, the computer instructions, when read by the processor, causing the processor to perform the method according to any of the embodiments of the first aspect.

In a fourth aspect, the embodiments of the present disclosure provide a storage medium storing computer-readable instructions for causing a computer to execute the method according to any one of the embodiments of the first aspect.

In a fifth aspect, the disclosed embodiments provide a computer program comprising computer readable code, wherein the computer readable code, when executed in an electronic device, causes a processor in the electronic device to perform the method according to any one of the embodiments of the first aspect.

The method for detecting the associated object comprises the steps of obtaining at least one matching object group by detecting in an image to be detected, wherein each matching object group comprises at least two target objects, obtaining visual information of each target object of each matching object group and spatial information of the at least two target objects in each matching object group, and determining whether the target objects in each matching object group are associated according to the visual information and the spatial information. The target detection is assisted by the correlation characteristics between the target objects in the same matching object group, so that the target detection precision in a complex scene can be improved, for example, the human body detection in a multi-person scene is realized by the correlation detection of the human face and the human body, and the detection accuracy is improved. In the association detection, the accuracy of the association detection of the target object is improved by combining the visual information and the spatial information of the target object, for example, in the human face and human body association detection, the visual characteristic information of the human face and the human body is adopted, the spatial position characteristic information of the human face and the human body is further considered, the association of the human face and the human body is assisted by the spatial position characteristic, the accuracy of the association of the human face and the human body is improved, and the accuracy of the target detection is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

Fig. 1 is a flow chart of a method of detecting an associated object in accordance with some embodiments of the present disclosure.

Fig. 2 is a flow chart of a method of detecting a matching object group in some embodiments according to the present disclosure.

Fig. 3 is a flow chart of a visual information extraction method in some embodiments according to the present disclosure.

Fig. 4 is a schematic diagram of a detection network in some embodiments according to the present disclosure.

FIG. 5 is a schematic diagram of a method of detecting an associated object according to some embodiments of the present disclosure.

Fig. 6 is a schematic diagram of an association detection network in some embodiments according to the present disclosure.

FIG. 7 is a flow diagram of a method of determining whether a target object in a set of matching objects is associated in some embodiments according to the present disclosure.

FIG. 8 is a schematic illustration of a visual output of a detection result of an associated object in accordance with some embodiments of the present disclosure.

Fig. 9 is a schematic diagram of a training flow of a neural network for detecting an associated object in accordance with some embodiments of the present disclosure.

Fig. 10 is a block diagram of an associated object detection apparatus according to some embodiments of the present disclosure.

Fig. 11 is a block diagram of a detection module in a detection apparatus of an associated object according to some embodiments of the present disclosure.

Fig. 12 is a block diagram of a determination module in a detection apparatus of an associated object according to some embodiments of the present disclosure.

FIG. 13 is a block diagram of a computer system suitable for use in implementing the detection method of associated objects of the present disclosure.

Detailed Description

The technical solutions of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure. In addition, technical features involved in different embodiments of the present disclosure described below may be combined with each other as long as they do not conflict with each other.

The associated object detection has important research significance for intelligent video analysis. Taking human body detection as an example, under a complex scene with many people, people can be shielded from each other, and if a detection method aiming at a single human body is adopted, the false detection rate is high, and the requirement is difficult to meet. The associated object detection can determine a matching object group by using face-human body association, and the detection of a target object (namely, a face and a human body) is realized by determining whether the face and the human body included in the same matching object group belong to the same person, so that the target detection precision in a complex scene can be improved.

Target object detection can adopt target detectors such as fast RCNN (Region-CNN) and the like to obtain human faces and human body detection frames in video frames or scene images, then a classifier is trained according to visual features of the human faces and the human bodies, and a predicted correlation result is obtained by using the classifier. The correlation detection precision in the similar method is relatively limited, and for high-precision detection scenes such as a multiplayer game scene, not only people in the scene often have partial occlusion, but also whether the face, the human body, the hand and even game props of a user are correlated needs to be determined, so that the user can know which user makes a relevant action, and even a great loss can be caused once the correlation fails. Therefore, the correlation detection accuracy in the related art is difficult to meet the use requirement of a high-precision scene.

The embodiment of the disclosure provides a method and a device for detecting a related object, an electronic device and a storage medium, so as to improve the detection precision of the related object.

In a first aspect, the disclosed embodiments provide a method for detecting a related object. The execution subject of the detection method of the embodiment of the present disclosure may be a terminal device, a server, or other processing device, for example, the terminal device may be a user device, a mobile device, a user terminal, a cellular phone, a vehicle-mounted device, a personal digital assistant, a handheld device, a computing device, a wearable device, or the like. In some embodiments, the detection method may also be implemented by way of the processor invoking computer-readable instructions stored in the memory, which is not limited by the present disclosure.

A method of detecting a correlation object according to some embodiments of the present disclosure is illustrated in fig. 1, and is described below in conjunction with fig. 1.

As shown in fig. 1, in some embodiments, the method for detecting a related object of the present disclosure includes:

s110, detecting at least one matching object group from the image to be detected, wherein each matching object group comprises at least two target objects.

Specifically, the image to be detected may be a natural scene image, and it is desirable to detect a preset associated target object from the image. It is understood that "associated target object" in the present disclosure refers to two or more target objects having an association in the scene of interest. For example, taking human face and human body association in human body detection as an example, the image to be detected includes a plurality of human faces and a plurality of human bodies, and the "human faces" and the "human bodies" belonging to the same person may be referred to as associated target objects. For another example, in a scene of multi-person horse riding entertainment, the image to be detected includes a plurality of human bodies and a plurality of horses, and the "human bodies" and the "horses" having riding relations can be referred to as associated target objects. It will be appreciated by those skilled in the art that the present disclosure is not described in detail herein.

The image to be detected may be acquired by an image acquisition device such as a camera, and specifically, the image to be detected may be a single frame image captured by the image acquisition device, and may also include a frame image in a video stream captured by the image acquisition device, which is not limited in this disclosure.

In the embodiment of the present disclosure, at least one matching object group may be detected from the image to be detected, where each matching object group includes at least two target objects. The matching object group refers to a set formed by at least two target objects which need to be confirmed whether to be associated.

As shown in fig. 2, in some embodiments, detecting at least one matching object group from the image to be detected may include:

s111, detecting each target object and the object type of each target object from the image to be detected;

and S112, aiming at each object class, combining each target object in the object class with each target object in other object classes respectively to obtain at least one matched object group.

In one example, taking "face-human body" association detection as an example, a plurality of target objects and an object class of each target object are detected from an image to be detected, the object classes are also referred to as "face class" and "human body class", the target objects of the face class "include m faces, and the target objects of the human body class" include n human bodies. And for each face in the m faces, combining the face with the n human bodies pairwise respectively to obtain m-n human face-human body pairs. The human face and the human body are target objects obtained by detection, m x n human face and human body pairs obtained by pairwise combination of the human face and the human body are the matching object group, and m and n are positive integers.

In another example, in a multiplayer game scenario, each person would also be equipped with associated objects, such as horses in a horse riding entertainment scenario, game props in a table game scenario, etc., the disclosed method is equally applicable to "human-object" association detection. Taking a horse riding entertainment scene as an example, a plurality of target objects and an object class of each target object are detected from an image to be detected, wherein the object classes are a human body class and an object class, the target objects of the human body class comprise p human bodies, and the target objects of the object class comprise q horses. For each of p human bodies, combining the p human bodies with q horses in pairs respectively to obtain p x q human body object pairs, wherein the human bodies and the objects are the target objects obtained by detection, and the p x q human body object pairs obtained by combining the human bodies and the horses in pairs are the matching object groups, wherein p and q are positive integers.

In another example, the three-person correlation detection is taken as an example. A plurality of target objects and an object class of each target object are detected from an image to be detected, wherein the object classes are also called a human hand class, a human face class and a human body class, and each object class comprises at least one target object belonging to the class. Each target object in each object class is respectively combined with three target objects in the other two object classes, namely, a plurality of human hand, human face and human body groups obtained by combining 1 human hand, 1 human face and 1 human body are the matching object groups. For example, the target objects of the human hand category include k human hands, the target objects of the human face category include m human faces, and the target objects of the human body category include n human bodies. And for each of the k hands, combining the k hands with the m faces and the n personal bodies respectively to obtain a k m n personal hand face body group, wherein k, m and n are positive integers.

As can be understood from the above examples, in the embodiments of the present disclosure, there is no need to limit the number of target objects in the matching object group, nor the category of the target objects. The matching object group may include at least two target objects, for example, 2, 3, 4 or more target objects, where the target objects may be human bodies or parts of human bodies, objects associated with human bodies, or entities unrelated to human bodies in a scene, and the disclosure is not limited thereto.

In one example, the image to be detected is processed through an associated detection network, and at least one matching object group is obtained from the image to be detected. The details of which are set forth below, are not presented here.

S120, acquiring visual information of each target object in each matching object group and spatial information of at least two target objects in each matching object group.

Specifically, the visual information is information of visual characteristics of each target object in the image, and is generally image characteristics obtained from pixel values of the image. For example, by performing visual feature extraction on an image to be detected, image feature information of a human face, a human hand, a human body, or an object in the image is obtained. And the spatial information may include feature information of the target objects in the matching object group on spatial positions and/or pose information of the target objects in the matching object group. Or the spatial information may include spatial position relationship information or relative posture information between target objects in the matching object group, for example, spatial relative position feature information and/or relative orientation information of a human face and a human body, a human face and a human hand, a human body and an object, and the like in the image.

In one example, the visual feature extraction may be performed on the region where each target object is located in the image to be detected, for example, feature points are extracted, and the pixel values of the feature points are converted into the visual features of the target object. The position characteristic information of the target object can be generated according to the position of the boundary of each target object in the image, and the posture of each target object is analyzed according to the standard posture model of the target object to obtain the posture information of the target object, so that the space information of the target object is obtained. Optionally, the relative position and/or the relative posture between the target objects in the matching object group may also be analyzed, and the obtained spatial information may also include the relative position information and/or the relative posture information between each target object and other target objects.

In one example, in the process of processing an image to be detected, visual feature extraction is performed on the image to be detected through an object detection network to obtain a feature map, and then visual information of each target object is obtained according to the feature map extraction.

In one example, in the process of processing the image to be detected, the image to be detected may be processed through the association detection network, so as to obtain spatial information of at least two target objects in each matching object group.

The network structure and the implementation principle in the above example are explained in detail below, and will not be detailed here.

S130, determining whether the at least two target objects in each matching object group are related or not according to the visual information and the spatial information of the at least two target objects in each matching object group.

For a certain matching object group, for example, a human face and human body matching object group, it is intended to determine whether there is an association between a human body and a human face in the matching object group, that is, whether the human body and the human face belong to the same person. After the visual information and the spatial information of the at least two target objects in the matching object group are obtained, whether the at least two target objects in the matching object group are related or not is determined by combining the visual information and the spatial information.

It is worth mentioning that at least one inventive concept of the disclosed method lies in: the relevance of the target object is determined on the basis of the visual information in combination with the spatial information of the target object in the set of matching objects. Taking human face and human body association detection as an example, the position distribution of the human face in the human body is always certain, so that on the basis of considering the visual information of the human face, association is assisted by combining the spatial position information of the human face and the human body, and therefore, when the problem of shielding in a multi-person complex scene is solved, the human face and human body association detection method has better robustness and improves the association precision of the human body and the human face.

In addition, based on the above inventive concept, the related target object in the method of the present disclosure refers to an object having a relevance in a spatial position, so that high-reliability spatial information can be extracted from the image to be detected, the number and types of the target objects in the matching object group are not limited, and the target objects may be human body parts, or any other objects having a relevance in a spatial position, such as animals and props, which is not described in detail in this disclosure.

In one example, the visual information and the spatial information of at least two target objects in each matching object group may be fused through a correlation detection network (e.g., "Pair Head" in fig. 4), and a correlation classification process may be performed based on the fused features, so as to determine whether at least two target objects in a certain matching object group are correlated. The following detailed description is not provided herein.

Therefore, according to the detection method of the related object, the target detection is assisted through the related features between the target objects in the same matching object group, the target detection precision in a complex scene is improved, for example, the human body detection in a multi-person scene is realized through the related detection of the human face and the human body, and the detection accuracy is improved. In the association detection, the association detection precision of the target object is improved by combining the visual information and the spatial information of the target object, for example, in the human face and human body association detection, the visual characteristic information of the human face and the human body is adopted, the spatial position characteristic information of the human face and the human body is further considered, the association of the human face and the human body is assisted by the spatial position characteristic, the association accuracy of the human face and the human body is improved, and the target detection precision is further improved.

In some embodiments, visual feature extraction may be performed on each target object in the matching object group to obtain visual information of the target object.

Specifically, fig. 3 shows a process of extracting visual information of a target object, fig. 4 shows an architecture of a detection network of the method of the present disclosure, and the method of the present disclosure is further described with reference to fig. 3 and 4.

As shown in fig. 3, in some embodiments, the method for detecting the related object includes:

s310, extracting visual features of the image to be detected to obtain a feature map of the image to be detected.

Specifically, as shown in fig. 4, the detection network of the present disclosure includes an object detection network 100 and an association detection network 200, where the object detection network 100 may be a trained neural network, and is configured to perform visual feature extraction on a target object on an image to be detected to obtain visual information of the target object.

In this embodiment, the object detection network 100 may include a backbone network (backbone) and a Feature Pyramid Network (FPN), and the image to be detected is sequentially processed by the backbone network and the FPN to obtain a Feature map (Feature) of the image to be detected.

In one example, the backbone network may use, for example, VGGNet, ResNet, and the like. The FPN can convert the feature map obtained by the backbone network into a feature map of a multilayer pyramid structure on the basis of the feature map. The backbone network backbone is an image feature extraction part, and the FPN is used for feature enhancement processing and can enhance shallow features extracted by the backbone network. It is to be understood that the above network is only an example and not limiting to the present disclosure, for example, in other embodiments, the backbone network may adopt any other form of feature extraction network; for another example, in another embodiment, instead of using the FPN in fig. 4, the feature map extracted from the backbone network may be directly used as the feature map (feature) of the image to be detected; etc., to which the present disclosure is not limited.

And S320, detecting and obtaining detection frames of all target objects based on the characteristic diagram.

S330, extracting and obtaining the visual information of each target object in each matching object group according to the detection frame.

Specifically, with continued reference to fig. 4, the object detection Network 100 also includes an RPN (Region pro-active Network). After obtaining the feature map of the image to be detected, the RPN can predict the detection frame (anchor) of each target object and the object type of the target object based on the feature map (feature) output by the FPN. For example, for the correlation detection of a human face and a human body, the RPN calculates a detection frame of the human body and the human face on the image to be detected and a "human face" or "human body" category to which a target object in the detection frame region belongs according to the feature map.

In this embodiment, the object detection network 100 further includes an RCNN (Region Convolutional Neural network), where the RCNN calculates a boundary box (bbox) offset for each target object detection box based on the feature map, and performs boundary regression processing on the target object detection box according to the bbox offset, so as to obtain a more accurate target object detection box.

After the detection frames of the target objects are obtained, the visual feature information of each target object is extracted and obtained based on the feature map and the detection frames. For example, further feature extraction may be performed on each detection frame based on the feature map, to obtain feature information of each detection frame as visual feature information of the corresponding target object. Or inputting the feature map and each detection frame into a visual feature extraction network to obtain visual feature information of each detection frame, namely to obtain the visual features of each target object.

In one example, taking human face and human body association detection as an example, an input image to be detected is shown in fig. 5. The RPN and the RCNN network obtain detection frames of each face and each human body on the image to be detected according to the characteristic diagram of the image to be detected, and the detection frames can be rectangular.

Referring to fig. 5, the image to be detected totally includes 3 human bodies and 3 human faces, 3 human face detection frames 201, 202, 203 and 3 human body detection frames 211, 212, 213 are obtained after RPN and RCNN processing, and visual information of each human body and human face is extracted according to each human body and human face detection frame.

The association detection network (e.g., "Pair Head" in fig. 4) 200 may also be a trained neural network, which combines target objects of different classes based on the obtained detection frames of the target objects and object classes to obtain respective matching object groups. For example, in a human face and human body association detection scene, each human face and human body are randomly combined based on the obtained human face and human body detection frame and object type to obtain a human face and human body matching object group. Taking fig. 5 as an example, the 3 human face detection frames 201, 202, 203 and the 3 human body detection frames 211, 212, 213 are combined two by two respectively to obtain 9 human face human body matching object groups. Next, the position characteristics of each human face and body matching object group need to be determined.

For each matching object group, an auxiliary bounding box is first constructed according to the detection boxes of the respective target objects in the matching object group. Taking the matching object group composed of the face detection box 201 and the human body detection box 212 in fig. 5 as an example, first, according to the two detection boxes, an intersection box (union box) which can simultaneously contain the two detection boxes and has the smallest area is determined as an auxiliary bounding box, that is, an auxiliary bounding box 231 shown by a dotted line in fig. 5.

It should be noted here that the purpose of the auxiliary bounding box is to calculate spatial information of each target object in the matching object group in the following, in this embodiment, the auxiliary bounding box covering the detection box of each target object in the matching object group may be selected, so that spatial information of each target object obtained in the following is fused with spatial information of other target objects in the matching object group to which the auxiliary bounding box belongs, thereby performing associated object detection based on a potential spatial position relationship between actually associated target objects, so that information is more compact, interference information at other positions is reduced, and a calculation amount is reduced. Further, the auxiliary bounding box with the smallest area may be selected from the auxiliary bounding boxes covering the detection boxes of the respective target objects in the matching object group. In other embodiments, the skilled artisan should appreciate that the auxiliary bounding box 231 is guaranteed to cover at least the target object in the set of matching objects.

And after the auxiliary boundary box is obtained, generating the position characteristic information of the target object according to the detection box of the target object and the auxiliary boundary box. In fig. 5, face mask information, that is, feature information indicating the spatial position of the face detection frame 201 with respect to the auxiliary bounding box 231 in the matching target group is generated from the face detection frame 201 and the auxiliary bounding box 231. Similarly, human mask information, that is, information indicating the spatial position characteristics of the human detection frame 212 with respect to the auxiliary bounding box 231 in the matching target group is generated from the human detection frame 212 and the auxiliary bounding box 231.

In one example, when calculating the position feature information of the face and the human body, the pixel values of the face detection box 201 and the human body detection box 212 may be set to 1, and the initial pixel value of the auxiliary bounding box 231 may be set to 0, so that the position feature information of the face and the human body with respect to the auxiliary bounding box may be obtained by detecting the pixel values.

After the position characteristic information of the target object is obtained, the position characteristic information of at least two target objects in the matching object group is spliced or fused in other modes, and then the space information of the target objects in the matching object group can be obtained.

The above description is made on the matching object group formed by the human face in the human face detection frame 201 and the human body in the human body detection frame 212, and the calculated position features of other matching object groups are the same as those of the matching object group, and the above processes are sequentially performed to obtain the position features of each matching object group, which is not described in detail in this disclosure.

Taking a matching object group composed of a human face object and a human body object as an example, after obtaining the visual information and the spatial information, the association detection network (for example, "Pair Head" in fig. 4) determines whether the target objects are associated according to the visual information and the spatial information of the matching object group.

The network structure of the association detection network (Pair Head) is shown in fig. 6. The visual information of the face detection box 201 and the human body detection box 212 is processed by a Region of interest pooling (Roi of interest pooling) layer to obtain the face visual feature 131 and the human body visual feature 132 respectively, and the spatial feature 133 is obtained according to the spatial information feature transformation. In this embodiment, the human visual features 131 are represented by features having a size of 64 × 7, the human visual features 132 are represented by features having a size of 64 × 7, and the spatial features 133 are represented by features having a size of 2 × 7.

The human face visual features 131, the human body visual features 132 and the space features 133 are fused to obtain fusion features of the matching object groups, and relevance classification processing is performed on the fusion features of the matching object groups, so that whether the target objects in the matching object groups are related or not can be determined.

In some embodiments, as shown in FIG. 7, determining whether the target object in the set of matching objects is associated may include:

and S710, performing relevance classification processing on the fusion characteristics of each matching object group to obtain relevance scores between at least two target objects in each matching object group.

S720, for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest association degree value as the target matching object group.

S730, determining at least two target objects in the target matching object group as related target objects.

Specifically, the network structures shown in fig. 4 to 6 are still taken as examples for explanation. After the fusion features of each matching object group are obtained, the fusion features are subjected to Full Connected Layer (FCL) 140, and the full Connected Layer 140 performs relevance classification processing on the fusion features, so that a relevance score between target objects in each matching object group can be obtained.

For example, as shown in fig. 5, after the classification processing of the full-link layer 140, the prediction scores of 9 matching object groups are obtained. For a certain face or human body, the face or human body belongs to 3 matching object groups, for example, for the face 201, the face or human body and the human bodies 211, 212, and 213 form 3 matching object groups, and of the three matching object groups, the matching object group with the highest degree of association is selected as the target matching object group. For example, in this example, if the association degree of the matching object group composed of the face 201 and the human body 211 is the highest, the matching object group is taken as a target matching object group, and the face 201 and the human body 211 are determined as associated target objects, that is, the face 201 and the human body 211 belong to the same person.

In addition, in some embodiments, in view of the visual output of the model, after the associated target object is determined, the associated target object may be visually output in the image.

In one example, the output of the image visualization may be as shown in fig. 8. In the example of fig. 8, taking a multiplayer table game scene as an example, the detection of the associated object includes the associated detection of "human face-human body-human hand", and the target matching object groups of a plurality of "human face-human body human hands" are obtained through the above embodiment, which is referred to by those skilled in the art, and the disclosure is not repeated herein.

After the target matching object group is obtained, the human face, the human body and the human hand detection frame included in the target matching object group can be displayed in the image. For example, fig. 8 includes 3 face detection blocks 201, 202, and 203, 3 body detection blocks 211, 212, and 213, and 5 hand detection blocks 221, 222, 223, 224, and 225. In one example, the detection frames of different categories can be respectively shown by different colors, and since fig. 8 is a gray scale diagram, colors cannot be obviously displayed, but those skilled in the art can understand this, and the detailed description of the disclosure is omitted.

And for the associated target objects in the same target matching object group, connecting and displaying the associated target objects by using connecting lines. For example, in the example of fig. 8, in the same target matching object group, the central point of the human hand detection frame and the central point of the human face detection frame are both connected with the central point of the human body detection frame by using a dotted line, so that the associated target object in the image can be clearly indicated, and an intuitive visualization result is obtained.

In some embodiments, before feature fusion, the visual information and the spatial information of the matching object group may be subjected to a dimensionality reduction process through a full connection layer, and fusion is performed after the features are mapped to features of a fixed length, which is not described in detail in this disclosure.

In some embodiments, the method of the present disclosure further includes a training process of the neural network shown in fig. 4, the training process is shown in fig. 9, and the training process of the neural network is described below with reference to fig. 4 and 9.

S910, obtaining a sample image set.

S920, processing the sample images in the sample image set through the associated detection network to be trained, and detecting the sample images to obtain at least one sample matching object group.

S930, processing the sample image through an object detection network to be trained to obtain visual information of each sample target object of each sample matching object group; and processing the sample images through the correlation detection network to be trained to obtain the spatial information of at least two sample target objects in each sample matching object group.

And S940, obtaining the relevance detection result of each sample matching object group according to the visual information and the spatial information of at least two sample target objects in each sample matching object group through the relevance detection network to be trained.

S950, determining an error between the correlation detection result of each sample matching object group and the label information, and adjusting a network parameter of at least one of the correlation detection network and the object detection network according to the error until the error converges.

Specifically, the sample image set includes at least one sample image, each sample image includes at least one detectable sample matching object group, such as at least one "face-human-body pair", "human-body-object pair", "human-hand-face-human-body group", and the like, each sample matching object group includes at least two sample target objects, and the sample target objects correspond to at least two object categories, the sample target objects are faces, human hands, human bodies, limbs, or objects, and the like, and the corresponding object categories are face categories, human-hand categories, object categories, and the like. Meanwhile, the sample image further includes label information of each sample matching object group, the label information is a true value of the sample matching object group, and represents actual relevance of each sample target object in the sample matching object group, that is, whether the sample target object in the sample matching object group is an actually relevant target object or not, and the label information can be obtained through modes such as manual labeling or neural network labeling.

The sample image set is input into the network shown in fig. 4, and finally output values of the correlation detection results of each sample matching object group are output after sequentially passing through the object detection network 100 and the correlation detection network 200 to be trained. The processing procedures of the object detection network and the association detection network are as described above, and are not described herein again.

After the output value of the correlation detection result of each sample matching object group is obtained, the error between the output value and the label information can be determined, the network parameters are adjusted according to the error back propagation until the error is converged, and the training of the object detection network and the correlation detection network can be completed.

The method of the present disclosure is described in detail with reference to the above examples, and those skilled in the art will understand that the method of detecting the associated object of the present disclosure is not limited to the above example scenarios, and may also be applied to any other target object associated detection suitable for having an association in a spatial position, which is not described in detail herein.

Therefore, according to the detection method of the related object, the target detection is assisted through the related features between the target objects in the same matching object group, the target detection precision in a complex scene is improved, for example, the human body detection in a multi-person scene is realized through the related detection of the human face and the human body, and the detection accuracy is improved. In the association detection, the accuracy of the association detection of the target object is improved by combining the visual information and the spatial information of the target object, for example, in the human face and human body association detection, the visual characteristic information of the human face and the human body is adopted, the spatial position characteristic information of the human face and the human body is further considered, the association of the human face and the human body is assisted by the spatial position characteristic, the accuracy of the association of the human face and the human body is improved, and the accuracy of the target detection is further improved.

In a second aspect, the disclosed embodiments provide an apparatus for detecting a related object. A detection apparatus for an associated object according to some embodiments of the present disclosure is shown in fig. 10.

As shown in fig. 10, in some embodiments, the detection device of the present disclosure comprises:

the detection module 410 is configured to obtain at least one matching object group by detecting in the image to be detected, where each matching object group includes at least two target objects;

an obtaining module 420, configured to obtain visual information of each target object in each matching object group, and spatial information of at least two target objects in each matching object group;

the determining module 430 is configured to determine whether at least two target objects in each matching object group are related according to the visual information and the spatial information of at least two target objects in each matching object group.

As shown in fig. 11, in some embodiments, the detection module 410 may include:

the detection submodule 411 is used for detecting each target object and the object type of each target object from the image to be detected;

the combining sub-module 412 is configured to, for each object class, combine each target object in the object class with each target object in other object classes, respectively, to obtain at least one matching object group.

In some embodiments, the obtaining module 420 is further configured to:

and performing visual feature extraction on each target object in the matching object group to obtain visual information of the target object.

In some embodiments, the obtaining module 420 is further configured to:

detecting a detection frame of each target object from the image to be detected;

and for each matching object group, generating the spatial information of at least two target objects in the matching object group according to the position information of the detection frames of the at least two target objects in the matching object group.

In some embodiments, the obtaining module 420 is further configured to:

generating an auxiliary boundary box of the matched object group, wherein the auxiliary boundary box covers the detection frame of each target object in the matched object group;

and fusing the position characteristic information of each target object in the same matching object group to obtain the spatial information of at least two target objects in the matching object group.

In some embodiments, the auxiliary bounding box is the bounding box that covers the smallest area of the bounding boxes that match the target objects in the object group.

As shown in fig. 12, in some embodiments, the determining module 430 may include:

a fusion sub-module 431, configured to perform fusion processing on the visual information and the spatial information of at least two target objects in each matching object group to obtain a fusion feature of each matching object group;

the determining sub-module 432 is configured to perform relevance classification processing on the fusion features of each matching object group, and determine whether at least two target objects in the matching object group are relevant.

In some embodiments, the determination submodule 432 is specifically configured to:

performing relevance classification processing on the fusion characteristics of each matching object group to obtain a relevance score between at least two target objects in each matching object group;

for a plurality of matching object groups to which the same target object belongs, determining the matching object group with the highest association degree value as a target matching object group;

and determining at least two target objects in the target matching object group as associated target objects.

In some embodiments, in the case that the target object is a human body part, the determining module 430 is specifically configured to:

and determining whether the human body parts in the same matching object group belong to the same human body.

Therefore, the detection device for the associated object disclosed by the invention assists the target detection through the associated features between the target objects in the same matched object group, so that the target detection precision in a complex scene is improved, for example, the human body detection in a multi-person scene is realized through the associated detection of the human face and the human body, and the detection accuracy is improved. In the association detection, the association detection precision of the target object is improved by combining the visual information and the spatial information of the target object, for example, in the human face and human body association detection, the visual characteristic information of the human face and the human body is adopted, the spatial position characteristic information of the human face and the human body is further considered, the association of the human face and the human body is assisted by the spatial position characteristic, the association accuracy of the human face and the human body is improved, and the target detection precision is further improved.

In a third aspect, the disclosed embodiments provide an electronic device, including:

a processor; and

a memory communicatively coupled to the processor and storing computer instructions readable by the processor, the computer instructions, when read by the processor, causing the processor to perform the method according to any of the embodiments of the first aspect.

In a fourth aspect, the disclosed embodiments provide a storage medium storing computer-readable instructions for causing a computer to execute the method according to any one of the embodiments of the first aspect.

Specifically, fig. 13 shows a schematic structural diagram of a computer system 600 suitable for implementing the method of the present disclosure, and the corresponding functions of the processor and the storage medium can be implemented by the system shown in fig. 13.

As shown in fig. 13, the computer system 600 includes a processor (CPU)601 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, the above methods may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the above-described method. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be understood that the above embodiments are only examples for clearly illustrating the present invention, and are not intended to limit the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of this disclosure may be made without departing from the scope of the disclosure.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：具有预定信息层的磁性PUF

Method and device for detecting associated object

相关技术

网友询问留言