Three-dimensional target detection method based on graph convolution attention network

文档序号：1614449 发布日期：2020-01-10 浏览：6次中文

阅读说明：本技术 一种基于图卷积注意网络的三维目标检测方法 (Three-dimensional target detection method based on graph convolution attention network ) 是由夏桂华何芸倩苏丽朱齐丹张智于 2019-09-26 设计创作，主要内容包括：本发明提供的是一种基于图卷积注意网络的三维目标检测方法。(1)对点云进行体素化划分与随机降采样；(2)在每个栅格体素中进行局部特征提取；(3)中间层卷积提取高阶特征图；(4)区域建议网络预测目标的标框、类别以及方向。本发明为了增强每个点与临近点之间的连接关系,提出了一种以边缘卷积形式为基础的引入注意机制的特征提取模块,同时在中间卷积层之后也引入原理相同的注意机制模块,对特征图各个通道进行特征的重新选择从而得到更合理的高阶特征图。本发明提升了点云的目标检测准确率,特别是在遮挡严重的情况下,仍能有良好性能。(The invention provides a three-dimensional target detection method based on a graph convolution attention network. (1) Carrying out voxelization division and random downsampling on the point cloud; (2) local feature extraction is carried out in each grid voxel; (3) extracting a high-order characteristic diagram by convolution of the middle layer; (4) the area suggests a frame, a category and a direction of the network prediction target. In order to enhance the connection relation between each point and the adjacent points, the invention provides a feature extraction module which is based on an edge convolution form and introduces an attention mechanism, and simultaneously introduces an attention mechanism module with the same principle after the middle convolution layer, and reselects the features of each channel of the feature map so as to obtain a more reasonable high-order feature map. The invention improves the target detection accuracy of the point cloud, and particularly has good performance under the condition of serious shielding.)

1. A three-dimensional target detection method based on a graph convolution attention network is characterized by comprising the following steps:

(1) carrying out voxelization division and random downsampling on the point cloud;

(2) local feature extraction is carried out in each grid voxel;

(3) extracting a high-order characteristic diagram by convolution of the middle layer;

(4) the area suggests a frame, a category and a direction of the network prediction target.

2. The method for detecting the three-dimensional object based on the graph volume attention network according to claim 1, wherein the voxel division and random down-sampling of the point cloud specifically comprises: dividing the original point cloud by using the structure of a voxel grid, abandoning outliers outside a specified range, dividing the point cloud into grids, performing random downsampling in each voxel grid, numbering each grid, and storing.

3. The method for detecting the three-dimensional object based on the graph convolution attention network as claimed in claim 2, wherein: the storage is the storage using a hash table.

4. The method according to claim 1, wherein the local feature extraction in each grid voxel specifically comprises: and in the grid of each voxel, extracting the characteristics of the corresponding points by using a graph attention network module.

5. The method for detecting the three-dimensional object based on the graph convolution attention network as claimed in claim 4, wherein the feature extraction of the corresponding point by using the graph attention network module is specifically as follows: firstly, connecting each point with adjacent points around to form a graph structure taking Euclidean distance as a judgment standard, simultaneously connecting each point with the point to form an edge, extracting information such as coordinates of two end points of each edge and the like as initial characteristics of the edge, then performing convolution operation on the edge, and finally obtaining voxel-level characteristics through selection of a symmetric function.

6. The method of claim 5, wherein the initial feature is selected using an attention mechanism prior to the edge convolution operation.

7. The method for detecting the three-dimensional object based on the graph convolution attention network as claimed in claim 1, wherein the intermediate layer convolution extracting the high-order feature map specifically comprises: compressing the characteristic diagram into a compact structure by using a sparse convolution method, and mapping the compact structure back to the original sparse spatial representation after convolution; after convolution abstraction, different channels are redistributed in weight by an attention mechanism, an attention diagram corresponding to the feature diagram is obtained, and the attention diagram is superposed on a high-order feature diagram obtained by convolution, so that a final three-dimensional feature diagram is obtained.

8. The method as claimed in claim 1, wherein the area suggestion network predicts the frame, the category and the direction of the target specifically includes: after the high-order feature map subjected to multilayer convolution is subjected to feature extraction, the predicted values of the bounding box, the category and the direction corresponding to each anchor point are calculated by utilizing three respective full-connection layers.

Technical Field

The invention relates to a computer vision three-dimensional point cloud processing method, in particular to a three-dimensional target detection method.

Background

Target detection is a traditional visualization task, and can identify and locate targets at the same time, which is a prerequisite for realizing an intelligent scene. Nowadays, two-dimensional detection has achieved unprecedented prosperity, but in the fields of map drawing, indoor robots, augmented reality and the like, three-dimensional detection is obviously superior to two-dimensional detection. It can provide more position and attitude information and is one of the basic tasks of automatic driving environment perception. Once RGB images have been the mainstream data form of target detection tasks, with the development of 3D sensors, lidar has become an increasingly popular detection tool in recent years.

Some lidar and camera based methods now incorporate point cloud data and image data together for greater accuracy. However, the fusion method also has the problem of excessive calculation cost, so the single-sensor method is still competitive. Many studies have shown that point clouds are a more suitable form of data for describing the shape of objects. The point cloud can better represent Euclidean distance without multi-scale problems. However, the point cloud is a sparse data, which makes the two-dimensional method difficult to directly apply.

When extracting features, most methods use a point-by-point processing mode and use a symmetric function to extract global features, and the idea ignores the connection and relationship between points. In contrast to picture data, point clouds are a natural graph structure that is easy to link. Some researches utilize the idea of graph network, consider that the relationship between adjacent points and edges is helpful to enhance the expression of local features, and propose an edge convolution method. During three-dimensional convolution, considering that in a defined voxel range, due to the sparsity of points, a plurality of voxels are empty, and by using a sparse convolution mode, the calculation speed can be improved and the memory loss can be reduced while the convolution effect is not influenced.

Disclosure of Invention

The invention aims to provide a three-dimensional target detection method based on a graph convolution attention network, which can improve the accuracy of point cloud target detection and has good performance under the condition of serious shielding.

The purpose of the invention is realized as follows:

(1) carrying out voxelization division and random downsampling on the point cloud;

(2) local feature extraction is carried out in each grid voxel;

(3) extracting a high-order characteristic diagram by convolution of the middle layer;

(4) the area suggests a frame, a category and a direction of the network prediction target.

The present invention may further comprise:

1. the voxel division and random downsampling of the point cloud specifically comprises the following steps: dividing the original point cloud by using the structure of a voxel grid, abandoning outliers outside a specified range, dividing the point cloud into grids, performing random downsampling in each voxel grid, numbering each grid, and storing.

The storage is the storage using a hash table.

2. The local feature extraction performed in each grid voxel specifically includes: and in the grid of each voxel, extracting the characteristics of the corresponding points by using a graph attention network module.

The feature extraction of the corresponding points by the use graph attention network module specifically comprises the following steps: firstly, connecting each point with adjacent points around to form a graph structure taking Euclidean distance as a judgment standard, simultaneously connecting each point with the point to form an edge, extracting information such as coordinates of two end points of each edge and the like as initial characteristics of the edge, then performing convolution operation on the edge, and finally obtaining voxel-level characteristics through selection of a symmetric function.

Prior to the edge convolution operation, an attention mechanism is used to select the initial features.

3. The intermediate layer convolution extracting high-order feature map specifically comprises the following steps: compressing the characteristic diagram into a compact structure by using a sparse convolution method, and mapping the compact structure back to the original sparse spatial representation after convolution; after convolution abstraction, different channels are redistributed in weight by an attention mechanism, an attention diagram corresponding to the feature diagram is obtained, and the attention diagram is superposed on a high-order feature diagram obtained by convolution, so that a final three-dimensional feature diagram is obtained.

4. The frame, category and direction of the area suggestion network prediction target specifically include: after the high-order feature map subjected to multilayer convolution is subjected to feature extraction, the predicted values of the bounding box, the category and the direction corresponding to each anchor point are calculated by utilizing three respective full-connection layers.

The three-dimensional target detection method based on the graph convolution attention network is characterized by enhancing the local relation expression of point cloud and optimizing the characteristic selection process. The invention uses the edge convolution method capable of expressing the relationship between adjacent points for the feature extraction of target detection, and uses an attention mechanism to select the initial physical features more important for feature expression in the feature selection stage of the initial points, thereby obtaining more excellent extracted features. In the process of the convolution of the middle layer, multi-channel characteristic data are also generated, the invention optimizes the convolution result by using the thought of an attention mechanism, strengthens the channel proportion with main influence and obtains a characteristic diagram with more expressive force.

A typical set of point cloud data for a scene contains more than 100k points, so that it is considered to use a specific data structure for preprocessing, i.e. voxelization, of the point cloud. The original point is firstly divided into voxels and the punctate features are firstly extracted, then the downsampled voxel signals enter convolution and region suggestion to obtain a three-dimensional bounding box.

The invention considers that the relation representation between the original points of the bottom layer is enhanced in the characteristic extraction process, the thought of a graph network is utilized in the characteristic extraction process, and simultaneously, in order to better enhance the characteristic expression, an attention mechanism simulating the human cognitive acuity is considered, so that the multichannel selection of the characteristics is more intelligent. According to the invention, attention mechanisms are respectively applied before the initial feature selection of edge convolution of the graph network and after the sparse convolution feature graph is processed, so that the feature expression of each stage is more explanatory while the expression force of the neural network module is improved.

The invention has the following advantages:

1. the method uses a graph volume method of an attention mechanism in the characteristic representation process of each voxel, can better describe the relation between each point of the point cloud, and extracts more expressive characteristics.

2. According to the method, after the convolution of the middle layer, the obtained high-order characteristic diagram is subjected to weight redistribution by using an attention mechanism, so that a more reasonable high-order characteristic diagram is obtained.

3. The two improvements act together, and the accuracy of the three-dimensional target detection in the vehicle detection can be improved.

Drawings

FIG. 1: a feature extraction module based on a graph network attention structure, wherein e represents an edge, x represents a point, and i and j represent the number of the point;

FIG. 2: extracting voxel characteristics;

FIG. 3: introducing an intermediate layer sparse convolution of an attention mechanism;

FIG. 4: and (4) carrying out an overall process.

Detailed Description

The invention is described in more detail below by way of example.

The method comprises the following steps: voxelized partitional clustering of point clouds

The original point cloud data of more than 100k points is structured and downsampled in a voxelization mode, firstly, points outside a certain range are cut out, and only points within the range of D, H and W under the x, y and z axes are reserved. The point number of one point cloud is too large, and the size v is used within the extraction range_d，v_h，v_wThe small element grid of (2) divides the whole point cloud.

In order to solve the problem of uneven distribution of points in each voxel, the present embodiment uses a random downsampling method, so that the number of points in each voxel is not more than T. And finally numbering the processed voxel structures, and storing the voxel structures in a hash table mode, thereby eliminating voxels with empty interior points.

Step two: point cloud feature extraction in voxels

After the original point cloud is voxelized, in order to obtain the voxel-level features, the method performs feature extraction on each voxel by using a graph attention network module.

The point cloud is a natural graph structure, and in the feature extraction of the point cloud, each point is considered separately and is ignored conventionallyThe relation between, define

Is a graph which comprises a point set consisting of n points

And edge sets between points

For example, the invention defines a d-dimensional proximity graph for each point x_iIn a

In (i, j) contains_i1)，...，(i，j_ik) An edge set of the form in which i and j are both numbers of points, then defining an edge characteristic as

Wherein h is_θAnd H in the following formula is a symmetric function.

Generally, the point cloud has three dimensions to represent its real world coordinates, and in this embodiment, the center point x is combined when describing the edge between two points_iAnd a point operatively connected thereto with h

As an initial feature selection. At this time, the contribution of each channel of the edge feature to the total feature representation is different, and thus, an attention mechanism method is added. After the multilayer perception operation of the edge convolution, a symmetry operation H is used for extracting the features of the edge level to obtain the corresponding features of the point level. Then, by setting point level feature X ═ X'₁，...，x′_nAnd performing another symmetrical operation to extract the final voxel-level features.

Step three: intermediate layer sparse convolution

The present embodiment uses a three-dimensional sparse convolution operation as a convolution interlayer. Suppose ConvMD (c)_in，c_outK, s, p) is a convolution operator, where c_inAnd c_outIs the number of inputs and output channels, k, s, p correspond to kernel size, stride size, and fill size, respectively. Each convolution operation contains a 3D convolution, a BatchNormal layer and a Relu layer. Finally, after the sparse mapping is converted into the dense mapping, a high-level feature mapping is obtained, and an attention module is added.

There are many different scale profiles during the convolution operation. It is clear that the contribution of each dimension's features to the overall feature is of different importance. In order to improve the description of the characteristic diagram and make the description more reasonable, the invention adds the attention diagram to the original characteristic diagram.

The present embodiment uses an SE attention module for generating an attention profile. First, let the dense feature map input asWherein H is the feature height, W is the feature width, and C is the number of channels. Then, an avg-pool operation is used to extract each channel to obtain an extracted feature, thereby obtaining a statistically derived channel weightThen using multi-layer perception to obtain some advanced features for each dimension, then the final attention is s_c＝F_e(z_cW) in which F_eIs an extraction function.

At a scaling function F_scaleThereafter, the attention feature map is added to the original mapTo obtain final output comprehensive characteristic diagram

This attention mechanism operation is added after the middle layer, and can aggregate high-level information into the final middle layer feature map, thereby providing more information for subsequent region suggestions.

Step four: regional advice network

Regional Proposal Networks (RPNs) have become a typical embedded module in many detection frameworks. In the present embodiment, an end-to-end form like SSD is used as the area proposal architecture. The inputs to the region proposal layers are the feature maps extracted from the intermediate layers, and one region proposal layer contains the convolution layer, the BatchNormal layer and the Relu layer. After each individual RPN layer, the feature maps are upsampled to the same fixed size and the maps are concatenated together. Finally, three 1 × 1 convolutions are used to generate the predicted values of bounding box, class and direction.

8页详细技术资料下载

Three-dimensional target detection method based on graph convolution attention network

相关技术

网友询问留言