Target depth estimation and detection method based on multiplexed image

文档序号：1490768 发布日期：2020-02-04 浏览：4次中文

阅读说明：本技术 基于多路复用图像的目标深度估计和检测方法 (Target depth estimation and detection method based on multiplexed image ) 是由 *** 周昌鑫于 2019-09-15 设计创作，主要内容包括：本发明公开了一种基于多路复用图像的目标深度估计和检测方法,包括在多路复用图像上训练目标检测神经网络、在多路复用图像上进行目标检测和对检测到的目标进行深度估计三个过程。本发明的方法能够解决2D目标检测方法无法感知物体深度的问题。(The invention discloses a target depth estimation and detection method based on a multiplexing image, which comprises three processes of training a target detection neural network on the multiplexing image, carrying out target detection on the multiplexing image and carrying out depth estimation on a detected target. The method can solve the problem that the depth of the object cannot be sensed by the 2D target detection method.)

1. A target depth estimation and detection method based on a multiplexed image is characterized by comprising three processes of training a target detection neural network on the multiplexed image, carrying out target detection on the multiplexed image and carrying out depth estimation on a detected target:

training a target detection neural network on a multiplexed image, comprising the steps of:

1) selecting a target detection network based on an anchor frame, and expanding the anchor frame part;

2) initializing a target detection network;

3) training a target detection network;

performing object detection on the multiplexed image, comprising the steps of:

4) inputting the multiplex image to be tested into the target detection network trained in the step 3) to obtain the category and the position parameter of the boundary frame of the target;

5) performing category marking on the target on the image according to the category of the target and the position parameters of the bounding box, and marking the position of each object by using a rectangular box;

the depth estimation of the detected target comprises the following steps:

6) calculating the horizontal pixel distance of the centers of the two bounding boxes of each target;

7) depth estimation is performed for each target detected based on the focal length of the camera and the baseline distance.

2. The method according to claim 1, wherein the step 1) comprises the following specific steps:

11) selecting an anchor frame-based target detection network framework;

12) expanding the anchor frame into an anchor frame pair; each anchor frame pair consists of a group of horizontally aligned anchor frames with the same size, each anchor frame pair outputs 7 parameters, the first one is a target class parameter, and the other 6 are target position parameters (x)_left，x_right，w_left，w_rightY, h) corresponding to the abscissa, width, ordinate, and height of the left and right bounding boxes, respectively, and sharing the ordinate y and height h.

3. The method according to claim 1, wherein the step 2) comprises the following specific steps:

21) initializing a training parameter; the learning rate is set to be 0.001, the iteration times are 100000 times, and the number of images input in each iteration is 2;

22) initializing a weight parameter; initializing the target detection network by using ResNet-101 network parameters pre-trained on an ImageNet data set, and using a xavier initialization method for an additional network layer of the target detection network: defining an input dimension m and an output dimension n of the layer where the parameters are located, wherein the parameters are in the range in a uniformly distributed mode

4. The method according to claim 1, wherein the step 3) comprises the following specific steps:

31) inputting training images and labels of objects on each image into a network, wherein the image size is set to be 1333 × 600;

32) in the training process, each anchor frame pair is divided into positive and negative samples according to the overlapping degree of the boundary frame with the target; when the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is larger than 0.5, the anchor frame pair is a positive sample and is matched with the target; when the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is less than 0.3, the anchor frame pair is a negative sample; the other anchor frame pairs are ignored;

33) calculating a loss function; the loss function is as follows:

wherein N is_aAnd N_gtThe number of anchor box pairs and targets, respectively, (x)_i，y_i，w_i，h_i) Representing left and right bounding box parameters, where x_iComprises

34) if the current iteration times are larger than the maximum iteration times, ending the training; otherwise, performing back propagation; in the back propagation process, the partial derivative of the loss function relative to a certain node is calculated, and the weight parameter of the network is updated by using a gradient descent algorithm; the update parameters of the network weights are as follows:

where m denotes the number of layers, i denotes the number of neurons in a layer, j denotes the number of one neuron input, α denotes the learning rate;

and after the updating of the network weight parameters by the back propagation process is finished, returning to the step 33).

5. The method according to claim 1, wherein the step 4) comprises the following specific steps:

41) a multiplexed image to be detected is input, and the image size is set to 1333 × 600.

42) And calculating to obtain a numerical detection result according to the trained neural network model and the input image.

6. The method according to claim 1, wherein the step 5) comprises the following specific steps:

51) using non-maximum suppression to remove redundancy for a detection frame output by the network, enabling a target on the multiplexed image to have two target frames, respectively performing non-maximum suppression on a left frame and a right frame, and then outputting the targets reserved by the left frame and the right frame;

52) the detection result (class, x)_left，x_right，w_left，w_rightY, h, score) to the text file for saving.

7. The method of claim 1, wherein: in the step 6), the step of the method comprises the following steps,

calculating the horizontal pixel difference of two bounding boxes of each object, namely the parallax of the object: disp ═ x_right-x_left。

8. The method according to claim 1, wherein in step 7),

obtaining the relation between the target depth and the parallax according to the binocular vision principle: depth is f × baseline/disp, where f is the focal length of the camera and baseline is the reference distance of the camera, thereby obtaining the depth of each object.

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a target depth estimation and detection method based on a multiplexing image.

Background

Object detection is an important research topic in the field of computer vision. Have been widely used in many real-world applications, such as face recognition, traffic safety, automatic driving, etc. Target detection based on deep learning refers to marking the position and the category of a target object in a natural scene picture or a video. However, applications like autopilot require not only knowledge of the position of an object in image space, but also the depth of the object in actual three-dimensional space.

At present, the 2D target detector based on the convolutional neural network has good effect and high real-time performance, but cannot sense the depth of an object. It comprises two key steps: classifier (regressor) training. Unlike the conventional method in which the features of the target object are extracted manually, deep learning uses a multi-layer convolutional neural network to automatically extract the features. Neural networks used to extract features require labeled data sets to be trained so that they can extract valuable features. And updating network parameters by using a back propagation algorithm in the training process. Training of the regressor refers to finding a series of rules for judging the category and the position of an object in the current image on the basis of the extracted features. Currently, mainstream target detection methods based on deep learning include fast RCNN, YOLO, and the like. However, these methods cannot sense the depth of the object, and need to be matched with other sensors to meet the requirements of the applications such as automatic driving.

The 3D object detection task is able to simultaneously identify the position of an object in image space and estimate the depth of the object. The 3D target detection effect depends on point cloud data obtained by the laser radar, however, the laser radar is high in manufacturing cost, low in resolution and small in sensing range. While image-based 3D object detection is inefficient and does not meet the real-time requirements of applications such as autopilot.

Disclosure of Invention

The invention aims to provide a target depth estimation and detection method based on a multiplexed image, aiming at the problem that the 2D target detection method cannot sense the depth of an object.

The technical solution for realizing the purpose of the invention is as follows: a target depth estimation and detection method based on a multiplexed image comprises three processes of training a target detection neural network on the multiplexed image, carrying out target detection on the multiplexed image and carrying out depth estimation on a detected target:

training a target detection neural network on the multiplexed image comprises the steps of:

1) selecting a target detection network based on an anchor frame, and expanding the anchor frame part;

2) initializing a target detection network;

3) and training the target detection network.

Performing object detection on the multiplexed image includes the steps of:

4) inputting the multiplex image to be tested into the target detection network trained in the step 3) to obtain the category and the position parameter of the boundary frame of the target;

5) and carrying out category marking on the target on the image according to the category of the target and the position parameter of the bounding box, and marking the position of each object by using a rectangular box.

The depth estimation of the detected target comprises the following steps:

6) calculating the horizontal pixel distance of the centers of the two bounding boxes of each target;

7) and then, depth estimation is carried out on each detected target according to the focal length and the baseline distance of the camera.

In the above method, the step 1) comprises the following specific steps:

11) an anchor-box-based target-detecting network framework is selected. Take fast RCNN as an example.

12) And expanding the anchor frame into an anchor frame pair. Each anchor frame pair consists of a group of horizontally aligned anchor frames with the same size, each anchor frame pair outputs 7 parameters, the first one is a target class parameter, and the other 6 are target position parameters (x)_left，x_right，w_left，w_rightY, h) corresponding to the abscissa, width, ordinate and height of the left and right bounding boxes, respectively, and sharing the ordinate y and height h.

In the above method, the step 2) comprises the following specific steps:

21) training parameters are initialized. The learning rate was set to 0.001, the number of iterations was 100000, and the number of images input per iteration was 2.

22) The weight parameters are initialized. An appropriate weight initialization method may lead to faster neural network convergence. Initializing the target detection network by using ResNet-101 network parameters pre-trained on an ImageNet data set, and using a xavier initialization method for an additional network layer of the target detection network: defining the input dimension m and the output dimension n of the layer where the parameters are located, the parameters will be distributed in a uniform mannerWithin the range

Initialization is performed.

In the above method, the step 3) includes the following specific steps:

31) the training images and the labels of the objects on each image are input into the network, with the image size set to 1333 x 600.

32) During the training process, each anchor box pair is divided into positive and negative samples according to the overlapping degree of the bounding box with the target. When the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is larger than 0.5, the anchor frame pair is a positive sample and is matched with the target; and when the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is less than 0.3, the anchor frame pair is a negative sample. The remaining anchor frame pairs are ignored.

33) A loss function is calculated. The loss function used in the present invention is as follows:

wherein N is_aAnd N_gtThe number of anchor box pairs and targets, respectively, (x)_i，y_i，w_i，h_i) Representing left and right bounding box parameters, where x_iComprises

And

two moieties, w_iComprisesAnd

two moieties, p_i(c) The representative category is a category of the user,

represents whether the ith anchor box pair is a positive sample, if so, equals 1, otherwise is 0,and whether the ith anchor frame pair is matched with the boundary frame of the jth target or not is represented, if so, the ith anchor frame pair is equal to 1, and otherwise, the ith anchor frame pair is 0. The whole loss function consists of three parts, the first two items calculate the position error of the positive sample, and the third item is the prediction error of the category.

34) And if the current iteration times are larger than the maximum iteration times, ending the training. Otherwise, back propagation is performed. The back propagation process updates the weight parameters of the network using a gradient descent algorithm by calculating the partial derivative of the loss function with respect to a certain node. The update parameters of the network weights are as follows:

where m denotes the number of layers, i denotes the number of neurons in a layer, j denotes the number of one neuron input, α denotes the learning rate.

And after the updating of the network weight parameters by the back propagation process is finished, returning to the step 33).

In the above method, the step 4) includes the following specific steps:

41) a multiplexed image to be detected is input, and the image size is set to 1333 × 600.

42) And calculating to obtain a numerical detection result according to the trained neural network model and the input image.

In the above method, the step 5) includes the following specific steps:

51) the redundancy is removed by using non-maximum suppression for the output detection frames of the network, and since the target on the multiplexed image has two target frames, the non-maximum suppression is performed for the left frame and the right frame, respectively, and then the target retained by both the left frame and the right frame is output.

52) The detection result (class, x)_left，x_right，W_left，W_rightY, h, score) to the text file for saving.

In the above method, the step 6) includes the following specific steps:

61): calculating the horizontal pixel difference of two bounding boxes of each object, namely the parallax of the object: disp ═ x_right-x_left。

In the above method, the step 7) includes the following specific steps:

71): obtaining the relation between the target depth and the parallax according to the binocular vision principle: depth is f × baseline/disp, where f is the focal length of the camera and baseline is the reference distance of the camera, and thus the depth of each detected object can be obtained.

Compared with the prior art, the invention has the following remarkable advantages: (1) the multiplexed image is used as input for 2D object detection; (2) the designed target detector can simultaneously identify the object and estimate the depth of the object; (3) the designed target detector has high running speed and meets the real-time requirement.

Drawings

Fig. 1 is an example of a multiplexed image.

Fig. 2 is a structural composition of the anchor frame pair.

Fig. 3 is an object detection network model structure.

FIG. 4 is a training process of the present invention.

FIG. 5 is a detection process of the present invention.

Fig. 6 is a visualization result diagram of the present invention.

Detailed Description

The following description will further describe the embodiments of the present invention with reference to the drawings.

The target detection method comprises three processes of training a target detection neural network on a multiplexing image, carrying out target detection on the multiplexing image and carrying out depth estimation on the detected target.

Training a target detection neural network on a multiplexed image, comprising the following specific steps:

first, since only the object detector for the natural image is currently available, it needs to be modified to make it suitable for multiplexing images (see fig. 1). An anchor-box-based object detection network framework is selected, taking fast RCNN as an example.

The anchor frame in the target detector is expanded as an "anchor frame pair" as shown in fig. 2. Each anchor frame pair consists of a group of horizontally aligned anchor frames with the same size, each anchor frame pair outputs 7 parameters, the first one is a target class parameter, and the other 6 are target position parameters (x)_left，x_right，W_left，W_rightY, h) corresponding to the abscissa, width, ordinate and height of the left and right bounding boxes, respectively, and sharing the ordinate y and height h. The structure of the target detector is shown in fig. 3.

The training parameters and the weight parameters are initialized. The learning rate was set to 0.001, the number of iterations was 100000, and the number of images input per iteration was 2. We initialized the weights of the target detection network using ResNet-101 pre-trained on the ImageNet dataset, and for the additional network layers of the target detection network we used the xavier initialization method: defining the input dimension m and the output dimension n of the layer in which the parameter is located, the parameter will be in the range in a uniformly distributed mannerInitialization is performed.

The training images and the labels of the objects on each image are input into the network, with the image size set to 1333 x 600. During the training process, each anchor box pair is divided into positive and negative samples according to the overlapping degree of the bounding box with the target. When the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is larger than 0.5, the anchor frame pair is a positive sample and is matched with the target; and when the overlapping degree of the left anchor frame and the right anchor frame in the anchor frame pair with the left boundary frame and the right boundary frame of the target is less than 0.3, the anchor frame pair is a negative sample. The remaining anchor frame pairs are ignored.

A loss function is calculated. The loss function used in the present invention is as follows:

wherein N is_aAnd N_gtThe number of anchor box pairs and targets, respectively, (x)_i，y_i，w_i，h_i) Representing left and right bounding box parameters, where x_iComprises

And

two moieties, w_iComprises

And

two moieties, p_i(c) The representative category is a category of the user,

represents whether the ith anchor box pair is a positive sample, if so, equals 1, otherwise is 0,

and whether the ith anchor frame pair is matched with the boundary frame of the jth target or not is represented, if so, the ith anchor frame pair is equal to 1, and otherwise, the ith anchor frame pair is 0. The whole loss function consists of three parts, the first two items calculate the position error of the positive sample, and the third item is the prediction error of the category.

And if the current iteration times are larger than the maximum iteration times, ending the training. Otherwise, back propagation is performed. The back propagation process updates the weight parameters of the network using a gradient descent algorithm by calculating the partial derivative of the loss function with respect to a certain node. The update parameters of the network weights are as follows:

where m represents the number of layers, i represents the number of neurons in a layer, j represents the number of a neuron input, α represents the learning rate.

The target detection on the multiplexed image comprises the following specific steps:

firstly, the size of the multiplex image to be detected is set to be 1333 multiplied by 600, and then the multiplex image is input into a neural network model trained in the previous process, and a numerical result of target detection is obtained. The redundancy is removed by using non-maximum suppression for the output detection frames of the network, and since the target on the multiplexed image has two target frames, the non-maximum suppression is performed for the left frame and the right frame, respectively, and then the target retained by both the left frame and the right frame is output. The detection result (class, x)_left，x_right，w_left，w_rightY, h, score) to the text file for saving.

Depth estimation is performed on the detected target, and a flow chart is shown in fig. 5, and the method comprises the following specific steps:

the horizontal pixel difference of two bounding boxes of each object, i.e. the disparity of this object, is first calculated: disp ═ x_right-x_left. And then obtaining the relation between the target depth and the parallax according to the binocular vision principle: depth is f × baseline/disp, where f is the focal length of the camera and baseline is the reference distance of the camera, and thus the depth of each detected object can be obtained.

It is important to point out here that, compared to the conventional 2D object detection method, the present invention can perform depth estimation on an object without losing the accuracy and speed of object detection. Compared with a 3D target detection method based on binocular images, the target detector of the invention takes only one third of time. Fig. 6 shows a visual view of part of the detection results of the present invention (in order to clearly display the detection results, we show the detection results from the multiplexed image on the left shot image corresponding to the detection results).

It will be apparent to those skilled in the art that each of the functions described above can be implemented in hardware, software, or a suitable combination of hardware and software, and it is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

12页详细技术资料下载

Target depth estimation and detection method based on multiplexed image

相关技术

网友询问留言