Method for improving depth estimation accuracy by utilizing multitask complementation

文档序号：1202612 发布日期：2020-09-01 浏览：10次中文

阅读说明：本技术 一种利用多任务互补提高深度估计准确性的方法 (Method for improving depth estimation accuracy by utilizing multitask complementation ) 是由颜成钢张杰华楼杰栋孙垚棋张继勇张勇东于 2020-04-29 设计创作，主要内容包括：本发明公开了一种利用多任务互补提高深度估计准确性的方法。本发明方法使用SFM系统得到的稀疏的深度图像作为辅助输入,与RGB图像一同输入到卷积神经网络,作为辅助线索提高深度估计的准确性。将稀疏的深度图像和RGB图像当作有噪声的数据,通过去噪编码器对输入数据进行编码,提取其中的特征,将提取的四组特征串联后进行共享,解决了单独的RGB图像以基于卷积神经网络的方法进行深度估计时没有可靠且鲁棒的线索的问题；将编码后的特征通过解码器重建为精确的深度图,得到了更加精确的深度估计结果。(The invention discloses a method for improving depth estimation accuracy by utilizing multi-task complementation. The method uses the sparse depth image obtained by the SFM system as auxiliary input, and inputs the auxiliary input and the RGB image into the convolutional neural network together to serve as an auxiliary clue so as to improve the accuracy of depth estimation. The sparse depth image and the RGB image are used as noisy data, the input data are coded through a denoising coder, the features of the input data are extracted, the four extracted features are connected in series and then shared, and the problem that when the depth estimation is carried out on the single RGB image by a method based on a convolutional neural network, no reliable and robust clues exist is solved; and reconstructing the coded features into an accurate depth map through a decoder to obtain a more accurate depth estimation result.)

1. A method for improving depth estimation accuracy by utilizing multi-task complementation is characterized in that a sparse depth image obtained by using an SFM system is used as auxiliary input and is input to a convolutional neural network together with an RGB image to be used as an auxiliary clue to improve the depth estimation accuracy, and the method comprises the following specific implementation steps of:

step 1: and (3) carrying out data enhancement:

horizontally turning and storing all RGB images in the data set which is divided into the training set and the test set, and doubling the scale of the data set; processing the RGB image in the enlarged data set through an SFM system to obtain a sparse depth image; the sparse depth image and the RGB image in the expanded data set are in one-to-one correspondence to form a part of the data set;

step 2, inputting the RGB image and the sparse depth image into an RGB denoising encoder and a depth denoising encoder respectively to obtain characteristics of the encoded sparse depth image and the RGB image respectively, and recording the characteristics as a sparse depth image characteristic d _ flat, an RGB image R channel characteristic R _ flat, an RGB image G channel characteristic G _ flat and an RGB image B channel characteristic B _ flat;

the RGB denoising encoder has the following structure:

the size of the input layer is equal to the size of the input RGB image; the size of the hidden layer is 1024 neurons, the activation function is relu activation function, and the weight initialization mode is random initialization according to normal distribution; the size of the output layer is equal to the size of the input RGB image, the weight initialization mode is random initialization according to normal distribution, and the activation function is a sigmoid activation function;

the structure of the depth denoising encoder is as follows:

the size of the input layer is equal to the size of the sparse depth image; the size of the hidden layer is 1024 neurons, the activation function is relu activation function, and the weight initialization mode is random initialization according to normal distribution; the size of the output layer is equal to the size of the input sparse depth image, the weight initialization mode is random initialization according to normal distribution, and the activation function is linear activation, namely, no change is made; the input of the hidden layer is the output of the input layer, and the output of the hidden layer is the input of the output layer;

step 3, serially connecting the characteristics d _ feat, r _ feat, g _ feat and b _ feat obtained in the step 2, namely performing dot product operation to obtain serially connected characteristic can _ feat, inputting the serially connected characteristic can _ feat into the next volume of lamination, activating the serially connected characteristic can _ feat, and outputting the shared characteristics by the volume of lamination;

the convolutional layer structure consists of 1024 neurons, the activation function is a relu activation function, and the weight initial mode is random initialization according to normal distribution; the shared characteristic output by the convolution layer is marked as shared _ feed;

step 4, inputting shared-feature shared _ feed into a decoding layer, wherein the size of the decoding layer is equal to that of the sparse depth image, the weight initialization mode is random initialization according to normal distribution, and the activation function is a relu function; the output of the decoding layer is the depth map to be obtained.

2. The method of claim 1, wherein the depth estimation accuracy is improved by multi-tasking, and the method comprises:

for an input layer of the RGB denoising encoder, a single RGB image in a training set is input, a monocular RGB image is output from an output end of the input layer to a hidden layer, the size of the input layer is required to be W × H × 1, the input layer is a convolution layer, and a variable 'channel' is added on the layer name to distinguish three channels; w is the width of the input RGB image, H is the height of the input RGB image, and the channel values are r, g and b; the hidden layer is a fully-connected layer of 1024 neurons, the activation function is a relu function, the weight initialization mode is random initialization according to zero-mean normal distribution, and a variable is added to the layer name to distinguish the hidden layers of different channels; the hidden layer is activated and then output to the output layer; the output layer is a fully-connected layer of W × H neurons, the activation function is a sigmoid activation function, and the weight initialization mode is random initialization according to zero-mean normal distribution; the output of the output layer is reformed into a W H1 shape;

for an input layer of the depth denoising encoder, inputting a sparse depth image in a training set output by an SFM system, transmitting the input sparse depth image to a hidden layer by an output end, requiring the size of the input layer to be W H1, and the input layer to be a convolution layer, then tiling the input layer before inputting the input layer to the hidden layer, and then inputting the input layer to the hidden layer; the hidden layer is a full-connection layer of 1024 neurons, the input after being tiled is activated, the weight initialization mode is the random initialization according to the normal distribution of zero mean, the activation function is the relu activation function, the hidden layer is output to the output layer after being activated, the output layer is the full-connection layer of W × H neurons, the weight initialization mode is the random initialization according to the normal distribution of zero mean, the activation function is linear activation, namely, no change is made, and the output result is reformed into the shape of W × H1.

3. The method of claim 2, wherein the depth estimation accuracy is improved by multi-tasking, and the method comprises:

setting the 10% pixel value in the RGB image and the sparse depth image in the training set to zero, respectively inputting the pixel value into an RGB denoising encoder and a depth denoising encoder, and respectively obtaining the characteristics of the encoded sparse depth image and the RGB image from an output layer.

Technical Field

The invention relates to the field of computer vision, in particular to a method for improving depth estimation accuracy by utilizing multi-task complementation. A method for depth estimation with a monocular camera.

Background

As an important component in the field of computer vision, depth estimation has become popular in recent years with the development of deep learning. The main applications of the depth estimation are scene understanding, scene modeling and depth perception, and the depth estimation is also greatly helpful for automatic driving and traversability estimation.

The depth estimation can be realized by devices such as laser radar and the like, and can also be obtained by acquiring images through a camera by using a computer vision method. The use of equipment such as laser radar and the like is inconvenient, and has the defects of high equipment price, incapability of being suitable for transparent objects and the like; with the development of deep learning in recent years, the results of depth estimation using the conventional machine learning method are gradually reduced, because the conventional method needs manually extracted features, and the features cannot well represent real 3D structural information and cannot obtain accurate results. The great progress of deep learning makes it possible for the model to learn rich features from the image itself, and thus the use of Convolutional Neural Networks (CNNs) for depth estimation has become mainstream in recent five years. On the other hand, the equipment for acquiring the images can be divided into a monocular camera and a binocular camera, and the monocular camera is common image acquisition equipment in daily life scenes, is low in price and is widely used; the binocular camera is designed by simulating the structure of human eyes, the depth can be calculated through binocular parallax, and the method is large in calculation amount and seriously influenced by illumination and distance.

In fact, three-dimensional depth information is obtained only through a two-dimensional RGB image, and the difficulty still exists because of no reliable clues; especially using the end-to-end regression method based on the convolutional neural network, depth is estimated from the RGB image, and many clues and features are ignored. The existing methods for calculating the sparse point cloud such as Geiger can be used for calculating the sparse depth image of the image, and the purpose of improving the prediction precision can be achieved by using the noisy and sparse depth image as the assistance of the RGB image.

Disclosure of Invention

The invention aims to solve the problem that a good enough result cannot be obtained by depth estimation based on a convolutional neural network method by only using RGB pictures, and the method for improving the depth estimation accuracy by utilizing multi-task complementation is used for improving the depth estimation accuracy.

In order to achieve the purpose, the sparse depth image obtained by the SFM system is used as an auxiliary input and is input into the convolutional neural network together with the RGB image, and the auxiliary input is used as an auxiliary clue to improve the accuracy of depth estimation.

Step 1, processing the RGB images in the data set through an SFM system to obtain sparse depth images, wherein the sparse depth images correspond to the RGB images in the data set one by one and become a part of the data set.

And 2, respectively inputting the RGB image and the sparse depth image into an RGB denoising encoder and a depth denoising encoder to respectively obtain the characteristics of the encoded sparse depth image and the RGB image, and respectively recording the characteristics as a sparse depth image characteristic d _ flat, an RGB image R channel characteristic R _ flat, an RGB image G channel characteristic G _ flat and an RGB image B channel characteristic B _ flat.

The RGB denoising encoder has the following structure:

The structure of the depth denoising encoder is as follows:

the size of the input layer is equal to the size of the sparse depth image; the size of the hidden layer is 1024 neurons, the activation function is relu activation function, and the weight initialization mode is random initialization according to normal distribution; the size of the output layer is equal to the size of the input sparse depth image, the weight initialization mode is random initialization according to normal distribution, and the activation function is linear activation, namely, no change is made. The input of the hidden layer is the output of the input layer, and the output of the hidden layer is the input of the output layer.

And 3, serially connecting the characteristics d _ feat, r _ feat, g _ feat and b _ feat obtained in the step 2, namely performing dot product operation to obtain the serially connected characteristic can _ feat, inputting the serially connected characteristic can _ feat into the next volume of lamination, activating the serially connected characteristic can _ feat, and outputting the shared characteristic by the volume of lamination.

The convolutional layer structure is composed of 1024 neurons, the activation function is a relu activation function, and the weight initial mode is random initialization according to normal distribution. The shared characteristic of the convolutional layer output is denoted shared _ feed.

And 4, inputting shared-feature shared _ feed into a decoding layer, wherein the size of the decoding layer is equal to that of the sparse depth image, the weight initialization mode is random initialization according to normal distribution, and the activation function is a relu function. The output of the decoding layer is the depth map to be obtained.

The above is the main structure and flow of the invention, wherein some details need to be supplemented, firstly, one RGB image should be divided into three channels of red, green and blue to be respectively input to the RGB denoising encoder, that is, the RGB denoising encoder needs to be called three times and has three outputs; secondly, training a model by using a random gradient descent method, wherein a loss function in training is Euclidean loss;

the invention achieves the following beneficial effects:

the sparse depth image and the RGB image are used as noisy data, the input data are coded through a denoising coder, the features of the input data are extracted, the four extracted features are connected in series and then shared, and the problem that when the depth estimation is carried out on the single RGB image by a method based on a convolutional neural network, no reliable and robust clues exist is solved; and reconstructing the coded features into an accurate depth map through a decoder to obtain a more accurate depth estimation result.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2 is a schematic diagram of an encoder structure

Detailed Description

The invention is described in further detail below with reference to specific embodiments and with reference to the accompanying drawings.

The specific embodiment and the implementation process of the invention are as follows:

7页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种行驶车辆高度实时检测方法

Method for improving depth estimation accuracy by utilizing multitask complementation

相关技术

网友询问留言