Light field depth prediction method based on convolutional neural network and attention mechanism

文档序号：106341 发布日期：2021-10-15 浏览：57次中文

阅读说明：本技术 一种基于卷积神经网络和注意力机制的光场深度预测方法 (Light field depth prediction method based on convolutional neural network and attention mechanism ) 是由张倩杜昀璋刘敬怀花定康王斌朱苏磊于 2021-06-30 设计创作，主要内容包括：本发明涉及一种基于卷积神经网络和注意力机制的光场深度预测方法,包括：获取光场图像并对其进行预处理,生成光场图像集；构建光场深度预测模型,该模型包括EPI学习模块、注意力模块和特征融合模块；将光场图像集分别输入EPI学习模块和注意力模块,分别获取光场图像的EPI信息以及各个图像权重；分别将光场图像EPI信息以及各个图像的权重输入特征融合模块,获得光场深度预测结果。与现有技术相比,本发明具有预测精度高、实用性好等优点。(The invention relates to a light field depth prediction method based on a convolutional neural network and an attention mechanism, which comprises the following steps of: acquiring a light field image and preprocessing the light field image to generate a light field image set; constructing a light field depth prediction model, wherein the model comprises an EPI learning module, an attention module and a feature fusion module; respectively inputting the light field image set into an EPI learning module and an attention module, and respectively acquiring EPI information of the light field image and each image weight; and respectively inputting the EPI information of the light field image and the weight of each image into the feature fusion module to obtain a light field depth prediction result. Compared with the prior art, the method has the advantages of high prediction precision, good practicability and the like.)

1. A light field depth prediction method based on a convolutional neural network and an attention mechanism is characterized by comprising the following steps:

step 1: acquiring a light field image and preprocessing the light field image to generate a light field image set;

step 2: constructing a light field depth prediction model, wherein the model comprises an EPI learning module, an attention module and a feature fusion module;

and step 3: respectively inputting the light field image set obtained in the step 1 into an EPI learning module and an attention module, and respectively obtaining EPI information of the light field image and image weights;

and 4, step 4: and respectively inputting the EPI information of the light field image and the weight of each image into the feature fusion module to obtain a light field depth prediction result.

2. The method for predicting the depth of the light field based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the preprocessing of the light field image in the step 1 specifically comprises: and carrying out data enhancement operation on the light field image.

3. The light field depth prediction method based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the EPI learning module specifically is:

parallel EPI learning networks are respectively arranged at four angles of 0 degree, 45 degrees, 90 degrees and 135 degrees, and each parallel EPI learning network comprises a two-dimensional convolution layer, an activation layer, a two-dimensional convolution layer, an activation layer and a batch normalization layer which are sequentially connected.

4. The method according to claim 3, wherein the loss function of the EPI learning network is:

where L is the return loss value, N is the total number of samples, and x and y are the predicted outputs, respectively.

5. The method according to claim 3, wherein the activation layer is specifically: sigmoid function.

6. The light field depth prediction method based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the attention module comprises a two-dimensional convolutional layer, a Resblock, a feature extraction layer, a Cost volume layer, a pooling layer, a fully-connected layer and an activation layer which are connected in sequence.

7. The light field depth prediction method based on the convolutional neural network and the attention mechanism as claimed in claim 6, wherein the feature extraction layer specifically comprises: a spatial pyramid pooling layer.

8. The method according to claim 1, wherein the step 2 further comprises validating the light field depth prediction model during training.

9. The method for predicting the depth of the light field based on the convolutional neural network and the attention mechanism as claimed in claim 8, wherein the verification method comprises:

firstly, calculating the mean square error MSE of the light field depth prediction result and the ground truth:

wherein N is the total number of pixels in the light field image; dep and GT are the light field depth prediction result and ground truth respectively; i is each pixel in the light field image;

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

then, calculating a structural similarity index SSIM:

wherein x and y are respectively a light field depth prediction result and a ground truth; μ is the average of the light field image pixel values; sigma_x ²And σ_y ²Respectively the variances of the corresponding images; sigma_x,yCovariance of x and y;

and finally, judging whether the MSE, the PSNR and the SSIM are all within preset thresholds, if so, finishing the training of the model, and otherwise, continuing to train the model.

10. The method for predicting the depth of the light field based on the convolutional neural network and the attention mechanism as claimed in claim 1, wherein the feature fusion module comprises 8 sequentially connected convolution blocks and 1 optimization block; the optimization block comprises two-dimensional convolution layers and an activation layer.

Technical Field

The invention relates to the technical field of light field depth estimation, in particular to a light field depth prediction method based on a convolutional neural network and an attention mechanism.

Background

The light field depth information reflects the precise spatial information of the corresponding target. Scene depth acquisition is a technical key for determining whether a light field image can be widely applied, and is also one of research hotspots in the fields of computer vision and the like. The method plays an important role in the fields of three-dimensional reconstruction, target recognition, automatic driving of automobiles and the like.

Currently, light field depth estimation algorithms are mainly classified into non-learning based methods and learning based methods. The non-learning method mainly comprises a focusing and defocusing fusion method and a method based on stereo matching. The focusing and defocusing fusion method obtains corresponding depth by measuring the ambiguity of different focal stack pixels, and the depth map obtained by the method can retain more details but introduce defocusing errors and reduce the precision of the depth map.

In recent years, deep learning has achieved a great achievement in the field of light field depth estimation, for example, chinese patent CN112785637A discloses a light field depth estimation method based on a dynamic fusion network, which includes the following steps: determining a light field data set, and determining a training set and a test set based on the light field data set; augmenting the light field dataset; building a dynamic fusion network model; the dynamic fusion network model consists of a double-flow network and a multi-mode dynamic fusion module; the dual-stream network consists of RGB streams and a focus stack stream; taking the output global RGB characteristics and the focus characteristics of the double-current network as the input of a multi-mode dynamic fusion module, and outputting a final depth map; training a constructed dynamic fusion network model based on a training set; and testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set. The light field depth estimation method in the patent can obtain precision superior to other light field depth estimation methods, reduces noise, retains more detailed information, breaks through the limitation of a light field camera, and is successfully applied to common consumer-grade camera data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide the light field depth prediction method based on the convolutional neural network and the attention mechanism, which has high prediction precision and good practicability.

The purpose of the invention can be realized by the following technical scheme:

a light field depth prediction method based on a convolutional neural network and an attention mechanism comprises the following steps:

step 1: acquiring a light field image and preprocessing the light field image to generate a light field image set;

step 2: constructing a light field depth prediction model, wherein the model comprises an EPI learning module, an attention module and a feature fusion module;

and 4, step 4: and respectively inputting the EPI information of the light field image and the weight of each image into the feature fusion module to obtain a light field depth prediction result.

Preferably, the preprocessing on the light field image in the step 1 specifically includes: and carrying out data enhancement operation on the light field image.

Preferably, the EPI learning module is specifically:

More preferably, the loss function of the EPI learning network is:

where L is the return loss value, N is the total number of samples, and x and y are the predicted outputs, respectively.

More preferably, the active layer is specifically: sigmoid function.

Preferably, the attention module comprises a two-dimensional convolutional layer, a Resblock, a feature extraction layer, a Cost volume layer, a pooling layer, a full-link layer and an activation layer which are connected in sequence.

More preferably, the feature extraction layer specifically is: a spatial pyramid pooling layer.

Preferably, the step 2 further includes verifying the depth of field prediction model during training.

More preferably, the verification method is:

firstly, calculating the mean square error MSE of the light field depth prediction result and the ground truth:

wherein N is the total number of pixels in the light field image; dep and GT are the light field depth prediction result and ground truth respectively; i is each pixel in the light field image;

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

then, calculating a structural similarity index SSIM:

wherein x and y are respectively a light field depth prediction result and a ground truth; μ is the average of the light field image pixel values; sigma_x ²And σ_y ²Respectively corresponding to imagesVariance; sigma_x,yCovariance of x and y;

and finally, judging whether the MSE, the PSNR and the SSIM are all within preset thresholds, if so, finishing the training of the model, and otherwise, continuing to train the model.

Preferably, the feature fusion module comprises 8 volume blocks and 1 optimization block which are connected in sequence; the optimization block comprises two-dimensional convolution layers and an activation layer.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the prediction precision is high: the light field depth prediction method fully considers the geometric characteristics of the light field image, fully utilizes the angular characteristics and symmetry of the light field image, improves the depth estimation precision, and can provide more accurate results under the same working duration and working conditions.

Secondly, the practicability is good: the light field depth prediction method does not depend on precision equipment such as radar and antenna, can conveniently acquire the required depth information, and has strong practicability.

Drawings

FIG. 1 is a flow chart of a light field depth prediction method according to the present invention;

FIG. 2 is a schematic structural diagram of a light field depth prediction model according to the present invention;

FIG. 3 is a schematic diagram of three modes of an attention module according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

A light field depth prediction method based on a convolutional neural network and an attention mechanism, the flow of which is shown in fig. 1:

step 1: acquiring a light field image and preprocessing the light field image to generate a light field image set;

acquisition of light field image: with the gradual maturity of the light field imaging technology, consumer-grade light field cameras are applied in a large scale, the light field cameras can be used for collecting the position and direction information rich in light in a scene, and the depth information of the scene can be further acquired by analyzing and processing the information by a passive depth estimation method. A light field camera may obtain four-dimensional light field information, i.e. obtain images of a scene from multiple perspectives, after one shot. These images form a 9 x 9 total array of 81 images, the relative position of each picture in the array being fixed. The difference between the relative positions of each picture (i.e., the baseline) and the difference in position between the corresponding same spatial point in each picture (i.e., the disparity) are calculated. And calculating the relation between the base line and the parallax to obtain the distance between the corresponding point in the space and the central lens visual angle of the camera.

Because the light field image needs to be acquired by means of certain professional equipment, such as a fixed camera array, a camera gantry or a light field camera, the problem that image data calculated in the same scene is insufficient may occur sometimes, and the data is preprocessed through data enhancement aiming at the problem in actual operation.

Under the condition of keeping the geometric relationship among the subaperture images in the optical field unchanged, limited data are transformed to enlarge the available data size, and the data enhancement operation in the embodiment comprises the following steps:

1. light field image with transferred central viewpoint

The acquired light field data has 9 × 9 views, the spatial resolution of each view is 512 × 512, and the data which can be used for training can be obtained by more than 9 times by translating on a 9 × 9 array by using a window with the size of 7 × 7;

2. angle-changeable

The new data which can be used for training can be directly obtained through a rotation mode, and the sub-aperture images can be firstly rotated and then the images of all the viewpoints are rearranged for connection through extracting the polar surface characteristics of the viewpoints.

3. Scaling and flipping

It should be noted that the disparity value is also transformed accordingly while the image is being enlarged or reduced.

The above three methods can operate on multiple dimensions of the image, such as a central view, an image size, an image RGB value, an image random color transformation, an image gray value, a gamma value, and the like.

Step 2: constructing a light field depth prediction model as shown in FIG. 2, the model comprising an EPI learning module, an attention module and a feature fusion module;

the construction method of the EPI learning module comprises the following steps:

the four-dimensional light-field image can be represented as L (x, y, u, v), where (x, y) is spatial resolution and (u, v) is angular resolution, and the relationship of the light-field image center to other viewpoints can be represented as:

L(x,y,0,0)＝L(x+d(x,y)*u,y+d(x,y)*v,u,v)

where d (x, y) is the disparity between the central viewpoint pixel (x, y) and the adjacent viewpoint corresponding pixel.

For an angular direction of θ (tan θ ═ v/u), the following relationship is re-established:

L(x,y,0,0)＝L(x+d(x,y)*u,y+d(x,y)*utanθ,u,utanθ)

wherein, the light field viewpoints are an orderly 9 × 9 array, and only if utan θ is an integer, the corresponding viewpoints can be ensured. The image angles 0 °, 45 °, 90 °, and 135 ° of the four viewpoint directions are thus selected, and the angular resolution of the light field image can be assumed to be (2N +1) × (2N + 1).

Therefore, the EPI learning network sets parallel EPI learning networks at four angles of 0 °, 45 °, 90 °, and 135 °, respectively, and performs feature extraction on the light field image data, respectively. The four parallel EPI learning networks respectively comprise a two-dimensional convolution layer 2D Conv, an activation layer Relu, a two-dimensional convolution layer 2D Conv, an activation layer Relu and a batch normalization layer BN which are sequentially connected.

In a two-dimensional convolutional layer 2D Conv, a and B are both two-dimensional matrices, and the convolution result is:

C(j,l)＝∑_p∑_q A(p,q)B(j-p+1,k-q+1)

a sigmoid function is adopted in the active layer Relu, and the method specifically comprises the following steps:

the nonlinear output is introduced into the output z of the neuron on the upper layer by the activation function, phi (z) is the output of the next layer, and Relu avoids the problems of gradient explosion and gradient disappearance to a certain extent.

Since the deep neural networks are stacked in multiple layers, the learning speed is reduced. To prevent the change of the input of the lower layer from becoming large or small, the upper layer falls into a saturation region, and the learning is stopped too early. After the last activation of the functional layer, Batch Normalization (BN) is selected. The batch normalization layer BN specifically comprises:

wherein mu is a conversion parameter, sigma is a scaling parameter, and the two parameters are used for converting and scaling data to ensure that the data conforms to a standard distribution with the average value of 0 and the variance of 1; b is a retransformation parameter and g is a rescaling parameter to ensure that the expressive power of the model is not reduced by normalization.

The penalty function for an EPI learning network is:

where L is the return loss value, N is the total number of samples, and x and y are the predicted outputs, respectively.

To cope with the problem of too small a baseline in the light field, a convolution kernel of size 2 × 2 with a step size of 1 is used to measure the slight disparity value. The convolution depth is set to 7 and the learning rate is set to 1e in the network^-5。

The construction method of the attention module comprises the following steps:

a large number of pictures of different angular viewing angles are acquired in the light field data. As described in the first step, by calculating disparity information and EPI information of corresponding points in these pictures, depth information in three-dimensional space can be obtained. However, since these pictures contain a lot of redundant information, an attention module is provided to calculate and assign weights to the pictures in the light field, and the importance and contribution degree of the pictures with higher value for estimating the depth of the light field are highlighted.

The attention module comprises a two-dimensional convolution layer 2D Conv, a Resblock, a feature extraction layer FE block, a Cost volume layer, a Pooling layer Pooling, a full-connection layer Connected and an activation layer Relu which are Connected in sequence, and specifically:

firstly, preprocessing a light field image through a two-dimensional convolution layer 2D Conv and Resblock, then extracting features in a feature extraction layer FE block, and removing texture regions and non-Lambert curved surfaces. And the feature extraction layer FE block extracts features according to the connection of adjacent regions and connects all feature mappings to obtain output feature mappings. Next, in the Cost volume layer, the relative position of the feature view is adjusted, and five dimensions (batch size × gap × height × width × feature size) are calculated to connect the Cost volumes after feature mapping. Finally, the input cost volumes are aggregated to generate the attention map, followed by the connectivity layer and the activation layer. Taking the HCI data set as an example, there are 9 × 9 sub-aperture viewpoints in each scene, so an attention map of 9 × 9 size is finally obtained. This part of the operation is divided into three steps:

first, extracting image features using a feature extraction layer

The feature extraction layer selects an SPP (spatial pyramid pooling) module, and the SPP module is used for estimating the disparity value by using the information of the adjacent region of the corresponding point.

The SPP module is specifically as follows: in a CNN, the last pooling layer is removed and replaced with an SPP for maximum pooling (max Pooling). The SPP-net may be trained with standard back-propagation.

Second, calculate the Cost volume

And transmitting the feature map of each sub-aperture view through the SPP module to obtain the feature map of each view. To make better use of these feature maps, the compute Cost volume is set. The input image is manually moved in the u or v direction at different levels of disparity according to the feature map provided by the SPP module so that the latter half of the network can directly view pixel information at different spatial locations using a relatively small received signal. 9 disparity levels are set, ranging from-4 to 4. After moving the feature maps, the feature maps are connected into a 5D Cost volume, whose size is equal to the batch size × the disparity × the height × the width × the feature size.

Third, obtaining an attention diagram

Note that the figure is essentially a 9 x 9 diagram, which indicates the importance of the corresponding view. The first type is free attention mapping, where each view has its own importance value. Learning all images in the light field picture; the second type is symmetrical attention, and the light-field image array is symmetrical along the u and v axes. The 25 images in which symmetry is found can be learned from symmetry. The entire map may be constructed by mirroring along the u-axis and the v-axis; in the third type, the image is symmetric along u, v and two diagonal axes. Again, taking advantage of symmetry, the weights of the 15 images that are symmetric are calculated, and then the complete attention map is constructed by mirroring along the diagonal, v and u axes. By constraining the structure of the attention map, the number of learnable weights is reduced. With the Cost volume as an input, the view selection module generates the attention map through a global pooling layer, then a full-link layer, and finally an active layer, thereby obtaining the attention allocation map for all pictures of the light field image.

The attention module includes three modes, as shown in FIG. 3, in the first mode, the module performs attention assessment on each image; in the second mode, only 0 ° and 90 ° are mirrored using the orientation image; in the last mode, 45 ° and 135 ° directions are added. Three methods are combined to obtain attention. The attention map is combined with convolutional layers in the neural network in the form of weights, and then the weights of the aperture viewpoints are enhanced.

The construction method of the feature fusion module comprises the following steps:

the feature fusion module comprises 8 convolution blocks and 1 optimization block which are sequentially connected, wherein the optimization block comprises two-dimensional convolution layers and an activation layer.

Step 2, verifying the optical depth prediction model during training, specifically:

firstly, calculating the mean square error MSE of the light field depth prediction result and the ground truth:

wherein N is the total number of pixels in the light field image; dep and GT are the light field depth prediction result and ground truth respectively; i is each pixel in the light field image;

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

then, calculating a structural similarity index SSIM:

and finally, judging whether the MSE, the PSNR and the SSIM are all within preset thresholds, if so, finishing the training of the model, and otherwise, continuing to train the model.

And step 3: respectively inputting the light field image set obtained in the step 1 into an EPI learning module and an attention module, and respectively obtaining EPI information of the light field image and image weights;

and 4, step 4: and respectively inputting the EPI information of the light field image and the weight of each image into the feature fusion module to obtain a light field depth prediction result.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于EPnP的双目相机位姿估计方法

Light field depth prediction method based on convolutional neural network and attention mechanism

相关技术

网友询问留言