Light field depth estimation method based on dynamic fusion network

文档序号:635416 发布日期:2021-05-11 浏览:22次 中文

阅读说明:本技术 一种基于动态融合网络的光场深度估计方法 (Light field depth estimation method based on dynamic fusion network ) 是由 朴永日 张淼 张玉坤 吉新新 于 2021-01-20 设计创作,主要内容包括:本发明公开了一种基于动态融合网络的光场深度估计方法,包括以下步骤:确定光场数据集,基于光场数据集确定训练集和测试集;扩充光场数据集;搭建动态融合网络模型;动态融合网络模型由双流网络和一个多模态动态融合模块构成;双流网络由RGB流和焦点堆栈流组成;将双流网络的输出全局RGB特征和焦点特征作为多模态动态融合模块的输入,输出最终深度图;基于训练集训练构建的动态融合网络模型;在所述测试集上测试训练好的动态融合网络模型,并在手机数据集上进行验证。本发明的光场深度估计方法可以获得优于其它光场深度估计方法的精度,减小噪声,保留更多细节信息,且打破了光场相机的限制,成功地应用于普通消费级相机数据。(The invention discloses a light field depth estimation method based on a dynamic fusion network, which comprises the following steps: determining a light field data set, and determining a training set and a test set based on the light field data set; augmenting the light field dataset; building a dynamic fusion network model; the dynamic fusion network model consists of a double-flow network and a multi-mode dynamic fusion module; the dual-stream network consists of RGB streams and a focus stack stream; taking the output global RGB characteristics and the focus characteristics of the double-current network as the input of a multi-mode dynamic fusion module, and outputting a final depth map; training a constructed dynamic fusion network model based on a training set; and testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set. The light field depth estimation method can obtain precision superior to other light field depth estimation methods, reduces noise, retains more detailed information, breaks through the limitation of a light field camera, and is successfully applied to common consumer-grade camera data.)

1. A light field depth estimation method based on a dynamic fusion network is characterized by comprising the following steps:

step 1, acquiring a light field data set, and obtaining a training set and a test set based on the light field data set, wherein each light field sample in the light field data set comprises an RGB image, a focal stack containing a plurality of focal slices focused at different depths, and a depth image corresponding to the RGB image;

step 2, performing data expansion on the light field data set;

step 3, building a dynamic fusion network model; the dynamic fusion network model consists of a double-flow network and a multi-mode dynamic fusion module; the dual-stream network consists of RGB streams and a focus stack stream; the RGB streamTaking RGB image with resolution size of H multiplied by W multiplied by 3 as input, and outputting the RGB image as global RGB characteristic frgb(ii) a The focus stack stream takes a focus stack with the resolution size of NxHxWx3 as input, and the output is a focus characteristic H; where H and W represent the length and width of the image and N represents the number of focal slices; global RGB feature f of output of double-flow networkrgbAnd the focus feature h is used as the input of the multi-mode dynamic fusion module, and a final depth map is output;

step 4, a dynamic fusion network model constructed based on the training set training takes a truth value depth map as a supervision signal, and a loss function is calculated;

and 5, testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set.

2. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein performing data expansion on the light field data set includes:

carrying out scale scaling on the focal slice and the RGB image by using a scale scaling factor in a value range [1,1.5 ]; and/or the presence of a gas in the gas,

rotating the focal slice and the RGB image by a rotation angle in a value range of-5, 5; and/or the presence of a gas in the gas,

random variation of brightness, contrast and saturation is achieved by uniform sampling within the range [0.6,1.4] to perform color transformation on the focal slice and the RGB image; and/or the presence of a gas in the gas,

the focal slice and RGB images are image flipped left and right at a random level of 50% probability.

3. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein the RGB stream learns RGB features from different layers of a convolutional neural network using a codec structure; wherein, the encoder adopts a pre-trained SEnet-154 network structure, and the decoder adopts four up-sampling layers to gradually enlarge the output characteristics from the encoder; an optimization module ranks features from the decoder and encoder along the color channelAfter being connected, the three 3 multiplied by 3 convolution layers are sent into, and the final RGB stream outputs the global RGB characteristic f with the size of 256 multiplied by 1rgb

The focus stack stream consists of a pre-coding block and a spatial pyramid ConvGRU model; first, stack the focus I1,I2...InInputting into a focus stack stream for pre-coding, wherein n represents the number of focus slices corresponding to an RGB image, and the pre-coding block is composed of four 5 × 5 convolution layers for coding the focus feature xi(ii) a The coded focus characteristic xiSending the model into a pyramid ConvGRU model; pyramid ConvGRU combines the ith focus feature xiAnd the output h from the last pyramid ConvGRUi-1As input, by controlling the update gate and the reset gate ziAnd riFrom the input focus feature xiAnd the previous output characteristic hi-1Selectively updating the multi-scale focus information; z is a radical ofiAnd riCoding multi-scale focusing information by using a spatial pyramid pooling module, namely applying hole convolution on a plurality of filters with different scales and receptive fields; by combining previous focus features hi-1And an update gate ziObtaining candidate state ni(ii) a Weighting to obtain new output characteristics hi

4. The dynamic fusion network-based light field depth estimation method according to claim 3, wherein the multi-modal dynamic fusion module comprises:

calculating content self-adaptive convolution kernel parameters according to the focus characteristics:

wherein [ p ]i-pj]Is the spatial dimension index of the array with 2D spatial offset, i and j represent pixel coordinates, W is the standard spatial convolution, h represents the output focus feature from the last pyramid of the focus stack stream, ConvGRU;

applying the generated content self-adaptive convolution kernel K to RGB features, so that the whole network dynamically fuses multimodal information to output an accurate prediction depth map d:

wherein f isrgb_jB represents the output of the RGB stream, and b is an offset, and the prediction depth map d is input to two 5 × 5 convolutional layers and optimized, and a prediction result with a final resolution of 256 × 256 × 1 is output.

5. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein the loss function is composed of a depth error log loss, a depth gradient loss and a surface normal loss;

calculating a loss function comprising:

step 4-1: calculating the logarithm of the depth error ldepthMinimizing the depth estimation map diAnd its true value depth map giThe difference between them:

wherein alpha (more than 0) is a super parameter, and n represents the number of effective pixel points with pixel values not 0 in the true value depth map; diIs the predicted depth value at pixel i, giA true depth value at pixel i;

step 4-2: computing a depth gradient loss function lgrad

WhereinIs calculated at the ith pixel of x | | | di-giSpace of | |The derivative(s) of the signal(s),is | | | d calculated at the ith pixel of yi-giThe spatial derivative of | l;

step 4-3: calculating the accuracy l of the surface normal between the predicted depth value and the true depth valuenormal

Wherein<*,*>The inner product of the vector is represented as,a surface normal representing a predicted depth value,a surface normal representing a true depth value;

step 4-4: the loss function of the dynamic fusion network light field depth estimation model is as follows:

L=ldepth+λlgrad+μlnormal

wherein λ, μ are weighting coefficients.

6. The dynamic fusion network-based light field depth estimation method according to claim 5, wherein training the constructed dynamic fusion network model based on the training set comprises:

and sending the training set into a constructed dynamic fusion network model, training by adopting an Adam optimization algorithm according to the loss function L, setting the maximum iteration round number of model training to be 80, and obtaining a final model result.

7. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein N is 12, H is 256, and W is 256.

8. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein n is 12.

9. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein α is 0.5.

10. The dynamic fusion network-based light field depth estimation method according to claim 1, wherein λ is 1, and μ is 1.

Technical Field

The invention relates to the technical field of light field depth estimation, in particular to a light field depth estimation method based on a dynamic fusion network.

Background

Depth estimation is an important link in understanding scene geometry. Accurate depth information plays an important role in the fields of object tracking, scene understanding, target detection, automatic driving, pose estimation, and the like. The traditional depth estimation utilizes two-dimensional images shot by a plurality of video cameras to obtain, and along with the development of sensor technology, a light field camera can simultaneously obtain spatial information and angle information of light radiation through one-time shooting as a novel passive multi-view imaging device, so that the unique function of shooting firstly and then focusing is realized. By analyzing and processing the light information, the depth information of the scene can be acquired.

Light field depth estimation methods can be divided into three categories: a light field depth estimation method based on Epipolar Plane Images (EPI), a depth estimation based on multi-view stereo matching, a depth estimation based on focal stacks. Compared with the former two methods, the focus stack fully utilizes the advantages of light field digital refocusing, and the implicit focus clues can be effectively used in the depth estimation task. The focus stack image records the focusing degree of a scene under different focal lengths, and a series of focus slices focused at different depths contain rich spatial parallax information, so that the 3-dimensional structure of the scene can be well revealed. In addition, the focusing information caters to the visual system of the human eye, and can maximally focus on objects in the scene.

Early focus stack-based depth estimation research work primarily aimed at determining the depth of a pixel by measuring the sharpness of the pixel or focusing on different images of the focus stack. In recent years, with the development of convolutional neural networks, researchers have used CNNs to extract effective focus information for depth estimation, rather than the traditional manually extracted features. The filter that can be extended to the entire focus stack is generally learned using a conventional convolution kernel, effectively improving the accuracy of the light field depth estimation.

There are several challenges that remain: 1) the method using the conventional convolution kernel ignores spatial correlation between focal slices and thus fails to capture the focal information well, resulting in a large amount of noise in the predicted depth map. The method of using only the focal slice loses part of the detail information, and the predicted depth map contains a large amount of noise. 2) Out-of-focus regions with unknown sharpness in the focal slice may lead to information loss, thereby making the predicted depth map discontinuous. Considering that the RGB images have higher definition and can be used to compensate for the data lost in the out-of-focus region in the focal slice, researchers have tried to introduce RGB as guidance information for multi-modal information fusion, but they have implemented static fusion of two features only by simple multiplication operation and cannot fully utilize the multi-modal information between the RGB images and the focal slice.

Therefore, a method for effectively learning the spatial relationship between the focus stacks and dynamically fusing the RGB information and the focus features is needed.

Disclosure of Invention

In view of the above, the present invention provides a light field depth estimation method for a dynamic fusion network, which is a practical and powerful deep learning solution for predicting scene depth.

The technical means adopted by the invention are as follows:

the invention provides a light field depth estimation method based on a dynamic fusion network, which comprises the following steps:

step 1, acquiring a light field data set, and obtaining a training set and a test set based on the light field data set, wherein each light field sample in the light field data set comprises an RGB image, a focal stack containing a plurality of focal slices focused at different depths, and a depth image corresponding to the RGB image;

step 2, performing data expansion on the light field data set;

step 3, building upDynamically fusing network models; the dynamic fusion network model consists of a double-flow network and a multi-mode dynamic fusion module; the dual-stream network consists of RGB streams and a focus stack stream; the RGB stream takes RGB images with resolution size H multiplied by W multiplied by 3 as input and outputs global RGB characteristic frgb(ii) a The focus stack stream takes a focus stack with the resolution size of NxHxWx3 as input, and the output is a focus characteristic H; where H and W represent the length and width of the image and N represents the number of focal slices; global RGB feature f of output of double-flow networkrgbAnd the focus feature h is used as the input of the multi-mode dynamic fusion module, and a final depth map is output;

step 4, a dynamic fusion network model constructed based on the training set training takes a truth value depth map as a supervision signal, and a loss function is calculated;

and 5, testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set.

Further, data augmenting the light field data set, comprising:

carrying out scale scaling on the focal slice and the RGB image by using a scale scaling factor in a value range [1,1.5 ]; and/or the presence of a gas in the gas,

rotating the focal slice and the RGB image by a rotation angle in a value range of-5, 5; and/or the presence of a gas in the gas,

random variation of brightness, contrast and saturation is achieved by uniform sampling within the range [0.6,1.4] to perform color transformation on the focal slice and the RGB image; and/or the presence of a gas in the gas,

the focal slice and RGB images are image flipped left and right at a random level of 50% probability.

Further, the RGB stream learns RGB features from different layers of a convolutional neural network using a codec structure; wherein, the encoder adopts a pre-trained SEnet-154 network structure, and the decoder adopts four up-sampling layers to gradually enlarge the output characteristics from the encoder; the optimization module cascades the features from the decoder and encoder along the color channels into three 3 × 3 convolutional layers, and finally outputs a global RG with a size of 256 × 256 × 1B feature frgb

The focus stack stream consists of a pre-coding block and a spatial pyramid ConvGRU model; first, stack the focus I1,I2...InInputting into a focus stack stream for pre-coding, wherein n represents the number of focus slices corresponding to an RGB image, and the pre-coding block is composed of four 5 × 5 convolution layers for coding the focus feature xi(ii) a The coded focus characteristic xiSending the model into a pyramid ConvGRU model; pyramid ConvGRU combines the ith focus feature xiAnd the output h from the last pyramid ConvGRUi-1As input, by controlling the update gate and the reset gate ziAnd riFrom the input focus feature xiAnd the previous output characteristic hi-1Selectively updating the multi-scale focus information; z is a radical ofiAnd riCoding multi-scale focusing information by using a spatial pyramid pooling module, namely applying hole convolution on a plurality of filters with different scales and receptive fields; by combining previous focus features hi-1And an update gate ziObtaining candidate state ni(ii) a Weighting to obtain new output characteristics hi

Further, the multi-modal dynamic fusion module includes:

calculating content self-adaptive convolution kernel parameters according to the focus characteristics:

wherein [ p ]i-pj]Is the spatial dimension index of the array with 2D spatial offset, i and j represent pixel coordinates, W is the standard spatial convolution, h represents the output focus feature from the last pyramid of the focus stack stream, ConvGRU;

applying the generated content self-adaptive convolution kernel K to RGB features, so that the whole network dynamically fuses multimodal information to output an accurate prediction depth map d:

wherein f isrgb_jB represents the output of the RGB stream, and b is an offset, and the prediction depth map d is input to two 5 × 5 convolutional layers and optimized, and a prediction result with a final resolution of 256 × 256 × 1 is output.

Further, the loss function is composed of three parts of depth error logarithmic loss, depth gradient loss and surface normal loss;

calculating a loss function comprising:

step 4-1: calculating the logarithm of the depth error ldepthMinimizing the depth estimation map diAnd its true value depth map giThe difference between them:

wherein alpha (more than 0) is a super parameter, and n represents the number of effective pixel points with pixel values not 0 in the true value depth map; diIs the predicted depth value at pixel i, giA true depth value at pixel i;

step 4-2: computing a depth gradient loss function lgrad

Wherein +x(. is | | d calculated at the ith pixel of xi-giA spatial derivative of | |, (v |)y(. is | | d calculated at the ith pixel of yi-giThe spatial derivative of | l;

step 4-3: calculating the accuracy l of the surface normal between the predicted depth value and the true depth valuenormal

Wherein<*,*>The inner product of the vector is represented as,a surface normal representing a predicted depth value,a surface normal representing a true depth value;

step 4-4: the loss function of the dynamic fusion network light field depth estimation model is as follows:

L=ldepth+λlgrad+μlnormal

wherein λ, μ are weighting coefficients.

Further, the dynamic fusion network model constructed based on the training set training comprises:

and sending the training set into a constructed dynamic fusion network model, training by adopting an Adam optimization algorithm according to the loss function L, setting the maximum iteration round number of model training to be 80, and obtaining a final model result.

Further, N is 12, H is 256, and W is 256.

Further, n takes the value of 12.

Further, α is 0.5.

Further, λ is 1, and μ is 1.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a light field depth estimation model dynamically fusing multi-modal clues between an RGB image and a focal stack. Considering that different focal slices have different scales of focal regions and focus at different depths, in order to correlate focal cues and depths, the present invention proposes a pyramid ConvGRU module to mine spatial correlation between different focal slices, thereby conveying multi-scale focal information in the depth direction. Subsequently, the invention provides a multi-modal dynamic fusion module, which fuses multi-modal features in a self-adaptive mode, and in the fusion strategy, convolution kernel parameters of RGB features dynamically change along with different input focus features, so that information loss in a depth map is avoided to a certain extent. The light field depth estimation method not only can retain the edge and detail information of an object and reduce the depth estimation noise, but also can be successfully applied to common consumption level camera data.

Based on the reasons, the method can be widely applied to the fields of scene depth estimation and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a light field depth estimation method based on a dynamic fusion network according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a dynamic convergence network model in an embodiment of the present invention;

FIG. 3(a) is a detailed structural diagram of a conventional ConvGRU;

FIG. 3(b) is a schematic structural diagram of a pyramid ConvGRU according to an embodiment of the present invention;

FIG. 4(a) shows three static fusion methods according to an embodiment of the present invention;

FIG. 4(b) is a schematic structural diagram of the multi-modal dynamic fusion module in the embodiment of the present invention;

FIG. 5 is a comparative experiment 1 of light field depth estimation methods on real light field data;

the depth map comprises a first column, a second column, a third column, a fourth column, a fifth column, a sixth column, a seventh column, an eighth column and a ninth column, wherein the first column is RGB (subjected to gray processing), the second column is a depth true value, the third column is a depth map obtained by a VDFF method, the fourth column is a depth map obtained by a PADMM method, the fifth column is a depth map obtained by a DDFF method, the sixth column is a depth map obtained by an LF-OCC method, the seventh column is a depth map obtained by an LF method, the eighth column is a depth map obtained by an EPINET method, and the ninth column;

FIG. 6 is a comparison experiment 2 of light field depth estimation methods on real light field data;

the depth map is obtained by the method of the invention, the depth map is obtained by the VDFF method, the depth map is obtained by the PADMM method, and the depth map is obtained by the DDFF method;

FIG. 7 is a comparison experiment of various light field depth estimation methods on a handset data set;

the first line is RGB (gray processing), the second line is a depth map obtained by the method of the present invention, and the third line is a depth map obtained by the DDFF method.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, a schematic flow chart of a light field depth estimation method based on a dynamic fusion network in an embodiment of the present invention is shown, where the method includes the following steps:

step 1, acquiring a light field data set, and obtaining a training set and a test set based on the light field data set; the concrete implementation is as follows:

(1) the DUT-LFDD light field data set containing 967 real-world light field samples from various indoor and outdoor scenes was used as the experimental data set.

Each light-field sample comprises an RGB image, a focal stack containing 12 focal slices focused at different depths I1,I2...I12And a depth image corresponding to the RGB image; of which 630 light field samples were randomly selected for making the training set and the remaining 337 light field samples were used for making the test set. The spatial resolution of each focal stack and corresponding RGB image is H × W in size.

(2) The LFSD data set is used as a test set. The data set contains 100 light field samples captured by the Lytro camera, with 60 indoor and 40 outdoor scenes. Each light-field sample consists of an RGB image, a focal stack and a depth map.

(3) The handset data set is used as the test set. Images of 640 x 360 pixel size were continuously taken using the cell phone during auto-focus, containing the focus stack and RGB images of different scenes (numbers in parentheses are frames): plants (23), bottles (31), fruits (30), metal (33), windows (27), telephones (33), etc. For each scene, the present invention selects 12 focal slices and RGB images to evaluate the proposed model.

Step 2, carrying out data expansion on the light field data set;

to prevent the overfitting problem, data expansion is performed in the same manner for each focal slice in the light field data set and for the RGB image.

The specific implementation of augmenting the light field dataset is as follows:

(1) scaling the focal slice and the RGB image: the value range of the scale scaling factor is [1,1.5 ];

(2) and (3) performing rotation transformation on the focus slice and the RGB image: the value range of the rotation angle is [ -5, 5 ];

(3) color transform the focal slice and RGB image: respectively carrying out linear transformation on the brightness, the saturation and the contrast of the image, wherein the value range of a linear transformation factor is [0.6,1.4 ];

(4) and (3) turning the focal slice and the RGB image left and right: the random probability of left and right flipping is 0.5.

The above four modes can be applied simultaneously, or only one of them can be adopted.

And 3, building a dynamic fusion network model.

As shown in fig. 2, the dynamic convergence network model is composed of a dual-flow network and a multi-modal dynamic convergence module. The dual-stream network consists of RGB streams and a focus stack stream; the RGB stream takes RGB image with resolution size H multiplied by W multiplied by 3 as input, and outputs global RGB feature frgb(ii) a The focus stack stream takes as input a focus stack of resolution size nxhxwx 3 and outputs as a focus feature H. Where H and W represent the length and width of the image, and take the value 256. N represents the number of focal slices and is 12. Then, the output global RGB feature f of the dual-stream network is usedrgbAnd the focus characteristic h is used as the input of the multi-mode dynamic fusion module, and the final depth map is output.

The RGB stream and the focus stack stream are specifically structured as follows:

(1) and RGB stream:

the RGB stream learns RGB features from different layers of CNN using a codec structure. Wherein the encoder employs a SENet-154 network architecture pre-trained on ImageNet and the decoder employs four upsampling layers to progressively expand the output characteristics from the encoder. An optimization module cascades the features from the decoder and encoder along the color channel and sends them to three 3 × 3 convolutional layers, and finally the RGB stream outputs a global RGB feature f with 256 × 256 × 1 sizergb

(2) Focus stack flow:

the focus stack stream consists of a pre-encoded block and a spatial pyramid ConvGRU model.

First, stack the focus I1,I2...InIs input to the focal stack streamPerforming pre-coding, wherein n represents the number of focus slices corresponding to an RGB image, the pre-coding is composed of four 5 × 5 convolutional layers to code the focus feature xiThis process can be defined as:

fi(Ii;θi)→xi (1)

where i denotes the ith focal slice, θiParameters representing the pre-coding layer, fiIs a learning mapping function.

Then, the encoded focus feature xiSending the focal slices into a pyramid ConvGRU model to mine spatial correlation in the focal slices, so that multi-scale focusing information in different focal slices is transmitted in the depth direction to generate effective focal features hi. Pyramid ConvGRU combines the ith focus feature xiAnd the output h from the last pyramid ConvGRUi-1As input, by controlling the update gate and the reset gate ziAnd riFrom the input focus feature xiAnd the previous output characteristic hi-1To selectively update the multi-scale focus information. The two gates encode multi-scale focusing information using a spatial pyramid pooling module instead of a conventional convolution kernel, i.e., applying a hole convolution over multiple filters with different scales and receptive fields. The expansion ratios were 1, 3 and 5, respectively.

Updating the door ziAnd a reset gate riIs defined as:

where all W and b are learnable model parameters, σ is an sigmoid function, and ·, represent the elemental multiplication and convolution, respectively.

Finally, by combining the previous focus features hi-1And an update gate ziObtaining candidate state niWeighting to obtain new output characteristics hiThe process can be written as:

ni=tanh(xi*Wxn+ri·hi-1*Whn+bn) (4)

hi=(1-zi)·hi-1+zi·ni (5)

(3) the multi-mode dynamic fusion module:

and taking the RGB features output by the RGB streams and the focus features output by the ConvGRU of the last pyramid in the focus stack stream as the input of the multi-mode dynamic fusion module. Specifically, the method mainly comprises the following steps:

step 3-1, calculating content self-adaptive convolution kernel parameters according to the focus characteristics, wherein the process can be recorded as:

wherein [ p ]i-pj]Is the spatial dimension index of the array with 2D spatial offset, i and j represent pixel coordinates, W is the standard spatial convolution, and h represents the output focus feature from the last pyramid of the focus stack stream, ConvGRU. When the focus characteristics change, the parameters of the adaptive convolution kernel will also change dynamically.

Step 3-2, applying the generated content self-adaptive convolution kernel K to RGB features, enabling the whole network to dynamically fuse multi-modal information and output an accurate prediction depth map d:

wherein f isrgb_jRepresenting the output of the RGB stream, b is the offset. Before performing the filtering operation, the filter parameters depend on the content of the focus feature. The final prediction graph relies on RGB features and reliable focus information.

And 3-3, inputting the output depth map d into two convolution layers of 5 multiplied by 5 to optimize, and outputting a prediction result with the final resolution of 256 multiplied by 1.

And 4, training a constructed dynamic fusion network model based on the training set, and calculating a loss function by taking the true value depth map as a supervision signal.

The loss function of the dynamic fusion network light field depth estimation model consists of three parts, namely depth error logarithmic loss, depth gradient loss and surface normal loss; the calculation of the loss function is implemented as follows:

step 4-1: calculating the logarithm of the depth error ldepthMinimizing the depth estimation map diAnd its true value depth map giThe difference between the two is specifically represented by the following formula:

wherein alpha (more than 0) is a super parameter, the value is 0.5, and n represents the number of effective pixel points with the pixel value not being 0 in the true value depth map.

Step 4-2: computing a depth gradient loss function lgrad

Wherein +x(. is | | d calculated at the ith pixel of xi-giA spatial derivative of | |, (v |)y(. is | | d calculated at the ith pixel of yi-giThe spatial derivative of | l. Depth gradient loss function lgradThe edge distortion problem caused by convolutional neural network training can be handled.

Step 4-3: calculating the accuracy l of the surface normal between the predicted depth value and the true depth valuenormal

Wherein<Denotes the inner product of the vectors,a surface normal representing a predicted depth value,surface normal representing a true depth value.

Step 4-4: the loss function of the dynamic fusion network light field depth estimation model is as follows:

L=ldepth+λlgrad+μlnormal (11)

wherein, λ and μ are weighting coefficients and take the value of 1.

Further, a light field depth estimation network is dynamically fused using a training set training, specifically: and (3) sending the training set obtained in the step (1) into the dynamic fusion light field depth estimation network established in the step (2), training by adopting an Adam optimization algorithm according to the dynamic fusion light field depth estimation loss function L designed in the step (3), setting the maximum iteration round number of model training to be 80, and obtaining a final model result.

And 5, testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set.

Further, the trained network is tested on two light field data sets and one mobile phone data set. To adapt to the trained network parameters, its spatial resolution is rescaled to H W.

As shown in fig. 3, in contrast to the conventional ConvGRU, the pyramid ConvGRU proposed by the present invention uses a spatial pyramid spatial pooling module for each gate instead of the conventional convolution, with the focal features being encoded in filters with different receptive fields and passed along the depth direction.

As shown in fig. 4, the three static fusion methods in fig. 4(a) are applied to the entire image. When the network parameters are fixed, the convolution kernel does not vary with the input pixels, thus ignoring the relationships between the multiple modes. In fig. 4(b), the filter varies with the focal point characteristics and is then used to convolve with the RGB features, thereby avoiding loss of information.

To demonstrate the advantages of the method provided by the present invention, two sets of comparative experiments were performed on two light field data sets with the method covering both the deep learning based method (DDFF, EPINet) and the non-deep learning labeled with × (VDFF, LF, PADMM, LF _ OCC). The graph of the effect of comparative experiment 1 is shown in fig. 5, and the graph of the effect of comparative experiment 2 is shown in fig. 6. On a data set acquired by the mobile phone, the method and the DDFF based on the focus stack are subjected to a comparison experiment, and the experimental result is shown in FIG. 7. It is obvious from the above-mentioned figure that the depth estimation result depth of the method provided by the invention is most accurate, the edge information of the object is better preserved, and the method has robustness. In addition, the method can well process various challenging scenes, such as scenes with similar foreground and background, smooth surface of an object, large scene depth change, complex background and the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种图形处理方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!