Depth map generation method and device

文档序号：551857 发布日期：2021-05-14 浏览：6次中文

阅读说明：本技术 深度图生成方法和装置 (Depth map generation method and device ) 是由毛亚敏刘志花高天豪金允泰李洪硕全明帝王强于 2019-11-14 设计创作，主要内容包括：提供了一种深度图生成方法和装置。所述深度图生成方法包括：基于双目图像产生第一深度图；获取深度相机采集的第二深度图；通过将第一深度图与第二深度图融合来产生最终深度图。(A depth map generation method and apparatus are provided. The depth map generation method comprises the following steps: generating a first depth map based on the binocular image; acquiring a second depth map acquired by a depth camera; a final depth map is generated by fusing the first depth map with the second depth map.)

1. A depth map generation method, comprising:

generating a first depth map based on the binocular image;

acquiring a second depth map acquired by a depth camera;

a final depth map is generated by fusing the first depth map with the second depth map.

2. The depth map generating method of claim 1, wherein the step of acquiring the second depth map comprises:

extracting depth features from the second depth map;

generating a depth-complemented third depth map by performing similarity matrix-guided filtering on the extracted depth features,

wherein the similarity matrix is generated based on the binocular images.

3. The depth map generation method of claim 2, wherein the step of generating a depth-complemented third depth map comprises: and obtaining the similarity between the current pixel of the depth map and the adjacent pixel of the current pixel based on the similarity matrix, and filtering the depth characteristic of the current pixel according to a preset direction based on the depth characteristic of the adjacent pixel and the similarity.

4. The depth map generating method according to claim 3, wherein the predetermined direction includes at least one of a top-to-bottom direction, a bottom-to-top direction, a left-to-right direction, and a right-to-left direction.

5. The depth map generating method of claim 1, wherein the step of generating the first depth map comprises:

obtaining a first feature map of each monocular image included in the binocular image;

obtaining a matching cost matrix based on a first feature map of the monocular image;

performing cost aggregation on the matched cost matrix;

and generating a depth map of each monocular image based on the matched cost matrix of the cost aggregation.

6. The depth map generating method of claim 5, wherein the step of generating the first depth map further comprises:

and carrying out consistency detection on the depth map of each monocular image to obtain a credible depth map of each monocular image.

7. The depth map generation method of claim 5, wherein the step of obtaining the first feature map comprises feature extraction of the binocular image based on a neural network of an attention mechanism.

8. The depth map generating method of claim 5, wherein the obtaining the matching cost matrix comprises: and obtaining an initial matching cost matrix based on the first feature map of the monocular image, and inputting the initial matching cost matrix into a preset neural network for convolution to obtain the matching cost matrix.

9. The depth map generation method of claim 5, wherein the step of cost aggregating comprises cost aggregating the matching cost matrices based on a Graph Neural Network (GNN).

10. The depth map generating method of claim 9, wherein the performing cost aggregation comprises: and obtaining neighborhood nodes of each central node serving as elements of the matching cost matrix according to the distance, and updating the matching cost of each central node based on the weight of each neighborhood node and the matching cost of the neighborhood nodes to obtain the first-price aggregated matching cost matrix.

11. The depth map generating method of claim 10, wherein the performing cost aggregation further comprises directing cost aggregation using the second feature map.

12. The depth map generating method of claim 11, wherein the second feature map is generated based on a semantic segmentation neural network, or the second feature map is the same as the first feature map.

13. The depth map generating method of claim 12, wherein the performing cost aggregation comprises: obtaining neighborhood pixels of each pixel of the second feature map according to the distance, finding neighborhood nodes of each central node serving as elements of the matching cost matrix according to the pixel corresponding relation between the second feature map and the matching cost, updating the matching cost of each central node based on the weight and the matching cost of the neighborhood nodes to obtain a second-price aggregated matching cost matrix, and aggregating the first-price aggregated matching cost matrix and the second-price aggregated matching cost matrix to obtain a third-price aggregated matching cost matrix.

14. The depth map generating method of claim 10 or 13, wherein the distance is a spatial distance, a feature vector distance, or a weighted sum of a spatial distance and a feature vector distance.

15. The depth map generating method of claim 5, wherein the performing cost aggregation comprises: obtaining a temporary matching cost matrix according to a plurality of matching cost matrixes with different scales, obtaining the probability that each element of the temporary matching cost matrix corresponds to the preset parallax, and obtaining a fourth matching cost matrix according to the probability and the temporary matching cost matrix.

16. The depth map generating method of claim 8, wherein the step of performing cost aggregation comprises obtaining a fifth matching cost matrix based on the similarity matrix and the matching cost matrix,

wherein the similarity matrix is generated based on the binocular images.

17. The depth map generating method of claim 1, wherein the step of generating the first depth map comprises:

inputting the binocular images into a preset neural network for feature extraction to obtain a first feature map of each monocular image included in the binocular images;

obtaining a matching cost matrix based on a first feature map of the monocular image;

calculating a probability that each element of the matching cost matrix corresponds to a predetermined disparity to convert the matching cost matrix into a matching probability matrix;

a depth map for each monocular image is generated based on the match probability matrix.

18. A depth map generation apparatus comprising:

the first acquisition unit is used for generating a first depth map based on the binocular image;

the second acquisition unit is used for acquiring a second depth map acquired by the depth camera;

and a depth synthesis unit for fusing the first depth map and the second depth map to generate a final depth map.

19. An electronic device comprising a memory and a processor, the memory having stored thereon computer-executable instructions that, when executed by the processor, perform the method of any of the preceding claims 1-17.

20. A computer readable medium having stored thereon computer executable instructions which, when executed, perform the method of any of the preceding claims 1-17.

Technical Field

The present invention relates to the field of image processing, and in particular, to a depth map generation method and apparatus.

Background

Recently, in many technologies such as autopilot, robot navigation, and augmented reality, a task at a high level needs to be solved based on an accurate and dense depth estimation. Depth can be obtained by sensors such as radar (LiDAR), Structured Light (SL), time-of-flight (TOF), etc., or can be generated from binocular images by depth estimation based on stereo matching algorithms.

In particular, LiDAR is a sensor that can accurately measure long distances, but can only provide sparse depth maps with a small vertical imaging field of view, and high precision LiDAR equipment is expensive and bulky, limiting its application scenarios. The measurement distance of the SL and TOF sensors is limited, the accuracy of the SL and TOF sensors is gradually reduced along with the increase of the measurement distance, the measurement range is limited, the effect is poor in an outdoor environment, and the acquired depth map has the problems of high noise, low resolution and the like. The depth estimation based on the stereo matching algorithm can obtain a dense depth map, the current method based on the deep learning network is mostly high in precision, but the precision of the depth estimation is reduced when ill-conditioned regions such as a non-texture region, a light reflection region and an occlusion region are encountered, and the domain adaptation problem exists, namely, the precision reduction problem exists when a network model trained on a specific data set (such as a synthetic data set) is changed to another data set (such as a real data set), and the method is not suitable for being applied to a new environment.

Disclosure of Invention

According to an exemplary embodiment of the present invention, there is provided a depth map generating method including: generating a first depth map based on the binocular image; acquiring a second depth map acquired by a depth camera; a final depth map is generated by fusing the first depth map with the second depth map.

The step of obtaining the second depth map may comprise: extracting depth features from the second depth map; a depth-complemented third depth map is generated by applying similarity matrix-guided filtering to the extracted depth features, wherein the similarity matrix is generated based on the binocular image.

The step of generating a depth-complemented third depth map may comprise: and obtaining the similarity between the current pixel of the depth map and the adjacent pixel of the current pixel based on the similarity matrix, and filtering the depth characteristic of the current pixel according to a preset direction based on the depth characteristic of the adjacent pixel and the similarity.

The predetermined direction may include at least one of a top-to-bottom direction, a bottom-to-top direction, a left-to-right direction, and a right-to-left direction.

The step of generating the first depth map may comprise: obtaining a first feature map of each monocular image included in the binocular image; obtaining a matching cost matrix based on a first feature map of the monocular image; performing cost aggregation on the matched cost matrix; and generating a depth map of each monocular image based on the matched cost matrix of the cost aggregation.

The step of generating the first depth map may further comprise: and carrying out consistency detection on the depth map of each monocular image to obtain a credible depth map of each monocular image.

The step of obtaining the first feature map may include feature extraction of the binocular image based on a neural network of an attention mechanism.

The step of obtaining the matching cost matrix may comprise: and obtaining an initial matching cost matrix based on the first feature map of the monocular image, and inputting the initial matching cost matrix into a preset neural network for convolution to obtain the matching cost matrix.

The step of performing cost aggregation may comprise performing cost aggregation on the matching cost matrices based on a Graph Neural Network (GNN).

The step of performing cost aggregation may comprise: and obtaining neighborhood nodes of each central node serving as elements of the matching cost matrix according to the distance, and updating the matching cost of each central node based on the weight of each neighborhood node and the matching cost of the neighborhood nodes to obtain the first-price aggregated matching cost matrix.

The step of performing cost aggregation may further comprise using the second feature map to guide cost aggregation.

The second feature map may be generated based on a semantic segmentation neural network, or the second feature map may be the same as the first feature map.

The step of performing cost aggregation may comprise: obtaining neighborhood pixels of each pixel of the second feature map according to the distance, finding neighborhood nodes of each central node serving as elements of the matching cost matrix according to the pixel corresponding relation between the second feature map and the matching cost, updating the matching cost of each central node based on the weight and the matching cost of the neighborhood nodes to obtain a second-price aggregated matching cost matrix, and aggregating the first-price aggregated matching cost matrix and the second-price aggregated matching cost matrix to obtain a third-price aggregated matching cost matrix.

The distance may be a spatial distance, a feature vector distance, or a weighted sum of the spatial distance and the feature vector distance.

The step of performing cost aggregation may comprise: obtaining a temporary matching cost matrix according to a plurality of matching cost matrixes with different scales, obtaining the probability that each element of the temporary matching cost matrix corresponds to the preset parallax, and obtaining a fourth matching cost matrix according to the probability and the temporary matching cost matrix.

The performing of the cost aggregation may include obtaining a fifth matching cost matrix based on the similarity matrix and the matching cost matrix, wherein the similarity matrix is generated based on the binocular image.

The step of generating the first depth map may comprise: inputting the binocular images into a preset neural network for feature extraction to obtain a first feature map of each monocular image included in the binocular images; obtaining a matching cost matrix based on a first feature map of the monocular image; calculating a probability that each element of the matching cost matrix corresponds to a predetermined disparity to convert the matching cost matrix into a matching probability matrix; a depth map for each monocular image is generated based on the match probability matrix.

The step of obtaining a trusted depth map may comprise: comparing the pixels of the corresponding positions among the depth maps of the monocular images, determining the pixels of the corresponding positions as untrustworthy pixels when the depth difference values of the pixels of the corresponding positions are larger than or equal to a preset threshold value, and determining the pixels of the corresponding positions as credible pixels when the depth difference values of the pixels of the corresponding positions are smaller than the preset threshold value so as to obtain credible depth maps.

The step of obtaining a trusted depth map may comprise: and processing the first feature map, the initial matching cost matrix and the matching cost matrix according to the credible depth map.

According to an exemplary embodiment of the present invention, there is provided a depth map generating apparatus including: the first acquisition unit is used for generating a first depth map based on the binocular image; the second acquisition unit is used for acquiring a second depth map acquired by the depth camera; a depth synthesis unit to generate a final depth map by fusing the first depth map with the second depth map.

The second acquisition unit may include: a depth feature extraction unit that extracts a depth feature from the second depth map; and a filtering unit which generates a depth-complemented third depth map by performing filtering guided by a similarity matrix on the extracted depth features, wherein the similarity matrix is generated by the first acquisition unit based on the binocular image.

The filtering unit may obtain a similarity between a current pixel of the depth map and a neighboring pixel of the current pixel based on the similarity matrix, and filter the depth feature of the current pixel in a predetermined direction based on the depth feature of the neighboring pixel and the similarity.

The predetermined direction may include at least one of a top-to-bottom direction, a bottom-to-top direction, a left-to-right direction, and a right-to-left direction.

The first acquisition unit may include: the feature extraction unit is used for obtaining a first feature map of each monocular image included in the binocular image; the matching cost obtaining unit is used for obtaining a matching cost matrix based on the first feature map of the monocular image; the cost aggregation unit is used for carrying out cost aggregation on the matched cost matrix; and a depth map generation unit which generates a depth map for each monocular image based on the cost aggregated matching cost matrix.

The first acquisition unit may further include: and the credible depth obtaining unit is used for carrying out consistency detection on the depth map of each monocular image so as to obtain the credible depth map of each monocular image.

The feature extraction unit may perform feature extraction on the binocular image based on a neural network of an attention mechanism.

The matching cost obtaining unit may obtain an initial matching cost matrix based on the first feature map of the monocular image, and input the initial matching cost matrix to a predetermined neural network for convolution to obtain the matching cost matrix.

The cost aggregation unit may perform cost aggregation on the matching cost matrix based on a Graph Neural Network (GNN).

The cost aggregation unit may obtain a neighborhood node of each central node as an element of the matching cost matrix according to the distance, and update the matching cost of each central node based on the weight of each neighborhood node and the matching cost of the neighborhood node to obtain the first-price aggregated matching cost matrix.

The cost aggregation unit may also direct cost aggregation using the second feature map.

The second feature map may be generated based on a semantic segmentation neural network, or the second feature map may be the same as the first feature map.

The cost aggregation unit may obtain a neighborhood pixel of each pixel of the second feature map according to the distance, find a neighborhood node of each center node as an element of the matching cost matrix according to a pixel correspondence between the second feature map and the matching cost, update the matching cost of each center node based on the weight and the matching cost of the neighborhood node to obtain a matching cost matrix of the second price aggregation, and aggregate the matching cost matrix of the first price aggregation and the matching cost matrix of the second price aggregation to obtain a matching cost matrix of the third price aggregation.

The distance may be a spatial distance, a feature vector distance, or a weighted sum of the spatial distance and the feature vector distance.

The cost aggregation unit may obtain a temporary matching cost matrix according to a plurality of matching cost matrices of different scales, obtain a probability that each element of the temporary matching cost matrix corresponds to a predetermined disparity, and obtain a fourth matching cost matrix according to the probability and the temporary matching cost matrix.

The cost aggregation unit may obtain a fifth matching cost matrix based on the similarity matrix and the matching cost matrix, wherein the similarity matrix is generated by the first acquisition unit based on the binocular image.

The first acquisition unit may include: the feature extraction unit is used for inputting the binocular images into a preset neural network to carry out feature extraction so as to obtain a first feature map of each monocular image included in the binocular images; the matching cost obtaining unit is used for obtaining a matching cost matrix based on the first feature map of the monocular image; a matching probability obtaining unit calculating a probability that each element of the matching cost matrix corresponds to a predetermined disparity to convert the matching cost matrix into a matching probability matrix; and a depth map generation unit which generates a depth map for each monocular image based on the matching probability matrix.

The trusted depth obtaining unit may compare pixels of corresponding positions between the depth maps of the respective monocular images, determine the pixels of corresponding positions as untrusted pixels when a depth difference value of the pixels of corresponding positions is greater than or equal to a predetermined threshold value, and determine the pixels of corresponding positions as trusted pixels when the depth difference value of the pixels of corresponding positions is less than the predetermined threshold value, to obtain the trusted depth map.

The credible depth obtaining unit can process the first feature map, the initial matching cost matrix and the matching cost matrix according to the credible depth map.

According to an exemplary embodiment of the invention, an electronic device is provided, comprising a memory and a processor, the memory having stored thereon computer-executable instructions that, when executed by the processor, perform the aforementioned method.

According to an exemplary embodiment of the present invention, a computer-readable medium is provided, having stored thereon computer-executable instructions, which, when executed, perform the aforementioned method.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The above and other objects and features of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which:

fig. 1 is a block diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention;

fig. 2 is a detailed block diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention;

fig. 4 is a diagram illustrating depth filtering guided by an obtained similarity matrix according to an exemplary embodiment of the present invention.

FIG. 5 is a schematic diagram illustrating feature extraction based on an attention mechanism in accordance with an exemplary embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a stacked hourglass structure of an attention-based dissemination module according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram illustrating an inter-pixel effect according to an exemplary embodiment of the present invention;

FIG. 8 is a diagram illustrating propagation of matching costs in the height dimension from top to bottom according to an exemplary embodiment of the invention;

FIG. 9 is a flowchart illustrating cost aggregation by a cost aggregation unit based on Graph Neural Networks (GNNs), according to an illustrative embodiment of the present invention;

FIG. 10 is a diagram illustrating cost aggregation by a cost aggregation unit based on a graph neural network according to an exemplary embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating the operation of an aggregation function according to an embodiment of the present invention;

fig. 12 is a diagram illustrating a feature/matching cost improvement operation performed by a trusted depth acquisition unit according to an exemplary embodiment of the present invention;

FIG. 13 is a diagram illustrating a geometric consistency optimization operation in accordance with an exemplary embodiment of the present invention;

fig. 14 is a flowchart illustrating a depth map generating method according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.

In general, the invention obtains a high-quality depth map by simultaneously optimizing the depth map obtained by the sensor and the depth map obtained by the stereo matching algorithm; on one hand, aiming at the problems of noise, sparseness, low resolution and the like of a depth map obtained from a sensor, a depth learning network based on binocular image guidance is provided for completing the depth map obtained from the sensor to obtain a completed depth map A; on the other hand, aiming at the problem that the precision of depth estimation in a pathological region is poor based on a depth learning method, a depth learning network based on an attention mechanism and a graph neural network is provided for extracting robust features and conducting the features to the pathological region to obtain a depth graph B; and finally, jointly optimizing the complemented depth map A and a depth map B obtained by a stereo matching algorithm through depth optimization, respectively obtaining corresponding normal vector maps from the depth maps A and B, and generating a high-quality depth map through the consistency constraint of the normal vector maps.

Fig. 1 is a block diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention.

Referring to fig. 1, a depth map generating apparatus 100 according to an exemplary embodiment of the present invention may include a first acquisition unit 110, a second acquisition unit 120, and a depth synthesis unit 130.

The first acquisition unit 110 may generate a first depth map based on the binocular image. In addition, the first acquisition unit 110 may also generate a similarity matrix based on the binocular images. Here, the first obtaining unit 110 may arbitrarily select a monocular image (by way of example and not limitation, a left image or a right image) from the binocular images, and calculate the degree of similarity between pixels of the images using a neural network (by way of example and not limitation, the neural network may be a convolutional neural network), resulting in a similarity matrix. Further, by way of example only and not limitation, binocular images may be captured by a binocular camera. Further, it should be understood that the left image right image, the left parallax map right parallax map, the left depth map right depth map, and the like described in the present invention are merely examples of binocular images and are not limited, and the present invention is equally applicable to other types of binocular images, such as upper and lower types of binocular images.

The second acquisition unit 120 may acquire a second depth map acquired by the depth camera. More specifically, the second acquisition unit 120 may generate the second depth map by depth-complementing the depth map based on the similarity matrix. More specifically, the second obtaining unit 120 may first obtain the similarity matrix generated by the first obtaining unit 110, and then input the depth map to the deep learning network constituted by the second obtaining unit 120 to obtain the second depth map. In addition, the existing deep learning network can be used for directly inputting the depth map and outputting the second depth map. However, the invention obtains the similarity degree between the pixels of the depth map through the similarity matrix, thereby guiding the network to endow the adjacent pixels with high similarity with higher weight during the depth completion, and obtaining the completion depth map with higher precision. Here, by way of example only and not limitation, the depth map may be acquired by a depth camera such as a SL or TOF sensor.

The depth synthesis unit 130 may generate a final depth map by fusing the first depth map with the second depth map. Here, by way of example only and not limitation, the depth synthesis unit 130 may perform fusion to output a high quality depth map, such as in a weighted fusion manner.

The depth map generating apparatus 100 according to an exemplary embodiment of the present invention will be described in more detail with reference to fig. 2 and 3. Fig. 2 is a detailed block diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention. Fig. 3 is a schematic diagram illustrating a depth map generating apparatus according to an exemplary embodiment of the present invention.

Referring to fig. 2, the second acquisition unit 120 may include a depth feature extraction unit 211 and a filtering unit 212, and the first acquisition unit 110 may include a feature extraction unit 221, a matching cost acquisition unit 222, a cost aggregation unit 223, and a depth map generation unit 224.

The depth feature extraction unit 211 may extract a depth feature from the second depth map. More specifically, the depth feature extraction unit 211 may input the depth map to a neural network (which may be, by way of example and not limitation, a convolutional neural network) to extract the depth features.

And then, in order to fill the holes of the depth map and improve the resolution of the depth map, the depth map needs to be filtered. Fig. 4 is a diagram illustrating a filtering guided by an obtained similarity matrix according to an exemplary embodiment of the present invention. As shown in fig. 4, a similarity matrix may be generated based on the binocular images by the first acquisition unit 110 and used for filtering by the second acquisition unit 120. Here, the filtering unit 212 may generate a depth-complemented third depth map by performing similarity matrix-guided filtering on the extracted depth features, at which time the depth synthesis unit 130 may generate a final depth map by fusing the first depth map with the third depth map. The role of the similarity matrix may be to compute pixel similarity to guide adaptive aggregation of the following features, which is more advantageous for propagation in less textured areas. More specifically, the filtering unit 212 may obtain a similarity between a current pixel of the depth map and a neighboring pixel of the current pixel based on the similarity matrix, and filter the depth characteristic of the current pixel in a predetermined direction using the depth characteristic based on the neighboring pixel and the similarity. By way of example only and not limitation, assuming that the depth feature of the current pixel of the obtained depth map is F (dimension H × W × C, where H represents height, W represents width, and C represents the number of channels), the filtering unit 212 may slice it along the dimension to obtain C depth feature slices (dimension H × W), and the process of conduction within each depth feature slice includes an update in a predetermined direction. Here, the predetermined direction may include at least one of a direction from top to bottom, a direction from bottom to top, a direction from left to right, and a direction from right to left, by way of example only, and not by way of limitation. Further, by way of example only and not limitation, assuming that the depth eigenvalue of the neighboring pixel q of the current pixel p is d '(q) for the top-down direction and the left-to-right direction, the filtering unit 212 may obtain the similarity w between the current pixel p and the neighboring pixel q through the similarity matrix and filter the depth eigenvalue d (p) of the current pixel p according to the following equation (1) to obtain the updated depth eigenvalue d' (p) of the current pixel p; for the direction from the bottom up and the direction from the right to the left, assuming that the depth feature value of the neighboring pixel q of the current pixel p before update is d (q), and the depth feature value after update is d' (q), the filtering unit 212 may obtain the similarity w between the current pixel p and the neighboring pixel q through the similarity matrix, and filter the depth feature value d (p) of the current pixel p according to the following equation (2) to obtain the depth feature value d ″ (p) of the current pixel p after update. The filtering unit 212 may then convolve the filtered depth features with a neural network, such as a convolutional neural network, to produce a depth-complemented third depth map.

d′(p)＝d(p)+w*d′(q) (1)

d″(p)＝w*d(q)+(1-w²)*d′(q) (2)

Using multi-view geometry theory, the depth of a pixel may be calculated from the disparity of two monocular images of a binocular image (left and right images, by way of example only and not limitation). Here, the disparity map may be generated by a stereo matching algorithm, that is, finding a correspondence between pixels of the left image and the right image, and generating the disparity map by minimizing a matching cost. At present, with the application of a deep learning network in a stereo matching algorithm, the accuracy of a disparity map is greatly improved, but the error of the generated disparity map is still large in a pathological region such as a non-texture region, a shielding region, a light reflecting region and the like. In order to solve the above challenging problem, the present invention improves three steps in the stereo matching algorithm, namely, matching cost calculation, cost aggregation, and disparity calculation. In the matching cost calculation, an attention mechanism can be adopted for feature extraction, robust depth features (such as object boundaries and the like) are obtained to reduce error matching among pixels of binocular images, so that the cost matching in a pathological region is more accurate, in addition, the credible depth can be judged through the learned left and right depth maps and fed back to a feature extraction network for feature/matching cost improvement, which will be described later; in the cost aggregation, a graph neural network can be adopted for cost aggregation, the dependence of a longer range among pixels is captured, and the extracted robust features can be conducted to a pathological region; in the parallax calculation, in order to obtain a more accurate parallax map, a left-right consistency constraint may be applied to pixels having no label information.

The feature extraction unit 221 may obtain a first feature map of each monocular image included in the binocular image. More specifically, the feature extraction unit 221 may perform feature extraction on the binocular image based on a neural network of an attention mechanism. Further, the similarity matrix may be obtained by the feature extraction unit 221 based on the binocular image, or may be obtained by providing another unit in the first acquisition unit 110.

Fig. 5 is a schematic diagram illustrating feature extraction based on an attention mechanism according to an exemplary embodiment of the present invention. Before adding an attention mechanism, feature aggregation is carried out between features through convolution, and the size of a receptive field is determined by the size of a convolution kernel; and after the attention mechanism is added, different weights are given to different features so that the feature aggregation range is dynamic, as shown in fig. 5, the left part represents a feature map, the right part represents the attention mechanism, the weights corresponding to the features are learned through the attention mechanism, if the weights of the features used in the convolution operation of the neural network based on the attention mechanism are small, the amount of information which is equivalent to the fact that the feature is blocked from propagating to the central feature of the convolution is indicated by an arrow at the right part of fig. 5, the propagation indicated by an arrow at the right part of fig. 5 is the amount of information which is contributed to the central feature by the peripheral features in the convolution operation, and the object boundary information is captured by the attention mechanism, the features belonging to the same object can be aggregated, and the features not belonging to the same object can be blocked.

More specifically, the feature extraction unit 221 may firstFirstly, carrying out feature extraction on the binocular image to obtain a feature map F (the dimension is H W C, wherein H represents the height of the feature map, W represents the width of the feature map, and C represents the number of channels of the feature map) of each monocular image included in the binocular image, and taking the feature map F of each channel_i(dimension H W) and perform max pooling operation Maxpool (f)_i) I.e. calculating the feature map f_iMiddle largest element m_iC maximum pooling operations are performed for C total channels, and C maximum elements m are obtained₁、m₂、…、m_cThen m is put₁、m₂、…、m_cInput into a neural network to obtain attention parameter a₁、a₂、…、a_cFinally, attention parameter a₁、a₂、…、a_cCharacteristic diagram f of different channels₁、f₂、…、f_cElement by element multiplication to obtain a new feature map F'.

As another example, the attention mechanism based on the feature space may also be used, that is, the feature extraction unit 221 may take the feature map F of each pixel of the feature map F_j(dimension: C) and performing a max pooling operation Maxpool (f)_j) I.e. calculating the feature map f_jMiddle largest element m_jH x W maximum pooling operations are performed for a total of H x W pixels, and H x W maximum elements m are obtained₁、m₂、…、m_H*WThen m is put₁、m₂、…、m_H*WInput into said one neural network to obtain the attention parameter b₁、b₂、…、b_H*WFinally, attention parameter b₁、b₂、…、b_H*WCharacteristic maps f of different pixels respectively₁、f₂、…、f_H*WElement by element multiplication to obtain a new feature map F ". Therefore, an attention mechanism can be applied to feature extraction of left and right images, and global information is learned by using attention to obtain robust depth features (object boundaries and the like), so that error matching among binocular image pixels is reduced, and matching cost in a pathological region is more accurate.

Robust and reliable stereo matching is of great significance for scene understanding in multimedia applications. Strong visual cues can be well detected and matched with current feature descriptors, but in general strong visual cues are too sparse to propagate the cost of matching. But propagating the cost from strong cells to weak cells is the key to dense disparity map generation. In this regard, an attention-based propagation module may be designed to propagate reliable matching costs from one cell to its neighbors, particularly ill-conditioned regions. The attention-based propagation module may include two parts, namely an attention part and a propagation part.

In particular, robust matching beliefs and weakening false positive matching beliefs can be enhanced by embedding a spatial attention layer on a stacked hour glass (stack hourglass) module of an existing convolutional neural network model. After the four-dimensional matching cost is compressed into three dimensions, the three-dimensional matching cost can be cut into slices, and parallax belief propagation is performed from one edge to the other edge of each slice. Unlike a stacked hour glass (stack hour glass) module structure, explicit parallax belief propagation can be employed within the matching cost to achieve more efficient belief propagation.

Fig. 6 is a schematic diagram illustrating a stacked hourglass structure of an attention-based dissemination module according to an exemplary embodiment of the present invention. As shown in fig. 6, an existing convolutional neural network model can be used as a backbone and the stacked hourglass module structure can be improved by attention-based propagation modules. Because the information transmission of each unit to the neighborhood is similar to that of a shared convolution kernel, a space attention mechanism can be introduced to expand the influence of strong robust pixels, and a channel attention mechanism is introduced to more effectively extract robust features. The dissemination module in fig. 6 is an attention-based dissemination module in accordance with an exemplary embodiment of the present invention wherein the stacked hourglass configuration of the attention-based dissemination module is synonymous with the embedded attention-based stacked hourglass configuration.

The stacked hourglass module structure of the convolutional neural network model can transmit information from one unit to the neighbor thereof through different scale designs and can perform upsampling and downsampling on matching costSampling expands the range of each cell. However, this exchange of information is sometimes inefficient because the effect of one pixel on another pixel is closely related to the distance between them. Fig. 7 is a schematic diagram illustrating an inter-pixel influence according to an exemplary embodiment of the present invention. As shown in fig. 7, where black cells refer to cells with strong parallax beliefs, i.e. robust matching, white cells refer to cells with weak parallax beliefs. In general, the disparity belief can be passed from the black cell to the other two white cells. However, the use of general convolutional layer kernels is limited in size, requiring the stacking of many convolutional layers or larger convolutional kernels for such belief transfer. Therefore, a matching cost propagation module based on disparity belief propagation may be employed. The output of the stacked hourglass modular structure of the convolutional neural network model contains four dimensions, namely channel, disparity, height and width, and the channel dimensions can be compressed to obtain the three-dimensional matching cost, namely disparity, height and width. To pass the disparity beliefs from one side to the other, the matching cost can be divided along the height or width dimension. Fig. 8 is a diagram illustrating propagation of matching costs in the height dimension from top to bottom according to an exemplary embodiment of the present invention. As shown in fig. 8, propagation in the height dimension from top to bottom is depicted. When a slice with the shape of "disparity x 1 x width" is obtained, i times can be copied, and a two-dimensional convolution layer with a convolution kernel size of 1 x kw is added, the input and output channels of which are both D, i.e. the disparity dimension in the current matching cost. In addition, a Batch Normalization (BN) layer may be added prior to nonlinear activation, followed by another convolution and BN layer (no activation). By doing so, the disparity propagation module according to the exemplary embodiment of the present invention can not propagate a cell with low disparity belief in the global scope. The matching cost obtaining unit 222 may obtain a matching cost matrix based on the first feature map of the monocular image. That is, the matching cost obtaining unit 222 may obtain an initial matching cost matrix based on the first feature map of the monocular image, and input the initial matching cost matrix to a neural network (which may be, for example and not by way of limitation, a Hourglass convolutional network or a multi-layer 3D convolutional neural network) for convolution to obtain the matching cost matrix. Piece of clothThe matching cost obtaining unit 222 may obtain a left feature map F with all feature dimensions H × W × C_lAnd right feature diagram F_rThe initial matching cost matrices are obtained by connecting together according to the parallax direction. More specifically, when the disparity is 0, the matching cost obtaining unit 222 may directly connect the left and right feature maps to obtain a matrix with dimension H × W × 2C; when the disparity is d, the matching cost obtaining unit 222 may first translate the right feature map by d (where d is a positive integer) columns along a predetermined direction (for example and without limitation, horizontal direction), and then connect the left and right feature maps to obtain a matrix with dimension H × W × 2C; assuming that the disparity value D ranges from 0 to DD (where DD is a positive integer), the matching cost obtaining unit 222 may sequentially shift the right feature map according to the magnitude of the disparity and connect the right and left feature maps to obtain a matrix with dimension H × W × 2C × D, that is, an initial matching cost matrix. Subsequently, the matching cost obtaining unit 222 may input the initial matching cost matrix to the above-mentioned neural network such as the Hourglass convolutional network or the multi-layered 3D convolutional neural network to be convolved to obtain the matching cost matrix (dimension H × W × D). In addition, the matching cost obtaining unit 222 may further obtain a plurality of matching cost matrices of different scales by inputting the initial matching cost matrix to the above-described neural network such as the Hourglass convolutional network or the multi-layered 3D convolutional neural network for convolution.

The cost aggregation unit 223 may perform cost aggregation on the matching cost matrix. Here, the cost aggregation unit 223 may perform cost aggregation on the matching cost matrix based on the Graph Neural Network (GNN). More specifically, the cost aggregation unit 223 may obtain a neighborhood node from the distance with each center node as an element of the matching cost matrix, and update the matching cost of each center node based on the weight of each neighborhood node and the matching cost of the neighborhood node to obtain the first cost aggregated matching cost matrix. This will be described below with reference to fig. 9 and 10. Fig. 9 is a flowchart illustrating cost aggregation performed by the cost aggregation unit 223 based on the Graph Neural Network (GNN), according to an exemplary embodiment of the present invention. Fig. 10 is a schematic diagram illustrating cost aggregation performed by the cost aggregation unit 223 based on the graph neural network according to an exemplary embodiment of the present invention. As shown in fig. 9, in step S910, the cost aggregation unit 223 mayAnd constructing a graph according to the matching cost matrix. More specifically, taking each element of the matching cost matrix as a central node, for a total of H × W central nodes, each central node is composed of position information and feature information (f, dimension is D), and the cost aggregation unit 223 may obtain a specific number k (k may be set by those skilled in the art according to actual conditions) of other central nodes (n) closest to each central node₁,n₂,…,n_k) (i.e., neighborhood nodes) and connecting each center node with the corresponding nearest k other center nodes to form an edge, wherein the distance may be a spatial distance, a feature vector distance, or a weighted sum of the spatial distance and the feature vector distance, and more particularly, the distance may be calculated by a euclidean distance. In step S920, the cost aggregation unit 223 may connect the matching costs of the k other central nodes (also referred to as k neighboring nodes) corresponding to each central node along the characteristic dimension (i.e., the characteristic dimension of each neighboring node is D, and the total dimension after the connection is k × D) and input the connected matching costs into a neural network (which may be a multi-layer perceptron network, for example and not limitation) to obtain the weight (w) of the edge of each central node and the corresponding k other central nodes closest to the central node₁,w₂,…,w_k) And updating the matching cost of each central node based on the weights and the corresponding matching costs of the k other central nodes closest to the weights to obtain a first price aggregated matching cost matrix. More specifically, the cost aggregation unit 223 may update the matching cost matrix by equation (3) below.

Wherein, w_iIs the weight of the corresponding edge of the k other center nodes closest in distance, f_iIs the matching cost, f, of the corresponding k other central nodes closest in distance_before1Is the matching cost before updating of the central node, f_current1Is the updated matching cost of the central node, σ is the activation function, a₁、b₁Is a predetermined constant and k is a positive integer. Through the operation, the characteristic propagation can be carried out among the pixels, and the long-range dependency among the pixels can be captured after multiple iterations, so that the extracted robust characteristic is conducted to the ill-conditioned area.

In addition, the cost aggregation unit 223 may also use the second feature map F_segAnd guiding cost aggregation. Here, the second feature map is generated based on a semantically segmented neural network (such as depeplab v3), or the second feature map is the same as the first feature map. More specifically, the cost aggregation unit 223 may obtain a neighborhood pixel of each pixel of the second feature map according to the distance, find a neighborhood node of each center node as an element of the matching cost matrix according to a pixel correspondence between the second feature map and the matching cost, update the matching cost of each center node based on the weight and the matching cost of the neighborhood node to obtain a second-price aggregated matching cost matrix, and aggregate the first-price aggregated matching cost matrix and the second-price aggregated matching cost matrix to obtain a third-price aggregated matching cost matrix. Cost aggregation unit 223 may first be based on F_segAnd constructing a graph, wherein each element of the matching cost matrix is taken as a central node, and the total number of the central nodes is H x W, and each central node is composed of position information and characteristic information (f, the dimension is D). Since each central node of the cost matrix is matched with F_segIs in a one-to-one relationship, the cost aggregation unit 223 can obtain the sum F_segIs closest to each central node (i.e., neighborhood nodes) by a certain number k (k can be set by one skilled in the art as a practical matter) of other pixels (i.e., neighborhood pixels). Subsequently, the cost aggregation unit 223 may obtain F with each central node_segK (k is a positive integer) other central nodes (n) closest to the pixel(s) of (a)₁,n₂,…,n_k) And connecting each central node with the corresponding k other central nodes with the nearest distance to form an edge so as to construct a graph G, wherein the distance can be a space distance, a feature vector distance or a weighted sum of the space distance and the feature vector distance, and furtherSpecifically, the distance may be calculated by a euclidean distance or a hamming distance. Subsequently, the cost aggregation unit 223 may connect and input the matching cost of the k other central nodes corresponding to each central node along the feature dimension to the above-mentioned neural network such as the multi-layer perceptron network to obtain the weight (w) of each central node and the edge of the corresponding k other central nodes closest to the central node₁,w₂,…,w_k) And updating the matching cost of each central node based on the weight and the corresponding matching cost of k other central nodes closest to the weight to obtain a second-price aggregated matching cost matrix, and aggregating the first-price aggregated matching cost matrix and the second-price aggregated matching cost matrix to obtain a third-price aggregated matching cost matrix. More specifically, the cost aggregation unit 223 may update the matching cost matrix by equation (4) below:

wherein, w_iIs the weight of the corresponding edge of the k other center nodes closest in distance, f_iIs the matching cost, f, of the corresponding k other central nodes closest in distance_before2Is the matching cost before updating of the central node, f_current1Is the updated matching cost of the central node, σ is the activation function, a₂、b₂Is a predetermined constant and k is a positive integer.

The obtained third-generation price aggregation matching cost matrix not only has matching cost aggregation guided by depth estimation, but also has matching cost aggregation guided by semantic segmentation, so that the precision can be improved.

In addition, other embodiments may also be employed, specifically as follows:

stereo image matching can provide convincing dense depth maps without depth sensors, and is of great interest in modern applications. Effective matching cost aggregation is the key of the stereo matching task, and although some effective cost aggregation methods have been developed by the traditional method and the deep neural network, the difficulty of finding the correct matching relationship in the pathological region still exists, and the difficulty can be alleviated by utilizing long-term context information. Compared with the previous work of adopting multilayer three-dimensional convolution, a new Guide Graph Convolution Module (GGCM) can be adopted to solve the problem. The GGCM constructs a graph according to the pixel similarity of the guide image, learns the global background information through a multi-layer superposition GGCM module, the vertex of the graph is a pixel, and the graph is constructed into a K-neighbor graph according to the similarity distance. In addition, a simple and efficient distance calculation method can be defined and compared with the existing GCN method. The distance comprises two parts, one part is a spatial distance for spatial relationship modeling, the other part is defined in a feature space of a guide image, and non-local pixel similarity can be explored and boundary information can be reserved. This approach is non-local in that each node receives information from all other nodes on the graph after a number of convolution iterations.

At present, the stereo matching method usually adopts two operations of down-sampling and up-sampling in a three-dimensional convolution layer for cost aggregation, and the reduction of resolution causes the edge of a parallax image to be blurred. Conventional guided image filtering methods demonstrate that image-guided based filtering can preserve the edges of the depth image. Therefore, a graph can be automatically constructed from the guide image and applied to graph convolution operation to adaptively aggregate cost quantity characteristics.

Specifically, given a feature map of the guide image (dimension H × W × C), a matching cost matrix C (dimension H × W × D) is input. Each element of the matching cost matrix can be considered as a central node, for a total of H × W central nodes, each consisting of feature information (dimension D).

First, graph G is constructed from similarity distances (N vertices with k neighbors and k edges per vertex). The vertices of the graph (dimension N × W) are all the pixel units of the tensor, and edges (dimension k × N) are generated between k neighboring pixels. A simple and effective distance function d can be defined_ijAs shown in the following equation (i).

The distance can comprise two parts, wherein one part is the coordinate expression (x) of a space distance pixel point for space relation modeling_i,y_i)、(x_j,y_j) The first term in the equation is the Euclidean distance of two pixel coordinates, and the second term in the equation is defined in the feature space of the guide image and is used for utilizing the non-local pixel similarity and reserving the depth edge information; for the feature map I, each pixel has a feature vector f_i(common C dimension, each dimension denoted as f_ic) Or f_j(common C dimension, each dimension denoted as f_jc) The distance in the feature space represents the Euclidean distance of two feature vectors, namely a second term of a formula; l denotes the l-th layer of the convolution.

Secondly, the matching cost units are adaptively aggregated by performing weighted summation on the k neighborhood pixels. FIG. 11 is a schematic diagram illustrating the operation of an aggregation function according to an embodiment of the present invention. As shown in fig. 11, the aggregation weights are estimated by the edge filter network, using a data-driven approach that depends on cost units, rather than predefined parameterization operations. The aggregation function is shown in equation (ii) below.

Wherein the content of the first and second substances,representing vertex information, edge filter network F^lCan be provided with side information L^l(i, j) is input, and a weight matrix theta is output_ij，W^lRepresenting a learnable parameter, u_j ^lDenotes v_i ^lNeighborhood node of, E^lRepresenting the edge attention function, N (i) representing the number of neighborhood nodes, l representing the l-th layer of the convolution, γ_ijAn edge-attention parameter is indicated.

Finally, aggregated vertex information can be employedAnd original vertex informationThe matching cost matrix is updated as shown in equation (iii) below. Where w is a fixed parameter.

In addition, the cost aggregation unit 223 may further obtain a temporary matching cost matrix according to a plurality of matching cost matrices of different scales, obtain a probability that each element of the temporary matching cost matrix corresponds to a predetermined disparity, and obtain a fourth matching cost matrix according to the probability and the temporary matching cost matrix. More specifically, the cost aggregation unit 223 may convert (e.g., upsample) a plurality of matching cost matrices (cost1, cost2, …, cost) of different scales to the same size (dimension H × W × D) and connect along the disparity dimension as shown in the following equation (5) to obtain a provisional matching cost matrix (dimension H × W × kD). Then, the cost aggregation unit 223 may obtain a probability that each element of the provisional matching cost matrix corresponds to a predetermined disparity based on the provisional matching cost matrix. Here, the cost aggregation unit 223 may obtain the probability P that each element of the temporary matching cost matrix corresponds to the predetermined disparity as shown in the following equation (6) using the softmax function. Finally, the cost aggregation unit 223 may multiply the probability P with the provisional matching cost matrix to obtain a fourth matching cost matrix as shown in the following equation (7). Experiments prove that the strategy can effectively improve the accuracy of generating the disparity map.

cost_temp＝concat(cost1，cost2，...，costk) (5)

P＝F_softmax(cost_temp) (6)

cost_new＝cost_temp*P (7)

Wherein, cost1, cost2, … and cost represent matching cost matrixes of k different scales, and cost_tempRepresenting a temporary matching cost matrix, P representing a temporary matching cost matrix cost_tempCorresponds to a probability, cost, of a predetermined disparity_newAnd k is a positive integer, and represents a newly obtained matching cost matrix (i.e., the matching cost matrix of the sixth cost aggregation).

Further, the cost aggregation unit 223 may also direct matching cost aggregation based on the similarity matrix, that is, may obtain a fifth matching cost matrix based on the similarity matrix and the matching cost matrix. More specifically, the cost aggregating unit 223 may perform a convolution operation on the matching cost matrix obtained by the matching cost obtaining unit 222 by using a similarity matrix (dimension: HW × HW), wherein parameters in the convolution operation are determined by the similarity matrix, and the convolution operation is performed by converting the matching cost matrix (H × W × D) into a two-dimensional matching cost matrix (HW × D), performing a multiplication (HW × HW) of the similarity matrix and the two-dimensional matching cost matrix to obtain a new matrix HW × D, and then converting the dimension to obtain a fifth matching cost matrix (H × W × D).

It should be appreciated that the cost aggregation unit 223 may be added as a separate part to an existing stereo matching network or a depth estimation network to improve the performance of the existing network.

And cost aggregation is performed by using context information in stereo matching, so that the method is very effective for improving matching precision. However, the deep learning network tends to perform cost aggregation for a four-dimensional matching cost matrix, which is a feature, a disparity, an image height, and an image width, respectively. However, generating a disparity map using a four-dimensional matching cost matrix requires compression of the feature dimensions and the disparity dimensions. In the prior art, simple full-connected layers are often used for pixel-by-pixel parallax calculation, and context information is not referred in the process, so that parallax generation errors are caused. According to the embodiment of the invention, after the feature dimension is compressed, the context-based matching confidence coefficient transmission can be carried out on the three-dimensional matching cost matrix so as to improve the parallax calculation precision of each pixel in the dimension compression process. More specifically, the first obtaining unit 110 may include a matching probability obtaining unit configured to calculate a probability Pd (which is calculated by using, by way of example and not limitation, a Softmax function) that each element (x, y) of the matching cost matrix obtained by the matching cost obtaining unit 222 corresponds to a predetermined disparity D to convert the three-dimensional matching cost matrix into a three-dimensional matching probability matrix C (H × W × D). The matching probability matrix can then be optimized using a belief propagation algorithm, which can be reduced to a process of performing a layer-by-layer convolution within the matching probability matrix, i.e., a convolution operation on the matching probability matrix (H W D). That is, the cost aggregation unit 223 may be replaced with a matching probability obtaining unit that performs the above-described matching probability calculation, and the depth map generation unit 224 may generate a depth map for each monocular image based on the matching probability matrix. According to an embodiment of the invention, the values within the matching probability matrix are transformed from probability space to log space when implementing a differentiable belief propagation algorithm within the matching probability matrix to fit the belief propagation algorithm with a convolution process. In the process, a guide map describing the similarity between adjacent pixels is needed to participate in the calculation, and the specific operation is that the convolution kernel parameters in the convolution process are determined by the pixel values at the corresponding positions of the guide map. The guide map can be converted by some feature maps obtained in the convolution process, and can also be converted by optical flow and semantic segmentation results of pictures.

The depth map generation unit 224 may generate a depth map for each monocular image based on the cost aggregated matching cost matrix. Here, the depth map generating unit 224 may generate the depth map of each monocular image using any one of the first price aggregated matching cost matrix, the third price aggregated matching cost matrix, the fourth matching cost matrix, and the fifth matching cost matrix obtained by the cost aggregating unit 223. According to an exemplary embodiment of the present invention, the depth map generating unit 224 may first generate the disparity map disp (H × W) of each monocular image based on the matching cost matrix or the matching probability matrix of the cost aggregation, and then generate the depth map of each monocular image based on the disparity map disp of each monocular image. More specifically, the depth map generation unit 224 may calculate each element C of the aggregated matching cost matrix or matching probability matrix using the softmax function_d(x_i,y_i) Corresponding probability P in the disparity dimension D_diThen range from 0 to DDAll disparity values di and corresponding probabilities P within_diThe disparity values of the current pixel as the monocular image are multiplied and summed as shown in equation (8) below.

Wherein di represents a disparity value, c_dElements of the matching cost matrix or matching probability matrix representing an aggregation,represents the disparity value of the current pixel, and DD represents the maximum value of the disparity value.

Subsequently, the depth map generating unit 224 may passA depth map for each monocular image is calculated, where b denotes a distance between optical centers of two cameras for acquiring the two monocular images, f denotes focal lengths of the two cameras for acquiring the two monocular images, and d is a disparity value of the obtained disparity map.

Further, according to an exemplary embodiment of the present invention, the depth map generating unit 224 may further include a trusted depth obtaining unit. The trusted depth obtaining unit may compare pixels of corresponding positions between the depth maps of the respective monocular images, determine the pixels of corresponding positions as untrusted pixels when a depth difference value of the pixels of corresponding positions is greater than or equal to a predetermined threshold value, and determine the pixels of corresponding positions as trusted pixels when the depth difference value of the pixels of corresponding positions is less than the predetermined threshold value, to obtain the trusted depth map. More specifically, the trusted depth obtaining unit may set the pixel value of the untrusted pixel to 0 and the pixel value of the trusted pixel to 1, thereby obtaining the trusted depth map. Subsequently, the trusted depth obtaining unit may process the first feature map, the initial matching cost matrix, and the matching cost matrix according to the trusted depth map. More specifically, the confidence depth obtaining unit may perform feature/matching cost improvement by multiplying the confidence depth map by the first feature map obtained by the feature extracting unit 221 or the initial matching cost matrix or the matching cost matrix obtained by the matching cost obtaining unit 222, and then transferring the result to the surrounding pixels, thereby improving the accuracy of the generated depth map while retaining the confidence feature or the matching cost. Here, after the trusted depth obtaining unit multiplies the trusted depth map by the first feature map obtained by the feature extracting unit 221 or the initial matching cost matrix or the matching cost matrix obtained by the matching cost obtaining unit 222, the matching cost obtaining unit 222 may obtain the initial matching cost matrix based on a product of the trusted depth map and the first feature map and may obtain the matching cost matrix based on a product of the trusted depth map and the initial matching cost matrix, and the cost aggregating unit 223 may perform cost aggregation on the product of the trusted depth map and the matching cost matrix. Fig. 12 is a diagram illustrating a feature/matching cost improvement operation performed by the trusted depth acquisition unit according to an exemplary embodiment of the present invention. As shown in fig. 12, the confidence depth obtaining unit may transfer the product of the obtained confidence depth map and the feature map or the matching cost matrix to surrounding pixels in subsequent processing.

Further, according to an exemplary embodiment of the present invention, in the training of the neural network constituted by the first acquisition unit 110, an error between the calculated disparity value and the labeled disparity value may be back-propagated as the first loss function to train the neural network. Sparse LiDAR data is mostly adopted as labels in a real scene data set, and a large amount of unlabeled data exists in one image. In the training process, sparse labeling can only restrain partial pixels, and effective restraint is lacked for unlabeled pixels. According to an exemplary embodiment of the present invention, the left-right consistency detection method may be adopted to constrain the unannotated pixels, that is, for the unannotated pixels, the disparity map of the computed monocular image is mapped to the disparity map of the corresponding other monocular image, and the difference between the disparity map of the computed corresponding other monocular image and the disparity map of the mapped corresponding other monocular image is used as a loss function of the unannotated pixels for back propagation. Furthermore, the loss function of the unmarked pixel may be taken as the first loss together with the loss function of the marked pixelThe function is used for back propagation for training of the neural network formed by the first acquisition unit 110. More specifically, the calculated disparity map of the monocular image may be mapped to the disparity map of the corresponding other monocular image (by way of example only and not limitation, the right disparity map may be mapped to the left disparity map) as shown in equation (9) below to generate a new disparity map of the other monocular image (by way of example only and not limitation, the left disparity map), and then the difference L between the calculated disparity map of the monocular image and the mapped disparity map of the corresponding other monocular image may be obtained as shown in equation (10) below_{warp_consistance}And is related to the loss function L of the marked pixel as shown in the following equation (11)_smoothTogether, for training of the neural network formed by the first acquisition unit 110.

L＝L_smooth+L_{warp_consistance} (11)

Wherein the content of the first and second substances,a disparity map representing a new left map generated from the right map, D_l(x_lY) denotes a pixel (x)_lY) value of (x)_l-D_l(x_lY), y) denotes a pixel (x)_lY) left-hand translation D_l(x_lY) obtaining new pixel coordinates, I_R(x_l-D_l(x_lY), y) represents a pixel (x)_r-D_l(x_lY), y) pixel values corresponding to the right image, L_{warp_consistance}Representing the loss of consistency function, L, of unmarked pixels_smoothRepresenting the loss function of the marker pixel and L the first loss function. Further, it should be noted that equations (9) to (11) and the left, right, left, etc. described in this paragraph of description are merely examplesThe person skilled in the art can perform corresponding processing according to the actual situation of the binocular image.

FIG. 13 is a diagram illustrating a geometric consistency optimization operation according to an exemplary embodiment of the present invention. As shown in fig. 13, in training of the neural network composed of the first acquisition unit 110 and the second acquisition unit 120, the first depth map may be input to one neural network (which may be a convolutional neural network, by way of example only and not limitation) and the second depth map may be input to another neural network (which may be a convolutional neural network, by way of example only and not limitation) to obtain the first normal vector map, respectivelyAnd a second normal vector mapAnd generating a high-quality depth map through a normal vector map consistency constraint, more specifically, a first normal vector mapAnd a second normal vector mapThe difference between the first and second acquisition units 110 and 120 may be used as a second loss function for back propagation for training of the neural network formed by the first and second acquisition units. According to an exemplary embodiment of the present invention, the first normal vector diagram may be calculated by the following equation (12)And a second normal vector mapThe difference between them. The depth map can be optimized by using the conventional method or the deep learning network, but the invention optimizes the depth maps obtained by a plurality of sensors together and utilizes the geometric constraint among the depth maps to obtain the depth map with higher consistencyAnd obtaining a high-quality depth map.

In addition, when training a model, a large amount of labeled data is usually needed to supervise a specific task of network learning, such as a deep estimation step in the present invention, however, in practice, a large amount of manpower and material resources are needed for labeled data, and in order to save resources, synthetic data and actual data can be used as a training set to perform a test on the actual data. In this case, the network structure constituted by the depth map generating means may be modified to change the regression network structure into a network structure of two tasks, i.e., regression and classification. The two tasks are mostly parameter shared, the classification problem branch requires the network to correctly classify the synthetic data and the actual data, the binary classification error is minimized, and the regression problem branch requires the network to minimize the regression error, so that the correct depth estimation is obtained. In order to implement domain adaptation, the network structure formed by the depth map generation device can be divided into three parts, namely, a shared coding module, a regression module and a classification module, in order to train the network formed by the depth map generation device simultaneously and not to generate ambiguity, a gradient inversion layer can be used, network parameters before the layer can go back towards an angle for coding data of two domains into similar distribution to obtain optimal depth estimation, and network parameters after the layer can go back towards the direction for reducing domain classification loss to optimize.

Fig. 14 is a flowchart illustrating a depth map generating method according to an exemplary embodiment of the present invention.

As shown in fig. 14, in step S1410, a first depth map may be generated based on the binocular image.

In step S1420, a second depth map acquired by the depth camera may be acquired.

In step S1430, a final depth map may be generated by fusing the first depth map with the second depth map.

The predetermined direction may include at least one of a top-to-bottom direction, a bottom-to-top direction, a left-to-right direction, and a right-to-left direction.

The step of obtaining the first feature map may include feature extraction of the binocular image based on a neural network of an attention mechanism.

The step of performing cost aggregation may comprise performing cost aggregation on the matching cost matrices based on a Graph Neural Network (GNN).

The step of performing cost aggregation may further comprise using the second feature map to guide cost aggregation.

The second feature map may be generated based on a semantic segmentation neural network, or the second feature map may be the same as the first feature map.

The step of performing cost aggregation may comprise: obtaining neighborhood pixels of each pixel of the second feature map according to the distance, finding neighborhood nodes of each central node as elements of the matching cost matrix according to the pixel corresponding relation between the second feature map and the matching cost, updating the matching cost of each central node based on the weight and the matching cost of the neighborhood nodes to obtain a second-price aggregated matching cost matrix, and aggregating the first-price aggregated matching cost matrix and the second-price aggregated matching cost matrix to obtain a third-price aggregated matching cost matrix.

The distance may be a spatial distance, a feature vector distance, or a weighted sum of the spatial distance and the feature vector distance.

The step of obtaining a trusted depth map may comprise: and processing the first feature map, the initial matching cost matrix and the matching cost matrix according to the credible depth map.

Further, according to an exemplary embodiment of the present invention, there may also be provided an electronic device including a memory and a processor, the memory having stored thereon computer-executable instructions that, when executed by the processor, perform the foregoing method. According to an exemplary embodiment of the present invention, there may also be provided a computer-readable medium having stored thereon computer-executable instructions that, when executed, perform the aforementioned method.

According to the embodiment of the invention, the high-precision depth map can be generated by optimizing the problems of noise, sparseness, low resolution and the like of the depth map obtained from the sensor and the problem of poor precision of depth estimation in a pathological region based on a depth learning method; in addition, by filtering the depth map obtained by the sensor with the binocular image, a complementary high-resolution depth map can be generated; in addition, a high-precision depth map can be generated by extracting robust features by using a stereo matching depth learning network and conducting the features to a pathological region; in addition, the first depth map and the depth-complemented second depth map are jointly optimized through consistency constraint by utilizing a depth optimization network, and a high-quality depth map is generated; in addition, by the multi-sensor-based high-quality depth map generation method, a high-quality depth map can be obtained by utilizing the advantages of each depth sensor, and the method is also suitable for some difficult scenes.

While the invention has been shown and described with reference to certain exemplary embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

29页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种视差图获取方法、装置、终端和存储介质

Depth map generation method and device

相关技术

网友询问留言