Depth map acquisition method, device and storage medium

文档序号：1906214 发布日期：2021-11-30 浏览：17次中文

阅读说明：本技术 深度图获取方法、设备及存储介质 (Depth map acquisition method, device and storage medium ) 是由陈鹭远王珂于 2020-05-25 设计创作，主要内容包括：本申请提供一种深度图获取方法、设备及存储介质。方法包括：利用双目图像样本和双目图像样本的深度图作为训练数据训练处理模型,再用训练后的处理模型对双目图像进行处理,生成双目图像的深度图。相较于现有技术中由双目图像获得视差图后,再由视差图获得深度图的方法,由于本申请中无需获得视差图,可避免视差图误差以深度值二次方传递至深度图,减少误差传递环节,提高所获得的深度图的准确度。(The application provides a depth map acquisition method, a device and a storage medium. The method comprises the following steps: and training a processing model by using the binocular image sample and the depth map of the binocular image sample as training data, and processing the binocular image by using the trained processing model to generate the depth map of the binocular image. Compared with the method for obtaining the depth map from the binocular image in the prior art, the method for obtaining the depth map from the binocular image does not need to obtain the disparity map, so that the disparity map error can be prevented from being transmitted to the depth map in the depth value quadratic power, error transmission links are reduced, and the accuracy of the obtained depth map is improved.)

1. A depth map acquisition method is characterized by comprising the following steps:

acquiring binocular images of roads synchronously shot by a camera set;

processing the binocular image by using a trained processing model to generate a depth map of the binocular image, wherein the trained processing model is obtained by training using training data, and the training data comprises: binocular image samples for the processing model input data and depth map samples for the processing model output data.

2. The method of claim 1, further comprising, prior to processing the binocular images using the trained processing model to generate the depth map of the binocular images:

acquiring a point cloud and a shot image of a training scene, wherein the shot image comprises a left eye image sample and a right eye image sample in the binocular image sample;

and determining the depth map sample according to the point cloud and the shot image.

3. The method of claim 2, wherein determining the depth map sample from the point cloud and the captured image comprises:

performing semantic segmentation on the shot image to obtain a pixel region of at least one training object, wherein the training scene comprises at least one training object;

determining the depth value of the pixel area of each training object according to the type of each training object and the point cloud;

and determining the depth map sample according to the depth value of the pixel area of each training object.

4. The method according to claim 3, wherein determining the depth value of the pixel region of each training object according to the type of each training object and the point cloud comprises:

judging whether the training object is sky or not;

if yes, setting the depth value of the pixel area of the training object as the maximum depth value;

if not, determining the depth value of the pixel area of the training object according to the point cloud.

5. The method of claim 4, wherein determining depth values for pixel regions of the training object from the point cloud comprises:

determining an associated point cloud matched with the pixel region of the training object from the point cloud;

and determining the depth value of the pixel area of the training object according to the associated point cloud.

6. The method of claim 5, wherein determining, from the point clouds, an associated point cloud that matches a pixel region of the training object comprises:

projecting the point cloud to a plane of the shot image, and determining the position of a projection area of the point cloud;

and determining the associated point cloud according to the position of the projection area and the position of the pixel area of the training object.

7. The method of claim 5, wherein determining depth values for pixel regions of the training object from the associated point cloud comprises:

and taking the distance from the associated point cloud to the plane of the shot image as the depth value of the pixel area of the training object.

8. The method according to any of claims 1-7, wherein the processing parameters of the processing model comprise a depth value range determined from the distance between the real objects in the road and the set of cameras.

9. An acquisition device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor being configured to perform the depth map acquisition method of any one of claims 1 to 8 when the program is executed.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the depth map acquisition method of any one of claims 1 to 8.

Technical Field

The present application relates to the field of machine vision technologies, and in particular, to a depth map acquisition method, device, and storage medium.

Background

In the process of manufacturing a high-precision map, in order to generate vectorized data of a road, two cameras with the same parameters are configured on a collection vehicle and used for shooting binocular images of the road.

The binocular image of the road refers to an image shot by two cameras with fixed relative positions and simultaneously triggered to the same road, wherein the image shot by the left camera is called a left eye image, and the image shot by the right camera is called a right eye image. After obtaining the binocular image, according to the requirementsAnd extracting a depth map of the binocular image from the binocular image, wherein the depth map of the binocular image is further used for generating vectorized data of the road. The existing depth map method for extracting road elements generally comprises the following steps: and performing resolution reduction processing on the images acquired by the acquisition vehicle to reduce the operation amount, and performing matching processing on the binocular images by using the trained binocular matching model to obtain a disparity map. Then according to the conversion formulaAnd converting the disparity map into a depth map to obtain a real object depth map. Wherein Z is depth, f is focal length, D is base length, and D is parallax.

However, the above method of converting a disparity map into a depth map transfers disparity map errors to the depth map. The error transfer formula isThe error transfer formula can be simplified to Δ Z ^ Z²Δ D, depth map error Z²The multiple disparity map errors, that is, the disparity map errors are transferred to the depth map by a quadratic multiple of the depth value, which is particularly obvious for an object with large depth information, resulting in inaccurate depth information.

Disclosure of Invention

The application provides a depth map acquisition method, a device and a storage medium. The technical problem that the obtained depth map has errors due to the fact that the existing method adopts a mode of converting the depth map into the disparity map is solved.

In a first aspect, the present application provides a depth map obtaining method, including:

acquiring binocular images of roads synchronously shot by a camera set;

processing the binocular image by using the trained processing model to generate a depth map of the binocular image, wherein the trained processing model is obtained by training by using training data, and the training data comprises: binocular image samples for processing model input data and depth map samples for processing model output data.

Optionally, before the processing the binocular image using the trained processing model to generate the depth map of the binocular image, the method further includes:

acquiring a point cloud of a training scene and a shot image of the training scene, wherein the shot image comprises a left eye image sample and a right eye image sample in a binocular image sample;

and determining a depth map sample according to the point cloud and the shot image.

Optionally, determining a depth map sample according to the point cloud and the shot image, specifically including:

performing semantic segmentation on the shot image to obtain a pixel area of at least one training object, wherein the training scene comprises at least one training object;

determining the depth value of the pixel area of each training object according to the type and the point cloud of each training object;

depth map samples are determined from the depth values of the pixel regions of each training object.

Optionally, determining a depth value of the pixel region of each training object according to the type and the point cloud of each training object specifically includes:

judging whether the training object is sky or not;

if yes, setting the depth value of the pixel area of the training object as the maximum depth value;

if not, determining the depth value of the pixel area of the training object according to the point cloud.

Optionally, determining a depth value of a pixel region of the training object according to the point cloud, specifically including:

determining an associated point cloud matched with a pixel region of a training object from the point clouds;

and determining the depth value of the pixel area of the training object according to the associated point cloud.

Optionally, determining, from the point clouds, an associated point cloud matched with a pixel region of the training object, specifically including:

projecting the point cloud to a plane of a shot image, and determining the position of a projection area of the point cloud;

and determining the associated point cloud according to the position of the projection area and the position of the pixel area of the training object.

Optionally, determining a depth value of the pixel region of the training object according to the associated point cloud, specifically including:

and taking the distance from the associated point cloud to the plane of the shot image as the depth value of the pixel area of the training object.

Optionally, the processing parameter of the processing model includes a depth value range, and the depth value range is determined according to a distance between a real object in the road and the camera group.

In a second aspect, the present application provides a binocular matching apparatus, comprising:

the acquisition module is used for acquiring binocular images of the road synchronously shot by the camera set;

the processing module is used for processing the binocular images by using the trained processing model to generate the depth map of the binocular images, wherein the trained processing model is obtained by training with training data, and the training data comprises: binocular image samples for processing model input data and depth map samples for processing model output data.

In a third aspect, the present application provides an acquisition device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor being configured to perform the depth map acquisition method according to the first aspect and the alternative when the program is executed.

In a fourth aspect, the present application provides a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the depth map acquisition method according to the first aspect and the alternatives.

The application provides a depth map acquisition method, a device and a storage medium, wherein the depth map acquisition method comprises the following steps: and training a processing model by using the binocular image sample and the depth map of the binocular image sample as training data, and processing the binocular image by using the trained processing model to generate the depth map of the binocular image. Compared with the method for obtaining the depth map from the binocular image in the prior art, the method for obtaining the depth map from the binocular image does not need to obtain the disparity map, so that the disparity map error can be prevented from being transmitted to the depth map in the quadratic power of the depth value, error transmission links are reduced, and the accuracy of the obtained depth map is improved.

In addition, the disparity map does not need to be acquired, a preprocessing link for performing resolution reduction processing on the binocular image before the disparity map is acquired is not needed, and the high-resolution binocular image can be directly processed by using the model so as to acquire a more accurate depth map. In addition, in the process of generating the depth map training sample, when a training object is the sky, the depth value of a pixel region corresponding to the sky is set to be the maximum depth value, and the depth pattern is used for training the processing model, so that the boundary of the depth map generated by the processing model is clearer, a clearer distant view edge can be obtained, and a high-precision map can be manufactured. In addition, when training data are obtained, a depth map of a binocular image sample can be obtained based on point clouds collected by the existing collection vehicle, and collection equipment does not need to be additionally arranged.

Drawings

Fig. 1 is a schematic structural view of a collection vehicle provided in the present application;

FIG. 2 is a high-precision mapping process provided herein;

fig. 3 is a schematic view of a shooting principle of the binocular camera set provided by the present application;

fig. 4 is a schematic flowchart of a depth map obtaining method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a sample depth map obtained according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a training processing model according to an embodiment of the present disclosure;

fig. 7 is a left eye image sample of a road provided in the second embodiment of the present application;

fig. 8 is a depth map sample of the left eye image sample shown in fig. 7 according to the second embodiment of the present application;

fig. 9 is a depth map sample of the left eye image sample shown in fig. 7 according to the second embodiment of the present application;

fig. 10 is a left eye image of a road provided in the second embodiment of the present application;

FIG. 11 is a depth map generated by processing the binocular image shown in FIG. 10 using a processing model;

FIG. 12 is a depth map generated by processing the binocular image shown in FIG. 10 using a contrast model;

fig. 13 is a schematic structural diagram of a depth map obtaining apparatus according to a third embodiment of the present application;

fig. 14 is a schematic structural diagram of an acquisition device provided in the fourth embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The manufacturing process of the high-precision map comprises an interior industry and an exterior industry, wherein the exterior industry is a process of collecting data of roads when a collection vehicle runs on the roads to be collected, the interior industry is a process of processing the collected data to generate vectorization map data, and then the vectorization map data is subjected to format conversion, code conversion, compiling and the like to generate a map product.

Fig. 1 is a schematic structural diagram of a collection vehicle provided in the present application, and as shown in fig. 1, a roof of the collection vehicle is generally configured with a plurality of laser radars and a plurality of cameras. The laser radar is used for collecting point clouds of all elements on a road, and the camera is used for collecting images of all the elements. The collection vehicle is also provided with sensors such as a Global Positioning System (GPS), an Inertial Measurement Unit (IMU), a milemeter and the like, and is used for realizing combined navigation. The collection vehicle is also provided with a host system, and the host system receives a user input instruction and controls the working state of each sensor according to the instruction. A hard disk is arranged in a trunk of the collection vehicle and used for storing data collected by the collection vehicle. Fig. 2 is a process of manufacturing a high-precision map, and as shown in fig. 2, after the collection vehicle executes the field work, point clouds, images and track points are collected. And then, respectively preprocessing the image and the point cloud by combining the track points, for example: preprocessing such as semantic segmentation and depth map extraction is carried out on the image, and aggregation processing, classification processing, resolving and the like are carried out on the point cloud. And then carrying out fusion processing on the preprocessed image and the point cloud to generate vectorized map data. The vectorized map data can reflect the position information of each road element on the collected road and the attribute information of each road element. The attribute information includes: road element type, road element color, etc.

Two cameras with the same parameters and fixed relative positions are generally arranged on the collection vehicle, and the two cameras simultaneously trigger the shooting of the same road to generate a binocular image of the road, wherein the image shot by the left camera is called a left eye image, and the image shot by the right camera is called a right eye image. Fig. 3 is a schematic diagram of the binocular camera set provided by the present application, and as shown in fig. 3, P1 is an imaging point of a target point P on a left eye image captured by a left camera, and P2 is an imaging point of the target point P on a right image captured by a right camera. o1 is the intersection of the line connecting the imaging point P1 and the object P and the left optical axis of the left camera. o2 is the intersection of the line connecting the imaging point P2 and the object P and the right optical axis of the right side camera. And B is the distance between the left optical axis of the left side camera and the right optical axis of the right side camera. X_leftCoordinate value, X, representing imaging point P1 on the left eye image_rightA coordinate value representing the imaging point P2 on the right eye image. The disparity value D is expressed as: d ═ X_left-X_right。

After obtaining the binocular image, a depth map of the binocular image is extracted from the binocular image. In the prior art, a depth map method for extracting road elements generally comprises the following steps: and matching the binocular images by using the trained binocular matching model to obtain a disparity map. And converting the disparity map into a depth map according to the following formula to obtain a real object depth map.

Wherein Z is depth, f is focal length, D is base length, and D is parallax.

According to the above formula, the error transfer formula for converting the disparity map into the error map is as follows:

where Δ Z represents a depth error and Δ D represents a parallax error.

By simplifying the error transfer equation, it can be known that:

and alsoThe above error transfer formula can be further simplified as follows:

ΔZ∝Z²ΔD (4)

from equation (4), the depth map error is proportional to Z²The multiple disparity map error, that is, the disparity error is transferred to the depth map by the quadratic multiple of the depth value, and for the road elements far away from the camera, the larger the error transfer coefficient is, which results in inaccurate depth map acquisition.

The embodiment of the application provides a depth map acquisition method, a depth map acquisition device, a depth map acquisition equipment and a storage medium, and aims to solve the problems. The invention conception of the application is as follows: the parallax error conversion formula introduces error transfer, and the binocular vision graph is processed by using the model, so that the depth map is directly obtained, and the error transfer process is reduced. In order for the model to be able to directly obtain the depth map, training data for training the model needs to be acquired. The acquisition vehicle is used for synchronously acquiring the binocular image and the point cloud of the training scene, the point cloud is matched with the binocular image, the depth map of the training scene is determined, and then training data are obtained. In addition, in the process of generating the depth map training sample, when a training object is the sky, the depth value of a pixel region corresponding to the sky is set to be the maximum depth value, and the depth pattern is used for training the processing model, so that the boundary of the depth map generated by the processing model is clearer, a clearer distant view edge can be obtained, and a high-precision map can be manufactured.

Fig. 4 is a flowchart illustrating a depth map obtaining method according to an embodiment of the present application. The depth map acquisition method provided by the embodiment of the application is used for acquiring equipment, and the acquiring equipment is equipment such as a server and a computer. As shown in fig. 4, the depth map obtaining method provided in the embodiment of the present application includes the following steps:

s101, point cloud of a training scene and a shot image of the training scene are obtained.

The training scene can be any road, the collection vehicle runs in the training scene, the two cameras with the same parameters and relatively fixed positions synchronously and continuously shoot the training scene, and the multiframe binocular image samples of the training scene are obtained. The image captured by the left camera is referred to as a left eye image sample, and the image captured by the right camera is referred to as a right eye image sample. The captured images of the training scene include left eye image samples and right eye image samples of the binocular image samples. When the camera shoots a training scene, the laser radar on the acquisition vehicle synchronously emits laser, and the training scene is scanned to obtain a multi-frame point cloud of the training scene.

After the acquisition vehicle acquires the original image and the original point cloud of the training scene, the multi-frame original image acquired this time is stored in the same directory, the multi-frame original point cloud acquired this time is stored in the same directory, and the image data and the point cloud data are exported from the acquisition equipment.

For each frame of raw image, the raw image may be preprocessed, for example: denoising, and the like. And performing aggregation classification processing on each frame of original point cloud to obtain a plurality of groups of point clouds. After the original image and the original point cloud are respectively preprocessed, the point cloud and the pixel area in the image are matched. The preprocessed image and the original point cloud can also be directly matched. Whether the point cloud and the image matched with each other are preprocessed or not is not limited.

The method comprises the steps of comparing the acquisition time of each frame of image with the acquisition time of each frame of point cloud, and if the difference value of the two is within a preset threshold value, determining that the frame of image and the frame of point cloud are synchronously acquired data.

And S102, determining a depth map sample according to the point cloud and the shot image.

After the synchronously acquired image and point cloud are determined, a depth map sample is obtained by matching the point cloud and the image. The method aims at synchronously acquired point cloud and shot images. And performing semantic segmentation on the shot image to obtain a pixel area of at least one training object. And determining the depth value of the pixel area of each training object according to the type and the point cloud of each training object. Depth map samples are determined from the depth values of the pixel regions of each training object.

When the collection vehicle shoots the image of the training scene, the sky easily appears in the shot image, and the depth value of the sky cannot be obtained through the laser radar. When the depth value of the pixel area of each training object is determined, whether the training object is sky or not is judged, and if yes, the depth value of the pixel area of the training object is set to be the maximum depth value. If not, determining the depth value of the pixel area of the training object according to the point cloud. Wherein, confirm the depth value of the pixel area of the training object according to the point cloud, include specifically: and determining associated point clouds matched with the pixel areas of the training object from the point clouds according to the matching relation between the point clouds and the pixel areas in the shot image. And determining the depth value of the pixel area of the training object according to the associated point cloud.

Fig. 5 is a schematic diagram of obtaining a depth map sample according to an embodiment of the present application, where a captured image is subjected to semantic segmentation to obtain pixel regions of training objects in the captured image, and if the training object is not sky, a point cloud is projected onto a plane where the captured image is located, and a pixel region matched with the point cloud is determined according to a projection position of the point cloud and positions of the pixel regions of the training objects in the captured image.

When determining the pixel area matched with the point cloud, if the projection position of the point cloud and the position of the pixel area of the training object satisfy a preset condition, for example: and if the distance between the point cloud and the training object is within a preset range, determining that the point cloud is matched with the pixel region of the training object. Usually, the projection of the point cloud onto the plane on which the image is taken is a region. The center of the projection area of the point cloud and the center of the pixel area can be selected for comparison, and if the distance between the center of the projection area of the point cloud and the center of the pixel area is within the range of a preset threshold value, the point cloud is considered to be matched with the pixel area of the training object.

After the point cloud is matched with the pixel regions of the respective training objects, the distance from the point cloud to the plane of the photographed image may be used as the depth value of each pixel region. If the training object is sky, the depth value of the pixel region of the training object is set to be the maximum depth value, that is, the depth value of the pixel region corresponding to the sky is greater than the depth value of the pixel region corresponding to any other training object. And obtaining a depth map sample of the training scene according to the depth value of each pixel area, wherein the depth map sample is a depth map of the shot image.

And S103, training the processing model by using the training data.

Wherein the training data comprises: binocular image samples of the training scene and depth map samples of the training scene. The binocular image samples of the training scene are used as input data of the processing model, and the depth pattern of the training scene is used as input data of the processing model. The processing model may be a neural network model, a deep learning network model, etc., and is not limited herein.

The following describes the process of training the processing model in detail by taking a deep learning network model as an example, and as shown in fig. 6, the processing model includes five links, which are: feature map extraction, cost value calculation, cost aggregation, full convolution network and Softmax function processing. During training, feature map extraction is carried out on the binocular image sample to generate a corresponding feature map. When proxy value calculation is performed on the feature map, a matrix C (w × h × z) is generated, where the matrix C represents a matching cost of pixels located at w × h with a depth value of z. After the matching cost of each pixel is obtained, the cost value of the pixel is optimized by the cost values of the adjacent pixels for each pixel, and S (w × h × z) is generated. After the cost aggregation is carried out, convolution processing and a Softmax function are carried out on each pixel, and the depth value of each pixel is generated. And substituting the depth image samples and the depth values of all pixels output by the Softmax function into a Smooth L1 function to determine loss values, reversely transmitting the loss values to cost value calculation, and solving an extreme value of the minimum loss function through continuous iteration to realize the training of the processing model.

And S104, acquiring binocular images of the road synchronously shot by the camera group.

When the collection vehicle executes field operation, binocular images of the road synchronously shot by the camera are guided into the acquisition equipment, and the acquisition equipment performs subsequent processing on the binocular images.

And S105, processing the binocular image by using the trained processing model to generate a depth map of the binocular image.

After the operation model is opened, the binocular image of the road is loaded into the processing model, the processing model is operated, and the processing model can output the depth map of the binocular image.

In the depth map obtaining method provided by the embodiment of the application, the processing model can directly output the depth map of the binocular image according to the binocular image through training the processing model, and compared with the prior art, error transmission is reduced, and the precision of the obtained depth map can be improved. In addition, in the process of generating a depth map training sample, the depth value of a pixel region corresponding to the sky is set to be the maximum depth value, and the depth pattern is used for training a processing model, so that the boundary of the depth map generated by the processing model is clearer, and especially the boundary between the sky and an object located around the sky can obtain a clearer distant view edge.

The depth map acquisition method provided by the second embodiment of the present application is described below with particular emphasis on specific examples, and the depth map acquisition method provided by the second embodiment of the present application includes the following steps:

s201, point cloud of a training scene and a shot image of the training scene are obtained.

Here, this step has already been described in detail in embodiment one S101, and is not repeated again. Captured images of a training scene as shown in fig. 7, the training scene includes at least one training object, for example: street lamps, curbs, bus stations, sky, etc.

S202, determining a depth map sample according to the point cloud and the shot image.

Here, this step has already been described in detail in S102, and repeated descriptions are omitted.

And judging whether the training object is sky or not, and if the training object is determined to be sky, setting the depth value of the pixel region of the training object as the maximum depth value. That is, the depth value corresponding to the pixel region of the sky is larger than the depth values corresponding to the remaining pixel regions.

And if the training objects are determined not to be sky, determining the associated point clouds matched with the pixel regions of the training objects according to the positions of the point clouds on the shot images. And determining the depth value of the pixel area of each training object according to the distance between the associated point cloud and the shooting plane.

Fig. 8 is a depth map of a binocular image sample obtained according to a point cloud according to an embodiment of the present application. As shown in fig. 8, since corresponding point clouds are not collected in the upper area and the right area of the image, the depth values are uniformly set to 0, and the depth value of the pixel area where the sky is located is also set. On the basis of the depth map sample shown in fig. 8, a depth map sample obtained by complementing the sky depth value of the sky is shown in fig. 9.

And S203, training the processing model by using the training data.

Here, this step has already been described in detail in S103, and the repeated portions are not described here again. And during training, determining a depth value range parameter in the processing parameters of the processing model according to the distance from the real object in the road to the camera set. Compared with the depth value range with infinity as the upper limit value, the depth value range determining method can remove the invalid depth value in the depth value range with infinity as the upper limit value by determining the depth value range according to the distance between the real object in the road and the camera set, eliminate the calculation process of calculating by using the invalid depth value, reduce the memory calculation amount and improve the calculation efficiency. When the processing model adopts a model structure based on a neural network, the depth value range is recommended to be [0, 96] according to the actual application range in the actual drive test data.

And S204, acquiring binocular images of the road synchronously shot by the camera group.

Here, this step has already been described in detail in S104, and the repeated portions are not described here again. Fig. 10 is a left eye image of a road provided in the second embodiment of the present application, as shown in fig. 10, the left eye image includes trees, curbs, lanes, street lamps, buildings, sky, and the like.

And S205, processing the binocular image by using the trained processing model to generate a depth map of the binocular image.

Here, this step has already been described in detail in S105, and the repeated portions are not described here again. Fig. 11 is a depth map obtained by processing the binocular image shown in fig. 10 using a processing model, and as shown in fig. 11, dark gray represents a portion closer to the camera, and light gray represents a portion farther away, and it can be known from fig. 11 that a clear boundary exists between a sky portion and a tree portion farther away from the image capturing apparatus, and a clearer distant view edge can be obtained. Fig. 12 is a depth map obtained by processing the binocular image shown in fig. 10 using a contrast model, and the contrast model is obtained by training using a depth pattern that sets the depth value of the pixel region corresponding to the sky to zero, as shown in fig. 12, the depth map obtained by processing the contrast model has no obvious boundary between the sky part and the tree part.

In the embodiment of the application, in the process of generating the depth map training sample, the depth value of the pixel region corresponding to the sky is set as the maximum depth value, and by using the depth pattern training processing model, the boundary of the depth map generated by the obtained processing model is clearer, and especially, the boundary between the sky and an object located around the sky can obtain a clearer distant view edge.

Fig. 13 is a schematic structural diagram of a depth map obtaining apparatus according to a third embodiment of the present application. As shown in fig. 13, a depth map obtaining apparatus 300 provided in an embodiment of the present application includes:

the acquisition module 301 is configured to acquire binocular images of a road synchronously photographed by the camera group;

a processing module 302, configured to process the binocular image using the trained processing model to generate a depth map of the binocular image, where the trained processing model is obtained by training using training data, and the training data includes: binocular image samples for processing model input data and depth map samples for processing model output data.

Optionally, the apparatus further comprises:

the obtaining module 301 is further configured to obtain a point cloud of a training scene and a captured image of the training scene, where the captured image includes a left eye image sample and a right eye image sample in a binocular image sample;

the determining module 303 is further configured to determine a depth map sample according to the point cloud and the captured image.