Training method and device of depth estimation model, electronic equipment and storage medium

文档序号：1832817 发布日期：2021-11-12 浏览：27次中文

阅读说明：本技术 深度估计模型的训练方法、装置、电子设备及存储介质 (Training method and device of depth estimation model, electronic equipment and storage medium ) 是由安世杰张渊于 2020-04-27 设计创作，主要内容包括：本公开关于一种深度估计模型的训练方法、装置、电子设备及存储介质,以至少解决卷积神经网络对视频图像深度估计的稳定性不高的问题。该方法包括：将n个样本图像以及与n个样本图像一一对应的n个模拟图像分别输入预设深度估计模型,得到分别对应n个样本图像的深度预测结果和n个模拟图像的深度预测结果；n个样本图像和n个模拟图像标注有像素点深度值；根据每个样本图像的像素点深度值和每个样本图像的深度预测结果获得n个第一损失；根据每个样本图像的深度预测结果和与每个样本图像对应的模拟图像的深度预测结果获得n个第二损失；根据n个第一损失和n个第二损失,训练预设深度估计模型,得到第一目标深度估计模型。(The disclosure relates to a training method and device of a depth estimation model, an electronic device and a storage medium, which are used for at least solving the problem that the stability of a convolutional neural network on video image depth estimation is not high. The method comprises the following steps: respectively inputting the n sample images and the n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results respectively corresponding to the n sample images and depth prediction results of the n simulated images; the n sample images and the n analog images are marked with pixel point depth values; obtaining n first losses according to the pixel point depth value of each sample image and the depth prediction result of each sample image; obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image; and training a preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.)

1. A method for training a depth estimation model, the method comprising:

respectively inputting n sample images and n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results which respectively correspond to the n sample images and depth prediction results of the n simulated images; wherein n is a positive integer; each analog image is obtained by performing amplification processing on a corresponding sample image; the n sample images and the n simulated images are marked with pixel point depth values;

obtaining n first losses according to the depth value of the pixel point of each sample image and the depth prediction result of each sample image;

obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image;

and training the preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.

2. The method of claim 1, further comprising:

respectively carrying out augmentation processing on each sample image by using a first method to obtain n analog images respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated;

the obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image includes:

respectively processing the depth prediction result of each simulated image by using the inverse process of the first method to obtain n first depth prediction results respectively corresponding to each simulated image;

and obtaining the loss of each first depth prediction result and the depth prediction result of the sample image corresponding to each first depth prediction result respectively to obtain n second losses.

3. The method of claim 2, wherein when the first method includes any one of mirroring, rotation or translation, the using the first method to perform the augmentation processing on each sample image respectively to obtain n analog images corresponding to each sample image respectively comprises:

respectively projecting each sample image to a three-dimensional space according to a camera imaging equation to obtain n first point clouds;

respectively processing each first point cloud by using a preset first rotation matrix and a first translation vector to obtain n second point clouds;

respectively carrying out back projection on each second point cloud according to a camera imaging equation;

and obtaining n simulated images respectively corresponding to each sample image according to the coordinate relationship between each sample image and the second point cloud after the back projection respectively corresponding to each sample image.

4. The method of claim 3, wherein the inverse process using the first method is used to process the depth prediction result of each simulated image to obtain n first depth prediction results corresponding to each simulated image, respectively, and comprises:

respectively processing the depth prediction result of each simulated image by using a preset second rotation matrix and a second translation vector to obtain n first depth prediction results respectively corresponding to each simulated image; wherein the second rotation matrix is an inverse of the first rotation matrix; the sum of the first translation vector and the second translation vector is 0.

5. The method according to any one of claims 2-4, further comprising:

respectively carrying out augmentation processing on the depth prediction result of each sample image by using the first method to obtain n projection depth prediction results respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated;

obtaining n third losses according to the depth value of the pixel point of each simulated image and the depth prediction result of each simulated image, and obtaining n fourth losses according to each projection depth prediction result and the depth prediction result of the simulated image corresponding to each projection depth prediction result respectively;

and training the first target depth estimation model according to the n third losses and the n fourth losses to obtain a second target depth estimation model.

6. A method of depth estimation, the method comprising:

acquiring an image of a depth to be estimated;

inputting the image with the depth to be estimated into the depth estimation model obtained by training according to the method of any one of claims 1 to 5, and obtaining the depth prediction result of the image with the depth to be estimated.

7. An apparatus for training a depth estimation model, the apparatus comprising: the device comprises a prediction module, an acquisition module and a training module;

the prediction module is configured to: respectively inputting n sample images and n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results which respectively correspond to the n sample images and depth prediction results of the n simulated images; wherein n is a positive integer; each analog image is obtained by performing amplification processing on a corresponding sample image; the n sample images and the n simulated images are marked with pixel point depth values;

the acquisition module is configured to: obtaining n first losses according to the depth value of the pixel point of each sample image and the depth prediction result of each sample image; obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image;

the training module is configured to: and training the preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.

8. A depth estimation apparatus, characterized in that the apparatus comprises:

an acquisition module configured to acquire an image of a depth to be estimated;

a prediction module configured to input the image with the depth to be estimated into the depth estimation model trained by the method according to any one of claims 1 to 5, and obtain a depth prediction result of the image with the depth to be estimated.

9. An electronic device, comprising:

a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute the executable instructions to implement the method of any one of claims 1-5 or to implement the method of claim 6.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-5 or perform the method of claim 6.

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a method and an apparatus for training a depth estimation model, an electronic device, and a storage medium.

Background

In the field of computer vision technology, estimating depth information of a two-dimensional image is a very important technology. This depth estimation technique is useful for obtaining the size of an object, the occlusion relationship and the shape between objects, and the like in a two-dimensional image. The depth estimation technology can be applied to conversion from 2D movies to 3D movies, autonomous navigation of an intelligent robot, grabbing by a mechanical arm, augmented reality and the like.

Currently, the industry generally uses a convolutional neural network (i.e., a preset depth estimation model) to estimate depth information of a two-dimensional image, so as to obtain the depth information of the two-dimensional image. Currently, the data sets of the disclosed sample images are mainly classified into two types, one is an indoor scene and the other is an outdoor scene. Only a small part of the data set of the sample image is continuous frame image data, and most of the data set is discontinuous frame image data, so that when the convolutional neural network obtained by training the data set of the disclosed sample image is used for depth estimation of continuous frame images in a video, the situation that the difference between two adjacent frame images is small, but the difference of depth estimation results is large occurs, and the problem of poor stability of depth estimation of the video image is caused.

Disclosure of Invention

The disclosure provides a training method and device of a depth estimation model, so as to improve the stability of depth estimation of continuous frame images in a video.

The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, an embodiment of the present disclosure provides a training method for a depth estimation model, the method including: respectively inputting n sample images and n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results which respectively correspond to the n sample images and depth prediction results of the n simulated images; wherein n is a positive integer; each analog image is obtained by performing amplification processing on a corresponding sample image; the n sample images and the n simulated images are marked with pixel point depth values; obtaining n first losses according to the depth value of the pixel point of each sample image and the depth prediction result of each sample image; obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image; and training the preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.

In an embodiment of the present disclosure: the first loss may reflect an error of the preset depth estimation model estimating the depth information of the sample image, and the second loss may reflect an error of the preset depth estimation model estimating the depth information of the simulated image. Therefore, "training the preset depth estimation model using the first loss and the second loss to obtain the first target depth estimation model" corresponds to correcting an error of the preset depth estimation model in estimating the depth information of the sample image and the simulated image corresponding to the sample image. Therefore, the first target depth estimation model obtained by training the first loss and the second loss is used, and compared with the preset depth estimation model, the depth information of the video image is estimated by using the first target depth estimation model, so that the difference of the depth information of the adjacent frame images in the obtained video is not too large, and the stability of the depth information of the estimated video image can be improved. In addition, when the first target depth estimation model is used for acquiring the depth information of the image, the video image does not need to be subjected to time filtering or space filtering before or after depth estimation, so that the calculation amount of the computer equipment in the process of performing depth estimation on the video image is reduced.

Optionally, the "obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image" includes: and obtaining n second losses according to the depth value of the pixel point of each sample image, the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image.

Optionally, the training method of the depth estimation model further includes: respectively carrying out augmentation processing on each sample image by using a first method to obtain n analog images respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated; the obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image includes: respectively processing the depth prediction result of each simulated image by using the inverse process of the first method to obtain n first depth prediction results respectively corresponding to each simulated image; and obtaining the loss of each first depth prediction result and the depth prediction result of the sample image corresponding to each first depth prediction result respectively to obtain n second losses.

Therefore, the simulated image corresponding to each sample image is equivalent to the adjacent frame image simulating the sample image, the depth prediction result of the simulated image is processed by using the inverse process of the first method, the difference of the depth information in the simulated image estimated by the preset depth estimation model relative to the depth information in the sample image estimated is calculated from the source, and the stability of the first target depth estimation model obtained by training when the depth information of the adjacent frame image is estimated is further improved.

Optionally, when the first method includes any one of mirroring, rotating, and translating, the "performing the amplification processing on each sample image by using the first method to obtain n analog images corresponding to each sample image respectively" includes: respectively projecting each sample image to a three-dimensional space according to a camera imaging equation to obtain n first point clouds; respectively processing each first point cloud by using a preset first rotation matrix and a first translation vector to obtain n second point clouds; and respectively carrying out back projection on each second point cloud according to a camera imaging equation. And obtaining n analog images respectively corresponding to each sample image according to the coordinate relation between each sample image and the second point cloud which is respectively corresponding to each sample image and is subjected to back projection.

Optionally, the "performing augmentation processing on the depth prediction result of each simulated image by using the inverse process of the first method to obtain n first depth prediction results corresponding to each simulated image respectively" includes: and respectively processing the depth prediction result of each simulated image by using a preset second rotation matrix and a second translation vector to obtain n first depth prediction results respectively corresponding to each simulated image. The second rotation matrix is an inverse of the first rotation matrix. The sum of the first translation vector and the second translation vector is 0.

Optionally, the method further includes: respectively carrying out augmentation processing on the depth prediction result of each sample image by using the first method to obtain n projection depth prediction results respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated; obtaining n third losses according to the depth value of the pixel point of each simulated image and the depth prediction result of each simulated image, and obtaining n fourth losses according to each projection depth prediction result and the depth prediction result of the simulated image corresponding to each projection depth prediction result respectively; and training the first target depth estimation model according to the n third losses and the n fourth losses to obtain a second target depth estimation model.

In this way, the third loss may reflect an error of the depth information of the simulated image estimated by the preset depth estimation model, and the fourth loss may reflect an error of the depth information of the simulated image estimated by the preset depth estimation model relative to the depth prediction result of the sample image. Therefore, the first target depth estimation model is trained by using the third loss and the fourth loss to obtain the second target depth estimation model, and the stability of the obtained second target depth estimation model for estimating the depth information of the video image can be further improved.

Optionally, the "performing amplification processing on each sample image by using the first method to obtain n analog images corresponding to each sample image respectively" includes: respectively carrying out augmentation processing on each sample image by using a first method; and adding noise to each processed sample image to obtain n analog images respectively corresponding to each sample image.

In this way, the trained model improves the estimation accuracy when estimating the depth information of the continuous frame images including noise.

According to a second aspect of the embodiments of the present disclosure, there is provided a depth estimation method, including: and obtaining an image of the depth to be estimated, and inputting the image of the depth to be estimated into the depth estimation model obtained by training the training method of the depth estimation model shown in the first aspect and any one of the possible implementation manners of the first aspect, so as to obtain a depth prediction result of the image of the depth to be estimated.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a depth estimation model, the apparatus including: the device comprises a prediction module, an acquisition module and a training module; the prediction module is configured to: respectively inputting n sample images and n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results which respectively correspond to the n sample images and depth prediction results of the n simulated images; wherein n is a positive integer; each analog image is obtained by performing amplification processing on a corresponding sample image; the n sample images and the n simulated images are marked with pixel point depth values; the acquisition module is configured to: obtaining n first losses according to the depth value of the pixel point of each sample image and the depth prediction result of each sample image; obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image; the training module is configured to: and training the preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.

Optionally, the obtaining module is specifically configured to: and obtaining n second losses according to the depth value of the pixel point of each sample image, the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image.

Optionally, the obtaining module is further configured to: respectively carrying out augmentation processing on each sample image by using a first method to obtain n analog images respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated; the acquisition module is specifically configured to: respectively processing the depth prediction result of each simulated image by using the inverse process of the first method to obtain n first depth prediction results respectively corresponding to each simulated image; and obtaining the loss of each first depth prediction result and the depth prediction result of the sample image corresponding to each first depth prediction result respectively to obtain n second losses.

Optionally, when the first method includes any one of mirroring, rotation, or translation, the obtaining module is specifically configured to: respectively projecting each sample image to a three-dimensional space according to a camera imaging equation to obtain n first point clouds; respectively processing each first point cloud by using a preset first rotation matrix and a first translation vector to obtain n second point clouds; respectively carrying out back projection on each second point cloud according to a camera imaging equation; and obtaining n simulated images respectively corresponding to each sample image according to the coordinate relationship between each sample image and the second point cloud after the back projection respectively corresponding to each sample image.

Optionally, the obtaining module is specifically configured to: respectively processing the depth prediction result of each simulated image by using a preset second rotation matrix and a second translation vector to obtain n first depth prediction results respectively corresponding to each simulated image; the second rotation matrix is an inverse of the first rotation matrix; the sum of the first translation vector and the second translation vector is 0.

Optionally, the obtaining module is further configured to: respectively carrying out augmentation processing on the depth prediction result of each sample image by using the first method to obtain n projection depth prediction results respectively corresponding to each sample image; the first method includes: at least one of mirrored, rotated, or translated; obtaining n third losses according to the depth value of the pixel point of each simulated image and the depth prediction result of each simulated image, and obtaining n fourth losses according to each projection depth prediction result and the depth prediction result of the simulated image corresponding to each projection depth prediction result respectively; the training module is further configured to: and training the first target depth estimation model according to the n third losses and the n fourth losses to obtain a second target depth estimation model.

Optionally, the obtaining module is specifically configured to: processing each sample image separately using a first method; and adding noise to each processed sample image to obtain a simulated image of each sample image.

According to a fourth aspect of embodiments of the present disclosure, there is provided a depth estimation apparatus, the apparatus including: an acquisition module configured to acquire an image of a depth to be estimated; a prediction module configured to input the image with the depth to be estimated into the depth estimation model trained by the method according to any one of claims 1 to 7, and obtain a depth prediction result of the image with the depth to be estimated.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions. Wherein the processor is configured to execute the instructions to implement the method for training the depth estimation model shown in any one of the above-mentioned first aspect and possible implementations of the first aspect, or to implement the method for depth estimation shown in the above-mentioned implementation of the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform a training method of a depth estimation model as shown in the first aspect or perform a depth estimation method as shown in the second aspect.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product directly loadable into an internal memory of an electronic device and containing software codes, the computer program being capable of implementing a training method of a depth estimation model shown in the first aspect or implementing a depth estimation method shown in the second aspect when the computer program is loaded into and executed by the electronic device.

The above-mentioned training apparatus, depth estimation apparatus, electronic device, or computer-readable storage medium or computer program product of any depth estimation model is used to execute the above-mentioned corresponding method, and therefore, the beneficial effects that can be achieved by the above-mentioned training apparatus, depth estimation apparatus, electronic device, or computer-readable storage medium or computer program product can refer to the beneficial effects of the corresponding schemes in the above-mentioned corresponding method, and are not described herein again.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of training a depth estimation model in accordance with an exemplary embodiment;

FIG. 2 illustrates a first image 20, a depth predictor 30, and a depth predictor 40, according to an exemplary embodiment;

fig. 3 shows two consecutive images on the public data set Scannet and a depth prediction result obtained by estimating depth information of the two consecutive images according to an exemplary embodiment;

FIG. 4 is a block diagram illustrating a training apparatus for a depth estimation model in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a depth estimation device in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that in the embodiments of the present disclosure, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "e.g.," in an embodiment of the present disclosure is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Some of the terms and techniques involved in the embodiments of the present disclosure are briefly described below:

1) world coordinate system (world coordinate system), camera coordinate system (camera coordinate system), image coordinate system

World coordinate system: also called a measurement coordinate system, is a three-dimensional rectangular coordinate system, on which the spatial positions of the camera and the object to be measured can be described. The camera initial pose is generally used as the position of the world coordinate system.

Camera coordinate system: the three-dimensional rectangular coordinate system is also provided, the origin is positioned at the optical center of a lens of the camera, the x axis and the y axis are respectively parallel to two sides of the image plane, and the z axis is the optical axis of the lens and is vertical to the image plane.

The relationship between the camera coordinate system and the world coordinate system may be described by a rotation matrix R1 and a translation vector T1. Assuming that the coordinate of a point P in the space under the world coordinate system is M1 (e.g., (Xw, Yw, Zw)), the coordinate of the point P under the camera coordinate system is M2 (e.g., (Xc, Yc, Zc)), and the relationship between M1 and M2 is: m1 × R1+ T1 ═ M2, where R1 is a 3 × 3 rotation matrix and T1 is a 3 × 1 translation vector.

Image coordinate system: is a two-dimensional rectangular coordinate system. The intersection point of the lens optical axis and the image plane is the origin of the image coordinate system.

M2(Xc, Yc, Zc) is projected onto the image plane by a ray passing through the center of projection, and the corresponding image point is M3(x, y). Where M3 is the coordinate of P in the image coordinate system. f is depth information of P. The corresponding relationship between M2 and M3 obtained according to the similar triangle principle can be expressed as the following formula:

2) other terms

In the embodiments of the present disclosure, "at least one" means one or more. "plurality" means two or more.

In the embodiment of the present disclosure, "and/or" is only one kind of association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

In the disclosed embodiments, a combination includes one or more objects.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be noted that the training method of the depth estimation model provided by the embodiment of the present disclosure may be applied to an electronic device or a server. The electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, and the like. The server may be one server, or may be a server cluster composed of a plurality of servers, which is not limited in this disclosure.

As shown in fig. 1, fig. 1 is a flowchart illustrating a training method of a depth estimation model according to an exemplary embodiment. The method shown in fig. 1 may be applied to an electronic device or a server. The method shown in fig. 1 may include the following steps S100 to S103:

s100: and respectively inputting the n sample images and the n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results respectively corresponding to the n sample images and depth prediction results of the n simulated images. Wherein n is a positive integer. Each analog image is obtained by performing amplification processing on the corresponding sample image. The n sample images and the n simulated images are labeled with pixel depth values.

The preset depth estimation model may be a data model CNN based on a convolutional neural network.

In a possible implementation manner, a first method is used to perform augmentation processing on each sample image respectively, so as to obtain n analog images respectively corresponding to each sample image. Wherein the first method comprises: at least one of mirrored, rotated, or translated.

Specifically, the first method may be used to perform the augmentation processing on each sample image respectively through the following steps to obtain a simulated image corresponding to each sample image:

the method comprises the following steps: the first sample image is converted into a first three-dimensional image. Wherein the first sample image is any one of the n sample images.

In one implementation, a first sample image in an image coordinate system is converted to a three-dimensional image in a world coordinate system using a first algorithm and the three-dimensional image is taken as a first three-dimensional image. Wherein, the first algorithm may be a perspective projection algorithm or a parallel projection algorithm, etc.

Step two: and processing the first three-dimensional image by using a preset first rotation matrix and a first translation vector to obtain a second three-dimensional image. The first rotation matrix and the first translational vector may be a preset rotation matrix R2 and a translational vector T2 around an x-axis, a y-axis, and a z-axis of the world coordinate system.

It is understood that the sizes of R2 and T2 can be limited to ensure that the similarity between the processed analog image and the corresponding sample image is greater than or equal to a threshold value.

Illustratively, the coordinate of the point V in the first three-dimensional image is V1, and V1 becomes V2 in the second three-dimensional image after being processed. V1 × R2+ T2 ═ V2. Where R2 may be a rotation matrix about the x-axis [ -5,5 ]. T2 may be the amount of y-axis direction translation of [ -0.05,0.05 ]. The unit of the translation amount is the same as the unit of the coordinates.

Step three: and converting the second three-dimensional image into a simulation image corresponding to the first sample image by using an inverse algorithm of the first algorithm.

Optionally, noise is added to the image obtained by converting the second three-dimensional image by using the inverse algorithm of the first algorithm, so as to obtain a simulated image corresponding to the first sample image. In this way, the trained first target depth estimation model improves the stability of estimating depth information when estimating the depth information of continuous frame images including noise.

Optionally, the depth prediction result of each sample image is respectively subjected to augmentation processing by using a first method, so as to obtain n projection depth prediction results respectively corresponding to each sample image.

Specifically, refer to the method in the first to third steps, which uses the first method to process each sample image, so as to obtain n analog images corresponding to each sample image. And will not be described in detail.

S101: and obtaining n first losses according to the depth value of the pixel point of each sample image and the depth prediction result of each sample image.

Specifically, a loss function is used to calculate a depth value of a pixel point of each sample image and a first loss of a depth prediction result of the sample image. The first loss may be characterized by a mean square error or a mean difference between a depth value of a pixel point of the sample image and a depth prediction result of the sample image. The loss function may be any one of a Reversed HuBer loss function, a mean square error function, and a cross entropy function.

It should be noted that, when the preset depth estimation model estimates the depth information of the sample image absolutely accurately, that is, when the depth value of a pixel point of the first sample image is the same as the depth prediction result of the first sample image, the first loss is 0. Therefore, the first loss may reflect an error of the preset depth estimation model for estimating the depth information of the first sample image.

Optionally, n third losses are obtained according to the depth value of the pixel point of each simulated image and the depth prediction result of each simulated image.

Specifically, a loss function is used to calculate a third loss of the depth value of the pixel point of each simulated image and the depth prediction result of the simulated image. The third loss may be characterized by a mean square error or a mean difference between a depth value of a pixel of the simulated image and a depth prediction result of the simulated image. The loss function may be any one of a Reversed HuBer loss function, a mean square error function, and a cross entropy function.

It should be noted that, when the preset depth estimation model estimates the depth information of the analog image absolutely accurately, that is, when the depth value of a pixel point of the analog image is the same as the depth prediction result of the analog image, the third loss is 0. Therefore, the third loss may reflect an error of the preset depth estimation model in estimating the depth information of the simulated image.

S102: and obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image.

In a possible implementation manner, n second losses may be obtained according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image through the following steps, and a method for obtaining one second loss is described below by taking the first sample image as an example for convenience of description. The analog image corresponding to the first sample image is hereinafter referred to as a first analog image:

the method comprises the following steps: and processing the depth prediction result of the first simulated image by using the inverse process of the method for generating the first simulated image to obtain a first depth prediction result corresponding to the first simulated image.

Specifically, the depth prediction result of the first simulated image is processed by using the second rotation matrix and the second translational vector. Wherein the second rotation matrix is an inverse matrix of the first rotation matrix, and a sum of the first translation vector and the second translation vector is 0.

Step two: and obtaining the loss of the depth prediction result of the first sample image and the first depth prediction result corresponding to the first simulation image to obtain a second loss.

Specifically, the loss function is used to calculate the loss between the depth prediction result of the first sample image and the first depth prediction result corresponding to the first simulated image, so as to obtain a second loss. The second loss may be characterized by a mean square error or a mean difference of the depth predictor of the first sample image and the first depth predictor. The loss function may be any one of a Reversed HuBer loss function, a mean square error function, and a cross entropy function.

In another possible implementation manner, the n second losses are obtained according to the depth value of the pixel point of each sample image, the depth prediction result of each sample image, and the depth prediction result of the simulated image corresponding to each sample image.

In addition to the above possible loss of the depth prediction result of the first sample image and the first depth prediction result corresponding to the first simulated image, the loss of the depth value of the pixel point of the first sample image and the first depth prediction result corresponding to the first simulated image is also obtained, and the sum of the obtained two losses is taken as a second loss.

It should be noted that, when the preset depth estimation model estimates the depth information of the sample image with good estimation stability of the depth information of the simulated image corresponding to the sample image, that is, the difference between the depth prediction result of the first sample image and the depth prediction result of the first simulated image is relatively small, the difference between the "first depth prediction result corresponding to the first simulated image" and the depth prediction result of the first sample image is relatively small. Therefore, the second loss between the depth prediction result of the first sample image and the "first depth prediction result corresponding to the first simulated image" may represent a difference value between the depth information of the first simulated image estimated by the preset depth estimation model and the depth prediction result of the first simulated image.

Optionally, n fourth losses are obtained according to each projection depth prediction result and the depth prediction result of the simulated image corresponding to each projection depth prediction result respectively.

Specifically, a loss function is used to calculate each projection depth prediction result and a depth prediction result of the simulated image corresponding to each projection depth prediction result, so as to obtain n fourth losses. The fourth loss may be characterized by a mean square error or a mean difference between the projected depth predictors and a "depth predictor for the simulated image corresponding to each projected depth predictor". The loss function may be any one of a Reversed HuBer loss function, a mean square error function, and a cross entropy function.

It should be noted that the execution order of S101 and S102 is not limited in the embodiment of the present disclosure, and for example, S101 is executed after S102 is executed.

S103: and training a preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model.

In a possible implementation manner, first, a first total loss is obtained by taking a sum of the mth first loss and the mth second loss. Wherein m is more than or equal to 0 and less than or equal to n. The mth first loss is obtained according to the depth value of the pixel point of the mth sample image and the depth prediction result of the mth sample image. The mth second loss is obtained from the depth prediction result of the mth sample image and the depth prediction result of the simulated image corresponding to the sample image. The method obtains n first total losses. Then, a preset depth estimation model is trained by using a gradient update rule (SGD) optimization algorithm based on the n first total losses, so as to obtain a first target depth estimation model.

Therefore, when the preset depth estimation model is trained, the difference of the depth prediction result of the simulated image corresponding to the mth sample image estimated by the preset depth estimation model relative to the depth prediction result of the mth sample image is considered, and the difference of the pixel point depth value of the mth sample image relative to the depth prediction result of the mth sample image is also considered, so that the stability of the trained first target depth estimation model when the depth information of the video image is estimated is improved.

Optionally, the first target depth estimation model is trained according to the n third losses and the n fourth losses to obtain a second target depth estimation model.

In a possible implementation, first, the sum of the mth third loss and the mth fourth loss is obtained to obtain a second total loss. Wherein m is more than or equal to 0 and less than or equal to n. The mth third loss is obtained according to the depth value of the pixel point of the mth simulated image and the depth prediction result of the simulated image. The mth fourth loss is obtained from the mth projection depth prediction result and the depth prediction result of the analog image corresponding to the projection depth prediction result. The n second total losses are obtained by the method. Then, a first target depth estimation model is trained by using a gradient update rule (SGD) optimization algorithm based on the n second total losses, so as to obtain a second target depth estimation model.

Therefore, when the preset depth estimation model is trained, the loss of the depth prediction result of the simulated image relative to the projection depth prediction result and the loss of the pixel point depth value of the simulated image relative to the depth prediction result of the simulated image are also considered, and the stability of the trained second target depth estimation model when the depth information of the video image is estimated is further improved.

It should be noted that, in another implementation manner in the embodiment of the present disclosure, a preset depth estimation model may be trained to obtain a third target depth estimation model by using the obtained n third losses and n fourth losses, and further, the third target depth estimation model may be trained to obtain a fourth target depth estimation model by using the obtained n first losses and n second losses. Compared with the preset depth estimation model, when the depth information of the video image is estimated, the stability of any one of the first target depth estimation model, the second target depth estimation model, the third target depth estimation model or the fourth target depth estimation model obtained by training in any implementation mode is higher than that of the preset depth estimation model when the depth information of the video image is estimated.

In the following, the Root Mean Square Error (RMSE) is used as an evaluation index to evaluate the accuracy of the result of estimating the depth information of one frame of image by using the first target depth estimation model obtained in the embodiment of the present disclosure. The accuracy of the depth information estimation result for the sample image Y can be embodied by the following formula:

wherein, Y_iRepresenting the true value of the depth of the pixel point i in the sample image Y. P_iAnd representing the estimated value of the depth of the pixel point i of the sample image, and n is the number of the pixel points in the sample image Y. Smaller values of RMSE indicate higher accuracy of the depth estimation.

Illustratively, as shown in fig. 2, fig. 2 illustrates a first image 20, a depth prediction result 30, and a depth prediction result 40 according to an exemplary embodiment. The depth prediction result 30 is a depth prediction result obtained by estimating depth information of the first image using a preset depth estimation model, and the depth prediction result 40 is a depth prediction result obtained by estimating depth information of the first image using a first target depth estimation model trained according to the embodiment of the present disclosure. The RMSE value for the calculated depth predictor 30 is 1.1035 and the RMSE value for the depth predictor 40 is 0.2091. It is clear that the accuracy of the depth information in the depth prediction result 40 is higher.

In the following, the accuracy of the result of estimating the depth information of two consecutive frames of images in the video by using the first target depth estimation model obtained in the embodiment of the present disclosure is evaluated by using RMSE as an evaluation index.

First sample image Y₁And the first sample image Y₂The RMSE value of (a) can be calculated by the following formula:

wherein, Y_1iRepresenting the depth estimate, Y, at pixel point i_2iRepresents Y at position i₂N represents the first sample image Y₁The number of pixels in (1). First sample image Y₁And the first sample image Y₂The number of the pixel points in (1) is the same. Smaller RMSE indicates Y₁And Y₂The smaller the depth difference (also called Y)₁And Y₂The smaller the jitter is, the higher the stability is).

Illustratively, as shown in fig. 3, fig. 3 is a depth prediction result obtained by estimating depth information of two consecutive images on a public data set Scannet according to an exemplary embodiment, and 50 in fig. 3 is a result of estimating the depth information of the two consecutive images by using a preset depth estimation model. And calculating the RMSE value of two depth prediction results obtained by estimating the depth information of the two continuous frames of images by using a preset depth estimation model to be 0.791. Fig. 3 at 60 is a result of estimating depth information of two consecutive frames of images by using a first target depth estimation model trained by the method provided by the embodiment of the present disclosure. The RMSE values of two depth prediction results obtained by estimating the depth information of the two consecutive frames of images by the first target depth estimation model are calculated to be 0.400, which shows that the result obtained by estimating the depth information of the two consecutive frames of images by the first target depth estimation model trained by the method provided by the embodiment of the disclosure is more stable and has less jitter.

It can be understood that, for a segment of video, the average value of the RMSE values of all the consecutive two frames of images in the video can be used as the stability indicator of the depth estimation of the segment of video.

In an exemplary embodiment, the embodiment of the present disclosure further provides a depth estimation method, which may use the depth estimation model obtained by training the training method of the depth estimation model provided in any one of the above possible implementation manners to estimate depth information of an image with a depth to be estimated.

It should be noted that the beneficial effects of the depth estimation method provided by the embodiment of the present disclosure refer to the beneficial effects of the training method of the depth estimation model, and are not described again.

The foregoing describes the scheme provided by the embodiments of the present disclosure, primarily from a methodological perspective. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for the various exemplary method steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the present disclosure may perform functional module division on a training device or a depth estimation device of a depth estimation model according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.

FIG. 4 is a block diagram illustrating a training apparatus for a depth estimation model according to an exemplary embodiment. Referring to fig. 4, the training apparatus 200 of the depth estimation model includes a prediction module 201, an acquisition module 202, and a training module 203. Wherein: the prediction module 201 is configured to: and respectively inputting the n sample images and the n simulated images which are in one-to-one correspondence with the n sample images into a preset depth estimation model to obtain depth prediction results respectively corresponding to the n sample images and depth prediction results of the n simulated images. Wherein n is a positive integer. Each analog image is obtained by performing amplification processing on the corresponding sample image. The n sample images and the n simulated images are labeled with pixel depth values. The acquisition module 202 is configured to: obtaining n first losses according to the pixel point depth value of each sample image and the depth prediction result of each sample image; and obtaining n second losses according to the depth prediction result of each sample image and the depth prediction result of the simulated image corresponding to each sample image. Or obtaining n second losses according to the depth value of the pixel point of each sample image, the depth prediction result of each sample image, and the depth prediction result of the simulated image corresponding to each sample image. The training module 203 is configured to: and training a preset depth estimation model according to the n first losses and the n second losses to obtain a first target depth estimation model. For example, in conjunction with fig. 1, the prediction module 201 may be configured to perform S100, and the acquisition module 202 may be configured to perform S101-S102. The training module 203 may be configured to perform S103.

Optionally, the obtaining module 202 is further configured to: and respectively carrying out augmentation processing on each sample image by using a first method to obtain a simulation image corresponding to each sample image. The first method comprises the following steps: at least one of mirrored, rotated, or translated. The obtaining module 202 is specifically configured to: and respectively carrying out augmentation processing on the depth prediction result of each simulated image by using the inverse process of the first method to obtain n first depth prediction results respectively corresponding to each simulated image. And obtaining the loss of each first depth prediction result and the depth prediction result of the sample image corresponding to each first depth prediction result respectively to obtain n second losses.

Optionally, when the first method includes any one of mirroring, rotating or translating, the obtaining module 202 is specifically configured to: and respectively projecting each sample image to a three-dimensional space according to a camera imaging equation to obtain n first point clouds. And respectively processing each first point cloud by using a preset first rotation matrix and a first translation vector to obtain n second point clouds. And respectively carrying out back projection on each second point cloud according to a camera imaging equation. And obtaining n analog images respectively corresponding to each sample image according to the coordinate relation between each sample image and the second point cloud which is respectively corresponding to each sample image and is subjected to back projection.

Optionally, the obtaining module 202 is specifically configured to: and respectively processing the depth prediction result of each simulated image by using a preset second rotation matrix and a second translation vector to obtain n first depth prediction results respectively corresponding to each simulated image. The second rotation matrix is an inverse of the first rotation matrix. The sum of the first translation vector and the second translation vector is 0.

Optionally, the obtaining module 202 is further configured to: and respectively carrying out augmentation processing on the depth prediction result of each sample image by using a first method to obtain n projection depth prediction results respectively corresponding to each sample image. The first method comprises the following steps: at least one of mirrored, rotated, or translated. And obtaining n third losses according to the depth value of the pixel point of each simulated image and the depth prediction result of each simulated image, and obtaining n fourth losses according to each projection depth prediction result and the depth prediction result of the simulated image corresponding to each projection depth prediction result respectively. The training module 203 is further configured to: and training the first target depth estimation model according to the n third losses and the n fourth losses to obtain a second target depth estimation model.

Optionally, the obtaining module 202 is specifically configured to: respectively carrying out augmentation processing on each sample image by using a first method; and adding noise to each processed sample image to obtain a simulated image of each sample image.

Optionally, the preset depth estimation model is a data model CNN based on a convolutional neural network.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here. In addition, for the explanation and the description of the beneficial effects of the training apparatus 200 for any depth estimation model provided above, reference may be made to the corresponding method embodiments described above, and details are not repeated.

Fig. 5 is a block diagram illustrating a depth estimation device according to an example embodiment. Referring to fig. 5, the depth estimation apparatus 500 includes an acquisition module 501 and a prediction module 502. The acquiring module 501 is configured to acquire an image of a depth to be estimated; the prediction module 502 is configured to input the image with the depth to be estimated into the depth estimation model obtained by training according to the above method, so as to obtain a depth prediction result of the image with the depth to be estimated.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. As shown in fig. 6, the electronic device 40 includes, but is not limited to: a processor 401, a memory 402, a display 403, an input unit 404, an interface unit 405, a power supply 406, and the like.

The processor 401 is a memory for storing the processor executable instructions. It is to be understood that the processor 401 is configured to perform any of the steps described above in the embodiment shown in fig. 1. That is, the block diagram of the electronic device 40 may be a hardware configuration diagram of the training apparatus 200 for the depth estimation model.

It should be noted that the electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown in fig. 6, or combine some components, or arrange different components, as will be understood by those skilled in the art.

The processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Processor 401 may include one or more processing units; optionally, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs as well as various data. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program (such as an acquisition unit, a transceiver unit, or a merging unit) required by at least one functional unit, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The display 403 is used to display information input by the user or information provided to the user. The Display 403 may include a Display panel, which may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The input Unit 404 may include a Graphics Processing Unit (GPU) that processes image data of still pictures or video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display 403. The image frames processed by the graphics processor may be stored in the memory 402 (or other storage medium).

The interface unit 405 is an interface for connecting an external device to the electronic apparatus 400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 405 may be used to receive input (e.g., data information, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 400 or may be used to transmit data between the electronic apparatus 400 and the external device.

A power source 406 (e.g., a battery) may be used to supply power to the various components, and optionally, the power source 406 may be logically connected to the processor 401 via a power management system, so that functions such as managing charging, discharging, and power consumption are performed via the power management system.

In an exemplary embodiment, the disclosed embodiments also provide a storage medium comprising instructions, such as the memory 402 comprising instructions, which are executable by the processor 401 of the electronic device 400 to perform the above-described method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In one example, referring to fig. 5, the receiving function of the obtaining module 501 may be implemented by the interface unit 405 in fig. 6. The processing functions of the above-described acquisition module 501 and the prediction module 502 may both be implemented by the processor 401 in fig. 6 calling a computer program stored in the memory 402.

In an exemplary embodiment, the disclosed embodiments also provide a computer program product comprising one or more instructions executable by the processor 401 of the electronic device 400 to perform the above-described method.

It should be noted that the instructions in the storage medium or one or more instructions in the computer program product are executed by the processor 401 to implement the processes of the method embodiments, and the same technical effect can be achieved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：机器人获取天花板轮廓的方法

Training method and device of depth estimation model, electronic equipment and storage medium

相关技术

网友询问留言