Single-view three-dimensional object reconstruction method and device based on RGB data

文档序号:138462 发布日期:2021-10-22 浏览:27次 中文

阅读说明:本技术 一种基于rgb数据的单视角三维物体重建方法及装置 (Single-view three-dimensional object reconstruction method and device based on RGB data ) 是由 孔德慧 高俊娜 王少帆 李敬华 王立春 于 2021-07-05 设计创作,主要内容包括:一种基于RGB数据的单视角三维物体重建方法及装置,能够将对象三维重建任务转换成基底系数矩阵的生成问题,挖掘可见部分和遮挡部分形状之间的关系,从而得到具有精确细节信息的三维体素,提升了三维模型重建精度。方法包括:(1)建立从潜在特征到初始三维体素的生成模型,该潜在特征由基底和系数线性组合而成。令训练集中的样本张成形状空间,经编码-解码后得到其形状潜空间,对其进行矩阵分解求得基底表示Θ;利用系数回归网络实现系数回归任务,将测试集中的图像再经编码过程回归其对应形状的系数矩阵Y;则基底Θ和系数Y的线性组合实现基于图像的三维模型重建。(2)建模体素数据为切片数据,利用设计的切片Transformer对初始三维体素进行细化处理,实现基于图像的精细化三维模型重建。(A single-view three-dimensional object reconstruction method and a single-view three-dimensional object reconstruction device based on RGB data can convert an object three-dimensional reconstruction task into a generation problem of a base coefficient matrix, and mine the relation between the shapes of a visible part and a shielding part, so that a three-dimensional voxel with accurate detail information is obtained, and the reconstruction precision of a three-dimensional model is improved. The method comprises the following steps: (1) and establishing a generative model from a latent feature to the initial three-dimensional voxel, wherein the latent feature is formed by linearly combining a substrate and a coefficient. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction. (2) And the modeling voxel data is slice data, and the designed slice Transformer is utilized to carry out thinning processing on the initial three-dimensional voxel so as to realize the reconstruction of a refined three-dimensional model based on the image.)

1. A single-view three-dimensional object reconstruction method based on RGB data is characterized in that: the method comprises the following steps:

(1) and establishing a generative model from a latent feature to the initial three-dimensional voxel, wherein the latent feature is formed by linearly combining a substrate and a coefficient. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

(2) And the modeling voxel data is slice data, and the designed slice Transformer is utilized to carry out thinning processing on the initial three-dimensional voxel so as to realize the reconstruction of a refined three-dimensional model based on the image.

2. The RGB data-based single-view three-dimensional object reconstruction method of claim 1, wherein: the step (1) comprises the following sub-steps:

(1.1) learning the latent features S of the three-dimensional voxels in the training set by an auto-encoder, and then defining a set of bases by using SVD;

(1.2) extracting a feature representation Z of an input image by using an image encoder, clustering potential embedding of all instances in each object class, and taking a clustering result as a shape prior B; then, a Transformer-based network de-regression coefficient is designed, the self-attention mechanism is utilized to model and fuse image visual characteristics and prior information to explore the association relation between the image visual characteristics and the prior information, the context dependence relation of the characteristics is improved, and complex semantic abstract information is learned to obtain a better coefficient representation Y.

3. The RGB data-based single-view three-dimensional object reconstruction method of claim 2, wherein: in the step (1.1), the basis matrix theta is obtained through SVD calculation of the feature matrix SF×K,SF×G=UF×FF×GVG×GWherein U isF×FAnd VG×GIs the left and right singular vectors, ΣF×GIs a diagonal matrix of singular values.

4. A single-view three-dimensional object reconstruction method based on RGB data as recited in claim 3, wherein: in the step (1.1), thetaF×KIs the left singular vector UF×FCorresponding to the K largest singular values.

5. Root of herbaceous plantThe method for single-view three-dimensional object reconstruction based on RGB data as claimed in claim 4, wherein: in the step (1.2), a Transformer encoder is used for modeling and fusing visual image features and prior information to obtain coefficient representation, the encoder comprises L identical blocks, and each block comprises two sublayers; the first sub-layer is a multi-head self-attention mechanism, the second sub-layer is a multi-layer perceptron network, and each layer of the two sub-layers is connected by a residual error; self-attention is the kernel component of the Transformer, which relates different positions of feature maps, describes self-attention as a mapping function, maps the query matrix Q, the key matrix K, and the value matrix V to the output attention matrix, Q, K,are matrices and the output is calculated from a weighted sum of values, wherein the weight assigned to each value is calculated by the corresponding key and query, and in the attention operation, scale factors are usedWith proper normalization, when a larger d results in a dot product that grows in magnitude, preventing very small gradients, the output of scaling the dot product attention is expressed as:

wherein Q, K, and V are by an embedding featureThrough WQ,WKAnd are andcalculated from the linear transformation of (c):

Q=ZWQ,K=ZWK,V=ZWV (2)

the multi-headed self-attention layer MSA jointly models information representing subspaces from different locations with multiple heads, each head using scaled dot product attention in parallel, the output of the multi-headed self-attention will be a concatenation of h attention head outputs:

MSA(Q,K,V)=Concat(H1,H2,...,Hh)Wout

Hi=Attention(Qi,Ki,Vi),i∈[1,...,h] (3)

given embedded featuresThen the characteristic Transfomer encoder structure of the L layer is represented as:

Z′l=MSA(LN(Zl-1))+Zl-1,l=1,2,...L

Zl=MLP(LN(Zl))+Z′l,l=1,2,...L

Y=LN(ZL), (4)

where LN (-) is defined as the layer normalization operation, the final encoder outputThe obtained coefficient; multiplying the coefficient and the substrate, and sending the product to a decoder for decoding to obtain a reconstructed initial three-dimensional voxel Vcoa

6. The RGB data-based single-view three-dimensional object reconstruction method of claim 5, wherein: in the step (1.2), the step (c),

the loss function is measured between the reconstructed three-dimensional voxel and the real three-dimensional voxel by using the mean value of the voxel-based binary cross entropy, defined as:

where N represents the number of voxels in the three-dimensional object, VcoaAnd VgtRepresentsAnd predicting the occupancy rate of the initial voxel and the occupancy rate of the corresponding real voxel, wherein the smaller the loss is, the closer the prediction result is to the real voxel.

7. The RGB data-based single-view three-dimensional object reconstruction method of claim 6, wherein: in the step (2), the step (c),

for each three-dimensional voxel, first defining it as V; then defineRepresenting the slicing of a three-dimensional voxel along an x-y coordinate plane for the slice direction, resulting inA collection, wherein the collection includes a dr×drOf a two-dimensional slice sequence of length dr(ii) a Each two-dimensional slice is converted to a size DlAnd the feature vector is taken as each slice feature, the slice feature matrix T is dr×Dl(ii) a And feeding the feature matrix into a Transformer encoder, and then expressing the Transformer encoder structure of the L layer as follows:

T′l=MSA(LN(Tl-1))+Tl-1,l=1,2,...L

Tl=MLP(LN(Tl))+T′l,l=1,2,...L

M=LN(TL), (6)

where LN (-) is defined as a layer normalization operation, the transform encoder contains L layers, the outputAnd encoder inputThe same size is maintained and then the optimized slices are stitched to form a complete and accurate three-dimensional voxel.

8. The RGB data-based single-view three-dimensional object reconstruction method of claim 7, wherein: in the step (2), the step (c),

the loss function comprises a refined reconstruction loss to make the predicted three-dimensional shape as close as possible to the true three-dimensional shape, and a loss function LRrecIs defined as:

where N represents the number of voxels in the three-dimensional object, VcoaAnd VgtRepresenting the occupancy of the predicted initial voxel and the occupancy of the corresponding real voxel.

9. A single-view three-dimensional object reconstruction device based on RGB data is characterized in that: it includes:

a three-dimensional reconstruction module that builds a generative model from a latent feature to the initial three-dimensional voxels, the latent feature being linearly combined from a basis and coefficients. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a base representation; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

And the three-dimensional voxel thinning module is used for thinning the initial three-dimensional voxel by using the designed slice Transformer according to the modeling voxel data serving as slice data, so as to realize the reconstruction of a refined three-dimensional model based on the image.

Technical Field

The invention relates to the technical field of computer vision and pattern recognition, in particular to a single-view three-dimensional object reconstruction method based on RGB data and a single-view three-dimensional object reconstruction device based on the RGB data.

Background

Three-dimensional object reconstruction based on computer vision technology is an important subject in scientific research and human life, and has very wide application in the fields of human-computer interaction, enhancement/virtual reality, medical diagnosis, automatic driving and the like.

One of the main goals of three-dimensional reconstruction based on computer vision techniques is to restore the three-dimensional structure of an object from two-dimensional images acquired by a vision sensor. At present, three-dimensional object reconstruction methods based on RGB images are mainly divided into traditional methods and deep learning-based methods. The traditional three-dimensional reconstruction method solves the reconstruction problem from the geometric angle. This type of method requires matching features between multiple images captured at different perspectives and relies on manually extracted features to restore the three-dimensional shape of the object. However, due to appearance change or self-occlusion phenomenon, the difference of images from different viewpoints is large, so that it is very difficult to establish an accurate feature correspondence, and a reconstructed model often lacks details. In addition, the conventional method needs to perform three-dimensional reconstruction according to geometric shapes such as shadows, textures, contours, photometric stereo and the like, and has high requirements on the environment for acquiring images, so that some constraint conditions are usually set to obtain a consistent reconstruction result. These methods also typically require the use of a precisely calibrated camera and high quality visual imaging elements to acquire images of the object, which undoubtedly increases the difficulty of model reconstruction.

In recent years, rapid development of deep learning and the advent of large 3D databases have attracted attention to data-driven reconstruction of three-dimensional objects. The three-dimensional reconstruction based on deep learning overcomes the defects in the traditional method and provides a new idea for high-quality three-dimensional reconstruction. The existing three-dimensional reconstruction method based on deep learning comprises two methods, namely a cyclic neural network and a convolutional neural network. The cyclic neural network-based method considers a three-dimensional reconstruction method as a sequence learning problem, and uses the cyclic neural network to fuse features extracted from input images to reconstruct a three-dimensional model. However, given different sequences of input images, such methods do not result in consistent three-dimensional reconstruction results. Moreover, long-term correlation of the sequence is difficult to obtain due to gradient disappearance or explosion, and important features of the input image may be forgotten as the number of layers of the network increases, resulting in an incomplete three-dimensional shape. The convolutional neural network based approach solves the problems of the circular neural network based approach by processing all input images in the sequence in parallel. At present, most of methods based on the convolutional neural network adopt an encoder-decoder framework, namely, an encoder encodes a two-dimensional image into a potential feature, and a decoder decodes the potential feature to obtain a three-dimensional shape. However, this type of method does not take into account the correlation between different objects in the shape space. In addition, this type of method usually introduces an average shape prior to supplement the model class features, but the average shape prior weakens some individual instance-specific features, and the conventional processing method does not consider the supplementary strategy of the shape prior to the visual features. In addition, the mining of the internal geometric association relation of the visible part and the shielding part of the corresponding object is not enough, which is also one of the defects of the existing methods. These limitations limit the implementation of refined three-dimensional reconstructions.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a single-view three-dimensional object reconstruction method based on RGB data, which can convert an object three-dimensional reconstruction task into a generation problem of a base coefficient matrix, and excavates the relation between the shapes of a visible part and a shielding part, thereby obtaining a three-dimensional voxel with accurate detail information and improving the reconstruction precision of a three-dimensional model.

The technical scheme of the invention is as follows: the single-view three-dimensional object reconstruction method based on RGB data comprises the following steps:

(1) and establishing a generative model from a latent feature to the initial three-dimensional voxel, wherein the latent feature is formed by linearly combining a substrate and a coefficient. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

(2) And the modeling voxel data is slice data, and the designed slice Transformer is utilized to carry out thinning processing on the initial three-dimensional voxel so as to realize the reconstruction of a refined three-dimensional model based on the image.

The latent space substrate can be obtained by three-dimensional shape derivation, a regression network is constructed at the same time to obtain a latent space representation coefficient corresponding to a two-dimensional image, the combination of the latent space representation coefficient and the latent space representation coefficient realizes image-based three-dimensional model reconstruction, the reconstructed initial model is constructed into a Transformer through a three-dimensional voxel slice to mine the dependency relationship between the shapes of a visible part and a shielding part, so that a three-dimensional voxel with accurate detail information is obtained, and the reconstruction accuracy of the three-dimensional model is improved.

There is also provided a single-view three-dimensional object reconstruction apparatus based on RGB data, the apparatus comprising:

a three-dimensional reconstruction module that builds a generative model from a latent feature to the initial three-dimensional voxels, the latent feature being linearly combined from a basis and coefficients. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

And the three-dimensional voxel thinning module is used for thinning the initial three-dimensional voxel by using the designed slice Transformer according to the modeling voxel data serving as slice data, so as to realize the reconstruction of a refined three-dimensional model based on the image.

Drawings

Fig. 1 shows an overall block diagram of a single-view three-dimensional object reconstruction apparatus based on RGB data according to the present invention.

FIG. 2 shows a block diagram of a transform encoder.

Figure 3 shows a block diagram of the multi-head attention.

Figure 4 shows single-view reconstruction results on a sharenet dataset.

Detailed Description

The single-view three-dimensional object reconstruction method based on RGB data comprises the following steps:

(1) and establishing a generative model from a latent feature to the initial three-dimensional voxel, wherein the latent feature is formed by linearly combining a substrate and a coefficient. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

(2) And the modeling voxel data is slice data, and the designed slice Transformer is utilized to carry out thinning processing on the initial three-dimensional voxel so as to realize the reconstruction of a refined three-dimensional model based on the image.

The latent space substrate can be obtained by three-dimensional shape derivation, a regression network is constructed at the same time to obtain a latent space representation coefficient corresponding to a two-dimensional image, the combination of the latent space representation coefficient and the latent space representation coefficient realizes image-based three-dimensional model reconstruction, the reconstructed initial model is constructed into a Transformer through a three-dimensional voxel slice to mine the dependency relationship between the shapes of a visible part and a shielding part, so that a three-dimensional voxel with accurate detail information is obtained, and the reconstruction accuracy of the three-dimensional model is improved.

Preferably, the step (1) comprises the following substeps:

(1.1) learning the latent features S of the three-dimensional voxels in the training set by an auto-encoder, and then defining a set of bases by using SVD;

(1.2) extracting a feature representation Z of an input image by using an image encoder, clustering potential embedding of all instances in each object class, and taking a clustering result as a shape prior B; then, a Transformer-based network de-regression coefficient is designed, the self-attention mechanism is utilized to model and fuse image visual characteristics and prior information to explore the association relation between the image visual characteristics and the prior information, the context dependence relation of the characteristics is improved, and complex semantic abstract information is learned to obtain a better coefficient representation Y.

Preferably, in the step (1.1), the basis matrix Θ is obtained by SVD calculation of the feature matrix SF×K,SF×G=UF×FF×GVG×GWherein U isF×FAnd VG×GIs the left and right singular vectors, ΣF×GIs a diagonal matrix of singular values.

Preferably, in the step (1.1), the value theta isF×KIs the left singular vector UF×FCorresponding to the K largest singular values.

Preferably, in the step (1.2), a Transformer encoder is used for modeling and fusing the visual image features and the prior information to obtain a coefficient representation, the encoder comprises L identical block components, and each block has two sub-layers; the first sub-layer is a multi-head self-attention mechanism, the second sub-layer is a multi-layer perceptron network, and each layer of the two sub-layers is connected by a residual error; self-attention is the kernel component of the Transformer, which relates different positions of feature maps, describes self-attention as a mapping function, maps the query matrix Q, the key matrix K and the value matrix V to the output attention matrix,are matrices and the output is calculated from a weighted sum of values, wherein the weight assigned to each value is calculated by the corresponding key and query, and in the attention operation, scale factors are usedWith proper normalization, when a larger d results in a dot product that grows in magnitude, preventing very small gradients, the output of scaling the dot product attention is expressed as:

wherein Q, K, and V are derived from embeddingsSign forThrough WQ,WKAnd are andcalculated from the linear transformation of (c):

Q=ZWQ,K=ZWK,V=ZWV (2)

the multi-headed self-attention layer MSA jointly models information representing subspaces from different locations with multiple heads, each head using scaled dot product attention in parallel, the output of the multi-headed self-attention will be a concatenation of h attention head outputs:

MSA(Q,K,V)=Concat(H1,H2,...,Hh)Wout

Hi=Attention(Qi,Ki,Vi),i∈[1,...,h] (3)

given embedded featuresThen the characteristic Transfomer encoder structure of the L layer is represented as:

Z'l=MSA(LN(Zl-1))+Zl-1,l=1,2,...L

Zl=MLP(LN(Zl))+Z'l,l=1,2,...L

Y=LN(ZL), (4)

where LN (-) is defined as the layer normalization operation, the final encoder outputThe obtained coefficient; multiplying the coefficient and the substrate, and sending the product to a decoder for decoding to obtain a reconstructed initial three-dimensional voxel Vcoa

Preferably, in the step (1.2),

the loss function is measured between the reconstructed three-dimensional voxel and the real three-dimensional voxel by using the mean value of the voxel-based binary cross entropy, defined as:

where N represents the number of voxels in the three-dimensional object, VcoaAnd VgtRepresenting the occupancy of the initial voxel and the occupancy of the corresponding real voxel, the smaller the loss, the closer the prediction result is to the real voxel.

Preferably, in the step (2),

for each three-dimensional voxel, first defining it as V; then defineRepresenting the slicing of a three-dimensional voxel along an x-y coordinate plane for the slice direction, resulting inA collection, wherein the collection includes a dr×drOf a two-dimensional slice sequence of length dr(ii) a Each two-dimensional slice is converted to a size DlAnd the feature vector is taken as each slice feature, the slice feature matrix T is dr×Dl(ii) a And feeding the feature matrix into a Transformer encoder, and then expressing the Transformer encoder structure of the L layer as follows:

T'l=MSA(LN(Tl-1))+Tl-1,l=1,2,...L

Tl=MLP(LN(Tl))+T'l,l=1,2,...L

M=LN(TL), (6)

where LN (-) is defined as a layer normalization operation, the transform encoder contains L layers, the outputAnd encoder inputThe same size is maintained and then the optimized slices are stitched to form a complete and accurate three-dimensional voxel.

Preferably, in the step (2),

the loss function comprises refining reconstruction loss to make the predicted three-dimensional shape as close as possible to the real three

Dimensional shape, its loss function LRrecIs defined as:

where N represents the number of voxels in the three-dimensional object, VcoaAnd VgtRepresenting the occupancy of the predicted initial voxel and the occupancy of the corresponding real voxel.

As shown in fig. 1, there is also provided a single-view three-dimensional object reconstruction apparatus based on RGB data, the apparatus including:

a three-dimensional reconstruction module that builds a generative model from a latent feature to the initial three-dimensional voxels, the latent feature being linearly combined from a basis and coefficients. Enabling the sample in the training set to be in a shape space, obtaining a shape latent space after encoding and decoding, and carrying out matrix decomposition on the shape latent space to obtain a substrate representation theta; a coefficient regression network is utilized to realize a coefficient regression task, and the images in the test set are regressed to a coefficient matrix Y of a corresponding shape through an encoding process; the linear combination of the base Θ and the coefficient Y enables an image-based three-dimensional model reconstruction.

And the three-dimensional voxel thinning module is used for thinning the initial three-dimensional voxel by using the designed slice Transformer according to the modeling voxel data serving as slice data, so as to realize the reconstruction of a refined three-dimensional model based on the image.

The present invention is described in more detail below.

The main key technical problems solved by the invention comprise: deducing a base matrix from the three-dimensional shape to construct a better potential feature space, and constructing a coefficient regression network to express from regression coefficients in the image, so that an object three-dimensional reconstruction task is converted into a generation problem of the base coefficient matrix; the slice transform is designed to mine the relationship between the visible and occluded part shapes, resulting in three-dimensional voxels with precise detail information. Finally, the invention improves the reconstruction precision of the three-dimensional model.

The same space has different expressions (coefficient matrix) according to different substrates, and the different expressions can be mutually converted through a matrix transformation form. Based on the principle, the correspondence between the three-dimensional shapes and the two-dimensional projection representations (images) thereof ensures the essential identity of the spaces in which the three-dimensional shapes and the two-dimensional projection representations (images) are located, so that the three-dimensional shapes and the two-dimensional projection representations (images) can be embedded into a certain intermediate space to realize the unification of the representation models. For the intermediate space, the space base can be derived by utilizing a three-dimensional shape; the representation coefficients of the respective shapes may be obtained by calculation from the shapes or by regression calculation from the corresponding images. Based on the shape, the shape reconstruction method based on the latent space feature representation model is provided, namely, the shape space of a sample sheet in a training set is encoded and decoded to obtain the intermediate latent space, and then matrix decomposition is carried out to obtain a substrate representation theta; the images in the test set are regressed to a coefficient matrix Y of the corresponding shape through a regression network; the linear combination of the two realizes the reconstruction of the three-dimensional model based on the image.

The invention mainly comprises three key technical points: 1) establishing a generation model from potential features to initial three-dimensional voxels, wherein the potential features are subjected to matrix decomposition to obtain a substrate through a shape potential space, and 2) realizing a coefficient regression task by utilizing a coefficient regression network; 3) and the modeling voxel data is slice data, and the initial three-dimensional voxels are subjected to thinning processing by using a designed slice Transformer.

1. Three-dimensional reconstruction based on latent space feature representation model

The main work of the part is to learn the base representation from the shape latent space, regress the coefficient from the prior knowledge and the image visual characteristics, and then multiply the base and the coefficient and send the multiplication result to a decoder to obtain the predicted initial three-dimensional shape.

1.1 substrate representation Module

The module mainly learns the shape of the diving spaceThe most relevant characteristics are extracted through matrix decomposition, the characteristic dimensionality is reduced, the network output is simplified, and meanwhile, the interference of irrelevant information is reduced. Specifically, we first learn the latent features S of three-dimensional voxels in the training set by an auto-encoder, and then define a set of bases using SVD. Specifically, the basis matrix Θ can be calculated by the SVD of the feature matrix SF×KI.e. SF×G=UF×FF×GVG×GWherein U isF×FAnd VG×GIs the left and right singular vectors, ΣF×GIs a diagonal matrix of singular values. More specifically, ΘF×KIs the left singular vector UF×FCorresponding to the K largest singular values.

1.2 coefficient representation Module

The module is mainly expressed based on a Transformer network regression coefficient. Specifically, a feature representation Z of the input image is extracted with an image encoder. Furthermore, for each object class, the potential embeddings of all instances within that class are clustered, and the clustering result is taken as a shape prior B. Then, a Transformer-based network de-regression coefficient is designed, the self-attention mechanism is utilized to model and fuse image visual characteristics and prior information to explore the association relation between the image visual characteristics and the prior information, the context dependence relation of the characteristics is improved, and complex semantic abstract information is learned to obtain a better coefficient representation Y.

The Transformer encoder is used to model and fuse the visual image features and prior information to obtain a coefficient representation, the structure of which is shown in fig. 2. In particular, the encoder comprises L identical block components, each block having two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second sub-layer is a multi-layer perceptron network. Each of the two sub-layers uses a residual join. Self-attention is the kernel component of the Transformer, which relates different locations of feature maps. Self-attention can be described as a mapping function that maps the query matrix Q, the key matrix K and the value matrix V to the output attention matrix,are all matrices. The output is calculated from a weighted sum of values, where the weight assigned to each value is calculated by the corresponding key and query. In attention operation, a scale factor is usedWith proper normalization, extremely small gradients can be prevented when a larger d results in an order of magnitude increase in the dot product. Thus, the output of scaling the dot product attention can be expressed as:

wherein Q, K, and V are by an embedding featureThrough WQ,WKAnd are andcalculated from the linear transformation of (c):

Q=ZWQ,K=ZWK,V=ZWV

a multi-headed self-attention layer (MSA) jointly models information representing subspaces from different locations using multiple heads. The multi-headed self-noticed structure is shown in fig. 3. Each head uses scaled dot product attention in parallel. The final multi-headed self-attention output will be the concatenation of the h attention head outputs:

MSA(Q,K,V)=Concat(H1,H2,...,Hh)Wout

Hi=Attention(Qi,Ki,Vi),i∈[1,...,h]

given embedded featuresThen the characteristic transmomer encoder structure of the L layer can be expressed as:

Z'l=MSA(LN(Zl-1))+Zl-1,l=1,2,...L

Zl=MLP(LN(Zl))+Z'l,l=1,2,...L

Y=LN(ZL),

where LN (-) is defined as the layer normalization operation. Final encoder outputThe obtained coefficient.

Multiplying the coefficient and the substrate, and sending the product to a decoder for decoding to obtain a reconstructed initial three-dimensional voxel Vcoa

The loss function of this portion is measured between the reconstructed three-dimensional voxels and the true three-dimensional voxels using the mean of the voxel-based binary cross entropy. More specifically, it can be defined as:

where N represents the number of voxels in the three-dimensional object. VcoaAnd VgtRepresenting the occupancy of the predicted initial voxel and the occupancy of the corresponding real voxel. The smaller the loss, the closer the prediction result is to the real voxel.

2. Three-dimensional voxel thinning module

And obtaining the initialized three-dimensional voxel through a three-dimensional voxel reconstruction module. Three-dimensional objects have local continuity and internal relevance, which is also indispensable guiding information for three-dimensional voxel refinement. In order to obtain the relations, a slice transform (S-transform) is designed to refine the voxels, the main operation is to model the voxel data into a two-dimensional slice sequence by using symmetry, then find the relation of the local features of the three-dimensional voxels by using the slice transform, and finally splice the optimized slices capturing the relation to form the final three-dimensional voxels. The method can capture richer dependence relations among three-dimensional voxel parts, explore detailed information and finally obtain complete and reasonable three-dimensional voxels.

Everyday objects often have allLocal symmetry, a property that facilitates the restoration of parts that are occluded or partially observed. Most of the models in the public data set ShapeNet used in the experiment of the invention are also symmetrical, and the x-y plane is a symmetrical plane. For each three-dimensional voxel, first defining it as V; then defineRepresenting the slicing of a three-dimensional voxel along an x-y coordinate plane for the slice direction, resulting inA collection, wherein the collection includes a dr×drOf a two-dimensional slice sequence of length dr. Each two-dimensional slice is converted to a size DlAnd the feature vector is taken as each slice feature, the slice feature matrix T is dr×Dl. This feature matrix is fed into the transform encoder. Then the Transfomer encoder structure of the L layer can be expressed as:

T'l=MSA(LN(Tl-1))+Tl-1,l=1,2,...L

Tl=MLP(LN(Tl))+T'l,l=1,2,...L

M=LN(TL),

where LN (-) is defined as the layer normalization operation. The transform encoder comprises L layers, the outputAnd encoder inputThe same size is maintained. The optimized slices are then stitched to form complete and accurate three-dimensional voxels.

The loss function of the part comprises a refined reconstruction loss, so that the predicted three-dimensional shape is as close to the real three-dimensional shape as possible, and the loss function L of the partRrecIs defined as:

where N represents the number of voxels in the three-dimensional object. VcoaAnd VgtRepresenting the occupancy of the predicted refined voxels and the occupancy of the corresponding real voxels.

The invention has been verified on the public data set ShapeNet, and good experimental effect is obtained. Table 1 shows the single-view reconstruction results of the present invention on the sharenet dataset, and it can be seen that the present method has achieved the best results compared to other methods. Fig. 4 shows subjective effects of some three-dimensional reconstructions on a sharenet data set, and experimental effects show that the algorithm can achieve good reconstruction effects on various objects.

TABLE 1

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention still belong to the protection scope of the technical solution of the present invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于深度特征解耦的可控人体形状补全方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!