Non-rigid alignment of volumetric performance capture

文档序号：1525410 发布日期：2020-02-11 浏览：16次中文

阅读说明：本技术 体积性能捕获的非刚性对齐 (Non-rigid alignment of volumetric performance capture ) 是由窦明松肖恩·瑞安·法内洛阿达尔什·普拉卡什·穆尔蒂·寇德莱克里斯多佛·雷曼萨姆赫·卡于 2018-07-25 设计创作，主要内容包括：电子设备(100)基于数据体积(120)和参考体积(125)之间的对应关系来估计环境中一个或多个物体的姿态,所述数据体积(120)包含由深度相机(102)捕获的当前帧(105)的数据网格(117),所述参考体积(125)包含多个融合先前数据帧,所述多个融合先前数据帧基于频谱嵌入并在参考体积和当前数据帧之间执行双向非刚性匹配以细化对应关系,从而支持基于位置的功能。电子设备基于频谱嵌入来预测数据体积和参考体积之间的对应关系。对应关系提供了约束条件,其加速了数据体积和参考体积之间的收敛。(An electronic device (100) estimates a pose of one or more objects in an environment based on a correspondence between a data volume (120) and a reference volume (125), the data volume (120) containing a data grid (117) of a current frame (105) captured by a depth camera (102), the reference volume (125) containing a plurality of fused previous data frames that are embedded based on a spectrum and perform a bi-directional non-rigid matching between the reference volume and the current data frame to refine the correspondence, thereby supporting a location-based function. The electronic device predicts a correspondence between the data volume and the reference volume based on the spectral embedding. The correspondence provides a constraint that speeds up convergence between the data volume and the reference volume.)

1. A method, comprising:

computing a current data grid frame (117) based on at least one frame of depth data (105) captured at one or more depth cameras (102) and at least one frame of color data (107) captured at one or more color cameras (104) of an electronic device (100), wherein the current data grid frame comprises a three-dimensional (3D) point cloud representing an environment of the electronic device;

maintaining a data volume (120) comprising the current data grid frame;

maintaining a reference volume (125) comprising a plurality of fused data mesh frames based on previously captured depth data frames and color data frames;

calculating an initial correspondence between the data volume and the reference volume based on spectral embedding (206); and

based on the initial correspondence, a current pose of at least one object in the environment is estimated.

2. The method of claim 1, wherein calculating the initial correspondence comprises:

aggregating 3D point clouds of the data volume and the reference volume based on a spatial pool; and

an embedding point is calculated from the aggregated 3D point cloud based on a non-linear mapping to generate a correspondence field between the data volume and the reference volume.

3. The method of claim 2, further comprising:

aligning the reference volume with the data volume;

calculating an alignment error between the data volume and the reference volume based on the non-linear mapping; and

resetting the reference volume to include the current data grid frame in response to identifying that an alignment error between the data volume and the reference volume exceeds a threshold.

4. The method of claim 1, further comprising:

non-rigid alignment parameters are identified to warp the reference volume to the pose of the current data grid frame.

5. The method of claim 1, further comprising: calculating a dense correspondence between the data volume and the reference volume, comprising:

aligning the data volume to the reference volume to generate an embedded deformation map comprising a plurality of vertices; and

aligning the reference volume to a data volume based on the embedded deformation map to identify a correspondence between the reference volume and the vertices of the data volume.

6. The method of claim 5, wherein aligning the data volume to the reference volume comprises: matching the color value of each vertex of the data volume with the color value of each corresponding vertex of the reference volume.

7. The method of claim 6, further comprising representing the color value of each vertex of the data volume and the color value of each corresponding vertex of the reference volume as a single intensity value.

8. The method of claim 1, wherein calculating the initial correspondence comprises: an affinity matrix is calculated that encodes a similarity between each 3D point of the point cloud of the data volume.

9. A method, comprising:

capturing depth data (105) of an environment at one or more depth cameras (102) of an electronic device (100);

capturing color data (107) of the environment at one or more color cameras (104) of the electronic device;

computing a current data grid frame (117) comprising corresponding three-dimensional (3D) points of a 3D point cloud based on the depth data and the color data;

maintaining a data volume (120) comprising the current data grid frame;

fusing the current data grid frame to a plurality of previously computed data grid frames to generate a reference volume (125);

calculating a sparse correspondence between the data volume and the reference volume based on spectral embedding;

computing a dense correspondence between the data volume and the reference volume based on a non-rigid alignment; and

estimating a current pose of at least one object in the environment based on the sparse correspondence.

10. The method of claim 9, wherein computing the sparse correspondence comprises:

aggregating 3D point clouds of the data volume and the reference volume based on a spatial pool; and

an embedding point is calculated from the aggregated 3D point cloud based on a non-linear mapping to generate a correspondence field between the data volume and the reference volume.

11. The method of claim 10, further comprising:

calculating an alignment error between the data volume and the reference volume based on the non-linear mapping; and

resetting the reference volume to include the current data grid frame in response to identifying that the alignment error between the data volume and the reference volume exceeds a threshold.

12. The method of claim 9, further comprising:

solving for non-rigid alignment parameters to warp the reference volume to the pose of the current data grid frame.

13. The method of claim 9, wherein calculating the dense correspondence between the data volume and the reference volume comprises:

aligning the data volume to the reference volume to generate an embedded deformation map comprising a plurality of vertices; and

aligning the reference volume with the data volume based on the embedded deformation map to identify a correspondence between the reference volume and the data volume.

14. The method of claim 13, further comprising: the color value of each vertex of the data volume and the color value of each corresponding vertex of the reference volume are represented as a single intensity value.

15. The method of claim 9, wherein computing the sparse correspondence comprises: an affinity matrix is calculated that encodes a similarity between each 3D point of the 3D point cloud of the data volume and the reference volume.

16. An electronic device (100) comprising:

at least one depth camera (102), the at least one depth camera (102) for capturing depth data (105) of an environment of the electronic device; and

a processor (110), the processor (110) being configured to:

computing a current data grid frame (1117) comprising a three-dimensional (3D) point cloud of an environment of the electronic device based on at least one frame of depth data captured at the at least one depth camera;

maintaining a data volume (120) comprising the current data grid frame;

maintaining a reference volume (125) comprising a plurality of fused data mesh frames based on previously captured depth data frames and color data frames;

calculating a sparse correspondence between the data volume and the reference volume based on spectral embedding; and

estimating a current pose of at least one object in the environment based on the sparse correspondence.

17. The electronic device of claim 16, wherein the processor is to calculate the sparse correspondence based on:

aggregating 3D point clouds of the data volume and the reference volume based on a spatial pool; and

an embedding point is calculated from the aggregated 3D point cloud based on a non-linear mapping to generate a correspondence field between the data volume and the reference volume.

18. The electronic device of claim 17, wherein the processor is further configured to:

calculating an alignment error between the data volume and the reference volume based on the non-linear mapping; and

resetting the reference volume to include the current data grid frame in response to identifying that an alignment error between the data volume and the reference volume exceeds a threshold.

19. The electronic device of claim 16, wherein the processor is further configured to:

solving for non-rigid alignment parameters to warp the reference volume to the pose of the current data grid frame.

20. The electronic device of claim 16, wherein the processor further calculates a dense correspondence between the data volume and the reference volume based on:

aligning the data volume with the reference volume to generate an embedded deformation map comprising a plurality of vertices; and

aligning the reference volume with a data volume based on the embedded deformation map to identify a correspondence between the reference volume and the data volume.

Background

Real-time performance capture for simulating human gestures, shapes, appearances and actions is used as input for computer vision tasks such as hand, body or object tracking, 3D reconstruction and instant positioning and mapping (SLAM). For such tasks, each new frame of depth and image data will be used for non-rigid scene reconstruction over time. However, non-rigid scene reconstruction is computationally laborious, suffers from excessive smoothing of high frequency details (such as faces), and is prone to tracking failures in the face of topology changes.

Drawings

The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.

FIG. 1 is a block diagram illustrating a motion tracking system that uses spectral embedding and backward and forward matching to correlate data volumes and evolved reference volumes to estimate the pose of objects in an environment, in accordance with some embodiments.

FIG. 2 is a flow diagram illustrating a method of correlating a data volume and an evolved reference volume to estimate a pose of an object in an environment using spectral embedding and backward and forward matching, in accordance with some embodiments.

Fig. 3 is a flow diagram illustrating a method of calculating a sparse correspondence field between a data volume and a reference volume based on learned approximation values of spectral embedding of extracted points, according to some embodiments.

Fig. 4 is a flow diagram illustrating a method of backward and forward matching of a reference volume to a data volume according to some embodiments.

Detailed Description

The following description is intended to convey a thorough understanding of the present disclosure by providing numerous specific embodiments and details including estimating a correspondence between a current data frame and a reference volume containing a plurality of fused previous data frames based on spectral embedding, and performing a bi-directional non-rigid match between the reference volume and the current data frame to refine the correspondence. It is to be understood, however, that the present disclosure is not limited to these specific embodiments and details, which are exemplary only, and that the scope of the present disclosure is, therefore, intended to be limited only by the appended claims and equivalents thereof. It should also be appreciated that in light of the known systems and methods, those skilled in the art will be able to utilize the present disclosure for any number of alternative embodiments with its intended purposes and benefits in accordance with specific design and other needs.

Fig. 1-4 illustrate techniques based on spectral embedding based on estimating a correspondence between a current data frame and a reference volume containing multiple fused previous data frames, estimating a pose of one or more objects (e.g., people) in an environment by a motion tracking system and performing a bi-directional non-rigid match between the reference volume and the current data frame to refine the correspondence in order to support location-based functions, such as Augmented Reality (AR) functions, Virtual Reality (VR) functions, visual localization/ranging, or other immediate localization and mapping (SLAM) functions, and so forth. A motion tracking system is an electronic device that computes a data grid based on the zero crossings of an extracted volume. The volume is a three-dimensional (3D) truncated signed distance function, which is an implicit surface (3D model) that represents the environment of the electronic device based on depth data captured by one or more depth cameras. The electronics maintain both the data volume and the evolved reference volume. The data volume contains a data grid based on a current frame captured by one or more depth cameras (referred to as a current data grid frame), and the reference volume is populated with an initial data grid frame and subsequent data grid frames, which are fused to the initial data grid frame. The electronic device predicts a correspondence between the data volume and the reference volume based on the spectral embedding. The correspondence provides a constraint that accelerates convergence between the data volume and the reference volume. In some embodiments, the electronics optimize convergence between the data volume and the reference volume based on a color term that corrects for drift in regions with finer geometries (e.g., faces). By tracking changes between the current data grid frame and the reference volume, the electronic device avoids tracking failures that may occur when relying solely on previous data grid frames.

In some embodiments, the electronics perform a two-way match between the reference volume and the data volume ("forward match") and a two-way match between the data volume and the reference volume ("backward match") to account for surface topology changes. The electronics use the correspondences identified from the two-way matching as residuals in a final non-rigid matching stage to align the reference volume with the data volume. After aligning the reference volume with the data volume, the electronic device fuses the data volume into the reference volume and warps the reference volume to the data to estimate the pose of the object in the environment of the electronic device. In some embodiments, the electronics incorporate high frequency detail and texture mapping to perform high fidelity reconstruction of moving objects (such as people deforming objects such as clothing) at relatively high speeds.

Fig. 1 illustrates an electronic device 100 in accordance with at least one embodiment of the present disclosure, the electronic device 100 configured to support real-time reconstruction of arbitrary non-rigid scenes used in Virtual Reality (VR) and/or Augmented Reality (AR) applications using depth map data. The electronic device 100 can include a user-portable mobile device such as a tablet computer, a computing-enabled cellular telephone (e.g., "smart phone"), a notebook computer, a Personal Digital Assistant (PDA), a remote gaming system, a television remote control, and so forth. In other embodiments, the electronic device 100 can include another type of mobile device, such as a head mounted display, a single camera, a multi-sensor camera, and so forth. For ease of illustration, the electronic device 100 is generally described in the example context of a mobile device, such as a tablet computer or smartphone; however, the electronic device 100 is not limited to these example embodiments. The electronic apparatus 100 includes: a plurality of sensors (such as a depth camera 102 and a color camera 104) for acquiring information about a local environment (not shown) of the electronic device 100; and a processor 110, the processor 110 for analyzing the information acquired by the sensors 102, 104 and estimating a pose of the object in the environment of the electronic device 100.

The electronic device 100 acquires visual information (images) of the local environment in the form of depth data 105 via a depth camera 102 and color data 107 via a color (RGB) imaging camera 104. In one embodiment, imaging camera 104 is implemented as a wide-angle imaging camera with a fisheye lens or other wide-angle lens to provide a wide-angle view of the local environment.

In one embodiment, the depth camera 102 projects the modulated light pattern into the local environment using a modulated light illuminator (not shown), and captures reflections of the modulated light pattern as it reflects off objects in the local environment using one or more imaging sensors (not shown). In some embodiments, the illuminator emits Infrared (IR) light. In some embodiments, the illuminator is a Vertical Cavity Surface Emitting Laser (VCSEL). VCSELs emit light from a larger surface than lasers and thus more light while still being eye-safe. In some embodiments, the illuminator is coupled with a suitable mask (not shown) to emit structured light (i.e., a modulated light pattern). In some embodiments, these modulated light patterns are temporally modulated light patterns. The reflection of the captured modulated light pattern is referred to herein as a "depth map". In some embodiments, the depth camera 102 captures depth maps at a rate of approximately 200 frames per second, where each frame includes depth data for a particular period of time.

Processor 110 includes memory 115, data volume 120, reference volume 125, sparse correspondence module 130, dense correspondence module 135, final match module 140, and skinning module 145. Memory 115 is typically a storage device configured to store depth data 105 and color data 107, and thus may be a Random Access Memory (RAM) storage module, a non-volatile storage device (e.g., flash memory), or the like. Memory 115 may form part of a memory hierarchy of electronic device 100 and may include other memory modules, such as additional caches that are not shown in FIG. 1. The memory 115 is configured to receive and store depth data 105 from the depth camera 102 and color data 107 from the color camera 104.

The processor 110 generates a textured mesh (also referred to as a three-dimensional (3D) point cloud) based on the depth data 105 and the color data 107, and maintains the textured mesh at the data volume 120. The data volume 120 includes a textured mesh corresponding to the current frame, referred to as the current data mesh frame 117. The processor 110 also maintains a reference volume 125, also referred to as a critical volume. Initially, the processor 110 sets the first data grid frame 117 as a reference frame or a key frame and performs a non-rigid alignment between the key frame and the new data grid frame, enabling the data of the new data grid frame to be fused into the key frame. In some embodiments, the processor 110 uses a Truncated Signed Distance Function (TSDF) at the reference volume 125 to fuse the new data mesh to the keyframe in a volume-weighted manner. If the processor 110 determines that the alignment error between the new data grid frame and the key frame is above a threshold, the processor 110 sets the new data grid frame as the current key frame.

The sparse correspondence module 130 is configured to find a correspondence between the 3D mesh of the new data mesh frame 117 and the reference volume 125. In some embodiments, for in a grid (graph)For each node (3D point), the sparse correspondence module 130 builds an affinity matrix

The affinity matrix encodes the similarity in the graph based on the distance D of each node, which approximates a kernel function to encode the non-linearity in the data grid. The sparse correspondence module 130 then uses a ═ USV ^TSVD decomposition of (1) to decompose the affinity matrix A and the first k-1 eigenvectors u ₂，…，u _kIs used as a new space to facilitate finding correspondences. Sparse correspondence module 130 discards the first feature vector associated with feature value 0 and which is a constant vector.

In some embodiments, sparse correspondence module 130 derives a 3D point X ∈ R from a set of 3D points representing current data grid frame 117 ^n×3The approach to the embedding point is learned, and this process is referred to herein as "spectral embedding". Sparse correspondence module 130 associates affinity matrix a with a non-linear mapping of the input space, similar to approximating a kernel function. Thus, the sparse correspondence module 130 considers the embedding space as a target output to be learned. In some embodiments, the sparse correspondence module 130 learns the function f (x) Y e R ^k×nWhere Y is the first k eigenvectors of the affinity matrix A. By learning the function f (x), the sparse correspondence module 130 avoids building the affinity matrix a and performs its SVD decomposition.

The sparse correspondence module 130 learns the nonlinear mapping (X) of the point cloud and the regression function W that minimizes the loss:

wherein Y ∈ R ^k×nIs the embedding space, and Γ (W) is a regularization term that avoids overfitting, which the sparse correspondence module 130 sets to l of W ₂-a norm. Sparse correspondence module 130 approximates and sets the kernel matrix by random features

Avoid calculatingKernel matrix A of data, where F ∈ R ^3×FIs randomly generated from a gaussian distribution.

In some embodiments, sparse correspondence module 130 uses a spatial pool to summarize the point clouds of data grid frames 117 and capture global and local structures. Sparse correspondence module 130 performs additional non-linear mapping

And calculates its maximum pool over all points n to obtain a single compact descriptor Ψ -max _iΨ _i∈R ^FWherein the sparse correspondence module 130 sets the same number of random features F. To capture global and local information for each point X in the point cloud X, the sparse correspondence module 130 aggregates descriptors (X) and ψ with a simple subtraction Φ (X) - ψ.

In some embodiments, the processor 110 collects the point cloud X by collecting the point cloud ₁，…，X _NThe sparse correspondence module 130 is trained and a plurality of points (e.g., 256 points) are randomly sub-sampled from each point cloud. Each point cloud represents a particular object (such as a person holding his body in a particular pose) performing a particular pose. The processor 110 calculates the embedding space Y using an offline process ₁，…Y _N. The processor 110 then remaps the input point cloud to Φ ₁，…Φ _NAnd minimizes the following loss function:

in some embodiments, at run-time, two grids M are given ₁And M ₂Processor 110 randomly sub-samples the grid to, for example, 256 3D coordinates X ₁And X ₂Computing their non-linear mapping Φ ₁And phi ₂And generates an embedding space Y ₁＝Φ ₁W，Y ₂＝Φ ₂W is added. In embedding space, sparse correspondence module 130 retrieves the closest point correspondence and uses the correspondence to initialize the correspondence between the current data grid frame 117 and the reference volume 125A system field.

The dense correspondence module 135 is configured to identify non-rigid alignment parameters that warp the mesh at the reference volume 125 to the current data mesh frame 117. The dense correspondence module 135 runs both forward matching (reference volume 125 to data volume 120) and backward matching (data volume 120 to reference volume 125) to find dense correspondence between the reference volume 125 and the data volume 120. Dense correspondence module 135 uses the sparse correspondence identified by sparse correspondence module 130 as a data item in the optimization forward matching problem. The data items penalize the distance between the warped reference point and its corresponding data point. The dense correspondence module 135 uses the dense correspondence as a residual in a final, non-rigid matching stage, after which the current data grid frame 117 is fused with the reference volume 125 to generate a model encoding the shape of the object in the environment of the electronic device 100.

In some embodiments, the dense correspondence module 135 uses an Embedded Deformation (ED) graph-based approach for non-rigid matching. The dense correspondence module 135 uniformly samples the ED nodes from the vertices of the reference grid, representing the ith node position as g _i. The dense correspondence module 135 associates each ED node with a set of parameters that represent the deformation imposed by the ED node in its local neighborhood. The dense correspondence module 135 connects adjacent ED nodes together to form an ED graph, and collectively represents the ED node locations and the deformation parameters on the ED graph using G. The dense correspondence module 135 "skins" each mesh vertex to K adjacent ED nodes so that the mesh will be deformed according to the parameters of the ED graph.

The dense correspondence module 135 identifies ED graphical parameters that deform the reference volume 125 to fit the current data grid 117. The dense correspondence module 135 parameterizes the deformation at each ED node as an affine transformation. In some embodiments, the affine transform has 12 parameters (3 × 3 matrix a and 3D vector t). Using linear hybrid skinning, the dense correspondence module 135 applies a transformation to warp the vertices v at the reference volume 125 to corresponding vertices v at the current data grid frame 117

Wherein the transformation is represented as In some embodiments, dense correspondence module 135 adds regularization on a as a soft constraint to force a as close to the rotation matrix as possible.

In some embodiments, dense correspondence module 135 represents a as a quaternion q without explicitly forcing q to be singular, but rather treating | | | q | | | ═ 1 as a soft constraint. To deform vertex v, dense correspondence module 135 executes

Where R (-) converts the quaternion to a rotation matrix. In some embodiments, r (q) becomes a rotation matrix multiplied by a scalar because dense correspondence module 135 does not explicitly force q to be singular. In some embodiments, the dense correspondence module 135 uses linear mixing instead of quaternion mixing to conserve computing resources while maintaining a high frame rate.

The dense correspondence module 135 is configured to perform a bi-directional non-rigid matching between the reference volume 125 and the data volume 120. The dense correspondence module 135 aligns the reference volume 125 with the data volume 120 (forward matching) and then aligns the data volume 120 with the reference volume 125 (backward matching), based on which the dense correspondence module 135 identifies an optimized point-to-point correspondence between the reference volume 125 and the data volume 120.

For forward matching, the dense correspondence module 135 matches the reference volume 125 to the data volume 120 by measuring a misalignment between the reference grid and the input depth map of the data volume 120 using the projected point-to-plane distance:

wherein

Is the curved vertex defined in equation (3), m and n are indices of the vertex and depth maps, respectively, and Γ _n(v) Representing the corresponding projected 3D point, δ, of v from the nth depth map _mnTo represent

Wherein if visible in the nth depth map, δ _mn1, otherwise δ _mn0. In some embodiments, the dense correspondence module 135 additionally integrates the visual shell term and the correspondence term learned from equation (1). In some embodiments, dense correspondence module 135 adds a regularization term to enforce smoothness of the non-rigid motion field. In some embodiments, the dense correspondence module 135 adds color terms to minimize drift.

After establishing the forward alignment, the dense correspondence module 135 references each vertex v on the surface _mWith its corresponding point on the current data grid frame 117

Pairing (volume fusion from the input depth map 105). Dense correspondence module 135 let

To represent

And if their distance is greater than a threshold value (e.g., 2mm), the corresponding pair is discarded to generate a corresponding set

For back matching, the dense correspondence module 135 uses equation (3) to warp the referenced ED node to the data pose (g → g + t). Dense correspondence module 135 then updates the node-node connectivity based on the mesh topology at depth data 105. Then, the dense correspondence isThe tying module 135 uses the updated ED graph for backward matching, where the dense correspondence module 135 finds a correspondence with the current data grid frame 117 for each vertex of the reference volume 120:

dense correspondence module 135 reuses the ED graph rather than resampling a new ED graph from current data grid frame 117, preserving correct ED node connectivity during alignment of reference volumes 125 representing open topologies to current data grid frame 117 representing closed topologies, in which case, although the ED graphs from reference volumes 125 will have correct connectivity, the resampled ED graph at current data grid frame 117 will have incorrect connectivity.

The dense correspondence module 135 aligns the data grid to the reference TSDF volume using the following data items

To measure misalignment:

wherein Defining a signed distance function at a fixed regular lattice point and (·) by Of the three-linearly interpolated sampled continuous volumetric field at any given point. The dense correspondence module 135 initializes the parameters at each ED node for backward matching to the inverse of the forward transform at that node: q. q.s ^bwd＝q ^-1And t ^bwd＝-t。

The final matching module 140 is configured to identify the best correspondence for each vertex at the reference volume: { (v) _m，p _m)}. If the final matching module 140 identifies a forward correspondence as to a reference topThe current data mesh frame 117 of point v has the best correspondence, the processor 110 considers vertex v to have a reliable correspondence with the current data mesh frame 117 of the given reference mesh topology at the reference volume 125. If final matching module 140 identifies the backward correspondence as having the best correspondence with the current data grid frame 117 of reference vertex v, final matching module 140 updates the skinning result (the two ED node sets and weights) of reference vertex v to match the topology at the current data grid frame 117. The final matching module 140 performs a final match using the extracted correspondences as residuals to align the reference volume 125 with the data volume 120 using the updated graph connectivity and vertex skinning:

in some embodiments, the final matching module 140 adds an energy term to measure the optical consistency between each vertex color on the reference mesh at the reference volume 125 and the color of the mesh vertices observed from the color data 107:

by adding an energy term, the final matching module 140 accounts for drift that may occur when the point-to-face data items used in the forward matching do not constrain the vertex from moving along the tangent plane. The final matching module 140 customizes the energy term to be:

wherein

The projection is projected onto the warped reference vertex of the nth color image space (2D image coordinates), as shown in equation (4), δ _mnIs a visibility item, and c _mRepresenting the 3D color vector of the mth vertex on the reference grid. In some embodiments, the final matching module 140 folds the 3D color into a single intensity (grayscale) value, which reduces J ^TJ computational requirements for evaluation. In some embodiments, to further reduce computations, final matching module 140 folds multiple residuals on the same vertex into one residual, replacing them with a merge:

where I is the input intensity image, and c is the per-vertex intensity value, and w _mnIs a weight on each input view, where ∑ _nw _mn1. Weight w _mnVisibility and viewing angle are taken into account.

Voxels in the reference volume 125 are located on regular lattice grid points, but after forward warping, in certain embodiments, the final matching module 140 resamples TSDF values and weights on the non-uniform grid of the data volume 120. In some cases, voxels corresponding to different surface sites can be warped to the same location. Instead of processing each voxel individually, the final matching module 140 applies a warping function to each volume cube (with eight voxels at its corners) on the lattice grid. After deformation, the eight voxels can also become non-cubic. The final matching module 140 discards those severely deformed cubes and assumes that the remaining cubes have roughly the shape of a cube. The final matching module 140 then performs a rasterization process on each deformed cube to detect all of the lattice points of the data volume 120 located within it. The final matching module 140 tri-linearly interpolates the TSDF values and weights for each lattice point.

When a collision occurs, the data voxel can obtain TSDF values from multiple cubes. In some embodiments, the final matching module 140 uses custom atomic operations to handle collisions. When the final matching module 140 assigns the new TSDF value d ^newWriting with old TSDF value d ^oldWhen d is a voxel of ^oldInvalid or | d ^new|<|d ^oldIf the final matching module 140 sets d to d ^new(ii) a Otherwise, setting d to d ^old。

Skin module 145 is configured to perform a relatively fast approximation to geodesic skins. Skinning module 145 first finds the ED node closest to each vertex using the euclidean distance. For each ED node, skinning module 145 solves for thermal equilibrium on the surface where heat is transferred from the ED node to a vertex only one of which is selected as its k-nearest ED node. Skinning module 145 then solves for the linear system of ED node i:

(D-tΔ)w _i＝δ _i， (9)

wherein D is a diagonal matrix, wherein D is _iiIs the area of vertex I (one third of the area of all triangles incident on vertex I), Δ is the discrete surface Laplacian calculated using the cotangent formula and using the half-edge data structure of the intermediate representation, t is the temperature hyperparameter, and δ _iIs a thermal vector that indicates the vertices that represent the ED nodes. In some embodiments, skinning module 145 samples the ED nodes on the mesh manifold by selecting the ED nodes as centroids rather than centers of the set of vertices. In some embodiments, skinning module 145 solves in parallel with the Gaussian-Seidel implementation to obtain a linear system. When the recovery solution w is directed to the vertex j _ijAt zero, skin module 145 determines that the recycled solution is too far away on the manifold and then separates the recycled solution from ED node I, thereby avoiding distortion artifacts. In some embodiments, skinning module 145 employs euclidean distances during volume warping to conserve computing resources.

Therefore, skinning module 145 associates each vertex (or voxel) in the volume to its neighboring ED nodes and gives the associated weights. Non-rigid deformations (i.e., motion fields or poses) are represented as a set of local semi-rigid deformations at each of the ED nodes. Therefore, initially the deformation is defined only at the position of the ED node. Because the vertices of the mesh (or voxels of the volume) are not typically co-located with the ED node, skinning module 145 interpolates deformation parameters for those mesh vertices (or voxels of the volume). To perform the interpolation, skinning module 145 determines neighboring ED nodes for each mesh vertex (or voxel) and then interpolates the parameters (e.g., by taking a weighted average of the neighboring ED parameters).

FIG. 2 is a flow diagram illustrating a method 200 employed by the motion tracking system 100 of FIG. 1 to correlate a data volume and an evolved reference volume using spectral embedding and forward and backward matching to estimate a pose of an object in an environment, according to some embodiments. At block 202, the processor 110 of the motion tracking system 100 of fig. 1 receives captured depth and color data from the depth camera 102 and the color camera 104. The depth and color data may be in the form of one or more depth maps and color images. At block 204, the processor 110 generates the current data grid frame 117, which the processor 110 maintains at the data volume 120. At block 206, the sparse correspondence module 130 estimates a correspondence field between the data volume 120 and the reference volume 125 based on the learning approximation of the spectral embedding. In some embodiments, sparse correspondence module 130 uses equation (1) to estimate the correspondence field.

At block 208, the dense correspondence module 135 matches forward and backward between the reference volume 125 and the data volume 120 to find dense correspondence between the grids. At block 210, the final matching module 140 performs a non-rigid optimization based on the residuals of the forward and backward matches from the dense correspondence module 135. At block 212, the processor 110 fuses the data mesh into the reference volume 125 and warps the reference data volume 125 to the data volume 120, after which method flow continues back to block 202.

Fig. 3 is a flow diagram illustrating a method 300 employed by the motion tracking system 100 of fig. 1 to calculate a sparse correspondence field between a data volume and a reference volume based on a learned approximation of spectral embedding of extracted points, in accordance with some embodiments. At block 302, the sparse correspondence module 130 receives the 3D point cloud of the current data grid frame 117. At block 304, the sparse correspondence module 130 approximates the embedding point from a subset of the 3D points of the 3D point cloud of the current data grid frame 117. At block 306, the sparse correspondence module 130 correlates the affinity matrix with the non-linear mapping of the current data grid frame 117. At block 308, sparse correspondence module 130 retrieves the closest point correspondence between the current data grid frame 117 and the reference volume 120 based on the non-linear mapping and embedding space.

Fig. 4 is a flow diagram illustrating a method 400 employed by the motion tracking system 100 of fig. 1 for backward and forward matching of a reference volume to a data volume, in accordance with some embodiments. At block 402, the dense correspondence module 135 performs forward matching of the reference volume 125 with the data volume 120. At block 404, the dense correspondence module 135 warps the embedded deformed nodes of the reference volume 125 to the pose of the data volume 120. At block 406, the dense correspondence module 135 updates the node-to-node connections between the reference volume 125 and the meshes of the data volume 120 based on the mesh topology at the data volume 120 to generate an updated embedded deformation graph. At block 508, the dense correspondence module 135 performs a backward matching of the data volume 120 to the reference volume 125 based on the updated embedded deformation graph.

At block 410, the dense correspondence module determines, for each vertex, whether a forward match correspondence is better than a backward match correspondence. For each vertex, if the dense correspondence 135 module determines that the forward matching correspondence is better than the backward matching correspondence, then at block 412, the final matching module 140 determines from the topology of the mesh at the reference volume 125 that the vertex has a reliable correspondence with the current data mesh frame 117. At block 410, if the dense correspondence module 135 determines that the forward-matching correspondence is not better than the backward-matching correspondence for a given vertex, then at block 414, the final matching module 140 updates the vertex skinning result for the vertex to match the topology of the current data mesh frame 117 at the data volume 120. At block 416, the final matching module 140 aligns the reference frame 125 with the data frame 120.

In some embodiments, certain aspects of the techniques described above may be implemented by one or more processors executing software by a processing system. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer-readable storage medium. The software can include instructions and certain data which, when executed by one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. A non-volatile computer-readable storage medium can include, for example, a magnetic or optical disk storage device, a solid state storage device such as flash memory, a cache, Random Access Memory (RAM) or other non-volatile storage device or devices, and so forth. Executable instructions stored on a non-transitory computer-readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.

Computer-readable storage media can include any storage media or combination of storage media that can be accessed by a computer system during use to provide instructions and/or data to the computer system. Such a storage medium can include, but is not limited to, an optical medium (e.g., Compact Disc (CD), Digital Versatile Disc (DVD), blu-ray disc), a magnetic medium (e.g., floppy disk, magnetic tape, or hard drive), volatile memory (e.g., Random Access Memory (RAM) or cache), non-volatile memory (e.g., Read Only Memory (ROM) or flash memory), or a microelectromechanical systems (MEMS) -based storage medium. The computer-readable storage medium can be embedded in a computing system (e.g., system RAM or ROM), fixedly attached to a computing system (e.g., a magnetic hard drive), removably attached to a computing system (e.g., a compact disk or Universal Serial Bus (USB) based flash memory), or coupled to a computer system via a wired or wireless network (e.g., Network Accessible Storage (NAS)).

Note that not all of the activities or elements described above in the general description are required, that a portion of a particular activity or device may not be required, and that one or more other activities or included elements may be performed in addition to those described above. Still further, the order in which activities are listed is not necessarily the order in which the activities are performed. Also, the concepts have been described with reference to specific embodiments.

However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. The benefits, advantages, solutions to problems, and any feature or features that may cause any benefit, advantage, or solution to occur or become more pronounced, however, are not to be construed as critical, required, or essential features of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.

16页详细技术资料下载

Non-rigid alignment of volumetric performance capture

相关技术

网友询问留言