Processing of depth maps for images

文档序号：1966965 发布日期：2021-12-14 浏览：20次中文

阅读说明：本技术 用于图像的深度图的处理 (Processing of depth maps for images ) 是由 C·韦雷坎普 B·W·D·范格斯特于 2020-03-03 设计创作，主要内容包括：一种处理深度图的方法包括接收(301)图像和对应的深度图。基于对应的深度图的至少第二深度图的深度值来更新(303)对应的深度图的第一深度图的深度值。更新基于根据其他图确定的候选深度值的加权组合。针对来自第二深度图的候选深度值的权重是基于对应于正在更新的深度的第一图像中的像素值与在一个位置处的第三图像中的像素值之间的相似性来确定的,所述位置是使用候选深度值通过将正在更新的深度值的位置投影到第三图像来确定的。以这种方式可以生成更一致的深度图。(A method of processing a depth map comprises receiving (301) an image and a corresponding depth map. The depth values of the first depth map of the corresponding depth map are updated (303) based on the depth values of at least the second depth map of the corresponding depth map. The updating is based on a weighted combination of candidate depth values determined from the other maps. A weight for a candidate depth value from the second depth map is determined based on a similarity between pixel values in the first image corresponding to the depth being updated and pixel values in the third image at a location determined by projecting the location of the depth value being updated to the third image using the candidate depth value. In this way a more consistent depth map may be generated.)

1. A method of processing a depth map, the method comprising:

receiving (301) a plurality of images representing a scene from different viewing poses and corresponding depth maps;

updating (303) depth values of a first depth map of the corresponding depth maps based on depth values of at least a second depth map of the corresponding depth maps, the first depth map being for a first image and the second depth map being for a second image; the updating (303) comprises:

determining (501) a first candidate depth value for a first depth pixel of the first depth map at a first depth map position in the first depth map, the first candidate depth value being determined in response to at least one second depth value of a second depth pixel of the second depth map at a second depth map position in the second depth map;

determining (503) a first depth value for the first depth pixel by a weighted combination of a plurality of candidate depth values for the first depth map location, the weighted combination comprising the first candidate depth value weighted by a first weight;

wherein determining (503) the first depth value comprises:

determining (601) a first image position in the first image for the first depth map position,

Determining (603) a third image position in a third image of the plurality of images based on the first candidate depth value, the third image position corresponding to a projection of the first image position to the third image;

determining (605) a first match error indication indicative of a difference between image pixel values in the third image for the third image position and image pixel values in the first image for the first image position, and

determining (607) the first weight in response to the first match error indication.

2. The method as recited in claim 1, wherein determining (501) a first candidate depth value includes determining a second depth map position relative to the first depth map position by projection between a first viewing pose of the first image and a second viewing pose of the second image based upon at least one of a second value and a first original depth value of the first depth map.

3. The method of claim 1 or 2, wherein the weighted combination comprises candidate depth values determined from a region of the second depth map determined in response to the first depth map position.

4. The method of claim 3, wherein the region of the second depth map is determined as a region around the second depth map location and the second depth map location is determined as a depth map location in the second depth map that is equal to the first depth map location in the first depth map.

5. The method of claim 3, wherein the region of the second depth map is determined as a region around a location in the second depth map determined by a projection from the first depth map location based on original depth values in the first depth map at the first depth map location.

6. The method of any preceding claim, further comprising determining a second match error indication indicative of a difference between an image pixel value in the second image for the second depth map position and the image pixel value in the first image for the first depth map position; and wherein determining the first weight is also responsive to the second match error indication.

7. The method of any preceding claim, further comprising determining an additional match error indication indicative of a difference between image pixel values in other images for depth map positions corresponding to the first depth map position and the image pixel values in the first image for the first depth map position; and wherein determining the first weight is also responsive to the additional match error indication.

8. The method of any preceding claim, wherein the weighted combination comprises depth values of the first depth map in an area around the location of the first depth map.

9. The method of any preceding claim, wherein the first weight depends on a confidence value of the first candidate depth value.

10. The method of any preceding claim, wherein only depth values of the first depth map having confidence values below a threshold are updated.

11. The method of any preceding claim, further comprising selecting a set of depth values of the second depth map to be included in the weighted combination to meet a requirement that the depth values of the set of depth values must have a confidence value above a threshold.

12. The method of any preceding claim, further comprising:

projecting a given depth map location for a given depth value in a given depth map to a corresponding location in a plurality of corresponding depth maps;

determining a measure of variation for a set of depth values comprising the given depth value and a depth value at the corresponding location in the plurality of corresponding depth maps; and is

Determining a confidence value for the given depth map location in response to the measure of variation.

13. The method of any preceding claim, further comprising:

projecting a given depth map location for a given depth value in a given depth map to a corresponding location in another depth map, the projection based on the given depth value;

projecting the corresponding location in the other depth map to a test location in the given depth map, the projecting based on a depth value at the corresponding location in the other depth map;

determining a confidence value for the given depth map location in response to a distance between the given depth map location and the test location.

14. An apparatus for processing a depth map, the apparatus comprising:

a receiver (201) for receiving (301) a plurality of images representing a scene from different viewing poses and corresponding depth maps;

an updater (203) for updating (303) depth values of a first depth map of the corresponding depth maps based on depth values of at least a second depth map of the corresponding depth maps, the first depth map being for a first image and the second depth map being for a second image; the updating (303) comprises:

Determining (501) a first candidate depth value for a first depth pixel of the first depth map at a first depth map position in the first depth map, the first candidate depth value being determined in response to at least one second depth value for a second depth pixel of the second depth map at a second depth map position in the second depth map;

wherein determining (503) the first depth value comprises:

determining (601) a first image position in the first image for the first depth map position,

determining (603) a third image position in a third image of the plurality of images based on the first candidate depth value, the third image position corresponding to a projection of the first image position to the third image;

Determining (607) the first weight in response to the first match error indication.

15. A computer program product comprising computer program code means adapted to perform all the steps of claims 1-13 when said program is run on a computer.

Technical Field

The present invention relates to the processing of depth maps for images, and in particular, but not exclusively, to the processing of depth maps that support view synthesis for virtual reality applications.

Background

In recent years, the variety and scope of image and video applications has increased dramatically with the continued development and introduction of new services and ways to utilize and consume video.

For example, one increasingly popular service is to provide image sequences in a manner that viewers can actively and dynamically interact with the system to change the parameters of the presentation. A very attractive feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, for example to allow the viewer to move and "look around" in the scene presented.

Such a feature can specifically allow providing a virtual reality experience to a user. This may allow the user to e.g. (relatively) freely move and dynamically change his position and where he is looking in the virtual environment. Typically, such virtual reality applications are based on a three-dimensional model of a scene that is dynamically evaluated to provide a particular requested view. Such methods are well known, for example, in gaming applications for computers and gaming machines, such as first-person shooter games.

It is also desirable that the images presented are three-dimensional images, particularly for virtual reality applications. Indeed, in order to optimize the viewer's immersion, it is generally more preferable for the user to experience the rendered scene as a three-dimensional scene. Indeed, the virtual reality experience should preferably allow the user to select his/her own position, camera viewpoint and time in time relative to the virtual world.

Many virtual reality applications are based on models of predetermined scenes and are typically based on artificial models of the virtual world. It is often desirable to provide a virtual reality experience based on real world capture.

In many systems, such as particularly when based on a real-world scene, an image representation of the scene is provided, where the image representation includes images and depths for one or more capture points/viewpoints in the scene. The image deepening representation provides a very efficient representation of, in particular, real-world scenes, wherein the representation is not only relatively easy to generate by capture of the real-world scene, but is also well suited for renderers to synthesize views from other viewpoints than those captured. For example, the renderer may be arranged to dynamically generate a view that matches the current local viewer gesture. For example, a viewer pose may be dynamically determined, and a view may be dynamically generated based on the image and, for example, a provided depth map to match the viewer pose.

In many practical systems, calibrated multi-view camera equipment may be used to allow playback from different perspectives of the user relative to the captured scene. Applications include selecting a personal viewpoint during a sports game, or playing back a captured 3D scene on an augmented reality or virtual reality headset.

You Yang et al disclose a Cross-View filtering scheme in "Cross-View Multi-layer Filter for Compressed Multi-View Depth Video" (IEEE TRANSACTIONS IMAGE PROCESSING, 1/2019 (2019-01-01), Vol.28, p.302-. With this approach, the distorted depth map is enhanced via non-local candidates selected from the current and neighboring viewpoints of different slots. Specifically, the candidates are clustered into a macro super pixel (superpixel) representing a cross-view, spatially and temporally prioritized physical and semantic cross-relationship.

WOLFF KATJA et al, at "Point Cloud Noise and outer Removal for image based 3D Reconstruction" (2016FOURTH INTERNATIONAL CONFERENCE ON 3D VISION (3DV), IEEE, 2016 (2016-10-25) 10/25), pp 1178-25, XP033027617, DOI: 10.1109/3DV.2016.20), disclose an algorithm that uses an input image and a corresponding depth map to remove pixels that are geometrically or photometrically inconsistent with the color surface implied by the input. This allows standard surface reconstruction methods to perform less smoothing and thus achieve higher quality.

In order to provide a smooth transition between the discrete captured viewpoints and some extrapolation outside the captured viewpoints, depth maps are typically provided and used to predict/synthesize views from these other viewpoints.

Depth maps are typically generated using (multi-view) stereo matching between capture cameras or more directly by using depth sensors (based on structured light or time of flight). However, such depth maps obtained from a depth sensor or disparity estimation process inherently have errors and may lead to errors and inaccuracies in the synthesized view. This degrades the viewer's experience.

Hence, an improved method for generating and processing a depth map would be advantageous. In particular, systems and/or methods that allow for improved operation, increased flexibility, improved virtual reality experience, reduced complexity, facilitated implementation, improved depth maps, improved composite image quality, improved presentation, improved user experience, and/or improved performance and/or operation would be advantageous.

Disclosure of Invention

Accordingly, the invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.

According to an aspect of the invention, there is provided a method of processing a depth map, the method comprising: receiving a plurality of images representing a scene from different viewing poses and corresponding depth maps; updating depth values of a first depth map of the corresponding depth maps based on depth values of at least a second depth map of the corresponding depth maps, the first depth map being for the first image and the second depth map being for the second image; the updating comprises the following steps: determining a first candidate depth value for a first depth pixel of the first depth map at a first depth map position in the first depth map, the first candidate depth value being determined in response to at least one second depth value for a second depth pixel of the second depth map at a second depth map position in the second depth map; determining a first depth value for the first depth pixel by a weighted combination of a plurality of candidate depth values for the first depth map location, the weighted combination comprising the first candidate depth value weighted by a first weight; wherein determining the first depth value comprises: determining a first image position in the first image for the first depth map position, determining a third image position in a third image of the plurality of images, the third image position corresponding to a projection of the first image position to the third image based on the first candidate depth value; a first match error indication indicative of a difference between image pixel values in the third image for the third image position and image pixel values in the first image for the first image position is determined, and a first weight is determined in response to the first match error indication.

The method may in many embodiments provide an improved depth map, and may in particular provide a set of depth maps with increased consistency. The method may allow for improved view consistency when synthesizing an image based on the image and the updated depth map.

The inventors have recognized that inconsistencies between depth maps may generally be more perceptible than errors or noise that are consistent between depth maps, and that certain methods may provide more consistent updated depth maps. The method may be used as a depth refinement algorithm that improves the quality of a depth map for a set of multi-view images of a scene.

The method may be convenient to implement in many embodiments and may be implemented with relatively low complexity and resource requirements.

A position in an image may directly correspond to a position in a corresponding depth map and vice versa. There may be a one-to-one correspondence between locations in the image and locations in the corresponding depth map. In many embodiments, the pixel locations may be the same in the image and the corresponding depth map, and the corresponding depth map may include one pixel for each pixel in the image.

In some embodiments, the weights may be binary (e.g., one or zero) and the weighted combination may be a selection (selection).

It should be understood that the term projection generally refers to the projection of three-dimensional spatial coordinates in a scene to two-dimensional image coordinates (u, v) in an image or depth map. However, projection may also refer toFrom one image or depth map to another, i.e. from a set of image coordinates (u) for a gesture₁，v₁) To another set of image coordinates (u) for another gesture₂，v₂) For scene points (u, v). Such projection between image coordinates for images corresponding to different viewing poses/positions is typically performed taking into account the corresponding spatial scene points, and in particular by taking into account the depth of the scene points.

In some embodiments, determining the first depth value comprises: projecting the first depth map location to a third depth map location in a third depth map of the corresponding depth maps, the third depth map being for the third image, and the projecting based on the first candidate depth value, determining a first match error indication indicative of a difference between image pixel values in the third image for the third depth map location and image pixel values in the first image for the first depth map location, and determining the first weight in response to the first match error indication.

In accordance with an optional feature of the invention, the determining the first candidate depth value comprises determining a second depth map position relative to the first depth map position by a projection between a first viewing pose of the first image and a second viewing pose of the second image based on at least one of the second value of the first depth map and the first original depth value.

This may provide particularly advantageous performance in many embodiments, and may in particular allow for improved depth maps with improved consistency in many scenarios.

The projection may be based on the first original depth value, from the second depth map location to the first depth map location, and thus from the second viewing pose to the first viewing pose.

The projection may be based on the second depth value, from the second depth map location to the first depth map location, and thus from the second viewing pose to the first viewing pose.

The original depth value may be an un-updated depth value of the first depth map.

The original depth value may be the depth value of the first depth map as received by the receiver.

In some embodiments, determining the first depth value comprises: projecting the first depth map location to a third depth map location in a third depth map of the corresponding depth map, the third depth map being for the third image and the projection being based on the first candidate depth value, determining a first match error indication indicative of a difference between image pixel values in the third image for the third depth map location and image pixel values in the first image for the first depth map location, and determining the first weight in response to the first match error indication.

In accordance with an optional feature of the invention, the weighted combination comprises candidate depth values determined from an area of the second depth map determined in response to the first depth map position.

This may provide increased depth map consistency in many embodiments. The first candidate depth value may be derived from one or more depth values of the region.

According to an optional feature of the invention, the region of the second depth map is determined as a region around a second depth map position, and the second depth map position is determined as a depth map position in the second depth map equal to a first depth map position in the first depth map.

This may allow a low complexity and low resource but efficient determination of suitable depth values to be taken into account.

According to an optional feature of the invention, the area of the second depth map is determined as an area around a position in the second depth map at the first depth position determined by a projection from the first depth map position based on the original depth value in the first depth map.

In many embodiments, this may provide increased depth map consistency. The original depth value may be a depth value of the first depth map received by the receiver.

In accordance with an optional feature of the invention, the method further comprises determining a second match error indication indicative of a difference between image pixel values in the second image for a second depth map position and image pixel values in the first image for the first depth map position; and wherein determining the first weight is also responsive to the second match error indication.

This may provide an improved depth map in many embodiments.

In accordance with an optional feature of the invention, the method further comprises determining an additional match error indication indicative of a difference between image pixel values in the other image for the depth map position corresponding to the first depth map position and image pixel values in the first image for the first depth map position; and wherein determining the first weight is also responsive to an additional match error indication.

This may provide an improved depth map in many embodiments.

According to an optional feature of the invention, the weighted combination comprises depth values of the first depth map in an area around the position of the first depth map.

This may provide an improved depth map in many embodiments.

According to an optional feature of the invention, the first weight is dependent on a confidence value of the first candidate depth value.

This may provide an improved depth map in many scenarios.

According to an optional feature of the invention, only depth values of the first depth map having a confidence value below a threshold are updated.

This may provide an improved depth map in many scenes and may specifically reduce the risk of updating accurate depth values to less accurate depth values.

According to an optional feature of the invention, the method further comprises selecting a set of depth values of the second depth map for inclusion in the weighted combination which fulfils a requirement that depth values of the set of depth values have to have a confidence value above a threshold value.

This may provide an improved depth map in many scenarios.

According to an optional feature of the invention, the method further comprises: projecting a given depth map location in the given depth map for a given depth value to a corresponding location in a plurality of corresponding depth maps; determining a change metric for a set of depth values, the set of depth values including a given depth value and depth values at corresponding locations in a plurality of corresponding depth maps; and, determining a confidence value for a given depth map location in response to the measure of variation.

This may provide a particularly advantageous determination of confidence values that may result in an improved depth map.

According to an optional feature of the invention, the method further comprises: projecting a given depth map location for a given depth value in a given depth map to a corresponding location in another depth map, the projection being based on the given depth value; projecting a corresponding location in the other depth map to a test location in the given depth map, the projection being based on a depth value at the corresponding location in the other depth map; a confidence value for a given depth map location is determined in response to a distance between the given depth map location and the test location.

This may provide a particularly advantageous determination of confidence values that may result in an improved depth map.

According to an aspect of the invention, there is provided an apparatus for processing a depth map, the apparatus comprising: a receiver for receiving a plurality of images representing a scene from different viewing poses and corresponding depth maps; an updater for updating depth values of a first depth map of the corresponding depth maps based on depth values of at least a second depth map of the corresponding depth maps, the first depth map being for the first image and the second depth map being for the second image; the updating comprises the following steps: determining a first candidate depth value for a first depth pixel of the first depth map at a first depth map position in the first depth map, the first candidate depth value being determined in response to at least one second depth value for a second depth pixel of the second depth map at a second depth map position in the second depth map; determining a first depth value for the first depth pixel by a weighted combination of a plurality of candidate depth values for the first depth map location, the weighted combination comprising the first candidate depth value weighted by a first weight; wherein determining the first depth value comprises: determining a first image position in the first image for the first depth map position, determining a third image position in a third image of the plurality of images, the third image position corresponding to a projection of the first image to the third image based on the first candidate depth value; a first match error indication is determined, the first match error indication indicating a difference between image pixel values in the third image for the third image position and image pixel values in the first image for the first image position, and the first weight is determined in response to the first match error indication.

These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which

FIG. 1 shows an example of an apparatus for providing a virtual reality experience;

FIG. 2 illustrates an example of elements of an apparatus for processing a depth map according to some embodiments of the invention;

FIG. 3 illustrates an example of elements of a method of processing a depth map according to some embodiments of the invention;

FIG. 4 shows an example of camera settings for capturing a scene;

FIG. 5 illustrates an example of elements of a method of updating a depth map according to some embodiments of the invention;

FIG. 6 illustrates an example of elements of a method of determining weights according to some embodiments of the invention;

fig. 7 shows an example of processing of depth maps and images according to some embodiments of the invention.

Detailed Description

The following description focuses on embodiments of the invention applicable to a virtual reality experience but it will be appreciated that the invention is not limited to this application but may be applied to many other systems and applications such as specific applications involving view synthesis.

Virtual experiences that allow users to move around in a virtual world are becoming more and more popular and services are being developed to meet this demand. However, providing efficient virtual reality services is very challenging, particularly if the experience is based on a capture of the real-world environment rather than a fully virtually generated artificial world.

In many virtual reality applications, viewer gesture input is determined, reflecting the gestures of a virtual viewer in a scene. The virtual reality device/system/application then generates one or more images corresponding to the view and viewport of the scene for the viewer corresponding to the viewer gesture.

Typically, virtual reality applications generate three-dimensional output in the form of separate view images for the left and right eyes. These can then be presented to the user in a suitable manner, such as separate left and right eye displays, typically VR headphones. In other embodiments, the images may be presented on an autostereoscopic display, for example (in which case a large number of view images may be generated for viewer gestures), or indeed in some embodiments only a single two-dimensional image may be generated (e.g. using a conventional two-dimensional display).

Viewer gesture input may be determined in different ways in different applications. In many embodiments, the user's body motion may be tracked directly. For example, a camera investigating the user area may detect and track the user's head (or even eyes). In many embodiments, a user may wear a VR headset that can be tracked externally and/or internally. For example, the headset may include accelerometers and gyroscopes that provide information about the motion and rotation of the headset and thus the head. In some examples, the VR headset may transmit a signal or include a (e.g., visual) identifier enabling an external sensor to determine movement of the VR headset.

In some systems, the observer gesture may be provided manually, such as by a user manually controlling a joystick or similar manual input. For example, a user may manually move a virtual viewer around in a scene by controlling a first analog joystick with one hand, and manually controlling the direction in which the virtual viewer is looking by manually moving a second analog joystick with the other hand.

In some applications, a combination of manual and automatic methods may be used to generate the input viewer gesture. For example, the headset may track the orientation of the head and the joystick may be used by the user to control the viewer's movement/position in the scene.

The generation of the image is based on a suitable representation of the virtual world/environment/scene. In some applications, a complete three-dimensional model may be provided for a scene, and a view of the scene from a particular viewer gesture can be determined by evaluating the model.

In many practical systems, a scene may be represented by an image representation that includes image data. The image data may generally include one or more images associated with one or more capture or anchor gestures, and may specifically include images for one or more viewports, where each viewport corresponds to a particular gesture. An image representation comprising one or more images, each representing a view of a given viewport for a given viewing gesture, may be used. Such viewing gestures or positions for which image data is provided are also commonly referred to as anchoring gestures or positions, or capture gestures or positions (as the image data may generally correspond to images captured by or to be captured by a camera positioned in the scene at a position and orientation corresponding to the capture gesture).

The images are typically associated with depth information, and in particular, a depth image or depth map is typically provided. Such a depth map may provide a depth value for each pixel in the corresponding image, where the depth value indicates a distance from the camera/anchor/capture location to the object/scene point depicted by the pixel. Thus, a pixel value may be considered to represent a ray from an object/point in the scene to the camera's capture device, and a depth value for the pixel may reflect the length of the ray.

In many embodiments, the resolution of the image and the corresponding depth map may be the same, and thus may comprise an individual depth value for each pixel in the image, i.e. the depth map may comprise one depth value for each pixel of the image. In other embodiments, the resolution may be different, and for example the depth map may have a lower resolution, such that one depth value may be applied to a plurality of image pixels. The following description will focus on embodiments in which the resolution of the image and the corresponding depth map are the same, and therefore for each image pixel (pixel of the image) there is a separate depth image pixel (pixel of the depth map).

The depth value may be any value indicative of the depth for a pixel, and thus it may be any value indicative of the distance from the camera position to an object of the scene depicted by the given pixel. The depth values may be, for example, disparity values, z-coordinates, distance measures, and the like.

Many typical VR applications may continue to provide, on the basis of such an image plus depth representation, a view image corresponding to a viewport of the scene for the current viewer pose, and an image being dynamically updated to reflect changes in the viewer pose, and an image being generated based on image data representing the (possible) virtual scene/environment/world. The application may do this by performing view synthesis and view shifting algorithms known to those of ordinary skill in the art.

In the art, the terms placement and pose are used as general terms for position and/or direction/orientation. For example, the combination of the position and direction/orientation of the object, camera, head, or view may be referred to as a pose or a placement. Thus, the placement or pose indication may comprise six values/components/degrees of freedom, each value/component typically describing an individual property of the position/position or orientation/direction of the corresponding object. Of course, in many cases, a placement or gesture may be considered to have fewer components or be represented by fewer components, for example, if one or more components are considered to be fixed or unrelated (e.g., if all objects are considered to be at the same height and have a horizontal orientation, four components may provide a complete representation of the object gesture). In the following, the term gesture is used to refer to a position and/or orientation that may be represented by one to six values (corresponding to the largest possible degree of freedom).

Many VR applications are based on gestures with the largest degrees of freedom, i.e., three degrees of freedom for each position and orientation, resulting in a total of six degrees of freedom. A gesture may thus be represented by a set or vector of six values representing six degrees of freedom, and thus, the gesture vector may provide a three-dimensional position and/or three-dimensional direction indication. However, it should be understood that in other embodiments, gestures may be represented by fewer values.

The gesture may be at least one of an orientation and a position. The pose value may be indicative of at least one of an orientation value and a position value.

Systems or entities based on providing the viewer with the greatest degree of freedom are commonly referred to as having 6 degrees of freedom (6 DoF). Many systems and entities provide only orientation or position, and these are often referred to as having 3 degrees of freedom (3 DoF).

In some systems, the VR application may be provided locally to the viewer through, for example, a stand-alone device that does not use, or even access, any remote VR data or processing. For example, a device such as a game console may include a memory for storing scene data, an input for receiving/generating viewer gestures, and a processor for generating corresponding images from the scene data.

In other systems, the VR application may be implemented and executed remotely from the viewer. For example, a device local to the user may detect/receive motion/gesture data that is transmitted to a remote device that processes the data to generate viewer gestures. The remote device may then generate an appropriate view image for the viewer gesture based on scene data describing the scene. The view image is then transmitted to a device local to the viewer who renders it. For example, the remote device may directly generate a video stream (typically a stereoscopic/3D video stream) that is directly rendered by the local device. Thus, in such examples, the local device may not perform any VR processing other than transmitting motion data and presenting the received video data.

In many systems, functionality may be distributed across local and remote devices. For example, the local device may process the received input and sensor data to generate a viewer gesture that is continuously transmitted to the remote VR device. The remote VR device may then generate corresponding view images and transmit the view images to the local device for presentation. In other systems, the remote VR device may not generate the view image directly, but may select and transmit the relevant scene data to the local device, which may then generate the rendered view image. For example, the remote VR device may identify the closest capture point and extract corresponding scene data (e.g., spherical images and depth data from the capture point) and transmit it to the local device. The local device may then process the received scene data to generate an image for the particular current viewing pose. A view gesture generally corresponds to a head gesture, and references to a view gesture may be considered generally equivalent to references to a head gesture.

In many applications, particularly for broadcast services, a source may transmit scene data in the form of an image (including video) representation of a scene independent of viewer gestures. For example, an image representation for a single view sphere for a single capture location may be transmitted to multiple clients. The individual client may then locally synthesize a view image corresponding to the current viewer pose.

One application that is particularly attractive for attention is to support a limited amount of movement such that the rendered view is updated to follow small movements and rotations corresponding to a substantially static viewer making only small head movements and head rotations. For example, a seated viewer can turn his head and move the head slightly, and the rendered view/image is adjusted to follow such pose changes. Such an approach may provide a highly and immersive (e.g., video) experience. For example, a viewer watching a sporting event may feel that he is present at a particular location in an arena.

This limited freedom application has the advantage of providing an improved experience while not requiring accurate scene representations from many different locations, thereby significantly reducing the capture requirements. Similarly, the amount of data that needs to be provided to the renderer can be greatly reduced. Indeed, in many scenarios, the images and typically the depth data for a single viewpoint need to be provided with a local renderer from which the required views can be generated.

The method may be particularly suitable for applications requiring data to be transmitted from a source to a destination over a bandwidth-limited communication channel, such as broadcast or client server applications.

Fig. 1 shows one such example of a VR system, where a remote VR client device 101 is in contact with a VR server 103 via a network 105, such as the internet. The server 103 may be arranged to support a potentially large number of client devices 101 simultaneously.

VR server 103 may support the broadcast experience, for example, by transmitting an image signal comprising an image representation in the form of image data that a client device may use to locally synthesize a view image corresponding to an appropriate gesture.

Fig. 2 shows example elements of an example implementation of an apparatus for processing a depth map. The apparatus may be embodied in the VR server 103 and will be described in this context. Fig. 3 shows a flow chart of a method for processing a depth map performed by the apparatus of fig. 2.

The device/VR server 103 comprises a receiver 201 performing step 301, in which step 301 a plurality of images representing a scene from different viewing poses and corresponding depth maps are received.

The image includes light intensity information, and pixel values of the image reflect the light intensity values. In some examples, the pixel values may be single values, e.g. for the luminance of a grayscale image, but in many embodiments the pixel values may be sets or vectors of (sub-) values, e.g. for the color channel values of a color image (e.g. RGB or Yuv values may be provided).

The depth map for the image may include depth values for the same viewport. For example, for each pixel of the image for a given view/capture/anchor gesture, the corresponding depth map includes pixels having depth values. Thus, the same location in the image and its corresponding depth map provide the ray intensity and depth, respectively, of the ray corresponding to the pixel. In some embodiments, the depth map may have a lower resolution, and for example, one depth image pixel may correspond to multiple image pixels. However, in this case there may still be a direct one-to-one correspondence between the positions in the depth map and the positions in the depth map (including the sub-pixel positions).

For the sake of brevity and to avoid complexity, the following description will focus on an example in which only three images and corresponding depth maps are provided. It is also assumed that these images are provided by a linear arrangement of cameras capturing a scene from three different viewing positions and having the same orientation as shown in fig. 4.

It should be appreciated that in many embodiments, a large number of images are typically received, and the scene is typically captured from a larger number of capture gestures.

The receiver is fed to a depth map updater, which for brevity is hereinafter simply referred to as updater 203. The updater 203 performs step 303 and in step 303 updates one or more (typically all) of the received depth maps. The updating comprises updating the depth value of the first received depth-map based on the depth value of at least the second received depth-map. Thus, cross-depth map and cross-viewing pose updates are performed to generate an improved depth map.

In an example, the updater 203 is coupled to the image signal generator 205, the image signal generator 205 performs step 305, in step 305 the image signal generator 205 generates an image signal comprising the received image and the updated depth map. The image signal may then be transmitted, for example, to the VR client device 101, where it may be used as a basis for synthesizing a view image for the current viewer gesture.

In an example, the depth map update is thus performed in the VR server 103, the updated depth map being distributed to the VR client device 101. However, in other embodiments, the depth map update may be performed, for example, in the VR client device 101. For example, the receiver 201 may be part of the VR client device 101 and receive images and corresponding depth maps from the VR server 103. The received depth map may then be updated by the updater 203 and instead of the image signal generator 205, the apparatus may comprise a renderer or view image synthesizer arranged to generate a new view based on the image and the updated depth map.

In other embodiments, all of the processing may be performed in a single device. For example, the same device may receive directly captured information and generate an initial depth map, e.g., by disparity estimation. The generated depth map may be updated and the synthesizer of the device may dynamically generate a new view.

Thus, the location of the described functionality and the specific use of the updated depth map will depend on the preferences and requirements of the various embodiments.

The updating of the depth map is accordingly based on one or more other depth maps representing depths from different spatial locations and for different images. The method makes use of the following recognition: for depth maps, not only are the absolute accuracy or reliability of the individual depth values or depth maps important for producing perceptual quality, but also the consistency between different depth maps.

In fact, the insight gained by heuristics is that when errors or inaccuracies between depth maps are not consistent, i.e. they vary from source view to source view, they are considered to be particularly harmful, as they effectively cause the viewer to perceive the virtual scene as being severely jarred when he changes position.

This view consistency is not always performed adequately in the depth map estimation process. This is the case, for example, when separate depth sensors are used to obtain a depth map for each view. In this case, the depth data is captured completely independently. At the other extreme, where all views are used to estimate depth (e.g. using a planar scanning algorithm), the results may still be inconsistent, as the results will depend on the specific multi-view disparity algorithm used and its parameter settings. The specific methods described below may alleviate such problems in many scenarios, and the depth maps may be updated to yield improved consistency between depth maps, and thus improved perceived image quality. The method may improve the quality of a depth map for a multi-view image of a set of scenes.

Fig. 5 shows a flow chart for performing an update on a pixel of a depth map. The method may be repeated for some or all depth map pixels to generate an updated first depth map. The process may then also be repeated for other depth maps.

The updating of a pixel (hereinafter referred to as first depth pixel) in a depth map (hereinafter referred to as first depth map) starts in step 501, wherein a first candidate depth value is determined for the first depth pixel. The location of the first depth pixel in the first depth map is referred to as the first depth map location. The corresponding terms are used to alter only the other views of the digital label.

The first candidate depth value is determined in response to at least one second depth value, which is a depth value of a second depth pixel at a second depth map position in the second depth map. Thus, the first candidate depth value is determined from one or more depth values of the further depth map. The first candidate depth value may specifically be an estimate of the correct depth value for the first depth pixel based on information contained in the second depth map.

Step 501 is followed by step 503 wherein an updated first depth value is determined for the first depth pixel by a weighted combination of a plurality of candidate depth values for the first depth map location. The first candidate depth value determined in step 503 is included in the weighted combination.

Thus, in step 501, one of a plurality of candidate depth values for a subsequent combination is determined. In most embodiments, the plurality of candidate depth values may be determined in step 501 by repeating the process described for the first candidate depth value for other depth values in the second depth map and/or for depth values in other depth maps.

In many embodiments, one or more candidate depth values may be determined in other ways or from other sources. In many embodiments, one or more of the candidate depth values may be depth values from the first depth map, such as depth values in a neighborhood of the first depth pixel. In many embodiments, the original first depth value, i.e. the depth value received by the receiver 201 for the first depth pixel in the first depth map, may be included as one of the candidate depth values.

Accordingly, the updater 205 may perform a weighted combination of candidate depth values comprising at least one candidate depth value determined as described above. The number, attributes, origin, etc. of any other candidate depth values will depend on the preferences and requirements of the various embodiments and the exact depth update operation required.

For example, in some embodiments, the weighted combination may only include the first candidate depth value and the original depth value determined in step 501. In this case, only a single weight for the first candidate depth value may be determined, for example, and the weight for the original depth value may be constant.

As another example, in some embodiments, the weighted combination may be a combination of a number of candidate depth values including values determined from other depth maps and/or locations, original depth values, depth values in a neighborhood in the first depth map, or indeed even depth values in an alternative depth map (e.g., a depth map using a different depth estimation algorithm). In such a more complex embodiment, a weight may be determined, for example, for each candidate depth value.

It should be appreciated that any suitable form of weighted combination may be used including, for example, a non-linear combination or a selective combination (where one candidate depth value is given a weight of 1 and all other candidate depth values are given a weight of 0). However, in many embodiments, linear combinations, in particular weighted averages, may be used.

Thus, as a specific example, the updated depth values for the image coordinates (u, v) in the depth map/view kThere may be at least one of the i e {1, …, n } candidate depth values z generated as described in step 501_iA weighted average of the set of (a). In this case, the weighted combination may correspond to a filter function given as:

Wherein the content of the first and second substances,is the updated depth value, z, at pixel location (u, v) for view k_iIs the ith input candidate depth value, w_iIs the weight of the ith input candidate depth value.

The method uses a specific way for determining the weight of the first candidate depth value, i.e. the first weight. The manner will be described with reference to the flowchart of fig. 6 and the image and depth maps of fig. 7.

Fig. 7 shows an example in which three images and three corresponding depth maps are provided/considered. The first image 701 is provided together with a first depth map 703. Similarly, a second image 705 is provided with a second depth map 707, and a third image 709 is provided with a third depth map 711. The following description will focus on determining a first weight for a first depth value of the first depth map 703 based on a depth value from the second depth map 707 and further taking into account the third image 709.

The determination of the first weight (for the first candidate depth value) is thus determined for the first depth pixel/first depth map location based on one or more second depth values for the second depth pixel at the second depth map location in the second depth map 707. In particular, as indicated by arrow 713 in fig. 7, the first candidate depth value may be determined as the second depth value at the corresponding position in the second depth map 707.

The determination of the first weight starts in step 601, where the updater determines a first image position in the first image 701 corresponding to a first depth map position as indicated by arrow 715. Typically, this may be just the same location and image coordinates. The pixels in the first image 701 that correspond to the first image position are referred to as first image pixels.

The updater 203 then continues in step 603 to determine a third image position in a third image 709 of the plurality of images based on the first candidate depth values, wherein the third image position corresponds to a projection of the first image position to the third image. The third image position may be determined from a direct projection of the image coordinates of the first image 701, indicated by arrow 717.

The updater 203 accordingly continues to project the first image position to a third image position in the third image 709. The projection is based on the first candidate depth value. Thus, the projection of the first image position to the third image 709 is based on a depth value which can be considered as an estimate of the first depth value determined based on the second depth map 707.

In some embodiments, the determination of the third image location may be based on a projection of the depth map location. For example, the updater 203 may continue to project the first depth map position (the position of the first depth pixel) to the third depth map position in the third depth map 711 as indicated by arrow 719. The projection is based on the first candidate depth value. Thus, the projection of the first depth map position to the third depth map 711 is based on a depth value, which can be considered as an estimate of the first depth value determined based on the second depth map 707.

The third image position may then be determined as the image position in the third image 709 corresponding to the third depth map position as indicated by arrow 721.

It should be understood that these two ways are equivalent.

The projection from one depth map/image to a different depth map/image may be a determination of the depth map/image position in a different depth map/image representing the same scene point as the depth map/image position in one depth map/image. Since the depth map/images represent different viewing/capture gestures, parallax effects will result in a shift in image position for a given point in the scene. The offset will depend on the change in viewing pose and the depth of the point in the scene. The projection from one image/depth map to another image/depth map may accordingly also be referred to as image/depth map position offset or determination.

As an example, the image coordinates (u, v) in one view (l)_lAnd its depth value z_l(u, v) corresponding image coordinates (u, v) projected to an adjacent view (k)_kCan be performed, for example, for a perspective camera by:

1. image coordinates (u, v)_lNot using z_lProjection in 3d space (x, y, z) of intrinsic parameters (focal length and principal point) of the camera (l) _lIn (1).

2. Using their relative extrinsic parameters (camera rotation matrix R and translational vector t), the non-projected points (x, y, z) in the coordinate system of the camera (l)_lIs converted into the coordinate system (x, y, z) k of the camera (k).

3. Final point (x, y, z)_k(using camera intrinsic values of k) onto the image plane of the camera (k), resulting in image coordinates (u, v)_k。

Similar mechanisms may be used for other camera projection types, such as equidistant columnar projection (ERP).

In the described approach, the projection based on the first candidate depth value may be considered to correspond to a third depth map/image position of the scene point (and for a change in viewing pose between the first and third viewing poses) for the first depth map/image position determined for the depth having the first candidate depth value.

The different depths will result in different offsets and in the present case the offset in image and depth map position between the first viewing pose for the first depth map 703 and the first image 701 and the third viewing pose for the third depth map 711 and the third image 709 is based on at least one depth value in the second depth map 707.

In step 603, the updater 203 determines a position in the third depth map 711 and the third image 709, respectively, which position will reflect the same scene point as the first image pixel in the first image 701, if the first candidate depth value is indeed the correct value for the first depth value and the first image pixel. Any deviation of the first candidate depth value from the correct value may result in an incorrect position being determined in the third image 709. It should be noted that scene points here refer to scene points on the ray associated with the pixel, but they may not necessarily be the frontmost scene points for the two viewing poses. For example, if a scene point seen from a first viewing pose is occluded by a (more) foreground object seen from a second viewing pose, the depth map and hence the depth values of the image may represent different scene points and therefore have values that may be very different.

Step 603 is followed by step 605 wherein a first indication of match error is generated based on the content of the first and third images 701, 709 at the first image position and the third image position, respectively. Specifically, image pixel values of the third image at the third image position are retrieved. In some embodiments, the image pixel values may be determined as image pixel values in the third image 709 for which the third depth map location in the third depth map 711 provides the determined depth value. It should be appreciated that in many embodiments, i.e. in case the same resolution is used for the third depth map 711 and the third image 709, directly determining the position in the third image 709 corresponding to the position of the first depth map (arrow 719) is equivalent to determining the position in the third depth map 711 and retrieving the corresponding image pixels.

Similarly, the updater 203 continues to extract pixel values in the first image 701 at the first image position. And then proceeds to determine a first match error indication indicative of a difference between the two image pixel values. It will be appreciated that any suitable difference metric may be used, such as a simple absolute difference, a squared root difference applied to, for example, pixel value components of a multi-color channel, etc.

Thus, the updater 203 determines 605 a first match error indication indicating a difference between image pixel values in the third image for the third image position and image pixel values in the first image for the first image position.

The updater 203 then proceeds to step 607, wherein a first weight is determined in response to the first match error indication. It should be appreciated that the specific manner for determining the first weight from the first match error indication may depend on the various embodiments. In many embodiments, complex considerations including, for example, other indications of match errors may be used, more examples of which will be provided later.

As a low complexity example, the first weight may be determined as a monotonically decreasing function of the first match error indication in some embodiments, and in many embodiments without taking any other parameters into account.

For example, in an example where the weighted combination comprises only the first candidate depth value and the original depth value of the first depth pixel, the combination may apply a fixed weight to the original depth value, the lower the first match error indication, the first weight is increased (typically also including weight normalization).

The first match error indication may be considered to reflect the degree of match of the first and third images in terms of representing a given scene point. If there is no occlusion difference between the first image and the third image, and if the first candidate depth value is the correct value, the image pixel values should be the same and the first match error indication should be zero. If the first candidate depth value deviates from the correct value, the image pixels in the third image may not directly correspond to the same scene point, and thus the first match error indication may increase. If the occlusion changes, the error may be very high. Thus, the first match error indication may provide a good indication of the accuracy and fitness of the first candidate depth value for the first depth pixel.

In different embodiments, the first candidate depth value may be determined from one or more depth values of the second depth map in different ways. Similarly, different ways may be used to determine which candidate values to generate for a weighted combination. In particular, a plurality of candidate values may be generated from the depth values of the second depth map, and a weight may be calculated separately for each of these candidate values in the manner described with respect to fig. 6.

In many embodiments, the determination of which second depth values to use to derive the first candidate depth value depends on a projection between the first depth map and the second depth map, determining the corresponding position in the two depth maps. In particular, in many embodiments, the first candidate depth value may be determined as the second depth value at the second depth map position, which is considered to correspond to the first depth map position, i.e. the second depth value is selected as the depth value considered to represent the same scene point.

The determination of the corresponding first and second depth map positions may be based on a projection from the first depth map to the second depth map, i.e. may be based on the original first depth value, or it may be based on a projection from the second depth map to the first depth map, i.e. may be based on the second depth value. In some embodiments, projections in two directions may be performed and, for example, an average of the projections may be used.

Thus, determining the first candidate depth value may comprise determining a second depth map position relative to the first depth map position by projection between a first viewing pose of the first image and a second viewing pose of the second image based on at least one of the second value and the first original depth value of the first depth map.

For example, for a first pixel in a given first depth map, the updater 203 may extract a depth value and use it to project the corresponding first depth map location to the corresponding second depth map location in the second depth map. It may then extract the second depth value at that location and use it as the first candidate depth value.

As another example, for a second pixel in a given second depth map, the updater 203 may extract a depth value and use it to project the corresponding second depth map location to the corresponding first depth map location in the first depth map. It may then extract the second depth value and use it as a first candidate depth value for the first depth pixel at the first depth map location.

In such embodiments, the depth value in the second depth map is directly used as the first candidate depth value. However, the depth values may be different because the two depth map pixels represent (without occlusion) distances to the same scene point but from different viewpoints. In many practical embodiments, this difference in distance from the camera/viewing pose at different locations to the same scene point is negligible and can be ignored. Thus, in many embodiments, it may be assumed that the cameras are perfectly aligned and looking in the same direction and have the same position. In that case, if the object is flat and parallel to the image sensor, the depth may indeed be exactly the same in the two corresponding depth maps. The deviation from this is usually negligibly small.

However, in some embodiments, determining the first candidate depth value from the second depth value may comprise modifying a projection of the depth value. This may be based on more detailed geometrical calculations, including considering the projection geometry of the two views.

In some embodiments, more than a single second depth value may be used to generate the first candidate depth value. For example, spatial interpolation may be performed between different depth values to compensate for projections that are not aligned with the center of the pixel.

As another example, in some embodiments, the first candidate depth value may be determined as a result of spatial filtering, wherein a kernel centered on the second depth map position is applied to the second depth map.

The following description will focus on embodiments where each candidate depth value depends only on a single second depth value and is also equal to the second depth value.

In many embodiments, the weighted combination may further comprise a plurality of candidate depth values determined from different second depth values.

In particular, in many embodiments, the weighted combination may comprise candidate depth values from the region of the second depth map. The region may generally be determined based on the first depth map location. In particular, the second depth map position may be determined by projection (in either or both directions) as previously described, and the region may be determined as a region (e.g., having a predetermined contour) around the second depth map position.

The approach may accordingly provide a set of candidate depth values for the first depth pixel in the first depth map. For each candidate depth value, the updater 203 may perform the method of fig. 6 to determine a weight for weighted combination.

One particular advantage of this approach is that selecting the second depth value for the candidate depth value is not overly important, as subsequent weight determinations will properly weight good and bad candidates. Thus, in many embodiments, a relatively low complexity approach may be used to select the candidate values.

In many embodiments, the area may for example simply be determined as a predetermined area around a position in the second depth map determined by a projection from the first depth map to the second depth map based on the original first depth values. Indeed, in many embodiments, the projection may even be replaced by simply selecting the region as a region around the depth map position in a second depth map that is the same as in the first depth map. Thus, this approach may simply select a candidate set of depth values by selecting the second depth values in the area around the position in the second depth map that is the same as the position of the first pixel in the first depth map.

This approach may reduce resource usage in practice but provide efficient operation. This may be particularly appropriate when the size of the region is relatively large compared to the position/disparity shift that occurs between the depth maps.

As previously mentioned, many different methods may be used to determine the weights for weighting the respective candidate depth values in the combination.

In many embodiments, the first weight may also be determined in response to additional match error indications determined for other images than the third image. In many embodiments, the described method may be used to generate a match error indication for all other images than the first image. A combined match error indication may then be generated, e.g. as an average of these match error indications, and the first weight may be determined based thereon.

In particular, the first weight may depend on a match error indicator, which is a function l ≠ k of the individual match errors from the views filtered to all other views. For determining for candidate Z_iAn example indicator of the weight of (c) is:

w_i(z_i)＝min_l≠k(e_kl(z_i)),

wherein e is_kl(z_i) Is given a candidate Z_iView k and l. The match error may for example depend on the color difference for a single pixel or may be calculated as a spatial average around the pixel position (u, v). Instead of calculating the minimum match error for view l ≠ k, an average or median value may be used, for example. In many embodiments, the evaluation function may preferably be robust to match error outliers caused by occlusions.

In many embodiments, a second match error indication may be determined for the second image (i.e. for the view from which the first candidate depth value was generated). The determination of this second match error indication may use the same method as described for the first match error indication and may generate a second match error indication indicating the difference of image pixel values in the second image for the second depth map position and image pixel values in the first image for the first depth map position.

The first weight may then be determined in response to the first and second match error indications (and possibly other match error indications or parameters).

In some embodiments, such weight determination may take into account not only, for example, average match error indications, but also relative differences between match indications. For example, if the first match error indicates a relatively low and the second match error indicates a relatively high, this may be due to occlusion in the second image relative to the first image (and not in the third image). Thus, the first weight may be reduced or even set to zero.

Other examples of weight considerations may, for example, use statistical measures such as median match error or other quantiles. Similar reasoning as described above applies here. For example, if we have a linear camera array of nine cameras, all looking in the same direction, we can assume that the center camera will always look at the non-occluded area around the edge of the object at four anchor points to the left or four anchor points to the right. In this case, the good total weight for a candidate may be a function of only the lowest four of the eight match errors total.

In many embodiments, the weighted combination may comprise other depth values of the first depth map itself. In particular, a set of depth pixels in the first depth map around the first depth position may be included in the weighted combination. For example, a predetermined spatial kernel may be applied to the first depth map, resulting in a low pass filtering of the first depth map. The weighting of the spatially low-pass filtered first depth map value and the candidate depth values from the other views may then be adjusted, e.g. by applying a fixed weight to the low-pass filtered depth values and a variable first weight to the first candidate depth values.

In many embodiments, the determination of the weight, in particular the first weight, also depends on the confidence value for the depth value.

Depth estimation and measurement are noisy in nature and various errors and variations may occur. In addition to depth estimation, many depth estimation and measurement algorithms may also generate confidence values that indicate how reliable the provided depth estimates are. For example, disparity estimation may be based on detecting matching regions in different images, and a confidence value may be generated to reflect the degree of similarity of the matching regions.

The confidence value may be used in different ways. For example, in many embodiments, the first weight for the first candidate depth value may depend on the confidence value for the first candidate depth value, and in particular the confidence value for the second depth value used to generate the first candidate depth value. The first weight may be a monotonically increasing function of the confidence value for the second depth value, and thus the first weight may be increased for increasing the confidence of the base depth value for generating the first candidate depth value. Thus, the weighted combination may be biased towards depth values that are considered reliable and accurate.

In some embodiments, the confidence values for the depth map may be used to select which depth values/pixels to update and for which depth pixels to leave the depth values unchanged. In particular, the updater 203 may be arranged to select only depth values/pixels of the first depth map for updating having a confidence value below a threshold.

Thus, instead of updating all pixels in the first depth map, the updater 203 specifically identifies depth values that are considered unreliable and updates only these values. This may result in an improved overall depth map in many embodiments, since it can prevent, for example, very accurate and reliable depth estimates from being replaced by more uncertain values generated from depth values from other viewpoints.

In some embodiments, the set of depth values of the second depth map included in the weighted combination, either by contributions to different candidate depth values or contributions to the same candidate depth value, may depend on the confidence value for the depth value. In particular, only depth values having a confidence value above a given threshold may be included, and all other depth values may be discarded from processing.

For example, the updater 203 may initially generate a modified second depth map by scanning the second depth map and removing all depth values having confidence values below a threshold. The previously described processing may then be performed using the modified second depth map, wherein all operations requiring the second depth value are bypassed if no such second depth value is present in the second depth map. For example, if the second depth value does not exist, a candidate depth value for the second depth value is not generated.

In some embodiments, the updater 203 may further be arranged to generate a confidence value for the depth value.

In some embodiments, a confidence value for a given depth value in a given depth map may be determined for corresponding locations in these depth maps in response to a change in depth values in other depth maps.

The updater 203 may first project a depth map position for a given depth value for which a confidence value is determined to a corresponding position in a plurality of other depth maps, and typically to all of these positions.

In particular, for the image coordinates (u, v) in the depth map k_kGiven the depth value of (a), a set L of other depth maps (typically for neighboring views) is determined. For these depth maps (L ∈ L)For each of the image coordinates (u, v) for L ∈ L are calculated by reprojection_l。

The updater 203 may then take into account the depth values in the other depth maps at the corresponding locations. The determination of the change metric for these depth values at the corresponding locations may continue. Any suitable measure of variation may be used, such as a measure of variance.

The updater 203 may then proceed to determine confidence values for a given depth map position from such a measure of variation, and in particular an increasing degree of variation may indicate a decreasing confidence value. Thus, the confidence value may be a monotonically decreasing function of the measure of variation.

In particular, for L ∈ L, a depth value z is given_kAnd in (u, v)_lA set z of corresponding adjacent depth values of_lA confidence indicator may be calculated based on the consistency of the depth values. For example, the variance of these depth values may be used as a confidence indicator. A low variance means a high confidence.

It is often desirable to make such determinations with respect to corresponding image coordinates (u, v) that may be potentially occluded by objects in the scene or camera boundaries_kThe resulting outliers are more robust. One particular way of achieving this is to select two adjacent views l at opposite sides of the camera view (k)₀And l₁And using the minimum of the depth difference

In some embodiments, a confidence value for a given depth value in a given depth map can be determined by evaluating the error resulting from the projection of the corresponding given depth location to another depth map, and then projecting it back using the two depth values of the two depths.

Thus, the updater 203 may first project a given depth map location to another depth map based on a given depth value. The depth value at the projected location is then retrieved and the location in the further depth map is projected back to the original depth map based on the further depth value. This results in a test location that is exactly the same as the original depth map location if the two depth values used for projection match perfectly (e.g., taking into account camera and capture properties and geometry). However, any noise or error will produce a difference between the two positions.

The updater 203 may correspondingly continue to determine a confidence value for a given depth map position in response to the distance between the given depth map position and the test position. The smaller the distance, the higher the confidence value, and thus the confidence value may be determined as a monotonically decreasing function of the distance. In many embodiments, multiple other depth maps may be considered, taking into account distance.

Thus, in some embodiments, the confidence value may be determined based on the geometric consistency of the motion vectors. Let d_klRepresenting a 2D motion vector, which 2D motion vector is to be given its depth z_kPixel (u, v)_kBringing to the adjacent view i. Each corresponding pixel position (u, v) in the neighboring view l_lAll have their own depth z_lThis produces a vector d which returns to view k_lk. In the ideal case of zero error, all these vectors map exactly back to the original point (u, v)_k. However, in general, this is not the case, and certainly not for regions with insufficient confidence. Therefore, a good metric for lack of confidence is the average error of the backprojection position. The error indicator may be expressed as:

wherein, f ((u, v)_l；z_l) Representing the use of depth values z_lThe image coordinates in view k, back-projected from the adjacent view/are shown. The norm | may be L1 or L2 or any other norm. The confidence value may be determined as a monotonically decreasing function of the value. It should be understood that the term "candidate" does not imply any limitation on depth values, and that the term candidate depth value may refer to any depth value included in the weighted combination.

It will be appreciated that for clarity, the above description has described embodiments of the invention with reference to different functional circuits, units and processors. It will be apparent, however, that any suitable distribution of functionality between different functional circuits, units or processors may be used without detracting from the invention. For example, functions illustrated as being performed by separate processors or controllers may be performed by the same processor or controllers. Thus, references to specific functional units or circuits are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. Thus, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the invention is limited only by the appended claims. Furthermore, although a feature may appear to be described in connection with particular embodiments, one of ordinary skill in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.

Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor. Furthermore, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Furthermore, the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims does not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. Furthermore, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second", etc., do not preclude a plurality. The terms "first", "second", "third", etc. are used as labels, and thus do not imply any other limitation as to providing a clear identification of corresponding features, and should not be construed as limiting the scope of the claims in any way. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：自主未知对象拾取和放置

Processing of depth maps for images

相关技术

网友询问留言