Method and apparatus for performing depth estimation of an object

文档序号：1026938 发布日期：2020-10-27 浏览：21次中文

阅读说明：本技术 执行对象的深度估计的方法和装置 (Method and apparatus for performing depth estimation of an object ) 是由 K.斯里尼瓦桑 P.K.S.克里什纳帕 P.P.德什潘德 R.萨卡尔于 2019-03-11 设计创作，主要内容包括：提供了一种用于执行场景的图像中的对象的深度估计的方法和装置。该方法包括：捕获场景的图像；由传感器从图像获得像素强度数据和事件数据；使用事件数据来生成事件深度图,其中,事件数据包括图像的事件图数据和图像的事件速度数据；以及使用事件深度图和像素强度数据来生成图像中的对象的深度图。(A method and apparatus for performing depth estimation of objects in an image of a scene is provided. The method comprises the following steps: capturing an image of a scene; obtaining, by a sensor, pixel intensity data and event data from an image; generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and generating a depth map of the object in the image using the event depth map and the pixel intensity data.)

1. A method for performing, by an electronic device, depth estimation of an object in an image, the method comprising:

capturing, by an electronic device, an image of a scene;

obtaining, by a sensor of an electronic device, pixel intensity data and event data from an image;

generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and

a depth map of an object in the image is generated using the event depth map and the pixel intensity data.

2. The method of claim 1, wherein generating an event depth map comprises:

generating a spatiotemporal two-dimensional (2D) event map and an event intensity image by processing event map data of the image; and

an event depth map is generated using the spatiotemporal 2D event map and the event velocity data.

3. The method of claim 2, wherein generating an event depth map using the spatiotemporal 2D event map and the event velocity data comprises using a degree of rotation of a capture device of the electronic device and translation estimation data of a sensor of the electronic device.

4. The method of claim 3, further comprising distributing the event depth map along an edge of the image.

5. The method of claim 1, wherein generating a depth map of an object in an image using the event depth map and the pixel intensity data comprises combining the event depth map with the pixel intensity data.

6. The method of claim 1, wherein generating a depth map of an object in an image comprises:

obtaining red, green, and blue (RGB) images from the image using the pixel intensity data;

generating an intermediate depth map based on the event depth map and the event intensity image; and

a depth map is generated by combining the intermediate depth map with the RGB image.

7. The method of claim 6, wherein combining the intermediate depth map with the RGB image comprises combining the intermediate depth map with the RGB image by applying a guided surface fitting process.

8. The method of claim 6, wherein generating the depth map of the object in the image further comprises post-processing the intermediate depth map using depth smoothing and hole filling of the intermediate depth map.

9. The method of claim 1, wherein obtaining, by a sensor of an electronic device, pixel intensity data and event data comprises:

obtaining pixel intensity data by an Active Pixel Sensor (APS) of a sensor; and

event data is obtained by a Dynamic Vision Sensor (DVS) of the sensor.

10. The method of claim 1, further comprising performing timing synchronization and timing correction of event data for pixel intensity data.

11. The method of claim 1, further comprising:

performing frequency scaling on a depth estimation circuit of the electronic device based on a scene data rate of the data and a maximum throughput of the electronic device; and

an operating frequency of the electronic device is determined based on the frequency scaling.

12. The method of claim 1, further comprising:

obtaining, by a sensor, depth information of an object in an image; and

motion information of an object in the image is obtained by the sensor.

13. The method of claim 1, further comprising generating pixel intensity data by interpolating accumulated event data.

14. An apparatus for performing depth estimation of an object in an image, the apparatus comprising:

a camera configured to capture an image of a scene;

a sensor configured to obtain pixel intensity data and event data from an image; and

a processor configured to:

generating an event depth map using event data, wherein the event data comprises event map data for the image and event velocity data for the image, an

A depth map of the object is generated using the event depth map and the pixel intensity data.

15. A non-transitory computer-readable medium storing instructions thereon, wherein the instructions, when executed, instruct at least one processor to perform a method comprising:

capturing, by an electronic device, an image of a scene;

obtaining, by a sensor of an electronic device, pixel intensity data and event data from an image;

generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and

a depth map of an object in the image is generated using the event depth map and the pixel intensity data.

Technical Field

The present disclosure relates generally to image processing and, more particularly, to a method and apparatus for performing depth estimation of objects in a scene.

Background

Various electronic devices, such as cameras, mobile phones, and other multimedia devices, are used to capture images of a scene. The captured depth map of the scene may be used in different applications, such as robotics, automotive sensing, medical imaging, and three-dimensional (3D) applications. A depth map is an image that includes information about the distance from a viewpoint to a surface included in a scene.

Conventional camera systems have many processing bottlenecks in depth estimation and active depth sensors (e.g., depth estimation in oversaturated regions, depth estimation under varying illumination conditions, and depth estimation of reflective and transparent objects).

For example, in a conventional Complementary Metal Oxide Semiconductor (CMOS) sensor and stereo setup (stereo setup) for depth estimation, an accurate depth cannot be estimated in a saturated image region. Furthermore, conventional CMOS sensors are unable to capture images of a scene at high frame rates and low power, which is not suitable for providing fast visual feedback.

In Advanced Driver Assistance Systems (ADAS), an accurate depth map of a scene is necessary for obstacle detection. Furthermore, the ADAS system should be able to operate under a variety of lighting conditions and provide quick visual feedback to the user for proper navigation. However, conventional CMOS sensors do not operate well under various lighting conditions and do not provide rapid visual feedback to the user, which results in poor imaging characteristics of the depth map. In addition, conventional CMOS sensors require higher bandwidth. For example, a conventional CMOS sensor may sample at the nyquist rate, which requires over 20 GBPS.

Therefore, there is a need for a method and apparatus for performing accurate depth estimation of objects in a scene under various lighting conditions.

Disclosure of Invention

Technical solution

According to an aspect of the present disclosure, a method for performing depth estimation of an object in an image by an electronic device is provided. The method comprises the following steps: capturing, by an electronic device, an image of a scene; obtaining, by a sensor of an electronic device, pixel intensity (pixel intensity) data and event data from an image; generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and generating a depth map of the object in the image using the event depth map and the pixel intensity data.

Advantageous effects

The present disclosure provides a method and apparatus for performing accurate depth estimation of objects in an image of a scene under various lighting conditions.

Drawings

The above and other aspects, features and advantages of certain embodiments of the present disclosure will become more apparent from the following description when taken in conjunction with the accompanying drawings, in which:

FIG. 1A illustrates an electronic device for performing depth estimation of objects in a scene according to an embodiment;

FIG. 1B illustrates a sparse depth map (sparse depth map) and a dense depth map (dense depth map) in accordance with an embodiment;

FIG. 1C is a flow diagram illustrating a method of obtaining an event intensity image according to an embodiment;

fig. 1D illustrates a method of obtaining scene data according to an embodiment;

FIG. 2 illustrates a depth estimation engine of an electronic device for performing depth estimation of objects in a scene, in accordance with an embodiment;

FIG. 3A is a flow diagram illustrating an apparatus method for performing depth estimation of objects in a scene according to an embodiment;

FIG. 3B is a flow diagram illustrating a method for generating a sparse event depth map, according to an embodiment;

fig. 3C is a flow diagram illustrating a method for creating a dense depth map according to an embodiment;

FIG. 4 is a process flow diagram illustrating an electronic device performing depth estimation of objects in a scene in accordance with an embodiment;

FIG. 5 is a process flow diagram illustrating an electronic device performing depth estimation of objects in a scene according to an embodiment;

FIG. 6 is a process flow diagram illustrating an electronic device performing depth estimation of objects in a scene according to an embodiment;

FIG. 7 is a flow diagram illustrating a method for performing depth estimation of objects in a scene according to an embodiment;

FIG. 8A is a process flow diagram illustrating an electronic device performing timing synchronization of event data according to an embodiment;

FIG. 8B illustrates timing synchronization of event data according to an embodiment;

FIG. 9 is a process flow diagram illustrating an electronic device performing frequency scaling according to an embodiment;

FIG. 10 is a process flow diagram illustrating an electronic device performing obstacle detection and navigation of an autonomous vehicle by using a depth map in accordance with an embodiment;

FIG. 11 is a process flow diagram illustrating an electronic device creating a depth map to be used in Augmented Reality (AR) shopping according to an embodiment;

FIG. 12 is a process flow diagram for reconstructing a three-dimensional (3D) scene of a 3D image according to an embodiment; and is

Fig. 13 is a process flow diagram for creating a Bokeh effect using a depth map, according to an embodiment.

Detailed Description

The present disclosure is provided to address at least the above problems and/or disadvantages and to provide at least the advantages described below.

According to an aspect of the present disclosure, a method for performing depth estimation of an object in an image by an electronic device is provided. The method comprises the following steps: capturing, by an electronic device, an image of a scene; obtaining, by a sensor of an electronic device, pixel intensity data and event data from an image; generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and generating a depth map of the object in the image using the event depth map and the pixel intensity data.

According to an aspect of the present disclosure, there is provided an apparatus for performing depth estimation of an object in an image. The device includes: a camera configured to capture an image of a scene; a sensor configured to obtain pixel intensity data and event data from an image; and a processor configured to generate an event depth map using the event data, wherein the event data comprises event map data for the image and event velocity data for the image, and to generate a depth map for the object using the event depth map and the pixel intensity data.

According to an aspect of the disclosure, a non-transitory computer-readable medium for storing instructions thereon is provided, wherein the instructions, when executed, instruct at least one processor to perform a method. The method comprises the following steps: capturing, by an electronic device, an image of a scene; obtaining, by a sensor of an electronic device, pixel intensity data and event data from an image; generating an event depth map using the event data, wherein the event data includes event map data for the image and event velocity data for the image; and generating a depth map of the object in the image using the event depth map and the pixel intensity data.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of various embodiments of the disclosure as defined by the claims and their equivalents. It includes various specific details to assist in this understanding, although these specific details are to be regarded as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the various embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the written meaning but are used by the inventor to convey a clear and consistent understanding of the disclosure. Accordingly, it will be appreciated by those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

As used herein, the singular forms (such as "a," "an," and "the") include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to a "component surface" includes reference to one or more of such surfaces.

The various embodiments described herein are not necessarily mutually exclusive, as some embodiments may be combined with one or more other embodiments to form new embodiments.

Herein, unless otherwise indicated, the term "or" refers to a non-exclusive or.

The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.

As is conventional in the art of this disclosure, embodiments may be described and illustrated in terms of blocks performing one or more of the described functions. These blocks, which may be referred to herein as managers, engines, controllers, units, modules, and the like, are physically implemented by analog and/or digital circuits (such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits, and the like), and may optionally be driven by firmware and software. The circuitry may be embodied in one or more semiconductor chips or on a substrate support such as a printed circuit board or the like. The circuitry making up the blocks may be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some of the functions of the blocks and a processor to perform other functions of the blocks.

Each block of an embodiment may be physically separated into two or more interacting and discrete blocks without departing from the scope of the disclosure. Likewise, blocks of embodiments may be physically combined into more complex blocks without departing from the scope of the disclosure.

According to one embodiment, a method and apparatus for performing depth estimation of objects in a scene is provided.

A sparse event depth map may be generated by processing event data for a scene.

A dense depth map of objects in a scene may be created by combining a sparse event depth map with a stream of pixel intensity data for the scene.

A spatiotemporal two-dimensional (2D) event map and an event intensity image may be generated by processing event map data of an image of a scene.

A sparse event depth map may be generated by processing the spatiotemporal 2D event map, the event velocity data, and/or the rotation of a capture device of the electronic device and the translation estimate data of the sensor. The sensor may be a monocular (monocular) event based sensor of the electronic device.

A pixel intensity data stream of a scene may be processed along with event data to obtain high quality red, green, and blue (RGB) images of the scene.

An intermediate depth map may be generated by using the generated sparse depth and an event intensity image of the scene.

Dense depth maps may be created by combining the intermediate depth maps with high quality RGB images of the scene.

Timing synchronization and timing correction of event data is performed for a stream of pixel intensity data of a scene.

According to an embodiment, a method for performing depth estimation of an object in a scene by using an electronic device is provided. The method comprises the following steps: capturing, by an electronic device, an image of a scene; and obtaining, by the electronic device, scene data, i.e., input data for the scene, from the sensor, wherein the input data includes a stream of pixel intensity data for the scene and event data for the scene. The sensor may be a monocular event based sensor. The method further comprises the following steps: generating, by an electronic device, a sparse event depth map by processing event data of a scene, wherein the event data of the scene includes event map data and event speed data; and generating, by the electronic device, a dense depth map of objects in the scene by fusing the sparse event depth map with a stream of pixel intensity data of the scene.

Unlike conventional methods and systems, the method according to embodiments may be used to perform depth estimation under various lighting conditions and in relation to a reflective surface. For example, the methods of the present disclosure may also be used to perform more accurate 3D reconstruction of a scene by combining CMOS image sensor data with event sensor data using a single sensor (such as a monocular event based sensor).

Due to the superior dynamic range of monocular event based sensors, the methods of the present disclosure may be used to capture more information about a scene, thereby providing more information in the saturated region of the scene.

Unlike conventional methods and systems, the methods of the present disclosure can be used to perform accurate depth estimation using timing synchronization of CMOS image sensor data and event data. For example, low-power wearable vision devices, such as Augmented Reality (AR) smart glasses, human-machine interaction (HRI) devices, and so forth, may perform accurate depth estimation of a scene using CMOS image sensor data in synchronization with the timing of event data.

Unlike conventional methods and systems, the methods of the present disclosure may be used to reduce the Dynamic Random Access Memory (DRAM) bandwidth of monocular event based sensors by using a dynamic time event context rate based on frequency scaling. Therefore, low-delay processing based on a reduced bandwidth can be obtained.

The methods of the present disclosure may be used to generate depth maps with high accuracy using a single sensor, such as a monocular event-based sensor. Furthermore, the method may be used to generate an accurate depth map without compromising the power consumption and performance of the electronic device.

Fig. 1A illustrates an electronic device for performing depth estimation of objects in a scene according to an embodiment.

Referring to fig. 1A, the electronic device 100 may be, but is not limited to, a smart phone, a drone, a mobile robot, an autonomous vehicle, a smart watch, a laptop, a mobile phone, a head-mounted display, a Personal Digital Assistant (PDA), a tablet, or any other electronic device that includes an image capture device. The electronic device 100 includes a monocular event based sensor 110, a depth estimator 120, an Image Signal Processor (ISP)130, an Event Signal Processor (ESP)140, a communicator 150, a processor 160, a memory 170, and a display 180. The depth estimator 120, ISP130, ESP 140, communicator 150 and processor 160 may be implemented as at least one hardware processor.

Monocular event based sensor 110 may be configured to capture an image of a scene. The scene may be, but is not limited to, a 2D scene or a 3D scene.

Monocular event based sensor 110 may be, but is not limited to, a camera, an RGB camera, a Charge Coupled Device (CCD) or CMOS sensor, or the like.

Monocular event based sensors 110 include Active Pixel Sensors (APS)111 and Dynamic Visual Sensors (DVS) 112. APS 111 may be configured to obtain a stream of pixel intensity data for a scene. The DVS112 may be configured to obtain event data for an image. Monocular event based sensor 110 may be configured to track motion changes in an image of a scene, i.e., changes in intensity of pixels in an image of a scene.

The monocular event based sensor 110 may be configured to perform a function of a depth sensor capable of obtaining depth information of an object in a 3D image of a scene, and a function of a motion sensor capable of acquiring motion information by detecting motion of an object in a 3D image of a scene.

Monocular event based sensor 110 may obtain scene data while capturing an image of a scene using a capture device, such as a camera of electronic device 100. The scene data may include a stream of pixel intensity data for the scene and event data for the scene.

Fig. 1B illustrates a sparse event depth map and a dense depth map, in accordance with an embodiment.

Throughout the specification, the term "sparse event depth map" may be used interchangeably with event depth map, sparse dense depth map, or sparse depth map. Likewise, the term "dense depth map" may be used interchangeably with depth map or dense map throughout the specification.

Referring to fig. 1B, the event signal processing engine 140 may generate a sparse event depth map by processing event data of an image. Typically, depth estimation involves two depths. One is sparse depth, which is a depth known only as a subset of the 2D image. The sparse event depth map 1003 may be depicted with a plurality of points obtained from an image 1001 of the scene. On the other hand, the dense depth map 1007 may be depicted with various depths known for the entire 2D image. The same color of the dense depth map indicates the same or substantially the same depth as image 1001 of the scene which is a 2D image.

The intermediate dense depth map 1005 may be obtained using the sparse event depth map 1003 and the pixel intensities. Throughout the description, the term "intermediate dense depth map" may be used interchangeably with intermediate map, semi-dense depth map, or intermediate depth map.

The event data for the image may include event map data and event speed data for the image of the scene. Event map data is data generated by accumulating event data in an Address Event Representation (AER) format over a certain time period and thereby generating a spatio-temporal 2D event map of the event data. Event velocity data refers to the predicted velocity of each event using optical flow.

The event signal processing engine 140 may generate a spatiotemporal 2D event map and an event intensity image using event map data of the images.

Fig. 1C is a flowchart illustrating a method of obtaining an event intensity image according to an embodiment.

Referring to fig. 1C, in step S101, event data is accumulated for a certain period of time (e.g., 30 ms).

In step S103, an event denoising process is applied to the accumulated event data.

In step S105, a Surface of Active Events (SAE) is generated. SAE is a 3D domain that includes a two-dimensional sensor framework and an additional dimension representing time.

In step S107, the initially reconstructed event intensity image is refined using intensity image smoothing by applying a smoothing function to the SAE.

In step S109, a final event intensity image is obtained after applying the smoothing function.

The event signal processing engine 140 may generate a sparse event depth map using the spatiotemporal 2D event map, the event velocity data, and the camera rotation and translation estimate data of the monocular event based sensor 110. For example, sparse event depth maps may be distributed along the edges of the image. Depth may indicate a relative distance between an object included in a scene and a capture device, such as a camera.

As the event data is processed, the event signal processing engine 140 may process the pixel intensity data stream of the image of the scene and thereby obtain a high quality RGB image from the captured image of the scene. The event signal processing engine 140 may generate an intermediate depth map using the generated sparse depth and the pixel intensity data. In general, pixel intensity data may be obtained by interpolating accumulated event data.

Using the estimated pixel intensities, the sparse event depth map may be propagated (populated) to neighboring regions. That is, regions without depth information may be populated with estimated pixel intensities in order to generate an intermediate depth map. The intermediate depth map may also be referred to as a semi-dense (event) depth map.

The depth estimator 120 may generate a dense depth map by combining the intermediate depth map with the high quality RGB image. Dense depth maps may be generated by post-processing the intermediate depth maps.

The depth estimator 120 may generate a dense depth map of objects in the image by combining the intermediate depth map with a high quality RGB image of the scene using a guided (guided) surface fitting (surface fitting) process. Since color, texture, and structure information about objects in a scene is incomplete according to the estimated event intensity, object information from high quality RGB images of the scene may be used to refine the intermediate event depth map.

The ISP130 may be configured to perform image signal processing using guided surface fitting techniques for creating dense depth maps.

The ISP130 may perform timing synchronization and timing correction of event data for the pixel intensity data stream of the scene.

ESP 140 may be configured to perform frequency scaling on the depth estimation circuitry of electronic device 100 in order to determine an operating frequency for maintaining a balance between power and performance of electronic device 100, where the frequency scaling is performed based on the scene data rate and the maximum throughput of the electronic device. The maximum throughput indicates the maximum amount of information that the electronic device 100 can process during a given amount of time. The operating frequency refers to the clock frequency at which the digital hardware block or digital hardware circuit operates. Frequency scaling is used to control the clock frequency to match the required performance and power consumed.

Communicator 150 may be configured to communicate internally with hardware components in electronic device 100. Processor 160 may be coupled to memory 170 for processing various instructions stored in memory 170 to perform depth estimation of objects in an image of a scene using electronic device 100.

The memory 170 may be configured to store instructions to be executed by the processor 160. Further, the memory 170 may be configured to store image frames of a 3D scene. The memory 170 may include non-volatile storage elements, such as magnetic hard disks, optical disks, floppy disks, flash memory, or forms of electrically programmable memories (EPROM) or Electrically Erasable and Programmable (EEPROM) memories.

Further, the memory 170 may be a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or propagated signal. However, the term "non-transitory" should not be construed in a sense that the memory 170 is not removable. A non-transitory storage medium may store data (e.g., Random Access Memory (RAM) or cache) that may change over time.

The memory 170 may be configured to store a larger amount of information.

The display 180 may be configured to display an image of the captured 3D scene. Display 180 may include a touch screen display, an AR display, a Virtual Reality (VR) display, and the like.

Although various hardware components of electronic device 100 are shown in fig. 1A, embodiments of the present disclosure are not so limited. For example, electronic device 100 may include fewer or more components than shown in FIG. 1A. Two or more components may be combined together to perform the same or substantially similar functions to perform depth estimation of objects in a scene.

Further, the labels or names of the components shown in fig. 1A are for illustrative purposes only, and do not limit the scope of the present disclosure.

Fig. 1D illustrates a method for obtaining scene data according to an embodiment.

Referring to fig. 1D, when a scene is captured by a camera, a plurality of photons are reflected from the scene in step S151, and the reflected plurality of photons fall on a plurality of pixel arrays in step S153.

In step S155, each pixel may release several electron-hole pairs according to the wavelength of the photon. In step S157, the charge formed in the electron-hole pairs may be converted into a voltage using an electrical component such as a capacitor.

In step S159, the voltage is compared with the previously stored voltage using the first comparator 191.

In step S161, the output of the comparator 191, which indicates a voltage difference that can be converted into an intensity difference, is compared with a predetermined threshold by the second comparator 193.

In step S163, the output of the second comparator 193 may be an on/off (on/off) signal, which may be event data.

In step S165, a time stamp is attached to the event data, and the event data may be converted into AER data in an AER format. The event data may represent changes in pixel intensity in an image of the scene. That is, an image representing a scene may be divided into a plurality of cells, which are referred to as "pixels". Each of a plurality of pixels in the image may represent a discrete region and have an associated intensity value. A pixel may appear dark at low gray intensities and light at high intensities.

In step S167, the voltage output associated with step S157 may be amplified using the amplifier 195. The amplified voltage may be input into an analog-to-digital converter (ADC)197 in step S169 in order to generate a digital value representing the pixel intensity of the scene.

The event data may be in the form of a time series of events e (n) from monocular event based sensors 110, as shown in equation (1) below.

e(n)＝{xⁿ,yⁿ,θⁿ,tⁿ}(1)

In equation (1), x and y refer to the coordinates of the pixel, θ refers to the polarity of the event, i.e., positive or negative, and t refers to the event-triggered timestamp. A positive θ indicates that the intensity has increased by some threshold Δ + >0 in the logarithmic intensity space at the corresponding pixel. A negative theta indicates that the intensity drops by a second threshold delta- >0 in the logarithmic intensity space.

Fig. 2 illustrates a depth estimator of an electronic device for performing depth estimation of an object in an image of a scene according to an embodiment.

Referring to fig. 2, the depth estimator 120 includes an event stream guided intensity generator 121, a sparse depth map generator 122, an intermediate depth map generator 123, an image guided surface fitting engine 124, a post processor 125, and a dense depth map generator 126. The depth estimator 120 may be implemented as at least one hardware processor.

The event stream directed intensity generator 121 may generate intensity data from event data of the monocular event based sensor 110. The sparse depth map generator 122 may generate a sparse event depth map using event data of an image of a scene. The sparse depth map generator 122 may generate a spatiotemporal 2D event map and an event intensity image by processing event map data of an image of a scene. Further, the sparse depth map generator 122 may generate a sparse event depth map by processing the spatiotemporal 2D event map, the event speed data, and the camera rotation and translation estimation data of the monocular event based sensor 110. The sparse event depth map may be distributed along the edges of the image of the scene.

As the event data is processed, the intermediate depth map generator 123 may simultaneously process the pixel intensity data stream of the image to obtain a high quality RGB image of the scene. The intermediate depth map generator 123 may generate an intermediate depth map by using the generated sparse depth and an event intensity image of the scene.

By fusing or combining the intermediate depth map with the high quality RGB image from the scene using a guided surface fitting process, the image-guided surface fitting engine 124 may create a dense depth map of the objects in the image.

The post-processor 125 may perform post-processing/depth smoothing for post-processing the intermediate depth map to generate a dense depth map.

The dense depth map generator 126 may create a dense depth map by fusing or combining the intermediate depth map with the high quality RGB image. Dense depth maps may be created by post-processing the intermediate depth maps.

By fusing or combining the intermediate depth map with the high-quality RGB image using a guided surface fitting process, the dense depth map generator 126 may generate a dense depth map of objects in the image. A guided surface fitting process can be used to find the best-fit line or surface for a series of data points using high quality RGB images and event data.

Fig. 3A is a flow diagram illustrating an apparatus method for performing depth estimation of objects in a scene according to an embodiment.

Referring to FIG. 3A, the operations therein (i.e., step 310 and step 340) may be performed by a processor or, in particular, by the depth estimator 120 of the electronic device 100.

In step 310, the electronic device 100 captures an image of a scene. The capturing may be performed by a camera or a capturing device of the electronic device 100.

In step 320, the electronic device obtains scene data of an image via a sensor (e.g., monocular event based sensor 110 of electronic device 100).

In step 330, the electronic device 100 generates a sparse event depth map by using event data of the image. The event data may include event map data and event speed data for an image of the scene.

To generate a sparse event depth map, other information such as camera orientation may be used with the event data of the captured image. The relative position of the camera or capture device when capturing images of a scene and the stream of events between accumulated image frames captured by the camera or capture device may help estimate the depth of the scene. The generated event map is sparse in that events are available at certain pixels, in particular along the edges of the image of the scene.

In step 340, the electronic device 100 generates a dense depth map of objects in the image using the sparse event depth map and the stream of pixel intensity data. Dense depth maps may be generated by fusing or combining a sparse event depth map and a stream of pixel intensity data of an image of a scene.

Although the various steps, acts, blocks, etc. in the flowchart of fig. 3A are shown in a particular order, they may be performed in a different order or concurrently. Further, some of the steps, actions, acts, blocks, etc. may be omitted, added, modified or skipped without departing from the scope of the present disclosure.

Fig. 3B is a flow diagram illustrating a method for generating a sparse event depth map, according to an embodiment.

Referring to fig. 3B, the electronic device 100 generates a spatiotemporal 2D event map and/or an event intensity image by processing event map data of an image of a scene in step 321. The sparse depth map generator 122 may use event map data of the images to generate spatiotemporal 2D event maps and/or event intensity images.

In step 323, the electronic device 100 may generate a sparse event depth map by processing the spatiotemporal 2D event map and the event velocity data. A sparse event depth map may be generated using spatiotemporal 2D event maps, event velocity data. The rotation of the capture device (e.g., camera) of the electronic device 100 and the translation estimate data of the monocular event based sensor 110 may also be used to generate a sparse event depth map. The sparse event depth map may be distributed along the edges of the image. The sparse depth map generation engine 122 may generate a sparse event depth map using the aforementioned processing.

Although the various steps, actions, acts, blocks, etc. in the flowchart of FIG. 3B are shown in a particular order, they may be performed in a different order or concurrently. Further, some of the steps, actions, acts, blocks, etc. may be omitted, added, modified or skipped without departing from the scope of the present disclosure.

Fig. 3C is a flow diagram illustrating a method for creating a dense depth map according to an embodiment.

Referring to fig. 3C, when the event data is processed, the electronic device 100 obtains a high-quality RGB image from the image of the scene by simultaneously processing the pixel intensity data streams of the image in step 331. As the event data is processed, the ISP130 simultaneously processes the pixel intensity data stream of the image to obtain a high quality RGB image from the image of the scene.

In step 333, the electronic device 100 generates an intermediate depth map based on the sparse event depth map and the event intensity image. The intermediate depth map generator 123 may generate an intermediate depth map using the generated sparse event depth map and an event intensity image of the scene.

In step 335, the electronic device 100 generates a dense depth map by combining or fusing the intermediate depth map with a high quality RGB image of the scene. The dense depth map generator 143 may generate a dense depth map by combining or fusing the intermediate depth map with the high quality RGB image.

Although the various steps, actions, acts, blocks, etc. in the flowchart of FIG. 3C are shown in a particular order, they may be performed in a different order or concurrently. Further, some of the steps, actions, acts, blocks, etc. may be omitted, added, modified or skipped without departing from the scope of the present disclosure.

FIG. 4 is a process flow diagram illustrating an electronic device performing depth estimation of objects in a scene according to an embodiment.

Referring to fig. 4, APS 111 may be, but is not limited to, a CMOS image sensor that may receive light and convert the received light into digital pixel values depending on light intensity. The sensor control circuit 110A may be, but is not limited to, a controller circuit that may be used to modify sensor parameter settings, such as exposure time, frame rate, etc.

A system Phase Locked Loop (PLL)135 may be a control system that generates an output signal whose phase is related to the phase of an input signal. The system PLL 135 may be an application processor on-chip silicon (AP SoC).

A CSI receiver (CSI Rx)136 on the AP SoC may receive the APs data via CSI D-Phy. In addition, the ISP130 may be used to enhance the quality of the APS data.

The DVS and APS feature correlation (feature correlation) engine 131 may be configured to perform feature correlation between DVS data and APS data. The corrected event timing engine 132 may be configured to perform timing synchronization and timing correction of event data for a pixel intensity data stream of a scene.

The event speed based frequency scaling engine 134 may be configured to perform frequency scaling on the depth estimation circuitry of the electronic device 100 and determine an operating frequency of the electronic device 100 for maintaining a balance between power and performance, wherein the frequency scaling is performed based on the input data rate and the maximum throughput of the electronic device.

The event sensor timing synchronization engine 133 can be configured to perform timing synchronization of event data.

ESP 140 may include an AER bus decoder 141, an event camera application layer 142, an event stabilizer 143, an event temporal context (event temporal context) detector 144, an event signal controller 149, an event buffer controller 145, an event map generator 146, an event based feature detector 147, and an event speed estimator 148.

The AER decoder 141 may receive event data from the DVS and change the event data from one clock domain (AER decoder 141) to another domain (event signal controller 149). The event camera application layer 142 may receive data from the AER decoder 141 and interface with the event signal controller 149. The event camera application layer 142 may also be used to filter out noise from the event data.

The event stabilizer 143 may fuse the pose (or orientation) received from the sensor hub (e.g., Inertial Measurement Unit (IMU) or other directional sensors) and stabilize the event data relative to the sensor motion. The event temporal context detector 144 may determine the validity of events over a time slice and determine the average cumulative time of event data.

The event signal controller 149 may control event timing of event data and digital Intellectual Property (IP) frequency control 250 of the event signal processor 140. Digital IP may be referred to herein as "IP".

The event buffer controller 145 may manage a queue of accumulated event frames and process internal memory requests.

The event map generator 146 may create a spatial 2D event map from the single pixel event data. The event velocity estimator 148 may measure the velocity of the event via particle tracking or other techniques. The event-based feature detector 147 may extract binary (binary) features from the event map.

The depth estimator 120 may receive data from the ISP130, the corrected event timing engine 132, and the event speed estimator 148. By combining the data received from the ISP130, the corrected event timing engine 132, and the event velocity estimator 148, the depth estimator 120 may create a depth map of objects in the image of the scene.

Fig. 5 is a process flow diagram illustrating an electronic device performing depth estimation of objects in an image of a scene according to an embodiment.

Referring to fig. 5, in step S510, the event sensor timing synchronization engine 133 receives the following input and transmits the following output.

Inputting: event speed information, DVS event map, most recent time slice RGB image, system PLL clock information, RGB image V-Sync information, previously corrected event timing (starting from frame 2, assuming a constant for the first frame).

And (3) outputting: corrected event timing (relative to the event signal controller).

In step S520, the depth estimator 120 receives the following input and transmits the following output.

Inputting: event map, estimated camera pose, event speed

And (3) outputting: event intensity image

In step S530, the depth estimator 120 receives the following input and transmits the following output.

Inputting: event intensity image of generation engine using guided event intensity

And (3) outputting: sparse 3D edge based depth data

In step S540, the depth estimator 120 receives the following input and transmits the following output.

Inputting: sparse 3D edge-based depth data from step 530

And (3) outputting: intermediate dense depth map using event sensor data

In step S550, the depth estimator 120 receives the following input and transmits the following output.

Inputting: RGB image from ISP130, intermediate dense depth map from step S540

And (3) outputting: dense depth map from combined RGB and event-based data

In step S560, the depth estimator 120 receives the following input and transmits the following output.

Inputting: depth map from step S550

And (3) outputting: post-processed smooth dense depth map

In step S570, the event speed based frequency scaling engine 134 performs frequency scaling on the depth estimation circuit of the electronic device 100 to determine an operating frequency of the electronic device 100 for maintaining a balance between power and performance. The frequency scaling is performed based on the input data rate and the maximum throughput of the electronic device 100.

FIG. 6 is a process flow diagram illustrating an electronic device performing depth estimation of objects in a scene according to an embodiment.

Referring to fig. 6, the depth estimator 120 may perform depth estimation of an image of a scene for saturated regions of the scene or under various lighting conditions of the scene. During image capture, for each scene, monocular event-based sensor 110 may output pixel intensity data and event data (which represents a positive or negative change in pixel intensity). The event data is processed to generate a sparse depth map. The sparse depth map is processed to form an intermediate depth map. The pixel intensity data is processed in parallel by the ISP130 to obtain a high quality RGB image.

Guided surface fitting is applied to the intermediate depth map and the RGB image in order to obtain a dense depth map representing object depths more accurate than the intermediate depth map. Thereafter, post-processing, such as depth smoothing, and hole filling, is applied. The output will be a dense, accurate depth map with the highest possible video resolution of the electronic device 100.

Unlike traditional stereo or active depth estimation methods, the event-based depth estimation of the present disclosure depends on camera/object motion. Furthermore, the low data rate makes depth estimation similar to real-time depth measurement, which is effective for visual applications of wearable devices. In a conventional approach, a pair of event sensor cameras is used to capture the depth of a scene. Unlike conventional approaches, this approach may use a single monocular event-based sensor 110 to capture depth information for a scene.

Furthermore, the monocular event based sensor 110 does not need to be calibrated or corrected for depth measurements. The CMOS and DVS sensor outputs are synchronized by the event sensor timing synchronization engine 133 of the present disclosure. The event sensor-based depth estimation disclosed herein produces a sparse depth map, where information is distributed primarily along object edges. Thus, the surface reflectivity does not affect the depth measurement accuracy as is the case for stereo or active depth estimation. The image data from APS 111 is used to obtain a dense depth map of the scene.

Fig. 7 is a flow diagram illustrating an electronic device performing depth estimation of an object in an image of a scene according to an embodiment.

Referring to fig. 7, in step 701, monocular event based sensor 110 begins capturing an event based camera sequence. Further, the monocular event based sensor 110 performs processing for the DVS sensor data processing path and the CMOS sensor data processing path.

In step 702, ESP 140 begins to perform camera event stabilization, i.e., image stabilization, using techniques that reduce blur associated with motion of the camera or other image capture device during exposure. Basically, the inertial measurement device may be used to measure the specific force and/or angular rate of the body in order to remove the effects of unwanted camera motion from the input frame and obtain stable image frames. The inertial measurer may remove the offset and/or temperature error from the captured image to generate Inertial Measurement (IM) data. The IM data may be fused with event data that may be obtained from unstable images.

In step 703, ESP 140 detects the validity of each event across a time slice.

In step 704, ESP 140 identifies an average accumulated time for event map generation.

In step 705, ESP 140 generates an event map based on the identified average accumulated time, and in step 706, ESP 140 generates an intensity image from the event map.

In step 707, ESP 140 performs event-based binary feature extraction from the generated event map. In step 708, ESP 140 measures event speed using particle tracking, and in step 709, outputs of event-based binary feature extraction and event speed are input to DVS and APS feature correlation engine 131.

In step 711, the RGB image data from the CMOS image sensor is processed. Step 712 and 714 may be performed in parallel by the depth estimator 120 for creating the depth map.

In step 709, DVS and APS feature correlation engine 131 may receive input from the execution of the event-based binary feature extraction, event speed, and CMOS image data.

In step 718, event sensor timing synchronization engine 133 performs timing synchronization and communicates the results of the timing synchronization to corrective event timing engine 132.

In step 719, the corrected event timing engine 132 depth processes the corrected event data timed event sensor data and performs event camera stabilization of the corrected event data timed event sensor data.

Fig. 8A is a process flow diagram illustrating an electronic device performing timing synchronization of event data according to an embodiment.

Typically, monocular event based sensors 110 are asynchronous in nature. The image data is represented in an AER format, where the AER format contains pixel coordinates, a timestamp of the event generation, and an event type.

In the case of a sensor comprising dynamic pixels and active pixels, there are two separate interfaces that send CMOS data via the CSI D-PHY and event data via the AER bus interface. There are many technologies that are evolving to take advantage of event data and CMOS data. However, since they are generated and transmitted via two separate interfaces, timing synchronization and correction of event data for CMOS data of a sensor comprising dynamic pixels and active pixels is required.

Referring to fig. 8A, timing synchronization may be primarily performed by a correlation engine 1300, wherein the correlation engine 1300 includes the DVS and APS feature correlation engine 131, the event sensor timing synchronization engine 133, and the corrected event timing engine 132.

The timing synchronization engine 144 performs timing synchronization and correction of event data received from the monocular event based sensor 110, which includes dynamic pixels and active pixels.

For timing synchronization, the RGB data from the ISP130 is processed to extract its image features. ISP130 may also generate V-sync to indicate the end of the RGB frame completion. The V-Sync may be sent to a V-Sync based frame timing generator 137.

In step S801, the event map from the event map generator 146 may be processed to estimate event velocity using the event velocity estimator 148 and to detect event features using the event-based feature detector 147. In addition, timing information via the system PLL 135 may be obtained. The output of ISP130 via V-sync based frame timing generator 137 and event based feature detector 147 may be combined and fused by DVS and APS feature correlation engine 131 to generate DVS data.

Based on the foregoing feature correlation, in step S803, it is determined whether the feature correlation exceeds a predetermined threshold. Based on this determination, a timing correction amount to be applied may be determined on the DVS data.

FIG. 8B illustrates timing synchronization of event data according to an embodiment.

Referring to fig. 8B, the correlation engine 1300 may combine and/or fuse RGB data 810 obtained from the active pixel frame and event data 820 obtained from the event data frame to correct and synchronize the time difference between the active pixel frame and the event data frame.

Fig. 9 is a process flow diagram illustrating an electronic device performing frequency scaling according to an embodiment.

Referring to fig. 9, during digital IP design of a camera-based system, synthesis and timing convergence of a maximum frequency by considering an input data rate and a maximum throughput are used. Unlike conventional CMOS-based sensors, where the input rate is deterministic, monocular event-based sensors 110 are asynchronous in nature. Thus, the rate at which data arrives and the required IP throughput may vary.

However, in actual operation, the depth estimation IP should be timed according to criteria for maintaining an optimal balance between power and performance of the electronic device. The system of the present disclosure includes an event speed-based frequency scaling engine 134 for depth estimation of IP that can take into account the input data rate and current throughput and decide the IP operating frequency that can optimally balance power and performance.

The event speed based frequency scaling engine 134 receives outputs from the event speed estimation engine 148 and the event buffer controller 145 and processes the throughput to estimate the IP frequency.

Fig. 10 is a process flow diagram illustrating obstacle detection and navigation via an autonomous vehicle by using a depth map according to an embodiment.

Referring to fig. 10, the operations performed by elements 1000 and 1010 may be used by an autonomous vehicle for obstacle detection or pedestrian tracking, particularly in low light conditions.

More specifically, the onboard IMU 1001 operates in conjunction with the wheel odometer 1003 to establish a global camera reference frame and its trajectory with respect to time. The pseudo-image stream created by the event-based camera sensor 1002 is used to extract event-based features (such as corners and line segments) which are then compared to IMU stabilization and positioning data obtained by the external localizer 1006. Thereafter, the accurate camera pose is estimated by the camera pose estimator 1007.

The external sensor calibrator 1004, the external position mapping unit 1005 and the external locator 1006 are configured to determine the exact position of the obstacle from the real-time depth information about the obstacle. Further, the navigator 1010 and the external obstacle detector 1009 determine the presence of an obstacle from the real-time depth information.

Monocular event based depth estimator 1008 is configured to operate with low power consumption and provide accurate depth information even in low light environments. The dummy frames created by the event-based camera sensor 1002 give sparse estimates of the further refined depth map and fill and smooth the holes in the depth map in real time with time-dependent aps (rgb) frames.

In Advanced Driver Assistance Systems (ADAS) comprising the elements of fig. 10, an accurate depth map of the image of the scene is necessary for obstacle detection. Furthermore, the ADAS system should work in various lighting conditions and provide quick visual feedback to the user for proper navigation. Monocular event based sensors 110 provide higher frame rates, higher dynamic range (-120 dB) under extreme lighting conditions.

FIG. 11 is a process flow diagram for using a depth map in AR shopping, according to an embodiment.

Referring to fig. 11, the method may be used to determine an accurate depth map of an image of a scene including transparent and reflective surfaces and a plurality of objects. The operations performed by the elements in flowchart 1100 may be used for AR shopping and allow a user to perform virtual object placement based on depth map estimation. The method of FIG. 11 may also be used in an AR shopping case to employ real-time depth obtained from the method in virtual reality object placement/implementation.

Refined accurate depth information from the monocular event based depth estimator 1104 is used by the flat area detector 1105 for flat plane estimation. The monocular event based depth estimator 1104 estimates the depth of objects in the image of the scene by fusing the active pixels and the event based data. Thus, a 3D surface may be defined to project a virtual object thereon.

The flat area detector 1105 is used to detect flat areas in the scene, thereby identifying suitable reserved locations for virtual object placement. The overlay rendering and seamless mixer 1106 is used to overlay the virtual object in the appropriate area identified by the flat area detector 1105.

The virtual component 1103 extracted from the event-based camera sensor 1101 is warped (warped) with respect to the desired estimate calculated from the depth and plane information. The data is fused and the corresponding virtual components are overlaid on the presentation surface by the overlay and seamless mixer 1106. The resulting image is presented along with other image data, as seen on the user's viewfinder/preview, providing an AR experience via an enhanced image 1108 displayed on a display 1109.

Fig. 12 is a process flow diagram for reconstructing a 3D scene of a 3D image according to an embodiment.

Referring to fig. 12, the operations corresponding to elements 1201-1208 of flowchart 1200 may be used to reconstruct a 3D scene of a 3D image.

Depth maps are important for 3D imaging and display and can be used in different application fields, such as digital holographic image processing, object reconstruction in whole body imaging, 3D object retrieval and scene understanding, and 3D printing. Furthermore, 3D depth maps are used in vision-based applications, such as in robotics, games, AR glasses, and the like.

Accurate depth maps of objects in the scene obtained by using monocular event based sensors 110 are used for virtual object placement. Accurate depth maps can be used in product previews to visualize or understand different products and functions.

FIG. 13 is a process flow diagram for creating a shot effect using a depth map using an electronic device, according to an embodiment.

Referring to fig. 13, steps 1301 and 1303 may be used to create a shot effect using a depth map. The electronic device 100 may capture an image of a scene and, based on processing via the shot engine 1301 and the depth estimator 1302 using monocular event based sensors, may create an accurate depth map of the image of the scene.

This process may be used to create a shot effect. Thus, the real-time depth map obtained according to the present disclosure may be used to apply a high-quality shot effect to a still image. A device (e.g., a smart camera platform) may capture a series of images presented to a user's preview/viewfinder. Event data from the event pipeline is processed to obtain a fine-grained, accurate and real-time depth map via a depth estimator 1302 using monocular event-based sensors. The depth map may be used to obtain a solution (such as a shot effect) to decompose a scene into a plurality of depth regions and classify the plurality of regions as layers of background or foreground based on the regions of interest. Furthermore, such a solution may be used to apply various image processing kernels to achieve a desired result.

The embodiments disclosed herein may be implemented by at least one software program running on at least one hardware device and performing network management functions to control the various elements.

The elements shown in fig. 1 to 13 include blocks, which may be at least one of hardware devices, software modules, or a combination of hardware devices and software modules.

While the present disclosure has been particularly shown and described with reference to certain embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the following claims and their equivalents.

33页详细技术资料下载

Method and apparatus for performing depth estimation of an object

相关技术

网友询问留言