Scalable unified audio renderer

文档序号：1160362 发布日期：2020-09-15 浏览：14次中文

阅读说明：本技术 可扩展的统一的音频渲染器 (Scalable unified audio renderer ) 是由 A·G·P·舍弗次乌 N·G·彼得斯于 2019-02-01 设计创作，主要内容包括：包括音频解码器、存储器和处理器的设备,可以被配置为执行技术的各个方面。音频解码器可以从比特流解码第一音频数据和第二音频数据。存储器可以存储第一音频数据和第二音频数据。处理器可以将第一音频数据渲染成第一空间域音频数据用于由在虚拟扬声器位置集合处的虚拟扬声器进行的回放,以及将第二音频数据渲染成第二空间域音频数据用于由在虚拟扬声器位置集合处的虚拟扬声器进行的回放。处理器还可以混合第一空间域音频数据和第二空间域音频数据以获得混合的空间域音频数据,以及将混合的空间域音频数据转换至基于场景的音频数据。(A device, including an audio decoder, a memory, and a processor, may be configured to perform various aspects of the techniques. The audio decoder may decode the first audio data and the second audio data from the bitstream. The memory may store first audio data and second audio data. The processor may render the first audio data into first spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations and render the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations. The processor may also mix the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data, and convert the mixed spatial-domain audio data to scene-based audio data.)

1. An apparatus configured to support unified audio rendering, the apparatus comprising:

an audio decoder configured to decode first audio data within a time frame and second audio data within the time frame from a bitstream;

a memory configured to store the first audio data and the second audio data; and

one or more processors configured to:

rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;

rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;

mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and

converting the mixed spatial domain audio data to scene-based audio data.

2. The device of claim 1, wherein the one or more processors are further configured to determine the set of virtual speaker locations at which the virtual speakers are located based on headphone capability data representing one or more capabilities of headphones and prior to rendering the first audio data and the second audio data.

3. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,

wherein the first audio data comprises one of first scene-based audio data, first channel-based audio data, or first object-based audio data, an

Wherein the second audio data comprises one of second scene-based audio data, second channel-based audio data, or second object-based audio data.

4. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,

wherein the one or more processors are configured to convert the mixed spatial domain audio data from the spatial domain to a spherical harmonic domain, an

Wherein the scene-based audio data comprises higher order ambisonic audio data defined in the spherical harmonic domain as a set of one or more higher order ambisonic coefficients corresponding to a spherical basis function.

5. The device of claim 1, wherein the set of virtual speaker locations comprises a set of virtual speaker locations that are uniformly distributed with respect to a sphere in which a listener's head is centered on the sphere.

6. The apparatus of claim 1, wherein the set of virtual speaker locations comprises a jumper point.

7. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,

wherein the one or more processors are configured to render the first audio data based on headphone-captured audio data to obtain the first spatial-domain audio data, wherein the headphone-captured audio data comprises audio data representative of sounds detected by headphones, and

wherein the one or more processors are configured to render the second audio data based on the audio data captured by the headphones to obtain the second spatial-domain audio data.

8. The device of claim 1, further comprising an interface configured to send the scene-based audio data and data indicative of the set of virtual speaker locations to a headset.

9. The device of claim 8, wherein the headset comprises a wireless headset.

10. The device of claim 8, wherein the headset comprises a computer-mediated reality headset that supports one or more of virtual reality, augmented reality, and mixed reality.

11. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,

wherein the one or more audio decoders are further configured to decode third audio data within the time frames from the bitstream,

wherein the memory is further configured to store the third audio data,

wherein the one or more processors are further configured to render the third audio data into third spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations, an

Wherein the one or more processors are configured to mix the first spatial-domain audio data, the second spatial-domain audio data, and the third spatial-domain audio data to obtain the mixed spatial-domain audio data.

12. A method of supporting unified audio rendering, the method comprising:

decoding, by a computing device and from a bitstream, first audio data within a timeframe and second audio data within the timeframe;

rendering, by the computing device, the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;

rendering, by the computing device, the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;

mixing, by the computing device, the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and

converting, by the computing device, the mixed spatial domain audio data to scene-based audio data.

13. The method of claim 12, further comprising determining the set of virtual speaker locations at which the virtual speakers are located based on headphone capability data representing one or more capabilities of headphones and prior to rendering the first audio data and the second audio data.

14. The method of claim 12, wherein the first and second regions are selected from the group consisting of,

wherein the first audio data comprises one of first scene-based audio data, first channel-based audio data, or first object-based audio data, an

Wherein the second audio data comprises one of second scene-based audio data, second channel-based audio data, or second object-based audio data.

15. The method of claim 12, wherein the first and second regions are selected from the group consisting of,

wherein converting the mixed spatial domain audio data comprises converting the mixed spatial domain audio data from the spatial domain to a spherical harmonic domain, an

16. The method of claim 12, wherein the set of virtual speaker locations comprises a set of virtual speaker locations that are uniformly distributed with respect to a sphere in which a listener's head is centered on the sphere.

17. The method of claim 12, wherein the set of virtual speaker locations comprises a jumper point.

18. The method of claim 12, wherein the first and second regions are selected from the group consisting of,

wherein rendering the first audio data comprises rendering the first audio data based on headphone-captured audio data to obtain the first spatial-domain audio data, wherein the headphone-captured audio data comprises audio data representing sounds detected by headphones, and

wherein rendering the second audio data comprises rendering the second audio data based on the audio data captured by the headphones to obtain the second spatial-domain audio data.

19. The method of claim 12, further comprising sending the scene-based audio data and data indicative of the set of virtual speaker locations to a headset.

20. The method of claim 19, wherein the headset comprises a wireless headset.

21. The method of claim 19, wherein the headset comprises a computer-mediated reality headset that supports one or more of virtual reality, augmented reality, and mixed reality.

22. The method of claim 12, further comprising:

decoding third audio data within the time frame from the bitstream; and

rendering the third audio data into third spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations,

wherein mixing the first spatial-domain audio data and the second spatial-domain audio data comprises mixing the first spatial-domain audio data, the second spatial-domain audio data, and the third spatial-domain audio data to obtain mixed audio data.

23. An apparatus configured to support unified audio rendering, the apparatus comprising:

means for decoding, from a bitstream, first audio data within a time frame and second audio data within the time frame;

means for rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;

means for rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;

means for mixing the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and

means for converting the mixed spatial domain audio data to scene-based audio data.

24. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to:

decoding first audio data within a time frame and second audio data within the time frame from a bitstream;

rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;

rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;

mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and

converting the mixed spatial domain audio data to scene-based audio data.

Technical Field

The present disclosure relates to processing of media data, such as audio data.

Background

Higher Order Ambisonic (HOA) signals, often represented by a plurality of Spherical Harmonic Coefficients (SHC) or other hierarchical elements, are three-dimensional representations of a sound field. The HOA or SHC representation may represent the sound field in a manner that does not rely on local speaker geometry used to playback the multi-channel audio signal rendered from the SHC signal. The SHC signal may also facilitate backward compatibility because the SHC signal may be rendered to a well-known and highly adopted multi-channel format, such as a 5.1 audio channel format or a 7.1 audio channel format. The SHC representation may thus enable a better representation of the sound field that also accommodates backward compatibility.

Disclosure of Invention

The present disclosure relates generally to auditory aspects of a user experience of computer-mediated reality systems, including Virtual Reality (VR), Mixed Reality (MR), Augmented Reality (AR), computer vision, and imaging systems.

In one example, various aspects of the technology are directed to a device configured to support unified audio rendering, the device comprising: an audio decoder configured to decode, from the bitstream, first audio data within a time frame and second audio data within the time frame; a memory configured to store first audio data and second audio data; and one or more processors configured to: rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and converting the mixed spatial domain audio data to scene-based audio data.

In another example, various aspects of the technology are directed to a method of supporting unified audio data rendering, the method comprising: decoding, by the computing device and from the bitstream, the first audio data within the time frame and the second audio data within the time frame; rendering, by the computing device, the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering, by the computing device, the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing, by a computing device, first spatial-domain audio data and second spatial-domain audio data to obtain mixed spatial-domain audio data; and converting, by the computing device, the mixed spatial domain audio data to scene-based audio data.

In another example, various aspects of the technology are directed to a device configured to support unified audio rendering, the device comprising: means for decoding, from a bitstream, first audio data within a time frame and second audio data within the time frame; means for rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; means for rendering second audio data into second spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; means for mixing the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and means for converting the mixed spatial domain audio data to scene-based audio data.

In another example, various aspects of the technology are directed to a non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: decoding first audio data within a time frame and second audio data within the time frame from a bitstream; rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and converting the mixed spatial domain audio data to scene-based audio data.

The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a schematic diagram showing spherical harmonic basis functions of various orders and sub-orders.

Fig. 2 is a schematic diagram illustrating a system that may perform various aspects of the techniques described in this disclosure.

FIG. 3 is a schematic diagram illustrating aspects of inconsistent spatial resolution distributions of mixed order ambisonics representations of a sound field.

Fig. 4 is a schematic diagram illustrating the difference between a full third order HOA representation of a soundfield and a mixed order ambisonic representation of the same soundfield, in which the horizontal regions have a higher spatial resolution than the rest regions.

Fig. 5 is a schematic diagram illustrating an example of a headset that may be used by one or more computer-mediated reality systems of the present disclosure.

Fig. 6 is a block diagram illustrating an example implementation of an audio playback system using a general information reference renderer, in accordance with the techniques of this disclosure.

Fig. 7 is a block diagram illustrating an example implementation of an audio playback system using a general informational reference renderer, in accordance with the techniques of this disclosure.

Fig. 8 is a block diagram illustrating an example implementation of an audio playback system using a general informational reference renderer that uses audio data captured by headphones for augmented reality, in accordance with the techniques of this disclosure.

FIG. 9 is a flow diagram illustrating example operations of the audio playback system shown in the example of FIG. 7 in performing aspects of the scalable unified rendering technique.

Detailed Description

In general, this disclosure is directed to techniques for playback of a sound field representation during a user experience of a computer-mediated reality system. Computer-mediated reality (CMR) technologies include various types of content generation and content consumption systems, such as Virtual Reality (VR), Mixed Reality (MR), Augmented Reality (AR), computer vision, and image systems. While several aspects of the disclosure are described with respect to a virtual reality system by way of example for ease of discussion, it will be understood that the techniques of the disclosure are also applicable to other types of computer-mediated reality technologies, such as mixed reality, augmented reality, computer vision, and image systems.

The virtual reality system may utilize field of view (FoV) information of the user to obtain video data associated with the FoV of the user. As such, the virtual reality system may obtain video data that partially or fully surrounds the viewer's head, e.g., for a virtual reality application or other similar scenario in which the user may move his or her head to see different portions of the image canvas that are not visible when pointing focus at a single point of the canvas. In particular, these techniques may be applied when a viewer points visual focus at a particular portion of a large canvas (such as a three-dimensional canvas that partially or fully encloses the viewer's head). The video data surrounding the user's head may be provided using a combination of screens (e.g., a set of screens surrounding the user) or via a head mounted display.

Examples of hardware capable of providing a head mounted display include VR headphones, MR headphones, AR headphones, and various other hardware. The sensory data and/or the test data may be used to determine the FoV of the user. As one example of sensory data, one or more angles associated with the positioning of the VR headset may indicate the FoV of the user, which angles form the "steering angle" of the headset. As another example of sensory data, a user's gaze angle (e.g., sensed via iris detection) may indicate the user's FoV. Video data and corresponding audio data may be encoded and prepared (e.g., for storage and/or transmission) using a feature set that includes FoV information.

The techniques of this disclosure may be used in connection with techniques related to the transmission (e.g., transmission and/or reception) of media data, such as video data and audio data, encoded at various quality levels for different regions at which the media data is to be played back. For example, the techniques of this disclosure may be used by a client device that includes a panoramic display (e.g., a display that partially or fully surrounds a viewer's head) and surround sound speakers. Typically, the display is configured such that the user's visual focus is directed at only a portion of the display at a given time. The system of the present disclosure may render and output audio data via the surround sound speakers such that the audio object associated with the current region of focus on the display is an output with greater directionality than the remaining audio objects.

There are various "surround sound" channel-based audio formats in the market. For example, they range from 5.1 home theater systems (which have been most successful in entering the military living room, in addition to stereo) to 22.2 systems developed by NHK (japan broadcasting association or japan broadcasting company). Content creators (e.g., hollywood studios) once wanted to produce a soundtrack for a movie and did not expend the effort to mix it for each speaker configuration. The Moving Picture Experts Group (MPEG) has promulgated standards that take into account the soundfield (e.g., higher order ambisonic-HOA-coefficients) to be represented using a hierarchical set of elements that can be rendered to speaker feeds, whether in locations defined by various standards or in inconsistent locations, for most speaker configurations, including 5.1 and 22.2 configurations.

MPEG promulgated as a standard for MPEG-H3D audio, formally named "information technology-High efficiency coding and media delivery in heterogeneous environments-Part 3:3D audio", stated by ISO/IEC JTC 1/SC 29, with the file identifier ISO/IEC DIS 23008-3, and the noted date 7/25/2014. MPEG also promulgates a second version of the 3D audio standard, named "information technology-High efficiency coding and media delivery in heterologous audio settings-Part 3:3D audio", set out by ISO/IEC JTC 1/SC 29, with file identifier ISO/IEC 23008-3:201x (e), and date 2016, 10 months and 12 days. References to "3D audio standards" in this disclosure may refer to one or both of the above standards.

As described above, one example of a hierarchical set of elements is a set of Spherical Harmonic Coefficients (SHC). The following expression illustrates a description or representation of a sound field using SHC:

the expression shows that at any point of the sound fieldPressure p at time t_iIt is possible to pass through the SHC,is uniquely represented. Here, the first and second liquid crystal display panels are,

c is the speed of sound (-343 m/s),

is a reference point (or observation point), j_n(. is a spherical Bessel function of order n, andis a spherical harmonic basis function of order n and sub-order m (which may also be referred to as a spherical basis function). It will be appreciated that the terms in square brackets are frequency domain representations of the signal (e.g.,) It can be approximated by various time-frequency transforms such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), or wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of systems of multi-resolution basis functions.

Fig. 1 is a diagram showing a spherical harmonic basic function from the zeroth order (n-0) to the fourth order (n-4). As can be seen, for each order there is an extension to the sub-order m, which is shown in the example of fig. 1 but not explicitly indicated for ease of illustration. SHC

May be physically captured by various microphone array configurations or, alternatively, derived from channel-based or object-based descriptions of sound fields. SHC represents scene-based audio, where SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a method involving (1+4)²A fourth order representation of the (25, and thus fourth order) coefficients.

As described above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC can be derived from microphone arrays are described in the "Three-Dimensional Surround Sound Systems Based on spherical harmonics" of Poletti M (Poletti, M.) (journal of the auditory engineering society (j. audioeng. soc.), vol.), vol.53, No.11, 11/2005, pp.1004-1025).

To illustrate how the SHC can be derived from the object-based description, consider the following equation. Coefficients for a sound field corresponding to an individual audio objectCan be expressed as:

wherein i is

Is a spherical Hankel function of order n, an

Is the location of the object. Considering the object source energy g (ω) as a function of frequency (e.g., using time-frequency analysis techniques such as performing a fast fourier transform on the PCM stream) allows us to convert each PCM object and corresponding location to SHC

Further, it can be shown (because the above is linear and orthogonal decomposition), for each objectThe coefficients are cumulative. In this way, a large number of PCM objects may pass throughThe coefficients are represented (e.g., as the sum of coefficient vectors for individual objects). Essentially, the coefficients contain information about the sound field (pressure as a function of 3D coordinates), and the above is expressed at the viewpoint

From individual objects to a representation of the entire sound field. The remaining figures are described below in the context of object-based and SHC-based audio coding.

Fig. 2 is a schematic diagram illustrating a system 10 that may perform various aspects of the techniques described in this disclosure. As shown in the example of fig. 2, system 10 includes a source system 200 and a content consumer system 202. Although described in the context of source system 200 and content consumer system 202, the techniques may be implemented in other contexts. Furthermore, source system 200 may represent any form of computing device capable of generating a bitstream compatible with the techniques of this disclosure. Likewise, content consumer system 202 may represent any form of computing device capable of implementing the techniques of this disclosure.

The source system 200 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by an operator of a content consumption device, such as the content consumer system 202. In many VR scenarios, the source system 200 generates audio content along with video content. In the example of fig. 2, the source system 200 includes a content capture device 204, a bitstream generation unit 206, a microphone 208, and a camera 210.

The content capture device 204 may be configured to connect with the microphone 208 or otherwise communicate with the microphone 208. Microphone 208 may represent Eigenmike (spherical microphone)

Or other types of 3D audio microphones capable of capturing and representing the sound field as HOA coefficients 11. In some examples, the content capture device 204 includes integration intoAn integrated microphone 208 inside the housing of the capture device 204. In some examples, the content capture device 204 may be connected with the microphone 208 wirelessly or via a wired connection.

The microphone 208 generates audio data 212. In some examples, the audio data is scene-based audio data (e.g., HOA coefficients), channel-based audio data, object-based audio data, or another type of audio data. In other examples, the content capture device 204 may process the audio data 212 after receiving the audio data 212 via some type of storage (e.g., removable storage). Various combinations of the content capture device 204 and the microphone 208 are possible, with several examples of such combinations being discussed above for purposes of illustration. The camera 210 may be configured to capture video data 214 and provide the captured raw video data 214 to the content capture device 204.

The content capture device 204 may be configured to connect with the bitstream generation unit 206 or otherwise communicate with the bitstream generation unit 206. The bitstream generation unit 206 may comprise any type of hardware device capable of connecting with the content capture device 204. The bitstream generation unit 206 may use the audio data 212 to generate a bitstream 216, the bitstream 216 including one or more representations of a sound field defined by the audio data 212. In some examples, the bitstream 216 may also include a representation of the video data 214.

The bitstream generation unit 206 may generate the representation of the audio data 212 in various ways. For example, the bitstream generation unit 206 may represent the audio data 212 in one or more of a scene-based audio format, a channel-based audio format, and/or an object-based audio format.

In some examples in which the bitstream generation unit 206 represents audio data in a scene-based format, the bitstream generation unit 206 uses an encoding scheme for ambisonic representation of a soundfield, referred to as Mixed Order Ambisonic (MOA). To generate a particular MOA representation of the sound field, the bitstream generation unit 206 may generate a partial subset of the total set of HOA systems. For example, each MOA representation generated by the bitstream generation unit 206 may provide precision with respect to some regions of the sound field, but provide less precision in other regions. In one example, the MOA representation of the sound field may include eight (8) uncompressed HOA coefficients of the HOA coefficients, whereas the third order HOA representation of the same sound field may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. As such, each MOA representation generated as a partial subset of the HOA coefficients may be less memory intensive and less bandwidth intensive (if and when transmitted as part of the bitstream 216 over the illustrated transmission channel) than a corresponding third order HOA representation of the same sound field generated from the HOA coefficients.

In some examples, the content capture device 204 may be configured to wirelessly communicate with the bitstream generation unit 206. In some examples, the content capture device 204 may communicate with the bitstream generation unit 206 via one or both of a wireless connection or a wired connection. Via the connection between the content capture device 204 and the bitstream generation unit 206, the content capture device 204 may provide content in various content forms, which are described herein as being part of the HOA coefficients 11 for purposes of discussion.

In some examples, the content capture device 204 may utilize aspects of the bitstream generation unit 206 (in terms of hardware or software capabilities of the bitstream generation unit 206). For example, the bitstream generation unit 206 may include dedicated hardware (or specialized software that, when executed, may cause one or more processors to perform psychoacoustic audio encoding) configured to perform psychoacoustic audio encoding, such as a unified speech and audio encoder denoted "USAC" as set forth by the Moving Picture Experts Group (MPEG) or MPEG-H3D audio encoding standards. The content capture device 204 may not include psychoacoustic audio encoder-specific hardware or specialized software, and instead provide the audio aspects of the audio content 212 in the form of non-psychoacoustic speech encoding (which is another way of referring to the audio data 212). The bitstream generation unit 206 may assist in the capture of the content 212 by performing, at least in part, psychoacoustic audio coding with respect to audio aspects of the audio content 212.

The bitstream generation unit 206 may assist in content capture and transmission by generating one or more bitstreams based at least in part on audio content (e.g., an MOA representation and/or a third order HOA representation) generated from the audio data 212. The bitstream 216 may include a compressed version of the audio data 212 (and/or a partial subset thereof used to form an MOA representation of a sound field) as well as any other different type of content, such as compressed versions of video data, image data, and/or text data. As an example, the bitstream generation unit 206 may generate the bitstream 216 for transmission across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 216 may represent an encoded version of the audio data 212 (and/or a partial subset thereof used to form an MOA representation of a sound field), and may include a main bitstream and additional side bitstreams, which may be referred to as side channel information.

FIG. 3 is a schematic diagram illustrating aspects of a non-uniform spatial resolution distribution of an MOA representation of a sound field. Although a perfectly spherical HOA has consistently high spatial resolution in all directions, the MOA representation of the same sound field has variable spatial resolution. In many cases, as in the example of fig. 3, the MOA representation of the soundfield includes high resolution spatial audio data in only horizontal regions, and lower resolution spatial audio data in the remaining regions of the soundfield. In the example shown in fig. 3, the MOA representation of the sound field includes a third order representation of the horizontal region (marked by the white bar), and a first order representation of all other regions (shown by the black shaded portion). That is, according to the MOA representation of fig. 3, as soon as the sound source leaves the equator of the sound field, the clarity and area of the high quality reconstruction rapidly decrease with respect to the audio objects emanating from the sound source.

Fig. 4 is a schematic diagram showing the difference between a full third order HOA representation of a sound field and a MOA representation of the same sound field, in which the horizontal regions have a higher spatial resolution than the remaining regions. As shown in fig. 4, the full third order HOA representation includes sixteen (16) uncompressed HOA coefficients to represent the sound field. The consistent spatial resolution of the complete HOA representation is shown as white (or blank) by the entire 3-axis diagram for the complete third order HOA representation.

In contrast, with respect to the same sound field, the MOA representation includes eight (8) uncompressed coefficients (or coefficient channels). Furthermore, in contrast to the consistent spatial resolution exhibited by the third order HOA representation, the MOA representation shows an inconsistent spatial resolution, where a high spatial resolution occurs along the equator of the 3D soundfield, while the remaining regions of the soundfield are represented at a lower spatial resolution. The MOA representation shown in fig. 4 is described as being a "3H 1P" MOA representation, indicating that the MOA representation comprises a third order representation of the horizontal region and a first order representation of the remaining region of the sound field.

Although described with respect to captured content 212/214, various aspects of the techniques described in this disclosure may be applied to generated or rendered content, such as is common in video games, where audio data 212 is retrieved from memory and/or storage rather than captured, and video data 214 is programmatically generated by hardware, such as a Graphics Processing Unit (GPU). In instances in which the source system 200 obtains the content 212/214 instead of fully capturing the content 212/214, the source system 200 may represent a computer (e.g., a video game system, a laptop computer, a desktop computer, etc.) configured to generate the audio data 212 and the video data 214.

Regardless, the content consumer system 202 may be operated by an individual, and may represent a VR client device in many examples. Content consumer system 202 may include an audio playback system 218 and headphones 220. The audio playback system 218 may refer to any form of audio playback system capable of rendering SHC (whether in the form of third order HOA and/or MOA representations) or other scene-based audio data for playback as multi-channel audio content.

Although shown in fig. 2 as being sent directly to content consumer system 202, source system 200 may output bitstream 216 to an intermediary device interposed between source system 200 and content consumer system 202. The intermediary device may store the bitstream 216 for subsequent delivery to the content consumer system 202, which content consumer system 202 may request the bitstream. The intermediary device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 216 for subsequent retrieval by the audio decoder. The intermediary may reside in a content delivery network that is capable of streaming the bitstream 216 (and possibly in conjunction with sending a corresponding video data stream) to a user, such as the content consumer system 202, requesting the bitstream 216.

Alternatively, the source system 200 may store the bitstream 216 to a storage medium, such as a compact disc, digital video disc, high definition video disc, or other storage medium, most of which are readable by a computer and thus may be referred to as a computer-readable storage medium or a non-transitory computer-readable storage medium. In this context, a transmission channel may refer to a channel through which content stored to the medium is transmitted (and may include retail stores and other store-based delivery mechanisms). Regardless, the techniques of this disclosure should not be limited in this regard to the example of fig. 2.

As noted above, the content consumer system 202 includes an audio playback system 218. The audio playback system 218 may represent any audio playback system capable of playing back multi-channel audio data. The audio playback system 218 may include several different renderers. The renderers may each supply different forms of rendering, where the different forms of rendering may include one or more of various methods of performing vector-based amplitude panning (VBAP), and/or one or more of various methods of performing sound field synthesis. As used herein, "a and/or B" means "a or B," or both "a and B.

The audio playback system 218 may decode scene-based audio data, object-based audio data, channel-based audio data from the bitstream 216. As described in more detail elsewhere in this disclosure, the audio playback system 218 may render the audio data decoded from the bitstream 216 into the output speaker feed 222. The speaker feed 222 may drive one or more speakers included in the headphones 220 (which are not shown in the example of fig. 2 for ease of illustration purposes). In some examples, the speaker feed 222 includes a left channel and a right channel for binaural playback. In examples where scene-based audio data (e.g., such as HOA coefficients) is included in the bitstream, the ambisonic representation of the soundfield may be normalized in several ways, including N3D, SN3D, FuMa, N2D, or SN 2D.

In some examples, content consumer system 202 receives bitstream 216 from a streaming server. The streaming server may provide various types of streams, or combinations of streams, in response to such requests from the streaming client. For example, the streaming server may also provide a full-order HOA stream as an option if requested by the streaming client (e.g., executing on the audio playback system 218). In other examples, the streaming server may provide one or more of an object-based representation of the soundfield, a higher order ambisonic representation of the soundfield, a mixed order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the higher order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the mixed order ambisonic representation of the soundfield, or a combination of the mixed order representation of the soundfield and the higher order ambisonic representation of the soundfield.

Content consumer system 202 may represent a video game system or other computing device similar to the source system. Although shown as separate systems, in some examples source system 200 and content consumer system 202 may be a single system. For example, both source system 200 and content consumer system 202 may be implemented within a single video game system or other computing device. A single computing device may be connected with the headset 220. In some instances, the headset 220 may house a single computing device (which implements both the source system 200 and the content consumer system 202) and not have separate computing systems.

Regardless of the configuration of source system 200 and content consumer system 202, content consumer system 202 may include headphones 220. Fig. 5 is a schematic diagram illustrating an example of a headset 220 that may be used by one or more computer-mediated reality systems of the present disclosure. In various examples, the headphones 220 may represent VR headphones, AR headphones, MR headphones, augmented reality (XR) headphones, or another type of headphones for CMR. In some examples, the headphones 220 do not have a visual component, but rather output sound without a visual component. For example, the headset 220 may be a set of headphones.

As shown in the example of fig. 5, the headset 220 includes a rear camera, one or more directional speakers, one or more tracking and/or recording cameras, and one or more Light Emitting Diode (LED) lights. In some examples, the LED lamp may be referred to as an "ultra bright" LED lamp. In addition, the headset 220 includes one or more eye tracking cameras, high-sensitivity audio microphones, and optical/projection hardware. The optics/projection hardware of the headphones 220 may include durable translucent display technology and hardware.

The headset 220 also includes connectivity hardware, which may represent one or more network interfaces supporting multi-mode connectivity, such as 4G communications, 5G communications, and so forth. The headset 220 also includes an ambient light sensor and a bone conduction transducer. In some examples, the headset 220 may also include one or more passive and/or active cameras with fisheye lenses and/or telephoto lenses. In accordance with various techniques of this disclosure, various devices of this disclosure, such as content consumer system 202 of fig. 2, may use the steering angle of headphones 220 to select an audio representation of a sound field for output via directional speakers of headphones 220. It will be appreciated that the earpiece 220 may exhibit a variety of different form factors.

As described above, content consumer system 202 also includes headphones 220. It will be understood that in various implementations, the headphones 220 may be included in the content consumer system 202 or externally coupled to the content consumer system 202. As discussed above with respect to fig. 5, the headphones 220 include display hardware configured to render the video data 214 and one or more speakers configured to reproduce a sound field represented by the audio data 212 based on the audio data 212. In some examples, the headphones 220 may also include hardware that implements some or all of the audio playback system 218. In some examples, a device separate from the headphones 220, such as a smartphone or personal computer (including video game systems), includes hardware that implements some or all of the audio playback system 218.

In some examples, the processor of the headset 220 tracks the steering angle using one or more angles associated with the head rotation information. In turn, the headphones 220 can use the steering angle to determine how to output the CMR audio sound field. The processor of the headphones 220 may also reproduce the sound field via one or more speakers (e.g., speakers of the headphones 220). In some examples, the processor of the headset 220 may use one or more sensors and/or cameras (e.g., the sensor and/or camera of the headset 220) to capture images indicative of the gaze angle of the user wearing the headset 220. For example, the processor of the headset 220 may use the gaze angle to determine the steering angle. The processor of the headset 220 may also represent a sequence of images at a field angle based on the steering angle. For example, the processor of the headset 220 may output portions of the image sequence via the display hardware of the headset 220 at a particular field angle that is appropriate for the current steering angle of the headset 220.

The storage device of the headset 220 may also locally store various types of representations, or combinations thereof, in response to such selections by an audio stream selector executed by a processor of the headset 220. For example, as discussed above, the processor of the headphones 220 may also provide a full-order HOA stream as an option if selected by the audio stream selector. In other examples, via speaker hardware of the headphones 220, the processor of the headphones 220 can output one or more of an object-based representation of the soundfield, a higher order ambisonic representation of the soundfield, a mixed order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the higher order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the mixed order ambisonic representation of the soundfield, or a combination of the mixed order ambisonic representation of the soundfield and the higher order ambisonic representation of the soundfield. In some examples, one or more of the sound field representations stored to the storage of the headphones 220 may include at least one high-resolution region and at least one lower-resolution region, and wherein the representation selected based on the steering angle provides a higher spatial accuracy with respect to the at least one high-resolution region and a smaller spatial accuracy with respect to the lower-resolution region.

In some examples, the headset 220 may include one or more batteries that provide power to the components of the headset 220.

Fig. 6 is a block diagram illustrating an example implementation of an audio playback system 218 using a Common Information Reference Renderer (CIRR) in accordance with aspects of the techniques described in this disclosure. In the example of fig. 6, the audio playback system 218 includes an external renderer Application Programming Interface (API)1700, an object/channel-based external renderer 1702, an object/channel-based internal renderer 1704, a CMR stream decoder 1706, a HOA-to-spatial domain conversion unit 1708, a mixing unit 1710, a HOA converter 1712, a HOA renderer 1714, a mixing unit 1716, a generic renderer API 1718, and a virtual speaker location unit 1720.

The CMR stream decoder 1706 receives and decodes a bitstream, such as the bitstream 216 (shown in the example of fig. 2). The bitstream 216 may include a CMR stream (which may be referred to as a "CRM stream 216"). By decoding CMR stream 216, CMR stream decoder 1706 can generate one or more streams of non-storyline sound audio data, one or more streams of object-based audio data, channel-based audio data, and associated metadata and/or HOA audio data or other scene-based audio data.

When the audio playback system 218 uses the external renderer 1702, the CMR stream decoder 1706 interfaces with the external renderer 1702 through an external renderer Application Programming Interface (API) 1700. The external renderer API1700 may represent an interface configured to provide configuration data or metadata to the external renderer 1702 and/or receive configuration data or metadata from the external renderer 1702. Accordingly, channel-based audio data, object-based audio data, and/or scene-based audio data, along with appropriate metadata and configuration information, are sent from the CMR stream decoder 1706 to the external renderer 1604.

The external renderer 1702 (which may also be referred to as an "object/channel-based renderer 1702") uses one or more streams of channel-based audio data, object-based audio data, and associated metadata and/or HOA audio data or other scene-based audio data to generate binaural storyline audio data. A mixing unit 1716 mixes the two-channel storyline audio data with one or more streams of non-storyline audio data to generate mixed two-channel audio data 1717. The speakers in the headphones 220 may generate sound based on the mixed binaural audio data generated by the mixing unit 1606.

In instances in which the CMR stream decoder 1706 provides the channel-based audio data 1705 and/or the object-based audio data 1707 (which may include associated metadata) to the external renderer 1702 via the external renderer API1700, the external renderer 1702 may render the channel-based audio data corresponding to the speaker layout of the headphones 220, such as the binaural audio data 1717, which is then transformed to accommodate the motion sensing data 221. In other words, the headphones 220 may perform further rendering to transform the binaural audio data 1717 in a manner that results in changing the focus steering represented by the motion-sensing data 221.

Given that the headphones 200 may be processing limited (e.g., feature a processor with less processing power than the audio playback system 218) and/or limited in energy (e.g., powered by a limited power source such as a battery), the headphones 220 may not have excessive processing and/or power capabilities to fully transform the binaural audio data 1717 in sufficient time to maintain consistency with the presented video data. Additionally, transforming the channel-based binaural audio data 1717 (which may have a left channel and a right channel) may involve significant mathematical computations that are difficult to perform in real-time through power-limited and/or energy-limited resources, further increasing the lack of consistency between the binaural audio data 1717 and the video data 214.

Such lack of consistency can introduce audio artifacts (aircects) that reduce the immersion in the CMR experience. Furthermore, significant processing may increase power consumption, memory bandwidth consumption, and associated memory consumption, which may result in limited time (due to limited power supply, such as a battery) during which the headphones 220 may support playback of the binaural audio data 1717 and video data 214. The possible emphasis processing and limited play duration can frustrate the user of the headphones 220 in terms of the overall experience, as audio artifacts can distort the immersion, cause nausea, or otherwise disrupt the overall experience, possibly preventing adoption of CMR.

In accordance with various aspects of the technology described in this disclosure, the audio playback system 218 may provide scalable unified audio rendering that reduces processing complexity while accommodating all of the various different audio format types, such as channel-based audio data, object-based audio data, and/or scene-based audio data. The audio playback system 218 may support scalable audio rendering because any number of channels and/or objects may be rendered without increasing processing complexity as opposed to rendering performed by the external renderer 1702. Further, the audio playback system 218 may support unified audio rendering by transforming object-based audio data and/or channel-based audio data into scene-based audio data, thereby potentially unifying all of the various audio format types.

As such, various aspects of the techniques may improve the operation of the audio playback system 218 itself, as the audio playback system 218 may reduce processing periods when rendering binaural audio data 1717 and scene-based audio data 1703 (one such example of which is shown as HOA audio data 1703) from channel-based audio data 1705 and/or object-based audio data 1707. As a result of more efficient processing during rendering and due to unification, the audio playback system 218 may reduce power, memory bandwidth, and memory storage space consumption, thereby potentially enabling the audio playback system 218 to operate on a fixed capacity power source (such as a battery) for longer durations.

In operation, the audio playback system 218 may receive a bitstream 216 that includes one or more different types of audio data (or, in other words, audio data that conforms to one or more different audio formats). For example, the bitstream 216 may include a compressed representation of the channel-based audio data 1705, the object-based audio data 1707, and/or the scene-based audio data 1703.

The CMR stream decoder 1706 may represent an example of an audio decoder configured to decode first audio data within a time frame (meaning a clear time period, such as a frame having a defined number of audio samples) and second audio data within the same time frame from the bitstream 216. The first audio data may refer to any one of the scene-based audio data 1703, the channel-based audio data 1705, or the object-based audio data 1707. The second audio data may also refer to any one of scene-based audio data 1703, channel-based audio data 1705, or object-based audio data 1707.

Unless explicitly stated otherwise, it is assumed for the purpose of explanation that the scene-based audio data 1703 represents first audio data, and the channel-based audio data 1705 represents second audio data. However, in some examples, various other types of audio data 1703 and 1707 may represent the first audio data, while various other types of audio data 1703 and 1707 may represent the second audio data.

As further shown in the example of fig. 6, the audio playback system 218 may include an object/channel-based renderer 1704, an HOA to spatial domain conversion unit 1708, a mixing unit 1710, an HOA converter 1712, an HOA renderer 1712, a mixing unit 1716, a general renderer API 1718, and a virtual speaker position unit 1720. The CMR stream decoder 1706 may output scene-based audio data 1703 to an HOA-to-spatial-domain conversion unit 1708. The CMR stream decoder 1706 may also output channel-based audio data 1705 to an object/channel-based renderer 1704.

HOA-to-spatial-domain conversion unit 1708 may represent a unit configured to render scene-based audio data 1703 into spatial-domain audio data for playback by virtual speakers at a set of virtual speaker positions (shown as position 1721, and may also be referred to as "virtual speaker position 1721"). In the case of HOA audio data 1703, the HOA to spatial domain conversion unit 1708 may store one or more different spherical basis functions having different orders and sub-orders. The HOA to spatial domain conversion unit 1708 may apply various equations similar to those listed above based on the spherical basis functions to render the HOA audio data 1703 into spatial domain audio data 1731.

That is, the HOA to spatial domain conversion unit 1708 may convert the HOA audio data 1703 from the spherical harmonic domain to the spatial domain to obtain the channel-based audio data 1731 (which is another way of referring to the spatial domain audio data 1731). The channel-based audio data 1731 may include a channel for each of the virtual speakers located at respective positions of the set of virtual speaker positions 1721. The HOA to spatial domain conversion unit 1708 may output spatial domain audio data 1731 to the mixing unit 1710.

The object/channel-based renderer 1704 may represent a unit configured to render the channel-based audio data 1705 and/or the object-based audio data 1707 for playback by the virtual speakers at the set of virtual speaker locations 1721. The object/channel-based renderer 1704 may remap the channel-based audio data 1705 from the current location for each channel to a set of virtual speaker locations. In some examples, the object-channel based renderer 1705 may perform vector-based amplitude panning to remap the channel-based audio data 1705 from the current location for each channel to a set of virtual speaker locations. In this regard, the object/channel-based renderer 1704 may render the channel-based audio data 1705 into spatial-domain audio data 1733 for playback by the virtual speakers at the set of virtual speaker locations.

The virtual speaker location element 1720 may represent an element configured to determine a set of speaker locations (e.g., a jumper (Fliege) point, which may represent one example of a set of virtual speaker locations that are uniformly distributed about a sphere in which a listener's head is centered on the sphere). In some examples, either 4, 8, 16, or 25 virtual speaker locations (or, in other words, positions) are supported. In accordance with various techniques of this disclosure, the virtual speaker location unit 1720 may determine a set of virtual speaker locations based on headphone capability information that indicates one or more capabilities of the headphones 220. For example, higher order and larger number of HOA coefficients require more processing operations to render in the same amount of output time. Accordingly, processors with less processing power, or those with more limited battery power, may be unable to process HOA coefficients having an order above a certain threshold, or may be configured to avoid processing HOA coefficients having an order above a certain threshold.

For example, the processor of the headphones 220 for rendering HOA coefficients may be configured to render HOA coefficients up to the third order HOA coefficients but not the fourth order HOA coefficients or higher. Typically, a smaller number of virtual loudspeaker positions are associated with lower order HOA coefficients. Accordingly, the virtual speaker location unit 1720 may determine the virtual speaker location based on information about the processing capabilities of the headphones 220. For example, the virtual speaker location unit 1720 may determine a threshold based on the processing capabilities of the headphones 220 and determine the virtual speaker locations such that the number of virtual speaker locations does not exceed the threshold.

In some examples, the virtual speaker location unit 1720 determines the set of virtual speaker locations based at least in part on information regarding the scene-based audio data 1703 decoded from the bitstream 216. For example, the virtual speaker location unit 1720 may determine the set of virtual speaker locations based on the order of HOA coefficients in the scene-based audio data 1703 decoded from the bitstream 216.

In some examples, the virtual speaker location unit 1720 is configured to use a look-up table that maps a type of processor (or a type of headphones) to a predetermined set of virtual speaker locations. In some examples, the virtual speaker location unit 1720 is configured to determine the set of virtual speaker locations 1721 based on other factors.

In some examples, the processing power of the headset 220 may change dynamically over time. For example, the processing capabilities of the headphones 220 may change based on other processing loads on the processor of the headphones 220, based on the available bandwidth for transmission of HOA audio data to the headphones 220, and/or based on other factors. Thus, in some such examples, the virtual speaker location unit 1720 may dynamically change which virtual speaker locations are used over time. In this aspect, the virtual speaker location unit 1720 may obtain and output virtual speaker locations 1721 representing virtual speakers at the set of virtual speaker locations to the HOA to spatial domain conversion unit 1708, the object/channel based renderer 1704, and the HOA renderer 1714.

As described above, the object/channel-based external renderer 1702 and/or the object/channel-based internal renderer 1704 render the channel and/or object-based audio data 1705/1707 for output on the virtual speakers at the determined virtual speaker positions 1721 based on the virtual speaker positions 1721 determined by the virtual speaker position unit 1720. The VBAP may be used by the object/channel-based external renderer 1702 and/or the object/channel-based internal renderer 1704 to render object or channel-based audio data for playback by the virtual speaker at the determined virtual speaker location 1721. The object/channel-based external renderer 1702 and/or the object/channel-based internal renderer 1704 may generate one spatial domain signal (e.g., channel) for each of the determined virtual speaker positions. As such, this first rendering step may be performed by an internal renderer or an external renderer.

In examples using the object/channel-based external renderer 1702, the external renderer API1700 may be used (e.g., by the CMR stream decoder 1706) to send and receive information from the object/channel-based external renderer 1702. The generic renderer API 1718 may be used (e.g., by the CMR stream decoder 1706) to send and receive information from the generic information renderer component.

The HOA to spatial domain conversion unit 1708 converts the HOA audio data 1703 to a spatially equivalent domain representation based on the virtual speaker position 1721 determined by the virtual speaker position unit 1720. For example, the HOA to spatial domain conversion unit 1708 may apply a rendering matrix corresponding to the determined virtual speaker location 1721 to the HOA audio data 1703. The HOA-to-spatial-domain conversion unit 1708 may generate one spatial-domain signal for each of the determined virtual speaker positions.

An equivalent spatial domain representation of the Nth order soundfield representation c (t) is obtained by rendering c (t) to O virtual loudspeaker signals w_j(t) 1 ≦ j ≦ O, where O ═ N +1)². The respective virtual loudspeaker positions are represented by means of a spherical coordinate system, wherein each position depends on a unit sphere, for example with a radius of 1. Thus, the position may be equivalently represented by a direction dependent on the order

J is more than or equal to 1 and less than or equal to O, whereinAndrespectively, the inclination and azimuth.

Rendering c (t) into the equivalent spatial domain may be formulated as a matrix multiplication

w(t)＝(Ψ^(N,N))^-1·c(t),

Wherein (·)^-1Indicating inversion.

About direction of dependent orderOf order N^(N,N)Can be defined by

Wherein

Representing real-valued spherical harmonics of order n and degree m.

Matrix Ψ^(N,N)Is invertible, so the HOA representation c (t) can be converted back from the equivalent spatial domain by

c(t)＝Ψ^(N,N)·w(t)·

The HOA sound field H may be converted to N-channel audio data according to the following equation

Where D is a rendering matrix determined based on the speaker configuration (e.g., determined virtual speaker locations) of the N-channel audio data.

In the above equation D^TIndicating the transpose of the rendering matrix D. Matrices, such as rendering matrices, may be processed in various ways. For example, the matrix may be as rows, columns, vectors, or otherwise processed (e.g., stored, added, multiplied, retrieved, etc.).

The mixing unit 1710 may represent a unit configured to mix spatial domain audio data 1731 generated by the HOA-to-spatial domain conversion unit 1708 with corresponding spatial domain audio data 1733 generated by the object/channel based external renderer 1702 or the object/channel based internal renderer 1704. In this way, the mixing unit 1710 may output spatial-domain audio data 1735 to the HOA converter 1712, the spatial-domain audio data 1735 having a channel for each of the determined virtual speaker positions 1721.

Further, in the example of fig. 6, based on the determined virtual speaker location 1721, the HOA converter 1712 may convert the spatial domain audio data 1735 output by the mixing unit 1710 to scene-based audio data (e.g., HOA, or in other words, the spherical harmonic domain). The HOA converter 1712 may output a stream of scene-based audio data 1737. In this way, the audio playback system 218 may determine a set of one or more virtual speaker locations 1721 based on the data regarding the capabilities of the headphones 220 and generate scene-based audio data 1737 based on the set of virtual speaker locations 1721. In some examples, the audio playback system 218 includes a transmitter configured to transmit scene-based audio data 1737 to the headphones 220 and data indicating a set of virtual speaker locations 1721.

Thus, in some examples, the audio playback system 218 may perform at least one of: generating first spatial domain audio data based on a set of one or more virtual speaker positions and scene-based audio data decoded from a bitstream; and generating second spatial domain audio data based on the set of one or more virtual speaker locations and the channel or object based audio data decoded from the bitstream. In such an example, the audio playback system 218 may generate the third spatial domain audio data based on at least one of the first spatial domain audio data and the second spatial domain audio data. Further, the audio playback system 218 may generate scene-based audio data based on at least one of the third spatial domain audio data.

Further, in one example, the audio playback system 218 may determine a set of one or more virtual speaker locations based on the data regarding the capabilities of the headphones. In this example, the audio playback system 218 may decode first audio data from the bitstream, the first audio data being scene-based audio data. Further, the audio playback system 218 may decode second audio data from the bitstream, the second audio data being object-based or channel-based audio data. In this example, the audio playback system 218 may render the first audio data into first spatial domain audio data for playback on speakers at the set of virtual speaker locations. The audio playback system 218 may render the second audio data into second spatial domain audio data for playback on speakers at the set of virtual speaker locations. Further, the audio playback system 218 may generate third spatial domain audio data by mixing the first spatial domain audio data and the second spatial domain audio data. In this example, the audio playback system 218 may transform the third spatial domain audio data in the second scene-based audio data.

The HOA renderer 1714 may then apply the rendering matrix to the stream of scene-based audio data 1737 output by the HOA converter 1712. By applying the rendering matrix to the stream of scene-based audio data 1737, the HOA renderer 1714 may generate spatial domain binaural plot audio data 1715. In other words, the HOA renderer 1714 may determine a rendering matrix based on the orientation of the headphones (e.g., as defined by the motion sensing data 221), and may generate the spatial domain audio data 1717 by applying the rendering matrix to the scene-based audio data 1737.

In other words, the HOA renderer 1714 may represent a unit configured to transform the scene-based audio data 1737 from the spherical harmonic domain to the spatial domain to obtain the channel-based audio data 1717. The HOA renderer 1714 may obtain rendering matrices specific to the headphones 220 or, in some examples, derive rendering matrices specific to the headphones 220 from the headphone capability information. The rendering matrix may be specific to the headphones 220, as the rendering matrix may result in placement of speakers within the headphones 220 or otherwise adapt the transformation to better localize sound in view of the capabilities of the headphones 220.

The HOA renderer 1714 may adapt or otherwise configure the rendering matrix to cause movement as represented by the motion sensing data 221. That is, the HOA renderer 1714 may apply one or more transformations to the rendering matrix to adjust how the sound field is represented by the resulting binaural storyline audio data 1715. The transformation may rotate or otherwise adjust the sound field to cause the movement defined by the motion sensing data.

The mixing unit 1716 may mix the binaural storyline audio data generated by the HOA renderer 1714 with the non-storyline audio data 1739 to generate mixed audio data in the spatial domain. The speakers of the headphones 220 can reproduce the sound field represented by the mixed audio data 1717. In this way, any Higher Order Ambisonic (HOA) content of the bitstream is converted into a spatially equivalent domain representation using virtual speaker positions. All signals rendered at the virtual speaker locations are then mixed by CIRR and converted to HOA representation. Finally, the CIRR renders the binaural signal.

As mentioned above, the HOA renderer 1714 may render the stream of scene-based audio data 1737 output by the HOA converter 1712 with a rendering matrix. In some examples, the HOA renderer 1714 determines the rendering matrix based on the orientation of the headphones 220 (e.g., a two-dimensional or three-dimensional spatial orientation of the headphones 220). For example, the headset 220 may include one or more sensors. In this example, the headset 220 may use signals from the sensors to determine the orientation of the headset 220. In this example, the earpiece 220 may generate information indicative of the orientation of the earpiece 220. In this example, the HOA renderer 1714 may use the information indicative of the orientation of the headphones 220 to determine the rendering matrix. For example, the HOA renderer 1714 may select a rendering matrix from a predetermined set of rendering matrices. In other examples, the HOA renderer 1714 or another component may use signals from sensors of the headphones 220 to determine the orientation of the headphones 220.

In some examples, the components of the audio playback system 218 are distributed among multiple devices. For example, the HOA renderer 1714 may be implemented in the headphones 220, while the remaining components of the audio playback system 218 shown in the example of fig. 6 are implemented in another device communicatively coupled to the headphones 220.

Distributing the components of the audio playback system 218 in this manner may have several advantages. For example, applying the rendering matrix to the scene-based audio data 1737 is a relatively simple computation that requires relatively little power compared to computations performed by other components of the audio playback system 218. Furthermore, when the HOA renderer 1714 is included in the headphones 220, the time taken to transmit information about the orientation of the headphones 220 is reduced. Thus, the audio playback system 218 as a whole may respond more aggressively to changes in the orientation of the headphones 220.

When mixing and converting object, channel and scene based audio signals into HOA format, low complexity sound field rotation operations can be implemented as close as possible to the binaural rendering point, potentially in a separate headphone device (e.g. headphones 220), enabling low motion-to-sound latency and fixed complexity for a given HOA order (regardless of the number of channels and objects). Other rendering steps, with potentially higher latency and computational requirements, may be performed closer to the decoder operation and synchronized with the video (e.g., on a computer or mobile phone). These other rendering steps are performed by either an internal renderer or an external renderer. If necessary, devices that implement CIRR can further reduce complexity by reducing the ambisonic order on the rendering operation.

Thus, in summary, the techniques of this disclosure may be implemented in one or more devices for rendering audio streams (e.g., headphones 220, smartphones, computers, or other devices). The device may include memory, battery, CPU, etc. The device may be configured to generate a set of speaker positions corresponding to an equivalent spatial domain representation for a desired rendering order based on available hardware resources. Further, the device may receive a scene-based audio stream and convert the stream to an equivalent spatial domain representation for a desired rendering order.

Further, the device may receive an object and/or channel based audio stream and convert the stream to an equivalent spatial domain representation for a desired rendering order. The device may mix equivalent spatial domain streams corresponding to the scene-based, object-based, and channel-based audio streams to generate equivalent spatial domain mixed streams. The device may render the equivalent spatial domain mixed streams to a binaural or speaker-based representation. In some examples, the desired rendering order is determined based on: a level of the scene-based audio stream and/or metadata information from the object-based audio stream. In some examples, the equivalent spatial domain representation is reconfigured according to information from the motion sensor.

Fig. 7 is a block diagram illustrating an example implementation of an audio playback system 218 using a general information reference renderer, in accordance with the techniques of this disclosure. In the example of fig. 7, the audio playback system 218 includes an external renderer API1800, an object/channel-based external renderer 1802, an object/channel-based internal renderer 1804, a CRM stream decoder 1806, a truncation unit 1808, a mixing unit 1810, a HOA renderer 1811, a mixing unit 1812, a generic renderer API 1814, and a virtual speaker location unit 1816.

The CRM stream decoder 1806 receives and decodes a bitstream, such as bitstream 216 (fig. 2). In some examples, the bitstream 216 is a CMR stream that includes encoded audio data and encoded video data for use in CMR. In some examples, the bitstream 216 does not include encoded video data. By decoding the bitstream 216, the CRM stream decoder 1806 may generate one or more streams of non-storyline audio data 1739, one or more streams of channel-based audio data 1705 and/or object-based audio data 1707, and associated metadata and scene-based audio data (e.g., HOA data).

In the example of fig. 7, the virtual speaker location unit 1816 determines a set of virtual speaker locations 1721 (e.g., jumper points, which again may represent one example of locations that are uniformly distributed about a sphere in which the listener's head is centered on the sphere). The virtual speaker location unit 1816 may determine the set of virtual speaker locations 1721 in the same manner as described elsewhere in this disclosure with respect to the virtual speaker location unit 1720 (fig. 6).

In the example of fig. 7, object/channel-based external renderer 1802 and/or object/channel-based internal renderer 1804 render object or channel-based audio data 1705/1707 to a stream of scene-based audio data 1805 having a desired rendering order based on virtual speaker position 1721 determined by virtual speaker position unit 1816. In this context, the "order" is in the sense of a higher order ambisonic coefficient. In examples using the object/channel based external renderer 1802, the external renderer API1800 may be used to send and receive information from the object/channel based external renderer 1802 (e.g., by the CMR stream decoder 1806). The generic renderer API 1814 may be used (e.g., by the CMR stream decoder 1806) to send and receive information from the generic information reference renderer component.

In the manner described above, the audio playback system 218 may determine one or more sets of virtual speaker locations 1721 based on the data regarding the capabilities of the headphones 220. Then, an object-based renderer (e.g., 1702, 1704, 1802, 1804) may render first audio data, which is object-based audio data, into spatial domain audio data (fig. 6) or scene-based audio data (fig. 7) based on the set of virtual speaker locations 1721.

The truncation unit 1808 may represent a unit configured to truncate the scene-based audio data 1703 output by the CMR stream decoder 1806 based on the virtual speaker position 1721 determined by the virtual speaker position unit 1816. For example, the truncation unit 1808 may reduce the order of the HOA audio data 1703. For example, truncation unit 1808 may convert HOA audio data 1703 rd order to 3H1P MOA, as shown in the example in fig. 4, outputting truncated scene-based audio data 1809. In another example, the truncation unit 1808 may convert the HOA coefficients of order 4 to HOA coefficients of order 3, or perform other similar conversions, to obtain truncated scene-based audio data 1809. In some examples, based on the determined virtual speaker position 1721, the truncation unit 1808 does not perform any truncation.

The mixing unit 1810 may mix the scene-based audio data 1809 output by the truncation unit 1808 with the scene-based audio data 1805 output by the object/channel-based external renderer 1802 or the object/channel-based internal renderer 1804. For example, the mixing unit 1810 may add the corresponding coefficients of the scene-based audio data 1809 output by the truncation unit 1808 to the scene-based audio data 1805 output by the object/channel-based external renderer 1802 or the object/channel-based internal renderer 1804.

The audio playback system 218 of fig. 7 may perform at least one of the following: generating second preliminary scene-based audio data based on the set of one or more virtual speaker locations and the first preliminary scene-based audio data decoded from the bitstream; and generating third preliminary scene-based audio data based on the set of one or more virtual speaker locations and the channel or object-based audio data decoded from the bitstream. Further, in such an example, the audio playback system 218 of fig. 7 may generate final scene-based audio data based on at least one of the second preliminary scene-based audio data and the third preliminary scene-based audio data.

In one example, the audio playback system 218 may determine the set of one or more virtual speaker locations 1721 based on data regarding the capabilities of the headphones 220. In this example, the CMR stream decoder 1806 of the audio playback system 218 may decode first audio data from the bitstream 216, the first audio data being first scene-based audio data 1703. Further, in this example, the CMR stream decoder 1806 may decode second audio data from the bitstream 216, the second audio data being object-based or channel-based audio data 1705/1707.

Further, in this example, the internal renderer 1804 or the external renderer 1802 of the audio playback system 218 may render the second audio data into scene-based audio data 1805 based on the virtual speaker position 1721. In this example, the mixing unit 1810 of the audio playback system 218 may mix the scene-based audio data 1805 with the scene-based audio data 1709 to obtain mixed scene-based audio data 1813. In some examples, prior to mixing, truncation unit 1808 may truncate first scene-based audio data 1703 based on virtual speaker location 1721. The HOA renderer 1811 may apply a rendering matrix in the manner described above with respect to the HOA renderer 1714 shown in the example of fig. 6 to convert the mixed scene-based audio data 1813 into binaural audio data 1715.

The HOA renderer 1811 may apply a rendering matrix to the mixed scene-based audio data 1813 output by the mixing unit 1810. The HOA renderer 1811 may generate spatial domain binaural storyline audio data by applying a rendering matrix to the stream of mixed scene-based audio data 1813. The mixing unit 1812 may mix the binaural scenario sound audio data generated by the HOA renderer 1811 with the non-scenario sound audio data to generate mixed audio data in a spatial domain. The headphones 220 may include speakers configured to reproduce a sound field represented by the mixed audio data. The HOA renderer 1811 may operate in the same manner as the HOA renderer 1714 of fig. 6. Further, in some examples, the HOA renderer 1811 may be implemented in the headphones 220, while one or more other components of the audio playback system 218 may be implemented in another device, such as a smartphone or computer.

Thus, in some examples, the techniques of this disclosure may be implemented in one or more devices for rendering audio streams. The device may include a memory, battery, CPU, or the like configured to generate a set of speaker locations corresponding to an equivalent spatial domain representation for a desired rendering order based on available hardware resources. The device may receive a scene-based audio stream and truncate the stream to a scene-based audio representation having a desired rendering order. Further, the device may receive an object and/or channel based audio stream and convert the stream to a scene based audio representation having a desired rendering order.

Further, the device may mix the scene-based audio representation with a desired rendering stage stream to generate a scene-based audio representation having a desired rendering stage mix stream, the desired rendering stage stream corresponding to the scene-based, object-based, and channel-based audio streams. The device may render the scene-based audio representation to a binaural or speaker-based representation using a desired rendering order mixing stream. In some examples, the desired rendering order is determined based on: a level of the scene-based audio stream and/or metadata information from the object-based audio stream. In some examples, the scene-based audio representation with the desired rendering order mix stream representation is reconfigurable according to information from a motion sensor.

Fig. 8 is a block diagram illustrating an example implementation of an audio playback system 218 using a general informational reference renderer that uses audio data captured by headphones for augmented reality, in accordance with the techniques of this disclosure. Fig. 8 is similar to fig. 6, except that the headset 220 may include a microphone that captures sound in the environment of the user of the headset 220. The headset 220 may generate headset-captured audio data 223 based on signals from the microphone.

In some examples, the audio data 223 captured by the headphones includes spatial domain audio data. In such an example, the object/channel-based external renderer 1702 or the object/channel-based internal renderer 1704 may generate modified spatial-domain audio data 223' (e.g., using VBAP) based on the determined combination of virtual speaker locations 1721 and the headphone-captured audio data 223. The mixing unit 1710 may mix the modified spatial-domain audio data 223' with spatial-domain audio data 1733 generated by the object/channel-based external renderer 1702 or the object/channel-based internal renderer 1704 based on the channel/object-based audio data decoded from the bitstream and/or spatial-domain audio data 1731 output by the HOA-to-spatial-domain conversion unit 1708. The spatial domain audio data generated by the mixing unit 1710 may then be processed in the manner described above with respect to fig. 6.

In some examples, the headset-captured audio data 223 includes scene-based audio data. In such an example, the HOA-to-spatial-domain conversion unit 1708 may generate modified spatial-domain audio data 223' based on the determined set of virtual speaker locations 1721 and the headphone-captured audio data 223. The mixing unit 1710 may mix the modified spatial-domain audio data 223' with spatial-domain audio data 1733 generated by the object/channel-based external renderer 1702 or the object/channel-based internal renderer 1704 based on channel-based or object-based audio data decoded from the bitstream and/or spatial-domain audio data 1731 that is output by the HOA-to-spatial-domain conversion unit 1708 based on the scene-based audio data 1703 decoded from the bitstream 216.

The spatial domain audio data generated by the mixing unit 1710 may then be processed in the manner described with respect to fig. 8. For example, the HOA converter 1712 may convert the audio data output by the mixing unit 1710 into scene-based audio data. Thus, in accordance with the techniques of this disclosure, the audio playback system 218 may generate scene-based audio data based on the headset-captured audio data 223, including audio data representing sounds detected by a computer-mediated reality (CMR) headset (e.g., the headset 220), and data decoded from the bitstream 216.

Similar examples may be provided with respect to the object/channel based external renderer 1802, the object/channel based internal renderer 1804, and the truncation unit 1808 of fig. 7 for receiving headphone-captured audio data.

FIG. 9 is a flowchart illustrating example operations of the audio playback system illustrated in the example of FIG. 7 in performing aspects of the scalable unified rendering technique. The CMR stream decoder 1706 may receive the bitstream 216 and decode, from the bitstream 216, first audio data 1703 within a given time frame and second audio data 1705(1900) within the same time frame. The CMR stream decoder 1706 may output first audio data 1703 to the HOA-to-spatial-domain conversion unit 1708 and second audio data 1705 to the object/channel-based renderer 1704.

The HOA-to-spatial-domain conversion unit 1708 may render the first audio data 1703 into first spatial-domain audio data 1731 for playback by the virtual speakers at the set of virtual speaker locations 1721, as described above (1902). The HOA-to-spatial-domain conversion unit 1708 may output first spatial-domain audio data 1731 to the mixing unit 1710. As described above, the object/channel-based renderer 1704 may render the second audio data 1705 into spatial domain audio data 1733 for playback by the virtual speakers at the set of virtual speaker locations 1721 (1904). The object/channel-based renderer 1704 may output second spatial domain audio data 1733 to the blending unit 1710.

The mixing unit 1710 may mix the first spatial-domain audio data 1731 and the second spatial-domain audio data 1733 to obtain mixed spatial-domain audio data 1735 (1906). The mixing unit 1710 may output the mixed spatial domain audio data 1735 to the HOA converter 1712, which the HOA converter 1712 may convert the mixed spatial domain audio data 1735 to the scene-based audio data 1737 (1908).

In some examples, the HOA converter 1712 may send scene-based audio data 1737 to the wireless headset 220, the wireless headset 220 incorporating the HOA renderer 1714 to facilitate adapting the rendering matrix in near real-time based on the motion sensing data 221, as described in more detail above. In other examples, the audio playback system 218 may include the HOA renderer 1714 and perform the adaptation to the rendering matrix noted above based on the motion-sensing data 221.

It is to be understood that, depending on the example, some acts or events of any of the techniques described herein can be performed in a different order, may be added, merged, or left out all together (e.g., not all described acts or events are necessary for the practice of the techniques). Further, in some examples, acts or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially.

In one or more examples, the functions described may be implemented in hardware, software, firmware, or a combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media corresponding to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, such as according to a communication protocol. In this manner, the computer-readable medium may generally correspond to (1) a tangible computer-readable storage medium that is non-transitory, or (2) a communication medium such as a signal or carrier wave. The data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve the instructions, code and/or data structures for implementation of the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Further, in some aspects, the functions described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Furthermore, the techniques may be fully implemented in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses including a wireless headset, an Integrated Circuit (IC), or a collection of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require implementation by different hardware units. Instead, as described above, the various units may be incorporated in a codec hardware unit or provided by a collection of interoperating hardware units including one or more memories as described above, along with appropriate software and/or firmware.

Various examples have been described. These examples, as well as other examples, are within the scope of the following claims.

32页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：车辆对车辆通信系统

Scalable unified audio renderer

相关技术

网友询问留言