Scalable unified audio renderer
阅读说明:本技术 可扩展的统一的音频渲染器 (Scalable unified audio renderer ) 是由 A·G·P·舍弗次乌 N·G·彼得斯 于 2019-02-01 设计创作,主要内容包括:包括音频解码器、存储器和处理器的设备,可以被配置为执行技术的各个方面。音频解码器可以从比特流解码第一音频数据和第二音频数据。存储器可以存储第一音频数据和第二音频数据。处理器可以将第一音频数据渲染成第一空间域音频数据用于由在虚拟扬声器位置集合处的虚拟扬声器进行的回放,以及将第二音频数据渲染成第二空间域音频数据用于由在虚拟扬声器位置集合处的虚拟扬声器进行的回放。处理器还可以混合第一空间域音频数据和第二空间域音频数据以获得混合的空间域音频数据,以及将混合的空间域音频数据转换至基于场景的音频数据。(A device, including an audio decoder, a memory, and a processor, may be configured to perform various aspects of the techniques. The audio decoder may decode the first audio data and the second audio data from the bitstream. The memory may store first audio data and second audio data. The processor may render the first audio data into first spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations and render the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations. The processor may also mix the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data, and convert the mixed spatial-domain audio data to scene-based audio data.)
1. An apparatus configured to support unified audio rendering, the apparatus comprising:
an audio decoder configured to decode first audio data within a time frame and second audio data within the time frame from a bitstream;
a memory configured to store the first audio data and the second audio data; and
one or more processors configured to:
rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;
rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;
mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and
converting the mixed spatial domain audio data to scene-based audio data.
2. The device of claim 1, wherein the one or more processors are further configured to determine the set of virtual speaker locations at which the virtual speakers are located based on headphone capability data representing one or more capabilities of headphones and prior to rendering the first audio data and the second audio data.
3. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,
wherein the first audio data comprises one of first scene-based audio data, first channel-based audio data, or first object-based audio data, an
Wherein the second audio data comprises one of second scene-based audio data, second channel-based audio data, or second object-based audio data.
4. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,
wherein the one or more processors are configured to convert the mixed spatial domain audio data from the spatial domain to a spherical harmonic domain, an
Wherein the scene-based audio data comprises higher order ambisonic audio data defined in the spherical harmonic domain as a set of one or more higher order ambisonic coefficients corresponding to a spherical basis function.
5. The device of claim 1, wherein the set of virtual speaker locations comprises a set of virtual speaker locations that are uniformly distributed with respect to a sphere in which a listener's head is centered on the sphere.
6. The apparatus of claim 1, wherein the set of virtual speaker locations comprises a jumper point.
7. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,
wherein the one or more processors are configured to render the first audio data based on headphone-captured audio data to obtain the first spatial-domain audio data, wherein the headphone-captured audio data comprises audio data representative of sounds detected by headphones, and
wherein the one or more processors are configured to render the second audio data based on the audio data captured by the headphones to obtain the second spatial-domain audio data.
8. The device of claim 1, further comprising an interface configured to send the scene-based audio data and data indicative of the set of virtual speaker locations to a headset.
9. The device of claim 8, wherein the headset comprises a wireless headset.
10. The device of claim 8, wherein the headset comprises a computer-mediated reality headset that supports one or more of virtual reality, augmented reality, and mixed reality.
11. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the substrate,
wherein the one or more audio decoders are further configured to decode third audio data within the time frames from the bitstream,
wherein the memory is further configured to store the third audio data,
wherein the one or more processors are further configured to render the third audio data into third spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations, an
Wherein the one or more processors are configured to mix the first spatial-domain audio data, the second spatial-domain audio data, and the third spatial-domain audio data to obtain the mixed spatial-domain audio data.
12. A method of supporting unified audio rendering, the method comprising:
decoding, by a computing device and from a bitstream, first audio data within a timeframe and second audio data within the timeframe;
rendering, by the computing device, the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;
rendering, by the computing device, the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;
mixing, by the computing device, the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and
converting, by the computing device, the mixed spatial domain audio data to scene-based audio data.
13. The method of claim 12, further comprising determining the set of virtual speaker locations at which the virtual speakers are located based on headphone capability data representing one or more capabilities of headphones and prior to rendering the first audio data and the second audio data.
14. The method of claim 12, wherein the first and second regions are selected from the group consisting of,
wherein the first audio data comprises one of first scene-based audio data, first channel-based audio data, or first object-based audio data, an
Wherein the second audio data comprises one of second scene-based audio data, second channel-based audio data, or second object-based audio data.
15. The method of claim 12, wherein the first and second regions are selected from the group consisting of,
wherein converting the mixed spatial domain audio data comprises converting the mixed spatial domain audio data from the spatial domain to a spherical harmonic domain, an
Wherein the scene-based audio data comprises higher order ambisonic audio data defined in the spherical harmonic domain as a set of one or more higher order ambisonic coefficients corresponding to a spherical basis function.
16. The method of claim 12, wherein the set of virtual speaker locations comprises a set of virtual speaker locations that are uniformly distributed with respect to a sphere in which a listener's head is centered on the sphere.
17. The method of claim 12, wherein the set of virtual speaker locations comprises a jumper point.
18. The method of claim 12, wherein the first and second regions are selected from the group consisting of,
wherein rendering the first audio data comprises rendering the first audio data based on headphone-captured audio data to obtain the first spatial-domain audio data, wherein the headphone-captured audio data comprises audio data representing sounds detected by headphones, and
wherein rendering the second audio data comprises rendering the second audio data based on the audio data captured by the headphones to obtain the second spatial-domain audio data.
19. The method of claim 12, further comprising sending the scene-based audio data and data indicative of the set of virtual speaker locations to a headset.
20. The method of claim 19, wherein the headset comprises a wireless headset.
21. The method of claim 19, wherein the headset comprises a computer-mediated reality headset that supports one or more of virtual reality, augmented reality, and mixed reality.
22. The method of claim 12, further comprising:
decoding third audio data within the time frame from the bitstream; and
rendering the third audio data into third spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations,
wherein mixing the first spatial-domain audio data and the second spatial-domain audio data comprises mixing the first spatial-domain audio data, the second spatial-domain audio data, and the third spatial-domain audio data to obtain mixed audio data.
23. An apparatus configured to support unified audio rendering, the apparatus comprising:
means for decoding, from a bitstream, first audio data within a time frame and second audio data within the time frame;
means for rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;
means for rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;
means for mixing the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and
means for converting the mixed spatial domain audio data to scene-based audio data.
24. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to:
decoding first audio data within a time frame and second audio data within the time frame from a bitstream;
rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations;
rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations;
mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and
converting the mixed spatial domain audio data to scene-based audio data.
Technical Field
The present disclosure relates to processing of media data, such as audio data.
Background
Higher Order Ambisonic (HOA) signals, often represented by a plurality of Spherical Harmonic Coefficients (SHC) or other hierarchical elements, are three-dimensional representations of a sound field. The HOA or SHC representation may represent the sound field in a manner that does not rely on local speaker geometry used to playback the multi-channel audio signal rendered from the SHC signal. The SHC signal may also facilitate backward compatibility because the SHC signal may be rendered to a well-known and highly adopted multi-channel format, such as a 5.1 audio channel format or a 7.1 audio channel format. The SHC representation may thus enable a better representation of the sound field that also accommodates backward compatibility.
Disclosure of Invention
The present disclosure relates generally to auditory aspects of a user experience of computer-mediated reality systems, including Virtual Reality (VR), Mixed Reality (MR), Augmented Reality (AR), computer vision, and imaging systems.
In one example, various aspects of the technology are directed to a device configured to support unified audio rendering, the device comprising: an audio decoder configured to decode, from the bitstream, first audio data within a time frame and second audio data within the time frame; a memory configured to store first audio data and second audio data; and one or more processors configured to: rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and converting the mixed spatial domain audio data to scene-based audio data.
In another example, various aspects of the technology are directed to a method of supporting unified audio data rendering, the method comprising: decoding, by the computing device and from the bitstream, the first audio data within the time frame and the second audio data within the time frame; rendering, by the computing device, the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering, by the computing device, the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing, by a computing device, first spatial-domain audio data and second spatial-domain audio data to obtain mixed spatial-domain audio data; and converting, by the computing device, the mixed spatial domain audio data to scene-based audio data.
In another example, various aspects of the technology are directed to a device configured to support unified audio rendering, the device comprising: means for decoding, from a bitstream, first audio data within a time frame and second audio data within the time frame; means for rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; means for rendering second audio data into second spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; means for mixing the first spatial-domain audio data and the second spatial-domain audio data to obtain mixed spatial-domain audio data; and means for converting the mixed spatial domain audio data to scene-based audio data.
In another example, various aspects of the technology are directed to a non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: decoding first audio data within a time frame and second audio data within the time frame from a bitstream; rendering the first audio data into first spatial domain audio data for playback by virtual speakers at a set of virtual speaker locations; rendering the second audio data into second spatial domain audio data for playback by the virtual speakers at the set of virtual speaker locations; mixing the first spatial domain audio data and the second spatial domain audio data to obtain mixed spatial domain audio data; and converting the mixed spatial domain audio data to scene-based audio data.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Drawings
FIG. 1 is a schematic diagram showing spherical harmonic basis functions of various orders and sub-orders.
Fig. 2 is a schematic diagram illustrating a system that may perform various aspects of the techniques described in this disclosure.
FIG. 3 is a schematic diagram illustrating aspects of inconsistent spatial resolution distributions of mixed order ambisonics representations of a sound field.
Fig. 4 is a schematic diagram illustrating the difference between a full third order HOA representation of a soundfield and a mixed order ambisonic representation of the same soundfield, in which the horizontal regions have a higher spatial resolution than the rest regions.
Fig. 5 is a schematic diagram illustrating an example of a headset that may be used by one or more computer-mediated reality systems of the present disclosure.
Fig. 6 is a block diagram illustrating an example implementation of an audio playback system using a general information reference renderer, in accordance with the techniques of this disclosure.
Fig. 7 is a block diagram illustrating an example implementation of an audio playback system using a general informational reference renderer, in accordance with the techniques of this disclosure.
Fig. 8 is a block diagram illustrating an example implementation of an audio playback system using a general informational reference renderer that uses audio data captured by headphones for augmented reality, in accordance with the techniques of this disclosure.
FIG. 9 is a flow diagram illustrating example operations of the audio playback system shown in the example of FIG. 7 in performing aspects of the scalable unified rendering technique.
Detailed Description
In general, this disclosure is directed to techniques for playback of a sound field representation during a user experience of a computer-mediated reality system. Computer-mediated reality (CMR) technologies include various types of content generation and content consumption systems, such as Virtual Reality (VR), Mixed Reality (MR), Augmented Reality (AR), computer vision, and image systems. While several aspects of the disclosure are described with respect to a virtual reality system by way of example for ease of discussion, it will be understood that the techniques of the disclosure are also applicable to other types of computer-mediated reality technologies, such as mixed reality, augmented reality, computer vision, and image systems.
The virtual reality system may utilize field of view (FoV) information of the user to obtain video data associated with the FoV of the user. As such, the virtual reality system may obtain video data that partially or fully surrounds the viewer's head, e.g., for a virtual reality application or other similar scenario in which the user may move his or her head to see different portions of the image canvas that are not visible when pointing focus at a single point of the canvas. In particular, these techniques may be applied when a viewer points visual focus at a particular portion of a large canvas (such as a three-dimensional canvas that partially or fully encloses the viewer's head). The video data surrounding the user's head may be provided using a combination of screens (e.g., a set of screens surrounding the user) or via a head mounted display.
Examples of hardware capable of providing a head mounted display include VR headphones, MR headphones, AR headphones, and various other hardware. The sensory data and/or the test data may be used to determine the FoV of the user. As one example of sensory data, one or more angles associated with the positioning of the VR headset may indicate the FoV of the user, which angles form the "steering angle" of the headset. As another example of sensory data, a user's gaze angle (e.g., sensed via iris detection) may indicate the user's FoV. Video data and corresponding audio data may be encoded and prepared (e.g., for storage and/or transmission) using a feature set that includes FoV information.
The techniques of this disclosure may be used in connection with techniques related to the transmission (e.g., transmission and/or reception) of media data, such as video data and audio data, encoded at various quality levels for different regions at which the media data is to be played back. For example, the techniques of this disclosure may be used by a client device that includes a panoramic display (e.g., a display that partially or fully surrounds a viewer's head) and surround sound speakers. Typically, the display is configured such that the user's visual focus is directed at only a portion of the display at a given time. The system of the present disclosure may render and output audio data via the surround sound speakers such that the audio object associated with the current region of focus on the display is an output with greater directionality than the remaining audio objects.
There are various "surround sound" channel-based audio formats in the market. For example, they range from 5.1 home theater systems (which have been most successful in entering the military living room, in addition to stereo) to 22.2 systems developed by NHK (japan broadcasting association or japan broadcasting company). Content creators (e.g., hollywood studios) once wanted to produce a soundtrack for a movie and did not expend the effort to mix it for each speaker configuration. The Moving Picture Experts Group (MPEG) has promulgated standards that take into account the soundfield (e.g., higher order ambisonic-HOA-coefficients) to be represented using a hierarchical set of elements that can be rendered to speaker feeds, whether in locations defined by various standards or in inconsistent locations, for most speaker configurations, including 5.1 and 22.2 configurations.
MPEG promulgated as a standard for MPEG-H3D audio, formally named "information technology-High efficiency coding and media delivery in heterogeneous environments-Part 3:3D audio", stated by ISO/IEC JTC 1/SC 29, with the file identifier ISO/IEC DIS 23008-3, and the noted date 7/25/2014. MPEG also promulgates a second version of the 3D audio standard, named "information technology-High efficiency coding and media delivery in heterologous audio settings-Part 3:3D audio", set out by ISO/IEC JTC 1/SC 29, with file identifier ISO/IEC 23008-3:201x (e), and date 2016, 10 months and 12 days. References to "3D audio standards" in this disclosure may refer to one or both of the above standards.
As described above, one example of a hierarchical set of elements is a set of Spherical Harmonic Coefficients (SHC). The following expression illustrates a description or representation of a sound field using SHC:
the expression shows that at any point of the sound fieldPressure p at time tiIt is possible to pass through the SHC,is uniquely represented. Here, the first and second liquid crystal display panels are,
c is the speed of sound (-343 m/s),is a reference point (or observation point), jn(. is a spherical Bessel function of order n, andis a spherical harmonic basis function of order n and sub-order m (which may also be referred to as a spherical basis function). It will be appreciated that the terms in square brackets are frequency domain representations of the signal (e.g.,) It can be approximated by various time-frequency transforms such as Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), or wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of systems of multi-resolution basis functions.Fig. 1 is a diagram showing a spherical harmonic basic function from the zeroth order (n-0) to the fourth order (n-4). As can be seen, for each order there is an extension to the sub-order m, which is shown in the example of fig. 1 but not explicitly indicated for ease of illustration. SHC
May be physically captured by various microphone array configurations or, alternatively, derived from channel-based or object-based descriptions of sound fields. SHC represents scene-based audio, where SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a method involving (1+4)2A fourth order representation of the (25, and thus fourth order) coefficients.As described above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC can be derived from microphone arrays are described in the "Three-Dimensional Surround Sound Systems Based on spherical harmonics" of Poletti M (Poletti, M.) (journal of the auditory engineering society (j. audioeng. soc.), vol.), vol.53, No.11, 11/2005, pp.1004-1025).
To illustrate how the SHC can be derived from the object-based description, consider the following equation. Coefficients for a sound field corresponding to an individual audio objectCan be expressed as:
wherein i is
Is a spherical Hankel function of order n, anIs the location of the object. Considering the object source energy g (ω) as a function of frequency (e.g., using time-frequency analysis techniques such as performing a fast fourier transform on the PCM stream) allows us to convert each PCM object and corresponding location to SHCFurther, it can be shown (because the above is linear and orthogonal decomposition), for each objectThe coefficients are cumulative. In this way, a large number of PCM objects may pass throughThe coefficients are represented (e.g., as the sum of coefficient vectors for individual objects). Essentially, the coefficients contain information about the sound field (pressure as a function of 3D coordinates), and the above is expressed at the viewpointFrom individual objects to a representation of the entire sound field. The remaining figures are described below in the context of object-based and SHC-based audio coding.Fig. 2 is a schematic diagram illustrating a system 10 that may perform various aspects of the techniques described in this disclosure. As shown in the example of fig. 2, system 10 includes a source system 200 and a content consumer system 202. Although described in the context of source system 200 and content consumer system 202, the techniques may be implemented in other contexts. Furthermore, source system 200 may represent any form of computing device capable of generating a bitstream compatible with the techniques of this disclosure. Likewise, content consumer system 202 may represent any form of computing device capable of implementing the techniques of this disclosure.
The source system 200 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by an operator of a content consumption device, such as the content consumer system 202. In many VR scenarios, the source system 200 generates audio content along with video content. In the example of fig. 2, the source system 200 includes a content capture device 204, a bitstream generation unit 206, a microphone 208, and a camera 210.
The content capture device 204 may be configured to connect with the microphone 208 or otherwise communicate with the microphone 208. Microphone 208 may represent Eigenmike (spherical microphone)
Or other types of 3D audio microphones capable of capturing and representing the sound field as HOA coefficients 11. In some examples, the content capture device 204 includes integration intoAn integrated microphone 208 inside the housing of the capture device 204. In some examples, the content capture device 204 may be connected with the microphone 208 wirelessly or via a wired connection.The microphone 208 generates audio data 212. In some examples, the audio data is scene-based audio data (e.g., HOA coefficients), channel-based audio data, object-based audio data, or another type of audio data. In other examples, the content capture device 204 may process the audio data 212 after receiving the audio data 212 via some type of storage (e.g., removable storage). Various combinations of the content capture device 204 and the microphone 208 are possible, with several examples of such combinations being discussed above for purposes of illustration. The camera 210 may be configured to capture video data 214 and provide the captured raw video data 214 to the content capture device 204.
The content capture device 204 may be configured to connect with the bitstream generation unit 206 or otherwise communicate with the bitstream generation unit 206. The bitstream generation unit 206 may comprise any type of hardware device capable of connecting with the content capture device 204. The bitstream generation unit 206 may use the audio data 212 to generate a bitstream 216, the bitstream 216 including one or more representations of a sound field defined by the audio data 212. In some examples, the bitstream 216 may also include a representation of the video data 214.
The bitstream generation unit 206 may generate the representation of the audio data 212 in various ways. For example, the bitstream generation unit 206 may represent the audio data 212 in one or more of a scene-based audio format, a channel-based audio format, and/or an object-based audio format.
In some examples in which the bitstream generation unit 206 represents audio data in a scene-based format, the bitstream generation unit 206 uses an encoding scheme for ambisonic representation of a soundfield, referred to as Mixed Order Ambisonic (MOA). To generate a particular MOA representation of the sound field, the bitstream generation unit 206 may generate a partial subset of the total set of HOA systems. For example, each MOA representation generated by the bitstream generation unit 206 may provide precision with respect to some regions of the sound field, but provide less precision in other regions. In one example, the MOA representation of the sound field may include eight (8) uncompressed HOA coefficients of the HOA coefficients, whereas the third order HOA representation of the same sound field may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. As such, each MOA representation generated as a partial subset of the HOA coefficients may be less memory intensive and less bandwidth intensive (if and when transmitted as part of the bitstream 216 over the illustrated transmission channel) than a corresponding third order HOA representation of the same sound field generated from the HOA coefficients.
In some examples, the content capture device 204 may be configured to wirelessly communicate with the bitstream generation unit 206. In some examples, the content capture device 204 may communicate with the bitstream generation unit 206 via one or both of a wireless connection or a wired connection. Via the connection between the content capture device 204 and the bitstream generation unit 206, the content capture device 204 may provide content in various content forms, which are described herein as being part of the HOA coefficients 11 for purposes of discussion.
In some examples, the content capture device 204 may utilize aspects of the bitstream generation unit 206 (in terms of hardware or software capabilities of the bitstream generation unit 206). For example, the bitstream generation unit 206 may include dedicated hardware (or specialized software that, when executed, may cause one or more processors to perform psychoacoustic audio encoding) configured to perform psychoacoustic audio encoding, such as a unified speech and audio encoder denoted "USAC" as set forth by the Moving Picture Experts Group (MPEG) or MPEG-H3D audio encoding standards. The content capture device 204 may not include psychoacoustic audio encoder-specific hardware or specialized software, and instead provide the audio aspects of the audio content 212 in the form of non-psychoacoustic speech encoding (which is another way of referring to the audio data 212). The bitstream generation unit 206 may assist in the capture of the content 212 by performing, at least in part, psychoacoustic audio coding with respect to audio aspects of the audio content 212.
The bitstream generation unit 206 may assist in content capture and transmission by generating one or more bitstreams based at least in part on audio content (e.g., an MOA representation and/or a third order HOA representation) generated from the audio data 212. The bitstream 216 may include a compressed version of the audio data 212 (and/or a partial subset thereof used to form an MOA representation of a sound field) as well as any other different type of content, such as compressed versions of video data, image data, and/or text data. As an example, the bitstream generation unit 206 may generate the bitstream 216 for transmission across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 216 may represent an encoded version of the audio data 212 (and/or a partial subset thereof used to form an MOA representation of a sound field), and may include a main bitstream and additional side bitstreams, which may be referred to as side channel information.
FIG. 3 is a schematic diagram illustrating aspects of a non-uniform spatial resolution distribution of an MOA representation of a sound field. Although a perfectly spherical HOA has consistently high spatial resolution in all directions, the MOA representation of the same sound field has variable spatial resolution. In many cases, as in the example of fig. 3, the MOA representation of the soundfield includes high resolution spatial audio data in only horizontal regions, and lower resolution spatial audio data in the remaining regions of the soundfield. In the example shown in fig. 3, the MOA representation of the sound field includes a third order representation of the horizontal region (marked by the white bar), and a first order representation of all other regions (shown by the black shaded portion). That is, according to the MOA representation of fig. 3, as soon as the sound source leaves the equator of the sound field, the clarity and area of the high quality reconstruction rapidly decrease with respect to the audio objects emanating from the sound source.
Fig. 4 is a schematic diagram showing the difference between a full third order HOA representation of a sound field and a MOA representation of the same sound field, in which the horizontal regions have a higher spatial resolution than the remaining regions. As shown in fig. 4, the full third order HOA representation includes sixteen (16) uncompressed HOA coefficients to represent the sound field. The consistent spatial resolution of the complete HOA representation is shown as white (or blank) by the entire 3-axis diagram for the complete third order HOA representation.
In contrast, with respect to the same sound field, the MOA representation includes eight (8) uncompressed coefficients (or coefficient channels). Furthermore, in contrast to the consistent spatial resolution exhibited by the third order HOA representation, the MOA representation shows an inconsistent spatial resolution, where a high spatial resolution occurs along the equator of the 3D soundfield, while the remaining regions of the soundfield are represented at a lower spatial resolution. The MOA representation shown in fig. 4 is described as being a "3H 1P" MOA representation, indicating that the MOA representation comprises a third order representation of the horizontal region and a first order representation of the remaining region of the sound field.
Although described with respect to captured content 212/214, various aspects of the techniques described in this disclosure may be applied to generated or rendered content, such as is common in video games, where audio data 212 is retrieved from memory and/or storage rather than captured, and video data 214 is programmatically generated by hardware, such as a Graphics Processing Unit (GPU). In instances in which the source system 200 obtains the content 212/214 instead of fully capturing the content 212/214, the source system 200 may represent a computer (e.g., a video game system, a laptop computer, a desktop computer, etc.) configured to generate the audio data 212 and the video data 214.
Regardless, the content consumer system 202 may be operated by an individual, and may represent a VR client device in many examples. Content consumer system 202 may include an
Although shown in fig. 2 as being sent directly to content consumer system 202, source system 200 may output bitstream 216 to an intermediary device interposed between source system 200 and content consumer system 202. The intermediary device may store the bitstream 216 for subsequent delivery to the content consumer system 202, which content consumer system 202 may request the bitstream. The intermediary device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 216 for subsequent retrieval by the audio decoder. The intermediary may reside in a content delivery network that is capable of streaming the bitstream 216 (and possibly in conjunction with sending a corresponding video data stream) to a user, such as the content consumer system 202, requesting the bitstream 216.
Alternatively, the source system 200 may store the bitstream 216 to a storage medium, such as a compact disc, digital video disc, high definition video disc, or other storage medium, most of which are readable by a computer and thus may be referred to as a computer-readable storage medium or a non-transitory computer-readable storage medium. In this context, a transmission channel may refer to a channel through which content stored to the medium is transmitted (and may include retail stores and other store-based delivery mechanisms). Regardless, the techniques of this disclosure should not be limited in this regard to the example of fig. 2.
As noted above, the content consumer system 202 includes an
The
In some examples, content consumer system 202 receives bitstream 216 from a streaming server. The streaming server may provide various types of streams, or combinations of streams, in response to such requests from the streaming client. For example, the streaming server may also provide a full-order HOA stream as an option if requested by the streaming client (e.g., executing on the audio playback system 218). In other examples, the streaming server may provide one or more of an object-based representation of the soundfield, a higher order ambisonic representation of the soundfield, a mixed order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the higher order ambisonic representation of the soundfield, a combination of the object-based representation of the soundfield and the mixed order ambisonic representation of the soundfield, or a combination of the mixed order representation of the soundfield and the higher order ambisonic representation of the soundfield.
Content consumer system 202 may represent a video game system or other computing device similar to the source system. Although shown as separate systems, in some examples source system 200 and content consumer system 202 may be a single system. For example, both source system 200 and content consumer system 202 may be implemented within a single video game system or other computing device. A single computing device may be connected with the
Regardless of the configuration of source system 200 and content consumer system 202, content consumer system 202 may include
As shown in the example of fig. 5, the
The
As described above, content consumer system 202 also includes
In some examples, the processor of the
The storage device of the
In some examples, the
Fig. 6 is a block diagram illustrating an example implementation of an
The CMR stream decoder 1706 receives and decodes a bitstream, such as the bitstream 216 (shown in the example of fig. 2). The bitstream 216 may include a CMR stream (which may be referred to as a "CRM stream 216"). By decoding CMR stream 216, CMR stream decoder 1706 can generate one or more streams of non-storyline sound audio data, one or more streams of object-based audio data, channel-based audio data, and associated metadata and/or HOA audio data or other scene-based audio data.
When the
The external renderer 1702 (which may also be referred to as an "object/channel-based renderer 1702") uses one or more streams of channel-based audio data, object-based audio data, and associated metadata and/or HOA audio data or other scene-based audio data to generate binaural storyline audio data. A mixing unit 1716 mixes the two-channel storyline audio data with one or more streams of non-storyline audio data to generate mixed two-
In instances in which the CMR stream decoder 1706 provides the channel-based audio data 1705 and/or the object-based audio data 1707 (which may include associated metadata) to the external renderer 1702 via the external renderer API1700, the external renderer 1702 may render the channel-based audio data corresponding to the speaker layout of the
Given that the headphones 200 may be processing limited (e.g., feature a processor with less processing power than the audio playback system 218) and/or limited in energy (e.g., powered by a limited power source such as a battery), the
Such lack of consistency can introduce audio artifacts (aircects) that reduce the immersion in the CMR experience. Furthermore, significant processing may increase power consumption, memory bandwidth consumption, and associated memory consumption, which may result in limited time (due to limited power supply, such as a battery) during which the
In accordance with various aspects of the technology described in this disclosure, the
As such, various aspects of the techniques may improve the operation of the
In operation, the
The CMR stream decoder 1706 may represent an example of an audio decoder configured to decode first audio data within a time frame (meaning a clear time period, such as a frame having a defined number of audio samples) and second audio data within the same time frame from the bitstream 216. The first audio data may refer to any one of the scene-based
Unless explicitly stated otherwise, it is assumed for the purpose of explanation that the scene-based
As further shown in the example of fig. 6, the
HOA-to-spatial-domain conversion unit 1708 may represent a unit configured to render scene-based
That is, the HOA to spatial domain conversion unit 1708 may convert the
The object/channel-based renderer 1704 may represent a unit configured to render the channel-based audio data 1705 and/or the object-based audio data 1707 for playback by the virtual speakers at the set of
The virtual speaker location element 1720 may represent an element configured to determine a set of speaker locations (e.g., a jumper (Fliege) point, which may represent one example of a set of virtual speaker locations that are uniformly distributed about a sphere in which a listener's head is centered on the sphere). In some examples, either 4, 8, 16, or 25 virtual speaker locations (or, in other words, positions) are supported. In accordance with various techniques of this disclosure, the virtual speaker location unit 1720 may determine a set of virtual speaker locations based on headphone capability information that indicates one or more capabilities of the
For example, the processor of the
In some examples, the virtual speaker location unit 1720 determines the set of virtual speaker locations based at least in part on information regarding the scene-based
In some examples, the virtual speaker location unit 1720 is configured to use a look-up table that maps a type of processor (or a type of headphones) to a predetermined set of virtual speaker locations. In some examples, the virtual speaker location unit 1720 is configured to determine the set of
In some examples, the processing power of the
As described above, the object/channel-based external renderer 1702 and/or the object/channel-based internal renderer 1704 render the channel and/or object-based audio data 1705/1707 for output on the virtual speakers at the determined
In examples using the object/channel-based external renderer 1702, the external renderer API1700 may be used (e.g., by the CMR stream decoder 1706) to send and receive information from the object/channel-based external renderer 1702. The generic renderer API 1718 may be used (e.g., by the CMR stream decoder 1706) to send and receive information from the generic information renderer component.
The HOA to spatial domain conversion unit 1708 converts the
An equivalent spatial domain representation of the Nth order soundfield representation c (t) is obtained by rendering c (t) to O virtual loudspeaker signals wj(t) 1 ≦ j ≦ O, where O ═ N +1)2. The respective virtual loudspeaker positions are represented by means of a spherical coordinate system, wherein each position depends on a unit sphere, for example with a radius of 1. Thus, the position may be equivalently represented by a direction dependent on the order
J is more than or equal to 1 and less than or equal to O, whereinAndrespectively, the inclination and azimuth.Rendering c (t) into the equivalent spatial domain may be formulated as a matrix multiplication
w(t)=(Ψ(N,N))-1·c(t),
Wherein (·)-1Indicating inversion.
About direction of dependent orderOf order N(N,N)Can be defined by
Wherein
Wherein
Representing real-valued spherical harmonics of order n and degree m.Matrix Ψ(N,N)Is invertible, so the HOA representation c (t) can be converted back from the equivalent spatial domain by
c(t)=Ψ(N,N)·w(t)·
The HOA sound field H may be converted to N-channel audio data according to the following equation
Where D is a rendering matrix determined based on the speaker configuration (e.g., determined virtual speaker locations) of the N-channel audio data.
In the above equation DTIndicating the transpose of the rendering matrix D. Matrices, such as rendering matrices, may be processed in various ways. For example, the matrix may be as rows, columns, vectors, or otherwise processed (e.g., stored, added, multiplied, retrieved, etc.).
The mixing unit 1710 may represent a unit configured to mix spatial domain audio data 1731 generated by the HOA-to-spatial domain conversion unit 1708 with corresponding spatial domain audio data 1733 generated by the object/channel based external renderer 1702 or the object/channel based internal renderer 1704. In this way, the mixing unit 1710 may output spatial-domain audio data 1735 to the HOA converter 1712, the spatial-domain audio data 1735 having a channel for each of the determined virtual speaker positions 1721.
Further, in the example of fig. 6, based on the determined
Thus, in some examples, the
Further, in one example, the
The HOA renderer 1714 may then apply the rendering matrix to the stream of scene-based audio data 1737 output by the HOA converter 1712. By applying the rendering matrix to the stream of scene-based audio data 1737, the HOA renderer 1714 may generate spatial domain binaural
In other words, the HOA renderer 1714 may represent a unit configured to transform the scene-based audio data 1737 from the spherical harmonic domain to the spatial domain to obtain the channel-based
The HOA renderer 1714 may adapt or otherwise configure the rendering matrix to cause movement as represented by the
The mixing unit 1716 may mix the binaural storyline audio data generated by the HOA renderer 1714 with the
As mentioned above, the HOA renderer 1714 may render the stream of scene-based audio data 1737 output by the HOA converter 1712 with a rendering matrix. In some examples, the HOA renderer 1714 determines the rendering matrix based on the orientation of the headphones 220 (e.g., a two-dimensional or three-dimensional spatial orientation of the headphones 220). For example, the
In some examples, the components of the
Distributing the components of the
When mixing and converting object, channel and scene based audio signals into HOA format, low complexity sound field rotation operations can be implemented as close as possible to the binaural rendering point, potentially in a separate headphone device (e.g. headphones 220), enabling low motion-to-sound latency and fixed complexity for a given HOA order (regardless of the number of channels and objects). Other rendering steps, with potentially higher latency and computational requirements, may be performed closer to the decoder operation and synchronized with the video (e.g., on a computer or mobile phone). These other rendering steps are performed by either an internal renderer or an external renderer. If necessary, devices that implement CIRR can further reduce complexity by reducing the ambisonic order on the rendering operation.
Thus, in summary, the techniques of this disclosure may be implemented in one or more devices for rendering audio streams (e.g.,
Further, the device may receive an object and/or channel based audio stream and convert the stream to an equivalent spatial domain representation for a desired rendering order. The device may mix equivalent spatial domain streams corresponding to the scene-based, object-based, and channel-based audio streams to generate equivalent spatial domain mixed streams. The device may render the equivalent spatial domain mixed streams to a binaural or speaker-based representation. In some examples, the desired rendering order is determined based on: a level of the scene-based audio stream and/or metadata information from the object-based audio stream. In some examples, the equivalent spatial domain representation is reconfigured according to information from the motion sensor.
Fig. 7 is a block diagram illustrating an example implementation of an
The
In the example of fig. 7, the virtual
In the example of fig. 7, object/channel-based
In the manner described above, the
The
The
The
In one example, the
Further, in this example, the
The
Thus, in some examples, the techniques of this disclosure may be implemented in one or more devices for rendering audio streams. The device may include a memory, battery, CPU, or the like configured to generate a set of speaker locations corresponding to an equivalent spatial domain representation for a desired rendering order based on available hardware resources. The device may receive a scene-based audio stream and truncate the stream to a scene-based audio representation having a desired rendering order. Further, the device may receive an object and/or channel based audio stream and convert the stream to a scene based audio representation having a desired rendering order.
Further, the device may mix the scene-based audio representation with a desired rendering stage stream to generate a scene-based audio representation having a desired rendering stage mix stream, the desired rendering stage stream corresponding to the scene-based, object-based, and channel-based audio streams. The device may render the scene-based audio representation to a binaural or speaker-based representation using a desired rendering order mixing stream. In some examples, the desired rendering order is determined based on: a level of the scene-based audio stream and/or metadata information from the object-based audio stream. In some examples, the scene-based audio representation with the desired rendering order mix stream representation is reconfigurable according to information from a motion sensor.
Fig. 8 is a block diagram illustrating an example implementation of an
In some examples, the audio data 223 captured by the headphones includes spatial domain audio data. In such an example, the object/channel-based external renderer 1702 or the object/channel-based internal renderer 1704 may generate modified spatial-domain audio data 223' (e.g., using VBAP) based on the determined combination of
In some examples, the headset-captured audio data 223 includes scene-based audio data. In such an example, the HOA-to-spatial-domain conversion unit 1708 may generate modified spatial-domain audio data 223' based on the determined set of
The spatial domain audio data generated by the mixing unit 1710 may then be processed in the manner described with respect to fig. 8. For example, the HOA converter 1712 may convert the audio data output by the mixing unit 1710 into scene-based audio data. Thus, in accordance with the techniques of this disclosure, the
Similar examples may be provided with respect to the object/channel based
FIG. 9 is a flowchart illustrating example operations of the audio playback system illustrated in the example of FIG. 7 in performing aspects of the scalable unified rendering technique. The CMR stream decoder 1706 may receive the bitstream 216 and decode, from the bitstream 216,
The HOA-to-spatial-domain conversion unit 1708 may render the
The mixing unit 1710 may mix the first spatial-domain audio data 1731 and the second spatial-domain audio data 1733 to obtain mixed spatial-domain audio data 1735 (1906). The mixing unit 1710 may output the mixed spatial domain audio data 1735 to the HOA converter 1712, which the HOA converter 1712 may convert the mixed spatial domain audio data 1735 to the scene-based audio data 1737 (1908).
In some examples, the HOA converter 1712 may send scene-based audio data 1737 to the
It is to be understood that, depending on the example, some acts or events of any of the techniques described herein can be performed in a different order, may be added, merged, or left out all together (e.g., not all described acts or events are necessary for the practice of the techniques). Further, in some examples, acts or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially.
In one or more examples, the functions described may be implemented in hardware, software, firmware, or a combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media corresponding to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, such as according to a communication protocol. In this manner, the computer-readable medium may generally correspond to (1) a tangible computer-readable storage medium that is non-transitory, or (2) a communication medium such as a signal or carrier wave. The data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve the instructions, code and/or data structures for implementation of the techniques described in this disclosure. The computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Further, in some aspects, the functions described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Furthermore, the techniques may be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses including a wireless headset, an Integrated Circuit (IC), or a collection of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require implementation by different hardware units. Instead, as described above, the various units may be incorporated in a codec hardware unit or provided by a collection of interoperating hardware units including one or more memories as described above, along with appropriate software and/or firmware.
Various examples have been described. These examples, as well as other examples, are within the scope of the following claims.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:车辆对车辆通信系统