Rendering audio objects using multiple types of renderers

文档序号：1943028 发布日期：2021-12-07 浏览：16次中文

阅读说明：本技术 使用多种类型的渲染器渲染音频对象 (Rendering audio objects using multiple types of renderers ) 是由 F·G·热尔曼 A·J·西斐德于 2020-05-01 设计创作，主要内容包括：一种使用多种类型的渲染器渲染音频对象的装置和方法。所选渲染器之间的权重取决于每个音频对象中的位置信息。由于每种类型的渲染器具有不同的输出覆盖范围,它们的加权输出的组合使得音频根据位置信息在该位置处被感知。(An apparatus and method for rendering audio objects using multiple types of renderers. The weights between the selected renderers depend on the position information in each audio object. Since each type of renderer has a different output coverage, the combination of their weighted outputs causes the audio to be perceived at that location according to the location information.)

1. A method of audio processing, the method comprising:

receiving one or more audio objects, wherein each of the one or more audio objects respectively includes location information;

for a given audio object of the one or more audio objects:

selecting at least two renderers of a plurality of renderers based on the position information of the given audio object;

determining at least two weights based on the position information of the given audio object;

rendering the given audio object using the at least two renderers weighted according to the at least two weights based on the position information to generate a plurality of rendered signals; and

combining the plurality of rendered signals to generate a plurality of speaker signals; and

outputting the plurality of speaker signals from a plurality of speakers.

2. The method of claim 1, wherein the at least two renderers are classified into at least two categories.

3. The method of claim 2, wherein the at least two categories include a sound field renderer, a beamformer, a pan, and a binaural renderer.

4. The method of any preceding claim, wherein a given rendering signal of the plurality of rendering signals comprises at least one component signal,

wherein each of the at least one component signal is associated with a respective one of the plurality of speakers; and

wherein, for a given speaker of the plurality of speakers, a given speaker signal of the plurality of speaker signals corresponds to combining all of the at least one component signals associated with the given speaker.

5. The method of claim 4, wherein a first renderer generates a first rendered signal, wherein the first rendered signal includes a first component signal associated with a first speaker and a second component signal associated with a second speaker,

wherein a second renderer generates a second rendering signal, wherein the second rendering signal comprises a third component signal associated with the first speaker and a fourth component signal associated with the second speaker,

wherein a first speaker signal associated with the first speaker corresponds to combining the first component signal and the third component signal, an

Wherein the second speaker signal associated with the second speaker corresponds to combining the second component signal and the fourth component signal.

6. The method of any of claims 1-5, wherein rendering the given audio object comprises: for a given renderer of the plurality of renderers, applying a gain based on the location information to generate a given rendering signal of the plurality of rendering signals.

7. The method of any of claims 1-5, wherein the plurality of speakers are arranged in a first group that is pointed in a first direction and a second group that is pointed in a second direction different from the first direction.

8. The method of claim 7, wherein the second direction comprises a vertical component, wherein the at least two renderers comprise a wave field synthesis renderer, an upward firing pan renderer, and a beamformer, and wherein the wave field synthesis renderer, the upward firing pan renderer, and the beamformer generate the plurality of rendering signals for the second group.

9. The method of claim 7, wherein the second direction comprises a vertical component, wherein the at least two renderers comprise a wave field synthesis renderer, an upward-firing panning renderer, and a lateral-firing panning renderer, and wherein the wave field synthesis renderer, the upward-firing panning renderer, and the lateral-firing panning renderer generate the plurality of rendering signals for the second group.

10. The method of claim 7, wherein the second direction comprises a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a beamformer, and wherein the wave field synthesis renderer and the beamformer generate the plurality of rendering signals for the second group.

11. The method of claim 7, wherein the second direction comprises a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a lateral emission panning renderer, and wherein the wave field synthesis renderer and the lateral emission panning renderer generate the plurality of rendering signals for the second group.

12. The method of any of claims 1-11, wherein the at least two renderers include serially connected renderers.

13. The method of any of claims 1-12, wherein the at least two renderers include an amplitude translator, a plurality of binaural renderers, and a plurality of beamformers;

wherein the amplitude translator is configured to render the given audio object based on the position information to generate a first plurality of signals;

wherein the plurality of binaural renderers are configured to render the first plurality of signals to generate a second plurality of signals;

wherein the plurality of beamformers are configured to render the second plurality of signals to generate a third plurality of signals; and

wherein the third plurality of signals are combined to generate the plurality of speaker signals.

14. A computer program comprising instructions which, when executed by a processor, control an apparatus to perform a process comprising the method according to any one of claims 1-13.

15. An apparatus for audio processing, the apparatus comprising:

a plurality of speakers;

a processor; and

a memory for storing a plurality of data to be transmitted,

wherein the processor is configured to control the apparatus to receive one or more audio objects, wherein each of the one or more audio objects comprises position information, respectively;

wherein, for a given audio object of the one or more audio objects:

the processor is configured to control the apparatus to select at least two renderers of a plurality of renderers based on the position information of the given audio object;

the processor is configured to control the apparatus to determine at least two weights based on the position information of the given audio object;

the processor is configured to control the apparatus to render the given audio object using the at least two renderers weighted according to the at least two weights based on the position information to generate a plurality of rendered signals; and

the processor is configured to control the apparatus to combine the plurality of rendered signals to generate a plurality of speaker signals; and

wherein the processor is configured to control the apparatus to output the plurality of speaker signals from the plurality of speakers.

Background

The present invention relates to audio processing, and in particular to processing audio objects using multiple types of renderers.

Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.

Audio signals can be generally classified into two types: channel-based audio and object-based audio.

In channel-based audio, the audio signal includes a plurality of channel signals, each channel signal corresponding to a speaker. Exemplary channel-based audio signals include stereo audio, 5.1 channel surround audio, 7.1 channel surround audio, and the like. Stereo audio includes two channels, a left channel for the left speaker and a right channel for the right speaker. 5.1 channel surround audio includes six channels: a front left channel, a front right channel, a middle channel, a left surround channel, a right surround channel, and a low frequency effect channel. 7.1 channel surround audio includes eight channels: a left front channel, a right front channel, a middle channel, a left surround channel, a right surround channel, a left back channel, a right back channel, and a low frequency effect channel.

In object-based audio, an audio signal includes audio objects, and each audio object includes position information regarding where audio of the audio object is to be output. Therefore, the position information may be independent of the configuration of the speaker. The rendering system then renders the audio objects using the position information to generate specific signals for the specific configuration of the speakers. Examples of object-based audio includeAtmos^TMAudio, DTS: X^TMAudio, etc.

Both the channel-based system and the object-based system may include a renderer that generates speaker signals from the channel signals or the object signals. Renderers can be classified into various types including wave field renderers, beamformers, panners, binaural renderers, and the like.

Disclosure of Invention

Although many existing systems combine multiple renderers, these existing systems do not recognize that the selection of a renderer can be made based on the desired perceived location of the sound. In many listening environments, the listening experience may be improved by taking into account the desired perceived location of the sound when selecting the renderer. Therefore, there is a need for a system that takes into account the desired perceived location of sound when selecting renderers and when assigning weights to use among the selected renderers.

In view of the foregoing problems and deficiencies in solutions, embodiments described herein are directed to controlling two or more renderers (optionally of a single category or of different categories) using desired perceived locations of audio objects.

According to an embodiment, an audio processing method includes receiving one or more audio objects, wherein each of the one or more audio objects respectively includes position information. The method further includes, for a given audio object of the one or more audio objects, selecting at least two renderers of the plurality of renderers based on the location information of the given audio object, e.g., the at least two renderers having at least two categories; determining at least two weights based on the position information of the given audio object; rendering a given audio object using at least two renderers weighted according to at least two weights based on the position information to generate a plurality of rendered signals; and combining the plurality of rendered signals to generate a plurality of speaker signals. The method also includes outputting a plurality of speaker signals from a plurality of speakers.

The at least two categories may include a sound field renderer, a beamformer, a pan and a binaural renderer.

A given rendering signal of the plurality of rendering signals may comprise at least one component signal, wherein each of the at least one component signals is associated with a respective one of the plurality of speakers, and wherein for a given speaker of the plurality of speakers, a given speaker signal of the plurality of speaker signals corresponds to combining all of the at least one component signal associated with the given speaker.

The first renderer may generate a first rendering signal, where the first rendering signal includes a first component signal associated with a first speaker and a second component signal associated with a second speaker. The second renderer may generate a second rendered signal, where the second rendered signal includes a third component signal associated with the first speaker and a fourth component signal associated with the second speaker. The first speaker signal associated with the first speaker may correspond to combining the first component signal and the third component signal. The second speaker signal associated with the second speaker may correspond to combining the second component signal and the fourth component signal.

Rendering the given audio object may include, for a given renderer of the plurality of renderers, applying a gain based on the location information to generate a given rendering signal of the plurality of rendering signals.

The plurality of speakers may include a dense linear array of speakers.

The at least two categories may include a sound field renderer, wherein the sound field renderer performs a wave field synthesis process.

The plurality of speakers may be arranged in a first group directed in a first direction and a second group directed in a second direction different from the first direction. The first direction may include a forward component and the second direction may include a vertical component. The second direction may comprise a vertical component, wherein the at least two renderers comprise a wave field synthesis renderer and an emission-up panning renderer, and wherein the wave field synthesis renderer and the emission-up panning renderer generate a plurality of rendering signals for the second group. The second direction may comprise a vertical component, wherein the at least two renderers comprise a wave field synthesis renderer, a fire up (firing) pan renderer, and a beamformer, and wherein the wave field synthesis renderer, the fire up pan renderer, and the beamformer generate a plurality of rendering signals for the second group. The second direction may comprise a vertical component, wherein the at least two renderers comprise a wave field synthesis renderer, an upward-firing panning renderer, and a lateral-firing panning renderer, and wherein the wave field synthesis renderer, the upward-firing panning renderer, and the lateral-firing panning renderer generate a plurality of rendering signals for the second group. The first direction may include a forward component and the second direction may include a lateral component. The first direction may comprise a forward component, wherein the at least two renderers comprise wave field synthesis renderers, and wherein the wave field synthesis renderer generates a plurality of rendering signals for the first group. The second direction may comprise a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a beamformer, and wherein the wave field synthesis renderer and the beamformer generate a plurality of rendering signals for the second group. The second direction may comprise a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a lateral emission pan renderer, and wherein the wave field synthesis renderer and the lateral emission pan renderer generate a plurality of rendering signals for the second group.

The method may further include combining the plurality of rendered signals of the one or more audio objects to generate a plurality of speaker signals.

The at least two renderers may include series connected renderers.

The at least two renderers may include an amplitude translator, a plurality of binaural renderers, and a plurality of beamformers. The amplitude translator may be configured to: based on the position information, the given audio object is rendered to generate a first plurality of signals. The plurality of binaural renderers may be configured to render the first plurality of signals to generate a second plurality of signals. The plurality of beamformers may be configured to render the second plurality of signals to generate a third plurality of signals. The third plurality of signals may be combined to generate a plurality of speaker signals.

According to another embodiment, a non-transitory computer readable medium stores a computer program that, when executed by a processor, controls an apparatus to perform a process including one or more of the method steps discussed herein.

According to another embodiment, an apparatus for processing audio includes a plurality of speakers, a processor, and a memory. The processor is configured to control the apparatus to receive one or more audio objects, wherein each of the one or more audio objects comprises position information, respectively. For a given audio object of the one or more audio objects, the processor is configured to control the apparatus to select at least two renderers of the plurality of renderers based on the location information of the given audio object, wherein the at least two renderer renderers have at least two categories; the processor is configured to control the apparatus to determine at least two weights based on the position information of the given audio object. The processor is configured to control the apparatus to render a given audio object using at least two renderers weighted according to at least two weights based on the position information to generate a plurality of rendered signals. The processor is configured to control the apparatus to combine the plurality of rendered signals to generate a plurality of speaker signals. The processor is configured to control the apparatus to output a plurality of speaker signals from a plurality of speakers.

The apparatus may include further details similar to the methods described herein.

According to another embodiment, a method of audio processing includes receiving one or more audio objects, wherein each of the one or more audio objects includes location information, respectively. For a given audio object of the one or more audio objects, the method further includes rendering the given audio object using a first class of renderers based on the location information to generate a first plurality of signals; rendering the first plurality of signals using a second category of renderers to generate a second plurality of signals; rendering the second plurality of signals using a third category of renderers to generate a third plurality of signals; and combining the third plurality of signals to generate a plurality of speaker signals. The method also includes outputting a plurality of speaker signals from a plurality of speakers.

The first category of renderers may correspond to amplitude translators, the second category of renderers may correspond to a plurality of binaural renderers, and the third category of renderers may correspond to a plurality of beamformers.

The method may include more details similar to those described with respect to other methods discussed herein.

According to another embodiment, an apparatus for processing audio includes a plurality of speakers, a processor, and a memory. The processor is configured to control the apparatus to receive one or more audio objects, wherein each of the one or more audio objects comprises position information, respectively. For a given audio object of the one or more audio objects, the processor is configured to control the apparatus to render the given audio object using the first category of renderers based on the location information to generate a first plurality of signals; the processor is configured to control the apparatus to render the first plurality of signals using a second category of renderers to generate a second plurality of signals; the processor is configured to control the apparatus to render the second plurality of signals using a third category of renderers to generate a third plurality of signals; the processor is configured to control the apparatus to combine the third plurality of signals to generate a plurality of speaker signals. The processor is configured to control the apparatus to output a plurality of speaker signals from a plurality of speakers.

The apparatus may include further details similar to the methods described herein.

The following detailed description and the accompanying drawings provide a further understanding of the nature and advantages of the various embodiments.

Drawings

Fig. 1 is a block diagram of a rendering system 100.

Fig. 2 is a flow chart of a method 200 of audio processing.

Fig. 3 is a block diagram of a rendering system 300.

Fig. 4 is a block diagram of a speaker system 400.

Fig. 5A and 5B are top and side views, respectively, of sound bar 500.

Fig. 6A, 6B, and 6C are first top, second top, and side views, respectively, illustrating output coverage of soundbar 500 (see fig. 5A and 5B) in a room.

Fig. 7 is a block diagram of a rendering system 700.

Fig. 8A and 8B are a top view and a side view, respectively, illustrating an example of a source distribution of soundbar 500 (see fig. 5A).

Fig. 9A and 9B are top views illustrating object-based audio (fig. 9A) to speaker array (fig. 9B) mapping.

Fig. 10 is a block diagram of a rendering system 1100.

Fig. 11 is a top view showing the output coverage of beamformers 1120e and 1120f implemented in a soundbar 500 (see fig. 5A and 5B) in a room.

Fig. 12 is a top view of soundbar 1200.

Fig. 13 is a block diagram of a rendering system 1300.

Fig. 14 is a block diagram of the renderer 1400.

Fig. 15 is a block diagram of a renderer 1500.

Fig. 16 is a block diagram of a rendering system 1600.

Fig. 17 is a flow diagram of a method 1700 of audio processing.

Detailed Description

Described herein are techniques for audio rendering. In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention as defined by the claims may include some or all of the features in these examples alone or in combination with other features described below, and may also include modifications and equivalents of the features and concepts described herein.

In the following description, various methods, procedures and steps are described in detail. Although specific steps may be described in a certain order, such order is primarily for convenience and clarity. Certain steps may be repeated multiple times, may occur before or after other steps (even if the steps are otherwise described in other orders), and may occur in parallel with other steps. The second step needs to be performed after the first step only if the first step has to be completed before the second step starts. This is particularly indicated when the context is not clear.

In this document, the terms "and", "or" and/or "are used. Such terms are to be understood with an inclusive meaning. For example, "a and B" may represent at least the following: "both A and B", "at least A and B". As another example, "a or B" may mean at least the following: "at least a", "at least B", "both a and B", "at least a and B". As another example, "a and/or B" may mean at least the following: "A and B", "A or B". When exclusive or is intended, it will be noted specifically (e.g., "a or B," "at most one of a and B").

Fig. 1 is a block diagram of a rendering system 100. Rendering system 100 includes an allocation module 110, a plurality of renderers 120 (three shown: 120a, 120b, and 120c), and a routing module 130. The renderer 120 is classified into a number of different categories, which will be discussed in more detail below. The rendering system 100 receives the audio signal 150, renders the audio signal 150, and generates a plurality of speaker signals 170. Each speaker signal 170 drives a speaker (not shown).

The audio signal 150 is an object audio signal and comprises one or more audio objects. Each audio object includes object metadata 152 and object audio data 154. The object metadata 152 includes position information of the audio object. The position information corresponds to a desired perceived position of the object audio data 154 of the audio object. The object audio data 154 corresponds to audio data to be rendered by the rendering system 100 and output by speakers (not shown). The audio signal 150 may be in one or more of a variety of formats, includingAtmos^TMFormat, surround sound (Ambisonics) format (e.g., B format), DTS from Xperi corp^TMFormat, etc. For simplicity, the following refers to a single audio object for purposes of describing the operation of the rendering system 100, but it is understood that multiple audio objects may be processed simultaneously, for example, by instantiating multiple instances of one or more renderers 120. For example,Atmos^TMan implementation of the system may reproduce up to 128 simultaneous audio objects in the audio signal 150.

The assignment module 110 receives object metadata 152 from the audio signal 150. The assignment module 110 also receives speaker configuration information 156. The speaker configuration information 156 generally indicates the configuration of the speakers connected to the rendering system 100, such as the number, configuration, or physical location of the speakers. The speaker configuration information 156 may be static when the speaker positions are fixed (e.g., as components physically attached to a device that includes the rendering system 100), and the speaker configuration information 156 may be dynamic when the speaker positions may be adjusted. The dynamic information may be updated as needed (e.g., when the speaker is moved). The speaker configuration information 156 may be stored in a memory (not shown).

Based on the object metadata 152 and the speaker configuration information 156, the assignment module 110 determines selection information 162 and location information 164. Given the arrangement of speakers according to the speaker configuration information 156, the selection information 162 selects two or more renderers 120 that are suitable for rendering audio objects of given position information in the object metadata 152. The location information 164 corresponds to a source location to be rendered by each selected renderer 120. In general, the location information 164 may be considered a weighting function that weights the object audio data 154 among the selected renderers 120.

The renderer 120 receives the object audio data 154, speaker configuration information 156, selection information 162, and location information 164. The renderer 120 uses the speaker configuration information 156 to configure the output of the renderer. The selection information 162 selects two or more renderers 120 to render the object audio data 154. Based on the location information 164, each selected renderer 120 renders the object audio data 154 to generate rendering signals 166. (e.g., renderer 120a generates rendering signal 166a, renderer 120b generates rendering signal 166b, etc.). Each of the rendered signals 166 from each renderer 120 corresponds to a driver signal for one of the speakers (not shown) configured according to the speaker configuration information 156. For example, if rendering system 100 is connected to 14 speakers, renderer 120a generates up to 14 rendering signals 166 a. (if a given audio object is rendered such that the given audio object is not output from a particular speaker, then the one of the rendering signals 166 may be considered zero or non-existent, as indicated by the speaker configuration information 156.)

Routing module 130 receives rendering signals 166 from each renderer 120 and speaker configuration information 156. Based on the speaker configuration information 156, the routing module 130 combines the rendered signals 166 to generate speaker signals 170. To generate each speaker signal 170, the routing module 130 combines, for each speaker, each of the rendered signals 166 corresponding to that speaker. For example, a given speaker may be associated with one of the rendered signals 166a, one of the rendered signals 166b, and one of the rendered signals 166 c; the routing module 130 combines the three signals to generate a corresponding one of the speaker signals 170 for the given speaker. In this manner, the routing module 130 performs the mixing function of the appropriate rendering signals 166 to generate the corresponding speaker signals 170.

Due to the acoustic linearity, the superposition principle allows the rendering system 100 to use any given speaker simultaneously for any number of renderers 120. The routing module 130 accomplishes this by summing the contribution from each renderer 120 for each speaker. As long as the sum of these signals does not overload the loudspeakers, the result corresponds to the case of assigning separate loudspeakers to each renderer, in terms of the impression of the listener.

When multiple audio objects are rendered for simultaneous output, routing module 130 combines rendering signals 166 in a manner similar to the single audio object case discussed above.

Fig. 2 is a flow chart of a method 200 of audio processing. The method 200 may be performed by the rendering system 100 (see fig. 1). The method 200 may be implemented by one or more computer programs, for example, executed by the rendering system 100 to control the operation of the rendering system.

In step 202, one or more audio objects are received. Each audio object comprises position information, respectively. (e.g., two audio objects a and B may have respective position information PA and PB.) as an example, the rendering system 100 (see fig. 1) may receive one or more audio objects in the audio signal 150. For each audio object, the method continues at 204.

At step 204, for a given audio object, at least two renderers are selected based on the position information of the given audio object. Optionally, the at least two renderers have at least two categories. (of course, a single category of renderer may be used to render a particular audio object; the operation of this case is similar to the multiple category case discussed herein.) for example, when the location information indicates that particular two renderers (of particular two categories) will be appropriate to render the audio object, then the two renderers are selected. The renderer may be selected based on the speaker configuration information 156 (see fig. 1). As an example, the assignment module 110 may generate selection information 162 to select at least two of the renderers 120 based on the location information in the object metadata 152 and the speaker configuration information 156.

At step 206, for a given audio object, at least two weights are determined based on the position information. The weights are associated with the renderer selected at step 204. For example, the assignment module 110 (see fig. 1) may generate location information 164 (corresponding to the weights) based on the location information in the object metadata 152 and the speaker configuration information 156.

At step 208, a given audio object is rendered using the selected renderer (see step 204) weighted according to the weight (see step 206) based on the position information to generate a plurality of rendering signals. As an example, the renderer 120 (see fig. 1, selected according to the selection information 162) generates a rendering signal 166 from the object audio data 154, weighted according to the position information 164. Continuing with the example, when the renderers 120a and 120b are selected, rendering signals 166a and 166b are generated.

At step 210, the plurality of rendered signals are combined (see step 208) to generate a plurality of speaker signals. For a given speaker, the respective rendered signals 166 are summed to generate a speaker signal. When the speaker signal is above the maximum signal level, the speaker signal may be attenuated to prevent overloading of a given speaker. As an example, the routing module 130 may combine the rendered signals 166 to generate speaker signals 170.

At step 212, a plurality of speaker signals are output from a plurality of speakers (see step 210).

The method 200 operates in a similar manner when multiple audio objects are to be output simultaneously. For example, multiple paths of steps 204-206-208 may be used to process multiple given audio objects in parallel, wherein rendered signals corresponding to the multiple audio objects are combined (see step 210) to generate a speaker signal.

Fig. 3 is a block diagram of a rendering system 300. Rendering system 300 may be used to implement rendering system 100 (see fig. 1) or to perform one or more steps of method 200 (see fig. 2). The rendering system 300 may store and execute one or more computer programs to implement the rendering system 100 or to perform the method 200. Rendering system 300 includes memory 302, processor 304, input interface 306, and output interface 308 connected by bus 310. The rendering system 300 may include (for simplicity) other components not shown.

The memory 302 collectively stores data used by the rendering system 300. The memory 302 may also store one or more computer programs that control the operation of the rendering system 300. The memory 302 may include volatile components (e.g., random access memory) and non-volatile components (e.g., solid-state memory). The memory 302 may store speaker configuration information 156 (see fig. 1) or data corresponding to other signals in fig. 1. Such as object metadata 152, object audio data 154, rendering signals 166, and the like.

The processor 304 generally controls the operation of the rendering system 300. When rendering system 300 implements rendering system 100 (see FIG. 1), processor 304 implements functions corresponding to allocation module 110, renderer 120, and routing module 130.

The input interface 306 receives the audio signal 150 and the output interface 308 outputs the speaker signal 170.

Fig. 4 is a block diagram of a speaker system 400. The speaker system 400 includes a rendering system 402 and a plurality of speakers 404 (six are shown, 404a, 404b, 404c, 404d, 404e, and 404 f). Speaker system 400 may be configured as a single device that includes all of the components (e.g., soundbar form factor). The speaker system 400 may be configured as a separate device (e.g., the rendering system 402 is one component and the speaker 404 is one or more other components).

The rendering system 402 may correspond to the rendering system 100 (see fig. 1) that receives the audio signal 150 and generates a speaker signal 406 corresponding to the speaker signal 170 (see fig. 1). The components of rendering system 402 may be similar to the components of rendering system 300 (see fig. 3).

The speaker 404 outputs an audible signal (not shown) corresponding to speaker signals 406 (six are shown, 406a, 406b, 406c, 406d, 406e, and 406 f). The speaker signal 406 may correspond to the speaker signal 170 (see fig. 1). The speaker 404 may output a speaker signal as discussed above with respect to 312 in fig. 3.

The category of the renderer:

as described above, the renderers (e.g., renderer 120 of FIG. 1) are classified into different categories. Four common renderers include a sound field renderer, a binaural renderer, a pan renderer, and a beamforming renderer. As described above (see step 204 in FIG. 2), for a given audio object, the selected renderer has at least two categories. For example, based on object metadata 152 and based on speaker configuration information 156 (see fig. 1), allocation module 110 may select a sound field renderer and a beamforming renderer (in renderer 120) to render a given audio object.

Additional details of four common renderer categories are provided below. Note that when a category includes a sub-category of a renderer, it should be understood that references to different categories of renderers similarly apply to different sub-categories of renderers. The rendering system described herein (e.g., rendering system 100 of fig. 1) may implement one or more of these categories of renderers.

Sound field renderer

Typically, sound field rendering aims at reproducing a specific sound pressure (sound) field within a given spatial volume. Subcategories of sound field renderers include wave field synthesis, near field compensated higher order surround sound (Ambisonics), and spectral partitioning.

An important capability of the sound field rendering method is to be able to project virtual sources in the near field, which means to generate a source that positions the listener at a position between himself and the loudspeakers. Although binaural renderers (see below) can also have this effect, the particularity here is that the correct positioning impression can be generated over a wide listening area.

Binaural renderer

Binaural rendering methods focus on delivering to the listener's ears a signal carrying a source signal that is processed to mimic the binaural cues associated with the source location. Although a simpler way of communicating such signals is usually through headphones, it can also be successfully done through a speaker system, feeding separate left and right ears to the listener by using crosstalk cancellers.

Translation renderer

The panning method directly utilizes basic auditory mechanisms (e.g., changing interaural loudness and time differences) to move the sound image (sound image) through delays and/or gain differences applied to the source signal before being fed to the plurality of speakers. Amplitude translators that use only gain differences are popular for their simple implementation and stable perceptual impression. Amplitude translators have been deployed in many consumer audio systems, such as stereo systems and traditional cinema content rendering. (an example of a magnitude translator design suitable for use with any loudspeaker array is provided by v.pulkki, "Virtual sound source localization using vector based magnitude translation" ("Virtual sound source base amplification"), journal of the audio engineering society, vol.45, No. 6, p.456 and 466, 1997.) finally, methods using reflections from the reproduction environment generally rely on similar principles to manipulate the spatial impression from the system.

Beamforming renderer

Beamforming was originally designed for sensor arrays (e.g., microphone arrays) as a means of amplifying signals from a preferred set of directions. Due to the principle of reciprocity in acoustics, the same principle can be used to create directional acoustic signals. Us patent 7,515,719 describes the use of beamforming to create virtual loudspeakers by using a focused source.

Rendering system considerations

The rendering system categories discussed above have many considerations regarding the optimal location to render and the source location.

The optimal position generally corresponds to a space deemed acceptable for rendering according to the listener perception metric. While the exact extent of such regions is often imperfect due to the lack of analytical metrics that capture the rendered perceptual quality well, qualitative information can often be obtained from typical error metrics (e.g., square error) and compared across different systems in different configurations. For example, a common observation is that at higher frequencies, the optimal position is smaller (for all classes of renderers). In general, it is also observed that the sweet spot increases with the number of loudspeakers available in the system, except for the panning method, for which adding loudspeakers has different advantages.

Different rendering system categories may also differ in the way and ability they must provide audio to be perceived at different source locations. From the listener's perspective, sound field rendering methods generally allow virtual sources to be created anywhere in the direction of the loudspeaker array. One aspect of these methods is that they allow the perceived distance of the source to be manipulated in a transparent manner and from the perspective of the entire listening area. The binaural rendering method may in principle provide any source position in the optimal position as long as the binaural information related to these positions has been previously stored. Finally, the panning approach may provide any source direction available for a pair/three speakers that are close enough (e.g., about 60 degrees, such as between 55-65 degrees) from the perspective of the listener. (however, the translation method (panning method) does not generally define a specific method of processing the source distance, and therefore if a distance component is required, other strategies need to be used.)

Furthermore, some classes of rendering systems exhibit interdependencies between source locations and optimal locations. For example, for a linear array loudspeaker implementing a wave field synthesis process (in the field rendering category), a source position in the center behind the array may be perceived in a larger optimal position in front of the array, while a source position in front of the array and shifted to the side may be perceived in a smaller off-center optimal position.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

In view of the above considerations, embodiments tend to use two or more rendering methods in combination, where the relative weights between the selected rendering methods depend on the audio object position.

With the increasing availability of hardware that allows the use of a large number of loudspeakers in consumer applications, the possibility of using complex rendering strategies becomes increasingly attractive. In fact, the number of loudspeakers is still limited, so that using a single rendering method generally results in a large limitation, usually in terms of the optimal position range. Furthermore, complex strategies may address complex speaker setups, such as lack of some surround sound coverage in certain areas, or simply lack of speaker density. However, the standard limitations of these rendering methods still exist, resulting in an essential compromise between coverage (the largest array that may have a wider range of possible source locations) and density (the most dense array that avoids high frequency distortion due to aliasing (aliasing) as much as possible) for a given number of channels.

In view of the foregoing, embodiments are directed to rendering object-based audio content using commonly driven multi-type renderers. For example, in rendering system 100 (see fig. 1), assignment module 110 processes the object-based audio content based on object metadata 152 and speaker configuration information 156 to determine (1) which of the renderers 120 to activate (selection information 162), and (2) the source location to be rendered by each activated renderer (location information 164). Each selected renderer then renders the object audio data 154 according to the location information 164 and generates a rendering signal 166, which the routing module 130 routes to the appropriate speaker in the system. The routing module 130 allows multiple renderers to use a given speaker. In this manner, the rendering system 100 uses the assignment module 110 to assign each audio object to the renderer 120, and the renderer 120 will effectively convey the intended spatial impression in the desired listening area.

For a system with K speakers (K1 … K), O objects (O1 … O) are rendered using R different renderers (R1 … R), the output s of each speaker K being given by:

in the above equation:

s_k(t): output signal of loudspeaker k

s_o(t): object signal

w_r: activating as object locationOf the function r (which may be a real scalar or a real filter)

δ_k∈r: indicating a function of 1 if speaker k is attached to renderer r and 0 otherwise

Driving function of loudspeaker k guided by renderer r as object positionFunction of (may be a real scalar or real filter)

Object location according to its metadata

Object position of renderer r for driving object o (may be equal to)

The renderer type of the renderer r is reflected in the drive functionIn (1). The specific behavior of a given renderer depends on its type and the available settings (by δ) of the speaker it drives_k∈rDetermined). The allocation of a given object in the renderer is controlled by an allocation algorithm, by means of an activation coefficient w_rAnd the mapping of a given object o in the space controlled by the renderer r

Applying the above equation to the rendering system 100 (see FIG. 1), each s_kCorresponding to one of the loudspeaker signals 170, s_oObject audio corresponding to a given audio objectData 154, w_rCorresponding to the selection information 162, delta_k∈rCorresponding to speaker configuration information 156 (e.g., configuring routing performed by routing module 130),a rendering function corresponding to each renderer 120, andandcorresponding to the location information 164. w is a_rAndcan be considered as a weight that provides a relative weight between the selected renderers of a given audio object.

Although the above equations are written in the time domain, exemplary implementations may operate in the frequency domain, for example using a filter bank. Such an implementation may transform the object audio data 154 to the frequency domain, perform the operations of the above equations in the frequency domain (e.g., convolution into multiplication, etc.), and then inverse transform the result to generate the rendered signal 166 or speaker signal 170.

Fig. 5A and 5B are top and side views, respectively, of sound bar 500. Soundbar 500 may implement rendering system 100 (see fig. 1). Soundbar 500 includes a plurality of speakers, including a linear array 502 (having 12 speakers 502a, 502b, 502c, 502d, 502e, 502f, 502g, 502h, 502i, 502j, 502k, and 502l) and an upward-firing group 504 (including 2 speakers 504a and 504 b). Speaker 502a may be referred to as the leftmost speaker, speaker 502l may be referred to as the rightmost speaker, speaker 504a may be referred to as the upper left speaker, and speaker 504b may be referred to as the upper right speaker. The number of speakers and their arrangement can be adjusted as desired.

Soundbar 500 is suitable for consumer use, for example in a home theater configuration, and may receive its input from a connected television or audio/video receiver. For example, soundbar 500 may be placed above or below the television screen.

Fig. 6A, 6B, and 6C are first top, second top, and side views, respectively, illustrating output coverage of soundbar 500 (see fig. 5A and 5B) in a room. Fig. 6A shows a near field output 602 generated by the linear array 502. The near field output 602 is generally projected outward from the front of the linear array 502. Fig. 6B shows virtual side outputs 604a and 604B generated by the linear array 502 using beamforming. Virtual side outputs 604a and 604b are generated by beamforming against the wall. Fig. 6C shows a virtual top output 606 generated by the upward-firing group 504. (the near-field output 602 of FIG. 6A, which is generally in the plane of the listener, is also shown.) the virtual top output 606 is generated by reflection against the ceiling. For a given audio object, sound bar 500 may combine two or more of these outputs together, for example, using a routing module such as routing module 130 (see fig. 1), in order to reconcile the perceived location of the audio object with its location metadata.

Fig. 7 is a block diagram of a rendering system 700. Rendering system 700 is a specific embodiment of rendering system 100 (see fig. 1) suitable for sound bar 500 (see fig. 5A). The rendering system 700 may be implemented using components of the rendering system 300 (see fig. 3). Like the rendering system 100, the rendering system 700 receives the audio signal 150. The rendering system 700 includes an allocation module 710, four renderers 720a, 720b, 720c, and 720d (collectively, renderers 720), and a routing module 730.

The assignment module 710 receives the object metadata 152 and the speaker configuration information 156 and generates the selection information 162 and the location information 164 in a manner similar to the assignment module 110 (see fig. 1).

The renderer 720 receives the object audio data 154, the speaker configuration information 156, the selection information 162, and the position information 164, and generates rendering signals 766a, 766b, 766c, and 766d (collectively referred to as rendering signals 766). Other functions of the renderer 720 are similar to those of the renderer 120 (see fig. 1). The renderer 720 includes a wave field renderer 720a, a left beamformer 720b, a right beamformer 720c, and a vertical translator 720 d. The wave field renderer 720a generates a rendering signal 766A corresponding to the near field output 602 (see fig. 6A). The left beamformer 720B generates a rendered signal 766B corresponding to the virtual side output 604a (see fig. 6B). The right beamformer 720c generates a rendered signal 766c corresponding to the virtual side output 604B (see fig. 6B). The vertical translator 720d generates a rendered signal 766d corresponding to the virtual top output 606 (see fig. 6C).

Routing module 730 receives speaker configuration information 156 and rendering signals 766 and combines rendering signals 766 in a manner similar to routing module 130 (see fig. 1) to generate speaker signals 770a and 770b (collectively speaker signals 770). The routing module 730 combines the rendered signals 766a, 766b, and 766c to generate speaker signals 770a that are provided to the speakers of the linear array 502 (see fig. 5A). The routing module 730 routes the render signal 766d to the speakers of the upward-firing group 504 (see fig. 5A) as the speaker signal 770 b.

As the perceived location of the audio objects changes throughout the listening environment, the assignment module 710 (using the location information 164) performs cross-fading between the various renderers 720 to generate smooth perceived source motion between the different regions of fig. 6A, 6B, and 6C.

Fig. 8A and 8B are a top view and a side view, respectively, illustrating an example of a source distribution of soundbar 500 (see fig. 5A). For a particular audio object in audio signal 150 (see fig. 1), object metadata 152 defines a desired perceived location within a virtual cube of size 1x1x 1. The virtual cube maps to a cube in the listening environment, such as using location information 164 by assignment module 110 (see FIG. 1) or assignment module 710 (see FIG. 7).

FIG. 8A shows a horizontal plane (x, y), with point 902 at (0,0), point 904 at (1,0), point 906 at (0, -0.5), and point 908 at (1, -0.5). (these points are marked with an "X") the perceived position of the audio object is then mapped from the virtual cube to a rectangular area 920 defined by these four points. Note that this plane is only half of the virtual cube in this dimension, and that sources with y >0.5 (e.g., behind listener position 910) are placed in front of listener position 910 on the line between points 906 and 908. Points 902 and 904 may be considered to be at the front wall of the listening environment. The width of region 920 (e.g., between point 902 and point 904) is approximately aligned with (or slightly within) the sides of linear array 502 (see also fig. 5A).

FIG. 8B shows a vertical plane (x, z), where point 902 is at (0,0), point 906 is at (-0.5,0), point 912 is at (0,1), and point 916 is at (-0.5, 1). The perceived position of the audio object is then mapped from the virtual cube to a rectangular area 930 defined by these four points. As with fig. 8A, in fig. 8B, the source where y >0.5 (e.g., behind the listener position 910) is placed on the line between point 906 and point 916. Points 912 and 916 may be considered to be at the ceiling of the listening environment. The bottom of the region 930 is aligned with the horizontal of the linear array 502.

In FIG. 8A, note that the trapezoid 922 in the horizontal plane has its wide base aligned to one side of the area 920 between point 902 and point 904 and its narrow base aligned in front of the listener's position 910 (on the line between point 906 and point 908). The system separates sources within ladder 922 that have a desired perceived location from source regions outside ladder 922 (but still within region 920). Within ladder 922, the source is not rendered using a beamformer (e.g., 720b and 720c in FIG. 7); instead, the source is reproduced using a sound field renderer (e.g., 720a in fig. 7). Outside of the trapezoid 922, the sources may be rendered in the horizontal plane using beamformers (e.g., 720b and 720c) and an acoustic field renderer (e.g., 720 a). In particular, the sound field renderer 720a places the source at the same coordinate y, at the leftmost side of the trapezoid 922, if the source is located at the left side (or at the rightmost side if the source is located at the right side), while the two beamformers 720b and 720c create a stereo phantom source between each other by panning. The left-right translation factor between the two beamformers 720b and 720c may follow a constant energy magnitude translation rule, mapping x-0 to only the left beamformer 720b and x-1 to only the right beamformer 720 c. (the assignment module 710 may use the location information 164 to implement this amplitude panning rule, e.g., using weights.) the system applies a constant energy cross-fading rule between the sound field renderer 720a and the beamformer pair 720b-720c such that the acoustic energy from the beamformers 720b-720c increases and the acoustic energy from the sound field renderer 720a decreases as the sound source is placed farther from the trapezoid 922. (assignment module 710 may use location information 164 to implement this cross-fade rule.)

In the z dimension (see fig. 8B), the system applies a constant energy cross-fading rule between the combined signal fed to beamformers 720B-720c and acoustic field renderer 720a and the rendered signal 766d, rendered by vertical translator 720d, fed to the upward-facing transmit group 504 (see fig. 5A and 5B). The cross-fading factor is proportional to the z-coordinate, with z-0 corresponding to all signals rendered by beamformers 720b-720c and acoustic field renderer 720a, and z-1 corresponding to all signals rendered using vertical translator 720 d. The rendered signal 766d generated by the vertical translator 720d is distributed between the two channels (to the two speakers 504a and 504b) using a constant energy amplitude translation rule, mapping only x-0 to the left speaker 504a and x-1 only to the right speaker 504 b. (assignment module 710 may use position information 164 to implement this amplitude panning rule.)

Fig. 9A and 9B are top views illustrating object-based audio (fig. 9A) to speaker array (fig. 9B) mapping. Fig. 9A shows a horizontal square region 1000 defined by a point 1002 at (0,0), a point 1004 at (1,0), a point 1006 at (0,1), and a point 1008 at (1, 1). Point 1003 is at (0,0.5), the midpoint between points 1002 and 1006, and point 1007 is at (1,0.5), the midpoint between points 1004 and 1008. Point 1005 is at (0.5 ), the center of square region 1000. Points 1002, 1004, 1012, and 1014 define a trapezoid 1016. Adjacent to the sides of the trapezoid 1016 are two regions 1020 and 1022 which have a width of 0.25 units in the designated x-direction. Adjacent to the sides of regions 1020 and 1022 are triangles 1024 and 1026. According to its metadata (e.g., object metadata 152 of fig. 1), the audio object may have a desired perceived location within square region 1000. An example object audio system using horizontal box 1000 is DolbyProvided is a system.

FIG. 9B illustrates the mapping of a portion of square region 1000 (see FIG. 9A) to region 1050 defined by points 1052, 1054, 1053, and 1057. Note that only half of the square region 1000 (defined by points 1002, 1004, 1003, and 1007) maps to region 1050; the perceived position in the other half of the square region 1000 is mapped on the line between points 1053 and 1057. (this is similar to the description above in fig. 8A.) the speaker array 1059 is within the region 1050; the width of the speaker array 1059 corresponds to the width L of the region 1050. Similar to square region 1000 (see fig. 9A), region 1050 includes a trapezoid 1056, two regions 1070 and 1072 adjacent to sides of the trapezoid, and two triangles 1074 and 1076. Regions 1070 and 1072 correspond to regions 1020 and 1022 (see FIG. 9A), while triangles 1074 and 1076 correspond to triangles 1024 and 1026 (see FIG. 9A). The wide base of trapezoid 1056 corresponds to the width L of region 1050 and the narrow base corresponds to the width L. The height of trapezoid 1056 is (H-H), where H corresponds to the height of the large triangle containing trapezoid 1056 and extending from the wide base (width L) to point 1075, and H corresponds to the height of the small triangle extending from the narrow base (width L) to point 1075. As will be described in more detail below, within regions 1070 and 1072, the system implements a constant energy cross-decay rule between categories of renderers.

More specifically, the output of the speaker array 1059 (see fig. 9B) can be described as follows. The speaker array 1059 has M speakers (M1, from left to right). The loudspeakers are driven in the following manner:

factor theta_NF/B(x_o,y_o) Driving the balance between the near field wave field synthesis renderer 720a and the beamformers 720b-720c (see fig. 7). It is defined using the sign of the 1056 trapezoid presented in FIG. 9B, and thus for

Then, for

θ_NF/B(x_o，y_o)＝|4x_o-2|-2l/L

Locating the source in the near field using the wave field renderer 720a follows the following rule:

the drive function is written in the frequency domain. For sources behind the array plane (e.g., behind the speaker array 1059, e.g., on the line between points 1052 and 1054):

whereinAnd c is the speed of sound

And in front of the array plane (e.g., in front of the speaker array 1059), note that only the last change:

wherein

In these expressions, the last term corresponds to the amplitude and delay control values in the 2.5D wave field synthesis theory for the local sources before and after the array plane (e.g., defined by the speaker array 1059). (overview of wave Field Synthesis theory by h. wierstorf, "Perceptual evaluation of Sound Field Synthesis" ("Perceptual association of Sound Field Synthesis"), the university of berlin industry, 2014.) other coefficients are defined as follows:

ω: frequency (in radians/second)

α: a window function limiting the truncation artifacts and enabling local wave field synthesis as a function of source and listening position.

EQ_m: an equalization filter compensates for loudspeaker response distortion.

PreEQ: a pre-equalization filter that compensates for 2.5 dimensional effects and truncation effects.

An arbitrary listening position.

With respect to beamformers 720b-720c, the system pre-computes a set of M/2 speaker delays and amplitudes, appropriate for the configuration of the left half of linear speaker array 1059. In the frequency domain, for each loudspeaker m and frequency ω, it provides us with filter coefficients B_m(ω). The beamformer drive function for the left half of the loudspeaker array (M1 … M/2) is then a filter defined in the frequency domain:

in the above equation, EQ_mIs an equalization filter (the same as the filters in equations (1) and (2)) that compensates for loudspeaker response distortion. The system is designed for symmetrical setup so that one needs to flip the beam filter for the right half of the array to get another beam, so for M/2 … M, there are:

the rendered signal 766d (see FIG. 7) corresponding to the speaker signal 770b provided to the two upward-firing speakers 504a-504b (see FIG. 5) corresponds to the signal s in the following manner_ULAnd s_UR：

According to one embodiment, the translation is verticalThe vessel 720d (see fig. 7) includes a pre-filtering stage. Application of prefiltering stage and height coordinate z₀Proportional height-aware filter H. In this case, for a given z₀The application filter is

Fig. 10 is a block diagram of a rendering system 1100. Rendering system 1100 is a modification of rendering system 700 (see fig. 7) suitable for implementation in sound bar 500 (see fig. 5A). The rendering system 1100 may be implemented using components of the rendering system 300 (see fig. 3). The components of the rendering system 1100 are similar to the components of the rendering system 700 and similar reference numerals are used. The rendering system 1100 also includes a second pair of beamformers 1120e and 1120 f. Left beamformer 1120e generates rendering signals 1166d and right beamformer 1120f generates rendering signals 1166e, which routing module 730 combines with other rendering signals 766a, 766b, and 766c to generate speaker signals 770 a. When considering the beamformer outputs separately, left beamformer 1120e creates a virtual left rear source, and right beamformer 1120f creates a virtual right rear source, as shown in FIG. 11.

Fig. 11 is a top view showing the output coverage of beamformers 1120e and 1120f implemented in a soundbar 500 (see fig. 5A and 5B) in a room. (the output coverage of the other renderers of the rendering system 1100 is as shown in fig. 6A-6℃) the virtual left rear output 1206A comes from a left beamformer 1120e (see fig. 10), which left beamformer 1120e generates signals reflected from the left and rear walls of the room. The virtual right back output 1206b is from a right beamformer 1120f (see fig. 10), which right beamformer 1120f generates signals reflected from the right and back walls of the room. (note the triangular regions of 1206A and 1206b that overlap behind the listener.) for a given audio object, soundbar 500 may combine the output coverage of fig. 11 using one or more of the output coverage of fig. 6A-6C, e.g., using a routing module such as routing module 730 (see fig. 10).

The output coverage of fig. 6A-6C and fig. 11 shows how a soundbar 500 (see fig. 5A and 5B) can be used in place of the speakers in a conventional 7.1 channel (or 7.1.2 channel) surround sound system. The left, center, and right speakers of the 7.1 channel system may be replaced by a linear array 502 driven by a sound field renderer 720a (see fig. 7), generating the output coverage shown in fig. 6A. The top speaker of the 7.1.2 channel system may be replaced by the upward-firing group 504 driven by the vertical translator 720d, resulting in the output coverage shown in fig. 6C. The left and right surround speakers of the 7.1 channel system may be replaced by linear arrays 502 driven by beamformers 720B and 720c to produce the output coverage shown in fig. 6B. The left and right rear surround speakers of the 7.1 channel system may be replaced by linear arrays 502 driven by beamformers 1120e and 1120f (see fig. 10), producing the output coverage shown in fig. 11. As described above, the system enables multiple renderers to render audio objects according to their combined output coverage in order to generate appropriate perceived locations for the audio objects.

In summary, the system described herein has the following advantages: the highest resolution rendering system (e.g., near field renderer) is placed in front of where most movie content is expected (because this position matches the screen position) and the human positioning accuracy is highest while the rear, side and height rendering remains coarser, which may be less important for typical movie content. Many of these systems also remain relatively compact and may be judiciously integrated with typical vision equipment (e.g., above or below a television screen). One feature to remember is that due to the superposition principle (e.g., using a combination of routing modules), the speaker array can be used to generate a large number of beams simultaneously to create a more complex system.

In addition to the output coverage shown above, further configurations may use other combinations of renderers to model other speaker settings.

Fig. 12 is a top view of soundbar 1200. Soundbar 1200 may implement rendering system 100 (see fig. 1). Soundbar 1200 is similar to soundbar 500 (see fig. 5A), and includes a linear array 502 (having 12 speakers 502a, 502b, 502c, 502d, 502e, 502f, 502g, 502h, 502i, 502j, 502k, and 502l) and an upward-firing group 504 (including 2 speakers 504a and 504 b). Soundbar 1200 also includes two side-firing speakers 1202a and 1202b, where speaker 1202a is referred to as the left side-firing speaker and speaker 1202b is referred to as the right side-firing speaker.

In contrast to sound bar 500 (see fig. 5A), sound bar 1200 uses side-emitting speakers 1202a and 1202B to generate virtual side outputs 604a and 604B (see fig. 6B).

Fig. 13 is a block diagram of a rendering system 1300. Rendering system 1300 is a modification of rendering system 1100 (see fig. 10) suitable for implementation in sound bar 1200 (see fig. 12). The rendering system 1300 may be implemented using components of the rendering system 300 (see fig. 3). The components of the rendering system 1300 are similar to the components of the rendering system 1100 and similar reference numerals are used. In contrast to rendering system 1100, rendering system 1300 replaces beamformers 720b and 720c with binaural renderer 1320.

The binaural renderer 1320 receives the speaker configuration information 156, the object audio data 154, the selection information 162, and the position information 164. Binaural renderer 1320 performs binaural rendering on object audio data 154 and generates left binaural signal 1366b and right binaural signal 1366 c. Considering only the side-facing transmit speakers 1202a and 1202b (see fig. 12), the left binaural signal 1366b generally corresponds to the output of the left side-facing transmit speaker 1202a, and the right binaural signal 1366c generally corresponds to the output of the right side-facing transmit speaker 1202 b. (note that routing module 730 then combines binaural signals 1366b and 1366c with other rendering signals 766 to generate speaker signal 770 to the entire set of speakers 502, 504, and 1202.)

Fig. 14 is a block diagram of the renderer 1400. Renderer 1400 may correspond to one or more of the renderers discussed above, such as renderer 120 (see fig. 1), renderer 720 (see fig. 7), renderer 1120 (see fig. 10), and so forth. Renderer 1400 shows that the renderer may include more than one renderer as its components. As shown here, the renderer 1400 includes a renderer 1402 in series with a renderer 1404. Although two renderers 1402 and 1404 are shown, the renderer 1400 may include additional renderers configured in various serial and parallel manners. The renderer 1400 receives the speaker configuration information 156, the selection information 162, and the position information 164; the renderer 1400 may provide these signals to one or more of the renderers 1402 and 1404, depending on their particular configuration.

The renderer 1402 receives the object audio data 154 and one or more of the speaker configuration information 156, the selection information 162, and the location information 164. The renderer 1402 performs rendering on the object audio data 154 and generates a rendering signal 1410. Rendering signal 1410 typically corresponds to an intermediate rendering signal. For example, the rendering signal 1410 may be a virtual speaker feed signal.

The renderer 1404 receives the rendering signal 1410 and one or more of the speaker configuration information 156, the selection information 162, and the location information 164. The renderer 1404 performs rendering on the rendering signal 1410 and generates a rendering signal 1412. Rendering signal 1412 corresponds to the rendering signals discussed above, such as rendering signal 166 (see fig. 1), rendering signal 766 (see fig. 7), rendering signal 1166 (see fig. 10), and so on. The renderer 1400 may then provide the render signal 1412 to a routing module (e.g., the routing module 130 of fig. 1, the routing module 730 of fig. 7 or 10 or 13), and so on, in a manner similar to that discussed above.

Generally, the renderers 1402 and 1404 are of different types in a manner similar to that discussed above. For example, these types may include amplitude translators, vertical translators, wave field renderers, binaural renderers, and beamformers. A specific example configuration is shown in fig. 15.

Fig. 15 is a block diagram of a renderer 1500. Renderer 1500 may correspond to one or more of the renderers discussed above, such as renderer 120 (see fig. 1), renderer 720 (see fig. 7), renderer 1120 (see fig. 10), renderer 1400 (see fig. 14), and so forth. Renderer 1500 includes an amplitude translator 1502, a number N of binaural renderers 1504 (three are shown: 1504a, 1504b, and 1504c), and a number M of beamformer groups, including a plurality of left beamformers 1506 (three are shown: 1506a, 1506b, and 1506c) and right beamformers 1508 (three are shown: 1508a, 1508b, and 1508 c).

The amplitude translator 1502 receives the object audio data 154, the selection information 162, and the position information 164. The amplitude translator 1502 performs rendering on the object audio data 154 and generates virtual speaker feeds 1520 (three shown: 1520a, 1520b, and 1520c) in a manner similar to other amplitude translators described herein. The virtual speaker feeds 1520 may correspond to canonical speaker feed signals, such as 5.1 channel surround signals, 7.1 channel surround signals, 7.1.2 channel surround signals, 7.1.4 channel surround signals, 9.1 channel surround signals, and so on. The virtual speaker feeds 1520 are referred to as "virtual" because they need not be provided directly to the actual speakers, but may be provided to other renderers in the renderer 1500 for further processing.

The details of the virtual speaker feeds 1520 may vary in various embodiments and implementations of the renderer 1500. For example, when virtual speaker feed 1520 includes a low frequency effects channel signal, amplitude translator 1502 may provide the channel signal directly to one or more speakers (e.g., bypassing binaural renderer 1504 and beamformers 1506 and 1508). As another example, when the virtual speaker feed 1520 includes a center channel signal, the amplitude translator 1502 may provide the channel signal directly to one or more speakers, or may provide the signal directly to one of a set of left beamformers 1506 and one of right beamformers 1508 (e.g., bypassing the binaural renderer 1504).

The binaural renderer 1504 receives the virtual speaker feeds 1520 and the speaker configuration information 156. (typically, the number N of binaural renderers 1504 depends on the details of the embodiment of the renderer 1500, e.g. the number of virtual speaker feeds 1520, the type of virtual speaker feeds, etc., as described above.) the binaural renderer 1504 performs rendering on the virtual speaker feeds 1520 and generates a left binaural signal 1522 (three are shown: 1522a, 1522b, and 1522c) and a right binaural signal 1524 (three are shown: 1524a, 1524b, and 1524c) in a manner similar to the other binaural renderers described herein.

Left beamformer 1506 receives left binaural signal 1522 and speaker configuration information 156, right beamformer 1508 receives right binaural signal 1524 and speaker configuration information 156. Each left beamformer 1506 may receive one or more left binaural signals 1522 and each right beamformer 1508 may receive one or more right binaural signals 1524, again depending on the details of the embodiment of the renderer 1500 as described above. (these one or more relationships are indicated by the dashed lines of 1522 and 1524 in fig. 15.) the left beamformer 1506 performs rendering on the left binaural signal 1522 and generates a rendered signal 1566 (three shown: 1566a, 1566b, and 1566 c). Right beamformer 1508 performs rendering on right binaural signal 1524 and generates rendered signal 1568 (three shown: 1568a, 1568b, and 1568 c). Beamformers 1506 and 1508 additionally operate in a manner similar to other beamformers described by the present disclosure. Rendering signals 1566 and 1568 correspond to the rendering signals discussed above, such as rendering signal 166 (see fig. 1), rendering signal 766 (see fig. 7), rendering signal 1166 (see fig. 10), rendering signal 1412 (see fig. 14), and so forth.

Renderer 1500 may then provide rendered signals 1566 and 1568 to routing module (e.g., routing module 130 of fig. 1, routing module 730 of fig. 7 or 10 or 13), and so on, in a manner similar to that discussed above.

As described above, the number M of left beamformers 1506 and right beamformers 1508 depends on the details of the embodiment of the renderer 1500. For example, the number M may vary based on the form factor of the device that includes the renderer 1500, based on the number of speaker arrays connected to the renderer 1500, based on the capabilities and arrangement of those speaker arrays, and so forth. As a general criterion, the number M (of beamformers 1506 and 1508) may be less than or equal to the number N (of binaural renderer 1504). As another general criterion, the number of separate speaker arrays may be less than or equal to twice the number N (of binaural renderers 1504). As one example form factor, a device may have physically separate left and right speaker arrays, where the left speaker array generates all left beams and the right speaker array generates all right beams. Another example form factor, a device may have physically separate front and rear speaker arrays, where the front speaker array generates left and right beams for all front binaural signals and the rear speaker array generates left and right beam signals for all rear binaural signals.

Fig. 16 is a block diagram of a rendering system 1600. Rendering system 1600 is similar to rendering system 100 (see fig. 1), in that renderer 120 (see fig. 1) is replaced by a renderer arrangement similar to that of renderer 1500 (see fig. 15); there are also differences associated with the assignment module 110 (see fig. 1). Rendering system 1600 includes an amplitude translator 1602, a number N of binaural renderers 1604 (three are shown: 1604a, 1604b, and 1604c), a number M of beamformer groups including a plurality of left beamformers 1606 (three are shown: 1606a, 1606b, and 1606c) and right beamformers 1608 (three are shown: 1608a, 1608b, and 1508c), and a routing module 1630.

The amplitude translator 1602 receives the object metadata 152 and the object audio data 154, renders the object audio data 154 according to the position information in the object metadata 152, and generates virtual speaker feeds 1620 (three shown: 1620a, 1620b, and 1620c) in a manner similar to other amplitude translators described in this disclosure. Similarly, the details of the virtual speaker feed 1620 may vary in various embodiments and implementations of the rendering system 1600 in a manner similar to that described above with respect to the renderer 1500 (see fig. 15). (in contrast to rendering system 100 (see FIG. 1), rendering system 1600 omits assignment module 110, but uses amplitude translator 1602 to weight virtual speaker feeds 1620 in binaural renderer 1604.)

The binaural renderer 1604 receives the virtual speaker feeds 1620 and the speaker configuration information 156. (in general, the number N of binaural renderers 1604 depends on the details of the embodiment of the rendering system 1600, e.g., the number of virtual speaker feeds 1620, the type of virtual speaker feeds, etc., as described above.) the binaural renderer 1604 performs rendering on the virtual speaker feeds 1620 and generates left binaural signals 1622 (three are shown: 1622a, 1622b, and 1622c) and right binaural signals 1624 (three are shown: 1624a, 1624b, and 1624c) in a manner similar to other binaural renderers described in this disclosure.

The left beamformer 1606 receives the left binaural signal 1622 and the speaker configuration information 156, and the right beamformer 1608 receives the right binaural signal 1624 and the speaker configuration information 156. Each left beamformer 1606 may receive one or more left binaural signals 1622 and each right beamformer 1608 may receive one or more right binaural signals 1624, again depending on the details of the embodiment of the rendering system 1600 described above. (these one or more relationships are indicated by the dashed lines 1622 and 1624 in fig. 16.) the left beamformer 1606 performs rendering on the left binaural signal 1622 and generates a rendered signal 1666 (three are shown: 1666a, 1666b, and 1666 c). The right beamformer 1608 performs rendering on the right binaural signal 1624 and generates a rendered signal 1668 (three are shown: 1668a, 1668b, and 1668 c). The beamformers 1606 and 1608 additionally operate in a manner similar to the other beamformers described in this disclosure.

Routing module 1630 receives speaker configuration information 156, render signal 1666, and render signal 1668. The routing module 1630 generates the speaker signal 1670 in a manner similar to other routing modules described in this disclosure.

Fig. 17 is a flow diagram of a method 1700 of audio processing. The method 1700 may be performed by the rendering system 1600 (see fig. 16). Method 1700 may be implemented by one or more computer programs, such as executed by rendering system 1600 to control its operations.

At step 1702, one or more audio objects are received. Each audio object comprises position information, respectively. As an example, the rendering system 1600 (see fig. 16) may receive the audio signal 150, which includes the object metadata 152 and the object audio data 154. For each audio object, the method continues to step 1704.

At step 1704, for a given audio object, the given audio object is rendered using a first category of renderers based on the location information to generate a first plurality of signals. For example, the amplitude translator 1602 (see fig. 16) may render a given audio object (in the object audio data 154) based on the location information (in the object metadata 152) to generate the virtual speaker signal 1620.

At step 1706, for the given audio object, the first plurality of signals is rendered using a second category of renderers to generate a second plurality of signals. For example, the binaural renderer 1604 (see fig. 16) may render the virtual speaker feed 1620 to generate a left binaural signal 1622 and a right binaural signal 1624.

At step 1708, for the given audio object, the second plurality of signals is rendered using a third category of renderers to generate a third plurality of signals. For example, the left beamformer 1606 may render the left binaural signal 1622 to generate a rendered signal 1666, and the right beamformer 1608 may render the right binaural signal 1624 to generate a rendered signal 1668.

At step 1710, the third plurality of signals is combined to generate a plurality of speaker signals. For example, routing module 1630 (see fig. 16) may combine render signal 1666 and render signal 1668 to generate speaker signal 1670.

At step 1712, a plurality of speaker signals are output from a plurality of speakers (see step 1708).

The operation of method 1700 is similar when multiple audio objects are to be output simultaneously. For example, multiple paths of steps 1704-1706-1708 may be used to process multiple given audio objects in parallel, wherein rendering signals corresponding to the multiple audio objects are combined (see step 1710) to generate a speaker signal.

As another example, multiple given audio objects may be processed by combining the rendered signals for each audio object at the output of one or more rendering stages. Applying this example to the rendering system 1600 (see fig. 16), the amplitude translator 1602 may render a plurality of given audio objects, each virtual speaker signal 1620 corresponds to a combined rendering that combines the plurality of given audio objects, and the binaural renderer 1604, the beamformers 1606 and 1608 operate on the combined rendering.

Details of the implementation

Embodiments may be implemented in hardware, executable modules stored on a computer-readable medium, or a combination of both (e.g., programmable logic arrays). Unless otherwise specified, the steps performed by an embodiment need not be inherently related to any particular computer or other apparatus, although they may be in some embodiments. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus (e.g., an integrated circuit) to perform the required method steps. Thus, embodiments may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices in a known manner.

Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein. (software itself and intangible or temporary signals are excluded from the scope of non-patent subject matter.)

The above description illustrates various embodiments of the invention and examples of how aspects of the invention may be practiced. The above examples and embodiments should not be considered the only embodiments, but rather, to illustrate the flexibility and advantages of the present invention as defined by the following claims. Based on the foregoing disclosure and the following claims, other arrangements, embodiments, implementations and equivalents will be apparent to those skilled in the art and may be employed without departing from the spirit and scope of the invention as defined by the claims.

The various aspects of the invention can be understood from the following exemplified examples (EEEs):

1. a method of audio processing, the method comprising:

receiving one or more audio objects, wherein each of the one or more audio objects respectively includes location information;

for a given audio object of the one or more audio objects:

selecting at least two renderers of a plurality of renderers based on the position information of the given audio object, wherein the at least two renderers have at least two categories;

determining at least two weights based on the position information of the given audio object;

rendering the given audio object using the at least two renderers weighted according to the at least two weights based on the position information to generate a plurality of rendered signals; and

combining the plurality of rendered signals to generate a plurality of speaker signals; and

outputting the plurality of speaker signals from a plurality of speakers.

2. The method of EEE 1, wherein the at least two categories include a sound field renderer, a beamformer, a pan, and a binaural renderer.

3. The method of any of EEEs 1-2, wherein a given rendering signal of the plurality of rendering signals includes at least one component signal,

wherein each of the at least one component signals is associated with a respective one of the plurality of speakers; and

4. The method of EEE 3, wherein a first renderer generates a first rendering signal, wherein the first rendering signal includes a first component signal associated with a first speaker and a second component signal associated with a second speaker,

wherein a first speaker signal associated with the first speaker corresponds to combining the first component signal and the third component signal, an

Wherein the second speaker signal associated with the second speaker corresponds to combining the second component signal and the fourth component signal.

5. The method of any of EEEs 1-4, wherein rendering the given audio object comprises: for a given renderer of the plurality of renderers, applying a gain based on the location information to generate a given rendering signal of the plurality of rendering signals.

6. The method of any one of EEEs 1-5, wherein the plurality of speakers comprises a dense linear array of speakers.

7. The method according to any of EEEs 1-6, wherein the at least two classes comprise a sound field renderer, wherein the sound field renderer performs a wave field synthesis process.

8. The method according to any one of EEEs 1-7, wherein the plurality of loudspeakers is arranged in a first group pointing in a first direction and a second group pointing in a second direction different from the first direction.

9. The method of EEE 8, wherein the first direction includes a forward component and the second direction includes a vertical component.

10. The method of EEE 8, wherein the second direction includes a vertical component, wherein the at least two renderers include a wave field synthesis renderer and an emission-up panning renderer, and wherein the wave field synthesis renderer and the emission-up panning renderer generate the plurality of rendering signals for the second group.

11. The method of EEE 8, wherein the second direction includes a vertical component, wherein the at least two renderers include a wave field synthesis renderer, an upward firing pan renderer, and a beamformer, and wherein the wave field synthesis renderer, the upward firing pan renderer, and the beamformer generate the plurality of rendering signals for the second group.

12. The method of EEE 8, wherein the second direction includes a vertical component, wherein the at least two renderers include a wave field synthesis renderer, an upward-firing panning renderer, and a lateral-firing panning renderer, and wherein the wave field synthesis renderer, the upward-firing panning renderer, and the lateral-firing panning renderer generate the plurality of rendering signals for the second group.

13. The method of EEE 8, wherein the first direction includes a forward component and the second direction includes a lateral component.

14. The method of EEE 8, wherein the first direction comprises a forward component, wherein the at least two renderers comprise wave field synthesis renderers, and wherein the wave field synthesis renderer generates the plurality of rendering signals for the first group.

15. The method of EEE 8, wherein the second direction comprises a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a beamformer, and wherein the wave field synthesis renderer and the beamformer generate the plurality of rendering signals for the second group.

16. The method of EEE 8, wherein the second direction comprises a lateral component, wherein the at least two renderers comprise a wave field synthesis renderer and a lateral emission panning renderer, and wherein the wave field synthesis renderer and the lateral emission panning renderer generate the plurality of rendering signals for the second group.

17. The method according to any one of EEEs 1-16, further comprising:

combining the plurality of rendered signals for the one or more audio objects to generate the plurality of speaker signals.

18. The method of any of EEEs 1-17, wherein the at least two renderers include serially connected renderers.

19. The method according to any of EEEs 1-18, wherein the at least two renderers include an amplitude translator, a plurality of binaural renderers, and a plurality of beamformers;

wherein the amplitude translator is configured to render a given audio object based on the position information to generate a first plurality of signals;

wherein the plurality of binaural renderers are configured to render the first plurality of signals to generate a second plurality of signals;

wherein the plurality of beamformers are configured to render the second plurality of signals to generate a third plurality of signals; and

wherein the third plurality of signals are combined to generate the plurality of speaker signals.

20. An audio processing apparatus, the apparatus comprising:

a plurality of speakers;

a processor; and

a memory for storing a plurality of data to be transmitted,

wherein the processor is configured to control the apparatus to receive one or more audio objects, wherein each of the one or more audio objects comprises position information, respectively;

for a given audio object of the one or more audio objects:

the processor is configured to control the apparatus to select at least two renderers of a plurality of renderers based on the position information of the given audio object, wherein the at least two renderers have at least two categories;

the processor is configured to control the apparatus to determine at least two weights based on the position information of the given audio object;

the processor is configured to control the apparatus to combine the plurality of rendered signals to generate a plurality of speaker signals; and

wherein the processor is configured to control the apparatus to output the plurality of speaker signals from the plurality of speakers.

21. A method of audio processing, the method comprising:

receiving one or more audio objects, wherein each of the one or more audio objects respectively includes location information;

for a given audio object of the one or more audio objects:

based on the position information, rendering the given audio object using a first class of renderers to generate a first plurality of signals;

rendering the first plurality of signals using a second category of renderers to generate a second plurality of signals;

rendering the second plurality of signals using a third category of renderers to generate a third plurality of signals; and

combining the third plurality of signals to generate a plurality of speaker signals; and

outputting the plurality of speaker signals from a plurality of speakers.

22. The method of EEE 21, wherein the first category of renderers corresponds to an amplitude translator, wherein the second category of renderers corresponds to a plurality of binaural renderers, and wherein the third category of renderers corresponds to a plurality of beamformers.

23. A non-transitory computer readable medium storing a computer program that, when executed by a processor, controls a device to perform a process comprising the method of any one of EEEs 1-19, 21, or 22.

24. An audio processing apparatus, the apparatus comprising:

a plurality of speakers;

a processor; and

a memory for storing a plurality of data to be transmitted,

wherein the processor is configured to control the apparatus to receive one or more audio objects, wherein each of the one or more audio objects comprises position information, respectively;

for a given audio object of the one or more audio objects:

the processor is configured to control the apparatus to render the given audio object using a first category of renderers to generate a first plurality of signals based on the location information;

the processor is configured to control the apparatus to render the first plurality of signals using a second category of renderers to generate a second plurality of signals;

the processor is configured to control the apparatus to render the second plurality of signals using a third category of renderers to generate a third plurality of signals; and

the processor is configured to control the apparatus to combine the third plurality of signals to generate a plurality of speaker signals; and

wherein the processor is configured to control the apparatus to output the plurality of speaker signals from the plurality of speakers.

Reference to the literature

U.S. application publication No. 2016/0300577

U.S. application publication No. 2017/0048640

International application publication No. WO 2017/087564A 1

U.S. application publication No. 2015/0245157

Wittek, F.Rumsey and G.Theile, "Perceptual Enhancement of wave field Synthesis by stereo method using stereo method" (J.Audio engineering society, Vol.55, No. 9, p.723 and 751, 2007)

U.S. Pat. No. 7,515,719

U.S. application publication No. 2015/0350804

Montag, "Three-dimensional Wave field synthesis using Multiple linear Arrays (Wave field synthesis in Three Dimensions by Multiple Line Arrays)," Miami university, 2011

R.ranjan and w.s.gan "hybrid speaker array headphone system for immersive 3D audio reproduction" (Ahybrid speaker array for interactive 3D audio reproduction), "international conference corpus of IEEE acoustics, speech and signal processing (ICASSP), p.1836-1840, p.2015 4 months

Pulkki, "Virtual sound source localization using vector based amplitude panning", journal of the Audio engineering society, Vol.45, No. 6, pp.456-466, 1997-one

U.S. Pat. No. 7,515,719

Wierstorf, "Perceptual evaluation of Sound Field Synthesis (Perceptual Association of Sound Field Synthesis", university of Berlin, 2014

37页详细技术资料下载

Rendering audio objects using multiple types of renderers

相关技术

网友询问留言