Three-dimensional audio source spatialization

文档序号：1866606 发布日期：2021-11-19 浏览：12次中文

阅读说明：本技术 三维音频源空间化 (Three-dimensional audio source spatialization ) 是由约瑟夫·德洛热于 2019-06-12 设计创作，主要内容包括：在远程呈现系统中递送音频的技术包括指定频率阈值,低于该频率阈值串扰消除(CC)被使用,高于该频率阈值VBAP被使用。在一些实施方式中,这样的频率阈值在1000Hz与2000Hz之间。此外,在一些实施方式中,改进的技术包括通过形成超定系统以一次为所有扩音器确定振幅权重来修改用于多于三个扩音器的VBAP。(Techniques for delivering audio in a telepresence system include specifying a frequency threshold below which Crosstalk Cancellation (CC) is used and above which VBAP is used. In some embodiments, such frequency threshold is between 1000Hz and 2000 Hz. Further, in some embodiments, the improved technique includes modifying VBAP for more than three loudspeakers by forming an overdetermined system to determine amplitude weights for all loudspeakers at once.)

1. A method, comprising:

receiving, by a processing circuit configured to perform audio source spatialization, audio data from an audio source at a source location, the audio data representing an audio waveform configured to be converted into sound of a certain frequency via a plurality of loudspeakers heard at a listener location, each of the plurality of loudspeakers having a respective loudspeaker location;

in response to the frequency of the audio signal being below a specified threshold, performing, by the processing circuit, a Crosstalk Cancellation (CC) operation on the plurality of loudspeakers to produce amplitudes and phases of respective audio signals emitted by the loudspeakers to determine spatialization cues; and

in response to the frequency of the audio signal being above the specified threshold, performing, by the processing circuit, a vector-based amplitude panning, VBAP, operation on the plurality of loudspeakers to produce respective weights for the loudspeakers, the respective weights for each of the plurality of loudspeakers representing factors with which the audio signal emitted by the loudspeaker is multiplied to determine spatialization cues.

2. The method of claim 1, wherein performing the CC operations on the plurality of loudspeakers comprises: the position and orientation of the listener is tracked over time.

3. The method of claim 1 or 2, wherein the number of loudspeakers of the plurality of loudspeakers is an even number, and

wherein performing the CC operation on the plurality of loudspeakers comprises: the pair of applications to the loudspeakers is configured to provide the listener with head related transfer functions HRTFs of a binaural sound field, the HRTFs being based on a parametric rigid spherical model.

4. The method of any of claims 1 to 3, wherein the specified threshold is between 1000Hz and 2000 Hz.

5. The method of any of claims 1 to 4, wherein performing the VBAP operation on the plurality of loudspeakers comprises:

generating a loudspeaker matrix having elements that are components of vectors parallel to differences between the listener position and respective loudspeaker positions of each of the plurality of loudspeakers;

generating a source vector having elements that are components of a vector parallel to a difference between the listener position and the source position; and

performing a pseudo-inverse operation on the loudspeaker matrix and the source vector to produce a weight vector having components, each component of the weight vector representing a respective weight for each of the plurality of loudspeakers.

6. The method of claim 5, wherein a distance between the listener and a first loudspeaker of the plurality of loudspeakers is different than a distance between the listener and a second loudspeaker of the plurality of loudspeakers.

7. The method of claim 5 or 6, wherein the number of loudspeakers of the plurality of loudspeakers is greater than three, and

wherein performing the pseudo-inverse operation on the loudspeaker matrices and the source vectors comprises: generating a product of the inverse of the loudspeaker and the penrose pseudo-inverse of the source vector.

8. The method of claim 7, wherein performing the pseudo-inverse operation on the loudspeaker matrices and the source vector further comprises: the sum of the squares of the components of the weight vector is minimized.

9. The method according to claim 7 or 8, wherein the components of the weight vector are smaller than zero, and

wherein the method further comprises:

removing elements of the loudspeaker matrix corresponding to the loudspeakers corresponding to the components of the weight vector that are less than zero to form a reduced loudspeaker matrix; and

performing the pseudo-inverse operation on the reduced loudspeaker matrices and the source vector to produce a reduced weight vector.

10. The method of any of claims 5 to 9, further comprising: multiplying each of the components of the weight vector by a respective scaling factor proportional to the distance between the listener and the loudspeaker of the plurality of loudspeakers to which that component of the weight vector corresponds.

11. A computer program product comprising a non-transitory storage medium, the computer program product comprising code that, when executed by processing circuitry configured to perform audio source spatialization, causes the processing circuitry to perform a method, the method comprising:

receiving audio data from an audio source at a source location, the audio data representing an audio waveform configured to be converted into sound of a certain frequency via a plurality of loudspeakers heard at a listener location, each of the plurality of loudspeakers having a respective loudspeaker location;

generating a source vector having elements that are components of a vector parallel to a difference between the listener position and the source position; and

12. The computer program product of claim 11, wherein a distance between the listener and a first loudspeaker of the plurality of loudspeakers is different than a distance between the listener and a second loudspeaker of the plurality of loudspeakers.

13. The computer program product of claim 11 or 12, wherein the number of loudspeakers in the plurality of loudspeakers is greater than three, and

14. The computer program product of claim 13, wherein performing the pseudo-inverse operation on the loudspeaker matrices and the source vector further comprises: the sum of the squares of the components of the weight vector is minimized.

15. The computer program product according to claim 13 or 14, wherein a component of the weight vector is less than zero, and

wherein the method further comprises:

removing elements of the loudspeaker matrix corresponding to the loudspeakers corresponding to the components of the weight vector that are less than zero to form a reduced loudspeaker matrix; and

performing the pseudo-inverse operation on the reduced loudspeaker matrices and the source vector to produce a reduced weight vector.

16. The computer program product of any of claims 11 to 15, further comprising: multiplying each of the components of the weight vector by a respective scaling factor proportional to the distance between the listener and the loudspeaker of the plurality of loudspeakers to which that component of the weight vector corresponds.

17. The computer program product of any of claims 11 to 16, wherein generating the loudspeaker matrix and the source vector is part of performing a vector-based amplitude panning (VBAP) operation on the plurality of loudspeakers, and

wherein the method further comprises:

in response to the frequency of the audio signal being below a specified threshold, performing a Crosstalk Cancellation (CC) operation on the plurality of loudspeakers to produce amplitudes and phases of respective audio signals emitted by the loudspeakers to determine spatialization cues; and

in response to the frequency of the audio signal being above the specified threshold, performing the VBAP operation on the plurality of loudspeakers to generate respective weights for the loudspeakers.

18. The computer program product of claim 17, wherein performing the CC operations on the plurality of loudspeakers comprises: the position and orientation of the listener is tracked over time.

19. The computer program product of claim 17 or 18, wherein the number of loudspeakers in the plurality of loudspeakers is an even number, and

20. An electronic device configured to perform audio source spatialization, the electronic device comprising:

a memory; and

a control circuit coupled to the memory, the control circuit configured to:

in response to the frequency of the audio signal being above the specified threshold, performing a vector-based amplitude panning, VBAP, operation on the plurality of loudspeakers to produce respective weights for the loudspeakers, the respective weights for each of the plurality of loudspeakers representing factors with which the audio signal emitted by the loudspeaker is multiplied to determine a spatialization cue.

Technical Field

The present description relates to three-dimensional audio source spatialization in systems such as telepresence systems.

Background

Telepresence refers to a set of technologies that allow a person to feel as if they were present in the field or to give an appearance of being present at a place other than their real location. For example, rather than a long trip to conduct a face-to-face conference, a person may instead use a telepresence system that uses a multi-codec video system to provide the appearance of being in a face-to-face conference. Each member of the conference uses the telepresence room to "dial in" and is able to see and talk to each other member on the screen as if they were in the same room. Such telepresence systems may represent an improvement over conventional teleconferences and video conferences because the visual aspects greatly enhance communication, allowing for the perception of facial expressions and other body languages.

Disclosure of Invention

In one general aspect, a method can include: audio data from an audio source at a source location is received by a processing circuit configured to perform audio source spatialization, the audio data representing an audio waveform configured to be converted into sound of a certain frequency via a plurality of loudspeakers heard at a listener location, each of the plurality of loudspeakers having a respective loudspeaker location. The method can also include, in response to the frequency of the audio signal being below a specified threshold, performing, by the processing circuit, a Crosstalk Cancellation (CC) operation on the plurality of loudspeakers to produce amplitudes and phases of respective audio signals emitted by the loudspeakers to determine the spatialization cues. The method can further include, in response to the frequency of the audio signal being above a specified threshold, performing, by the processing circuit, a vector-based amplitude panning (VBAP) operation on the plurality of loudspeakers to generate respective weights for the loudspeakers, the respective weights for each of the plurality of loudspeakers representing factors by which the audio signal emitted by the loudspeaker is multiplied to determine the spatialization cue. In some embodiments, the weights are complex and include phase.

In another general aspect, a computer program product comprising a non-transitory storage medium, the computer program product comprising code that, when executed by a processing circuit configured to perform audio source spatialization, causes the processing circuit to perform a method. The method can include: audio data from an audio source at a source location is received, the audio data representing an audio waveform configured to be converted to sound of a certain frequency via a plurality of loudspeakers heard at a listener location, each of the plurality of loudspeakers having a respective loudspeaker location. The method can also include generating a loudspeaker matrix having elements that are components of a vector parallel to a difference between the listener position and a respective loudspeaker position for each of the plurality of loudspeakers. The method can further include generating a source vector having elements that are components of a vector parallel to a difference between the listener position and the source position. The method can further include performing a pseudo-inverse operation on the loudspeaker matrix and the source vector to produce a weight vector having components, each component of the weight vector representing a respective weight for each of the plurality of loudspeakers.

The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a schematic diagram illustrating an example electronic environment for implementing the improved techniques described herein.

FIG. 2 is a flow diagram illustrating an example method of performing an improved technique within an electronic environment.

Fig. 3 is a schematic diagram illustrating an example geometry used in considering Crosstalk Cancellation (CC) operations.

Fig. 4 is a schematic diagram illustrating an example rigid ball HRTF model at two different arrival orientations.

Fig. 5 is a schematic diagram illustrating example geometries used in considering vector-based amplitude panning (VBAP) operations.

FIG. 6 is a flow chart illustrating an example process of performing VBAP operations.

Fig. 7 illustrates an example of a computer device and a mobile computer device that can be used with the circuitry described herein.

Detailed Description

The goal of a telepresence system that delivers the above audio is to provide the listener with appropriately spatialized talker speech. Such systems accurately deliver sound to the listener's left and right ears. Delivery is simple if use of the headset is warranted. However, in the telepresence example of interest, the listening experience is not impeded, and therefore, loudspeaker rendering is used.

There are a number of techniques for delivering spatialized audio to listeners-including wave field synthesis and multi-channel analog stereo. These techniques are typically used for the rendering of complex acoustic environments (with many sound sources) and require a minimum of four (for B-format multi-channel analog stereo installations) and often more (for higher-order multi-channel analog stereo and wave field synthesis installations) loudspeakers. Furthermore, loudspeakers used for multi-channel analog stereo envelop/surround the listener.

In contrast, the above-described telepresence systems use a relatively small number of loudspeakers (e.g., between two and four). In some embodiments, the speakers are positioned in front of the listener. Therefore, both multi-channel analog stereo and wave field synthesis are not suitable for use in the above-described telepresence systems. Instead, loudspeaker displays are centered on two conceptually simple techniques intended to use two or more loudspeakers to display spatialized sound to a single listener: cross-talk cancellation and vector-based amplitude translation.

One conventional approach to delivering audio in a telepresence system includes using crosstalk cancellation techniques to determine the complex signal from each loudspeaker that produces the desired signal in each listener's ear. Another conventional approach to delivering audio in a telepresence system includes using vector-based amplitude panning (VBAP) to derive amplitude weightings for each loudspeaker that properly position the audio source.

The above-described conventional methods of delivering audio in a telepresence system have some drawbacks that may result in poor spatialization. For example, while crosstalk cancellation can provide more accurate spatialization cues, crosstalk cancellation also tends to be sensitive to tracker error at high frequencies, where the sound wavelengths are close to the amplitude of the tracker error. VBAP is less sensitive to tracker errors but produces less accurate spatialization cues.

Furthermore, VBAP assumes that there are exactly three loudspeakers and that the listener's head is equidistant from each loudspeaker. If there are more than three loudspeakers, the region defined by the loudspeakers is decomposed into triangles that do not intersect the loudspeakers at the vertices, and VBAP is performed for each triplet of triangles. This can be problematic because there can be more than one way to decompose a region and there is no clear way to determine which is preferable.

In accordance with embodiments described herein and in contrast to the above-described conventional methods of delivering audio in a telepresence system, an improved technique of delivering audio in a telepresence system includes specifying a frequency threshold below which Crosstalk Cancellation (CC) is used and above which VBAP is used. In some embodiments, such frequency threshold is between 1000Hz and 2000 Hz. Further, in some embodiments, the improved technique includes modifying VBAP for more than three loudspeakers by forming an overdetermined system to determine amplitude weights for all loudspeakers at once.

Such a hybrid approach maintains more accurate CC localization cues in the frequency region where CC localization cues are most important and where the CC is least sensitive to tracker errors and Head Related Transfer Function (HRTF) individualization, while less accurate and less sensitive VBAP localization cues outside the frequency region are used. Furthermore, the modified VBAP does not assume that the listener is equidistant from all loudspeakers, and the weights determined by the modified VBAP for each loudspeaker do not depend on any decomposition of the region spanned by those loudspeakers.

FIG. 1 is a schematic diagram illustrating an example electronic environment 100 in which the improved techniques described above may be implemented. As shown, in fig. 1, an example electronic environment 100 includes a sound rendering computer 120.

The sound rendering computer 120 is configured to implement the mixing scheme described above and to perform the modified VBAP operation described above. The sound rendering computer 120 includes a network interface 122, one or more processing units 124, and a memory 126. Network interface 122 includes, for example, an ethernet adapter, a token ring adapter, or the like, for converting electronic and/or optical signals into electronic form for use by sound-rendering computer 120. Processing unit set 124 includes one or more processing chips and/or components. The memory 126 includes both volatile memory (e.g., RAM) and non-volatile memory, such as one or more ROMs, disk drives, solid-state drives, or the like. The set of processing units 124 and the memory 126 together form a control circuit that is configured and arranged to perform various methods and functions as described herein.

In some embodiments, one or more of the components of the sound rendering computer 120 can be or can include a processor (e.g., processing unit 124) configured to process instructions stored in memory 126. An embodiment of such instructions as depicted in fig. 1 includes a sound acquisition manager 130, a crosstalk cancellation manager 140, and a VBAP manager 150. Further, as illustrated in FIG. 1, the memory 126 is configured to store various data, which is described with respect to a corresponding manager that uses such data.

The sound acquisition manager 130 is configured to acquire sound data 132 from a sound source. For example, in a telepresence system that hosts a virtual conference, conference participants at remote locations speak, and the sound produced by the speaking is detected by a microphone. The microphone converts the detected sound into a digital data format, which is transmitted to the sound rendering computer 120 through the network.

The sound data 132 represents audio detected by a microphone and converted into a digital data format. In some embodiments, the digital data format is uncompressed, mono, with 16kHz and 16-bit resolution. In some embodiments, the digital data format is in a compressed stereo format such as Opus or MP 3. In some embodiments, recording is performed at a rate greater than 16kHz, such as 44kHz or 48 kHz. In some embodiments, the resolution is higher than 16 bits, e.g., 24 bits, 32 bits, floating, etc. The sound rendering computer 120 is then configured to convert the sound data 132 into sound played through loudspeakers such that at the listener's location, the listener will perceive the sound as originating from a virtual source location (e.g., at a seat beside the listener).

The sound data 132 uses waveforms to represent audio produced by a source at any time. The waveform represents a range of frequencies at each time instant or throughout a time window. In some implementations, the sound acquisition manager 130 is configured to store a frequency-space representation of the sound data 132 over a specified time window (e.g., 10 seconds, 1 second, 0.5 seconds, 0.1 seconds, etc.). In this case, for each time window, there is a distribution of frequencies and corresponding amplitudes and phases.

The microphone position data 134 represents the position of the microphone near the listener. The location is specified with respect to an origin of a specified coordinate system. In some embodiments, the origin of the coordinate system is at a point in the listener's head. In some embodiments, the loudspeaker position data is represented by a cartesian coordinate triplet.

The virtual source location data 136 represents the location of the virtual source within the coordinate system described above. The location of the virtual source is the apparent location of the sound source as heard by the listener. For example, in a telepresence system, it may be desirable to have a meeting with a remote user, but as if the remote user were sitting beside the listener. In this case, the location of the virtual source will be at that place next to the listener.

The listener position data 138 represents the position of the listener within the coordinate system. In some implementations, the listener's position is at the origin of a coordinate system. In some implementations, the listener position data 138 changes over time, corresponding to tracking of the motion of the listener.

The crosstalk cancellation manager 140 is configured to perform crosstalk cancellation operations on the sound data 132 and HRTF data 142 to produce amplitude/phase data 144. As discussed in detail with respect to fig. 3 and 4, the crosstalk cancellation operation generates an amplitude/phase signal at each loudspeaker based on the sound data 132 and the HRTF data 142. When the frequency is below a specified threshold, e.g., 1000Hz, 2000Hz, or in between, the operations are performed by the sound rendering computer 120.

HRTF data 142 represents the various HRTFs between each speaker and each ear of the listener. With two loudspeakers and two ears, there are four HRTFs for each configuration of user and loudspeakers. In some embodiments, the HRTFs are based on a rigid spherical model, i.e. a parametric model that depends on the position and orientation of the listener relative to the loudspeakers. Like sound data, HRTFs are represented in frequency space.

The amplitude/phase data 144 represents the output of the crosstalk cancellation operation, i.e., the respective amplitude and phase emitted at each loudspeaker such that the listener hears the respective desired sound in each ear. In some implementations, because the sound data 132 is sampled in frequency space throughout a time window, the amplitude/phase data 144 will change with each time window duration.

The VBAP manager 150 is configured to perform VBAP operations on the loudspeaker position data 134, the virtual source position data 136 and the listener position data 138 to produce weight vector data 162 representing an amplitude weight for each loudspeaker. As shown in fig. 1, the VBAP manager 150 includes a loudspeaker matrix manager 152, a source vector manager 154, and a pseudo-inverse manager 156.

The loudspeaker matrix manager 152 is configured to generate loudspeaker matrix data 158 based on the loudspeaker position data 134 and the listener position data 138. In some implementations, the loudspeaker matrix data 158 has columns that include components of the unit vector in the direction of the loudspeaker position relative to the listener position.

The source vector manager 154 is configured to generate source vector data 160 based on the virtual source location data 136 and the listener location data 138. In some implementations, the source vector data 160 has elements that include components of a unit vector in the direction of the virtual source location relative to the listener location.

The pseudo-inverse manager 156 is configured to perform pseudo-inverse operations on the loudspeaker matrix data 158 and the source vector data 160 to produce weight vector data 162. In some embodiments, the pseudo-inverse operation includes generating a Penrose (Penrose) pseudo-inverse from the loudspeaker matrix data 158. In some embodiments, the pseudo-inverse operation includes generating a Singular Value Decomposition (SVD) of the loudspeaker matrix represented by the loudspeaker matrix data 158.

The weight vector data 162 represents a weight vector having elements as respective weights for each loudspeaker. The weights for a loudspeaker indicate that the signal emitted by the loudspeaker is multiplied by a factor that makes the listener hear the desired sound. In some embodiments, each element of the weight vector is a positive number. In some implementations, at least one of the elements of the weight vector is zero, implying that the loudspeaker to which the zero weight corresponds is not active in producing the desired sound for the listener.

In some embodiments, the memory 126 can be any type of memory, such as random access memory, disk drive memory, flash memory, or the like. In some implementations, the memory 126 can be implemented as more than one memory component (e.g., more than one RAM component or disk drive memory) associated with the components of the sound rendering computer 120. In some implementations, the memory 126 can be a database memory. In some implementations, the memory 126 can be or can include non-local memory. For example, the memory 126 can be or can include a memory shared by multiple devices (not shown). In some implementations, the memory 126 can be associated with a server device (not shown) within a network and configured to serve components of the sound rendering computer 120.

The components (e.g., modules, processing units 124) of the sound rendering computer 120 can be configured to operate based on one or more platforms (e.g., one or more similar or different platforms) that can include one or more types of hardware, software, firmware, operating systems, runtime libraries, and the like. In some implementations, the components of the sound rendering computer 120 can be configured to operate within a cluster of devices (e.g., a server farm). In such implementations, the functionality and processing of the components of the sound rendering computer 120 can be distributed to several devices of a cluster of devices.

The components of the sound rendering computer 120 can be or include any type of hardware and/or software configured to process attributes. In some implementations, one or more portions of the components shown in the components of the sound rendering computer 120 in fig. 1 can be or can include a hardware-based module (e.g., a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a memory), a firmware module, and/or a software-based module (e.g., a module of computer code, a set of computer-readable instructions that can be executed at a computer). For example, in some implementations, one or more portions of the components of the sound rendering computer 120 can be or can include software modules configured for execution by at least one processor (not shown). In some embodiments, the functionality of the components can be included in different modules and/or different components than those shown in fig. 1.

Although not shown, in some implementations, the components of the sound rendering computer 120 (or portions thereof) can be configured to operate within, for example, a data center (e.g., a cloud computing environment), a computer system, one or more servers/host devices, and/or the like. In some implementations, the components of the sound rendering computer 120 (or portions thereof) can be configured to operate within a network. Accordingly, components of the sound rendering computer 120 (or portions thereof) can be configured to function in various types of network environments that can include one or more devices and/or one or more server devices. For example, the network can be or can include a Local Area Network (LAN), a Wide Area Network (WAN), and the like. The network can be or can include a wireless network and/or a wireless network implemented using, for example, gateway devices, bridges, switches, and the like. The network can include one or more segments and/or can have portions based on various protocols such as Internet Protocol (IP) and/or proprietary protocols. The network can include at least a portion of the internet.

In some embodiments, one or more of the components of the sound rendering computer 120 can be or can include a processor configured to process instructions stored in a memory. For example, the sound acquisition manager 130 (and/or a portion thereof), the crosstalk cancellation manager 140 (and/or a portion thereof), and the VBAP manager 150 (and/or a portion thereof) can be a combination of a processor and a memory configured to execute instructions related to a process for implementing one or more functions.

FIG. 2 is a flow diagram illustrating an example method 200 of mapping user interaction data to discrete buckets. The method 200 may be performed by the software constructs described in connection with fig. 1 residing in the memory 126 of the sound rendering computer 120 and being run by the set of processing units 124.

At 202, the sound acquisition manager 130 receives audio data from an audio source at a source location, the audio data representing an audio waveform configured to be converted into sound of a frequency via a plurality of loudspeakers heard at a listener location, each of the plurality of loudspeakers having a respective loudspeaker location.

At 204, the crosstalk cancellation manager 140 performs a Crosstalk Cancellation (CC) operation on the plurality of loudspeakers to produce amplitudes and phases of respective audio signals emitted by the loudspeakers to determine spatialization cues in response to the frequency of the audio signals being below a specified threshold.

At 206, the VBAP manager 150 performs VBAP operations on the plurality of loudspeakers to generate respective weights for the loudspeakers in response to the frequency of the audio signal being above a specified threshold, the respective weights for each of the plurality of loudspeakers representing the audio signal emitted by the loudspeaker multiplied by factors that determine the spatialization cues.

Fig. 3 is a schematic diagram illustrating an example geometry 300 for use when considering Crosstalk Cancellation (CC) operations. Within the geometry 300, a pair of loudspeakers 310(1) and 310(2) face a listener 320.

The propagation of sound from a source to a human listener is typically described in terms of Head Related Transfer Functions (HRTFs). The HRTF is a frequency response that describes the propagation from a point source at a particular location to the left and right ears in the absence of reverberation. HRTFs depend on many factors. For simplicity, it is usually reduced to relying on the source arrival orientation-i.e., azimuth and elevation-relative to the direction in which the head is pointing. Other factors are often ignored, such as distance, rotation of the head relative to the torso, and so forth.

The sound rendered by the loudspeaker 310(1) is used by (H)_1L,H_1R) The HRTFs described are propagated to both ears of a listener 320. Similarly, sound rendered by loudspeakers 310(2) is as by (H)_2L,H_2R) As described, to both ears of listener 320. This means that-represented in the frequency domain-the signal S played from the loudspeaker₁And S₂The observation signals L and R are generated obeying the following relationship:

suppose that the desired binaural signal to be rendered at both ears is represented by L_desAnd R_desGiven, this system of equations can then be solved to obtain the appropriate S₁And S₂When played over a loudspeaker, the desired signal will be generated at each ear:

thus, if the speaker to ear HRTF (H)_1L,H_1R) And (H)_2L,H_2R) As is known, one can generate the microphone output signals necessary to deliver the spatialized audio to the listener 320.

Note that when the position of the listener changes relative to the loudspeakers (or vice versa), the HRTF will change. An example of an HRTF that can be changed in real-time as the listener moves is provided in fig. 4.

Fig. 4 shows two source orientations (az, el) located respectively on the left and right sides of the listener's head: HRTFs of (-10 °, 0 °) and (20 °, 0 °). The top row of the panel shows the amplitude of the left and right ear transfer functions. The middle row of the panel shows the amplitude of the left ear divided by the frequency response of the right ear. The bottom row shows the left-to-right ear time delay relative to the time propagation. These graphs show the following HRTF characteristics relevant for sound localization.

Interaural Time Difference (ITD), which is a significant relative delay in the source signal between the two ears. Consider the bottom row of panels in fig. 4. A source arriving from the left side of the listener, i.e., -10 °, 0 °, arrives first at the left ear and second at the right ear. This results in a negative relative delay L/R (ITD) observed for this source position. The source arriving from the right side of the listener, i.e. from (20 °, 0 °), shows the opposite behavior. The | ITD | from the more laterally positioned sources of (20 °, 0 °) is greater than the | ITD | from the sources of (-10 °, 0 °). The ITD is not constant with frequency as it is for a point in the free field. The presence of the head produces a larger ITD amplitude at lower frequencies than at higher frequencies.

Interaural Level Difference (ILD) which is the relative level difference in the source signal between the two ears. Consider the top and middle rows of the panel in fig. 4. A source arriving from the listener's left side-i.e., from (-10 °, 0 °) is louder at the left ear than at the right ear, because the head "masks" the source as it travels to the right ear. For this source position, this results in a positive ratio of the amplitude L/R expressed in dB (═ ILD). The source arriving from the right side of the listener, i.e. from (20 °, 0 °), shows the opposite behavior. The | ILD | from the more laterally positioned sources of (20 °, 0 °) is typically greater than the | ILD | from the sources of (-10 °, 0 °), because of the higher degree of head shadowing. Like ITD, ILD is not constant with frequency. The presence of the head produces a larger ILD amplitude at higher frequencies than at lower frequencies.

As spectral cues for the peaks, valleys and notches evident in the transfer function amplitude shown in the top row of the panel in fig. 4. These arise from various factors including ear canal resonance, reflections from the listener's torso/shoulders, and reflections from the outer ear or pinna.

In general, interaural cues (ITDs and ILDs) reflect source lateralization (i.e., movement to the left or right of the listener). The broad trends in ITDs and ILDs are similar across different listeners and even lend themselves to easy simulation using rigid spherical head models. Since ITDs start aliasing at higher frequencies, ITDs are most correlated with lower frequencies (below-1500 Hz). ILD is most correlated with higher frequencies (above-1500 Hz), primarily due to the reduced correlation of ITD at these frequencies.

The interaural cues become blurred when considered along a "cone of confusion" like lateralized source positions. For example, sources located at (az, el) — (45 °, 0 °), (135 °, 0 °), (90 °, 45 °) and (90 °, -45 °) are all similarly laterally along a cone formed by rotating a ray directed at (45 °, 0 °) about the interaural axis. Spectral cues are typically used by listeners to distinguish source locations along the same cone of confusion. In particular, spectral cues are useful for elevation positioning and front/back source discrimination. They are also useful for "externalization" -i.e., making sound appear as if it originates from an actual point outside the head. Spectral cues are highly individualized due to highly individualized changes in pinna structure across different listeners.

The telepresence system is configured to present the voice of the remote talker as if the talker is in the listener's acoustic space. Assuming that sound rendering computer 120 has properly "cleaned up: the audio is transmitted such that it is a single channel consisting of only the talker's voice. The task of the sound rendering computer 120 is to convert this single source into a binaural signal based on the relative positions and head orientations of the listener and talker. This is done by applying appropriate HRTFs to the talker's voice to produce a signal that should be presented to the listener's ears as shown in fig. 3.

One technique for acquiring these signals is a rigid spherical model or a rigid spherical HRTF model for ILD/ITD rendering. Studies have shown that rigid sphere models are able to generate interaural cues, particularly ITDs, that reflect those observed with real listeners. Fig. 4 also shows in dashed lines a synthetic HRTF based on a rigid spherical head model with a radius of 8.5 cm. (other radii may be used, e.g., 8.0cm, 9.0cm, 7.5cm, 9.5cm, etc.) the interaural cords are very similar, but the high frequency ILD tends to decrease. There are no detailed spectral cues, but this is not surprising. However, the rigid ball model has the advantage of being fully parameterized and mathematically solvable.

Another technique that may be used is a custom HRTF rendering, where the listener's own empirically derived HRTFs are applied. While this produces the most accurate and truest binaural signal, in some embodiments the cost associated with this approach renders it impractical as a general approach.

Another technique that may be used is reference set HRTF rendering. Instead of using individual listener HRTFs, an alternative would be to use a generic "typical" HRTF for spatialization, or an HRTF chosen from a library of reference HRTFs. Since the interaural cords of ITDs and ILDs are generally similar across the listener, this will yield good spatialization, especially with respect to lateral sources.

Another technique that may be used is reference set ILD/ITD rendering. Instead of using full HRTFs to synthesize spatialization, a simpler alternative would be to synthesize only the interaural (ITD and ILD) localization cues. These cues are similar across the listener, so the use of interaural cues, or "reference sets", will produce similar spatialization of the lateral sources as achieved using the listener's own interaural cues. Furthermore, interaural cues are typically less "rich" than full HRTFs, meaning that they may be able to be parameterized or sampled at a less dense set of source orientations, thereby reducing memory footprint at runtime.

As described above, the CC operation described above is preferably performed for lower frequencies (e.g., below between 1000Hz and 2000 Hz). Above such frequencies, improved techniques include performing modified VBAP operations to produce a positive set of weights for at least some of the loudspeakers.

Fig. 5 is a schematic diagram illustrating an example geometry 500 for use in considering a modified vector-based amplitude panning (VBAP) operation. In the geometry 500, there are four loudspeakers 510(1), 510(2), 510(3), and 510(4) aimed at the listener 530. There is also typically a virtual source 520 in front of the listener 520. The listener 530 is not necessarily equidistant from all of the loudspeakers 510(1-4) and may be moving around with respect to them. In some implementations, there are more than four loudspeakers near the listener 530. In some implementations, there are two loudspeakers near the listener 530.

FIG. 5 shows microphones 510(1-4) U pointed from the center of listener 530 (or listener 530 in general)_HL，1-4And virtual source 520U_HVA set of unit vectors for each of the plurality of sets. From these unit vectors, the VBAP manager 150 generates an overdetermined (or undetermined when the number of loudspeakers is less than three) linear system that produces weights corresponding to each of the loudspeakers 510 (1-4).

The solution for the linear system of conventional VBAP has several limitations. First, conventional VBAP assumes that the head of listener 530 is positioned equidistant from all of the loudspeakers, e.g., 510 (1-4). Second, a conventional VBAP uses exactly three loudspeakers to spatialize the virtual source 520. When there are more than three loudspeakers, a conventional VBAP needs to partition the listener space into non-overlapping triangles, so that each sub-region is covered by exactly three loudspeakers. In conventional VBAP, although spatialization is achieved by computing VBAP weights for the appropriate subset of loudspeakers, it requires arbitrary partitioning of the space into triangles. For example, when the loudspeakers 510(1-4) are arranged in a square centered on the listener, the square may be divided into two triangles in two different ways: 510(1,2,3) +510(2,3,4) or 510(1,2,4) +510(1,3, 4); it is unclear which is preferable. Furthermore, the division into groups of three loudspeakers can lead to counterintuitive loudspeaker weighting. For example, consider a square geometry above divided into two triangular subregions spanned by 510(1,2,3) +510(2,3, 4). In this case, the virtual source that is located exactly at the center of the square will only have non-zero VBAP weights for loudspeakers 510(2) and 510 (3). A more intuitive VBAP weighting would have equal contributions from all four loudspeakers. Third, there is no guarantee that all weights found from a conventional VBAP will be all positive. Thus, the modified VBAP is presented with respect to fig. 4.

FIG. 6 is a flow diagram illustrating an example method 600 of performing a modified VBAP. The method 600 may be performed by the software constructs described with respect to fig. 1 residing in the memory 126 of the sound rendering computer 120 and being executed by the set of processing units 124.

At 602, the loudspeaker manager 152 bases on the unit vector U_HL，1-4A loudspeaker matrix is generated. Typically, the loudspeaker matrix has a unit vector for each column in three dimensions corresponding to each loudspeaker. For example, when there are N loudspeakers, the loudspeaker matrix has a dimension of 3 × N. For the case illustrated in fig. 5, the matrix has dimensions 3 × 4. Thus, the linear system is overdetermined.

At 604, the source vector manager 154 generates a source vector. In which case the source vector is simply the unit vector U_HV。

At 606, the pseudo-inverse manager 156 performs pseudo-inverse operations on the loudspeaker matrices and source vectors to generate weight vectors. For example, in some embodiments, the pseudo-inverse manager 156 computes the matrix (L)^TL)^-1L^TTo generate a penos pseudo-inverse of the loudspeaker matrix L. In this case, the weights are then derived from the quantity (L)^TL)^-1L^TU_HVIs generated. The weight vector of the overdetermined system is not uniquely determined. In this case, the pseudo-inverse manager 156 generates the weight vector w having the smallest norm, i.e., the sum of squares of the components of the weight vector w is smallest.

At 608, the VBAP manager 150 determines whether all components of the weight vector are positive. If all weights are positive, the method 600 is complete at 614. If not, the VBAP manager sets all those components of the weight vector w to zero at 610. In effect, the VBAP manager 150 removes those loudspeakers for which the negative weights correspond. In this case, at 612, the loudspeaker matrix manager 152 generates a new loudspeaker matrix L' with columns corresponding to the removed negative weights. Method 600 then repeats until all components of weight vector w are positive.

In some implementations, after generating the weight vector with all positive components, the VBAP manager 150 multiplies each component by the corresponding head-to-loudspeaker distance. This multiplication corrects for the loss of energy in the inverse square range loudspeaker due to waves propagating over different distances. In the absence of reverberation, this compensates for the non-reverberant path signal pointed at by the loudspeakers for the case where the listener is not equidistant from the loudspeakers. In some embodiments, the weight vector w may also include a phase component based on the distance between the listener and the loudspeaker. In this case, such phase components align the phase of the signal at the head of the listener.

The modified VBAP described above addresses all three of the concerns described above. In particular, (i) the modified VBAP does not assume that the listener is equidistant from all loudspeakers, (ii) the modified VBAP applies to 2+ loudspeakers, (iii) a subset of loudspeakers is selected by an iterative process rather than by arbitrarily pre-dividing the space into triangles, (iv) for an arrangement such as a square, a source located at the center of the square receives equal VBAP contributions from all four vertex loudspeakers, (v) all weights are positive.

The improved technique described above uses tracked listener head positions to continuously update VBAP weights for proper source spatialization. Note that VBAP depends only on the listener head position and the virtual source position. VBAP does not require knowledge of head rotation or HRTF. This can result in less accurate spatialization cues than the spatialization provided by CC, but spatialization cues are also less susceptible to tracking errors and HRTF inaccuracies.

In summary, CC requires knowledge of the listener position/rotation and listener HRTF. VBAP, on the other hand, only requires knowledge of the listener's position. In general, CC provides more accurate positioning cues, but is more sensitive to tracker (especially rotational) errors and limited by the accuracy of the underlying HRTF model, while VBAP provides less accurate positioning cues but is less sensitive to tracker errors and does not require HRTF knowledge at all. The sensitivity of CC to tracker error is wavelength dependent-as the wavelength decreases, the tracker error becomes a larger fraction of the wavelength. Furthermore, the highly individualized aspect of listener HRTFs is focused on high frequency spectral cues that depend on the shape of the outer ear (or pinna) of the individual listener. Finally, sound localization (especially left/right localization) is dominated by the low frequency interaural cables.

These properties suggest a hybrid CC/VBAP approach that uses CC in the low frequency region and VBAP in the high frequency region. In that way, more accurate CC localization cues are maintained in the frequency regions where they are most important and where the CC has the lowest sensitivity to tracker errors and HRTF individualization and where VBAP localization cues are used elsewhere that are less sensitive to tracker errors. A typical cut-off frequency between the low frequency region and the high frequency region is in the range of 1000-2000Hz (which reflects the fact that the binaural time difference starts to alias spatially in this region).

FIG. 7 illustrates an example of a general-purpose computer device 700 and a general-purpose mobile computer device 750 that can be used with the techniques described herein.

As shown in fig. 7, computing device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 750 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 700 includes a processor 702, memory 704, a storage device 706, a high-speed interface 708 connecting to memory 704 and high-speed expansion ports 710, and a low speed interface 712 connecting to low speed bus 714 and storage device 706. Each of the components 702, 704, 706, 708, 710, and 712, are interconnected using various buses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 702 can process instructions for execution within the computing device 700, including instructions stored in the memory 704 or on the storage device 706 to display graphical information for a GUI on an external input/output device, such as display 716 coupled to high speed interface 708. In other embodiments, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Additionally, multiple computing devices 700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

The memory 704 stores information within the computing device 700. In one implementation, the memory 704 is a volatile memory unit or units. In another implementation, the memory 704 is a non-volatile memory unit or units. The memory 704 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 706 is capable of providing mass storage for the computing device 700. In one implementation, the storage device 706 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. The computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 704, the storage device 706, or memory on processor 702.

The high-speed controller 708 manages bandwidth-intensive operations for the computing device 700, while the low-speed controller 712 manages lower bandwidth-intensive operations. This allocation of functionality is merely exemplary. In one embodiment, high-speed controller 708 is coupled to memory 704, display 716 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 710, which high-speed expansion ports 710 may accept various expansion cards (not shown). In this embodiment, low-speed controller 712 is coupled to storage device 706 and low-speed expansion port 714. The low-speed expansion port, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, for example, through a network adapter.

As shown in the figure, computing device 700 may be implemented in many different forms. For example, it may be implemented as a standard server 720, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 724. Additionally, it may be implemented in a personal computer such as a laptop computer 722. Alternatively, components from computing device 700 may be combined with other components in a mobile device (not shown), such as device 750. Each of such devices may contain one or more of computing device 700, 750, and an entire system may be made up of multiple computing devices 700, 750 communicating with each other.

Computing device 750 includes a processor 752, a memory 764, an input/output device such as a display 754, a communication interface 766, and a transceiver 768, among other components. The device 750 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 750, 752, 764, 754, 766, and 768, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

The processor 752 is capable of executing instructions within the computing device 750, including instructions stored in the memory 764. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. For example, the processor may provide for coordinating other components of device 750, such as control of user interfaces, applications run by device 750, and wireless communications by device 750.

Processor 752 may communicate with a user through control interface 758 and display interface 756 coupled to a display 754. The display 754 may be, for example, a TFT LCD (thin film transistor liquid Crystal display) or OLED (organic light emitting diode) display, or other suitable display technology. The display interface 756 may comprise appropriate circuitry for driving the display 754 to present graphical and other information to a user. The control interface 758 may receive commands from a user and convert them for submission to the processor 752. Additionally, an external interface 762 may be provided in communication with processor 752, so as to enable near area communication of device 750 with other devices. External interface 762 may, for example, be provided for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

The memory 764 stores information within the computing device 750. The memory 764 can be implemented as one or more of a computer-readable medium (medium) or media (media), one or more volatile memory units, or one or more non-volatile memory units. Expansion memory 774 may also be provided and connected to device 750 through expansion interface 772, which expansion interface 772 may include, for example, a SIMM (Single in line memory Module) card interface. Such expansion memory 774 may provide additional storage space for device 750, or may also store applications or other information for device 750. Specifically, expansion memory 774 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 774 may be provided as a security module for device 750, and may be programmed with instructions that permit secure use of device 750. In addition, secure applications may be provided via the SIMM card, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.

As described below, the memory may include, for example, flash memory and/or NVRAM memory. In one embodiment, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 764, expansion memory 774, or memory on processor 752, that may be received, for example, through transceiver 768 or external interface 762.

Device 750 may communicate wirelessly through communication interface 766, which may include digital signal processing circuitry where necessary. Communication interface 766 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 768. Additionally, short-range communication may occur, such as using a bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (global positioning system) receiver module 770 may provide additional navigation-and location-related wireless data to device 750, which may be used as appropriate by applications running on device 750.

Device 750 may also communicate audibly using audio codec 760, where audio codec 760 may receive spoken information from a user and convert it to usable digital information. Audio codec 760 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 750. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 750.

As shown in the figure, computing device 750 may be implemented in many different forms. For example, it may be implemented as a cellular telephone 780. It may also be implemented as part of a smart phone 782, a personal digital assistant, or other similar mobile device.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium," "computer-readable medium" refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the description.

It will also be understood that when an element is referred to as being "on," connected to, "electrically connected to," coupled to, "or electrically coupled to" another element, it can be directly on, connected or coupled to the other element or one or more intervening elements may be present. In contrast, when an element is referred to as being directly on, directly connected to, or directly coupled to another element, there are no intervening elements present. Although the terms "directly on," "directly connected to," or "directly coupled to" may not be used throughout the detailed description, elements shown as directly on, directly connected, or directly coupled can be so referred to. The claims of the present application may be modified to recite exemplary relationships described in the specification or shown in the figures.

While certain features of the described embodiments have been illustrated as described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the embodiments. It is to be understood that they have been presented by way of example only, and not limitation, and that various changes in form and details may be made. Any portion of the devices and/or methods described herein can be combined in any combination, except mutually exclusive combinations. The embodiments described herein can include various combinations and/or subcombinations of the functions, components and/or features of the different embodiments described.

In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：个性化三维音频

Three-dimensional audio source spatialization

相关技术

网友询问留言