Personalization of head-related transfer function templates for audio content presentation

文档序号：1943027 发布日期：2021-12-07 浏览：22次中文

阅读说明：本技术 用于音频内容呈现的头部相关传递函数模板的个性化 (Personalization of head-related transfer function templates for audio content presentation ) 是由威廉·欧文·布里米约因二世 H·G·哈萨格瓦米思·克里希纳·伊泰普菲利普·罗宾逊于 2020-04-10 设计创作，主要内容包括：一种用于生成为头戴式装置用户定制的个性化HRTF的系统。该系统包括服务器和音频系统。服务器部分地基于用户的声学特征数据(例如,图像数据、人体测量特征等)和模板HRTF来确定个性化HRTF。服务器向音频系统提供个性化HRTF。音频系统使用个性化HRTF向用户呈现空间化的音频内容。(A system for generating personalized HRTFs customized for a head-mounted device user. The system includes a server and an audio system. The server determines a personalized HRTF based in part on acoustic feature data (e.g., image data, anthropometric features, etc.) of the user and the template HRTF. The server provides personalized HRTFs to the audio system. The audio system presents the spatialized audio content to the user using personalized HRTFs.)

1. A method, comprising:

determining one or more personalized filters based at least in part on the acoustic feature data of the user;

generating one or more personalized HRTFs for the user based on a template head-related transfer function (HRTF) and the determined one or more personalized filters; and

providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

2. The method of claim 1, wherein determining the one or more personalization filters comprises determining parameter values for the one or more personalization filters using a trained machine learning model and acoustic feature data of the user.

3. The method of claim 1 or claim 2, wherein the parameter values of the one or more personalized filters describe one or more personalized notches in the one or more personalized HRTFs, and the parameter values comprise one or more of: a frequency location, a width in a frequency band centered on the frequency location, and an amount of attenuation caused in the frequency band centered on the frequency location.

4. The method of claim 2, wherein the machine learning model is trained with image data, anthropometric features, and/or acoustic data, the acoustic data comprising measurements of HRTFs obtained for a population of users.

5. The method of any preceding claim, wherein generating the one or more personalized HRTFs for the user based on the template HRTFs and the determined one or more personalized filters comprises:

adding at least one notch to the template HRTF using at least one of the one or more personalized filters to generate a personalized HRTF of the one or more personalized HRTFs.

6. A method as claimed in any preceding claim, wherein the template HRTF is based on a universal HRTF describing a user population, the universal HRTF comprising at least one notch in a frequency range.

7. The method of claim 6 wherein the template HRTF is generated from the generic HRTF by removing the at least one notch such that the template HRTF is a smooth and continuous function over the frequency range, the frequency range being 5kHz to 10kHz, and there being at least one notch in the template HRTF outside of the frequency range.

8. The method of any preceding claim, wherein the audio system is part of a head-mounted device.

9. The method of any of claims 1-7, wherein the audio system is separate from and external to a headset.

10. A non-transitory computer readable medium configured to store program code instructions that, when executed by a processor, cause the processor to perform steps comprising:

determining one or more personalized filters based at least in part on the acoustic feature data of the user;

generating one or more personalized HRTFs for the user based on a template head-related transfer function (HRTF) and the determined one or more personalized filters; and

providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

11. The computer-readable medium of claim 10, wherein determining the one or more personalized filters comprises determining parameter values for the one or more personalized filters using a trained machine learning model and acoustic feature data of the user, and training the machine learning model with image data, anthropometric features, and/or acoustic data, the acoustic data comprising measurements of HRTFs obtained for a population of users.

12. The computer-readable medium of claim 11, wherein the parameter values of the one or more personalized filters describe one or more personalized notches in the one or more personalized HRTFs, and the parameter values include one or more of: a frequency location, a width in a frequency band centered on the frequency location, and an amount of attenuation caused in the frequency band centered on the frequency location.

13. The computer-readable medium of any of claims 10-12, wherein generating the one or more personalized HRTFs for the user based on the template HRTFs and the determined one or more personalized filters comprises:

adding at least one notch to the template HRTF using at least one of the one or more personalized filters to generate a personalized HRTF of the one or more personalized HRTFs.

14. A method, comprising:

receiving, at a head mounted device, one or more personalized head-related transfer functions (HRTFs) of a user of the head mounted device;

retrieving audio data associated with a target sound source direction relative to the head mounted device;

applying the one or more personalized HRTFs to the audio data to render the audio data as audio content; and

rendering, by a speaker assembly of the headset, the audio content, wherein the rendered audio content is spatialized such that the rendered audio content sounds originating from the target sound source direction.

15. The method of claim 14, further comprising:

capturing acoustic feature data of the user; and

the captured acoustic feature data is transmitted to a server,

wherein the server determines the one or more personalized HRTFs using the captured acoustic feature data, and the server provides the one or more personalized HRTFs to the head-mounted device.

Background

The present disclosure relates generally to binaural audio synthesis, and in particular to personalizing Head Related Transfer Functions (HRTFs) to render audio content.

The sound received at both ears from a given sound source may differ depending on the direction and location of the sound source relative to each ear and the room environment in which the sound is perceived. HRTFs characterize the sound received at a person's ear for a particular position (and frequency) of a sound source. A plurality of HRTFs are used to characterize how a user perceives sound. In some cases, the HRTFs form a high-dimensional data set that relies on tens of thousands of parameters to provide the listener with a perception of the direction of the sound source.

SUMMARY

A system for generating a user-customized personalized HRTF for an audio system (e.g., that may be implemented as part of a head set). The system includes a server and an audio system. The server determines a personalized HRTF based in part on acoustic feature data (e.g., image data, anthropometric features, etc.) of the user and the template HRTF. The template HRTF is an HRTF that can be customized (e.g., adding one or more notches) so that it can be personalized for different users. The server provides personalized HRTFs to the audio system. The audio system presents the spatialized audio content to the user using personalized HRTFs. The methods described herein may also be embodied as instructions stored on a computer-readable medium.

According to a first aspect of the invention, there is provided a method comprising: determining one or more personalized filters based at least in part on the acoustic feature data of the user; generating one or more personalized head related transfer functions HRTFs for the user based on the template HRTFs and the determined one or more personalized filters; and providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

The method may be performed by a server.

The one or more personalized filters may be determined by machine learning.

One or more personalized filters may be used to personalize the template HRTF so that it is customized for the user, thereby forming a personalized HRTF.

Personalization may mean adding one or more notches.

The parameter values of the one or more personalized filters may describe one or more personalized notches in the one or more personalized HRTFs.

The parameter values may include one or more of: a frequency location, a width in a frequency band centered on the frequency location, and an amount of attenuation caused in the frequency band centered on the frequency location. The parameter value may comprise each of said lists.

The machine learning model may be trained with image data, anthropometric features, and/or acoustic data, including measurements of HRTFs obtained for a population of users.

Generating one or more personalized HRTFs for a user based on the template HRTFs and the determined one or more personalized filters may include: at least one notch is added to the template HRTF using at least one of the one or more personalized filters to generate a personalized HRTF of the one or more personalized HRTFs.

The template HRTF may be based on a universal HRTF (generic HRTF) describing a user population, the universal HRTF comprising at least one notch in a frequency range.

The template HRTF may be generated from the generic HRTF by removing the at least one notch such that the template HRTF is a smooth and continuous function over the frequency range.

The frequency range may be 5kHz to 10 kHz.

There may be at least one notch in the template HRTF that is outside the frequency range.

The audio system may be part of a head-mounted device.

Alternatively, the audio system may be separate from and external to the headset.

According to a second aspect of the invention, there is provided a non-transitory computer readable medium configured to store program code instructions which, when executed by a processor, cause the processor to perform steps comprising: determining one or more personalized filters based at least in part on the acoustic feature data of the user; generating one or more personalized head related transfer functions HRTFs for the user based on the template HRTFs and the determined one or more personalized filters; and providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

The parameter values of the one or more personalized filters may describe one or more personalized notches in the one or more personalized HRTFs.

The machine learning model may be trained with image data, anthropometric features, and/or acoustic data, including measurements of HRTFs obtained for a population of users.

According to a third aspect of the invention, there is provided a method comprising: receiving, at a head mounted device, one or more personalized head related transfer functions HRTFs of a user of the head mounted device; retrieving audio data associated with a target sound source direction relative to the head mounted device; applying one or more personalized HRTFs to audio data to render the audio data as audio content; audio content is rendered by a speaker assembly of the headset, wherein the rendered audio content is spatialized such that it sounds originating from a target sound source direction.

One or more HRTFs may be received from a server. The head mounted device may retrieve the audio data. The head-mounted device may apply one or more personalized HRTFs to the audio data.

The method may further comprise: capturing acoustic feature data of a user; and transmitting the captured acoustic feature data to a server, wherein the server determines one or more personalized HRTFs using the captured acoustic feature data, and the server provides the one or more personalized HRTFs to the head-mounted device.

A system for generating a personalized HRTF customized for an audio system (e.g., that may be implemented as part of a head-mounted device) is also described. The system may include a server and an audio system. The server may determine a personalized HRTF based in part on acoustic feature data (e.g., image data, anthropometric features, etc.) of the user and the template HRTF. The template HRTF may be an HRTF that may be customized (e.g., adding one or more notches) so that it may be personalized for different users. The server may provide personalized HRTFs to the audio system. The audio system may present the spatialized audio content to the user using personalized HRTFs.

The methods described herein may also be embodied as instructions stored on a computer-readable medium.

Brief Description of Drawings

Fig. 1 is a perspective view of an elevation angle (elevation) of a sound source from a user viewpoint in accordance with one or more embodiments.

Fig. 2 shows an example depiction of three HRTFs parameterized by a user's sound source elevation angle, in accordance with one or more embodiments.

Fig. 3 is a schematic diagram of an advanced system environment for generating personalized HRTFs, in accordance with one or more embodiments.

FIG. 4 is a block diagram of a server in accordance with one or more embodiments.

Fig. 5 is a flow diagram illustrating a process for processing a request for one or more personalized HRTFs for a user, in accordance with one or more embodiments.

Fig. 6 is a block diagram of an audio system in accordance with one or more embodiments.

Fig. 7 is a flow diagram illustrating a process of presenting audio content on a head mounted device using one or more personalized HRTFs, in accordance with one or more embodiments.

Fig. 8 is a system environment of a headset including an audio system in accordance with one or more embodiments.

Fig. 9 is a perspective view of a headset including an audio system in accordance with one or more embodiments.

The figures depict various examples for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative examples of the structures and methods illustrated herein may be employed without departing from the principles described herein.

Detailed Description

Overview

The system environment is configured to generate personalized HRTFs. HRTFs characterize the sound received at a human ear for a particular position of a sound source. A plurality of HRTFs are used to characterize how a user perceives sound. Based on the anatomy of a person (e.g., ear shape, shoulder, etc.), HRTFs for a particular source direction relative to a person may be unique to a person because their anatomy may affect how sound reaches the person's ear canal.

Typical HRTFs that are user-specific include features (e.g., notches) for customizing the HRTF for the user. The template HRTF is an HRTF determined using data from a certain population of people, which can then be personalized for a single user. Thus, a single template HRTF is customizable to provide different personalized HRTFs for different users. The template HRTF can be considered as a smoothly varying continuous energy function with no individual sound source directional frequency characteristics over one or more frequency ranges (e.g., 5kHz-10 kHz). The template HRTF is used to generate a personalized HRTF by applying one or more filters to the template HRTF. For example, a filter may be used to introduce one or more notches in the template HRTF. In some embodiments, for a given source direction, the notch is described by the following parameters: a frequency location, a bandwidth centered at the frequency location, and a band attenuation value at the frequency location. The notch may be viewed as the result of resonance in the acoustic energy when it reaches the listener's head and bounces around the head and pinna, undergoing cancellation (cancellation) before reaching the entrance of the ear canal. As described above, a notch can affect how a person perceives sound (e.g., what height the sound sounds from relative to the user).

The system environment includes a server and an audio system (which may be implemented in whole or in part as part of the headset, may be stand-alone and external to the headset, etc.). The server may receive acoustic feature data characterizing the user's head and/or the head-mounted device. For example, a user may provide images and/or video of their head and/or ears, anthropometric features of the head and/or ears, and so forth to a server system. The server determines parameter values for one or more personalization filters (e.g., adding notches) based at least in part on the acoustic feature data. For example, the server may utilize machine learning to identify parameter values for one or more notch filters based on the received acoustic feature data. The server generates one or more personalized HRTFs for the user based on the template HRTFs and personalized filters (e.g., parameter values determined for the one or more personalized notches). In some embodiments, the server provides one or more personalized HRTFs to an audio system (e.g., which may be part of a headset) associated with the user. The audio system may apply one or more personalized HRTFs to the audio data to render the audio data as audio content. The audio system may then render the audio content (e.g., through a speaker component of the audio system). The presented audio content is spatialized audio content (i.e., sounds originating from one or more target sound source directions).

In some embodiments, some or all of the functions of the server are performed by the audio system. For example, the server may provide personalized filters (e.g., parameter values for one or more personalized notches) to an audio system on the headset, and the audio system may generate one or more personalized HRTFs using the personalized filters and the template HRTFs.

FIG. 1 is a perspective view of an auditory perception of a user 110 when perceiving audio content according to one or more embodiments. An audio system (not shown) presents audio content to a user 110 of the audio system. In this illustrative example, user 110 is placed at the origin of the spherical coordinate system, more specifically, the midpoint between the ears of user 110. When the audio system in the headset provides audio content to the user 110, to facilitate the user's immersive experience, the audio system may spatially position the audio content such that the user perceives the audio content as originating from a source direction 120 relative to the headset. Source direction 120 may be in elevationAnd an azimuth angle θ 140. Elevation is the angle measured from the horizontal plane 150 towards the pole of the spherical coordinate system. The azimuth angle is measured from the reference axis on the horizontal plane 150. In other embodiments, the perceived direction of origin of sound may comprise one or more vectors, e.g., vector angles describing the width of the perceived direction of origin of sound or describing the perceived region of the direction of origin of soundVector solid angle of the domain. The audio content may further be spatially localized to originate from a certain distance in the direction of the target sound source, using the physical principle that the sound pressure decreases with the distance r by the ratio 1/r.

Two parameters that affect sound localization are the user's Interaural Time Difference (ITD) and Interaural Level Difference (ILD). ITD describes the difference in sound arrival time between the two ears, a parameter that provides an indication of the angle or direction of the sound source from the head. For example, sound from a sound source located on the right side of a person will reach the right ear before reaching the left ear of the person. ILD describes the difference in sound level or intensity between two ears. For example, sound from a sound source located to the right of a person may be louder when heard by the right ear of the person, as compared to sound heard by the left ear, since the head occludes part of the sound wave as it propagates to the left ear. ITDs and ILDs may affect lateralization (lateralization) of sound.

In some embodiments, the personalized HRTF of the user is parameterized based on the sound source elevation and azimuth. Therefore, for having an elevation angleAnd a target user audio perception of a particular source direction 120 for a defined value of the azimuth angle θ 140, the audio content provided to the user may be modified by a set of HRTFs personalized for the user and for the target source direction 120. Some embodiments may also spatially position the presented audio content at a target distance in the direction of the target sound source, depending on the distance between the user 110 and the target location from which the sound is intended to be perceived to originate.

Template HRTF

The template HRTF is an HRTF that can be customized so that it can be personalized to different users. The template HRTF can be considered as a smoothly varying continuous energy function, without individual sound source directional frequency characteristics, but describing the average sound source directional frequency characteristics of a set of listeners (e.g., in some cases all listeners).

In some embodiments, the template HRTF is generated from a universal HRTF of a user population. In some embodiments, the universal HRTF corresponds to an average HRTF obtained in a population of users. In some embodiments, the universal HRTF corresponds to an HRTF in an HRTF database obtained from a population of users. In some embodiments, the criterion for selecting the one HRTF from the database of HRTFs corresponds to a predefined machine learning or statistical model or statistical measure. The universal HRTF shows the average frequency characteristics of different sound source directions in a user population.

In some embodiments, the template HRTF may be considered to preserve the average angle-dependent ITDs and ILDs of the general user population. However, the template HRTF does not show any personalized frequency characteristics (e.g., notches at specific locations). The notch may be viewed as a result of resonance in the acoustic energy when the acoustic energy reaches the listener's head and bounces around the head and pinna, undergoing cancellation before reaching the entrance of the ear canal. The notches in the HRTFs (e.g., number of notches, location of notches, width of notches, etc.) are customized/personalized for a particular user. Thus, the template HRTF is a generic non-personalized parametric frequency transfer function that has been modified to remove personalized notches in the spectrum, particularly those between 5kHz and 10 kHz. And in some embodiments, these notches may be located below 5kHz and above 10 kHz.

For the user, a fully personalized "real" HRTF is a high dimensional data set that relies on tens of thousands of parameters, providing the listener with a realistic perception of the elevation of the sound source. Features such as the geometry of the user's head, the shape of the pinna of the ear, the geometry of the ear canal, the density of the head, environmental characteristics, etc., all transform the audio content as it propagates from the source location and affect the manner in which the individual user perceives the audio (e.g., attenuates or amplifies the frequencies of the generated audio content). In short, the personalized "real" HRTF for a user includes personalized notches in the frequency spectrum.

Fig. 2 shows an example depiction of three HRTFs parameterized by a user's sound source elevation angle, in accordance with one or more embodiments. The three HRTFs include a real HRTF210 for the user, a template HRTF 220, and a personalized HRTF 230. The three HRTFs depict color-coded energy values (in decibels) (energy (dB) in the range of-20 dB to 20 dB) in decibels for elevation angles in degrees (in elevation angles in the range of-90 degrees to 90 degrees) parameterized over a set of frequency values in kilohertz (in frequency (kHz) in the range of 0.0kHz to 16.0 kHz), as discussed further below. Note that although not shown, each of these HRTFs has a plot as a function of azimuth.

The real HRTF210 describes the real frequency attenuation characteristics that affect how the ear receives sound from a point in space in the elevation range shown. Note that in the frequency range of about 5.0kHz to 16.0kHz, the real HRTF 330 exhibits frequency attenuation characteristics over the entire elevation range. This is visually depicted as a notch 240. This means that for audio content within the frequency band range of 5.0kHz-16kHz, in order for the audio content to provide the user with a true immersive experience with respect to the sound source elevation, the generated audio content may ideally be convolved with HRTFs that are as close as possible to the true HRTFs 210 for the illustrated elevation range.

The template HRTF 220 represents an example of the frequency attenuation characteristics displayed by a generic centroid HRTF that preserves the average angle-dependent ITDs and ILDs of a general user population. Note that template HRTF 220 exhibits similar characteristics to real HRTF210 over a frequency range of about 0.0kHz-5.0 kHz. However, in the frequency range of about 5.0kHz-16.0kHz, unlike the real HRTF 330, the template HRTF 220 exhibits reduced frequency attenuation characteristics over the elevation range shown.

The personalized HRTF 230 is a version of the template HRTF 220 that has been personalized for the user. As discussed below with reference to fig. 3-7, personalization applies one or more filters to the template HRTF. One or more filters may be used to introduce one or more notches into the template HRTF. In the illustrated example, two notches 350 are added to the HRTF template 230 to form a personalized HRTF 230. Note that personalized HRTF 230 exhibits similar characteristics to real HRTF210 over the frequency range of 0.0kHz-16.0kHz, due in part to the fact that notch 250 approximates notch 240 in real HRTF 210.

Overview of the System

Fig. 3 is a schematic diagram of an advanced system environment 300 for determining personalized HRTFs for a user 310, according to one or more embodiments. The head mounted device 320 communicates with the server 330 over the network 340. The user 310 may wear the head mounted device 320.

The server 330 receives acoustic feature data. For example, the user 310 may provide acoustic feature data to the server 330 via the network 340. The acoustic feature data describes features of the head of the user 310 and/or the head-mounted device 320. The acoustic feature data may include, for example, one or more images of the head and/or ears of the user 310, one or more videos of the head and/or ears of the user 310, anthropometric features of the head and/or ears of the user 310, one or more images of the head wearing the headset 320, one or more images of the headset 320 alone, one or more videos of the head wearing the headset 320, one or more videos of the headset 320 alone, or some combination thereof. The anthropometric features of the user 310 are measurements of the head and/or ears of the user 310. In some embodiments, the anthropometric features may be measured using a measuring instrument such as a tape measure (measuring tape) and/or a straight edge. In some embodiments, an imaging device (not shown) is used to capture images and/or video of the head and/or ears of user 310. The imaging device may be a camera on the headset 320, a Depth Camera Assembly (DCA) that is part of the headset 320, an external camera (e.g., part of a mobile device), an external DCA, some other device configured to capture images and/or depth information, or some combination thereof. In some embodiments, the imaging device is also used to capture images of the headset 320. Data may be provided to server 330 via network 340.

To more accurately capture the user's head, the user 310 (or some other party) positions the imaging device at different positions relative to their head such that the captured images cover different portions of the user's 310 head. The user 310 may hold the imaging device at different angles and/or distances relative to the user 310. For example, the user 310 may hold the imaging device a distance of one arm directly in front of the face of the user 310 and use the imaging device to capture an image of the face of the user 310. The user 310 may also hold the imaging device at a distance shorter than the arm length, with the imaging device pointing to the side of the head of the user 310, to capture images of the user's 310 ears and/or shoulders. In some embodiments, the imaging device may run feature recognition software and automatically capture an image when a feature of interest (e.g., ear, shoulder) is recognized or input is received from a user to capture the image. In some embodiments, the imaging device may have an application with a Graphical User Interface (GUI) that guides the user 310 to capture multiple images of the head of the user 310 from a particular angle and/or distance relative to the user 310. For example, the GUI may request a frontal image of the face of the user 310, a right ear image of the user 310, and a left ear image of the user 310. In some embodiments, the anthropometric features are determined by the imaging device using images and/or video captured by the imaging device.

In the illustrated example, data is provided from the headset 320 to the server 330 via the network 340. However, in alternative embodiments, some other device (e.g., a mobile device (e.g., a smartphone, a tablet, etc.), a desktop computer, an external camera, etc.) may be used to upload data to the server 330. In some embodiments, the data may be provided directly to server 330.

The network 340 may be any suitable communication network for data transmission. Network 340 is typically the Internet, but may be any network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile wired or wireless network, a private network, or a virtual private network. In some example embodiments, the network 340 is the internet and uses standard communication technologies and/or protocols. Thus, the network 340 may include links using technologies such as ethernet, 802.11, Worldwide Interoperability for Microwave Access (WiMAX), 3G, 4G, Digital Subscriber Line (DSL), Asynchronous Transfer Mode (ATM), infiniband, PCI Express (PCI Express) advanced switching, and so forth. In some example embodiments, the entities use custom and/or dedicated data communication techniques instead of, or in addition to, the techniques described above.

The server 330 uses the acoustic feature data of the user and the template HRTF to generate a personalized HRTF for the user 310. In some embodiments, there is a single template HRTF for all users. However, in alternative embodiments, there are multiple different template HRTFs, and each template HRTF is for a different group having one or more common features (e.g., head size, ear shape, male, female, etc.). In some embodiments, each template HRTF is associated with a particular feature. These features may be, for example, head size, head shape, ear size, gender, age, some other feature that affects how a person perceives sound, or some combination thereof. For example, based on changes in head size and/or age, there may be different HRTFs (e.g., there may be a template HRTF for children and a different HRTF for adults) because ITDs may scale with head diameter. In some embodiments, server 330 uses the acoustic feature data to determine one or more features (e.g., ear size, shape, head size, etc.) that describe the head of user 310. The server 330 may then select a template HRTF based on the one or more features.

The server 330 uses a trained machine learning system on the acoustic feature data to obtain filters customized for the user. Filters may be applied to the template HRTF to create a personalized HRTF. The filter may be, for example, bandpass (e.g., describing peaks), bandstop (e.g., describing notches), highpass (e.g., describing high frequency shelves), lowpass (e.g., describing low frequency shelves), or some combination thereof. The filter may be described by one or more parameter values. The parameter values may include, for example, frequency location, frequency bandwidth centered at the frequency location (e.g., determined by a quality factor and/or filter order), and depth at the frequency location (e.g., gain). The depth at a frequency location refers to the attenuation value in the frequency band at the frequency location. A single filter or a combination of filters may be used to describe one or more notches. In some embodiments, server 330 uses a trained Machine Learning (ML) model to determine filter parameter values for one or more personalized filters using acoustic feature data of user 310. The ML model may determine the filter based in part on ITDs and/or ILDs estimated from the acoustic feature data. As mentioned above, ITD may affect, for example, elevation angle, and ILD may have some impact on lateralization. Based on the respective filter parameter values, one or more personalized filters are each applied to the template HRTF to modify the template HRTF (e.g., add one or more notches) to generate a personalized HRTF for the user 310 (e.g., at least one for each ear). The personalized HRTF can be parameterized by elevation and azimuth. In some embodiments, when multiple users may operate the head mounted device 320, the ML model may determine parameter values to be applied to the personalized notches of the template HRTFs for each particular individual user to generate a personalized HRTF for each of the multiple users.

In some embodiments, server 330 provides personalized HRTFs to head mounted device 320 via network 340. An audio system (not shown) in the head mounted device 320 stores the personalized HRTF. Head mounted device 320 may then present the audio content to user 310 using the personalized HRTF such that it sounds originating from a particular location toward the user (e.g., in front of, behind, etc. a virtual object in the room). For example, the head-mounted device 320 may convolve the audio data with one or more personalized HRTFs to generate spatialized audio content that, when rendered, sounds to originate from a particular location (i.e., spatialized audio content).

In some embodiments, the server 330 provides the generated personalized set of filter parameter values to the headset 310. In this embodiment, an audio system (not shown) in the head mounted device 320 applies the personalized set of filter parameter values to the template HRTF to generate one or more personalized HRTFs. The template HRTF may be stored locally on the head mounted device 320 and/or retrieved from some other location (e.g., server 330).

Fig. 4 is a block diagram of a server 400 in accordance with one or more embodiments. Server 330 is an embodiment of server 400. The server 400 includes various components including, for example, a data store 410, a communication module 420, a template HRTF generation module 430, and an HRTF personalization module 440. Some embodiments of server 400 have different components than those described herein. Similarly, functionality may be distributed among components in a different manner than described herein. And in some embodiments one or more functions of the server 400 may be performed by other components (e.g., the audio system of the headset).

The data store 410 stores data for use by the server 400. The data in the data store 410 may include, for example, one or more template HRTFs, one or more personalized HRTFs, personalized filters (e.g., a personalized set of filter parameter values), a user profile, acoustic feature data, other data related to server system 400 usage, audio data, or some combination thereof. In some embodiments, the data store 410 stores one or more template HRTFs from a template HRTF generation module 430, stores personalized HRTFs from an HRTF personalization module 440, stores a personalized set of filter parameter values from an HRTF personalization module 440, or some combination thereof. In some embodiments, data store 410 may periodically receive and store updated time-stamped template HRTFs from template HRTF generation module 440. In some embodiments, periodically updated personalized HRTFs for a user may be received from HRTF personalization module 440, time stamped, and stored in data store 410. In some embodiments, data store 410 may receive and store a personalized set of time-stamped filter parameter values from HRTF personalization module 440.

The communication module 420 communicates with one or more head-mounted devices (e.g., head-mounted device 320). In some embodiments, the communication module 420 may also communicate with one or more other devices (e.g., imaging devices, smartphones, etc.). The communication module 420 may communicate via, for example, the network 340 and/or some direct coupling (e.g., Universal Serial Bus (USB), WIFI, etc.). The communication module 420 may receive a request from the headset for a personalized HRTF for a particular user, acoustic feature data (from the headset and/or some other device), or some combination thereof. The communication module 420 may also provide the head-mounted device with one or more personalized HRTFs, one or more personalized sets of filter parameter values, one or more template HRTFs, or some combination thereof.

The template HRTF generating module 430 generates a template HRTF. The generated template HRTF may be stored in the data storage 410 and may also be transmitted to the head-mounted device to be stored in the head-mounted device. In some embodiments, HRTF generation module 430 generates template HRTFs from a generic HRTF. The universal HRTF is associated with some user groups and may include one or more notches. Notches in a generic HRTF correspond to amplitude variations over a frequency window or band. The notch is described by the following parameters: a frequency location, a bandwidth centered at the frequency location, and a band attenuation value at the frequency location. In some embodiments, notches in the HRTFs are identified as frequency locations where the amplitude variation exceeds a predetermined threshold. Thus, notches in a generic HRTF can be considered to represent the average attenuation characteristics as a function of frequency and direction for a user population.

The template HRTF generation module 430 removes notches in the generic HRTF over some or all of the entire audible frequency band (the range of sound that can be perceived by humans) to form a template HRTF. The template HRTF generation module 430 may also smooth the template HRTF such that some or all of it is a smooth and continuous function. In some embodiments, the template HRTF is generated as a smooth and continuous function that lacks notches over some frequency ranges, but does not necessarily lack notches outside those frequency ranges. In some embodiments, the template HRTFs are such that there are no notches in the frequency range of 5kHz-10 kHz. This may be important because the notch in this frequency range varies from user to user. This means that in the frequency range of about 5kHz-10kHz, the number of notches, notch size, notch location may have a large effect on how the acoustic energy is received at the entrance of the ear canal (and thus the user perception). Thus, making the template HRTF as a smooth and continuous function without notches in the frequency range of about 5kHz-10kHz makes it a suitable template that can be personalized for different users. In some embodiments, template HRTF generation module 430 generates HRTF templates as smooth and continuous functions that lack notches at all frequency ranges. In some embodiments, template HRTF generation module 430 generates HRTFs that are smooth and continuous functions over one or more frequency bands, but may include notches outside of the one or more frequency bands. For example, template HRTF generation module 430 may generate a template HRTF template that lacks notches over a range of frequencies (e.g., about 5kHz-10kHz) but may include one or more notches outside of that range.

Note that the general HRTF used to generate the template HRTF is based on the user population. In some embodiments, a population may be selected such that it represents the majority of users, and a single template HRTF is generated from the population and used to generate some or all of the personalized HRTFs.

In other embodiments, multiple populations are used to generate different universal HRTFs, and the populations are such that each is associated with one or more common features. These features may be, for example, head size, head shape, ear size, ear shape, age, gender, some other feature that affects how a person perceives sound, or some combination thereof. For example, one population may be for adults, one population for children, one population for males, one population for females, and so forth. Template HRTF generation module 430 may generate a template HRTF for one or more of a plurality of general HRTFs. Thus, there may be multiple different template HRTFs, and each template HRTF is for a different group that shares some common feature set.

In some embodiments, template HRTF generation module 430 may periodically generate new template HRTFs and/or modify previously generated template HRTFs as more population HRTF data is obtained. Template HRTF generation module 430 may store each newly generated template HRTF and/or each update to the template HRTF in data store 410. In some embodiments, the server 400 may send the newly generated template HRTF and/or an update of the template HRTF to the head mounted device.

HRTF personalization module 430 determines filters personalized for a user based at least in part on acoustic feature data associated with the user. The filter may comprise one or more filter parameter values, e.g. personalized for the user. HRTF personalization module 430 employs a trained Machine Learning (ML) model on acoustic feature data of a user to determine personalized filter parameter values (e.g., filter parameter values) for one or more personalized filters (e.g., notches) customized for the user. In some embodiments, the personalized filter parameter values are parameterized by the sound source elevation and azimuth. The ML model is first trained using data collected from a population of users. The collected data may include, for example, image data, anthropometric features, and acoustic data. Training may include supervised or unsupervised learning algorithms, including but not limited to linear and/or logistic regression models, neural networks, classification and regression trees, k-means clustering, vector quantization, or any other machine learning algorithm. The acoustic data may include HRTFs measured using an audio measuring device and/or simulated by numerical analysis from three-dimensional scans of the head.

In some embodiments, the filters and/or filter parameter values are derived directly from the user's image data via machine learning, which corresponds to a single or multiple snapshots of the left and right ears taken by the camera (in the phone or otherwise). In some embodiments, the filters and/or filter parameter values are derived by machine learning from single or multiple videos of the left and right ears captured by the camera (in the phone or otherwise). In some embodiments, the filter and/or filter parameter values are derived from anthropometric features of the user and correspond to body features of the left and right ears. These anthropometric features include left and right ear heights, left and right ear widths, left and right ear concha cavity (ear cavum concha) heights, left and right ear concha cavity widths, left and right ear concha boat (ear cymba) heights, left and right ear fossa (ear fossa) heights, left and right ear pinna heights and widths, left and right ear interaural notch widths (ear interaural input width), and other relevant physical measurements. In some embodiments, the filter and/or filter parameter values are derived from a weighted combination of the photograph, video and anthropometric results.

In some embodiments, the ML model uses a convolutional neural network model having a layer of nodes, where the values at the nodes of the current layer are a transformation of the values at the nodes of the previous layer. The transformation in the model is determined by a set of weights and parameters that connect the current layer and the previous layer. In some examples, the transformation may also be determined by a set of weights and parameters used to transform between previous layers in the model.

The inputs to the neural network model may be some or all of the acoustic feature data of the user and the template HRTF encoded onto the first convolutional layer, and the outputs of the neural network model are filter parameter values to be applied to one or more personalized notches of the template HRTF, the filter parameter values being parameterized by the elevation and azimuth of the user; this is decoded from the output layer of the neural network. The weights and parameters of the transformation across multiple layers of the neural network model may indicate the relationship between the information contained in the starting layer and the information obtained from the final output layer. For example, the weights and parameters may be quantization of user features, etc., included in the information of the user image data. The weights and parameters may also be based on historical user data.

The ML model may include any number of machine learning algorithms. Some other ML models that may be used are linear and/or logistic regression, classification and regression trees, k-means clustering, vector quantization, and the like. In some embodiments, the ML model includes a deterministic method that has been trained with reinforcement learning (thereby creating a reinforcement learning model). The model is trained to improve the quality of a personalized set of filter parameter values generated using measurements from a monitoring system within an audio system at a head-mounted device.

The HRTF personalization module 430 selects an HRTF template for generating one or more personalized HRTFs for the user. In some embodiments, HRTF personalization module 430 simply retrieves a single HRTF template (e.g., from data store 410). In other embodiments, HRTF personalization module 430 determines one or more features associated with the user from the acoustic feature data and uses the determined one or more features to select a template HRTF from a plurality of template HRTFs.

The HRTF personalization module 430 generates one or more personalized HRTFs for the user using the selected template HRTFs and one or more personalized filters (e.g., a set of filter parameter values). The HRTF personalization module 430 applies personalized filters (e.g., one or more sets of personalized filter parameter values) to the selected template HRTFs to form personalized HRTFs. In some embodiments, HRTF personalization module 430 adds at least one notch to the selected template HRTF using at least one of the one or more personalized filters to generate a personalized HRTF. In this way, HRTF personalization module 430 can approximate a real HRTF (e.g., as described above with respect to fig. 2) by adding one or more notches (personalized to the user) to the template HRTF. In some embodiments, HRTF personalization module 430 may then provide one or more personalized HRTFs to the head-mounted device (via communication module 420). In an alternative embodiment, HRTF personalization module 430 provides a personalized set of filter parameter values to the head mounted device, and the head mounted device generates one or more personalized HRTFs using the template HRTFs.

Fig. 5 is a flow diagram illustrating a process 500 for processing a request for one or more personalized HRTFs for a user, in accordance with one or more embodiments. In one embodiment, the process of FIG. 5 is performed by a server (e.g., server 400). In other embodiments, other entities (e.g., consoles) may perform some or all of the steps of the process. Likewise, embodiments may include different and/or additional steps, or perform the steps in a different order.

The server 400 receives 510 acoustic feature data associated with a user. For example, the server 400 may receive one or more images of the user's head and/or ears. The acoustic feature data may be provided to the server over a network from, for example, an imaging device, a mobile device, a headset, etc.

The server 400 selects 520 the template HRTF. The server 400 selects a template HRTF from one or more templates (e.g., stored in a data store). In some embodiments, the server 400 selects the template HRTF based in part on acoustic feature data associated with the user. For example, the server 400 may determine that the user is an adult using the acoustic feature data and select a template HRTF associated with a child (versus an adult).

The server 500 determines 530 one or more personalized filters based in part on the acoustic feature data. The determination is performed using a trained machine learning model. In some embodiments, at least one personalized filter describes one or more sets of filter parameter values. Each set of filter parameter values describes a single notch. The individualized filter parameter values describe frequency location, frequency bandwidth centered at the frequency location (e.g., determined by a quality factor and/or filter order), and depth (e.g., gain) at the frequency location. In some embodiments, the personalized filter parameter values are parameterized for each elevation and azimuth pair value in a spherical coordinate system centered on the user. In some embodiments, the personalized filter parameter values are described as being within one or more particular frequency ranges (e.g., 5kHz-10 kHz).

The server 500 generates 540 one or more personalized HRTFs for the user based on the template HRTFs and one or more personalized filters (e.g., one or more sets of filter parameter values). The server 500 adds at least one notch to the template HRTF using one or more personalized filters (e.g., through one or more sets of filter parameter values) to generate a personalized HRTF.

The server 500 provides 550 one or more personalized HRTFs to an audio system associated with the user. In some embodiments, some or all of the audio system may be part of a head-mounted device. In other embodiments, some or all of the audio system may be separate from and external to the headset. The audio system may use one or more personalized HRTFs to render audio content to a user.

Note that in an alternative embodiment, server 500 provides one or more personalized filters (and possibly a template HRTF) to the headset, and step 540 is performed by the headset.

Fig. 6 is a block diagram of an audio system 600 in accordance with one or more embodiments. In some embodiments, the audio system of fig. 6 is a component of a head-mounted device that provides audio content to a user. In other embodiments, some or all of the audio system 600 is separate from and external to the headset. For example, the audio system 600 may be part of a console. The audio system 600 includes a speaker assembly 610 and an audio controller 620. Some embodiments of the audio system 600 have different components than those described herein. Similarly, functionality may be distributed among components in a different manner than described herein.

The speaker component 610 provides audio content to a user of the audio system 600. The speaker assembly 610 includes speakers that provide audio content according to instructions from the audio controller 620. In some embodiments, one or more speakers of the speaker assembly 610 may be located remotely from the headset (e.g., within a local area of the headset). The speaker component 610 is configured to provide audio content to one or both ears of a user of the audio system 600 using speakers. The speaker may be, for example, a moving coil transducer (piezoelectric transducer), a piezoelectric transducer, some other device that generates acoustic pressure waves using an electrical signal, or some combination thereof. A typical moving coil transducer includes a coil and a permanent magnet that generates a permanent magnetic field. When the wire is placed in a permanent magnetic field, application of a current to the wire generates a force on the coil that, depending on the amplitude and polarity of the current, can move the coil toward or away from the permanent magnet. Piezoelectric transducers comprise a piezoelectric material that can be strained by applying an electric field or voltage across the piezoelectric material. Some examples of piezoelectric materials include polymers (e.g., polyvinyl chloride (PVC), polyvinylidene fluoride (PVDF)), polymer-based composites, ceramics, or crystals (e.g., quartz (silica or SiO2), lead zirconate titanate (PZT)). One or more speakers placed near the user's ears may be coupled to a soft material (e.g., silicone) that adheres well to the user's ears and may be comfortable for the user.

The audio controller 620 controls the operation of the audio system 600. In some embodiments, the audio controller 620 obtains acoustic feature data associated with the head-mounted device user. The acoustic feature data may be obtained from an imaging device on the headset (e.g., a depth camera component) or from some other device (e.g., a smartphone). In some embodiments, audio controller 620 may be configured to determine anthropometric features based on data from the imaging device and/or other devices. For example, audio controller 620 may derive anthropometric features using a weighted combination of photos, videos, and anthropometric results. In some embodiments, audio controller 620 provides acoustic feature data to a server (e.g., server 400) via a network (e.g., network 340).

The audio system 600 generates audio content using one or more personalized HRTFs. One or more personalized HRTFs are customized for a user. In some embodiments, some or all of the one or more personalized HRTFs are received from a server. In some embodiments, audio controller 620 generates one or more personalized HRTFs using data (e.g., a personalized set of notch parameters and a template HRTF) received from a server.

In some embodiments, audio controller 620 may identify an opportunity to present audio content having a target sound source direction to a user of audio system 600, for example, when a flag for presenting audio content having a target sound source direction appears in the virtual experience. The audio controller 620 may first retrieve audio data that will be subsequently rendered to generate audio content for presentation to the user. The audio data may additionally specify a target sound source direction and/or a target position of a virtual source of audio content within a local area of the audio system 600. Each target sound source direction describes a spatial direction of a virtual source of sound. Further, the target sound source position is a spatial position of the virtual source. For example, the audio data may include an explosion from a first target sound source direction and/or target location behind the user, and a bird call from a second target sound source direction and/or target location in front of the user. In some embodiments, the target sound source direction and/or target position may be organized in a spherical coordinate system, with the user located at the origin of the spherical coordinate system. Then, each target sound source direction is represented as an elevation angle with respect to a horizontal plane and an azimuth angle in a spherical coordinate system, as shown in fig. 1. The target sound source position includes an elevation angle from a horizontal plane, an azimuth angle, and a distance from an origin of a spherical coordinate system.

Audio controller 620 uses one or more personalized HRTFs for a user based on a target audio source direction and/or location perception associated with audio data to be presented to the user. Audio controller 620 convolves the audio data with one or more personalized HRTFs to render audio content to a user that is spatialized to appear to originate from a target source direction and/or location. The audio controller 620 provides the rendered audio content to the speaker component 610 for presentation to a user of the audio system.

Fig. 7 is a flow diagram illustrating a process 700 of presenting audio content on a head mounted device using one or more personalized HRTFs, in accordance with one or more embodiments. In one embodiment, the process of fig. 7 is performed by a headset. In other embodiments, other entities may perform some or all of the steps of the process. For example, steps 710 and 720 may be performed by some other device. Likewise, embodiments may include different and/or additional steps, or perform the steps in a different order.

The head mounted device captures 710 acoustic feature data of the user. The headset may capture images and/or video of the user's head and ears, for example, using an imaging device in the headset. In some embodiments, the headset may communicate with an external device (e.g., camera, mobile device/phone, etc.) to receive acoustic feature data.

The head-mounted device provides 720 the acoustic signature data to a server (e.g., server system 400). In some embodiments, the acoustic signature data may be pre-processed at the headset before being provided to the server. For example, in some embodiments, the headset may use captured images and/or video to determine anthropometric features of the user.

The head mounted device receives 730 one or more personalized HRTFs from a server. One or more personalized HRTFs are customized for a user.

The head mounted device renders 740 the audio content using one or more personalized HRTFs. The head-mounted device may convolve the audio data with one or more personalized HRTFs to generate audio content. The audio content is rendered by the speaker assembly and perceived as originating from a target source direction and/or a target location.

In the above embodiments, the server provides personalized HRTFs to the head mounted device. However, in alternative embodiments, the server may provide the template HRTF, one or more personalized filters (e.g., one or more sets of personalized filter parameter values), or some combination thereof to the head-mounted device. The head mounted device will then generate an individualized HRTF using one or more individualized filters.

Artificial reality system environment

Fig. 8 is a system environment 800 including a headset 805 of an audio system 600 in accordance with one or more embodiments. The system 800 may operate in an artificial reality environment (e.g., a virtual reality environment, an augmented reality environment, a mixed reality environment, or some combination thereof). The system 800 shown in fig. 8 includes a headset 805 and an input/output (I/O) interface 815 coupled to a console 810, and the console 810 and/or the headset 805 communicate with the server 400 over the network 340. The headset 805 may be an embodiment of the headset 320. Although fig. 8 illustrates an example system 800 including one headset 805 and one I/O interface 815, in other embodiments any number of these components may be included in the system 800. For example, there may be multiple headsets 805 each having an associated I/O interface 815, each headset 805 and I/O interface 815 communicating with console 810. In alternative configurations, different and/or additional components may be included in system 800. In addition, in some embodiments, the functionality described in connection with one or more of the components shown in fig. 8 may be distributed among the components in a manner different than that described in connection with fig. 8. For example, some or all of the functionality of the console 810 is provided by the headset 805.

The head mounted device 805 may be a near-eye display (NED) or Head Mounted Display (HMD) that presents content to a wearer, the content including an augmented view of a physical reality environment with computer-generated elements (e.g., two-dimensional (2D) or three-dimensional (3D) images, 2D or 3D video, sound, etc.). In some embodiments, the presented content includes audio presented via the audio system 600, the audio system 600 receives audio information from the headset 805, the console 810, or both, and presents audio data based on the audio information. In some embodiments, the head mounted device 805 presents virtual content to the wearer that is based in part on the real environment surrounding the wearer. For example, the virtual content may be presented to a wearer of the head-mounted device. The head mounted device includes an audio system 600. The headset 805 may also include a depth camera component (DCA)825, an electronic display 830, an optics block 835, one or more position sensors 840, and an Inertial Measurement Unit (IMU) 845. Some embodiments of the headset 805 have different components than those described in connection with fig. 8. Additionally, in other embodiments, the functionality provided by the various components described in conjunction with fig. 8 may be distributed differently among the components of the headset 805 or captured in a separate component remote from the headset 805. One example of a headset is described below with reference to fig. 9.

The audio system 600 presents audio content to a user of the head mounted device 805 using one or more personalized HRTFs. In some embodiments, the audio system 600 may receive and store (e.g., from the server 400 and/or console 810) the user's personalized HRTF. In some embodiments, the audio system 600 may receive and store (e.g., from the server 400 and/or console 810) the template HRTF and/or one or more personalized filters (e.g., described by parameter values) to be applied to the template HRTF. The audio system 600 receives audio data associated with a target sound source direction relative to the headset 805. The audio system 600 applies one or more personalized HRTFs to audio data to generate audio content. The audio system 600 presents audio content to a user through a speaker assembly. The rendered audio content is spatialized such that when rendered with the speaker assembly it sounds originating from the target sound source direction and/or target position.

The DCA825 captures data describing depth information for a local region around some or all of the headset 805. DCA825 may include an illuminator, an imaging device, and a DCA controller that may be coupled to both the illuminator and the imaging device. The light emitter illuminates the local area with illumination light, for example, according to an emission instruction generated by the DCA controller. The DCA controller is configured to control operation of specific components of the light emitter based on the emission instructions, for example, to adjust the intensity and pattern of the illumination light illuminating the local area. In some embodiments, the illumination light may include a structured light pattern, such as a dot pattern, a line pattern, or the like. An imaging device captures one or more images of one or more objects in a local area illuminated with illumination light. The DCA825 may use data captured by the imaging device to calculate depth information, or the DCA825 may send this information to another device (e.g., console 810) that may use data from the DCA825 to determine depth information. The DCA825 may also be used to capture depth information describing the user's head and/or ears by removing the headset and pointing the DCA at the user's head and/or ears.

Electronic display 830 displays 2D or 3D images to the wearer based on data received from console 810. In various embodiments, electronic display 830 comprises a single electronic display or multiple electronic displays (e.g., one display for each eye of the wearer). Examples of electronic display 830 include: a Liquid Crystal Display (LCD), an Organic Light Emitting Diode (OLED) display, an active matrix organic light emitting diode display (AMOLED), a waveguide display, some other display, or some combination thereof.

The optics block 835 amplifies the image light received from the electronic display 830, corrects optical errors associated with the image light, and presents the corrected image light to the wearer of the headset 805. In various embodiments, optics block 835 includes one or more optical elements. Example optical elements included in the optics block 835 include: a waveguide, an aperture, a Fresnel lens (Fresnel lenses), a convex lens, a concave lens, a filter, a reflective surface, or any other suitable optical element that affects the image light. Further, the optics block 835 may include a combination of different optical elements. In some embodiments, one or more optical elements in optical block 835 can have one or more coatings, such as a partially reflective coating or an anti-reflective coating.

The magnification and focusing of image light by optics block 835 allows electronic display 830 to be physically smaller, lighter in weight, and consume less power than larger displays. Further, the magnification may increase the field of view of the content presented by electronic display 830. For example, the field of view of the displayed content is such that the displayed content is presented using nearly all of the wearer's field of view (e.g., about 110 degrees diagonal), and in some cases all of the field of view. Additionally, in some embodiments, the amount of magnification may be adjusted by adding or removing optical elements.

In some embodiments, optical block 835 may be designed to correct one or more types of optical errors. Examples of optical errors include barrel or pincushion distortion, longitudinal chromatic aberration, or lateral chromatic aberration. Other types of optical errors may also include spherical aberration, chromatic aberration (chromatic aberration) or errors due to lens field curvature (lens field curvature), astigmatism or any other type of optical error. In some embodiments, the content provided to electronic display 830 for display is pre-distorted, and when optics block 835 receives image light generated based on the content from electronic display 830, optics block 835 corrects the distortion.

The IMU 845 is an electronic device that generates data indicative of the position of the headset 805 based on measurement signals received from the one or more position sensors 840. The position sensor 840 generates one or more measurement signals in response to the motion of the headset 805. Examples of the position sensor 840 include: one or more accelerometers, one or more gyroscopes, one or more magnetometers, another suitable type of sensor to detect motion, one type of sensor for error correction of the IMU 845, or some combination thereof. The position sensor 840 may be located outside of the IMU 845, inside of the IMU 845, or some combination of the two.

Based on the one or more measurement signals from the one or more position sensors 840, the IMU 845 generates data indicative of an estimated current position of the headset 805 relative to the initial position of the headset 805. For example, the position sensors 840 include multiple accelerometers for measuring translational motion (forward/backward, up/down, left/right) and multiple gyroscopes for measuring rotational motion (e.g., pitch, yaw, and roll). In some embodiments, the IMU 845 performs fast sampling of the measurement signals and calculates an estimated current position of the headset 805 from the sampled data. For example, the IMU 845 integrates the measurement signals received from the accelerometers over time to estimate a velocity vector, and integrates the velocity vector over time to determine an estimated current location of a reference point on the headset 805. Alternatively, the IMU 845 provides sampled measurement signals to the console 810, and the console 810 parses the data to reduce errors. The reference point is a point that may be used to describe the position of the headset 805. The reference point may generally be defined as a point or location in space that is related to the orientation and position of the headset 805.

The I/O interface 815 is a device that allows the wearer to send action requests and receive responses from the console 810. An action request is a request to perform a particular action. For example, the action request may be an instruction to begin or end capturing image or video data, or an instruction to perform a particular action within an application. The I/O interface 815 may include one or more input devices. An example input device includes: a keyboard, mouse, game controller, or any other suitable device for receiving and transmitting an action request to console 810. The action request received by the I/O interface 815 is transmitted to the console 810, and the console 810 performs an action corresponding to the action request. In some embodiments, as further described above, the I/O interface 815 includes an IMU 845 that captures calibration data indicating an estimated location of the I/O interface 815 relative to an initial location of the I/O interface 815. In some embodiments, I/O interface 815 may provide haptic feedback to the wearer according to instructions received from console 810. For example, the haptic feedback is provided when an action request is received, or when the console 810 transmits instructions to the I/O interface 815 that cause the I/O interface 815 to generate the haptic feedback when the console 810 performs the action.

The console 810 provides content to the headset 805 for processing in accordance with information received from one or more of: a headset 805 and an I/O interface 815. In the example shown in fig. 8, console 810 includes application storage 850, tracking module 855, and engine 860. Some embodiments of console 810 have different modules or components than those described in conjunction with fig. 8. Similarly, the functionality described further below may be distributed among the components of the console 810 in a manner different from that described in conjunction with FIG. 8.

The application storage 850 stores one or more applications for execution by the console 810. An application is a set of instructions that, when executed by a processor, generates content for presentation to a wearer. The content generated by the application may be responsive to input received from the wearer via the movement of the headset 805 or the I/O interface 815. Examples of applications include: a gaming application, a conferencing application, a video playback application, or other suitable application.

The tracking module 855 calibrates the system environment 800 using one or more calibration parameters and may adjust the one or more calibration parameters to reduce errors in the position determination of the headset 805 or the I/O interface 815. The calibration performed by the tracking module 855 may also take into account information received from the IMU 845 in the headset 805 and/or the IMU 845 included in the I/O interface 815. Additionally, if tracking of the headset 805 is lost, the tracking module 855 may recalibrate some or all of the system environment 800.

The tracking module 855 uses information from the one or more position sensors 840, the IMU 845, the DCA825, or some combination thereof, to track movement of the headset 805 or the I/O interface 815. For example, the tracking module 855 determines the location of a reference point of the headset 805 in the map of the local area based on information from the headset 805. The tracking module 855 may also determine the location of a reference point of the headset 805 or a reference point of the I/O interface 815 using data from the IMU 845 indicating the location of the headset 805 or the I/O interface 815, respectively, or using data from the IMU 845 included in the I/O interface 815. Additionally, in some embodiments, the tracking module 855 may use the partial data from the IMU 845 indicating the location of the headset 805 to predict a future location of the headset 805. The tracking module 855 provides the estimated or predicted future location of the headset 805 or the I/O interface 815 to the engine 860.

Engine 860 also executes applications within system environment 800 and receives position information, acceleration information, velocity information, predicted future position of headset 805, or some combination thereof, from tracking module 855. Based on the received information, the engine 860 determines content to be provided to the headset 805 for presentation to the wearer. For example, if the received information indicates that the wearer has looked to the left, the engine 860 generates content for the headset 805 that reflects (mirror) the wearer's movements in the virtual environment or in the environment augmenting the local area with additional content. Additionally, engine 860 performs actions within the application executing on console 810 in response to action requests received from I/O interface 815, and provides feedback to the wearer that the actions were performed. The feedback provided may be visual or auditory feedback via the headset 805, or tactile feedback via the I/O interface 815.

Example head-mounted device

Fig. 9 is a perspective view of a headset 900 including an audio system in accordance with one or more embodiments. The head mounted device 900 presents media to a user. Examples of media presented by the headset 900 include one or more images, video, audio, or some combination thereof. The head mounted device 900 may be a near-eye display, glasses, or a Head Mounted Display (HMD). The headset 900 includes components such as a frame 905, a lens 910, a sensor device 915, and an audio system (not shown). In embodiments as a head-mounted device, the head-mounted device 900 may correct or enhance the vision of the user, protect the user's eyes, or provide images to the user. The head-mounted device 900 may be eyeglasses to correct visual defects of the user. The head-mounted device 900 may be sunglasses that protect the user's eyes from sunlight. The head-mounted device 900 may be safety glasses that protect the user's eyes from impact. The head mounted device 900 may be a night vision device or infrared goggles to enhance the user's vision at night. In an alternative embodiment, the headset 900 may not include a lens 910 and may be a frame 905 with an audio system that provides audio content (e.g., music, radio, podcasts) to the user.

The frame 905 includes a front portion that holds the lens 910 and an end piece (end piece) that attaches to the user. The front of the frame 905 rests on top of the nose of the user. The end pieces, e.g., temples, are the portion of the frame 905 to which the temples of the user are attached. The length of the tip may be adjustable (e.g., adjustable temple length) to suit different users. The end pieces may also include portions that bend (curl) behind the user's ears (e.g., temple caps (temples), ear pieces (ear pieces)).

The lens 910 provides or transmits light to a user wearing the head mounted device 900. The lens 910 is held by the front of the frame 905 of the headset 900. The lens 910 may be a prescription lens (e.g., single vision lens), bifocal lens, and trifocal lens or progressive lens) to help correct the user's vision deficiencies. The prescription lens transmits ambient light to the user wearing the head-mounted device 900. The transmitted ambient light may be altered by the prescription lens to correct the user's vision deficiencies. The lens 910 may be a polarized lens or a colored lens to protect the user's eyes from sunlight. The lens 910 may be one or more waveguides that are part of a waveguide display, where the image light is coupled to the user's eye through an end or edge of the waveguide. The lens 910 may include an electronic display for providing image light, and may also include an optical block for magnifying the image light from the electronic display. In some embodiments, lens 910 is an embodiment of electronic display 830.

The sensor device 915 estimates the current position of the headset 900 relative to the initial position of the headset 900. The sensor device 915 may be located on a portion of the frame 905 of the headset 900. The sensor device 915 includes a position sensor and an inertial measurement unit. The sensor device 915 may also include one or more cameras placed on the frame 905 to view or face the user's eyes. One or more cameras of sensor device 915 are configured to capture image data corresponding to eye positions of the user's eyes. The sensor device 915 may be an embodiment of the IMU 845 and/or the location sensor 840.

An audio system (not shown) provides audio content to a user of the head-mounted device 900. The audio system is an embodiment of audio system 600 and renders content using speakers 920.

Additional configuration information

Embodiments according to the invention are specifically disclosed in the appended claims, relating to methods, storage media, and audio systems, wherein any feature mentioned in one claim category (e.g. method) may also be claimed in another claim category (e.g. storage media, audio systems, and computer program products). The dependencies or back-references in the appended claims are chosen for formal reasons only. However, any subject matter resulting from an intentional back-reference to any preceding claim (in particular multiple dependencies) may also be claimed, such that any combination of a claim and its features is disclosed and may be claimed without regard to the dependencies chosen in the appended claims.

The subject matter which can be claimed comprises not only the combination of features as set forth in the appended claims, but also any other combination of features in the claims, wherein each feature mentioned in the claims can be combined with any other feature or combination of other features in the claims. Furthermore, any of the embodiments and features described or depicted herein may be claimed in a separate claim and/or in any combination with any of the embodiments or features described or depicted herein or in any combination with any of the features of the appended claims.

In one embodiment, a method may comprise: determining one or more personalized filters based at least in part on the acoustic feature data of the user; generating one or more personalized Head Related Transfer Functions (HRTFs) for the user based on the template HRTFs and the determined one or more personalized filters; and providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

Determining the one or more personalized filters may include determining parameter values for the one or more personalized filters using the trained machine learning model and acoustic feature data of the user. The parameter values of the one or more personalized filters may describe one or more personalized notches in the one or more personalized HRTFs. The parameter values may include: a frequency location, a width in a frequency band centered on the frequency location, and an amount of attenuation caused in the frequency band centered on the frequency location.

A machine learning model may be trained with image data, anthropometric features, and acoustic data, including measurements of HRTFs obtained for a population of users.

Generating one or more personalized HRTFs for a user that may be based on the template HRTFs and the determined one or more personalized filters may include: at least one notch is added to the template HRTF using at least one of the one or more personalized filters to generate a personalized HRTF of the one or more personalized HRTFs.

The template HRTF may be based on a universal HRTF describing a user population, which may include at least one notch in a frequency range. The template HRTF may be generated from the generic HRTF by removing the at least one notch such that the template HRTF is a smooth and continuous function over the frequency range. The frequency range may be 5kHz to 10 kHz. At least one notch may be present in the template HRTF outside the frequency range.

The audio system may be part of a head-mounted device. The audio system may be separate from the headset and external to the headset.

In one embodiment, a non-transitory computer readable medium may be configured to store program code instructions that, when executed by a processor, may cause the processor to perform steps comprising: determining one or more personalized filters based at least in part on the acoustic feature data of the user; generating one or more personalized Head Related Transfer Functions (HRTFs) for the user based on the template HRTFs and the determined one or more personalized filters; and providing the generated one or more personalized HRTFs to an audio system, wherein the personalized HRTFs are used to generate the spatialized audio content.

The parameter values of the one or more personalized filters may describe one or more personalized notches in the one or more personalized HRTFs. The parameter values may include: a frequency location, a width in a frequency band centered on the frequency location, and an amount of attenuation caused in the frequency band centered on the frequency location.

A machine learning model may be trained with image data, anthropometric features, and acoustic data, including measurements of HRTFs obtained for a population of users.

In one embodiment, a method may comprise: receiving, at a head mounted device, one or more personalized HRTFs of a user of the head mounted device; retrieving audio data associated with a target sound source direction relative to the head mounted device; applying one or more personalized HRTFs to audio data to render the audio data as audio content; and rendering, by a speaker assembly of the headset, the audio content, wherein the rendered audio content is spatialized such that it sounds originating from the target sound source direction.

In one embodiment, a method may comprise: capturing acoustic feature data of a user; and transmitting the captured acoustic feature data to a server, wherein the server determines one or more personalized HRTFs using the captured acoustic feature data, and the server provides the one or more personalized HRTFs to the head-mounted device.

In one embodiment, an audio system may include: an audio component and an audio controller, the audio component comprising one or more speakers configured to present audio content to a user of the audio system; the audio controller is configured to perform a method according to or within any of the embodiments described above.

In an embodiment, one or more computer-readable non-transitory storage media may embody software that is operable when executed to perform a method according to or within any of the embodiments described above.

In an embodiment, an audio system and/or system may include: one or more processors; and at least one memory coupled to the processor and comprising instructions executable by the processor, the processor being operable when executing the instructions to perform a method according to or within any of the embodiments described above.

In an embodiment, a computer program product, preferably comprising a computer-readable non-transitory storage medium, which when executed on a data processing system, is operable to perform a method according to or within any of the embodiments described above.

The foregoing description of the embodiments of the disclosure has been presented for the purposes of illustration; it is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. One skilled in the relevant art will recognize that many modifications and variations are possible in light of the above disclosure.

Some portions of the present description describe embodiments of the disclosure in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Moreover, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combination thereof.

Any of the steps, operations, or processes described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code, the computer program code executable by a computer processor for performing any or all of the steps, operations, or processes described.

Embodiments of the present disclosure may also relate to apparatuses for performing the operations herein. The apparatus may be specially constructed for the required purposes, and/or it may comprise a general purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of medium suitable for storing electronic instructions, which may be coupled to a computer system bus. Moreover, any computing system referred to in the specification may include a single processor, or may be an architecture that employs a multi-processor design to increase computing power.

Embodiments of the present disclosure may also relate to products produced by the computing processes described herein. Such products may include information derived from computing processes, where the information is stored on non-transitory, tangible computer-readable storage media and may include any embodiment of a computer program product or other combination of data described herein. Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the disclosure be limited not by this detailed description, but rather by any claims that issue on an application based thereupon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

31页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：使用多种类型的渲染器渲染音频对象

Personalization of head-related transfer function templates for audio content presentation

相关技术

网友询问留言