Audio signal encoding method, audio signal decoding method, program, encoding device, audio system, and decoding device

文档序号：174379 发布日期：2021-10-29 浏览：32次中文

阅读说明：本技术 音频信号编码方法、音频信号解码方法、程序、编码装置、音频系统及解码装置 (Audio signal encoding method, audio signal decoding method, program, encoding device, audio system, and decoding device ) 是由西口正之加藤巧大于 2020-02-18 设计创作，主要内容包括：本发明提供即使是声道数较多的音频信号也能够以充分的比特率进行编码的音频信号编码方法。该音频信号编码方法由编码装置1执行,对多个声道的音频信号进行编码。首先,计算对应于听觉的空间上的掩蔽效应的掩蔽阈值。然后,利用计算出的掩蔽阈值,对多个声道的音频信号决定分配给各声道的信息量。在此基础上,对多个声道的音频信号以各自被分配的信息量进行编码。由此,即使是多个声道的音频信号,也能够以充分的比特率进行编码。(The invention provides an audio signal encoding method capable of encoding with a sufficient bit rate even for an audio signal with a large number of channels. The audio signal encoding method is executed by an encoding apparatus 1, and encodes audio signals of a plurality of channels. First, a masking threshold corresponding to a spatial masking effect of hearing is calculated. Then, the amount of information allocated to each channel is determined for the audio signals of the plurality of channels using the calculated masking threshold. On this basis, the audio signals of a plurality of channels are encoded with the amounts of information allocated to them. This enables encoding at a sufficient bit rate even for audio signals of a plurality of channels.)

1. An audio signal encoding method for encoding audio signals of a plurality of channels by an encoding apparatus,

a masking threshold corresponding to the spatial masking effect of hearing is calculated,

determining an amount of information to be allocated to each of the channels using the calculated masking threshold,

and encoding the audio signals of the plurality of channels with the amount of information allocated to each of the plurality of channels.

2. An audio signal encoding method for encoding a sound source object and position information of the sound source object, the method being performed by an encoding device,

a masking threshold corresponding to the spatial masking effect of hearing is calculated,

deciding an amount of information allocated to the sound source object using the calculated masking threshold,

encoding the sound source object and the position information of the sound source object with the allocated information amount.

3. The audio signal encoding method of claim 1 or 2,

the masking threshold is calculated corresponding to the spatial masking effect based on the spatial distance and/or direction between the respective inter-channels and/or between the respective sound source objects.

4. The audio signal encoding method of claim 3,

the masking threshold is calculated corresponding to the spatial masking effect in which spatial distances and/or directions between the sound channels and/or the sound source objects are closer to each other with a larger influence, and are further away from each other with a smaller influence.

5. The audio signal encoding method of claim 3 or 4,

the masking threshold is calculated with respect to the channels and/or the sound source objects located at front-rear symmetrical positions as viewed from a listener, corresponding to the spatial masking effect that varies the degree of influence brought about by each other with respect to the spatial distance and/or direction between the sound source objects.

6. The audio signal encoding method of any one of claims 3 to 5,

the masking threshold is calculated with respect to the sound channel and/or the sound source object at a position located rearward from the listener, corresponding to a masking effect on the space where the sound channel and/or the object exists in front of the position belonging to front-rear symmetry.

7. The audio signal encoding method of any one of claims 3 to 6,

the masking threshold is calculated corresponding to the spatial masking effect in which the degree of influence of the signals of the respective channels and/or sound source objects on each other varies depending on whether the signals of the respective channels and/or sound source objects are tonal signals or noisy signals.

8. The audio signal encoding method of claim 7,

the masking threshold is adjusted using the following equation (1),

T＝β{max(y1、αy2)-1}

y1＝f(x-θ)

y2 ═ f (180-x-theta) … … formula (1)

Where T denotes a weight multiplied by a masking threshold in the frequency domain of each channel signal in order to calculate the masking threshold, θ denotes an azimuth of a masking sound, α denotes a constant controlled by the frequency of the masking sound, β denotes a constant controlled depending on whether the signal of the masking sound is a tonal signal or a noisy signal, and x denotes the direction or the azimuth of the masking sound.

9. The audio signal encoding method of any one of claims 1 to 8,

the average number of bits per sample is calculated using Perceptual Entropy (Perceptual Entropy).

10. An audio signal decoding method, which is an audio signal decoding method performed by a decoding apparatus,

decoding audio signals of a plurality of said channels encoded by the audio signal encoding method recited in claims 1 to 9.

11. A program for encoding audio signals of a plurality of channels, the program being executed by an encoding apparatus,

causing the encoding device to calculate a masking threshold corresponding to a spatial masking effect of the auditory sense,

causing the encoding device to determine the amount of information to be allocated to each channel using the calculated masking threshold,

causing the encoding apparatus to encode the audio signals of the plurality of channels with the amount of information allocated to each of the channels.

12. An encoding device that encodes audio signals of a plurality of channels and/or a sound source object and position information of the sound source object, the encoding device comprising:

a masking threshold calculation unit that calculates a masking threshold corresponding to a spatial masking effect of an auditory sense;

an information amount determination unit configured to determine an information amount to be assigned to each of the channels and/or the sound source object, using the masking threshold calculated by the masking threshold calculation unit; and

and an encoding unit that encodes the audio signals of the plurality of channels and/or the sound source object and the position information of the sound source object in the amount of information allocated to each of the sound source objects.

13. An audio system having the encoding device and the decoding device according to claim 12,

the decoding device includes a decoding unit that decodes the audio signals of the plurality of channels and/or the sound source object encoded by the encoding device into a speech signal.

14. An audio system having the encoding device and the decoding device according to claim 12,

the decoding device is provided with:

a direction calculation unit that calculates a direction in which a listener is facing;

a transmitting unit that transmits the direction calculated by the direction calculating unit to the encoding device; and

a decoding unit configured to decode the audio signals of the plurality of channels and/or the sound source object encoded by the encoding device into a speech signal,

the masking threshold calculation section of the encoding device calculates the masking threshold in correspondence with the spatial masking effect based on the spatial distance and/or direction between the respective inter-channels and/or between the respective sound source objects with respect to the position and the direction of the listener.

15. The audio system of claim 13 or 14,

the decoding device further includes a stereo audio playback unit that converts the audio signal decoded by the decoding unit into a stereo audio signal for playing a stereo audio to the listener.

16. A decoding device is characterized by comprising:

a signal obtaining unit that determines an amount of information to be allocated to each channel and/or sound source object by using a masking threshold corresponding to a spatial masking effect of an auditory sense, and obtains a signal in which audio signals of a plurality of channels and/or position information of the sound source object and the sound source object are encoded by the amount of information allocated to each channel and/or sound source object; and

and a decoding unit configured to decode the encoded audio signals of the plurality of channels and/or the sound source object into a speech signal from the signal obtained by the signal obtaining unit.

17. The decoding device according to claim 16, further comprising:

a direction calculation unit that calculates a direction in which a listener is facing; and

and a transmitting unit configured to transmit the direction calculated by the direction calculating unit to an encoding device.

18. The decoding apparatus according to claim 16 or 17,

the audio signal decoding device further includes a stereo audio playback unit that converts the audio signal decoded by the decoding unit into a stereo audio signal for playing a stereo audio to the listener.

Technical Field

The present invention particularly relates to an audio signal encoding method, an audio signal decoding method, a program, an encoding device, an audio system, and a decoding device.

Background

Conventionally, in the encoding of an audio signal (audio signal), there are the following audio encoding techniques: an audio encoding technique based on Bit allocation (Bit allocation) that adaptively allocates the number of bits in quantization of each channel of audio signals input to a plurality of channels in the time axis or the frequency axis.

In recent years, in encoding audio signals such as MPEG-2AAC, MPEG-4AAC, and MP3 used in the standard, the masking effect of auditory sense on the frequency axis is used for the bit allocation.

The auditory masking effect refers to an effect in which a certain sound is hard to hear due to the presence of other sounds.

Patent document 1 describes an example of an audio signal encoding technique using an auditory masking effect. In the technique of patent document 1, a threshold for bit allocation of a masking effect (hereinafter, referred to as "masking threshold") is calculated in order to utilize the masking effect of the auditory sense.

Documents of the prior art

Patent document

Patent document 1: japanese patent laid-open No. 5-248972.

Non-patent document 1: andrea Spanias et al, "Audio Signal Processing and Coding," U.S. Wiley-Interscience, John Wiley & Sons, 2007.

Disclosure of Invention

Problems to be solved by the invention

However, in the conventional masking threshold calculation, the spatial relationship between a plurality of channels is not considered, and therefore, there is a problem that the bit rate (bandwidth) may be insufficient for an audio signal having a large number of channels.

The present invention has been made in view of the above circumstances, and an object thereof is to solve the above problems.

Means for solving the problems

An audio signal encoding method according to the present invention is an audio signal encoding method for encoding audio signals of a plurality of channels, the method being executed by an encoding device, wherein a masking threshold corresponding to a spatial masking effect of an auditory sense is calculated, an amount of information to be assigned to each of the channels is determined using the calculated masking threshold, and the audio signals of the plurality of channels are encoded with the amount of information to be assigned to each of the channels.

The program of the present invention is a program for encoding audio signals of a plurality of channels to be executed by an encoding device, the program causing the encoding device to calculate a masking threshold corresponding to a spatial masking effect of an auditory sense, causing the encoding device to determine an amount of information to be allocated to each of the channels using the calculated masking threshold, and causing the encoding device to encode the audio signals of the plurality of channels with the amount of information allocated to each of the channels.

An encoding device according to the present invention is an encoding device that encodes audio signals of a plurality of channels and/or a sound source object and position information of the sound source object, the encoding device including: a masking threshold calculation unit that calculates a masking threshold corresponding to a spatial masking effect of an auditory sense; an information amount determination unit configured to determine an information amount to be assigned to each of the channels and/or the sound source object, using the masking threshold calculated by the masking threshold calculation unit; and an encoding unit that encodes audio signals of the plurality of channels and/or the sound source object and the position information of the sound source object in the amount of information allocated to each of the sound source object and the sound source object.

The audio system according to the present invention is an audio system including the encoding device and a decoding device, wherein the decoding device includes: a direction calculation unit that calculates a direction in which a listener is facing; a transmitting unit that transmits the direction calculated by the direction calculating unit to the encoding device; and a decoding section that decodes the audio signals of the plurality of channels and/or the sound source objects encoded by the encoding apparatus into a speech signal, the masking threshold calculation section of the encoding apparatus calculating the masking threshold in correspondence with the spatial masking effect based on a spatial distance and/or direction between the channels and/or between the sound source objects with respect to the position and the direction of the listener.

The decoding device of the present invention is characterized by comprising: a signal obtaining unit that determines an amount of information to be allocated to each channel and/or sound source object by using a masking threshold corresponding to a spatial masking effect of an auditory sense, and obtains a signal in which audio signals of a plurality of channels and/or position information of the sound source object and the sound source object are encoded by the amount of information allocated to each channel and/or sound source object; and a decoding unit configured to decode the encoded audio signals of the plurality of channels and/or the sound source object into a speech signal from the signal obtained by the signal obtaining unit.

Effects of the invention

According to the present invention, the following audio signal encoding method may be provided: an audio signal encoding method for encoding an audio signal at a sufficient bit rate even for an audio signal having a large number of channels by calculating a masking threshold corresponding to an auditory spatial masking effect, determining an amount of information to be allocated to each of the channels for audio signals of a plurality of channels using the calculated masking threshold, and encoding the audio signals with the allocated amount of information.

Drawings

Fig. 1 is a system configuration diagram of an audio system of an embodiment of the present invention.

Fig. 2 is a flowchart of an audio encoding-decoding process of an embodiment of the present invention.

Fig. 3 is a conceptual diagram of the audio encoding and decoding process shown in fig. 2.

Fig. 4 is a conceptual diagram of the audio encoding and decoding process shown in fig. 2.

Fig. 5 is a conceptual diagram of a measurement system of a listening experiment according to an embodiment of the present invention.

Fig. 6 is a conceptual diagram showing threshold search in a listening experiment according to an embodiment of the present invention.

Fig. 7 is a screen example of an answer screen in a listening experiment according to an embodiment of the present invention.

Fig. 8 is a graph in which the horizontal axis represents the azimuth of the masked sound, and the peak of the masking threshold when the azimuth of the masking sound is 0 ° in the embodiment of the present invention is plotted.

Fig. 9 is a graph in which the horizontal axis represents the azimuth of the masked sound, and the peak of the masking threshold when the azimuth of the masking sound is 45 ° in the embodiment of the present invention is plotted.

Fig. 10 is a graph in which the horizontal axis represents the azimuth of the masked sound, and the peak of the masking threshold when the azimuth of the masking sound is 90 ° in the embodiment of the present invention is plotted.

Fig. 11 is a graph in which the horizontal axis represents the azimuth of the masked sound, and the peak of the masking threshold when the azimuth of the masking sound is 135 ° according to the embodiment of the present invention is plotted.

Detailed Description

< embodiment >

[ control structure of audio system X ]

First, a control structure of an audio system X according to an embodiment of the present invention will be described with reference to fig. 1.

The audio system X is a system capable of acquiring audio signals of a plurality of channels, encoding and transmitting the audio signals by the encoding apparatus 1, and decoding and playing the audio signals by the decoding apparatus 2.

The encoding apparatus 1 is an apparatus that encodes an audio signal. In the present embodiment, the encoder 1 is, for example, a PC (Personal Computer), a server, an encoder board installed in these, a dedicated encoder, or the like. The encoding device 1 of the present embodiment encodes audio signals of a plurality of channels and/or a sound source object and position information of the sound source object. For example, the encoding device 1 encodes audio signals of a plurality of channels such as 2-channel, 5.1-channel, 7.1-channel, and 22.2-channel in accordance with an audio encoding method such as MPEG-2AAC, MPEG-4AAC, MP3, Dolby Digital (registered trademark), and DTS (registered trademark).

The decoding device 2 is a device that decodes the audio signal encoded by the decoding device 2. In the present embodiment, the decoding device 2 is, for example, an HMD (Head-Mounted Display) for VR (Virtual Reality) or AR (Augmented Reality), a Smart Phone (Smart Phone), a game machine, a home television, a wireless headset, a Virtual multichannel headset, a device for movie theaters and public viewing places, a dedicated decoder, a Head tracking sensor, or the like. The decoding apparatus 2 decodes and plays the audio signal encoded by the encoding apparatus 1 by wired or wireless transmission.

The audio system X mainly includes a microphone array 10, a sound collecting unit 20, a frequency domain converting unit 30, a masking threshold calculating unit 40, an information amount determining unit 50, an encoding unit 60, a direction calculating unit 70, a transmitting unit 80, a decoding unit 90, a stereo audio playing unit 100, and headphones 110.

Among them, the frequency domain converting unit 30, the masking threshold calculating unit 40, the information amount determining unit 50, and the encoding unit 60 function as the encoding device 1 (transmitting side) of the present embodiment.

The direction calculation unit 70, the transmission unit 80, the decoding unit 90, the stereo audio playback unit 100, and the headphone 110 function as the decoding device 2 (receiving side) of the present embodiment.

The microphone array 10 picks up speech in a sound space, which is a space where various sounds exist in various places. Specifically, for example, the microphone array 10 obtains sound waves in a plurality of directions of 360 °. In this case, by controlling the directivity by the beamforming process and directing the beam in each direction, spatial sampling of the sound space can be performed, and a multi-channel speech beam signal can be obtained. Specifically, in the beamforming of the present embodiment, the phase difference of the sound waves arriving at each microphone of the microphone array 10 is controlled by a filter, and the signals arriving at each microphone are emphasized. On the basis, as spatial sampling, a sound field is spatially divided, information on the included space is kept unchanged, and sound collection is performed through multiple channels.

The sound collecting unit 20 is a device such as a mixer that collects voices of a plurality of channels and transmits the collected voices as an audio signal to the encoding device 1.

The frequency domain conversion unit 30 cuts the voice beam signals obtained by spatial sampling in different directions into windows (frames) of about several microseconds to several tens of milliseconds, and converts the signals from the time domain to the frequency domain by DFT (Discrete Fourier Transform) or MDCT (Modified Discrete Cosine Transform). For example, preferably, the frame is at a sampling frequency of 48kHz and the number of quantization bits is 16 bits, using about 2048 samples. The frequency domain conversion unit 30 outputs the frame as an audio signal of each channel. That is, the audio signal of the present embodiment is a signal in the frequency domain.

The masking threshold calculation unit 40 calculates a masking threshold corresponding to a spatial masking effect of the auditory sense from the audio signal of each channel converted by the frequency domain conversion unit 30. In this case, the masking threshold calculation unit 40 calculates the masking threshold in the frequency domain by applying a model in which the spatial masking effect is taken into consideration. The calculation of the masking threshold in the frequency domain itself can be realized by the method described in non-patent document 1, for example.

Alternatively, the masking threshold calculation unit 40 may obtain the sound source object, and similarly calculate the masking threshold corresponding to the spatial masking effect of the auditory sense. The sound source object represents each of a plurality of audio signals generated from spatially different positions. The sound source object is for example an audio signal with position information. For example, the audio signal may be obtained by converting an output signal of a microphone such as each instrument of a recording band, a sampled voice signal used in a game or the like, or the like into an audio signal in the frequency domain.

Further, the masking threshold calculation unit 40 may obtain an audio signal that is temporarily collected and stored in a recording medium such as a flash memory, HDD, or optical recording medium, convert the audio signal, and calculate a frequency mask.

Specifically, as a model of the above-described spatial masking effect, the masking threshold calculation section 40 may also calculate the masking threshold in correspondence with the spatial masking effect of the spatial distance and/or direction between each channel and/or between each sound source object based on the positional direction information with respect to the listener.

Alternatively, the masking threshold calculation unit 40 may calculate the masking threshold in accordance with a spatial masking effect based on a spatial distance and/or direction between the channels and/or between the sound source objects.

More specifically, the masking threshold calculation unit 40 may calculate the masking threshold in accordance with a spatial masking effect in which spatial distances and/or directions between the channels and/or sound source objects are closer to each other and influence on each other is larger, and the spatial masking effect is smaller when the spatial distances and/or directions are farther from each other.

The masking threshold calculation unit 40 may calculate the masking threshold in accordance with a spatial masking effect in which the degree of influence of the mutual influence of the spatial distance and/or direction between the sound source objects is changed, with respect to the sound channels and/or the sound source objects located at positions symmetrical in the front-rear direction as viewed from the listener.

Furthermore, the masking threshold calculation unit 40 may calculate the masking threshold in accordance with a spatial masking effect in which a sound channel and/or a sound source object located at a rear position is present in front of the sound channel and/or the sound source object at a front-rear symmetric position with respect to the listener.

Specifically, the masking threshold calculation unit 40 may perform adjustment using the following equation (1) when calculating the masking threshold.

T＝β{max(y1，αy2)-1}

y1＝f(x-θ)

y2 ═ f (180-x-theta) … … formula (1)

More specifically, in the present embodiment, a sound that interferes with hearing is referred to as a "masking sound", and a sound that interferes with hearing is referred to as a "masked sound". max is a function of the maximum value in the return parameter. As for the constant, a value α of 1 can be used when the masking sound is 400Hz, and a value α of 0.8 can be used when the masking sound is 1 kHz. When the masking sound is noisy, β may be about 11 to 14, and when the masking sound is a pure sound (tonality), β may be about 3 to 5. That is, when the masking tone is tonal, T is flat for all θ regardless of the value of x.

For f (x) in the formula (1), a linear function such as a triangular wave shown in the following formula (2) can be used, for example.

The obtained azimuth or the azimuth of the masked sound can be used as x. The azimuth corresponds to the direction of beamforming of the microphone, the direction of the sound source object, and the like.

As f (x), an expression of f (x) cos (x) may be used. Further, as f (x), for example, a function calculated from an experimental result of an actual masking sound or a masked sound may be used.

The masking threshold calculation unit 40 may calculate the masking threshold in accordance with a spatial masking effect in which the degree of influence of the signals of the respective channels and/or the sound source objects on each other changes depending on whether the signals of the respective channels and/or the sound source objects are tonal signals or noisy signals.

The information amount determination unit 50 determines the amount of information to be allocated to the sound source object by using the masking threshold calculated by the masking threshold calculation unit 40. In the present embodiment, bit allocation of each audio signal based on the masking threshold is performed as the information amount. As this bit allocation, the information amount determination unit 50 may calculate the average number of bits per sample in accordance with the masking threshold calculated by the masking threshold calculation unit 40 using Perceptual Entropy Perceptual entry (hereinafter, referred to as "PE").

The encoding unit 60 encodes audio signals of a plurality of channels and/or position information of a sound source object and a sound source object with respective allocated amounts of information. In the present embodiment, the encoding unit 60 quantizes each audio signal based on the number of bits allocated by the information amount determination unit 50, and transmits the quantized audio signal to the transmission path. The transmission path may use, for example, Bluetooth (registered trademark), HDMI (registered trademark), WiFi, USB (Universal Serial Bus), or other wired or wireless information transmission means. More specifically, the transmission can be performed by point-to-point (Peer-to-Peer) communication via a network such as the internet or WiFi.

The direction calculating section 70 calculates the direction in which the listener is facing. The direction calculation unit 70 includes, for example: acceleration sensors, gyro sensors, geomagnetic sensors, and the like capable of head tracking; and circuits for converting their outputs into directional information.

In addition, the direction calculating unit 70 can calculate position and direction information that is information obtained by adding position information considering the relationship of the positions of the audio signals of the sound source object or the plurality of channels with respect to the listener to the calculated direction information.

The transmission unit 80 transmits the positional and directional information calculated by the directional calculation unit 70 to the encoding device 1. The transmitting unit 80 can transmit the position and direction information to be receivable by the masking threshold calculating unit 40, for example, by wired or wireless transmission similar to the transmission path of the audio signal.

The decoding unit 90 decodes the audio signals of the plurality of channels and/or the sound source object encoded by the encoding device 1 into a speech signal. The decoding unit 90 first performs inverse quantization on a signal received from the transmission path, for example. Next, the signal in the frequency domain is returned to the time domain by IDFT (Inverse Discrete Fourier Transform ), IMDCT (Inverse Discrete Cosine Transform), or the like, and converted into a speech signal of each channel.

The stereo audio playback unit 100 converts the audio signal decoded by the decoding unit 90 into a stereo audio signal for playing stereo audio to a listener. Specifically, the stereo audio playback section 100 regards the beam signals returned to the time domain and different in direction as signals emitted from a sound source located in the direction, and convolves the Head-Related Transfer Function (HRTF) of the beam direction with the HRTF. The HRTF is obtained by expressing, as a transfer function, a change in sound due to a peripheral object including pinna, human head, and shoulders.

Next, the HRTF convolved signals are weighted in different beam directions and added, thereby generating 2-channel binaural signals to be presented to a listener. Here, the weighting different for each beam direction is a process of performing weighting such that the binaural signal as the L signal and the R signal is closer to the binaural signal in the sound space to be reproduced. Specifically, a binaural signal is generated by adding a convolution of HRTFs of sound source directions to respective sound sources existing in a certain sound space. With this binaural signal as a target signal, the output signal is weighted so that the binaural signal obtained as an output is equal to the target signal.

The stereo audio playing section 100 can play the stereo audio by updating the HRTF using the position and direction information calculated by the direction calculating section 70 in addition to the masking threshold described above.

The headphones 110 are devices that a listener plays back audio that is decoded and stereophonicized. The headphone 110 includes a D/a converter, an Amplifier (Amplifier), an electromagnetic driver, an earcap worn by the user, and the like.

The encoding device 1 and the decoding device 2 include, as various circuits, control units that control arithmetic units, such as an ASIC (Application Specific Processor), a DSP (Digital Signal Processor), a CPU (Central Processing Unit), an MPU (Micro Processing Unit), and a GPU (programmable Graphics Unit).

The encoder 1 and the decoder 2 include semiconductor memories such as ROM (Read Only Memory) and RAM (Random Access Memory), and storage units such as HDD (Hard Disk Drive) as magnetic recording media and optical recording media, as storage means. The storage unit stores a control program for implementing each method according to the embodiment of the present invention.

Further, the encoding device 1 and the decoding device 2 may include a display unit such as a liquid crystal display or an organic EL display, an input unit such as a pointing device such as a keyboard, a mouse, or a touch panel, and an interface such as a LAN board, a wireless LAN board, Serial/parallel, or usb (universal Serial bus).

The encoding device 1 and the decoding device 2 can realize the methods according to the embodiments of the present invention using hardware resources by being executed by the control unit using various programs mainly stored in the storage unit.

Further, a part of the above configuration or an arbitrary combination thereof may be configured in hardware or in a circuit by an IC, a Programmable logic device, an FPGA (Field-Programmable Gate Array), or the like.

[ Audio codec processing for Audio System X ]

Next, with reference to fig. 2 and 3, the audio signal encoding/decoding process of the audio system X according to the embodiment of the present invention will be described.

The audio signal encoding/decoding process according to the present embodiment is mainly performed by the encoding device 1 and the decoding device 2, respectively, by the control unit cooperating with each unit and performing control using hardware resources, and executing a control program stored in the storage unit, or directly executed by each circuit.

Next, the audio signal encoding/decoding process will be described in detail for each step with reference to the flowchart of fig. 2.

(step S101)

First, the frequency domain converter 30 of the encoding device 1 performs speech data acquisition processing.

Here, a sound collector comes to a stadium or the like and collects sound using the microphone array 10. Thereby, a voice signal in each direction (θ) centered on the microphone array 10 is obtained. At this time, sound is collected on the sound collection side based on the idea of "spatial sampling". Spatial sampling is the spatial division of a sound field through multiple channels to pick up sound. In the present embodiment, for example, speech signals of a specific step size obtained by segmenting left and right 0 ° to 360 ° are collected corresponding to a plurality of channels. Note that, with respect to 0 ° to 360 ° in the vertical direction, sound can be collected in specific steps.

The frequency domain converter 30 cuts out the acquired voice data and the like, converts the voice data from the time domain into a signal in the frequency domain by DFT, MDCT, or the like, and stores the signal in the storage unit as an audio signal.

(step S201)

Here, the direction calculation unit 70 of the decoding device 2 performs the direction calculation process.

The direction calculating section 70 calculates direction information toward which the listener is facing and position information for the audio data.

(step S202)

Next, the transmission unit 80 performs a directional transmission process.

The transmission unit 80 transmits the positional and directional information calculated by the directional calculation unit 70 to the encoding device 1.

(step S102)

Here, the masking threshold calculation unit 40 of the encoding device 1 performs the masking threshold calculation process. In the present embodiment, the masking threshold T is calculated in the frequency domain, and further, a masking threshold for masking in the space described later is calculated, and bit allocation is determined. Therefore, the masking threshold calculation section 40 first calculates the masking threshold T in the frequency band.

The auditory masking effect will be described with reference to fig. 3 (a). An auditory masking effect is an effect in which a certain sound becomes difficult to hear due to the presence of other sounds. Hereinafter, a sound that interferes with hearing is referred to as a "masking sound", and a sound that interferes with hearing is referred to as a "masked sound".

The masking effect is roughly divided into frequency masking (simultaneous masking) and time masking (sequential masking). Frequency masking is masking that is generated if a masking tone overlaps a masked tone in time, and temporal masking is masking that is generated if the masking tone is separated in time.

In the graph of fig. 3 (a), the horizontal axis represents frequency, and the vertical axis represents energy of a signal. That is, fig. 3 (a) is a graph showing an example of the range and threshold of a spectrum (masked sound) masked by a masking sound when 1 spectrum (pure sound) included in a certain signal is used as the masking sound. In this way, the threshold of the masked sound rises even in the vicinity of the frequency of the masking sound in which no signal component exists. In addition, the frequency range in which the threshold value rises is asymmetric with respect to the frequency of the masking sound, and if the frequency of the masked sound is high with respect to the masking sound, the masked sound is more likely to be masked than a sound of a lower frequency. Therefore, the masking sound may have not only the frequency component of the masking sound but also components extending to both sides thereof in terms of auditory sense.

The concept of frequency masking application in encoding is shown using (b) of fig. 3. In the graph, the horizontal axis represents frequency, and the vertical axis represents energy of a signal. The thicker black curve represents the spectrum of the signal. In addition, the gray curve represents the masking threshold. Here, in fig. 3 (b), the smeared range is a portion masked by the frequency masking so as not to be perceived. In this case, in fig. 3 (b), the portion actually contributing to the perception of sound is a portion sandwiched between a curve representing the spectrum of the signal and a curve representing the masking threshold. In addition, as shown in the high region in fig. 3 (b), frequencies in which the energy of the signal spectrum is smaller than the masking threshold do not contribute to the perception of sound. That is, even if only bits corresponding to the energy obtained by subtracting the masking threshold from the energy of the signal spectrum are allocated, the signal can be transmitted in a state where the deterioration is not audibly perceived. In this way, by using the masking effect in the frequency domain, the number of bits required for transmission can be reduced while maintaining the auditory quality.

The curve showing the masking threshold values for the entire frequency band as shown in fig. 3 (b) is obtained by calculating the masking threshold values for the respective frequency components using the knowledge of the masking for a single spectrum or noise and integrating them.

Here, a detailed calculation method of the masking threshold T in the frequency band will be described.

The masking threshold calculation unit 40 calculates a masking threshold calculation formula (hereinafter, referred to as "SF"), for example.) Convolved with the Bark spectrum as described in patent document 1. Then, the masking threshold calculation section 40 calculates an extended (Spread) masking threshold T using Spectral Smoothness (SFM) and an adjustment factor_spread. On this basis, the masking threshold calculation section 40 masks the Spread with the threshold T by using deconvolution_spreadReturning to the region of the Bark spectrum, a temporal threshold T is calculated. In addition, in the present embodiment, the masking threshold calculation unit 40 divides the temporary threshold T by the number of DFT spectra belonging to each Bark scale, compares the divided values with the absolute threshold, and converts the temporary threshold T into the final threshold T for frequency masking_final。

More specifically, as the absolute threshold value to be compared with the temporary threshold value T by the masking threshold value calculating section 40, an approximate expression T of the absolute threshold value in the frequency f (hz) is calculated by the following expression (3)_qf[dBSPL]。

T_qf＝3.64(f/1000)^-0.8-6.5exp{-0.6(f/1000-3.3)²}+10^-3(f/1000)⁴+O_LSB… … type (3)

Wherein O added in the formula (3)_LSBIs such that the absolute threshold value T at a frequency of 4kHz^q4000＝min(T_qf) An offset value corresponding to the energy of a signal having a frequency of 4kHz and an amplitude of 1 bit.

Specifically, the masking threshold calculation section 40 calculates the threshold T in the i-th frequency band (final frequency band) of the frequency masking using the following expression (4)_final。

In addition, the masking threshold calculation unit 40 calculates the masking threshold T from the threshold T of the frequency band_finalA masking threshold corresponding to the spatial masking effect of hearing is further calculated. At this time, the masking threshold calculation section 40 calculates a frequency masking threshold in consideration of spatial masking, using the direction information of the audio signal.

The masking threshold corresponding to the spatial masking effect of auditory sensation will be described with reference to (c) of fig. 3.

In the conventional audio encoding method, in many cases, the masking threshold of the own channel is calculated using only the signal component of the own channel. That is, in an audio signal having a plurality of channels, a masking threshold is determined independently for each channel without considering masking by signals of channels other than a target channel in masking of the target channel.

Here, it is considered that in an audio signal obtained by spatial sampling as used in the present embodiment, the correlation between signals between adjacent channels is large, and a portion having a similar waveform and a portion having a dissimilar waveform coexist. Therefore, in coding a signal obtained by spatial sampling, there is a possibility that information of masking in each channel can be mutually applied between channels from the viewpoint of masking. Therefore, in the present embodiment, "spatial masking" that extends the masking effect to a spatial region is used to encode a signal obtained by spatial sampling.

In the conceptual diagram of fig. 3 (c), the horizontal axis represents the spatial direction of the signal, the depth represents the frequency, and the vertical axis represents the energy of the signal. The area inside the quadrangular pyramid that is located in the gradient slow area masking the sound signal represents the area to be masked by the signal. As can be seen from comparison with the frequency mask of fig. 3 (b), the dimension of the direction is increased and the dimension is increased by one in fig. 3 (c). In addition, in the direction in space, azimuth and elevation are included. As shown in fig. 3 (c), in spatial masking, a curve representing a masking threshold is three-dimensional. That is, masking and masked signals are generated even in the spatial direction. Such spatial masking is referred to as auditory central nervous system-related masking of the interaction of information between the two ears.

The calculation of the masking threshold for spatial masking will be described with reference to fig. 4. Fig. 4 is an example of a masking threshold value in which spatial masking is taken into consideration in the calculation of a signal in the i direction among signals in the N directions from 1 to N. The horizontal axis of each graph represents frequency, and the vertical axis represents energy of a signal. In each graph, the solid black line represents the signal spectrum, and the solid gray line represents the masking threshold calculated from those. The black dashed line is the signal weighted masking threshold for each direction. The gray dotted lines indicate the masking threshold of the i-direction signal after all the masking based on the signals in the respective directions is considered.

More specifically, the present inventors prepared a masking model in consideration of spatial masking in an omnidirectional sound source based on the results of listening experiments in the embodiments described later, and calculated the model as described below.

The calculation order is as follows. First, a masking threshold is calculated for each direction signal in the same manner as in the conventional frequency domain masking. Next, in order to obtain the masking threshold T in each of these directions, a function T corresponding to the above expression (1) is used_spatial(θ, x) weights by which masking thresholds in the frequency domain of each channel signal are multiplied are calculated and weighted. Wherein the weight of the masking threshold on the signal itself, i.e. in the i-direction, is zero dB, i.e. such that the linear scale is 1. Then, the weighted omni-directional masking thresholds are summed up at a linear scale. This results in a masking threshold for the signal in the i direction in consideration of spatial masking. The above processing is performed for signals in other directions in the same manner, and a threshold value in consideration of spatial masking can be obtained for the signal in the entire cycle.

Following function T_spatialThe description is made in detail. Function T_spatialThe attenuation amount is a function of the attenuation amount of a masking threshold from the direction in which a masking sound exists in decibels when the direction of the masking sound and the direction of the masked sound are input as variables. Thus, T is_spatialThe maximum value of the azimuth of the existence of the masking tone is determined to be 0dB]。

In the present embodiment, the azimuth of the masking sound is set to [ deg.]The azimuth of the masked sound is set to x [ deg ].]Will function T_spatial(θ，x)[dB]Calculated by the following equation (2 of 4).

T_spatial(θ, x) ═ β { max (f (x- θ), α f (180 ° -x- θ)) -1} … … formula (2 of 4)

Where α, β are scaling factors, and 0 ≦ α ≦ 1, 0 ≦ β. max is a function of the maximum value within the return parameter. f is an arbitrary periodic function having a period of 360 ° such that the maximum value is obtained at phase 0 °.

In the present embodiment, as the periodic function f (x), for example, a triangular wave similar to the above-described formula (2) can be used. When the function f is defined in this way, f (x- θ) represents a threshold change such that the azimuth where the masking sound exists is 0dB, and the azimuth level that advances 180 ° is the smallest in the opposite azimuth. On the other hand, f (180-x- θ) is 0dB in an azimuth symmetrical to the front-rear direction of the azimuth where the masking sound exists, and is changed by a threshold value so that the azimuth level at which the azimuth advances by 180 ° is minimized. That is, two functions f in which the phases are matched so as to represent "attenuation from the threshold value of the direction in which the masking sound exists" and "attenuation from the threshold value of the direction in which the masking sound exists, which is symmetrical in the front-rear direction" are prepared, and the maximum value of these functions is taken and scaled, whereby the masking threshold value that simultaneously represents two phenomena, that is, "the phenomenon in which the threshold value decreases as the masking sound moves away from the direction in which the masking sound exists" and "the phenomenon in which the threshold value is folded back in the coronal plane" can be calculated.

The scaling factor α (0 ≦ α ≦ 1) is a coefficient for reflecting a masking effect such that "the rise of the threshold when the masked sound is located at an azimuth symmetrical to the front-rear of the masking sound appears more pronounced as the frequency (center frequency) of the masking sound is lower. α is determined so that the lower the frequency of the masking sound, the closer to 1, and the higher the frequency of the masking sound, the closer to 0. By doing so, f (180-x- θ) can be scaled according to the frequency of the masking tone, and the degree to which the threshold is folded back in the coronal plane can be adjusted.

The scaling factor β (0 ≦ β) is a coefficient for reflecting the finding that "when the masking sound is a pure sound, the change of the threshold value based on the azimuth of the masked sound is flat". β is determined so that the pitch of the masking sound is closer to 0 as the pitch is tonal, and the value is larger as the pitch of the masking sound is noisy. By doing so, it is possible to adjust the function T when θ and x change depending on whether the masking sound is a pure sound or a noise_spatialMagnitude of the value of the ensemble.

In this way, in the present embodiment, the weight T multiplied by the masking threshold in the frequency domain of each channel signal is applied. By integrating the frequency-domain masking thresholds for each direction multiplied by the weight, the masking threshold for that direction (x direction) can be calculated (on the frequency axis).

As shown in the examples, the optimal values of α and β can be calculated by performing loop calculation through actual experiments, and can be used as tables.

(step S103)

Next, the traffic determination unit 50 performs traffic determination processing.

In the audio system X according to the present embodiment, bit allocation is performed in consideration of a spatial region in the frequency domain using direction information of a signal obtained by spatial sampling. In addition, in order to perform bit allocation in consideration of a spatial region, a masking effect is used.

For this purpose, the information amount determination unit 50 determines the amount of information to be assigned to each channel and/or sound source object using the masking threshold calculated by the masking threshold calculation unit 40. By using a masking threshold corresponding to an auditory spatial masking effect, bit allocation on the frequency axis in consideration of a spatial region can be performed. That is, by using the auditory spatial masking effect, the number of bits of a signal required for transmission can be reduced while maintaining the auditory quality.

In the present embodiment, the information amount determination unit 50 calculates bit allocation as the information amount using, for example, PE in order to positively utilize the masking effect of auditory sense. If the signal is below the masking threshold, the amount of information that is not meaningful to human hearing, i.e., annihilation, may be found in quantization noise, and the amount obtained by calculating the average information amount of the music signal may be PE.

The PE can be calculated by the following formula (5).

Wherein, T_iThreshold value of critical band in Bark domain as T_i/k_i＝T_{final i}And (4) embedding.

(step S104)

Next, the encoding unit 60 performs an encoding process.

The encoding unit 60 encodes audio signals of a plurality of channels and/or position information of a sound source object and a sound source object with respective allocated amounts of information.

The encoded data is transmitted to the decoding device 2 on the receiving side. The transmission is for example made by point-to-point communication. Alternatively, the data may be downloaded as data, or read as a memory card or an optical recording medium by the decoding apparatus 2.

(step S203)

Here, the decoding unit 90 of the decoding device 2 performs decoding processing.

The decoding unit 90 decodes the audio signals of the plurality of channels and/or the sound source object encoded by the encoding device 1 into a speech signal. Specifically, when the decoding device 2 is a smartphone or the like, the audio signal transmitted from the encoding device 1 is decoded by a decoder such as a specific codec.

(step S204)

Next, the stereo audio playback section 100 performs stereo audio playback processing.

The stereo audio playback unit 100 converts the audio signal decoded by the decoding unit 90 into a stereo audio signal for playing stereo audio to a listener.

Specifically, the stereo audio playback unit 100 plays a multi-channel audio signal as a 2-channel audio signal without changing the spatial information. This can be achieved by adding the transmission characteristics of the sound from the sound source to the human ear root to each speech signal, and performing addition calculation in all directions. That is, the stereo audio playback section 100 synthesizes audio signals having different directions, and plays the audio signals using headphones. For this purpose, a Head Related Transfer Function (HRTF) corresponding to the direction of each voice signal is convolved and converted into a 2-channel sound signal. Specifically, the stereo audio playback unit 100 adds, for example, the transmission characteristics of HRTFs corresponding to the directions of the respective signals to the respective audio signals, and outputs the sum of the signals in the L channel and the R channel. Thus, the 2-channel speech signal can be easily reproduced without depending on the number of channels on the sound pickup side.

The audio signal encoding/decoding process according to the embodiment of the present invention is completed as described above.

With the above configuration, the following effects can be obtained.

In recent years, along with multichannel audio playback environments and popularization of binaural playback in AR (Augmented Reality) and VR (Virtual Reality), the importance of techniques for sound collection, transmission, playback, and enhancement of 3D sound fields has increased.

However, in encoding a signal obtained by spatial sampling, it is necessary to encode a sound signal around the entire circumference of a listener, and therefore, as the sampling direction increases, the number of channels increases greatly, and a higher total bit rate is required.

As an example, transmission via the internet using a smartphone or the like is considered. In Spotify (registered trademark), which is one of music distribution services, the bit rate at the time of streaming media playback is at most about 320kbps for 2-channel stereo. In spatial sampling, it is assumed that a signal having more channels than 2 channels is transmitted, and therefore, it is necessary to further reduce the bit rate per 1 channel.

On the other hand, conventionally, in encoding of audio signals (data compression such as MPEG), an auditory masking effect is utilized. However, the masking mainly uses only the masking effect on the frequency axis. In audio coding such as MPEG-2AAC, MPEG-4AAC, and MP3, and in coding multichannel signals, the auditory masking effect on the frequency axis of each channel is used.

However, a sound field generally represented by a multi-channel signal is composed of a plurality of sound sources spatially dispersed. On the other hand, when a plurality of sound sources are spatially arranged at the same time, the masking effect and the auditory sensation are not clear in their actions and effects, and thus they do not reach the application. That is, what masking effect is brought about by sound sources arranged in a three-dimensional space, what influence is brought about, and what perception related to auditory sense is formed. That is, in the conventional masking threshold calculation, the spatial relationship between the channels is not considered.

In contrast, an encoding device 1 according to an embodiment of the present invention is an encoding device that encodes audio signals of a plurality of channels and/or a sound source object and position information of the sound source object, including: a masking threshold calculation unit (40) for calculating a masking threshold corresponding to a spatial masking effect of an auditory sense; an information amount determination unit 50 for determining the amount of information to be assigned to each channel and/or sound source object by using the masking threshold calculated by the masking threshold calculation unit 40; and an encoding unit 60 for encoding the audio signals of the plurality of channels and/or the sound source object and the position information of the sound source object with the amounts of information allocated to them.

In this configuration, when encoding audio signals of a plurality of channels or sound source objects and position information thereof, the number of bits to be allocated to each channel and the sound source object is determined in consideration of the spatial masking effect of the sense of hearing, and thus, the present invention can be applied to compression of a multi-channel signal having direction information. This enables encoding that takes into account the spatial relationship between the channels.

Here, since the spatial relationship between channels is not considered in the conventional calculation of the masking threshold, there is a possibility that compression by bit allocation is insufficient and a bit rate (bandwidth) at the time of transmission or the like is insufficient for an audio signal having a large number of channels such as 22.2-channel audio and the like in which the presence is further improved.

In contrast, in the audio signal encoding method according to the embodiment of the present invention, the sound field expressed by the multi-channel signal is composed of a plurality of sound sources spatially distributed. Since the signal obtained by spatial sampling includes spatial information, the number of transmission bits can be further reduced by performing bit allocation in consideration of the spatial region in addition to the conventional frequency domain.

Thus, an audio signal encoding method capable of encoding an audio signal with a sufficient bit rate even for an audio signal with a large number of channels such as 22.2 channels can be provided. That is, for a plurality of sound sources spatially distributed, a masking threshold is obtained based on a mutual masking effect, and bit allocation based on the masking threshold is performed, whereby the bit rate can be reduced. According to the experiments of the present inventors, the bit rate can be reduced by 5% to 20% as compared with the conventional one.

An audio system X according to the present invention is an audio system X including an encoding device 1 and a decoding device 2 described therein, the decoding device 2 including: a direction calculation unit 70 for calculating the direction in which the listener is facing; a transmission unit 80 that transmits the direction calculated by the direction calculation unit 70 to the encoding device 1; and a decoding unit 90 that decodes the audio signals of the plurality of channels and/or the sound source objects encoded by the encoding device 1 into voice signals, wherein the masking threshold calculation unit 40 of the encoding device 1 calculates the masking threshold in accordance with a spatial masking effect based on a spatial distance and/or direction between the channels and/or the sound source objects with respect to the position and direction of the listener.

With this configuration, when decoding an audio signal encoded by encoding using a masking threshold corresponding to the above-described auditory spatial masking effect, directional information toward which a listener is facing is calculated by head tracking or the like, and thus, auditory display for controlling the position of a sound image can be realized. That is, the position of the sound source of each channel or the relative positional relationship between the position of the sound source object and the listener can be fed back to the encoding device 1, and encoding and decoding can be performed based on the positional relationship.

Thus, an audio system capable of easily collecting, transmitting, playing, and enjoying a 360 DEG spherical sound space among users can be provided.

Conventionally, as 3D (three-dimensional) sound field playback technologies, there have been developed auditory display technologies based on binaural/auditory transmission in which music, broadcast/movie content, and surround-type music and/or broadcast/movie content are enjoyed through headphones or through two front speakers, sound field reproduction technologies that simulate the sound field of a hall or theater actually existing in a 5.1 channel or 7.1 channel surround playback environment for home theater, and the like. Also, development of 3D sound field playback technology using wavefront synthesis based on speaker arrays is also advancing. With the development of such a playback system, sound collection and multi-channel presentation of content have become widespread.

However, as a playing technique of 3D audio, embodiments related to head-related transfer functions and localization are actively studied, but the correlation with spatial masking is not studied.

In contrast, the audio system of the present invention is characterized in that the decoding device 2 further includes a stereo audio playback unit 100, and the stereo audio playback unit 100 converts the audio signal decoded by the decoding unit 90 into a stereo audio signal for playing back stereo audio to a listener.

With such a configuration, it is possible to efficiently encode an audio signal by using a 2-channel playback of the correlation or masking effect of a plurality of sound sources suitable for sound field dispersion in a 3-dimensional space in association with a Head Related Transfer Function (HRTF) regarding the perception of the audio signal in space. That is, an audio signal encoded according to how a human captures a 3D sound field is played as stereo audio, and thereby a sound field with higher reality than ever before can be played.

This is considered to be a similar effect to the effect of "reproducing the" impression "of human feeling as" stored color "to further increase the sense of realism" in the image, compared with "reproducing the color faithfully". That is, sound field reproduction with a higher realism can be realized.

The audio signal encoding method of the present invention is characterized in that the masking threshold is calculated in correspondence with a spatial masking effect based on a spatial distance and/or direction between channels and/or between sound source objects.

With such a configuration, for example, it is possible to realize encoding based on a spatial masking effect using a model calculated based on a spatial distance or direction between each channel and/or each sound source object. That is, when a person listens to sounds spread over a 3-dimensional space, by applying a mutual masking effect based on the spatial distance and/or direction of spatially arranged sound sources to encoding, more efficient encoding can be achieved, and the transmission bit rate of data can be reduced.

The audio signal encoding method of the present invention is characterized in that the masking threshold is calculated in accordance with a spatial masking effect in which spatial distances and/or directions between the sound channels and/or sound source objects are closer to each other and have a larger influence on each other, and spatial masking effects in which spatial masking effects are farther from each other and have a smaller influence on each other.

With this configuration, for example, the spatial masking effect can be calculated using a model in which the spatial distance or direction between channels and/or sound source objects is closer to each other, the influence of the channels and/or the sound source objects is larger, and the influence is smaller as the channels and/or the sound source objects are farther from each other. By utilizing such a spatial masking effect, more efficient encoding can be realized, and the transmission bit rate of data can be reduced.

The audio signal encoding method of the present invention is characterized in that the masking threshold is calculated in correspondence with a spatial masking effect that varies the degree of influence of mutual influence between sound source objects with respect to a spatial distance and/or direction, with respect to sound channels and/or sound source objects located at positions that are symmetrical front to back as viewed from a listener.

With the above configuration, with respect to the channels or sound source objects located at positions symmetrical in the front-rear direction as viewed from the listener, the spatial masking effect can be calculated using a model in which the influence of the channels or the sound source objects is not necessarily greater as the spatial distance or direction between the sound source objects is closer, but is smaller as the spatial distance or direction is farther. Thus, for example, if the masking sound is located symmetrically in front and rear directions, the increase in the masking threshold can be calculated largely in accordance with the spatial masking effect in which the influence is strong although the spatial distance is long.

By utilizing such a spatial masking effect, more efficient encoding can be realized, and the transmission bit rate of data can be reduced.

The audio signal encoding method of the present invention is characterized in that, with respect to a channel and/or sound source object located at a position rearward as viewed from a listener, a masking threshold is calculated corresponding to a masking effect on a space where the channel and/or the object exist in front of positions belonging to front-rear symmetry.

With the above configuration, it is possible to calculate a masking threshold using a spatial masking effect in which a channel or a sound source object located at a rear position as viewed from a listener exists in front of a mirror image copy of the channel or the sound source object at a front-rear symmetric position. That is, the masking threshold is calculated such that a sound source located rearward of the axis is moved to the front of the axis, which belongs to a line-symmetric position centered on the axis, with a straight line connecting both ears as the axis.

By utilizing such a spatial masking effect, more efficient encoding can be realized, and the transmission bit rate of data can be reduced.

The audio signal encoding method of the present invention is characterized in that the masking threshold is calculated in accordance with a spatial masking effect in which the degree of influence exerted on the signals of the respective channels and/or sound source objects by each other varies depending on whether the signal of the respective channel and/or sound source object is a tonal signal or a noisy signal.

With the above configuration, as a spatial masking effect, a masking threshold can be calculated using a model in which the degree of influence of each channel signal or sound source object signal with respect to each other changes depending on whether each channel signal or sound source object is a tonal signal or a noisy signal.

With this configuration, more efficient encoding can be realized, and the transmission bit rate of data can be reduced.

In the audio signal encoding method of the present invention, the masking threshold is adjusted by the following formula (1).

T＝β{max(y1、αy2)-1}

y1＝f(x-θ)

y2 ═ f (180-x-theta) … … formula (1)

With this configuration, the spatial masking effect corresponding to each model can be easily calculated. This enables efficient encoding and reduction of the data transmission bit rate.

Conventionally, PE is generally calculated by considering only masking effects in the frequency domain of each channel of a stereo signal.

In contrast, the audio signal encoding method of the present invention is characterized in that the average number of bits per sample is calculated by PE in consideration of the masking effect across the inter-channel space.

By performing bit allocation to the masking threshold value in this manner, the transmission bit rate of data can be reduced. Experiments by the present inventors have confirmed that the bit rate can be reduced by about 5 to 25 percent.

The audio signal decoding method of the present invention is an audio signal decoding method executed by the decoding device 2, and is characterized by decoding audio signals of a plurality of channels encoded by the above-described audio signal encoding method.

With this configuration, by decoding the audio signal encoded by the encoding device 1, it is possible to play back a high-quality audio signal even if the transmission bit rate is low.

[ other embodiments ]

In addition, in the embodiment of the present invention, as the encoding of the audio signals of a plurality of channels, the encoding of 22.2 channels is mentioned.

In contrast, the AUDIO signal encoding method according to the present embodiment can be applied to multi-channel AUDIO encoding such as 5.1 channel and 7.1 channel, including 3D AUDIO encoding in which a space is sampled, object encoding represented by MPEG-H3D AUDIO, and existing 2-channel stereo AUDIO encoding.

That is, the encoding device 1 can acquire voice data from multi-channel voice data, voice objects, and the like, which have already been collected, in step S101 of fig. 2 without collecting sound using the microphone array 10 as shown in fig. 1 of the above-described embodiment.

Further, in the above-described embodiment, an example is described in which a headphone capable of head tracking is used as the decoding device 2 for decoding the audio signal transmitted to the audio system X.

However, if an audio system is available in which a masking effect on the auditory sense of a sound source scattered in a 3-dimensional space can be used, the audio signal encoding method and the audio decoding method according to the present embodiment can be applied to any system. For example, the present invention can be applied to a system for capturing, transmitting, and playing a 3D sound field other than these, and can also be applied to VR/AR applications and the like.

The above-described embodiments describe examples in which a wearable headphone, an in-ear headphone, or the like is used as the headphone 110 for playing stereoscopic audio.

However, the headphone 110 may be a plurality of speakers or the like fixed as in the embodiment.

Further, although the above-described embodiment describes that the position and direction information is fed back from the headphone to the encoding device 1, this need not be the case. In this way, when the feedback of the position and direction information is not performed, it is needless to say that the masking threshold value can be calculated without using the position and direction information.

In this case, the stereo audio playback section 100 may update the convolution of the Head Related Transfer Function (HRTF) without matching the position direction information.

In addition, in the above-described embodiment, the configuration in which the decoding device 2 includes the direction calculation unit 70 and the transmission unit 80 has been described.

However, in the audio signal encoding method and the audio decoding method according to the present embodiment, it is not always necessary to know the direction in which the listener is facing. Therefore, the direction calculation unit 70 and the transmission unit 80 may not be provided.

In the above-described embodiments, an example of calculating the spatial masking effect after the frequency masking is extended is described.

In contrast, even if the frequency is replaced with time, the same spatial masking effect can be calculated. Further, as the spatial masking effect, a combination of masking in the frequency direction and masking in the time direction can be used.

In the above-described embodiments, an example in which transmission is performed using a spatial masking effect while suppressing the bit rate at a low level has been described. That is, an example of encoding audio signals of a plurality of channels with quality equivalent to that of conventional high-bit-rate audio encoding is described.

In contrast, it is possible to perform coding by emphasizing important sounds or distorting the sense of localization, rather than performing high-quality coding. Alternatively, the amount of information allocated to a location that is acoustically important may be increased by a spatial masking effect, or conversely, the amount of information allocated to a location that is acoustically unimportant may be further decreased, thereby enhancing the presence.

In addition, in the above-described embodiments, an example of bit allocation as allocation of information amount is described.

However, the information amount allocation may be performed not by simply determining (allocating) the number of bits for each band but by allocating the information amount corresponding to entropy coding or other coding.

Further, as described in the above embodiment, when there is feedback of the position and direction information, it is possible to calculate an efficient masking threshold value using the position and direction information.

Therefore, the bit rate of distribution (transmission) can be changed according to the presence or absence of feedback of the position and direction information. That is, the decoding device 2 that feeds back the position and direction information to the encoding device 1 can transmit data at a lower bit rate than the decoding device 2 that does not feed back the position and direction information.

With this configuration, a service for providing content at a lower cost can be realized.

Next, the present invention will be further described with reference to examples based on the drawings, and the following specific examples do not limit the present invention.

Examples

(experiment of masking model considering spatial masking)

(Experimental method)

An experiment for measuring the threshold value at each frequency of a masking sound in the presence of the masking sound for each position of the masking sound will be described with reference to fig. 5 and 6.

FIG. 5 is a schematic diagram showing a measurement system. Here, the front of the subject is set to 0 °, and the counterclockwise direction is set to positive. A PC (Personal Computer) is disposed on the front of the subject. The subject sits on a chair and listens to the stimulus sound presented by the speaker with both ears. Speakers were placed at 8 places at 45 ° intervals so as to surround the entire circumference of the subject at a position 1.5m away from the subject. Further, correction of the sound pressure level [ dBSPL ] in the output of the experimental system was performed by measurement using a noise meter (RION NA-27).

The experimental methods are described below. Initially, in order to allow the subject to grasp the sound source used in the experiment, a demonstration is performed in which each sound source is individually presented. Subsequently, the measurement is started. In the measurement, a masking tone is always present. The masked tones are presented for a duration of 0.7 seconds, and are presented repeatedly after 0.7 second intervals of silence. The subject inputs "whether or not the masking sound is perceived to change" to the PC while presenting the masking sound 3 times for each frequency and each sound pressure level of the masking sound while viewing the answer screen. At this time, the subject is instructed to input the answer without moving the head but with moving the line of sight. Here, "the masking sound is perceived to change" includes not only when a masked sound is perceived, but also when a sound that is neither a masking sound nor a masked sound is perceived. For example, when two pure tones having slightly different frequencies are simultaneously present, a "beat" of a sound having a frequency equal to the frequency difference of 2 tones is sensed due to the interference of the sound wave. The case where such a sound is perceived also includes a case where "the masking sound feeling changes".

In order to get used to the experimental method, test measurements that are not reflected in the experimental results are started several times.

Fig. 6 is an explanatory diagram showing a threshold value search method in the present experiment. The search method for the threshold in this experiment was performed in a manner conforming to the adaptive method. The adaptive method is a method in which an experimenter adjusts a physical parameter value of a stimulus according to a response of the experimenter, and determines a threshold value.

In fig. 6, the horizontal axis represents the number of groups of masking sounds, and the vertical axis represents the sound pressure level of the masking sounds. The number of groups of masked sounds "1 group" means a period in which the masked sounds are presented 3 times, and is set as a unit of sound source presentation.

First, the frequency of the masked sound is fixed to f1, and is presented to the listener at the sound pressure level SPLmax. Then, the sound pressure level is changed to SPLmin to be presented to the listener. SPLmax refers to the maximum value in the measurement range of the sound pressure level, and SPLmin refers to the minimum value in the measurement range of the sound pressure level. Here, if the subject fails to detect a masked sound at the sound pressure level SPLmax, SPLmax is regarded as the threshold, and if a masked sound at the sound pressure level SPLmin can be detected, SPLmin is regarded as the threshold. In this case, the actual threshold value may be considered to be outside the measurement range. As an example considered as above, the threshold value of the masked sound of the frequency f2 in fig. 6 is exemplified. Fig. 6 shows a case where the masked sound at the frequency f2 cannot be detected even at the sound pressure level SPLmin. Thus, the number of groups of sound pressure levels to which the subject must respond varies depending on the response of the subject. After the masked sound is presented at the sound pressure level SPLmin, the threshold value is found by binary search according to the answer of the subject. That is, a value that is intermediate between the minimum value of the sound pressure level of the masked sound that can be detected by the measurement so far and the maximum value of the sound pressure level of the masked sound that cannot be detected is set as the value of the next sound pressure level. Continuing with such a search, only one sound pressure level remains that can ultimately be set. The final remaining sound pressure level is used as a threshold for the masked sound of frequency f 1.

For the search described above, the frequency is continuously changed in the order of f1, f2, f3, and … … as shown in fig. 6, and the search is performed. In this experiment, the threshold values of the masked sound are examined sequentially from the low frequency side.

The answer screen presented to the subject is shown in fig. 7. The answer screen when the masking sound is a 1 sound source is shown in fig. 7 (a), and the answer screen when the masking sound is a 2 sound source is shown in fig. 7 (b). The following are displayed on the screen: the direction of the masking sound, the sound pressure level of the masking sound, the direction of the masked sound, the frequency of the masked sound, a lamp that lights up during the playback of the masked sound, a counter that indicates the number of times the masked sound is played, and a button that inputs the presence or absence of the masked sound. The subject can perceive from which direction each sound source appears at what size and when. The reason why the frequency of the masked sound is displayed is because the measurement is performed while continuously changing the frequency of the masking sound (the type of masking sound), and therefore, the subject is made to know which answer to the masked sound is currently input, and confusion of the answer is prevented. The examinee himself/herself notifies the PC of "detection of a masked sound" by turning on a button for detecting presence or absence of an input masked sound, and notifies the PC of "failure to detect a masked sound" by turning off the button. The initial value of the counter indicating the number of times the masked sound is played is 0, and changes depending on the number of times the masked sound is played, for example, 0, 1, 2, 3, and 0 … …. If the count reaches 0, the answer is reset, that is, the button for detecting the existence of the input masked sound is turned off, and the masked sound shifts to the next sound pressure level or frequency. The subject must input the presence or absence of detection during the time when the counter indicates 1, 2, or 3.

Further, the program for answering of the listening experiment was programmed by Max ver.7 of the Cycling' 74 company. For programs other than this, the program was programmed by MATLAB ver.r2018a of MathWorks corporation.

(List of masking tones)

A list of masking sounds used in the experiment is shown in table 1 below.

[ Table 1]

Masking sounds used

Name(s)	Sound source signal
		Masking tone A (mask A)	Frequency band noise with center frequency of 400Hz and bandwidth of 100Hz
Masking sound B (mask B)	Center frequency 1000Hz, bandwidth 150Hz frequency band noise
		Robbing mute C (mask C)	Pure tone with frequency of 400Hz
Masking tone D (mask D)	Pure tone with frequency of 1000Hz

For the masking sound, a band noise and a pure sound having a frequency (center frequency) of 400Hz or 1000Hz are prepared. These masking sounds will be described later by the names of masking sounds a (mask a) to d (mask d). The bandwidth of the band noise is determined so as to substantially coincide with the bandwidth of the critical band. It is known that a noise component contributing to masking a pure tone is limited to a component having a certain bandwidth in a band noise having the pure tone as a center frequency. The critical band is a band that contributes to masking such pure tones.

(Experimental conditions)

As experimental conditions, experiments were performed on two kinds, i.e., a case where the number of masking sounds is 1 and a case where the number of masking sounds is 2. Experiments were conducted in a anechoic room with the sampling frequency of the acoustic source signal set to 48 kHz.

First, the conditions when the number of masking sounds to be arranged is 1 are shown in table 2 below.

[ Table 2]

Experimental conditions (when the masking sound is set to 1 sound source)

The subjects were 2 males (subject a, subject b) in 20 years old who had normal hearing. For the masking sound, any one of the sound sources of the masking sounds a to D described above is used. For the sound pressure level of the masking sound, two kinds of 60dBSPL and 80dBSPL are used. The azimuth of the masking tone is set to 1 arbitrary azimuth of four azimuths of 0 °, 45 °, 90 °, and 135 °. That is, the azimuth of the masking sound is only the 4 azimuth of the left ear side. When the 4-azimuth masking tone azimuth is prepared and tested as described above, data on the threshold value of the subject's half-cycle portion can be obtained. Since the threshold value is considered to be symmetric with respect to the median plane if the human head shape is assumed to be bilaterally symmetric, the data of the threshold value in the remaining half-cycle portion, which is not obtained in the present experiment, is symmetrical with the data obtained in the present experiment.

The masking sound is a pure tone 1 sound source, and its frequency and sound pressure level are as follows. Specifically, the frequency of the masking sound is decided in such a manner as to become dense if it is a frequency close to the frequency (center frequency) of the masking sound. In the case where the masking sound is a pure sound, it is considered that when the frequency of the masked sound and the frequency of the masking sound are completely matched (400Hz, 1000Hz), the masked sound cannot be perceived at all sound pressure levels, and therefore such frequencies are excluded from the measurement target. The sound pressure level of the masking sound may be set to a value of every 3dB, the maximum level thereof is the sound pressure level of the masking sound, and the minimum level thereof is 20dBSPL or 18 dBSPL. The maximum level is determined based on the expectation that the sound pressure level of the masked sound is completely perceived when the sound pressure level of the masked sound is greater than the sound pressure level of the masking sound. The minimum level was determined in consideration of the background noise level in the sound-deadening chamber as a laboratory, and the measurement range was approximately 15dB lower than the background noise level. The azimuth of the masked sound is set to 45 ° or 315 °. When the azimuth of the masked sound is 45 °, the masking sound and the azimuth of the masked sound match each other, and as a result, a threshold value of frequency masking that has been conventionally studied is obtained. When the azimuth of the masked sound is 315 °, the masking sound and the masked sound exist in different azimuths from each other, and therefore, a spatial masking threshold, which is a masking between channels of stereo sound, is obtained as a result.

The azimuth of the masked sound is set to any 1 azimuth from among 8 azimuths of 45 ° to 315 ° each from 0 °.

Next, the conditions when the number of masking sounds to be arranged is 2 are shown in table 3 below.

[ Table 3]

Experimental conditions (when the masking sound is set to 2 sound sources)

The subject is only subject a. In the masking sound, masking sound a is arranged at 45 ° in azimuth, and masking sound B is arranged at 315 °. The masked sound uses a pure tone 1 sound source. The frequency of the masking sound is a frequency that meets the conditions when the frequency (center frequency) of the masking sound is 400Hz and the conditions when the frequency is 1000 Hz. Further, since the masking sounds (masking sound a, masking sound B) that are arranged are all noisy, it can be considered that, even when the frequency of the masked sound completely coincides with the center frequency of the masking sound (400Hz, 1000Hz), the masked sound can be perceived if it is at a certain sound pressure level or higher, unlike a pure sound. Therefore, 400Hz and 1000Hz were also added to the measurement object. The maximum value of the sound pressure level of the masking sound is 9dB greater than that in table 2. This is considered that since 2 sound sources exist as the masking sound, the sound pressure level of the sound to be listened to rises by about 6dB at maximum.

The orientation of the masked sound is set to 225 °.

(calculation of masking threshold)

(results and thinking)

The experimental results on the subject a will be described with reference to fig. 8 to 11.

α and β described in the above formula (5) are searched for in the range of values shown in the following table 4.

[ Table 4]

Range of cyclic calculation of alpha, beta

Parameter(s)	Extent of loop calculation
		α	0、0.01、0.02、…、1
β	0、0.01、0.02、…、20

In the present embodiment, the optimal values of α and β are calculated as follows. First, T is calculated for all combinations of the types of masking sounds (masking sounds a to D), orientations, and sound pressure levels at a certain value of α or β_spatialMean Squared Error (MSE) with the maximum of the threshold in each position of the masked tone obtained as a result of the experiment. Next, the calculated mean square errors are summed up for each type of masking tone. The above operations are repeated by changing the values of α and β, and the set of α and β when the sum of the mean square error per type of masking sound is minimum is set as the optimum value of α and β.

Here, a mean square error mse (j) in the azimuth of the j-th masking sound is calculated by the following expression (6).

In the formula (6), T_spatial(i) Indicating the azimuth [ deg ] of the ith masked sound.]Function T of_spatialOutput value of, T_measured(i) Indicating the azimuth [ deg ] of the ith masked sound.]Next, the measured value obtained by the experiment of the threshold value of the masked sound is obtained. L is_{masker azimuth}Threshold value [ dBSPL ] of masked sound in the direction in which the masking sound exists]. Due to T_spatialIndicating the presence of a masking toneSo this has the effect of adjusting T_spatialAnd T_measuredThe effect of the offset between. N is T_spatialAnd T_measuredThe number of entries (total number of azimuths of the masked sound). In this calculation, since the scale of the azimuth of the masked sound is 1 ° from 0 ° to 360 °, N is 361. However, T_measuredSince the scale indicating the azimuth of the masked sound is a 45 ° scale in the actual measurement, the value of the portion missing in the 1 ° scale is estimated by linear interpolation.

As a result of the loop calculation, optimum values of α and β are obtained for the masking sounds a to D as shown in table 5 below.

[ Table 5]

Optimum values of alpha and beta obtained by cyclic calculation

Kinds of robbing masking sound	Optimum value of alpha	Optimum value of beta
			Masking tone A	0.40	11.96
Masking tone B	0.28	9.24
			Masking tone C	0.52	1.12
Masking tone D	0.30	5.82

FIGS. 8 to 11 show the values of T using Table 5_spatialA value fitted to the measured value of the threshold value of the masked sound. The upper left graph of each graph is the result with respect to the masking sound a, the upper right graph is the result with respect to the masking sound B, the lower left graph is the result with respect to the masking sound C, and the lower right graph is the result with respect to the masking sound D.

The horizontal axis of each graph represents the azimuth of the masked sound, and the vertical axis represents the sound pressure level. The azimuth belonging to the masking tone azimuth is indicated by a vertical dotted line. The black solid line represents the measured value of the threshold value of the masking sound when the sound pressure level of the masking sound is 80dBSPL, and the gray solid line represents the measured value of the threshold value of the masking sound when the sound pressure level of the masking sound is 60 dBSPL. In contrast, the dashed red line represents the usage function T_spatialValues fitted with a solid red line, the dotted grey line representing the use of the function T_spatialValues fitted with a solid line of grey.

In addition, each dotted line is a pair function T_spatialIs added to the offset L_{masker azimuth}The latter value.

As can be seen from fig. 8 to 11, each graph is approximately fitted to the actually measured value. However, for example, if the threshold value at the azimuth symmetrical to the front and rear of the masking sound is increased when observing the band noise such as the masking sound a and the masking sound B as in the upper left graph of fig. 8 and the upper left graph of fig. 9, a portion where the broken line does not fit well to the solid line can be seen. The reason for this is considered to be that when the masking sound is a frequency band noise and the azimuth of the masking sound is 90 °, the change due to the azimuth of the threshold is small, and when the sum of the mean square errors is to be minimized, the influence is exerted so that the value of α becomes small. In order to fit the above-described part well, when the error between the actually measured value and the model function when the azimuth of the masking sound is 90 ° is large and there is no relation, the value of α may be set large.

In the present embodiment, the values of α and β are obtained by loop calculation, but the value of β may be determined based on an index for determining the pitch (tonality or noisiness) of the masking sound. Examples of the index for discriminating the pitch of the masking tone include autocorrelation, Spectral Flatness Measure (SFM), and the like. By using these indices, β can be determined as a parameter and fitted.

(conclusion)

In the present embodiment, a basic listening experiment is performed to confirm spatial masking, and a masking threshold calculation method and modeling in consideration of spatial masking can be realized by reflecting the knowledge obtained by the experiment.

First, in the listening experiment, even when the masking sound and the masked sound are caused to exist in different azimuths, a threshold rise in the vicinity of the frequency of the masking sound is observed, and from this, the existence of spatial masking is confirmed.

The masking threshold varies depending on the azimuth of the masking sound and the azimuth of the masked sound, and basically, the threshold decreases the farther the azimuth of the masked sound is from the azimuth of the masking sound. In the 2-channel stereo environment, a value obtained by adding a weight of 15dB to a threshold value of masking by a signal of the own channel on the own channel may be used as a threshold value of masking by a signal of the own channel on a signal of another channel. In the case where the masking sound is a band noise, the threshold of the masked sound is seen to rise in a direction symmetrical with the front and rear of the masking sound compared with the direction around the masking sound, and this becomes more noticeable as the center frequency of the masking sound is lower. In addition, when the masking sound is a pure sound, the variation of the threshold based on the azimuth of the masked sound is flat.

Further, when each masking sound is present alone, the sum of the masking threshold of the signal of the same azimuth as the masking sound and the masking threshold of the signal of the azimuth other than the masking sound in the linear scale is used as the masking threshold considering the signal of the azimuth other than the signal of the own azimuth.

These results are summarized below:

when the masking sound is 0 °, the threshold is highest when the position of the masked sound is 0 °. At 45 ° and 90 °, the threshold value decreases as the position of the masked sound is farther from the masking sound. However, the threshold value rises from 135 ° to approximately the same level as that at 0 ° at 180 °. That is, the values of the masking thresholds based on the masking sounds are almost symmetrical in the front and rear of the listener.

When the masking sound is 45 °, the threshold value becomes the highest when the masked sound position is 45 °. At 90 deg., the threshold value drops. Although the temperature is further decreased at 135 °, the threshold value is increased as expected and approaches the threshold value at 45 °. At 180 deg., the threshold value decreases, and at 225 deg., it further decreases. This is similar to the case where the masking tone is 0 °, and the masking threshold values are almost symmetrical around the listener. That is, the line symmetry is formed centering on the line connecting 90 ° to 270 °.

The same tendency applies to both a masking sound of 90 ° and a masking sound of 135 °.

From the above-described findings, a masking threshold calculation method considering spatial masking is proposed as follows: in a 2-channel stereo environment, values obtained by-15 dB weighting the masking threshold of the own channel and the masking thresholds of the other channels are summed in a linear scale. In all directions, changes in the orientation of the peak of the threshold of the masked sound are modeled by an arbitrary periodic function having a period of 360 ° and periodic functions obtained by phase-shifting the periodic functions so as to be line-symmetric at 90 ° and 270 °. Using the modeled function, the masking thresholds for each channel are weighted to sum in a linear scale.

That is, the masking threshold can be calculated by the above equation (1). The masking threshold is calculated based on this, and the number of bits required for signal transmission can be reduced.

It is to be understood that the configurations and operations of the above embodiments are examples, and can be performed by making appropriate changes within the scope of the present invention.

Industrial applicability

The biological sequence analysis method of the present invention can provide an audio signal encoding method in which the bit rate is further reduced than before by using the spatial masking effect of the auditory sense, and can be industrially used.

Description of the reference numerals

1 encoding device

2 decoding device

10 microphone array

20 sound collecting part

30 frequency domain conversion part

40 masking threshold calculation section

50 information amount determination unit

60 code part

70 direction calculating part

80 transmitting part

90 decoding unit

100 stereo audio playing part

110 head set

X audio system

37页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于使用声音质量的估计和控制的源分离的装置和方法

Audio signal encoding method, audio signal decoding method, program, encoding device, audio system, and decoding device

相关技术

网友询问留言