Audio signal processing apparatus and noise suppression method

文档序号：538889 发布日期：2021-06-01 浏览：2次中文

阅读说明：本技术 音频信号处理装置及噪声抑制方法 (Audio signal processing apparatus and noise suppression method ) 是由难波隆一见山成志真锅芳宏及川芳明于 2019-08-23 设计创作，主要内容包括：本发明的目的是通过进行适合于噪声环境的噪声抑制来提高噪声抑制性能。获取噪声字典数据,噪声字典数据是基于包括与声音接收点和噪声源之间的方向以及噪声类型有关的信息的安装环境信息从噪声数据库中读取的。所获取的噪声字典数据用于对由布置在声音接收点的麦克风所获取的音频信号进行噪声抑制处理。(The purpose of the present invention is to improve noise suppression performance by performing noise suppression suitable for a noise environment. Noise dictionary data is acquired, which is read from a noise database based on installation environment information including information on a direction between a sound receiving point and a noise source and a type of noise. The acquired noise dictionary data is used for noise suppression processing of an audio signal acquired by a microphone arranged at a sound receiving point.)

1. A speech signal processing apparatus comprising:

a control calculation unit configured to acquire noise dictionary data read from the noise database unit based on installation environment information including information on a type of noise and a direction between a sound receiving point and a noise source; and

a noise suppression unit configured to perform noise suppression processing on a voice signal obtained by a microphone arranged at the sound reception point using the noise dictionary data.

2. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit acquires a transfer function between the noise source and the sound receiving point from a transfer function database unit that holds transfer functions between two points under various environments based on the installation environment information, and

the noise suppression unit uses the transfer function for noise suppression processing.

3. The speech signal processing apparatus according to claim 1,

wherein the installation environment information includes information on a distance from the sound receiving point to a noise source, an

The control calculation unit acquires noise dictionary data from the noise database unit while including the type, the direction, and the distance as parameters.

4. The speech signal processing apparatus according to claim 1,

wherein the installation environment information includes information on an azimuth angle and an elevation angle between the sound receiving point and a noise source as the direction, an

The control calculation unit acquires noise dictionary data from the noise database unit while including the type, the azimuth, and the elevation as parameters.

5. The speech signal processing apparatus of claim 1 further comprising

An installation environment information holding unit configured to store the installation environment information.

6. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit executes processing of storing the installation environment information input by the user operation.

7. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit performs a process of estimating a direction or distance between the sound receiving point and a noise source, and performs a process of storing installation environment information suitable for the estimation result.

8. The speech signal processing apparatus according to claim 7,

wherein the control calculation unit determines whether or not a type of noise of the noise source is present within a predetermined period of time when estimating a direction or a distance between the sound receiving point and the noise source.

9. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit executes a process of storing installation environment information determined based on an image captured by an imaging device.

10. The speech signal processing apparatus according to claim 9,

wherein the control calculation unit performs shape estimation based on the captured image.

11. The speech signal processing apparatus according to claim 1,

wherein the noise suppression unit calculates a gain function using the noise dictionary data acquired from the noise database unit, and performs noise suppression processing using the gain function.

12. The speech signal processing apparatus according to claim 1,

wherein the noise suppression unit calculates a gain function based on noise dictionary data in which a transfer function obtained by convolving a transfer function between a noise source and the sound receiving point into noise dictionary data acquired from the noise database unit is reflected, and performs noise suppression processing using the gain function.

13. The speech signal processing apparatus according to claim 1,

wherein the noise suppression unit performs gain function interpolation in the frequency direction according to a predetermined condition determination in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

14. The speech signal processing apparatus according to claim 1,

wherein the noise suppression unit performs gain function interpolation in a spatial direction according to a predetermined condition determination in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

15. The speech signal processing apparatus according to claim 1,

wherein the noise suppressing unit performs noise suppressing processing using estimation results of a period of time in which noise is not included and a period of time in which noise is included.

16. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit acquires noise dictionary data from the noise database unit for each frequency band.

17. The speech signal processing apparatus of claim 2 further comprising

A storage unit configured to store the transfer function database unit.

18. The speech signal processing apparatus of claim 1 further comprising

A storage unit configured to store the noise database unit.

19. The speech signal processing apparatus according to claim 1,

wherein the control calculation unit acquires the noise dictionary data through communication with an external device.

20. A noise suppression method performed by a speech signal processing apparatus, the noise suppression method comprising:

acquiring noise dictionary data read from a noise database unit based on installation environment information including information on a type of noise and a direction between a sound receiving point and a noise source; and

performing noise suppression processing on a speech signal obtained by a microphone arranged at the sound receiving point using the noise dictionary data.

Technical Field

The present technology relates to a speech signal processing apparatus and a noise suppression method thereof, and particularly relates to the technical field of noise suppression suitable for the environment.

Background

Examples of noise suppression techniques include a spectral subtraction technique that subtracts an estimated noise spectrum from an observed signal, and a technique that performs noise suppression by defining a gain function (defining gains before and after noise suppression) (spectral gain, a priori/a posteriori SNR), and multiplying the observed signal by the defined gain function.

Non-patent document 1 described below discloses a noise suppression technique using spectral subtraction. Further, non-patent document 2 described below discloses a technique of a method using a spectral gain.

Documents of the prior art

Non-patent document

Non-patent document 1: BOLL S.F (1979) supression of Acoustic Noise in Speech Using Spectral subtransction. IEEE tran. on Acoustics, Speech and Signal Processing ASSP-27,2, pp.113-120.

Non-patent document 2: ephraim and D.Malah, "spech enhancement using minor mean-square error short-time spectral amplitude estimator", IEEE Trans Acoust, spech, Signal Processing, ASSP-32,6, pp.1109-1121, Dec.1984.

Disclosure of Invention

Problems to be solved by the invention

In the spectral subtraction, due to the subtraction, the spectrum enters a puncturing state in time-frequency slot units (the signal at a partial time frequency becomes 0), and sometimes becomes a friction sound called a musical noise.

Further, in the gain function type method, since a certain probability density distribution is assumed for target voice (e.g., voice, etc.) and noise (mainly stationary noise), the performance of non-stationary noise is poor or the performance is degraded in an environment where stationary noise deviates from the assumed distribution.

Further, in an actual usage environment, neither the target sound nor the noise is a dry source, but the influence of the spatial transfer characteristic of convolution at the time of propagation and the radiation characteristic of the noise source cannot be effectively reflected in noise suppression.

In view of the foregoing, the present technology provides a method that can achieve appropriate noise suppression suitable for the environment.

Problem solving scheme

A speech signal processing apparatus according to the present technology includes: a control calculation unit configured to acquire noise dictionary data read from the noise database unit based on installation environment information including information on a type of noise and a direction between a sound receiving point and a noise source; and a noise suppression unit configured to perform noise suppression processing on a speech signal obtained by a microphone arranged at a sound reception point using the noise dictionary data.

For example, using a noise database unit that stores characteristics of each type and direction of a noise source, noise dictionary data of noise suitable for at least the type and direction of noise in the installation environment of the voice signal processing apparatus is acquired and used for noise suppression processing (noise reduction).

Typically, the sound receiving point corresponds to the position of the microphone.

The direction between the sound receiving point and the noise source may be information indicating an azimuth angle of the noise point with respect to the sound receiving point, or information indicating an azimuth angle of the sound receiving point with respect to the noise point.

In the above-described voice signal processing apparatus according to the present technology, it is considered that the control calculation unit acquires the transfer function between the noise source and the sound receiving point from the transfer function database unit that holds the transfer function between two points under various environments based on the installation environment information, and the noise suppression unit uses the transfer function for the noise suppression processing.

In other words, the spatial transfer function is used for the noise suppression processing in addition to the noise dictionary data of the noise suitable for the noise type and the azimuth.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the installation environment information includes information on a distance from the sound receiving point to the noise source, and the control calculation unit acquires the noise dictionary data from the noise database unit while including the type, the direction, and the distance as parameters.

In other words, noise dictionary data suitable for at least these types, directions, and distances is used for noise suppression.

In the above-described voice signal processing apparatus according to the present technology, it is considered that the installation environment information includes information on an azimuth and an elevation between the sound reception point and the noise source as a direction, and the control calculation unit acquires the noise dictionary data from the noise database unit while including the type, the azimuth, and the elevation as parameters.

The information on the direction is not information on the direction when the positional relationship between the sound receiving point and the noise source is viewed from a two-dimensional perspective, but information on a three-dimensional direction including the positional relationship (elevation angle) in the up-down direction.

In the above-described voice signal processing apparatus according to the present technology, it is considered that an installation environment information holding unit configured to store installation environment information is included.

According to the installation of the voice signal processing apparatus, information as installation environment information is input in advance.

In the above-described voice signal processing apparatus according to the present technology, it is considered that the control calculation unit performs processing of storing the installation environment information input by the user operation.

For example, in the case where the installation environment information is input by an operation by a person who installs the voice signal processing apparatus, a person who uses the voice signal processing apparatus, or the like, the voice signal processing apparatus may store the installation environment information according to the operation.

In the above-described voice signal processing apparatus according to the present technology, it is considered that the control calculation unit performs a process of estimating a direction or a distance between the sound receiving point and the noise source, and performs a process of storing the installation environment information suitable for the estimation result.

For example, the installation environment information is obtained by performing a process of estimating a direction or a distance between the sound receiving point and the noise source in a state where the voice signal processing apparatus is installed in the use environment.

In the above-described voice signal processing apparatus according to the present technology, it is considered that when the direction or distance between the sound receiving point and the noise source is estimated, the control calculation unit determines whether or not there is noise of the type of the noise source within a predetermined period of time.

For each type of noise source, a period of time during which noise is generated is estimated, and estimation of a direction or a distance is performed in an appropriate period of time.

For example, image capturing is performed by an imaging device in a state where a voice signal processing device is installed in a use environment, and the installation environment is determined by image analysis.

In the above-described voice signal processing apparatus according to the present technology, it is considered that the control calculation unit performs shape estimation based on the captured image.

For example, in a state where the voice signal processing apparatus is installed in a use environment, image capturing is performed by the imaging apparatus to estimate the three-dimensional shape of the installation space.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the noise suppression unit calculates a gain function using the noise dictionary data acquired from the noise database unit, and performs the noise suppression process using the gain function.

The gain function is calculated using the noise dictionary data as a template.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the noise suppression unit calculates a gain function based on the noise dictionary data in which the transfer function is reflected, and performs the noise suppression process using the gain function. Wherein the gain function is obtained by convolving a transfer function between the noise source and the sound receiving point into noise dictionary data acquired from the noise database unit.

The noise dictionary data is deformed in a state of reflecting transfer functions of the noise source and the sound receiving point.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the noise suppression unit determines to perform gain function interpolation in the frequency direction according to a predetermined condition in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

For example, in the case where a gain function is obtained for each frequency bin, interpolation is performed in the frequency direction.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the noise suppression unit determines to perform gain function interpolation in the spatial direction according to a predetermined condition in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

For example, in the case where a gain function is obtained in the case where there are a plurality of voice recording points due to a plurality of microphones or the like, interpolation is performed in the spatial direction.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the noise suppression unit performs the noise suppression process using the estimation results of the period in which the noise is not included and the period in which the noise is included.

For example, a signal-to-noise ratio (SNR) is obtained from an estimation of the presence or absence of noise as a time period, and the SNR is reflected in the gain function calculation.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the control calculating unit acquires noise dictionary data for each frequency band from the noise database unit.

In other words, the noise dictionary data for each frequency bin is obtained from the noise database unit.

In the above-described speech signal processing apparatus according to the present technology, it is considered to include a storage unit configured to store a transfer function database unit.

In other words, the transfer function database unit is stored into the speech signal processing apparatus.

In the above-described speech signal processing apparatus according to the present technology, it is considered to include a storage unit configured to store a noise database unit.

In other words, the noise database unit is stored into the voice signal processing apparatus.

In the above-described speech signal processing apparatus according to the present technology, it is considered that the control calculation unit acquires the noise dictionary data by communication with an external apparatus.

In other words, the noise database unit is not stored in the speech signal processing apparatus.

The noise suppression method according to the present technology includes: the noise dictionary data read from the noise database unit is acquired based on installation environment information including information on the type of noise and the direction between the sound receiving point and the noise source, and noise suppression processing is performed on a speech signal obtained by a microphone arranged at the sound receiving point using the noise dictionary data.

Thus, noise suppression suitable for the environment is realized.

Drawings

Fig. 1 is a block diagram of a speech signal processing apparatus according to an embodiment of the present technology.

Fig. 2 is a block diagram of a voice signal processing apparatus and an external apparatus according to an embodiment.

Fig. 3 is an explanatory diagram of a function of a control calculation unit and a storage function according to the embodiment.

Fig. 4 is an explanatory diagram of noise section estimation according to the embodiment.

Fig. 5 is a block diagram of an NR unit according to an embodiment.

Fig. 6 is an explanatory diagram of a noise suppressing operation according to the first embodiment.

Fig. 7 is an explanatory diagram of a noise suppressing operation according to the second embodiment.

Fig. 8 is an explanatory diagram of a noise suppressing operation according to the third embodiment.

Fig. 9 is an explanatory diagram of a noise suppressing operation according to the fourth embodiment.

Fig. 10 is an explanatory diagram of a noise suppressing operation according to the fifth embodiment.

Fig. 11 is a flow diagram of a process of noise database construction according to an embodiment.

Fig. 12 is an explanatory diagram of acquiring noise dictionary data according to the embodiment.

Fig. 13 is a flow diagram of a preliminary measurement/input process according to an embodiment.

Fig. 14 is a flowchart of a process performed when using a device according to an embodiment.

Fig. 15 is a flowchart of a process performed by the NR unit according to the embodiment.

Detailed Description

Hereinafter, the embodiments will be described in the following order.

<1. configuration of speech Signal processing apparatus >

<2. operation of the first to fifth embodiments >

<3. noise database construction step >

<4. preliminary measurement/input processing >

<5. processing performed when using the apparatus >

<6. noise reduction treatment >

<7. conclusions and variants >

<1. configuration of speech Signal processing apparatus >

The speech signal processing apparatus 1 of the embodiment is an apparatus that performs speech signal processing serving as noise suppression (NR: noise reduction) on a speech signal input by a microphone.

Such a voice signal processing apparatus 1 may be configured as a stand-alone apparatus, may be connected with another apparatus, or may be built in various electronic apparatuses.

Actually, the voice signal processing apparatus 1 is configured to be built in or connected to a camera, a television apparatus, an audio apparatus, a recording apparatus, a communication apparatus, a remote presentation apparatus, a voice recognition apparatus, a dialogue apparatus, an agent apparatus for performing voice support, a robot, or various information processing apparatuses.

Fig. 1 shows a configuration of a speech signal processing apparatus 1. The speech signal processing apparatus 1 includes a microphone 2, a Noise Reduction (NR) unit 3, a signal processing unit 4, a control calculation unit 5, a storage unit 6, and an input device 7.

Note that not all of these configurations are necessarily required. Furthermore, these configurations need not be provided in their entirety. For example, a separate microphone 2 may be connected as the microphone 2. The input device 7 needs to be provided or connected only when needed.

As the speech signal processing apparatus 1 of this embodiment, it is sufficient to provide at least the NR unit 3 and the control calculation unit 5, which function as a noise suppression unit at least.

For example, a plurality of microphones 2a, 2b, and 2c are provided as the microphone 2. Note that, for convenience of description, when it is not particularly necessary to indicate the respective microphones 2a, 2b, and 2c, the plurality of microphones 2a, 2b, and 2c are collectively referred to as "microphone 2".

The voice signal collected by the microphone 2 and converted into an electric signal is supplied to the NR unit 3. Note that as shown by the dotted line, the voice signal from the microphone 2 is sometimes supplied to the control calculation unit 5 for analysis.

In the NR unit 3, noise reduction processing is performed on the input speech signal. Details of the noise reduction processing will be described later.

The voice signal subjected to the noise reduction processing is supplied to the signal processing unit 4, and necessary signal processing suitable for the function of the apparatus is performed on the voice signal. For example, recording processing, communication processing, reproduction processing, voice recognition processing, voice analysis processing, and the like are performed on a voice signal.

Note that the signal processing unit 4 may function as an output unit of a voice signal that has been subjected to noise reduction processing, and may transmit the voice signal to an external apparatus.

For example, the control calculation unit 5 is constituted by a microcomputer including a CPU (central processing unit), a ROM (read only memory), a RAM (random access memory), an interface unit, and the like. The control calculation unit 5 performs a process of supplying data (noise dictionary data) to the NR unit 3 in such a manner that noise suppression suitable for the environmental state is performed in the NR unit 3, which will be described later in detail.

The storage unit 6 includes, for example, a nonvolatile storage medium, and stores information necessary for controlling the NR unit 3 executed by the control calculation unit 5. Specifically, information storage serving as a noise database unit, a transfer function database unit, an installation environment information holding unit, and the like, which will be described later, is performed.

The input device 7 indicates a device that inputs information to the control calculation unit 5. For example, a keyboard, a mouse, a touch panel, a pointing device, a remote controller, or the like for a user to perform information input is used as an example of the input device 7.

Further, a microphone, an imaging device (camera), and various sensors are also used as examples of the input device 7.

For example, fig. 1 shows a configuration in which a storage unit 6 is provided in an integrated apparatus to store a noise database unit, a transfer function database unit, an installation environment information holding unit, and the like. Alternatively, a configuration using the external storage unit 6A as shown in fig. 2 is also assumed.

For example, the communication unit 8 is provided in the voice signal processing apparatus 1, and the control calculation unit 5 may communicate with the calculation system 100 serving as a cloud or an external server via the network 10.

In the computing system 100, the control computing unit 5A communicates with the control computing unit 5 via the communication unit 11.

Then, a noise database unit and a transfer function database unit are provided in the storage unit 6A, and information serving as an installation environment information holding unit is stored in the storage unit 6.

In this case, the control calculation unit 5 acquires necessary information (for example, a noise dictionary data unit obtained from a noise database unit, a transfer function obtained from a transfer function database unit, and the like) in communication with the control calculation unit 5A.

For example, the control calculation unit 5A transmits the installation environment information of the voice signal processing apparatus 1 to the control calculation unit 5A. The control calculation unit 5A acquires noise dictionary data suitable for the installation environment information from the noise database, and transmits the acquired noise dictionary data to the control calculation unit 5 and the like.

Of course, a noise database unit, a transfer function database unit, an installation environment information holding unit, and the like may be provided in the storage unit 6A.

Alternatively, it is considered that only the information serving as the noise database unit is stored in the storage unit 6A. In particular, assume that the data volume of the noise database unit is huge. In this case, it is preferable to use an external storage resource of the voice signal processing apparatus 1, such as the storage unit 6A.

In the case of the configuration shown in fig. 2 described above, the network 10 only needs to be a transmission path through which the voice signal processing apparatus 1 can communicate with an external information processing apparatus. For example, various configurations are assumed, such as the internet, a Local Area Network (LAN), a Virtual Private Network (VPN), an intranet, an extranet, a satellite communication network, a CATV (community antenna television) communication network, a telephone line network, and a mobile communication network.

Hereinafter, description will be continued assuming the configuration shown in fig. 1, but the following description may be applied to the configuration shown in fig. 2.

Functions included in the control calculation unit 5 and information areas stored in the storage unit 6 are illustrated in a and B in fig. 3. Note that, in the case of the configuration shown in fig. 2, it is only necessary to disperse the functions shown in a in fig. 3 in the control calculation units 5 and 5A, and further, the information areas shown in B in fig. 3 are distributed and stored in one or both of the storage units 6 and 6A.

As shown in fig. 3 a, the control calculation unit 5 includes the following functions: a management control unit 51, an installation environment information input unit 52, a noise section estimation unit 53, a noise direction/distance estimation unit 54, and a shape/type estimation unit 55. Note that the control calculation unit 5 need not include all of these functions.

The management control unit 51 instructs a function of executing various types of basic processing by controlling the calculation unit 5. For example, the management control unit 51 instructs to perform a function of writing/reading information into/from the storage unit 6, a communication process, a control process of the NR unit 3 (provision of noise dictionary data), a control of the input device 7, and the like.

The installation environment information input unit 52 instructs to input specification data such as the size and sound absorption of the installation environment of the voice signal processing apparatus 1 and information such as the type, position, and direction of noise present in the installation environment, and stores the input information as a function of the installation environment information.

For example, the installation environment information input unit 52 generates installation environment information based on data input by the user using the input device 7, and stores the generated installation environment information into the storage unit 6.

Alternatively, the installation environment information input unit 52 generates installation environment information by analyzing an image or voice obtained by an imaging device or a microphone serving as the input device 7, and causes the generated installation environment information to be stored into the storage unit 6.

The installation environment information includes, for example, the type of noise, the direction (azimuth, elevation) from the noise source to the sound receiving point, and the distance.

The type of noise is, for example, the type of sound of the noise itself (such as the type of frequency characteristic), the type of noise source, and the like. Noise sources are, for example, household appliances in the installation environment, such as air conditioners, washing machines or refrigerators, stationary ambient noise, etc.

Further, various methods may be used as decomposing the noise type into a plurality of patterns. For example, even in the same kind of refrigerator, the washing noise and the drying noise are different. In addition, the noise type may be subdivided into multiple modes by subcategory.

The noise section estimation unit 53 represents a function of determining whether or not various types of noise exist within a predetermined period of time using voice input from a microphone array including one or more microphones 2 (or another microphone serving as the input device 7).

For example, the noise section estimation unit 53 determines a noise section serving as a time period during which noise to be suppressed occurs, and a target sound existing section serving as a time period during which a target sound such as sound to be recorded exists, as shown in fig. 4.

The noise direction/distance estimation unit 54 indicates a function of estimating the direction and distance of each sound source. For example, the noise direction/distance estimation unit 54 estimates the direction of arrival and the distance of a sound source from signals observed using a speech input from a microphone array including one or more microphones 2 (or other microphones serving as the input device 7). For example, a MUSIC (multiple signal classification) method or the like may be used for such estimation.

The shape/type estimation unit 55 instructs the following functions with the imaging device as the input device 7: image data obtained by performing image capturing by an imaging device is input, the three-dimensional shape of the installation space is estimated by analyzing the image data, and the presence, type, position, and the like of a noise source are estimated.

As shown in B in fig. 3, an installation environment information holding unit 61, a noise database unit 62, and a transfer function database unit 63 are provided in the storage unit 6.

The installation environment information holding unit 61 is a database that holds specification data such as the size and sound absorption of the installation environment and information such as the type, position, and direction of noise present in the installation environment. That is, the installation environment information generated by the installation environment information input unit 52 is stored.

The noise database unit 62 is a database that holds statistical attributes of noise for each noise type. In other words, the noise database unit 62 stores the directional characteristic, the probability density distribution of the amplitude, the spatial transfer characteristic for various directions and for each distance for each sound source type collected as data in advance.

The noise database unit 62 is configured to be able to read out noise dictionary data using, for example, the type, direction, distance, and the like of the noise source as parameters.

The noise dictionary data is information including the above-described directional characteristic of each sound source type, the probability density distribution of the amplitude, the spatial transfer characteristic of various directions and each distance.

Note that the directivity of each sound source may be obtained by performing actual measurement or performing acoustic simulation in advance using a dedicated device, and may be represented by a function using the direction as a parameter, for example.

The transfer function database unit 63 is a database that holds transfer functions between arbitrary two points in various environments. For example, the transfer function database unit 63 is a database that stores transfer functions between two points collected as data in advance, or transfer functions generated from shape information by acoustic simulation.

Fig. 5 shows a configuration example of the NR unit 3.

The NR unit 3 performs processing of suppressing the corresponding noise on the voice signal input from the microphone 2 using the statistical characteristics obtained from the noise database unit 62.

For example, the NR unit 3 acquires information on the type of noise in a period determined to include noise from the noise database unit 62, reduces noise from recorded voice, and outputs voice.

As described above, using the directional characteristic/transfer characteristic, the accuracy/performance of the noise reduction processing can be improved (for example, convolution in order of the statistical characteristic/directional characteristic of the noise source, transfer characteristic, and microphone (array) directivity) by appropriately transforming (convolving or the like) the noise statistical information using the noise source statistical information (template such as gain function or mask information) obtained from the noise database 62, the directional characteristic of the noise source, and the transfer characteristic from the noise source to the sound receiving point obtained from the positional relationship between the two points.

In the present embodiment, the accuracy of noise reduction can be made higher by considering noise dictionary data (sound source directivity, etc.) stored in advance in the database and signal conversion caused by transfer characteristics between two points, etc., as compared with performing adaptive signal processing/noise reduction processing using only the observation signal as information.

The NR unit 3 includes a Short Time Fourier Transform (STFT) unit 31, a gain function application unit 32, an Inverse Short Time Fourier Transform (ISTFT) unit 33, an SNR estimation unit 34, and a gain function estimation unit 35.

The speech signal input from the microphone 2 is supplied to the gain function application unit 32, the SNR estimation unit 34, and the gain function estimation unit 35 after being subjected to short-time fourier transform in the STFT unit 31.

The noise interval estimation result and the noise dictionary data D (or the noise dictionary data D' considering the transfer function) are input to the SNR estimating unit 34. Then, the noise interval estimation result and the noise dictionary data D are used to obtain the a priori SNR and the a posteriori SNR of the speech signal that has undergone the short-time fourier transform.

Using the a priori SNR and the a posteriori SNR, for example, a gain function for each frequency bin is obtained in the gain function estimation unit 35. Note that these types of processing performed by the SNR estimation unit 34 and the gain function estimation unit 35 will be described later.

The obtained gain function is supplied to the gain function application unit 32. The gain function application unit 32 performs noise suppression by, for example, multiplying the voice signal of each frequency bin by a gain function.

The ISTFT unit 33 performs short-time inverse fourier transform on the output of the gain function application unit 32, thereby outputting the obtained output as a voice signal (NR output) to which noise reduction has been performed.

<2. operation of the first to fifth embodiments >

The voice signal processing apparatus 1 having the above-described configuration performs noise suppression using the radiation characteristic and the transfer characteristic of the noise source in the environment.

For example, noise dictionary data (a probability density function describing the occurrence of the amplitude of a noise source, a time-frequency mask, or the like) having statistical characteristics of each noise source is created, and the noise dictionary data is acquired using the transfer direction from the sound source or the like as a parameter.

Further, by utilizing the direction or spatial transfer characteristic (distance in the simplified case) between the noise source and the sound receiving point (in the present embodiment, the position of the microphone 2), noise suppression can be effectively performed on the recorded sound.

Various sound sources have unique radiation characteristics, and speech is not uniformly radiated in all directions. In view of the above, the performance of noise suppression is enhanced by considering the radiation characteristic of noise or considering the spatial transfer characteristic indicating the characteristic of reverberation reflection in the space.

Specifically, in the preliminary measurement performed when the speech signal processing apparatus 1 is installed, the direction/distance of the noise source, the noise type, the size of the installation environment, and the like are input by the user, or in the case where the installation location of the apparatus varies, estimation of the noise direction/distance is performed by using the microphone array, the imaging apparatus, and the like when the location changes, information on the noise type, the azimuth angle, the elevation angle, the distance, and the like is acquired, and the acquired information is recorded as the installation environment information.

Next, required noise dictionary data (template) is extracted from the noise database using the installation environment information as a parameter.

Then, the input voice signal from the microphone 2 is subjected to noise reduction using the noise dictionary data.

Hereinafter, specific examples of such system operations are illustrated as operations of the first to fifth embodiments.

Note that the system operation includes two types of processing, including a process of preliminary measurement (hereinafter, also referred to as "preliminary measurement/input process") and an actual process performed when the speech signal processing apparatus 1 is used (hereinafter, also referred to as "process performed when the apparatus is used").

In the preliminary measurement/input process, any one of or a combination of input information of a user, a recording signal in a microphone array, an image signal obtained by an imaging device, and the like is used as the input information.

Installation environment information such as the size of a room in which the voice signal processing apparatus 1 is installed, the sound absorption degree based on materials, and the position and type of a noise source is stored in the installation environment information holding unit 61.

In the case where the speech signal processing apparatus 1 is a stationary apparatus, it is assumed that preliminary measurement is performed at the time of installation or the like. Further, in the case where the voice signal processing apparatus 1 is a movable apparatus such as a smart speaker, it is assumed that preliminary measurement is performed when the installation position is changed.

Next, as processing performed when using the apparatus, the NR unit 3 performs noise suppression on the voice signal from the microphone 2 using statistical information of noise extracted from the noise database with a parameter stored in the installation environment information as a parameter.

Hereinafter, the processing performed by the control calculation unit 5 and the storage unit 6 will be mainly exemplified as the operation performed using the functions illustrated in a and B in fig. 3.

Fig. 6 shows the operation of the first embodiment.

In the preliminary measurement/input process, input information input by the user is received by the function of the installation environment information input unit 52, and is stored as installation environment information into the installation environment information holding unit 61.

The input information input by the user includes information specifying the direction or distance between the noise source and the microphone 2, information specifying the type of noise, information relating to the size of the installation environment, and the like.

here, i, θ,l is as follows.

i: noise type indexing

θ: azimuth angle from noise source to sound reception point direction (direction of microphone 2)

Elevation angle from noise source to sound receiving point direction

l: distance from noise source to sound receiving point

The management control unit 51 converts the noise dictionary data D (i, θ,l) to the NR unit 3. The NR unit 3 generates a noise dictionary using the noise dictionary data D (i, θ,l) performing a noise reduction process.

By this operation, the NR unit 3 can perform noise reduction processing suitable for the installation environment, such as the type, direction, and distance of noise in particular.

Note that in each of the examples of fig. 6 to 10, i, θ,l is used as an example of the installation environment information, but this is an example, and another kind of installation environment information such as the size and the sound absorption degree of the installation environment may also be used as a parameter of the noise dictionary data D. In addition, i, θ,l need not always be included and various combinations of parameters are assumed. For example, only the noise type i and the azimuth angle θ may be used as parameters of the noise dictionary data D.

Fig. 7 shows the operation of the second embodiment.

The preliminary measurement/input process is similar to that in fig. 6.

In the processing performed when the apparatus is used, the management control unit 51 acquires the installation environment information (e.g., i, θ,l) and acquires the noise dictionary data D (i, θ,l). Further, the management control unit 51 uses an installation environment informationAnd the average value of (i, theta,l) as parameters from the transfer function database unit 63, the transfer functions H (i, θ,l)。

the management control unit 51 converts the noise dictionary data D (i, θ,l) and the transfer function H (i, θ,l) to the NR unit 3.

The NR unit 3 generates a noise dictionary using the noise dictionary data D (i, θ,l) and the transfer function H (i, θ,l) performing noise reduction treatment.

By this operation, the NR unit 3 can perform noise reduction processing particularly suitable for the installation environment (such as the type, direction, and distance of noise) and reflect the transfer function.

Fig. 8 shows the operation of the third embodiment.

Further, the voice signal collected by the microphone 2 (or another microphone in the input device 7) is received and analyzed by the function of the noise direction/distance estimation unit 54, and the direction and distance of the noise source are estimated. This information may also be stored as installation environment information in the installation environment information holding unit 61 by the function of the installation environment information input unit 52.

Therefore, even if the user does not perform input, the installation environment information can be stored. Further, even if the user does not perform input, the installation environment information can be updated when the arrangement of the voice signal processing apparatus 1 is changed or the like.

In the processing performed when the apparatus is used, the management control unit 51 acquires the installation environment information (e.g., i, θ,l) and acquires the noise dictionary data D (i, θ,l). The management control unit 51 converts the noise dictionary data D (i, θ,l) to the NR unit 3.

Further, the noise section estimation unit 53 supplies determination information of the noise section to the NR unit 3.

In the NR unit 3, for a period determined to include noise, noise dictionary data D (i, θ,l) performing a noise reduction process.

By this operation, the NR unit 3 can perform noise reduction processing particularly suitable for the installation environment, for example, the type, direction, and distance of noise, and reflect the transfer function.

Note that, as shown in fig. 7, in the NR unit 3, in a period in which noise is included, it is also possible to use noise dictionary data D (i, θ,l) and the transfer function H (i, θ,l) performing noise reduction treatment.

Fig. 9 shows the operation of the fourth embodiment.

In the preliminary measurement/input process, the user input may be omitted. For example, a voice signal collected by the microphone 2 (or another microphone in the input device 7) is received and analyzed by the function of the noise direction/distance estimation unit 54, and the direction and distance of the noise source are estimated. This information may be stored as installation environment information into the installation environment information holding unit 61 by the function of the installation environment information input unit 52.

Further, in this case, the noise section is determined by the function of the noise section estimation unit 53, and the noise direction/distance estimation unit 54 estimates the direction, distance, noise type, installation environment, size, and the like in the period in which noise is generated.

By using the noise section determination information, the estimation accuracy of the noise direction/distance estimation unit 54 can be improved.

The processing performed when using the apparatus is similar to that of the first embodiment shown in fig. 6.

As shown in fig. 7, however, the transfer function H (i, θ,l), or further assume that the noise section determination information obtained by the noise section estimation unit 53 is used as shown in fig. 8.

Fig. 10 shows the operation of the fifth embodiment.

Also in this case, in the preliminary measurement/input process, the user input may be omitted. For example, the shape/type estimation unit 55 performs image analysis on an image signal obtained by performing image capturing by an imaging device in the input device 7, and estimates a direction, a distance, a noise type, an installation environment size, and the like.

Specifically, in the image analysis, the shape/type estimation unit 55 estimates the three-dimensional shape of the installation space, and estimates the presence or absence and the position of the noise source. For example, a home appliance used as a noise source is determined or a three-dimensional spatial shape of a room is determined, and then a distance, a direction, a reflection state, and the like of a voice are recognized.

These pieces of information are stored into the installation environment information holding unit 61 as installation environment information by the function of the installation environment information input unit 52.

By the image analysis, environmental information different from the voice analysis can be input.

Note that, as a combination with the example shown in fig. 8, more accurate or diversified installation environment information can also be obtained by combining the voice analysis of the noise direction/distance estimation unit 54 and the image analysis of the shape/type estimation unit 55.

The processing performed when using the apparatus is similar to that of the first embodiment shown in fig. 6.

Also in this case, as shown in fig. 7, the transfer function H (i, θ,l), or further assume that the noise section determination information obtained by the noise section estimation unit 53 is used as shown in fig. 8.

<3. noise database construction step >

In the various embodiments described above, the description has been given assuming that the construction of the noise database unit 62 has been completed in advance. Here, an example of the construction step of the noise database unit 62 will be described.

Fig. 11 shows an example of the construction steps of the noise database unit 62.

For example, the processing in fig. 11 is performed using a sound recording system including an information processing apparatus and a noise database construction system.

Here, the acoustic recording system refers to a device and an environment in which various noise sources can be installed and noise can be recorded while changing, for example, the recording position of a microphone with respect to the noise sources.

In step S101, basic information input is performed.

For example, the operator inputs information on the type of noise and the direction and distance of the measurement location to the front surface of the noise source into the noise database construction system.

In this state, in step S102, the operation of the noise source is started. In other words, noise is generated.

In step S103, recording and measurement of noise are started, and recording and measurement are performed within a predetermined time. Then, in step S104, the measurement is completed.

In step S105, determination of additional recording is performed.

For example, by performing a plurality of measurements while changing the type of noise or the position (i.e., direction or distance) of a microphone, noise recording suitable for various installation environments is performed.

That is, the steps in steps S101 to S104 are repeatedly performed while changing the position of the microphone or changing the noise source as additional recording.

If the required measurement is ended, the process proceeds to step S106, where statistical parameter calculation is performed by the information processing apparatus of the noise database construction system. In other words, the calculation of the noise dictionary data D is performed based on the measured voice data, and the calculated noise dictionary data D is compiled into the database.

As a specific example of measuring/generating the noise dictionary data D by the above-described steps, a generation/acquisition example of the noise dictionary data in consideration of directivity will be described.

For example, the directional characteristic of the noise is obtained using the noise type, frequency, and direction as parameters.

First, an example of generation of the noise dictionary data D will be described.

For each noise type (i), the direction (θ, b),) And distance (l), the propagation of sound is calculated by measurement or acoustic simulation, such as the finite difference time domain method (FDTD method).

Fig. 12 shows a sphere, and the noise source is arranged at the center of the sphere (denoted by "x" in the figure). Then, by installing microphones at the grid points (intersections of circular arcs) of the sphere and performing measurement, or by performing acoustic simulation on the 3D shape of the noise source, the transfer function y from the central noise source position x to each grid point can be obtained.

Note that, in the case of the measurement as in fig. 12, the distance (l) is equal to the radius of the microphone array including the microphones arranged on the intersection point of the circular arcs (the radius of the sphere).

The above measurements are repeated and for each azimuth theta, elevation angle for each noise type iAnd a distance l, obtaining a dictionary of transfer functions with a predetermined discrete precision.

Then, for the measured transfer characteristic yi (θ,l) performing a DFT (discrete Fourier transform).

[ equation 1]

Note that reference numerals in the formulas are as follows.

i: noise type indexing

θ: azimuth angle from noise source to sound receiving point direction

Phi: elevation angle from noise source to sound receiving point direction

l: distance from noise source to sound receiving point

k: frequency point index

N: measured impulse response length

Then, the absolute value (amplitude) of the FFT coefficient of each frequency bin is held as noise dictionary data Di (k, θ,l)。

[ formula 2]

D_i(k，θ，φ，l)＝|Y_i(k，θ，φ，l)|

Note that another gain calculation method may be used as long as the method can perform relative comparison for each type, each direction, and each distance.

Next, an example of acquiring the noise dictionary data D will be described.

Basically, only the noise type (i), direction (θ,) The distance l, and the frequency k are parameters to obtain the desired Di (k, theta,l) value.

In the case where data specifying a direction does not exist in the noise database unit 62, it is considered to generate data by performing linear interpolation, lagrangian interpolation (quadratic interpolation), or the like from data of surrounding adjacent grid points. For example, in the case where the position of "●" in fig. 12 is the sound receiving point LP for which directivity is desired, interpolation is performed using data of the grid points HP around the sound receiving point LP indicated by "o".

In the case where data of a specified distance does not exist in the noise database unit 62, it is considered to generate data based on the inverse distance square law or the like. Further, similar to the case of orientation, interpolation may be performed from data of adjacent distances.

It is assumed that NR is performed for each frequency point on the frequency axis using the value of the noise dictionary data D obtained by the above-described method.

Note that in addition to i (noise type), θ (azimuth),in addition to the combination of parameters (elevation angle), l (distance) and k (frequency), parameters indicating the surroundings, such as sound absorption, etc., may be used.

Further, in the case where the directivity or the frequency characteristics thereof are substantially different, even if the noise types are the same, the noise types may be regarded as different types depending on the operation mode or the like. Such as a heating mode or a cooling mode of the air conditioner, etc.

<4. preliminary measurement/input processing >

Subsequently, preliminary measurement/input processing performed at the time of device installation will be described.

For example, when the voice signal processing apparatus 1 (a single apparatus or an apparatus including the voice signal processing apparatus 1) is installed for use, measurement and input of information about an installation environment are performed.

Fig. 13 shows processing regarding such measurement and input performed by the control calculation unit 5 mainly using the function of the installation environment information input unit 52.

In step S201, the control calculation unit 5 inputs installation environment information from the input device 7 and the like.

As the input mode, it is assumed that input is performed by a user operation. For example, assume the following inputs, etc.:

inputting information specifying the direction/distance of the noise source relative to the installed device

Inputting information specifying the type of noise

Input installation environment dimensions, wall materials, reflectivity, sound absorption, and other information about the room.

Further, as in the above-described third, fourth, and fifth embodiments, input of installation environment information (preliminary measurement) other than the user input is also performed. For example, a case where the following information is input is also assumed;

the measured value of the direction or distance of the noise source obtained by the noise direction/distance estimation unit 54

Estimation information obtained by the shape/type estimation unit 55, such as noise, direction, distance, or information about a room.

If the control calculation unit 5 (installation environment information input unit 52) acquires such information obtained by user input or automatic measurement, in step S202, the control calculation unit 5 performs the following processing: the installation environment information is generated based on the acquired information, and the generated installation environment information is stored in the installation environment information holding unit 61.

As described above, the installation environment information is stored in the voice signal processing apparatus 1.

<5. processing performed when using the apparatus >

Subsequently, a process performed when the device is used will be described with reference to fig. 14.

This processing is, for example, processing performed after the power of the voice signal processing apparatus 1 is turned on or the operation of the voice signal processing apparatus 1 is started.

In step S301, the control calculation unit 5 checks whether the installation environment information has been stored. In other words, in the above process in fig. 13, it is checked whether the control calculating unit 5 has stored the installation environment information in the installation environment information holding unit 61.

If the installation environment information has not been stored, the control calculation unit 5 performs acquisition and storage of the installation environment information by the above processing in fig. 13 in step S302.

In the state where the installation environment information is stored, the process advances to step S303.

In step S303, the control calculation unit 5 acquires the installation environment information from the installation environment information holding unit 61, and supplies necessary information to the NR unit 3. Specifically, the control calculation unit 5 acquires the noise dictionary data D from the noise database unit 62 using the installation environment information, and supplies the noise dictionary data D to the NR unit 3.

Further, in some cases, the control calculation unit 5 acquires the transfer function H between the noise source and the sound reception point from the transfer function database 63 using the installation environment information, and supplies the transfer function H to the NR unit 3.

If such information is supplied to the NR unit 3 in step S304, the NR unit 3 calculates a gain function using the noise dictionary data D or further using the transfer characteristic H, and performs noise reduction processing.

After that, the noise reduction processing in step S304 is continued by the NR unit 3 until it is determined in step S305 that the operation is ended.

<6. noise reduction treatment >

An example of the noise reduction processing in the NR unit 3 will be described.

In the NR unit 3, by repeatedly executing the processing in fig. 15, a gain function for noise reduction processing performed on the voice signal obtained by the microphone 2 is calculated, and noise reduction processing is performed. The processing to be described below is gain function setting processing performed by the SNR estimating unit 34 and the gain function estimating unit 35 in fig. 5.

In step S401 of fig. 15, the NR unit 3 performs initialization of the microphone index (microphone index ═ 1).

The microphone index is a number assigned to each of the plurality of microphones 2a, 2b, 2c, and the like. By performing initialization of the microphone index, the microphone with index number 1 (e.g., the microphone 2a) can be used as the first target of the gain function calculation.

In step S402, the NR unit 3 initializes a frequency index (frequency index is 1).

The frequency index is a number assigned to each frequency bin, and by performing initialization of the frequency index, the frequency bin having the index number 1 can be used as a first processing target of gain function calculation.

In steps S403 to S409, for the microphone 2 having the specified microphone index, the gain function of the frequency point specified by the frequency index is acquired and applied.

First, an overview of the flow in steps S403 to S409 will be described, and details of gain function calculation will be described later.

First, in step S403, the NR unit 3 updates the estimated noise powers, the a priori SNRs, and the a posteriori SNRs of the corresponding microphones 2 and frequency points by the SNR estimating unit 34 in fig. 5.

The a priori SNR is the SNR relative to a target sound (e.g., a predominantly human sound) that suppresses the target noise.

The a posteriori SNR is the SNR of the actual observed sound after noise superposition with respect to the suppressed target noise.

For example, fig. 5 shows an example in which the noise section estimation result is input to the SNR estimating unit 34. In the SNR estimating unit 34, the noise power and the a posteriori SNR are updated in a period in which the suppression target noise exists, using the noise section estimation result. Although the power true value of the target sound cannot be obtained, the prior SNR can be calculated using an existing method such as the decision guidance method disclosed in non-patent document 2.

In step S404, the NR unit 3 determines whether the power of the noise other than the target noise at the current frequency is equal to or less than a predetermined value. A determination is made to determine whether the gain function calculation can be performed with a high degree of confidence.

When a positive result is obtained in step S404, in step S406, the NR unit 3 performs gain function calculation using the gain function estimation unit 35.

Then, in step S409, the obtained gain function is transmitted to the gain function application unit 32 as a gain function of the frequency point of the target microphone 2, and applied to noise reduction processing.

Note that when the microphone index is set to 1 and the frequency index is set to 1, the process always advances from step S404 to step S406. This is because interpolation in step S407 or S408, which will be described later, cannot be performed.

When a positive result is not obtained in step S404, in step S405, the NR unit 3 determines whether the power of noise other than the target noise in the vicinity of the corresponding frequency is equal to or smaller than a predetermined value. This determination is a determination as to whether or not interpolation of the gain function on the frequency axis is appropriate.

When a positive result is obtained in step S405, the NR unit 3 performs interpolation calculation of the gain function in step S407. In other words, using the gain function estimation unit 35, the NR unit 3 performs processing of interpolating the gain function of the corresponding frequency bin from the neighborhood frequencies on the frequency axis using the directivity dictionary information based on the noise dictionary data D.

When a positive result is not obtained in step S405, the NR unit 3 performs interpolation calculation of the gain function in step S408. In this case, using the gain function estimation unit 35, the NR unit 3 performs a process of interpolating the gain functions of the frequency bins of the target microphone 2 using the gain functions of the same frequency indices as the other microphone 2 using the directivity dictionary information based on the noise dictionary data D.

Then, in step S410, the NR unit 3 checks whether the above-described processing in steps S403 to S409 has been performed in the entire frequency band, and if the processing has not been completed, the frequency index is incremented and the processing returns to step S403. That is, the NR unit 3 performs a process of similarly obtaining a gain function of the next bin.

In the case where the processing of steps S403 to S409 has been completed in the entire frequency band for one microphone 2, the NR unit 3 checks in step S412 whether or not all the microphones 2 have completed the processing. If the processing has not been completed, in step S413, the NR unit 3 increments the microphone index, and the processing returns to step S402. That is, the processing is sequentially started for the other microphones 2 for each frequency point.

In this way, in fig. 15, for each microphone 2, a gain function is obtained for each frequency bin, and the obtained gain function is applied to the noise reduction processing.

In this case, in the processing of steps S403, S404, and S405, the calculation method of the gain function is selected.

In a case where the process advances to step S406, gain function calculation is performed.

In the case where the process advances to step S407, a gain function is obtained by interpolation in the frequency direction.

In the case where the process advances to step S408, a gain function is obtained by interpolation in the spatial direction.

Hereinafter, the processing of the gain function will be described.

The above-described processing in fig. 15 is an example of noise reduction using the noise dictionary data D. In other words, the dictionary Di (k, θ,l) as a template (i: noise type, k: frequency, θ: the direction of the azimuth angle is,elevation angle, l: distance) a gain function g (k) is calculated for each frequency k. Then, by calculating the estimated noise power using a dictionary, the accuracy of the gain function is enhanced.

However, in step S406, the noise dictionary data D is not used, and in the processing of steps S407 and S408, the noise dictionary data D is used.

Then, if a gain function is obtained, the gain function is applied to each frequency, and a noise reduction output is obtained. In the case of using a noise reduction method applying a spectral gain function, x (k) ═ g (k) y (k) is obtained. X (k) represents the noise-reduced speech signal output, g (k) represents the gain function, and y (k) represents the speech signal input obtained by the microphone 2.

First, the gain function calculation in step S407 will be described.

The gain function calculation is performed assuming a specific distribution shape as a probability density distribution of the amplitude (/ phase) of the target sound while varying according to the type of the target sound or the like.

The updates of the estimated noise power, the a priori SNR and the a posteriori SNR in step S403 are used for gain function calculation.

In the case of the present embodiment, as shown in fig. 5, by the SNR estimating unit 34 acquiring information on the noise section estimation result, it is possible to determine a period in which the target sound is not present.

Thus, the noise power σ is estimated using the period in which the target sound does not exist_N ²。

The a priori SNR is the SNR with respect to the target sound suppressing the target noise and is expressed as follows.

[ formula 3]

Here, the reference numerals in the formulas are as follows.

ξ (λ, k): a priori SNR

λ: time frame index

k: frequency index

σ_S ²: target acoustic power

σ_N ²: noise power

In this way, it is possible to estimate the noise power σ by estimating the noise power σ from a portion including only the noise in which the target sound is not present_N ²And calculating the target acoustic power sigma_S ²To obtain the a priori SNR.

Further, the a posteriori SNR is the SNR with respect to the actual observation sound suppressing the target noise after the noise superposition, and is calculated by obtaining the power of the observation signal of each frame (target sound + noise). The posterior SNR is expressed as follows.

[ formula 4]

Here, the reference numerals in the formulas are as follows.

γ (λ, k): posterior SNR

R²: observed signal (target sound + noise) power

Then, a gain function G (λ, k) for suppressing noise is calculated from the above-described a priori SNR and a posteriori SNR. The gain function G (λ, k) is as follows. Note that ν and μ are probability density distribution parameters of the speech amplitude.

[ formula 5]

Here, "u" represents as follows.

[ formula 6]

For example, in step S406 of fig. 15, the gain function is obtained as described above. This case is a case where it is determined in step S404 that the power of noise other than the target noise at the current frequency is equal to or smaller than a predetermined value. For example, this case is a case where there is no sudden noise component or the like for the corresponding microphone 2 and frequency point, and the accuracy of the above gain function (equation 5) is estimated to be high.

However, in reality, in the voice signal obtained by the microphone 2, there is no period in which only noise desired to be removed exists. In other words, dark noise, unstable noise, and the like are always present, and an estimation error of a noise spectrum is generated.

Then, by erroneously determining an interval including the target sound or the unstable noise as a noise interval, an estimation error of the noise spectrum becomes large.

Therefore, the calculation of the gain function in the unreliable frequency band or the microphone signal is interpolated by using the directional characteristic of the noise source and the frequency characteristic thereof, thereby improving the noise reduction accuracy. This processing corresponds to the processing in step S407 or S408.

First, gain function interpolation on the frequency axis in step S407 will be described.

Note that the microphone index m is set for the calculation target microphone 2. Further, k and k' denote frequency indices. Hereinafter, the microphone 2 having a microphone index of m is described as "microphone m".

Hereinafter, for each microphone m (azimuth angle θ, elevation angle) performing noise reductionDistance l between noise source and microphone 2) is performed [1][2][3]And (4) processing.

[1]Estimating noise in a period determined not to include target soundAcoustic power σ_N ²。

[2] A frequency band k that is less likely to include another noise (or target sound) is obtained. The frequency band k is a frequency band that is unlikely to include another noise or a component of the target sound.

Using the estimated noise power σ_N ²The a priori SNR, the a posteriori SNR and the gain function gm (k) are calculated based on each noise reduction method.

[3] A frequency band k' that is likely to include another kind of noise (or target sound) is obtained.

Acquiring noise dictionary data D (k', θ,l) and obtaining an estimated noise power σ from the edge band_N ²。

When the noise power of the microphone m in the time frame λ of the frequency band k is described as σ_N，M ²(λ, k), estimated noise power σ based on edge band k_N，M ²(λ, k') and noise dictionary data D, the noise power can be expressed as follows.

[ formula 7]

Then, the a priori SNR, the a posteriori SNR and the gain function gm (k) are calculated from the obtained estimated noise power.

In this way, the gain function can be calculated by interpolating a proportional calculation of the ratio of the target sound to the observed sound (target sound + noise) or the ratio of the noise components between the frequencies.

Note that it is desirable to update the gain function in such a manner as to achieve consistency between the frequency band in which the gain function has been calculated and the frequency characteristics of noise, rather than updating the gain function independently for each frequency k.

Further, in the frequency band k' where the reliability of estimating the noise spectrum is low, it is considered that the estimated noise spectrum is not used, but is calculated from the gain function of the frequency band having high reliability using the noise direction characteristic dictionary.

Note that a linear mixture or the like using an appropriate time constant and estimated noise power in the past time frame may be used.

The gain function interpolation in the spatial direction in step S408 proceeds as follows.

At microphone m '(azimuth θ', elevation angle)In case the updating of the gain function of the distance l') has ended, the result is used to calculate the estimated noise power σ_N，M ²And a gain function gm (k) is calculated.

Estimated noise power σ of microphone m_N，M ²(λ, k) and estimated noise power σ of microphone m_N，M ^′2(λ, k) is as follows.

[ formula 8]

In other words, in the interpolation in the spatial direction using the other microphone m', the gain function is obtained by proportionally calculating the ratio of the target sound to the observation sound (target sound + noise) or the ratio of the noise components between the microphones.

Note that a linear mixture with a gain function calculated from the estimated noise spectrum of the actual microphone m may be used.

By performing these interpolations, the performance and efficiency of noise reduction can be made higher.

In other words, it is possible to reduce adverse effects caused by estimation errors of a noise spectrum that actually causes performance deterioration. This is because using the directional characteristic information of the noise source, the other noise power can be accurately estimated from the noise power of the frequency band including a small amount of the target sound and the other noise.

Furthermore, the gain function of another microphone 2 can be quickly calculated from the gain functions to be applied to the observation signals of the microphones 2 existing in a specific direction and a specific distance.

Further, the gain function between the microphones 2 can be made uniform. For example, even if some microphones 2 are mixed with sudden noise such as contact, the noise power and the gain function can be accurately calculated from the estimated noise power and the noise directivity dictionary of another microphone 2.

Note that the processing in fig. 15 shows an example in which interpolation is performed in the frequency direction and interpolation is performed in the spatial direction, respectively, but in addition to or instead of this, it is considered that interpolation is performed in the frequency direction and the spatial direction.

Subsequently, a case where the transfer function is considered will be described.

The following processes [1], [2], [3], and [4] are performed in consideration of a transfer function between noise and a sound receiving point.

[1]The transfer characteristics H (k, θ,l)。

[2]in calculating the gain function, a convolution of the transfer characteristic is performed to the dictionary. When considering a dictionary of transfer functions, Di' (k, theta,l), Di' (k, theta,l)＝Di(k，θ，l)*|H(k，θ，l)|。Di(k，θ，l) is noise dictionary data, and H (k, θ,l) is a transfer function.

[3] A gain function is calculated based on each method of noise reduction. In this case, the estimated noise power is updated not using the noise dictionary data Di but using the above-described convolved noise dictionary data Di 'on which the transfer characteristic has been performed, and the gain function is calculated using the noise dictionary data Di'.

[4] A gain function is applied and a noise reduced output is obtained.

As described above, the speech signal output x (k) that has undergone noise reduction processing is denoted as x (k) ═ g (k) y (k). In this case, from the noise dictionary data Di' (k, θ, g,l) calculating a gain function G (k).

Note that as the transfer function, it is considered to use a transfer function H (ω, θ, l) obtained by simplifying the transfer function from the noise source to the sound receiving point (microphone 2) by distance, or to use a transfer function H (x1, y1, z1, x2, y2, z2) specifying the positions of the noise source and the sound receiving point by coordinates.

In other words, the transfer function H is represented by a function having the positions (three-dimensional coordinates) of the noise source and the sound receiving point in a certain space as parameters.

Further, by appropriately dispersing the coordinates, the transfer function H can be recorded as data.

Further, the transfer function H may be recorded as a function or data that simplifies the distance between two points.

<7. conclusions and variants >

According to the above embodiment, the following effects are obtained.

The speech signal processing apparatus 1 of the embodiment includes a control calculation unit 5 that acquires noise dictionary data D read out from the noise database unit 62 based on installation environment information including information on the type of noise and the direction between a sound receiving point (the position of the microphone 2 in the case of the present embodiment) and a noise source, and an NR unit 3 (noise suppression unit) that performs noise suppression processing on a speech signal obtained by the microphone 2 disposed at the sound receiving point using the noise dictionary data D.

By using a direction (theta or theta) adapted at least with respect to the type i of noise and between the sound receiving point of the microphone 2 and the noise source) The NR unit 3 can effectively perform noise suppression on the voice signal from the microphone 2. This is because each sound source has unique radiation characteristics, and speech is not uniformly radiated in all directions, and in this regard, by considering the type i and direction (θ or θ) suitable for noise) The radiation characteristic of (2) can improve the performance of noise suppression.

For example, in the case where an acoustic device for remote presentation, a television, or the like is permanently installed and operated in a real space, the distance and direction between a noise source and a sound receiving point (e.g., the microphone 2) are generally fixed. For example, a television set is rarely moved once installed, and the position of a microphone mounted on the television set with respect to an air conditioner or the like is given as a specific example. Further, the case of the fixed position also includes a case where it is desired to remove the sound of a person sitting at a table or the like from the recorded sound. Particularly in these cases, it is possible to suppress noise sources by effectively utilizing the direction information and the spatial transfer characteristic between two points in the setting space, thereby improving the quality of the recorded sound.

On the other hand, in the case of installing a movably installed device such as a smart speaker, in the case where the installation position varies under the same installation environment, it is necessary to re-estimate the direction and distance of the noise source, and also consider a configuration in which optimum noise suppression is performed using a combination of the sound source type/direction information and the spatial transfer characteristics between two points obtained in advance.

At this time, it is also possible to accurately perform dynamic direction/distance estimation using the 3D shape size data of the installation environment and the direction/distance information of the fixed sound source obtained in advance, with the installation environment remaining unchanged.

Note that in the case of absolute directional noise, noise suppression may also be performed by beamforming using a plurality of microphones, but a sufficient effect may not be obtained sometimes according to the reverberation characteristics of the environment. Further, the target sound source sometimes deteriorates according to the noise direction and the target sound direction. Therefore, it is effective in combination with the technique of the present embodiment.

In the second embodiment, a description has been given of an example in which the control calculation unit 5 acquires a transfer function between a noise source and a sound receiving point from the transfer function database unit 63 that maintains transfer functions between two points under various environments based on installation environment information, and the NR unit 3 uses the transfer function for noise suppression processing.

By taking into account the type i and direction (theta or)) And a spatial transfer characteristic (transfer function H) representing a reverberation reflection characteristic in the space, can improve the performance of noise suppression.

In this embodiment, a description has been given of an example in which the installation environment information includes information on the distance l from the sound receiving point to the noise source, and the arithmetic unit 5 is controlled to be in the type i, direction (θ or) The distance l is a parameter, and the noise dictionary data D is acquired from the noise database unit 62.

The installation environment information includes the type i of noise, and the direction (theta or theta) from the sound receiving point to the noise source) And a distance l and is at least adapted to the type i, direction (theta or) And the noise dictionary data of the distance l are stored in the noise database unit 62. So that it is possible to identify the appropriate type i, direction (theta or theta)) And noise dictionary data for distance l.

Then, by also reflecting the distance l between the noise source and the sound receiving point, the attenuation of the noise level based on the distance l can also be reflected. This may further enhance the performance of noise suppression.

In the embodiments, a description has been given of an example in which the installation environment information includes information about an azimuth angle θ and an elevation angle between the sound reception point and the noise sourceAs a direction and controls the calculation unit 5 to calculate the azimuth angle theta and the elevation angle in the type iAs a parameter, the noise dictionary data D is acquired from the noise database unit 62.

In other words, the information on the direction is not information on the direction when the positional relationship between the sound receiving point and the noise source is viewed from a two-dimensional perspective, but information on a three-dimensional direction including the positional relationship (elevation angle) in the up-down direction.

The installation environment information includes type i of noise, azimuth angle theta, elevation angleAnd the distance l from the sound reception point to the noise source, and at least for type i, azimuth theta, elevationAnd the noise dictionary data of the distance l are stored in the noise database unit 62.

By combining azimuth theta and elevationReflected as the direction between the noise source and the sound receiving point, noise suppression can be performed in consideration of the characteristics of noise based on a more precise direction in a three-dimensional space, and noise suppression performance can be improved.

In the embodiment, a description has been given of an example including the installation-environment-information holding unit 61 that stores the installation environment information (refer to B in fig. 3, fig. 13, and fig. 14).

For example, information input in advance as installation environment information is stored according to the installation of the voice signal processing apparatus. By acquiring the installation environment information in advance according to the actual installation environment, the noise dictionary data can be appropriately obtained at the time of actual operation of the NR unit 3.

In the first and second embodiments, a description has been given of an example in which the control calculation unit 5 performs processing of storing the installation environment information input by the user operation (refer to fig. 13).

In the case where the user previously inputs the installation environment information using the function of the installation environment information input unit 52 according to the actual installation environment, the control calculation unit 5 acquires the installation environment and stores the installation environment into the installation environment information holding unit 61. So that the noise dictionary data D suitable for the installation environment specified by the user at the time of the actual operation of the NR unit 3 can be obtained from the noise database unit 62.

In the third and fourth embodiments, a description has been given of an example in which the control calculation unit 5 performs a process of estimating the direction or distance between the sound receiving point and the noise source, and performs a process of storing installation environment information suitable for the estimation result.

The control calculation unit 5 estimates the direction or distance between noise sources in advance from the actual installation environment using the function of the noise direction/distance estimation unit 54, and stores the estimation result as installation environment information into the installation environment information holding unit 61. Therefore, even if the user does not input the installation environment information, the noise dictionary data D suitable for the installation environment can be obtained from the noise database unit 62 at the time of the actual operation of the NR unit 3.

Further, when the installation position or the like is moved, the user is not required to re-input the installation environment information, and the installation environment information can also be updated to new installation environment information based on the estimation of the direction or distance.

In the fourth embodiment, a description has been given of an example in which, when estimating the direction or distance between the sound receiving point and the noise source, the control calculation unit 5 determines whether or not there is noise of the noise source type within a predetermined period of time.

Whereby the direction or distance between noise sources can be accurately estimated.

In the fifth embodiment, a description has been given of an example in which the control calculation unit 5 performs processing of storing the installation environment information determined based on the image captured by the imaging device.

For example, in a state where the voice signal processing apparatus 1 is installed in a use environment, image capturing is performed by an imaging apparatus serving as the input apparatus 7. The control calculation unit 5 analyzes an image captured in an actual installation environment using the function of the shape/type estimation unit 55, and estimates the type, direction, distance, and the like of a noise source. By storing the estimation result as the installation environment information into the installation environment information holding unit 61, therefore, even if the user does not input the installation environment information, the noise dictionary data D suitable for the installation environment can be obtained from the noise database unit 62 at the time of the actual operation of the NR unit 3.

Further, when the installation location or the like is moved, the installation environment information can be updated to new installation environment information based on the analysis of the captured image without requiring the user to newly input the installation environment information.

In the fifth embodiment, a description has been given of an example in which the control calculation unit 5 performs shape estimation based on a captured image. For example, in a state where the voice signal processing apparatus 1 is installed in a use environment, image capturing is performed by an imaging apparatus to estimate a three-dimensional shape of an installation space.

The calculation unit 5 can analyze the image captured in the actual installation environment, estimate the three-dimensional shape, and estimate the presence or absence and the position of the noise source using the function of the shape/type estimation unit 55. The estimation result is stored as the installation environment information in the installation environment information holding unit 61. Whereby the installation environment information can be automatically acquired. For example, a home appliance used as a noise source may be determined, or a distance, a direction, a reflection condition of voice, or the like may be accurately recognized from a spatial shape.

The NR unit 3 of the embodiment calculates a gain function using the noise dictionary data D acquired from the noise database unit 62, and performs noise reduction processing (noise suppression processing) using the gain function.

It is thereby possible to obtain a gain function suitable for the environment information and perform noise suppression processing suitable for the environment.

Further, a description has been given of an example in which the NR unit 3 of the embodiment calculates a gain function based on the noise dictionary data D' reflecting the transfer function H and performs noise suppression processing using the gain function. Wherein the noise dictionary data D' is obtained by convolving the transfer function between the noise source and the sound receiving point into the noise dictionary data D (acquired from the noise database unit 62).

In other words, the noise dictionary data D is deformed in a state where the transfer function H is reflected. Thereby, a gain function considering a transfer function between the noise source and the sound receiving point can be obtained, and the noise suppression performance can be enhanced.

As described above with reference to fig. 15, a description has been given of an example in which, in the noise reduction processing, the NR unit 3 of the embodiment determines (step S404 or S405) to perform gain function interpolation in the frequency direction (step S407) according to a predetermined condition, and performs noise suppression processing using the interpolated gain function (step S409).

For example, when the power for removing noise other than the target noise is large due to sudden noise or the like in a certain frequency point, it is assumed that the gain function for removing the target noise in the frequency point cannot be appropriately calculated. Therefore, the state of the adjacent frequency points is determined, and if the power of removing noise other than the target noise in the adjacent frequency points is not large, interpolation is performed using gain coefficients in the frequency points. In particular, by using the noise dictionary data, appropriate interpolation can be performed by simple calculation. Thereby improving the noise suppression performance, reducing the processing load, and accordingly increasing the processing speed.

Further, in the processing example of fig. 15, the NR unit 3 determines (step S404 or S405) to perform gain function interpolation in the spatial direction according to a predetermined condition (step S408), and performs noise suppression processing using the interpolated gain function (step S409).

For example, the gain coefficient may be calculated by performing interpolation of the gain function in the spatial direction while reflecting the difference in the azimuth angle θ between the microphones 2. In particular, by using the noise dictionary data, appropriate interpolation can be performed by simple calculation. Thereby improving the noise suppression performance, reducing the processing load, and accordingly increasing the processing speed.

In particular, as shown in the flow in fig. 15, in the case where the power of noise other than the target noise is large in the frequency point in which the gain coefficient calculation is being performed or in the frequency point in the vicinity thereof, by applying the gain function interpolation in the spatial direction, an appropriate gain function can be obtained even when the interpolation in the frequency direction is inappropriate.

A description has been given of an example in which the NR unit 3 of the embodiment performs noise suppression processing using the estimation results of the period of time not including noise and the period of time including noise (refer to fig. 5).

For example, the a priori SNR and the a posteriori SNR are obtained from an estimation of the presence or absence of noise as a time period, and are reflected in the gain function calculation.

Therefore, the noise power can be appropriately estimated, and appropriate gain function calculation can be performed.

Description has been given of an example in which the control calculation unit 5 of the embodiment acquires noise dictionary data from the noise database unit for each frequency band.

In other words, as above with reference to the figuresAcquisition of information suitable for installation environment (all or part of type i, azimuth angle θ, elevation angle) for each frequency point, as described in 15Distance l) and obtaining a gain function. Accordingly, the noise suppression processing can be performed using an appropriate gain function for each frequency bin.

In the embodiment, a description has been given of an example in which the storage unit 6 storing the transfer function database unit 63 is included (refer to B in fig. 3).

Thus, the speech signal processing apparatus 1 can appropriately independently obtain the transfer function H at the time of actual operation of the NR unit 3.

In the embodiment, a description has been given of an example in which the storage unit 6 storing the noise database unit 62 is included (refer to B in fig. 3).

The speech signal processing apparatus can thereby obtain the noise dictionary data D independently as appropriate at the time of actual operation of the NR unit 3.

As an embodiment, as shown in fig. 2, a configuration is exemplified in which the control calculating unit 5 acquires the noise dictionary data D by communicating with an external device.

In other words, for example, the noise database unit 62 is not stored in the voice signal processing apparatus but in the cloud or the like, and the noise dictionary data D is acquired by communication.

This can reduce the burden of the memory capacity on the voice signal processing apparatus 1. In particular, the data amount of the noise database unit 62 sometimes becomes huge, and in this case, by using an external resource such as the storage unit 6A in fig. 2, the processing becomes easy. Further, as the data amount of the noise dictionary data D becomes satisfactory, noise dictionary data suitable for various environments is stored. That is, by storing the noise database unit 62 in the external resource, and each voice signal processing apparatus 1 acquires the noise dictionary data D by communication, it is possible to acquire the noise dictionary data D more suitable for each voice signal processing apparatus 1. This may further enhance the noise suppression performance.

Note that, for similar reasons, it is also preferable to store the transfer function database unit 63 in an external resource similar to the storage unit 6A.

Further, according to each voice signal processing apparatus 1, it is also possible to make an external resource such as the storage unit 6A have the function of the installation environment information holding unit 61, so that the hardware load on the voice signal processing apparatus 1 can be reduced.

Note that the effects described in this specification are merely exemplary and not limiting, and other effects may be caused.

Note that the present technology may also adopt the following configuration.

(1) A speech signal processing apparatus comprising:

a noise suppressing unit configured to perform noise suppressing processing on a voice signal obtained by a microphone arranged at a sound receiving point using the noise dictionary data.

(2) The speech signal processing apparatus according to the above (1),

wherein the control calculation unit acquires a transfer function between the noise source and the sound receiving point from a transfer function database unit that maintains transfer functions between two points under various environments based on the installation environment information, an

The noise suppression unit uses the transfer function for noise suppression processing.

(3) The speech signal processing apparatus according to the above (1) or (2),

wherein the installation environment information includes information on a distance from a sound receiving point to the noise source, an

The control calculation unit acquires noise dictionary data from the noise database unit while including the type, direction, and distance as parameters.

(4) The speech signal processing apparatus according to any one of the above (1) to (3),

wherein the installation environment information includes information on an azimuth angle and an elevation angle between the sound receiving point and the noise source as a direction, an

The control calculation unit acquires noise dictionary data from the noise database unit while including type, azimuth angle, and elevation angle as parameters.

(5) The voice signal processing apparatus according to any one of the above (1) to (4), further comprising an installation environment information holding unit configured to store the installation environment information.

(6) The speech signal processing apparatus according to any one of the above (1) to (5),

wherein the control calculation unit executes processing of storing the installation environment information input by the user operation.

(7) The speech signal processing apparatus according to any one of the above (1) to (6),

wherein the control calculation unit performs a process of estimating a direction or distance between the sound receiving point and the noise source, and performs a process of storing installation environment information suitable for the estimation result.

(8) The speech signal processing apparatus according to the above (7),

wherein the control calculation unit determines whether or not there is noise of the type of the noise source within a predetermined period of time when estimating the direction or distance between the sound receiving point and the noise source.

(9) The speech signal processing apparatus according to any one of the above (1) to (8),

wherein the control calculation unit executes a process of storing the installation environment information determined based on the image captured by the imaging device.

(10) The speech signal processing apparatus according to the above (9),

wherein the control calculation unit performs shape estimation based on the captured image.

(11) The speech signal processing apparatus according to any one of the above (1) to (10),

wherein the noise suppression unit calculates a gain function using the noise dictionary data acquired from the noise database unit, and performs the noise suppression process using the gain function.

(12) The speech signal processing apparatus according to any one of the above (1) to (11),

wherein the noise suppression unit calculates a gain function based on the noise dictionary data reflecting a transfer function obtained by convolving the transfer function between the noise source and the sound receiving point into the noise dictionary data acquired from the noise database unit, and performs the noise suppression process using the gain function.

(13) The speech signal processing apparatus according to any one of the above (1) to (12),

wherein the noise suppression unit determines to perform gain function interpolation in the frequency direction according to a predetermined condition in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

(14) The speech signal processing apparatus according to any one of the above (1) to (13),

wherein the noise suppression unit performs gain function interpolation in the spatial direction according to a predetermined condition determination in the noise suppression processing, and performs the noise suppression processing using the interpolated gain function.

(15) The speech signal processing apparatus according to any one of the above (1) to (14),

wherein the noise suppression unit performs the noise suppression process using the estimation results of the period not including the noise and the period including the noise.

(16) The speech signal processing apparatus according to any one of the above (1) to (15),

wherein the control calculation unit acquires the noise dictionary data from the noise database unit for each frequency band.

(17) The speech signal processing apparatus according to the above (2), further comprising

A storage unit configured to store the transfer function database unit.

(18) The speech signal processing apparatus according to any one of the above (1) to (17), further comprising

A storage unit configured to store the noise database unit.

(19) The speech signal processing apparatus according to any one of the above (1) to (17),

wherein the control calculation unit acquires the noise dictionary data through communication with an external device.

(20) A noise suppression method performed by a speech signal processing apparatus, the noise suppression method comprising:

noise suppression processing is performed on a speech signal obtained by a microphone arranged at a sound receiving point using noise dictionary data.

Description of the symbols

1 speech signal processing device

2 microphone

3 NR Unit

4 Signal processing unit

5. 5A control calculation unit

6. 6A memory cell

7 input device

51 management control unit

52 installation environment information input unit

53 noise interval estimation unit

54 noise direction/distance estimation unit

55 shape/type estimation unit

61 install environment information holding unit

62 noise database cell

The transfer function database unit is 63.

42页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于双功率存储器的柔性功率序列化

Audio signal processing apparatus and noise suppression method

相关技术

网友询问留言