Voice enhancement method, method for training neural network and related equipment

文档序号：116995 发布日期：2021-10-19 浏览：16次中文

阅读说明：本技术 一种语音增强方法、训练神经网络的方法以及相关设备 (Voice enhancement method, method for training neural network and related equipment ) 是由王午芃邢超陈晓孙凤宇于 2020-04-10 设计创作，主要内容包括：本申请公开了一种语音增强方法,涉及人工智能领域,包括：获取待增强语音和参考图像,待增强语音和参考图像为同时获取的数据。根据第一神经网络输出待增强语音的第一增强信号。根据第二神经网络输出参考图像的掩蔽函数,掩蔽函数指示参考图像对应的频段能量是否小于预设值,频段能量小于预设值表示参考图像对应的待增强语音的频段为噪声。根据第一增强信号和掩蔽函数的运算结果确定待增强语音的第二增强信号。通过本申请提供的技术方案,可以将图像信息应用于语音增强的过程中,在一些相对嘈杂的环境中,也可以很好的提升语音增强的能力,提升听感。(The application discloses a voice enhancement method, which relates to the field of artificial intelligence and comprises the following steps: and acquiring the voice to be enhanced and the reference image, wherein the voice to be enhanced and the reference image are simultaneously acquired data. A first enhancement signal of the speech to be enhanced is output according to the first neural network. And outputting a masking function of the reference image according to the second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, and the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise. And determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function. Through the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and in relatively noisy environments, the voice enhancement capability can be well improved, and the auditory sense is improved.)

1. A method of speech enhancement, comprising:

acquiring a voice to be enhanced and a reference image, wherein the voice to be enhanced and the reference image are simultaneously acquired data;

outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is obtained by taking a first mask as a training target and training mixed data of the voice and noise;

outputting a masking function of the reference image according to a second neural network, wherein the masking function indicates whether frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy smaller than the preset value indicates that a frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is obtained by training an image including lip features corresponding to a voice source adopted by the first neural network with a second mask as a training target;

and determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function.

2. The speech enhancement method according to claim 1, wherein the reference image is a corresponding image including lip features at a sound source of the speech to be enhanced.

3. The speech enhancement method according to claim 1 or 2, wherein said determining a second enhancement signal of the speech to be enhanced according to the first enhancement signal and the operation result of the masking function comprises:

and determining the second enhancement signal according to a weight output by a third neural network by taking the first enhancement signal and the masking function as input data of the third neural network, wherein the weight indicates an output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

4. The speech enhancement method of claim 3, further comprising:

determining whether the reference image includes face information or lip information;

when the reference image does not include the face information or the lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

5. The speech enhancement method of claim 3 or 4 wherein the modified signal is the result of the multiplication of the first enhancement signal and the masking function.

6. The speech enhancement method of claim 5 wherein the modified signal is determined according to a result of a multiplication operation of M signal-to-noise ratios and a masking function at a first time, wherein M is a positive integer, the first enhanced signal output by the first neural network at the first time comprises M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is the masking function output by the second neural network at the first time.

7. The speech enhancement method according to any one of claims 1 to 6, wherein the speech to be enhanced comprises a first acoustic feature frame, a time instant corresponding to the first acoustic feature frame is indicated by a first time index, the reference image comprises a first image frame, the first image frame is input data of the second neural network, and the outputting of the masking function of the reference image according to the second neural network comprises:

outputting a masking function corresponding to the first image frame at a first time according to the second neural network, wherein the first time is indicated by a multiple of the first time index, and the multiple is determined according to a ratio of a frame rate of the first acoustic feature frame to a frame rate of the first image frame.

8. The speech enhancement method of any one of claims 1 to 7, wherein the method further comprises:

performing feature transformation on the voice to be enhanced to obtain frequency domain features of the voice to be enhanced;

the method further comprises the following steps:

and performing characteristic inverse transformation on the second enhanced signal to obtain enhanced voice.

9. The speech enhancement method of claim 8,

the feature transformation of the speech to be enhanced comprises:

performing short-time Fourier transform (STFT) on the voice to be enhanced;

said inverse characteristic transforming said second enhancement signal comprising:

and carrying out inverse short-time Fourier transform (ISTFT) on the second enhanced signal.

10. The speech enhancement method according to any one of claims 1 to 9, characterized in that the method further comprises:

and sampling the reference image to enable the frame rate of an image frame included in the reference image to be a preset frame rate.

11. The speech enhancement method according to any one of claims 1 to 10, wherein the lip features are obtained by feature extraction of a face image obtained by face detection of the reference image.

12. The speech enhancement method according to any one of claims 1 to 11, wherein the frequency band energy of the reference image is represented by an activation function, and the value of the activation function is approximated to the IBM to obtain the second neural network.

13. The speech enhancement method of any one of claims 1 to 12 wherein the speech to be enhanced is obtained through a single audio channel.

14. The speech enhancement method of any one of claims 1 to 13 wherein the first mask is an ideal floating-value mask IRM and the second mask is an ideal binary mask IBM.

15. A method of training a neural network for speech enhancement, the method comprising:

acquiring training data, wherein the training data comprises mixed data of voice and noise and an image which corresponds to a voice sound source and comprises lip features;

training the mixed data by taking an ideal floating value mask IRM as a training target to obtain a first neural network, wherein the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced;

training the image by taking an ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy smaller than the preset value indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation result of the first enhancement signal and the masking function is used for determining a second enhancement signal of the voice to be enhanced.

16. The method of training a neural network of claim 15, wherein the reference image is a corresponding image including lip features at a sound source of the speech to be enhanced.

17. The method of training a neural network according to claim 15 or 16, wherein the operation result of the first enhancement signal and the masking function is used to determine a second enhancement signal of the speech to be enhanced, comprising:

18. The method of training a neural network of claim 17, further comprising:

determining whether the image includes face information or lip information;

when the image does not include the face information or the lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

19. A method of training a neural network as claimed in claim 17 or 18, wherein the modification signal is the result of the multiplication of the first enhancement signal and the masking function.

20. The method of claim 19, wherein the modified signal is determined according to a result of a multiplication operation of M signal-to-noise ratios and a masking function at a first time, wherein M is a positive integer, the first enhanced signal output by the first neural network at the first time comprises M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is the masking function output by the second neural network at the first time.

21. A method for training a neural network according to any one of claims 15 to 20, wherein the speech to be enhanced includes a first acoustic feature frame, the time corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of the second neural network, and the outputting the masking function of the image according to the second neural network includes:

22. The method of training a neural network of any one of claims 15 to 21, further comprising:

performing feature transformation on the voice to be enhanced to obtain frequency domain features of the voice to be enhanced;

the method further comprises the following steps:

and performing characteristic inverse transformation on the second enhanced signal to obtain enhanced voice.

23. The method of training a neural network of claim 22,

the feature transformation of the speech to be enhanced comprises:

performing short-time Fourier transform (STFT) on the voice to be enhanced;

said inverse characteristic transforming said second enhancement signal comprising:

and carrying out inverse short-time Fourier transform (ISTFT) on the second enhanced signal.

24. The method of training a neural network of any one of claims 15 to 23, further comprising:

and sampling the image to enable the frame rate of an image frame included in the image to be a preset frame rate.

25. A method for training a neural network according to any one of claims 15 to 24, wherein the lip features are obtained by feature extraction on a face image obtained by face detection on the image.

26. A method for training a neural network as claimed in any one of claims 15 to 25, wherein the energy in the frequency band of the image is represented by an activation function, and the value of the activation function is approximated to the IBM to obtain the second neural network.

27. A method for training a neural network as claimed in any one of claims 15 to 26, wherein the speech to be enhanced is obtained via a single audio channel.

28. The method of training a neural network of any one of claims 15 to 27, wherein the first mask is an ideal floating-value mask IRM and the second mask is an ideal binary mask IBM.

29. A speech enhancement apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be enhanced and a reference image, and the voice to be enhanced and the reference image are simultaneously acquired data;

the audio processing module is used for outputting a first enhancement signal of the voice to be enhanced according to a first neural network, and the first neural network is obtained by training mixed data of the voice and noise by taking a first mask as a training target;

the image processing module is used for outputting a masking function of the reference image according to a second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is obtained by training an image which comprises lip features and corresponds to the voice source adopted by the first neural network by taking a second mask as a training target;

and the comprehensive processing module is used for determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function.

30. The speech enhancement device of claim 29 wherein the reference image is a corresponding image including lip features at a sound source of the speech to be enhanced.

31. The speech enhancement device according to claim 29 or 30, wherein the integrated processing module is specifically configured to:

32. The speech enhancement device of claim 31, wherein the device further comprises: a feature extraction module for extracting the features of the image,

the feature extraction module is used for determining whether the reference image comprises face information or lip information; when the reference image does not include the face information or the lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

33. The speech enhancement device of claim 31 or 32 wherein the modified signal is the result of the multiplication of the first enhancement signal and the masking function.

34. The apparatus of claim 33, wherein the modified signal is determined according to a result of a multiplication operation of M signal-to-noise ratios and a masking function at a first time, wherein M is a positive integer, the first enhanced signal output by the first neural network at the first time comprises M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is the masking function output by the second neural network at the first time.

35. The speech enhancement device according to any one of claims 29 to 34, wherein the speech to be enhanced comprises a first acoustic feature frame, a time instant corresponding to the first acoustic feature frame is indicated by a first time index, the reference image comprises a first image frame, the first image frame is input data of the second neural network, and the image processing module is specifically configured to:

36. An apparatus for training a neural network, the neural network being used for speech enhancement, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring training data, and the training data comprises mixed data of voice and noise and an image which corresponds to a voice source and comprises lip characteristics;

the audio processing module is used for training the mixed data to obtain a first neural network by taking an ideal floating value mask IRM as a training target, and the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced;

the image processing module is used for training the image by taking an ideal binary masking IBM as a training target to obtain a second neural network, the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value to indicate that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation result of the first enhancement signal and the masking function is used for determining a second enhancement signal of the voice to be enhanced.

37. The apparatus for training a neural network according to claim 36, wherein the reference image is a corresponding image including lip features at a sound source of the speech to be enhanced.

38. An apparatus for training a neural network as claimed in claim 36 or 37, further comprising: a comprehensive processing module which is used for processing the data,

the comprehensive processing module is configured to determine the second enhancement signal according to a weight output by the third neural network, where the weight indicates an output ratio of the first enhancement signal and a modification signal in the second enhancement signal, the modification signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network with the first mask as a training target.

39. The apparatus for training a neural network of claim 38, further comprising: a characteristic feature extraction module for extracting the characteristic features of the image,

the characteristic feature extraction module is used for determining whether the image comprises face information or lip information;

40. An apparatus for training a neural network as claimed in claim 38 or 39, wherein the modification signal is the result of the multiplication of the first enhancement signal and the masking function.

41. The apparatus of claim 40, wherein the modified signal is determined according to a result of a multiplication operation of M signal-to-noise ratios and a masking function at a first time, wherein M is a positive integer, the first enhanced signal output by the first neural network at the first time comprises M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is the masking function output by the second neural network at the first time.

42. An apparatus for training a neural network according to any one of claims 36 to 41, wherein the speech to be enhanced includes a first acoustic feature frame, a time instant corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of the second neural network, and the image processing module is specifically configured to:

43. A speech enhancement apparatus, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor for performing the method of any one of claims 1-14 when the memory-stored program is executed.

44. An apparatus for training a neural network, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor for performing the method of any one of claims 15-28 when the memory-stored program is executed.

45. A computer storage medium, characterized in that the computer storage medium stores a program code comprising instructions for performing the steps in the method according to any of claims 1-14.

46. A computer storage medium, characterized in that the computer storage medium stores program code comprising instructions for performing the steps in the method according to any of claims 15-28.

Technical Field

The application relates to the field of artificial intelligence, in particular to a voice enhancement method and a method for training a neural network.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

Speech Recognition (ASR) is a technology for recognizing corresponding text content from a speech waveform, and is one of important technologies in the field of artificial intelligence. In speech recognition systems, speech enhancement techniques are a very important technique, also commonly referred to as speech noise reduction techniques. High-frequency noise, low-frequency noise, white noise and various other noises in the voice signal can be eliminated through a voice enhancement technology, so that the voice recognition effect is improved. Therefore, how to improve the voice enhancement effect needs to be solved.

Disclosure of Invention

The embodiment of the application provides a voice enhancement method, which can be used for well improving the voice enhancement capability and improving the hearing sense in the process of applying image information to voice enhancement in relatively noisy environments.

In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:

a first aspect of the present application provides a speech enhancement method, which may include: and acquiring the voice to be enhanced and the reference image, wherein the voice to be enhanced and the reference image are simultaneously acquired data. And outputting a first enhancement signal of the voice to be enhanced according to a first neural network, wherein the first neural network is obtained by taking a first mask as a training target and training mixed data of the voice and the noise. And outputting a masking function of the reference image according to the second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is obtained by training the image which can include lip features and corresponds to the voice source adopted by the first neural network by taking the second mask as a training target. And determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function. According to the first aspect, a first enhancement signal of the speech to be enhanced is output by using the first neural network, and the incidence relation between the image information and the speech information is modeled by using the second neural network, so that the masking function of the reference image output by the second neural network can indicate that the speech to be enhanced corresponding to the reference image is noise or speech. Through the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and in relatively noisy environments, the voice enhancement capability can be well improved, and the auditory sense is improved.

Optionally, with reference to the first aspect, in a first possible implementation manner, the reference image is an image that may include lip features and corresponds to a sound source of the speech to be enhanced.

Optionally, with reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner, determining a second enhancement signal of the speech to be enhanced according to the first enhancement signal and an operation result of the masking function may include: and taking the first enhancement signal and the masking function as input data of a third neural network, determining a second enhancement signal according to a weight output by the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the method may further include: it is determined whether the reference image may include face information or lip information. When the reference image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the modification signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the correction signal is determined according to a result of a multiplication operation between M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, the first enhancement signal output by the first neural network at the first time may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time.

Optionally, with reference to the first aspect or the first to fifth possible implementation manners of the first aspect, in a sixth possible implementation manner, the speech to be enhanced may include a first acoustic feature frame, a time instant corresponding to the first acoustic feature frame is indicated by a first time index, the reference image may include a first image frame, the first image frame is input data of a second neural network, and outputting a masking function of the reference image according to the second neural network may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the first aspect or the first to sixth possible implementation manners of the first aspect, in a seventh possible implementation manner, the method may further include: and performing feature transformation on the voice to be enhanced to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhanced signal to obtain enhanced voice.

Optionally, with reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner, the performing feature transformation on the speech to be enhanced may include: and performing short-time Fourier transform (STFT) on the voice to be enhanced. Performing an inverse characteristic transform on the second enhanced signal may include: an inverse short-time fourier transform ISTFT is performed on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the first aspect, in a ninth possible implementation manner, the method may further include sampling the reference image, so that a frame rate of an image frame that the reference image may include is a preset frame rate.

Optionally, with reference to the first aspect or the first to eighth possible implementation manners of the first aspect, in a tenth possible implementation manner, the lip feature is obtained by performing feature extraction on a face image, where the face image is obtained by performing face detection on a reference image.

Optionally, with reference to the first aspect or the first to tenth possible implementation manners of the first aspect, in an eleventh possible implementation manner, frequency band energy of the reference image is represented by an activation function, and a value of the activation function is made to approach IBM to obtain a second neural network.

Optionally, with reference to the first aspect or the first to eleventh possible implementation manners of the first aspect, in a twelfth possible implementation manner, the speech to be enhanced is obtained through a single audio channel.

Optionally, with reference to the first aspect or the first to twelfth possible implementation manners of the first aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A second aspect of the present application provides a method of training a neural network for speech enhancement, the method may include: training data is obtained, which may include mixed data of speech and noise and corresponding images at the source of the speech, which may include lip features. And training the mixed data by taking the ideal floating value mask IRM as a training target to obtain a first neural network, wherein the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced. The method comprises the steps of training an image by taking an ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and indicates that the frequency band of a voice to be enhanced corresponding to the reference image is noise, and the operation result of a first enhancement signal and the masking function is used for determining a second enhancement signal of the voice to be enhanced.

Optionally, in combination with the second aspect described above, in a first possible implementation manner, the reference image is a corresponding image that may include lip features at a sound source of the speech to be enhanced.

Optionally, with reference to the second aspect or the first possible implementation manner of the second aspect, in a second possible implementation manner, the determining, by using the operation result of the first enhancement signal and the masking function, a second enhancement signal of the speech to be enhanced may include: and taking the first enhancement signal and the masking function as input data of a third neural network, determining a second enhancement signal according to a weight output by the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the method may further include: it is determined whether the image may include face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the modification signal is a product operation result of the first enhancement signal and a masking function.

Optionally, with reference to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner, the correction signal is determined according to a result of a multiplication operation between M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, the first enhancement signal output by the first neural network at the first time may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time.

Optionally, with reference to the second aspect or the first to fifth possible implementation manners of the second aspect, in a sixth possible implementation manner, the speech to be enhanced may include a first acoustic feature frame, a time instant corresponding to the first acoustic feature frame is indicated by a first time index, the image may include a first image frame, the first image frame is input data of a second neural network, and outputting a masking function of the image according to the second neural network may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the second aspect or the first to sixth possible implementation manners of the second aspect, in a seventh possible implementation manner, the method may further include: and performing feature transformation on the voice to be enhanced to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhanced signal to obtain enhanced voice.

Optionally, with reference to the seventh possible implementation manner of the second aspect, in an eighth possible implementation manner, the performing feature transformation on the speech to be enhanced may include: and performing short-time Fourier transform (STFT) on the voice to be enhanced. Performing an inverse characteristic transform on the second enhanced signal may include: an inverse short-time fourier transform ISTFT is performed on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the second aspect, in a ninth possible implementation manner, the method may further include: the image is sampled, so that the frame rate of the image frame which can be included in the image is a preset frame rate.

Optionally, with reference to the second aspect or the first to eighth possible implementation manners of the second aspect, in a tenth possible implementation manner, the lip feature is obtained by performing feature extraction on a face image, and the face image is obtained by performing face detection on an image.

Optionally, with reference to the second aspect or the first to tenth possible implementation manners of the second aspect, in an eleventh possible implementation manner, frequency band energy of the image is represented by an activation function, and a value of the activation function is made to approach IBM to obtain a second neural network.

Optionally, with reference to the second aspect or the first to eleventh possible implementation manners of the second aspect, in a twelfth possible implementation manner, the speech to be enhanced is obtained through a single audio channel.

Optionally, with reference to the second aspect or the first to twelfth possible implementation manners of the second aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A third aspect of the present application provides a speech enhancement apparatus, comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the voice to be enhanced and the reference image, and the voice to be enhanced and the reference image are simultaneously acquired data. And the audio processing module is used for outputting a first enhancement signal of the voice to be enhanced according to a first neural network, and the first neural network is obtained by training mixed data of the voice and the noise by taking the first mask as a training target. And the image processing module is used for outputting a masking function of the reference image according to the second neural network, wherein the masking function indicates whether the frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and represents that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the second neural network is a neural network obtained by training the image including the lip feature corresponding to the voice source adopted by the first neural network by taking the second mask as a training target. And the comprehensive processing module is used for determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function.

Optionally, with reference to the third aspect, in a first possible implementation manner, the reference image is an image including a lip feature corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the third aspect or the first possible implementation manner of the third aspect, in a second possible implementation manner, the comprehensive processing module is specifically configured to: and taking the first enhancement signal and the masking function as input data of a third neural network, determining a second enhancement signal according to a weight output by the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the third aspect, in a third possible implementation manner, the apparatus further includes: the device comprises a feature extraction module and a feature extraction module, wherein the feature extraction module is used for determining whether the reference image comprises face information or lip information. When the reference image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

Optionally, with reference to the second aspect or the third possible implementation manner of the third aspect, in a fourth possible implementation manner, the modification signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the third aspect, in a fifth possible implementation manner, the correction signal is determined according to a result of a multiplication operation between M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, a first enhancement signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time.

Optionally, with reference to the third aspect or the first to fifth possible implementation manners of the third aspect, in a sixth possible implementation manner, the speech to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the reference image includes a first image frame, the first image frame is input data of a second neural network, and the image processing module is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the seventh possible implementation manner of the third aspect, in an eighth possible implementation manner, the performing feature transformation on the speech to be enhanced may include: and performing short-time Fourier transform (STFT) on the voice to be enhanced. Performing an inverse characteristic transform on the second enhanced signal may include: an inverse short-time fourier transform ISTFT is performed on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the third aspect, in a ninth possible implementation manner, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that the reference image may include is a preset frame rate.

Optionally, with reference to the third aspect or the first to eighth possible implementation manners of the third aspect, in a tenth possible implementation manner, the lip feature is obtained by performing feature extraction on a face image, where the face image is obtained by performing face detection on a reference image.

Optionally, with reference to the third aspect or the first to tenth possible implementation manners of the third aspect, in an eleventh possible implementation manner, frequency band energy of the reference image is represented by an activation function, and a value of the activation function is made to approach IBM to obtain a second neural network.

Optionally, with reference to the third aspect or the first to eleventh possible implementation manners of the third aspect, in a twelfth possible implementation manner, the speech to be enhanced is obtained through a single audio channel.

Optionally, with reference to the third aspect or the first to twelfth possible implementation manners of the third aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A fourth aspect of the present application provides an apparatus for training a neural network, the neural network being used for speech enhancement, the apparatus comprising: the acquisition module is used for acquiring training data, wherein the training data comprises mixed data of voice and noise and images corresponding to the voice source and comprising lip features. And the audio processing module is used for training the mixed data to obtain a first neural network by taking the ideal floating value mask IRM as a training target, and the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced. The image processing module is used for training the image by taking an ideal binary masking IBM as a training target to obtain a second neural network, the trained second neural network is used for outputting a masking function of the reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and indicates that the frequency band of the voice to be enhanced corresponding to the reference image is noise, and the operation result of the first enhancement signal and the masking function is used for determining a second enhancement signal of the voice to be enhanced.

Optionally, with reference to the fourth aspect, in a first possible implementation manner, the reference image is an image including a lip feature corresponding to a sound source of the speech to be enhanced.

Optionally, with reference to the fourth aspect or the first possible implementation manner of the fourth aspect, in a second possible implementation manner, the method further includes: and a comprehensive processing module.

And the comprehensive processing module is used for determining a second enhancement signal according to a weight output by the third neural network by taking the first enhancement signal and the masking function as input data of the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

Optionally, with reference to the second possible implementation manner of the fourth aspect, in a third possible implementation manner, the apparatus further includes: a characteristic feature extraction module for extracting the characteristic features of the image,

and the characteristic feature extraction module is used for determining whether the image comprises face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

Optionally, in combination with the second possible implementation manner of the fourth aspect or the third possible implementation manner of the fourth aspect, in a fourth possible implementation manner, the modification signal is a product operation result of the first enhancement signal and the masking function.

Optionally, with reference to the fourth possible implementation manner of the fourth aspect, in a fifth possible implementation manner, the correction signal is determined according to a result of a multiplication operation between M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, a first enhancement signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time.

Optionally, with reference to the fourth aspect or the first to fifth possible implementation manners of the fourth aspect, in a sixth possible implementation manner, the speech to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of a second neural network, and the image processing module is specifically configured to: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

Optionally, with reference to the seventh possible implementation manner of the fourth aspect, in an eighth possible implementation manner, the performing feature transformation on the speech to be enhanced may include: and performing short-time Fourier transform (STFT) on the voice to be enhanced. Performing an inverse characteristic transform on the second enhanced signal may include: an inverse short-time fourier transform ISTFT is performed on the second enhancement signal.

Optionally, with reference to the first to eighth possible implementation manners of the fourth aspect, in a ninth possible implementation manner, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that the reference image may include is a preset frame rate.

Optionally, with reference to the fourth aspect or the first to eighth possible implementation manners of the fourth aspect, in a tenth possible implementation manner, the lip feature is obtained by performing feature extraction on a face image, where the face image is obtained by performing face detection on a reference image.

Optionally, with reference to the fourth aspect or the first to tenth possible implementation manners of the fourth aspect, in an eleventh possible implementation manner, frequency band energy of the reference image is represented by an activation function, and a value of the activation function is made to approach IBM to obtain a second neural network.

Optionally, with reference to the fourth aspect or the first to eleventh possible implementation manners of the fourth aspect, in a twelfth possible implementation manner, the speech to be enhanced is obtained through a single audio channel.

Optionally, with reference to the fourth aspect or the first to twelfth possible implementation manners of the fourth aspect, in a thirteenth possible implementation manner, the first mask is an ideal floating value mask IRM, and the second mask is an ideal binary mask IBM.

A fifth aspect of the present application provides a speech enhancement apparatus, comprising: a memory for storing a program. A processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform the method as described in the first aspect or any one of the possible implementations of the first aspect.

A sixth aspect of the present application provides an apparatus for training a neural network, comprising: a memory for storing a program. A processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform the method as described in the second aspect or any one of the possible implementations of the second aspect.

A seventh aspect of the present application provides a computer storage medium, characterized in that the computer storage medium stores program code, which includes instructions for executing the method as described in the first aspect or any one of the possible implementation manners of the first aspect.

An eighth aspect of the present application provides a computer storage medium, wherein the computer storage medium stores program code, and the program code comprises instructions for executing the method as described in the second aspect or any one of the possible implementation manners of the second aspect.

According to the scheme provided by the embodiment of the application, the first enhancement signal of the voice to be enhanced is output by utilizing the first neural network, the incidence relation between the image information and the voice information is modeled by utilizing the second neural network, and the masking function of the reference image output by the second neural network can indicate that the voice to be enhanced corresponding to the reference image is noise or voice. Through the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and in relatively noisy environments, the voice enhancement capability can be well improved, and the auditory sense is improved.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;

FIG. 2 is a system architecture provided herein;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

fig. 5 is a hardware structure of a chip according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a system architecture according to an embodiment of the present application;

fig. 7 is a flowchart illustrating a speech enhancement method according to an embodiment of the present application;

FIG. 8 is a diagram illustrating an application scenario of a solution provided by an embodiment of the present application;

FIG. 9 is a diagram illustrating an application scenario of a solution provided by an embodiment of the present application;

FIG. 10 is a diagram illustrating an application scenario of a solution provided by an embodiment of the present application;

FIG. 11 is a diagram illustrating an application scenario of a solution provided by an embodiment of the present application;

fig. 12 is a schematic diagram of time series alignment according to an embodiment of the present application;

FIG. 13 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

FIG. 14 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

FIG. 15 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

FIG. 16 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application;

FIG. 18 is a schematic structural diagram of an apparatus for training a neural network according to an embodiment of the present disclosure;

fig. 19 is a schematic structural diagram of another speech enhancement apparatus according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of another apparatus for training a neural network according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved. The division of the modules presented in this application is a logical division, and in practical applications, there may be another division, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed, and in addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some ports, and the indirect coupling or communication connection between the modules may be in an electrical or other similar form, which is not limited in this application. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

In order to better understand the field and the scenario to which the scheme provided by the present application may be applied, before specifically describing the technical scheme provided by the present application, first, a description is given to an artificial intelligence agent framework, a system architecture to which the scheme provided by the present application is applicable, and related knowledge of a neural network.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is described in detail below in two dimensions, "intelligent information chain" (horizontal axis) and "Information Technology (IT) value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by a smart chip.

The smart chip may be a hardware acceleration chip such as a Central Processing Unit (CPU), a neural-Network Processing Unit (NPU), a Graphic Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

The infrastructure platform may include distributed computing framework and network, and may include cloud storage and computing, interworking network, and the like.

For example, for an infrastructure, data may be obtained through sensors and external communications and then provided to an intelligent chip in a distributed computing system provided by the base platform for computation.

(2) Data:

data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphics, images, voice and text, and also relates to internet of things data of traditional equipment, including service data of an existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing:

the data processing generally includes processing modes such as data training, machine learning, deep learning, searching, reasoning, decision making and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General-purpose capability:

after the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent products and industrial applications:

the intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

The embodiment of the application can be applied to many fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe cities and other fields.

In particular, the embodiment of the present application can be applied to the field of speech enhancement and speech recognition requiring the use of (deep) neural networks.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

(1) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(3) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

As shown in fig. 2, the present embodiment provides a system architecture 100. In fig. 2, a data acquisition device 160 is used to acquire training data.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input raw data and compares the output data with the raw data until the difference between the output data of the training device 120 and the raw data is smaller than a certain threshold, thereby completing the training of the target model/rule 101.

The target model/rule 101 can be used to implement the speech enhancement method of the embodiment of the present application, and the training device can be used to implement the method for training a neural network provided by the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 2, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 2, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: the image to be processed is input by the client device.

The preprocessing module 113 and the preprocessing module 114 are configured to perform preprocessing according to input data (such as an image to be processed) received by the I/O interface 112, and in this embodiment of the application, the preprocessing module 113 and the preprocessing module 114 may not be provided (or only one of the preprocessing modules may be provided), and the computing module 111 may be directly used to process the input data.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 2, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 2, a target model/rule 101 is obtained according to training of the training device 120, where the target model/rule 101 may be a neural network in the present application in this embodiment, and specifically, the neural network provided in this embodiment may be CNN, Deep Convolutional Neural Networks (DCNN), Recurrent Neural Networks (RNNS), and the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 3. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

The structure of the neural network specifically adopted by the speech enhancement method and the method for training the model according to the embodiment of the present application may be as shown in fig. 3. In fig. 3, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. The input layer 210 may obtain an image to be processed, and deliver the obtained image to be processed to the convolutional layer/pooling layer 220 and the following neural network layer 230 for processing, so as to obtain a processing result of the image. The following describes the internal layer structure in CNN 200 in fig. 3 in detail.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 3 may include layers such as 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 3, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 3) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 3 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

The structure of the neural network specifically adopted by the speech enhancement method and the method for training the model according to the embodiment of the present application may be as shown in fig. 4. In fig. 4, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. Compared with fig. 3, in the convolutional layer/pooling layer 220 in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the all-neural network layer 230 for processing.

It should be noted that the convolutional neural networks shown in fig. 3 and fig. 4 are only examples of two possible convolutional neural networks of the speech enhancement method and the method for training the model according to the embodiment of the present application, and in a specific application, the convolutional neural networks used in the speech enhancement method and the method for training the model according to the embodiment of the present application may also exist in the form of other network models.

Fig. 5 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor. The chip may be provided in the execution device 110 as shown in fig. 2 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 2 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for each layer in the convolutional neural network shown in fig. 3 or fig. 4 can be implemented in a chip as shown in fig. 5.

The neural network processor NPU is mounted as a coprocessor on a main central processing unit (CPU, host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuitry 303 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 302 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 301 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 308.

The vector calculation unit 307 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 307 can store the processed output vector to the unified buffer 306. For example, the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 303, for example, for use in subsequent layers in a neural network.

The unified memory 306 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 305 (DMAC) to carry input data in the external memory to the input memory 301 and/or the unified memory 306, store the weight data in the external memory into the weight memory 302, and store data in the unified memory 306 into the external memory.

A Bus Interface Unit (BIU) 310, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 309 through a bus.

An instruction fetch buffer (issue fetch buffer)309 connected to the controller 304, for storing instructions used by the controller 304;

and the controller 304 is configured to call the instruction cached in the instruction fetch memory 309, so as to control the working process of the operation accelerator.

An inlet: the data that can be explained here according to the actual invention are explanatory data such as detected vehicle speed? Obstacle distance, etc.

Generally, the unified memory 306, the input memory 301, the weight memory 302 and the instruction fetch memory 309 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM) or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 2 may be performed by the operation circuit 303 or the vector calculation unit 307.

As shown in fig. 6, the present application provides a system architecture. The system architecture includes a local device 401, a local device 402, and an execution device 210 and a data storage system 150, wherein the local device 401 and the local device 402 are connected with the execution device 210 through a communication network.

The execution device 210 may be implemented by one or more servers. Optionally, the execution device 210 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 150 or call program code in the data storage system 150 to implement the speech enhancement method or the method of training a neural network of the embodiments of the present application.

The execution device 210 can build a target neural network through the above process, and the target neural network can be used for voice enhancement or voice recognition processing, etc.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 401 and the local device 402 acquire relevant parameters of the target neural network from the execution device 210, deploy the target neural network on the local device 401 and the local device 402, and perform voice enhancement or voice recognition and the like by using the target neural network.

In another implementation, the execution device 210 may directly deploy a target neural network, and the execution device 210 performs speech enhancement or other types of speech processing on the speech to be enhanced according to the target neural network by acquiring the images to be processed from the local device 401 and the local device 402.

The execution device 210 may also be referred to as a cloud device, and in this case, the execution device 210 is generally deployed in the cloud.

The executing device 110 in fig. 2 described above can execute the speech enhancement method of the embodiment of the present application, the training device 120 in fig. 4 described above can execute the steps of the method for training a neural network of the embodiment of the present application, and the CNN model shown in fig. 5 and fig. 6 and the chip shown in fig. 5 can also be used to execute the steps of the speech enhancement method and the method for training a model of the embodiment of the present application. The speech enhancement method and the method for training the model according to the embodiment of the present application are described in detail below with reference to the accompanying drawings.

Fig. 7 is a schematic flowchart of a speech enhancement method according to an embodiment of the present application.

As shown in fig. 7, a speech enhancement method provided in an embodiment of the present application may include the following steps:

701. and acquiring the voice to be enhanced and the reference image.

The method can acquire the voice to be enhanced through a multi-channel microphone array and also can acquire the voice to be enhanced through a single audio channel (hereinafter, referred to as single channel).

Only the information of the time domain and the frequency domain is utilized by the single-channel speech enhancement, and the information of the space domain is utilized by the microphone array speech enhancement as well as the information of the time domain and the frequency domain. Because the time domain and frequency domain information play a leading role in sound source separation, and the spatial domain information only plays an auxiliary role, the speech to be enhanced of the scheme provided by the application can be acquired through a monophonic microphone array.

It should be noted that obtaining the speech to be enhanced through a single audio channel is a more preferable solution provided by the embodiments of the present application. Monophonic speech enhancement has relatively low hardware cost requirements, can form a versatile solution, and is widely applied in various products. But the complex environment may limit the effect of the monophonic acoustic probabilistic model, and the task of monophonic speech enhancement is more difficult. The scheme provided by the application can provide visual information for the acoustic model to enhance the effect of the voice noise reduction model. With the development of the fifth Generation mobile communication technology (5th Generation mobile networks or 5th Generation wireless systems, 5th-Generation, 5G), video calls and cameras are increasingly widely used in 5G smart homes, so that the voice enhancement method based on the single channel provided by the application can be widely applied in the near future.

The reference image related in the technical scheme provided by the application can be acquired by a camera, a video camera and other equipment capable of recording images or images. The following is an illustration of the acquisition of speech to be enhanced and reference images in conjunction with several exemplary scenarios to which the present application may be applied. It should be noted that the following exemplary scenarios are only examples of possible applicable scenarios of the scheme provided in the present application, and do not represent all scenarios to which the scheme provided in the present application may be applicable.

Scene one: video and voice call

Fig. 8 is a schematic diagram of an application scenario of a scheme provided in an embodiment of the present application. As shown in fig. 8 a, device a and device B are establishing a video voice call. The device A and the device B can be a mobile phone, a tablet, a notebook computer or an intelligent wearable device. Assuming that the device a adopts the scheme provided by the present application, in the process of establishing that the video voice is passed by the device a and the device B, the sound acquired by the device a is the voice to be enhanced, and the voice to be enhanced at this time may include the voice of the user of the device a and the noise of the surrounding environment. The image acquired by the device a is a reference image, and the reference image at this time may be an image of an area in which a camera lens of the device a is aligned, for example, if a user of the device a aligns a camera with his face (it should be noted that, when the camera lens and the camera in this application do not emphasize the difference therebetween, the camera lens and the camera express the same meaning, and both represent devices for recording images or images), the reference image at this time is a face of the user of the device a. Or the user of the device a does not aim the camera at himself but at the surrounding environment during the passage of the video and voice, and the reference image is the surrounding environment at this time.

The technical scheme provided by the application can be combined with the image information to enhance the voice, and particularly needs to be combined with the image information of the face to enhance the voice, so that a better voice enhancement effect can be achieved when the camera aims at the face. In order to facilitate users to better feel the good voice enhancement effect brought by the scheme provided by the application. In a specific scene, a user can be prompted to aim the camera at the face, and a better voice enhancement effect can be obtained. Fig. 8 b is a schematic diagram of a suitable scenario of another scheme provided in the present application. Taking device a as an example, it is assumed that device a adopts the scheme provided by the present application, and in the process of establishing video voice passing with device B, a text prompt may be displayed in a window of a video session. For example, as shown in b in fig. 8, in the process of the video, the text "aim the camera at the face, the voice effect is better" is displayed in the video window, or "please aim the camera at the face" or "perform voice enhancement, please aim the camera at the face" is displayed. Or as shown in c in fig. 8, in the process of the video, if the device a detects that the user has aimed the camera at the face, no prompt is made, and when it is detected that the user of the device a does not aim the camera at the face but aims at the environment in the process of the video, a text prompt is displayed in the video window, for example, "aim the camera at the face, the voice effect is better," please aim the camera at the face, "or the like can be displayed. It should be noted that, after the user knows the function, the user can select to turn off the text prompt, that is, the user knows that the video voice passes through the process, and aims the camera at the face, so that after a better voice enhancement effect is achieved, the user can actively turn off the text prompt function, or can preset, and the device adopting the scheme only displays the text prompt in the process of the first video voice passing.

Scene two: meeting recording

Fig. 9 is a schematic diagram of another applicable scenario provided in the embodiment of the present application. At present, in order to improve the work efficiency, it is an important means to coordinate the work of multiple people through a conference. In order to be able to trace back the conference content, the recording of the content spoken by each speaker and the arrangement of the conference recording during the conference becomes a basic requirement. Currently recording the speaker's utterance and organizing the conference recording can be done in a number of ways, such as: manual shorthand of secretaries. Or recording devices such as a recording pen and the like record the whole course firstly, and after meeting, the recording contents are manually arranged to form meeting records and the like. But both of these approaches are inefficient because of the need for manual intervention.

The voice recognition technology is introduced to the conference system to bring convenience to the arrangement of conference records, such as: in the conference system, the speech content of the participants is recorded through the recording equipment, and the speech recognition software identifies the speech content of the participants, so that conference records can be further formed, and the efficiency of arranging the conference records is greatly improved. The scheme provided by the application can be applied to a scene of meeting sound recording, and the voice recognition effect is further improved. In this scenario, it is assumed that a is speaking at the conference, the speaking content of a may be recorded, and the images are synchronously acquired while the speaking content of a is recorded. The speech content of a is speech to be enhanced, the speech to be enhanced may include pure speech of a and other noise generated in the conference, and the image taken synchronously is a reference image, which is a face image of a in a preferred embodiment. In some practical situations, the photographer may not photograph the face of the person a in the whole course during the speech of the person a, and then during the speech of the person a, the acquired other non-face images may also be regarded as reference images in the present scheme.

In another scenario, assuming three persons a, B and C are speaking at the conference, the speaking content of at least one of the three persons a, B and C can be optionally enhanced. For example, when the content of speech of a is selected to be enhanced, a face image of a may be synchronously captured during a speech, where the content of speech of a is speech to be enhanced, the speech to be enhanced may include pure speech of a and other noise generated in a conference (for example, the other noise may be the content of speech of B or the content of speech of C), and the face image of a captured synchronously at this time is a reference image. When the content of speech of B is selected to be enhanced, a face image of B may be synchronously captured during the speech of B, at this time, the content of speech of B is speech to be enhanced, the speech to be enhanced may include pure speech of B and other noise generated in the conference (for example, the other noise may be the content of speech of a or the content of speech of C), and at this time, the face image of B that is synchronously captured is a reference image. When the content of the utterance of C is selected to be enhanced, the face image of C may be synchronously captured during the utterance of C, at this time, the content of the utterance of C is to-be-enhanced speech, the to-be-enhanced speech may include pure speech of C and other noise generated in the conference (for example, the other noise may be the content of the utterance of a or the content of the utterance of B), and at this time, the synchronously captured face image of C is a reference image. Or when the content of the speech of a and B is selected to be enhanced, the face images of a and B may be synchronously captured during the speech of a and B, where the content of the speech of a and B is speech to be enhanced, the speech to be enhanced may include pure speech of a and pure speech of B and other noise generated in the conference (for example, the other noise may be the content of the speech of C), and the face images of a and B captured synchronously at this time are reference images. When the content of the speech of B and C is selected to be enhanced, during the speech of B and C, the speech content of B and C may be captured synchronously with the face images of B and C, where the speech content of B and C is speech to be enhanced, the speech to be enhanced may include pure speech of B and pure speech of C and other noise generated in the conference (for example, the other noise may be the content of speech of a), and the face images of B and C captured synchronously at this time are reference images. When the contents of the speech of a and C are selected to be enhanced, the face images of a and C may be synchronously captured during the speech of a and C, the contents of the speech of a and C are speech to be enhanced, the speech to be enhanced may include pure speech of a and pure speech of C and other noise generated in the conference (for example, the other noise may be the content of the speech of B), and the face images of a and C captured synchronously at this time are reference images. Or when the contents of the speech of a, B and C are selected to be enhanced, the face images of a, B and C may be synchronously captured during the speech of a, B and C, where the contents of the speech of a, B and C are speech to be enhanced, the speech to be enhanced may include pure speech of a and pure speech of B and pure speech of C and other noise generated in the conference (such as sound emitted by other conferees except ABC or other environmental noise), and the face images of a, B and C captured synchronously at this time are reference images.

Scene three: voice interaction with wearable devices

The wearable device referred to in this context is a portable device that may be worn directly on the body or integrated into the clothing or accessories of the user. For example, the wearable device may be a smart watch, smart bracelet, smart glasses, and the like. Input methods and semantic understanding based on voice recognition are widely applied to wearable devices, although touch is still the main mode of communication between people and the wearable devices at present, because the screens of the wearable devices are generally small, and the communication between people and the wearable devices is mainly simple and direct tasks, voice is bound to be the next generation information entry of the wearable devices, so that fingers of people can be liberated, and the communication between people and the wearable devices is more convenient and natural. However, these devices are generally used by users in relatively complex acoustic environments, and there are various kinds of interference of sudden noise around the devices, for example, communication between people and mobile phones and wearing devices often occurs in the street or in the shopping mall, very noisy background noise exists in these scenes, the complex noise environment usually causes a significant reduction in the recognition rate of speech, and the reduction in the recognition rate means that these devices cannot accurately understand the instructions of the users, which greatly reduces the user experience. The scheme provided by the application can also be applied to a voice interaction scene with the wearable device. As shown in fig. 10, when acquiring a voice instruction of a user, the wearable device may acquire a face image of the user synchronously, and according to the scheme provided by the application, the voice instruction of the user is subjected to voice enhancement, so that the wearable device can better recognize the instruction of the user and make a response corresponding to the instruction of the user. In the scene, a voice instruction of a user can be regarded as voice to be enhanced, and a synchronously acquired face image is regarded as a reference image.

Scene four: voice interaction with smart home

The smart home (home automation) is characterized in that a home is used as a platform, facilities related to home life are integrated by utilizing a comprehensive wiring technology, a network communication technology, a safety precaution technology, an automatic control technology and an audio and video technology, an efficient management system of home facilities and home schedule affairs is constructed, home safety, convenience, comfort and artistry are improved, and an environment-friendly and energy-saving living environment is realized. For example, smart homes may include smart lighting systems, smart curtains, smart televisions, smart air conditioners, and so forth. As shown in fig. 11, when the user sends the voice control instruction to the smart home, the specific method may include that the user directly sends the voice control instruction to the smart home, or the user sends the voice control instruction to the smart home through other devices, for example, sends the voice control instruction to the smart home remotely through a device such as a mobile phone. At this time, the image of the preset area can be obtained through the smart home or other equipment. For example, when a user sends a voice control instruction to an intelligent home through a mobile phone, the mobile phone can acquire an image shot at the moment, in such a scene, the voice control instruction sent by the user is to-be-enhanced voice, and the synchronously-shot image is a reference image. In a specific implementation scenario, when the face is not detected in the preset area, a voice prompt may be issued to prompt the user to aim the camera at the face, for example, an indication "voice enhancement is in progress, please aim the camera at the face" is issued.

702. A first enhancement signal of the speech to be enhanced is output according to the first neural network.

The first neural network is a neural network obtained by training mixed data of speech and noise with ideal floating rate mask (IRM) as a training target.

The time-frequency masking is a common target of voice separation, the common time-frequency masking has ideal binary masking and ideal floating value masking, the intelligibility and the perception quality of the separated voice can be obviously improved, and once the time-frequency masking target is estimated, the time-domain waveform of the voice can be synthesized by an inverse transformation technology without considering phase information. Illustratively, the definition of an ideal floating value mask for the fourier transform domain is given below:

where Ys (t, f) is the short-time fourier transform coefficient of the clean speech in the mixed data, Yn (t, f) is the short-time fourier transform coefficient of the noise in the mixed data, Ps (t, f) is the energy density corresponding to Ys (t, f), and Pn (t, f) is the energy density corresponding to Yn (t, f).

Given the definition of ideal floating value masking for the fourier transform domain, it is noted that one skilled in the art, having the knowledge of the solution provided in the present application, can easily conceive that other speech separation targets can also be used as training targets for the first neural network. For example, short-time fourier transform masking, implicit time-frequency masking, etc. may also be used as the training target of the first neural network. In other words, in the prior art, after voice separation is performed on mixed data of voice and noise by a certain neural network, the signal-to-noise ratio of an output signal of the neural network at any time can be obtained, and then a training target adopted by the neural network can be adopted.

The speech may be pure speech or clean speech, and is speech without any noise protection. The mixed data of the voice and the noise refers to noise-added voice, namely voice obtained by adding noise with preset distribution to the clean voice. In this embodiment, clean speech and noisy speech are used as the speech to be trained.

Specifically, when generating the noisy speech, a plurality of noisy speech corresponding to the clean speech may be obtained by adding various differently distributed noises to the clean speech. For example: adding the noise of the first distribution into the clean voice 1 to obtain a noise-added voice 1, adding the noise of the second distribution into the clean voice 2 to obtain a noise-added voice 2, adding the noise of the third distribution into the clean voice 1 to obtain a noise-added voice 3, and so on. Through the above-mentioned noise adding process, a plurality of data pairs of clean speech and noise-added speech can be obtained, for example: { clean speech 1, noisy speech 1}, { clean speech 1, noisy speech 2}, { clean speech 1, noisy speech 3}, and so on.

In the actual training process, a plurality of clean voices can be obtained first, and a plurality of noises with different distributions are added to each clean voice, so that a large amount of data pairs of { clean voice and noise voice } are obtained. These data pairs are used as speech to be trained. For example: 500 sentences such as mainstream newspaper and magazine media and the like can be selected to contain all vocalizations as much as possible, and then 100 different persons are selected to read aloud to serve as clean voice signals (namely clean voice corresponding to simulated noisy voice). Then, common living noises in public scenes, traffic, working scenes, coffee houses and the like 18 are selected and are subjected to cross synthesis with clean voice signals to obtain voice signals with noises (equivalent to simulated noisy voices). The clean voice signal and the voice signal with noise are matched one by one to be used as marked data. The data are randomly disturbed, 80% of the data are selected as a training set to carry out neural network model training, the other 20% of the data are selected as a verification set to be used for verifying the result of the neural network model, and the finally trained neural network model is equivalent to the first neural network in the embodiment of the application.

After the first neural network training is finished, when the voice is enhanced, the voice to be enhanced is converted into a two-dimensional time-frequency signal and is input into the first neural network, and a first enhanced signal of the voice to be enhanced is obtained.

The speech signal to be enhanced may be subjected to time-frequency conversion in a short-time-fourier-transform (STFT) manner, so as to obtain a two-dimensional time-frequency signal of the speech to be enhanced. It should be noted that, in the present application, the time-frequency transformation is sometimes referred to as feature transformation, and when the distinction between the two is not emphasized, the two represent the same meaning, and in the present application, the two-dimensional time-frequency signal is sometimes referred to as frequency domain feature, and when the distinction between the two is not emphasized, the two represent the same meaning. This is illustrated below, assuming that the expression for the speech to be enhanced is as follows:

y(t)＝x(t)+n(t)

wherein, y (t) represents the time domain signal of the speech to be enhanced at the time t, x (t) represents the time domain signal of the clean speech at the time t, and n (t) represents the time domain signal of the noise at the time t. The STFT transform of the speech to be enhanced can be expressed as follows:

Y(t,d)＝x(t,d)＋N(t,d)t-1，2，...，T；d＝1，2，...，D

wherein, Y (t, d) represents the representation of the frequency domain signals of the speech to be enhanced in the t acoustic feature frame and the d frequency band, X (t, d) represents the representation of the frequency domain signals of the clean speech in the t acoustic feature frame and the d frequency band, and N (t, d) represents the representation of the frequency domain signals of the noise in the t acoustic feature frame and the d frequency band. T and D represent how many frames of acoustic features and the total number of frequency bands, respectively, the signal to be enhanced has in total.

Note that the method of performing feature transformation on a speech signal is not limited to the STFT method, and other methods, such as Gabor transformation and Wigner-Ville distribution, may be used in some other embodiments. In the prior art, the embodiment of the present application can be applied to a manner of performing feature transformation on a speech signal to obtain a two-dimensional time-frequency signal of the speech signal. In a specific embodiment, in order to accelerate the convergence speed and convergence of the neural network, the frequency domain features after feature transformation may be further normalized. For example, the frequency domain feature may be subjected to an operation of subtracting the mean value and dividing the standard deviation to obtain a normalized frequency domain feature. In a specific embodiment, the normalized frequency domain feature may be used as an input of the first neural network to obtain the first enhanced signal, and the long-short-term memory network (LSTM) is taken as an example, and may be represented by the following formula:

to the right of the above equation, there is a training target IRM, which has been introduced above. In this formula, Ps (angular, j) represents the energy spectrum (also called energy density) of the clean signal at the time j, and Ps (angular, j) represents the energy spectrum of the noise signal at the time j. The left side of the above equation represents an approximation to the training target by the neural network. Alpha is alpha_jThe input of the neural network may be represented by a frequency domain feature in the present embodiment, and g () represents a functional relationship, for example, the functional relationship may be a functional relationship obtained by performing normalization of the mean value divided by the standard deviation and then performing logarithmic transformation on the input of the neural network.

It should be noted that the LSTM is only for illustration, and the first neural network of the present application may be any timing model, that is, a corresponding output may be provided at each time step, so as to ensure the real-time performance of the model. After the first neural network is trained, the weights can be frozen, that is, the weight parameters of the first neural network are kept unchanged, so that the second neural network or other neural networks do not affect the performance of the first neural network, the model can be ensured to be output according to the first neural network under the condition of lacking of visual modalities (that is, the reference image does not include face information or lip information), and the robustness of the model is ensured.

703. And outputting a masking function of the reference image according to the second neural network.

The masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the fact that the frequency band energy is smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is noise, and the fact that the frequency band energy is not smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is clean voice. The second neural network is a neural network obtained by training an image including lip features corresponding to a sound source of a voice used by the first neural network with an Ideal Binary Mask (IBM) as a training target.

From a physiological point of view, it can be considered that the volume, timbre, etc. of the same speech uttered by different persons are different, resulting in differences in the pronunciation time spectrum of each sound, but the energy distribution is the same. The energy distribution of the pronunciation can be the result of the normalization of the original audio to factors such as speaker and volume, which is also the reason for guessing the syllable from the formants of the audio. We therefore model the energy distribution of the clean signal and fit this energy distribution with an image of the human mouth. In fact, it is difficult to directly fit the energy distribution with the human mouth image, and human voice is determined not only by the mouth shape but also by the shape of the resonant cavity inside the oral cavity and the position of the tongue, but the human mouth image cannot accurately reflect the factors, so that the video of the same mouth shape can correspond to different voices, i.e. the voice cannot be mapped one by one. Therefore, we design the weak correlation (weak reference) mode, and convert the original fine distribution into the coarse distribution by means of binarization, so as to facilitate the fitting of the image end. The rough distribution characterizes whether the mouth shape corresponds to the pronunciation condition of a certain set of frequency bands. The method and the device for establishing the acoustic feature frame mapping relation have the advantages that the mapping relation between the frequency band energy of the image and the frequency band energy of the voice is established through the second neural network, and specifically, the association relation between the energy of each frequency band of the image frame at each moment and the energy of each frequency band of the acoustic feature frame at each moment is established.

The training target of the second neural network and the data used for training will be described below.

The training target IBM of the second neural network is a symbolic function, the definition of which is illustrated below by the following expression.

Wherein the dist function is an energy distribution function, which is defined as follows:

wherein j is at time j or j frameThe moment when the duration ends. Each frame may include a plurality of frequency bands, for example, k frequency bands, where k is the k-th frequency band of the clean speech at time j, and k is a positive integer. How many frequency bands each time includes may be preset, for example, it may be set that one time includes 4 frequency bands, or one time includes 5 frequency bands, which is not limited in this embodiment of the application. P_S(a^kj) Refers to the energy spectrum of the k-th band of the clean signal at time j. Thus dist (aj) characterizes the distribution of audio energy over the k frequency bands corresponding to time j. the threshold is a predetermined threshold, which in one embodiment may be generally 10^-5. If the difference between dist (aj) and threshold is greater than or equal to 0, that is, dist (aj) is greater than threshold, then it is considered that dist (aj) is voice dominant or it cannot be judged that dist (aj) is voice dominant or noise dominant, and the corresponding function value is set to 1. If the difference between dist (aj) and threshold is less than 0, i.e., dist (aj) is less than threshold, then dist (aj) is considered noise dominated and its corresponding function value is set to 0.

The training data of the second neural network is an image including lip features corresponding to the sound source of the voice adopted by the first neural network. For example, in step 702, it is mentioned that 500 sentences such as a mainstream newspaper and magazine media may be selected, and as far as possible all utterances are included, and then 100 different persons are selected for reading, and the selected words are used as a clean speech signal (i.e. clean speech corresponding to simulated noisy speech), so that the training data of the second neural network may include face images of the 100 different persons, or include mouth images of the 100 different persons, or include face images of the 100 different persons, such as upper half body images. It should be noted that the training data of the second neural network does not only include the corresponding image including the lip feature at the sound source of the speech adopted by the first neural network, but also may include some image data not including the lip feature or data not including the face image.

The following is a specific explanation in conjunction with the following formulas.

v represents training data, which has already been described above and will not be repeated here. sigmoid is defined assigmoid is an activation function, which represents the energy of each frequency band at each moment of the image, and the value of sigmoid is approximated to the value of dist (aj) -threshold, such as LSTM used in the above formula, through a neural network. f () represents a feature extraction function. It should be noted that sigmoid is only used for illustration, and other activation functions may be adopted to approximate the training target in the embodiment of the present application.

Furthermore, in a particular embodiment, the processed image frames of the second neural network may be time-series aligned with the acoustic feature frames of the first neural network. Through the alignment of the time series, the data output by the second neural network and the data output by the first neural network which are processed at the same time can be ensured to be corresponding in the subsequent flow. For example, assume that there is a segment of video that includes 1 frame of image frames and 4 frames of acoustic feature frames. The multiple relationship between the number of the image frames and the number of the acoustic frames may be determined by resampling the segment of video at a preset frame rate, for example, resampling image data included in the segment of video at a frame rate of 40 frames/s for the image frames, and resampling audio data included in the segment of video at a frame rate of 10 frames/s for the acoustic feature frames. In this video, the image frames of 1 frame are temporally aligned with the acoustic feature frames of 4 frames. In other words, the duration of the image frames of the 1 frame is aligned with the duration of the acoustic feature frames of the 4 frames. In the scheme, the first neural network processes the 4 frames of acoustic feature frames, the second neural network processes the 1 frame of image frames, and the processed image frames of the second neural network are aligned with the acoustic feature frames of the first neural network in time series, in this example, so that the 4 frames of acoustic feature frames are still aligned with the 1 frame of image frames in time during the processing of the first neural network and the second neural network and after the processing is completed. Moreover, according to the scheme provided by the application, after the 1 frame of image frame is subjected to time alignment processing through the second neural network, 4 frame of image frames corresponding to the 4 frame of acoustic feature frames respectively can be obtained, and the masking functions corresponding to the 4 frame of image frames are output. A specific description is given below of a time series alignment manner according to an embodiment of the present application.

In a specific embodiment, the speech to be enhanced includes a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image includes a first image frame, the first image frame is input data of a second neural network, and a masking function of the image is output according to the second neural network, including: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame, so that the first moment is the moment corresponding to the first acoustic feature frame. For example, in the above formula, m represents a multiple, and is determined according to a ratio of a frame rate of the first acoustic feature frame to a frame rate of the first image frame. For example, if the frame rate of the first acoustic feature frame is 10 frames/s and the frame rate of the first image frame is 40 frames/s, the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame is 1/4(10/40), and m in the above formula is 4. For another example, if the frame rate of the first acoustic feature frame is 25 frames/s and the frame rate of the first image frame is 50 frames/s, the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame is 1/2(25/50), and m in the above formula is 2. For a clearer explanation of the time queue alignment, the following takes m to 4, as further explained in conjunction with fig. 12. Fig. 12 is a schematic diagram of time series alignment according to an embodiment of the present application. As shown in fig. 12, the white boxes in the figure represent the image frames of the input of the second neural network, and as shown in fig. 12, the image frames of the 4-frame input are shown. Assuming that the duration of an input 1 frame image frame is the same as the duration of a 4 frame acoustic feature frame, that is, when m is 4, after the processing of time sequence alignment of the second neural network, the input one frame image frame corresponds to the 4 processed image frames, and the duration of each frame of the 4 processed image frames is the same as the duration of the acoustic frame. As shown in fig. 12, the black boxes represent the image frames subjected to the time alignment processing by the second neural network, and the second neural network outputs the masking functions of the image frames subjected to the time alignment processing, and as shown in fig. 12, if 16 image frames subjected to the time alignment processing are included in total, the masking functions corresponding to the 16 image frames subjected to the time alignment processing are output. The 16 image frames are temporally aligned with one acoustic feature frame, in other words, 1 image frame represented by a white square is temporally aligned with 4 acoustic feature frames, and 1 image frame represented by a black square is temporally aligned with 1 acoustic feature frame.

And after the training of the second neural network is finished, inputting the reference image into the second neural network during voice enhancement to obtain a masking function of the reference image. In the actual implementation process, some preprocessing may be performed on the reference image, and the preprocessed reference image is input to the second neural network, for example, the reference image may also be sampled to the specified image frame rate. The face feature extraction can be performed on the reference image to obtain a face image, and the face feature extraction can be performed through a face feature extraction algorithm. The face feature extraction algorithm comprises a recognition algorithm based on face feature points, a recognition algorithm based on the whole face image, a recognition algorithm based on a template and the like. For example, the method may be face detection based on a face feature point detection algorithm. The extraction of the face features can also be performed through a neural network. The extraction of the face features can be performed through a convolutional neural network model, such as face detection based on a multitask convolutional neural network. The face graph after face feature extraction can be used as the input of the second neural network. The second neural network may further process the face image, for example, may extract image frames corresponding to the motion features of the human mouth, and perform time-series alignment processing on the image frames corresponding to the motion features of the human mouth.

704. And determining a second enhanced signal of the voice to be enhanced according to the first enhanced signal and the operation result of the masking function.

The embodiment can output the first enhancement signal through the first neural network and output the masking function of the reference image through the second neural network. Because the second neural network establishes the mapping relation between the frequency band energy of the image and the frequency band energy of the voice, the masking function can indicate whether the frequency band energy of the reference image is smaller than a preset value or not, the fact that the frequency band energy is smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is noise, and the fact that the frequency band energy is not smaller than the preset value indicates that the voice to be enhanced corresponding to the reference image is clean voice. The second enhancement signal of the speech to be enhanced, determined by the result of the operation of the first enhancement signal and the masking function, can obtain a better speech enhancement effect than the first enhancement signal, i.e. than the scheme of speech enhancement by only a single neural network. For example, assume that, for a first frequency band included in the audio to be enhanced at a certain time, the signal-to-noise ratio of the first frequency band output by the first neural network is a, assume that a represents that the first neural network determines that the first frequency band is a speech dominant, assume that a represents that the first neural network outputs frequency band energy of the first frequency band as B, and that B is smaller than a preset value, that is, assume that B represents that the second neural network determines that the first frequency band is a noise dominant, perform mathematical operations through a and B, for example, perform one or more operations of addition, multiplication, or square on a and B to obtain an operation result between a and B, and determine the occupation ratio of a and B in the finally output second enhancement signal through the operation result. Specifically, the principle of the operation of the first enhancement signal and the masking function is that the practical meaning of the masking function is to measure whether a certain frequency band has enough energy. When the first enhancement signal output by the first neural network and the masking function output by the second neural network indicate inconsistency, the method can be reflected as:

the value output by the second neural network is small, the value output by the first neural network is large, the corresponding first neural network (audio end) considers that a certain frequency band (such as the first frequency band) has energy to form pronunciation, and the second neural network (video end) considers that the mouth shape of a person can not give out corresponding sound;

the value output by the second neural network is large, the value output by the first neural network is small, the first neural network (audio end) considers that a certain frequency band (such as the first frequency band) has no energy to form pronunciation, and the second neural network (video end) considers that the mouth shape of a person is making a certain possible sound;

the inconsistent parts are scaled to a smaller value through the operation mode of the first enhancement signal and the masking function, the consistent parts are kept unchanged, a new output second enhancement signal after fusion is obtained, and the frequency band energy of the unvoiced or audio-video inconsistent frequency band is compressed to a smaller value.

As can be seen from the embodiment corresponding to fig. 7, the first neural network is used to output the first enhancement signal of the speech to be enhanced, and the second neural network is used to model the association relationship between the image information and the speech information, so that the masking function of the reference image output by the second neural network can indicate that the speech to be enhanced corresponding to the reference image is noise or speech. Through the technical scheme provided by the application, the image information can be applied to the voice enhancement process, and in relatively noisy environments, the voice enhancement capability can be well improved, and the auditory sense is improved.

The above embodiment corresponding to fig. 7 describes that the second enhancement signal of the speech to be enhanced can be determined from the first enhancement signal and the result of the operation of the masking function. A preferred scheme is given below, in which a second enhancement signal of the speech to be enhanced is determined by a third neural network, and specifically, the second enhancement signal is determined according to a weight value output by the third neural network. The weight value indicates an output ratio of the first enhancement signal and a modification signal in the second enhancement signal, the modification signal being a result of an operation of the masking function and the first enhancement signal. The third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the IRM as a training target.

Fig. 13 is a schematic flow chart of another speech enhancement method according to an embodiment of the present application.

As shown in fig. 13, another speech enhancement method provided in the embodiment of the present application may include the following steps:

1301. and acquiring the voice to be enhanced and the reference image.

Step 1301 can be understood with reference to step 701 in the corresponding embodiment of fig. 7, and details are not repeated here.

1302. A first enhancement signal of the speech to be enhanced is output according to the first neural network.

Step 1302 may be understood with reference to step 702 in the corresponding embodiment of fig. 7, and will not be repeated herein.

1303. And outputting a masking function of the reference image according to the second neural network.

Step 1303 can be understood with reference to step 703 in the corresponding embodiment of fig. 7, and is not repeated here.

In a specific embodiment, the method may further include: it is determined whether the reference image includes face information. And if the reference image is determined to comprise the face information, outputting a masking function of the reference image according to the second neural network.

1304. And determining a second enhanced signal according to the weight value output by the third neural network.

And determining a second enhanced signal according to the weight value output by the third neural network by taking the first enhanced signal and the masking function as input data of the third neural network. The weight value indicates an output ratio of the first enhancement signal and a modification signal in the second enhancement signal, the modification signal being a result of an operation of the masking function and the first enhancement signal. The third neural network is a neural network obtained by training the output data of the first neural network and the output data of the second neural network with the IRM as a training target.

The third neural network trains the output data of the first neural network and the output data of the second neural network, specifically, trains a plurality of groups of first enhancement signals output by the first neural network in the training process and a plurality of groups of masking functions output by the second neural network in the training process. Since the second neural network time-serially aligns the image frames with the acoustic feature frames of the first neural network in step 1302, the output of the first neural network and the output of the second neural network received by the third neural network at the same time are time-aligned data. The third neural network may train the operation result of the first enhancement signal and the masking function, and the mathematical operation between the first enhancement signal and the masking function has been described above, and is not repeated here. The application is not limited to the type of the third neural network, for example, the third neural network is LSTM, and when the mathematical operation between the first enhancement signal and the masking function is a multiplication operation, the third neural network trains the output data of the first neural network and the output data of the second neural network to output a weight (gate), which can be represented by the following formula:

gate＝LSTM(IBM×IRM)

in the above step 701, several specific scenarios to which the present solution may be applied are mentioned, wherein the reference image may include face information, specifically, an image including face information at a sound source of the speech to be enhanced. In some scenarios, the reference image may also be unrelated to face information, e.g., the reference image may be unrelated to the corresponding image at the sound source. The training data of the second neural network not only comprises the image which is corresponding to the sound source of the voice adopted by the first neural network and comprises the lip feature, but also comprises some image data which does not comprise the lip feature or data which does not comprise the face image. So in a different scenario, whether speech is to be enhanced in combination with the output of the second neural network, and if speech is to be enhanced in combination with the output of the second neural network, what the ratio of the output of the second neural network and the output of the first neural network in the finally output second enhanced signal is, are determined by the weights of the outputs of the third neural network. Illustratively, taking the mathematical operation between the first enhancement signal and the masking function as a multiplication operation, the second enhancement signal may be represented by the following formula, where IRM' represents the second enhancement signal:

IRM'＝gate×(IBM×IRM)+(1-gate)×IRM

since the output of the second neural network is not completely accurate, which may lead to an erroneous scaling of a part of the first enhancement signal, we add a third neural network, which by weight retains the confident part, while the inconclusive part is filled by the first enhancement signal. The design scheme also ensures that when the visual modality is not detected (namely, the reference image is not detected to include the face signal or the lip information), the weight value is set to 0, so that the IRM' is the IRM, namely, the second enhancement signal is the first enhancement signal, and the scheme provided by the application can have good voice enhancement performance under different conditions.

In a specific embodiment, the correction signal is determined according to a multiplication result of M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, a first enhancement signal output by the first neural network at the first time includes M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time. This process is illustrated below in conjunction with fig. 14. Fig. 14 is a schematic flow chart of another speech enhancement method according to an embodiment of the present application. As shown in fig. 14, a distribution curve of frequencies of a segment of speech to be enhanced is given, as shown in fig. 14, the speech to be enhanced at a first time includes a frame of acoustic feature frame, the frame of acoustic feature frame includes 4 frequency bands, it is to be noted that the first time may be any time corresponding to the speech to be enhanced, the first time includes 4 frequency bands only for illustration, how many frequency bands each time includes may be preset, for example, it may be set that a time includes 4 frequency bands, or a time includes 5 frequency bands, which is not limited in this embodiment of the present application. Suppose the snr for the 4 bands is 0.8, 0.5, 0.1 and 0.6, respectively. The second neural network outputs the masking functions of the 4 frequency bands corresponding to the reference image at the first moment, because the second neural network aligns the image frame with the acoustic feature frame of the first neural network in time sequence, which is not repeated herein. Assume that the masking functions corresponding to the 4 frequency bands are 1, 1, 0 and 1, respectively. The correction signal comprises 4 frequency bands, and the energy of each frequency band is 0.8(1x0.8),0.5(1x0.5),0(0x0.1),0.6(1x0.6), respectively.

By the implementation mode provided by the application, the scheme provided by the application can support streaming decoding, and the upper bound is the duration of a unit acoustic feature frame. Taking the duration of a unit acoustic feature frame as 10ms as an example, the theoretical upper bound of the time delay of the output second enhanced speech is 10ms by the scheme provided by the application. Because the second neural network outputs the masking function according to the time corresponding to the acoustic feature frame (specifically, the description about time series alignment may be referred to above, and details are not repeated here), the third neural network receives the first enhancement signal corresponding to the acoustic feature frame of one frame, and may process the first enhancement signal and the masking function corresponding to the same time, and output the second enhancement signal at the time. Since the speech to be enhanced can be processed frame by frame, the second enhancement signal can be played frame by frame. In other words, since the speech to be enhanced may be processed frame by frame with the acoustic feature frame as a unit, and the corresponding second neural network also outputs the masking function according to the time corresponding to the acoustic feature frame, the third neural network may output the second enhancement signal with the acoustic feature frame as a unit, so in the scheme provided in this application, the upper bound of the theoretical time delay is the duration of the acoustic feature frame in unit.

For a better understanding of the solution provided by the present application, it is described below in connection with fig. 15.

Fig. 15 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application. Assume that there is a segment of video that includes speech to be enhanced and a reference picture. And after the voice to be enhanced is subjected to characteristic transformation to obtain the frequency domain characteristic corresponding to the voice to be enhanced, inputting the frequency domain characteristic into the first neural network. As shown in fig. 15, it is assumed that the segment of speech to be enhanced is sampled into 3 segments of audio, and each segment of audio is subjected to feature transformation and includes 4 frames of acoustic feature frames, i.e., the input of the first neural network in fig. 15. The reference image is assumed to be resampled according to the ratio of the preset frame rate of the image frame to the frame rate of the acoustic feature frame, and it is determined that every 4 frames of the acoustic feature frame correspond to 1 frame of the image frame. After the second neural network performs the time alignment processing on the 1 frame image frame, 4 frames of image frames corresponding to the 4 frames of acoustic feature frames may be output, that is, the output of the second neural network in fig. 15. The first enhancement signal corresponding to the 4 frames of acoustic feature frames output by the first neural network and the masking function corresponding to the 4 frames of image frames output by the second neural network may be sequentially input to the third neural network, and the third neural network may output the second enhancement signal corresponding to the 4 frames of acoustic feature frames, that is, the output of the third neural network in fig. 15. And then, performing feature inverse transformation on the second enhanced signal to obtain a time domain enhanced signal of the voice to be enhanced.

After the third neural network is trained, when speech enhancement is performed, the first enhancement signal and the masking function may be used as input data of the third neural network, and a second enhancement signal is determined according to a weight output by the third neural network.

In a specific embodiment, after the training of the third neural network, during speech enhancement, the method may further include performing inverse feature transform on a result output by the third neural network to obtain a time-domain signal. For example, if the frequency domain feature obtained by the short-time fourier transform of the speech to be enhanced is the input of the first neural network, inverse short-time-fourier-transform (ISTFT) may be performed on the second enhancement signal output from the third neural network to obtain the time domain signal.

As can be seen from the embodiments corresponding to fig. 7 and fig. 15, some image data not including lip features or data not including face images may also be included in the training data of the second neural network. It should be noted that, in some specific embodiments, the training data of the second neural network may only include image data including lip features or data including facial images. In some specific embodiments, it may be determined whether the reference image includes face information or lip information, if the reference image does not include the face information or the lip information, the enhancement signal of the speech to be enhanced is output only according to the first neural network, and if the reference image includes the face information or the lip information, the enhancement signal of the speech to be enhanced is output according to the first neural network, the second neural network, and the third neural network. Referring to fig. 16, fig. 16 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application. The system firstly judges whether the reference image comprises face information or lip information, if the reference image does not comprise the face information or the lip information, the enhancement signal of the voice to be enhanced is determined according to a first enhancement signal output by a first neural network, namely a second enhancement signal is the first enhancement signal. If the system determines that the reference image includes face information or lip information, the second enhancement signal is determined by the third neural network according to the mask function output by the second neural network and the first enhancement signal output by the first neural network, and how to determine the second enhancement signal according to the third neural network is described in detail above, and details are not repeated here.

The flow of the speech enhancement method provided by the embodiment of the application comprises an application flow and a training flow. The application process provided by the present application is introduced above, specifically, a speech enhancement method is introduced, and the training process provided by the present application is introduced below, specifically, a method for training a neural network is introduced.

The present application provides a method of training a neural network for speech enhancement, the method may include: training data is obtained, which may include mixed data of speech and noise and corresponding images at the source of the speech, which may include lip features. And training the mixed data by taking the ideal floating value mask IRM as a training target to obtain a first neural network, wherein the trained first neural network is used for outputting a first enhancement signal of the voice to be enhanced. The method comprises the steps of training an image by taking an ideal binary masking IBM as a training target to obtain a second neural network, wherein the trained second neural network is used for outputting a masking function of a reference image, the masking function indicates whether the frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value and indicates that the frequency band of a voice to be enhanced corresponding to the reference image is noise, and the operation result of a first enhancement signal and the masking function is used for determining a second enhancement signal of the voice to be enhanced.

In a specific embodiment, the reference image is a corresponding image at the sound source of the speech to be enhanced, which may include lip features.

In a specific embodiment, the operation result of the first enhancement signal and the masking function is used to determine a second enhancement signal of the speech to be enhanced, which may include: and taking the first enhancement signal and the masking function as input data of a third neural network, determining a second enhancement signal according to a weight output by the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

In a specific embodiment, the method may further comprise: it is determined whether the image may include face information or lip information. When the image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

In a specific embodiment, the modification signal is the result of the multiplication of the first enhancement signal and the masking function.

In a specific embodiment, the correction signal is determined according to a result of a multiplication operation of M signal-to-noise ratios and a masking function at a first time, where M is a positive integer, the first enhancement signal output by the first neural network at the first time may include M frequency bands, each of the M frequency bands corresponds to one signal-to-noise ratio, and the masking function at the first time is a masking function output by the second neural network at the first time.

In a specific embodiment, the speech to be enhanced may include a first acoustic feature frame, a time corresponding to the first acoustic feature frame is indicated by a first time index, the image may include a first image frame, the first image frame is input data of a second neural network, and outputting a masking function of the image according to the second neural network may include: and outputting a masking function corresponding to the first image frame at a first moment according to the second neural network, wherein the first moment is indicated by a multiple of the first time index, and the multiple is determined according to the ratio of the frame rate of the first acoustic feature frame to the frame rate of the first image frame.

In a specific embodiment, the method may further include: and performing feature transformation on the voice to be enhanced to obtain the frequency domain features of the voice to be enhanced. The method may further comprise: and performing characteristic inverse transformation on the second enhanced signal to obtain enhanced voice.

In a specific embodiment, the performing feature transformation on the speech to be enhanced may include: and performing short-time Fourier transform (STFT) on the voice to be enhanced. Performing an inverse characteristic transform on the second enhanced signal may include: an inverse short-time fourier transform ISTFT is performed on the second enhancement signal.

In a specific embodiment, the method may further include: the image is sampled, so that the frame rate of the image frame which can be included in the image is a preset frame rate.

In a specific embodiment, the lip features are obtained by feature extraction on a face image, and the face image is obtained by face detection on an image.

In a specific embodiment, the frequency band energy of the image is represented by an activation function, and the value of the activation function is made to approach IBM to obtain a second neural network.

In one particular embodiment, the speech to be enhanced is acquired through a single audio channel.

In one particular embodiment, the first mask is an ideal floating value mask IRM and the second mask is an ideal binary mask IBM.

The experimental data set used the Grid data set as the pure speech corpus, 1000 speakers per person for 32 groups of speakers, and 32000 corpora were divided into 27000 training sets (30 groups of speakers, 900 for each group), 3000 senstest sets (30 groups of speakers, 100 for each group) and 2000 inseenttest sets (2 groups of speakers, 1000 for each group). The CHiME background data set is divided into a training noise set and a common environment testing noise set according to the ratio of 8:2, and the Audio Human noise is used as a Human voice environment testing set. The primary comparative baselines are the acoustic model (AO), Visual Speech Enhancement (VSE) model and the Looking to Listen (L2L) model. Experiments were evaluated primarily by PESQ scoring. Experimental data prove that the scheme provided by the application can utilize visual information to comprehensively improve the voice enhancement task on-5 dB to 20 dB.

The speech enhancement method and the neural network training method according to the embodiments of the present application are described in detail above with reference to the accompanying drawings, and related apparatuses according to the embodiments of the present application are described in detail below. It should be understood that the related device can perform the speech enhancement method and the steps of neural network training of the embodiments of the present application, and the repetitive description is appropriately omitted when describing the related device.

Fig. 17 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application;

in one particular embodiment, the speech enhancement apparatus comprises: the obtaining module 1701 is configured to obtain a speech to be enhanced and a reference image, where the speech to be enhanced and the reference image are data obtained simultaneously. The audio processing module 1702 is configured to output a first enhancement signal of the speech to be enhanced according to a first neural network, where the first neural network is obtained by training mixed data of speech and noise with a first mask as a training target. And an image processing module 1703, configured to output a masking function of the reference image according to a second neural network, where the masking function indicates whether frequency band energy corresponding to the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value, and indicates that a frequency band of a to-be-enhanced voice corresponding to the reference image is noise, and the second neural network is a neural network obtained by training an image including a lip feature corresponding to a voice source used by the first neural network with a second mask as a training target. And the comprehensive processing module 1704 is configured to determine a second enhancement signal of the speech to be enhanced according to the first enhancement signal and an operation result of the masking function.

In a specific embodiment, the reference image is an image including lip features corresponding to a sound source of the speech to be enhanced.

In a specific embodiment, the comprehensive processing module 1704 is specifically configured to: and taking the first enhancement signal and the masking function as input data of a third neural network, determining a second enhancement signal according to a weight output by the third neural network, wherein the weight indicates the output proportion of the first enhancement signal and a correction signal in the second enhancement signal, the correction signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network by taking the first mask as a training target.

In a particular embodiment, the apparatus further comprises: the device comprises a feature extraction module and a feature extraction module, wherein the feature extraction module is used for determining whether the reference image comprises face information or lip information. When the reference image does not include face information or lip information, the weight value indicates that the output proportion of the correction signal in the second enhancement signal is 0, and the output proportion of the first enhancement signal is one hundred percent.

In a specific embodiment, the modification signal is the result of the multiplication of the first enhancement signal and the masking function.

In a specific embodiment, the feature extraction module is further configured to sample the reference image, so that a frame rate of an image frame that the reference image may include is a preset frame rate.

In a specific embodiment, the lip features are obtained by feature extraction on a face image, and the face image is obtained by face detection on a reference image.

In a specific embodiment, the frequency band energy of the reference image is represented by an activation function, and the value of the activation function is made to approach IBM to obtain a second neural network.

In one particular embodiment, the speech to be enhanced is acquired through a single audio channel.

In one particular embodiment, the first mask is an ideal floating value mask IRM and the second mask is an ideal binary mask IBM.

Fig. 18 is a schematic structural diagram of an apparatus for training a neural network according to an embodiment of the present disclosure.

The application provides a device for training a neural network, the neural network is used for speech enhancement, the device includes: an obtaining module 1801, configured to obtain training data, where the training data includes mixed data of voice and noise and a corresponding image including lip features at a sound source of the voice. And the audio processing module 1802 is configured to train the mixed data to obtain a first neural network by using the ideal floating value mask IRM as a training target, where the trained first neural network is used to output a first enhancement signal of the speech to be enhanced. An image processing module 1803, configured to train an image to obtain a second neural network by using an ideal binary masking IBM as a training target, where the trained second neural network is configured to output a masking function of a reference image, where the masking function indicates whether a frequency band energy of the reference image is smaller than a preset value, the frequency band energy is smaller than the preset value, and indicates that a frequency band of a to-be-enhanced speech corresponding to the reference image is noise, and an operation result of the first enhancement signal and the masking function is used to determine a second enhancement signal of the to-be-enhanced speech.

In a specific embodiment, the reference image is an image including lip features corresponding to a sound source of the speech to be enhanced.

In a specific embodiment, the method further comprises the following steps: the comprehensive processing module 1804 is configured to use the first enhancement signal and the masking function as input data of a third neural network, determine a second enhancement signal according to a weight output by the third neural network, where the weight indicates an output ratio of the first enhancement signal and a modification signal in the second enhancement signal, the modification signal is an operation result of the masking function and the first enhancement signal, and the third neural network is a neural network obtained by training output data of the first neural network and output data of the second neural network with the first mask as a training target.

In a particular embodiment, the apparatus further comprises: a characteristic feature extraction module for extracting the characteristic features of the image,

In a specific embodiment, the modification signal is the result of the multiplication of the first enhancement signal and the masking function.

In a specific embodiment, the lip features are obtained by feature extraction on a face image, and the face image is obtained by face detection on a reference image.

In one particular embodiment, the speech to be enhanced is acquired through a single audio channel.

In one particular embodiment, the first mask is an ideal floating value mask IRM and the second mask is an ideal binary mask IBM.

Fig. 19 is a schematic structural diagram of another speech enhancement apparatus according to an embodiment of the present application

Fig. 19 is a schematic block diagram of a speech enhancement apparatus according to an embodiment of the present application. The speech enhancement device module shown in FIG. 19 includes a memory 1901, a processor 1902, a communication interface 1903, and a bus 1904. The memory 1901, the processor 1902, and the communication interface 1903 are communicatively connected to each other via a bus 1904.

The communication interface 1903 corresponds to the image acquisition module 901 in the speech enhancement apparatus, and the processor 1902 corresponds to the feature extraction module 902 and the detection module 903 in the speech enhancement apparatus. Each of the speech enhancement device modules and modules are described in detail below.

The memory 1901 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1901 may store programs that, when executed by the processor 1902, the processor 1902 and the communication interface 1903 are configured to perform the steps of the speech enhancement method of embodiments of the present application. In particular, the communication interface 1903 may retrieve an image to be detected from memory or other device, and then voice enhance the image to be detected by the processor 1902.

The processor 1902 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement the functions that are required to be executed by the modules in the speech enhancement apparatus according to the embodiment of the present application (for example, the processor 1902 may implement the functions that are required to be executed by the feature extraction module 902 and the detection module 903 in the speech enhancement apparatus), or execute the speech enhancement method according to the embodiment of the present application.

The processor 1902 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the speech enhancement method of the embodiment of the present application may be implemented by integrated logic circuits of hardware in the processor 1902 or by instructions in the form of software.

The processor 1902 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1901, and the processor 1902 reads information in the memory 1901, and completes, in combination with hardware thereof, functions required to be executed by modules included in the speech enhancement apparatus according to the embodiment of the present application, or executes the speech enhancement method according to the embodiment of the method of the present application.

The communication interface 1903 enables communication between the appliance module and other devices or a communication network using a transceiver device such as, but not limited to, a transceiver. For example, the image to be processed may be acquired through the communication interface 1903.

Bus 1904 may include a path that transfers information between various components of the device module (e.g., memory 1901, processor 1902, communication interface 1903).

Fig. 20 is a schematic structural diagram of another apparatus for training a neural network according to an embodiment of the present disclosure.

Fig. 20 is a hardware configuration diagram of a neural network training device according to an embodiment of the present application. Similar to the above-described apparatus, the training neural network apparatus shown in fig. 20 includes a memory 2001, a processor 2002, a communication interface 2003, and a bus 2004. The memory 2001, the processor 2002, and the communication interface 2003 are communicatively connected to each other via a bus 2004.

The memory 2001 may store a program, and the processor 2002 is configured to execute the steps of the training method of the neural network according to the embodiment of the present application when the program stored in the memory 2001 is executed by the processor 2002.

The processor 2002 may be a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is configured to execute the relevant programs to implement the neural network training method according to the embodiment of the present application.

The processor 2002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the neural network according to the embodiment of the present application may be implemented by hardware integrated logic circuits in the processor 2002 or instructions in the form of software.

It should be understood that the neural network trained by the neural network training device shown in fig. 20 can be used to perform the method of the embodiment of the present application.

Specifically, the apparatus shown in fig. 20 may acquire training data and a neural network to be trained from the outside through the communication interface 2003, and then train the neural network to be trained according to the training data by the processor.

It should be noted that although the above-described apparatus modules and apparatus show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus modules and apparatus may also include other devices necessary to achieve proper operation. Also, the device modules and devices may include hardware components to implement other additional functions, as may be appreciated by those skilled in the art, according to particular needs. Furthermore, those skilled in the art will appreciate that the apparatus modules and apparatus may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in fig. 19 and 20.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

55页详细技术资料下载

Voice enhancement method, method for training neural network and related equipment

相关技术

网友询问留言