Audio noise reduction method and training method of audio noise reduction model

文档序号：972870 发布日期：2020-11-03 浏览：9次中文

阅读说明：本技术 音频降噪方法和音频降噪模型的训练方法 (Audio noise reduction method and training method of audio noise reduction model ) 是由胡诗超赵伟峰于 2020-07-09 设计创作，主要内容包括：本发明实施例公开了一种音频降噪方法和音频降噪模型的训练方法。该方案可以获取待降噪音频信号,对待降噪音频信号进行处理以获得频谱特征,将频谱特征输入卷积网络模型进行处理,以获得频谱卷积特征,将频谱卷积特征输入循环网络模型进行处理,以获得目标频谱或目标频谱掩码,对目标频谱进行处理,以获得目标音频信号；或使用目标频谱掩码对待降噪音频信号进行处理,以获得目标音频信号。本申请实施例通过将神经网络结构应用于K歌录制的歌声降噪,从而在对带噪歌声进行有效降噪的同时,也能很好地保持歌声原本的信号结构,避免了降噪后明显的听感失真。(The embodiment of the invention discloses an audio noise reduction method and an audio noise reduction model training method. According to the scheme, an audio signal to be denoised can be obtained, the audio signal to be denoised is processed to obtain a spectrum characteristic, the spectrum characteristic is input into a convolution network model to be processed to obtain a spectrum convolution characteristic, the spectrum convolution characteristic is input into the convolution network model to be processed to obtain a target spectrum or a target spectrum mask, and the target spectrum is processed to obtain a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.)

1. An audio noise reduction method, comprising:

acquiring an audio signal to be denoised;

processing the audio signal to be denoised to obtain a frequency spectrum characteristic;

inputting the frequency spectrum characteristics into a convolution network model for processing to obtain frequency spectrum convolution characteristics;

inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask;

processing the target frequency spectrum to obtain a target audio signal; or processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal.

2. The audio noise reduction method of claim 1, wherein the step of processing the audio signal to be noise reduced using the target spectral mask to obtain a target audio signal comprises:

calculating the frequency spectrum characteristics of the audio signal to be denoised according to the target frequency spectrum mask, and generating a target frequency spectrum according to the calculated frequency spectrum characteristics;

and processing the target frequency spectrum to obtain a target audio signal.

3. The audio noise reduction method of claim 1, wherein the step of inputting the spectral convolution signature into a cyclic network model for processing to obtain a target spectrum or target spectral mask comprises:

inputting the frequency spectrum convolution characteristics into a circulating network model to obtain frequency spectrum circulating characteristics;

and inputting the spectrum cycle characteristics into a full-connection network to obtain a target spectrum or a target spectrum mask.

4. The audio noise reduction method of claim 1, wherein the step of processing the audio signal to be noise reduced to obtain spectral features comprises:

carrying out short-time Fourier transform on the time domain waveform of the audio signal to be denoised to obtain a transformed initial frequency spectrum;

and extracting amplitude characteristics and phase characteristics of the initial frequency spectrum.

5. The audio noise reduction method of any of claims 1-4, wherein the target spectrum is processed to obtain a target audio signal; or the step of processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal comprises:

calculating a target complex frequency spectrum according to the target frequency spectrum, the amplitude characteristic and the phase characteristic; or calculating a target complex frequency spectrum according to the target frequency spectrum mask, the amplitude characteristic and the phase characteristic;

and carrying out short-time Fourier inversion on the target complex frequency spectrum to generate a target audio signal.

6. The audio noise reduction method of claim 5, wherein the step of calculating a target complex spectrum from the target spectrum, the amplitude characteristic, and the phase characteristic comprises;

calculating a target complex spectrum according to a first formula, wherein the first formula is as follows:

Ct＝Yt_abs*exp^(1j*Xt_phase)

where Yt _ abs is the spectral amplitude of the target spectrum, and Xt _ phase is the phase characteristic of the initial spectrum.

7. The audio noise reduction method of claim 5, wherein the step of computing a target complex spectrum from the target spectral mask, magnitude signature, and phase signature comprises:

calculating a target complex spectrum according to a second formula, wherein the second formula is as follows:

Ct＝Xt_abs*mask_t*exp^(1j*Xt_phase)

wherein, Xt _ abs is the amplitude characteristic of the initial spectrum, Xt _ phase is the phase characteristic of the initial spectrum, and mask _ t is the target spectrum mask.

8. A method for training an audio noise reduction model, comprising:

acquiring a first audio corresponding to a target song containing noise and a second audio corresponding to a target song not containing noise;

acquiring amplitude characteristics of a first audio frequency spectrum and audio characteristics of a second audio frequency;

and training a preset audio noise reduction model according to the amplitude characteristic of the first audio frequency spectrum and the audio characteristic of the second audio frequency, wherein the preset audio noise reduction model comprises a multilayer convolutional neural network and a two-layer cyclic neural network.

9. The method for training an audio noise reduction model according to claim 8, wherein the step of training a preset audio noise reduction model according to the amplitude feature of the first audio frequency spectrum and the audio feature of the second audio frequency comprises:

inputting the amplitude characteristic of the first audio frequency spectrum into the preset audio noise reduction model to obtain the audio characteristic of a third audio frequency after noise reduction;

calculating an error between the audio features of the third audio and the audio features of the second audio;

and performing iterative training on the preset audio noise reduction model according to the error.

10. The method for training an audio noise reduction model according to claim 9, wherein the predetermined audio noise reduction model further comprises a deep neural network, and the step of inputting the amplitude feature of the first audio frequency spectrum into the predetermined neural network model to obtain the audio feature of the noise-reduced third audio frequency comprises:

and inputting the amplitude characteristic of the first audio frequency spectrum into the preset neural network model, and inputting the middle sequence characteristic or the last sequence characteristic of the time sequence of the cyclic neural network into the deep neural network to obtain the audio characteristic of the third audio frequency after noise reduction.

Technical Field

The invention relates to the technical field of data processing, in particular to an audio noise reduction method and an audio noise reduction model training method.

Background

In recent years, the market scale of karaoke software on a mobile terminal is gradually enlarged, and a user group is distributed in all ages and all music levels. Especially, with the popularization of intelligent terminals such as smart phones and tablet computers, it becomes possible for a user to do karaoke without going out. For example, after the user installs the karaoke software on the smart phone, the user can sing a song without going into a KTV. One of the main scenes of the software is recording songs, namely, two audio signals of accompaniment and human voice, and finally generating one audio signal data of a synthesized product through a signal processing technology.

When K song APP on the market is used for recording at present, the recording is limited by non-professional equipment and environment, noise (microphone friction sound, environment background noise and the like) is easily mixed in the singing voice recorded by a user, and great influence is brought to the auditory sensation. Therefore, it is necessary to reduce noise of the recorded singing voice. Existing singing voice noise reduction schemes are based on traditional digital signal processing, estimate the noise spectrum using various effective frequency domain conversions and time domain transformations, and then extract the clean speech signal from the recorded signal. For example, on the basis of the original singing voice signal with noise, the power spectrum and other characteristics of the noise part in the original signal are estimated by using a statistical signal processing method, and then the singing voice signal after noise reduction is predicted from the original singing voice signal with noise through the power spectrum and other characteristics of the noise signal obtained through calculation.

The applicant finds that the traditional noise reduction method only has a certain noise reduction effect on certain specific types of noise (steady-state noise and the like), and has a difficult ideal effect on other more complicated and variable background noise (such as unsteady state). In addition, the traditional singing voice noise reduction method easily introduces distortion to the original voice signal in the noise reduction process.

Disclosure of Invention

The embodiment of the invention provides an audio noise reduction method and an audio noise reduction model training method, which can improve the noise reduction effect of audio.

The embodiment of the invention provides an audio noise reduction method, which comprises the following steps:

acquiring an audio signal to be denoised;

processing the audio signal to be denoised to obtain a frequency spectrum characteristic;

inputting the frequency spectrum characteristics into a convolution network model for processing to obtain frequency spectrum convolution characteristics;

inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask;

processing the target frequency spectrum to obtain a target audio signal; or processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal.

The embodiment of the invention also provides a training method of the audio noise reduction model, which comprises the following steps:

acquiring a first audio corresponding to a target song containing noise and a second audio corresponding to a target song not containing noise;

acquiring amplitude characteristics of a first audio frequency spectrum and audio characteristics of a second audio frequency;

According to the audio noise reduction scheme provided by the embodiment of the invention, an audio signal to be subjected to noise reduction can be obtained, the audio signal to be subjected to noise reduction is processed to obtain a spectrum characteristic, the spectrum characteristic is input into a convolution network model to be processed to obtain a spectrum convolution characteristic, the spectrum convolution characteristic is input into a circulation network model to be processed to obtain a target spectrum or a target spectrum mask, and the target spectrum is processed to obtain a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a first flowchart of an audio denoising method according to an embodiment of the present invention;

FIG. 1b is a second flowchart of an audio denoising method according to an embodiment of the present invention;

FIG. 1c is a schematic flowchart of a method for training an audio noise reduction model according to an embodiment of the present invention;

FIG. 1d is a schematic diagram of a first structure of a default network model according to an embodiment of the present invention;

fig. 1e is a schematic diagram of a second structure of the default network model according to the embodiment of the present invention;

fig. 2a is a schematic diagram of a first structure of an audio noise reduction apparatus according to an embodiment of the present invention;

fig. 2b is a schematic diagram of a second structure of the audio noise reduction device according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

An embodiment of the present invention provides an audio noise reduction method, where an execution main body of the audio noise reduction method may be the audio noise reduction device provided in the embodiment of the present invention, or a server integrated with the audio noise reduction device, where the audio noise reduction device may be implemented in a hardware or software manner.

As shown in fig. 1a, fig. 1a is a schematic diagram of a first flow of an audio denoising method according to an embodiment of the present invention, where the specific flow of the audio denoising method may be as follows:

101. and acquiring an audio signal to be denoised.

In an embodiment, the audio signal to be denoised is an audio signal input by a user before denoising processing, such as a song sung by the user. Further, the audio signal to be denoised may be an audio signal obtained by synthesizing an accompaniment audio and an trunk audio, where the accompaniment audio may be an accompaniment audio corresponding to a target audio to be sung by a user, for example, a request is sent to a server according to an identifier of the target audio (song name, album name, singer, etc.), and then an accompaniment audio returned by the server according to a request for writing is received, where the accompaniment audio is a pure accompaniment part in the target audio.

The dry sound audio may be audio input by a user, such as a human voice input through a microphone of the terminal device while playing an accompaniment. For example, when recording a song, a user acquires accompaniment audio of the song according to a song name of a target audio, and then inputs dry audio which is turned over by the user through a microphone.

In other embodiments, the audio signal to be denoised may further include one accompaniment audio and multiple trunk audio. For example, a plurality of users sing a target audio in a chorus manner, if the duration of a song is four minutes, a user a sing a part of the first two minutes, and a user B sing a part of the second two minutes, at this time, after obtaining a road accompaniment audio according to the song name, a first dry sound audio corresponding to the first two minutes input by the user a and a second dry sound audio corresponding to the second two minutes input by the user B are respectively obtained through a microphone, so as to obtain the audio signal to be denoised.

102. The audio signal to be noise reduced is processed to obtain spectral features.

In an embodiment, an initial spectrum of the audio signal to be denoised may be obtained, and then spectral features of the initial spectrum may be further extracted. The spectral features may include spectral features and phase features, for example, spectral power calculation and filtering are performed after an initial spectrum is obtained, and finally amplitude extraction and phase extraction are performed according to a filtered signal.

In another embodiment, the audio signal may be further subjected to framing processing to obtain at least 1 frame of audio signal. In the process of framing, the frame length of each frame of audio signal frame and the frame interval between two adjacent frames of audio signal frames can be set according to actual conditions. Then, signal sampling is performed on each audio signal frame, each audio signal frame is changed into a discrete state, and Short Time Fourier Transform (STFT) is performed on the sampled data corresponding to each audio signal frame, so as to obtain the spectral characteristics corresponding to each audio signal frame. Wherein the short-time fourier transform is a general-purpose tool for speech signal processing. It defines a very useful class of time and frequency distributions that specify the complex amplitude of any signal over time and frequency. In particular, the acquired audio signal to be denoised may be converted into its time domain waveform, and when implemented, the short-time Fourier Transform is computed as a Fast Fourier Transform (FFT) of a series of windowed data frames, wherein the windows "slide" or "hop" over time.

103. And inputting the spectrum characteristics into a convolution network model for processing to obtain the spectrum convolution characteristics.

104. And inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask.

The audio noise reduction model provided in the embodiment of the present application includes a convolutional Network model (CNN) and a cyclic Network model (RNN), where the convolutional Network model and the cyclic Network model may be respectively in multiple layers, that is, the audio noise reduction model may be a hybrid Neural Network structure based on a convolutional Neural Network and a cyclic Neural Network. The CNN can well capture the frequency spectrum structure of the audio signal, and on the other hand, the RNN can use the front and rear time sequence information to perform related frequency spectrum prediction. Therefore, the CNN-RNN-based audio noise reduction method provided by the embodiment of the application can effectively reduce noise of noisy singing voice, and simultaneously can well keep the original signal structure of the singing voice, thereby avoiding obvious auditory distortion after noise reduction.

In an embodiment, the audio noise reduction model may be trained in advance. After the training is completed, the model is used for prediction, and the prediction result can be a target spectrum or a target spectrum MASK, namely a MASK value of the target spectrum. Specifically, the output layer of the audio noise reduction model during training may be a target audio signal spectrum or a MASK value of the target audio signal spectrum.

105. Processing the target frequency spectrum to obtain a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal.

In an embodiment, when the target spectrum is acquired in step 104, a complex spectrum of the target audio signal may be generated according to the target spectrum amplitude and the phase characteristics of the initial spectrum. When the MASK value of the target spectrum is acquired in step 104, the complex spectrum of the target audio signal may be generated according to the MASK value of the target spectrum, the magnitude characteristic and the phase characteristic of the initial spectrum.

Further, after obtaining the complex frequency spectrum of the target audio signal, the complex frequency spectrum may be subjected to an inverse short-time fourier transform. The short-time fourier transform can convert a time-domain signal into corresponding amplitude and phase under different frequencies, the frequency spectrum of the signal is the representation of the time-domain signal under the frequency domain, and the short-time inverse fourier transform can convert the frequency spectrum back to the signal of the time domain, so that the time-domain waveform of the target audio signal can be obtained after the short-time inverse fourier transform is performed on the complex frequency spectrum. Therefore, the noise reduction processing of the audio signal to be subjected to noise reduction is realized.

In an embodiment, the processing the audio signal to be noise-reduced by using the target spectrum mask to obtain the target audio signal may include: calculating the spectrum characteristics of the audio signal to be denoised according to the target spectrum mask, generating a target spectrum according to the calculated spectrum characteristics, and processing the target spectrum to obtain a target audio signal.

Specifically, after obtaining the spectrum mask, the electronic device may process the audio signal to be denoised by using the spectrum mask, multiply the spectrum mask by corresponding elements of the spectral features of the audio signal to be denoised, generate a denoising spectrum, and obtain the target audio signal according to the denoising spectrum. After the target audio signal is obtained, the electronic device can cache the target audio signal and execute corresponding operation according to the user requirement.

In the embodiment of the application, the audio is denoised based on a CNN-RNN mixed neural network, the CNN is firstly used for converting and extracting the characteristics such as corresponding frequency spectrums, the RNN is then used for analyzing and extracting the time sequence relation of the characteristics, the MASK value or the frequency spectrum of the clean singing voice signal of each frame of signal is finally predicted, and the denoised singing voice signal is finally obtained.

As described above, the audio noise reduction method provided by the embodiment of the present invention may obtain an audio signal to be noise reduced, process the audio signal to be noise reduced to obtain a spectral feature, input the spectral feature into the convolution network model to process to obtain a spectral convolution feature, input the spectral convolution feature into the circulation network model to process to obtain a target spectrum or a target spectrum mask, and process the target spectrum to obtain a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.

The method described in the previous examples is described in further detail below.

Referring to fig. 1b, fig. 1b is a second flow chart of the audio denoising method according to the embodiment of the invention. The method comprises the following steps:

201. and acquiring an audio signal to be denoised.

202. And carrying out short-time Fourier transform on the time domain waveform of the audio signal to be subjected to noise reduction to obtain a transformed initial frequency spectrum.

For example, the noisy waveform of the audio signal to be denoised is xn, and the waveform of the audio signal to be denoised can be subjected to short-time fourier transform to obtain a transformed initial spectrum stft (xn).

203. And extracting amplitude characteristics and phase characteristics of the initial frequency spectrum.

In an embodiment, fast discrete fourier transform is performed on a time domain waveform, FFT operation is performed on the time domain waveform signal to obtain a frequency domain signal array, spectral power calculation and filtering are performed on the signal after the fast fourier transform, and finally amplitude extraction and phase extraction are performed according to the filtered signal. For example, after the initial spectrum stft (xn) is acquired, amplitude features and phase features are extracted, which are Xt _ abs and Xt _ phase, respectively.

204. And inputting the spectrum characteristics into a convolution network model for processing to obtain the spectrum convolution characteristics.

205. And inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask.

The audio noise reduction model provided in the embodiment of the present application includes a convolution network model and a circulation network model, respectively. In an embodiment, the amplitude characteristics Xt _ abs are input into a convolution network model for processing to obtain spectrum convolution characteristics, and then the spectrum convolution characteristics are input into the convolution network model for processing to predict denoised singing voice characteristics, which may be target audio signal spectrum amplitudes Yt _ abs or MASK values MASK _ t of the target audio signal spectrum amplitudes.

Specifically, when the audio noise reduction model outputs a target audio signal spectrum, the target audio signal spectrum amplitude Yt _ abs is obtained as an audio feature, and when the neural network model outputs a MASK value of the target audio signal spectrum, the MASK value MASK _ t of the target audio signal spectrum amplitude is obtained as an audio feature.

In an embodiment, the target spectrum or target spectrum mask may be obtained over a Fully Connected Network (FCN) because the role of FCN is to convert RNN output characteristics to spectral characteristics consistent with short-time fourier-varying spectral dimensions. Namely, the step of inputting the spectrum convolution characteristics into a cyclic network model for processing to obtain a target spectrum or a target spectrum mask includes:

inputting the frequency spectrum convolution characteristics into a circulating network model to obtain frequency spectrum circulating characteristics;

and inputting the spectrum cycle characteristics into a full-connection network to obtain a target spectrum or a target spectrum mask.

206. Calculating a target complex frequency spectrum according to the target frequency spectrum, the amplitude characteristic and the phase characteristic; or calculating the target complex spectrum according to the target spectrum mask, the amplitude characteristic and the phase characteristic.

In an embodiment, if the target audio signal spectral amplitude Yt _ abs is predicted, a complex spectrum of the target audio signal may be generated according to the target audio signal spectral amplitude and the phase characteristic of the initial spectrum. Namely, the step of calculating the target complex frequency spectrum according to the target frequency spectrum, the amplitude characteristic and the phase characteristic comprises the following steps;

calculating a target complex frequency spectrum according to a first formula, wherein the first formula is as follows:

Ct＝Yt_abs*exp^(1j*Xt_phase)

where Yt _ abs is the spectral amplitude of the target spectrum, and Xt _ phase is the phase characteristic of the initial spectrum.

In an embodiment, if the MASK value MASK _ t is predicted as the spectral amplitude of the target audio signal, the complex spectrum of the target audio signal may be generated according to the MASK value of the spectral amplitude of the target audio signal, the amplitude characteristic and the phase characteristic of the initial spectrum. That is, the step of calculating the target complex spectrum according to the target spectrum mask, the amplitude feature and the phase feature includes:

calculating the target complex frequency spectrum according to a second formula, wherein the second formula is as follows:

Ct＝Xt_abs*mask_t*exp^(1j*Xt_phase)

wherein, Xt _ abs is the amplitude characteristic of the initial spectrum, Xt _ phase is the phase characteristic of the initial spectrum, and mask _ t is the target spectrum mask.

207. And carrying out short-time Fourier inversion on the target complex frequency spectrum to generate a target audio signal.

The audio noise reduction method based on the CNN-RNN neural network adopts CNN characteristic transformation and can extract related pure singing voice characteristics from the singing voice characteristics with noise. On the other hand, the input feature is used for learning according to a time dynamic change rule, the judgment information of the feature information of the previous frame and the next frame is acted on a current frame noise reduction prediction algorithm, a CNN-RNN mixed neural network is provided, the advantages that the CNN is good in feature extraction and the RNN is good in time sequence analysis are combined, the strong correlation among the audio frames is well utilized, and finally the noise feature is reduced while the original pure human voice feature is protected. Compared with the traditional noise reduction mode based on signal processing, the scheme can have better noise reduction capability, and can ensure that the singing voice signal after noise reduction has less voice distortion.

As described above, the audio noise reduction method provided by the embodiment of the present invention may obtain an audio signal to be noise reduced, perform short-time fourier transform on a time-domain waveform of the audio signal to be noise reduced to obtain a transformed initial frequency spectrum, extract an amplitude feature and a phase feature of the initial frequency spectrum, input the frequency spectrum feature into a convolution network model for processing to obtain a frequency spectrum convolution feature, input the frequency spectrum convolution feature into the convolution network model for processing to obtain a target frequency spectrum or a target frequency spectrum mask, and calculate a target complex frequency spectrum according to the target frequency spectrum, the amplitude feature, and the phase feature; or calculating a target complex frequency spectrum according to the target frequency spectrum mask, the amplitude characteristic and the phase characteristic, and performing short-time Fourier inversion on the target complex frequency spectrum to generate a target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.

An embodiment of the present application further provides a method for training an audio noise reduction model, please refer to fig. 1c, which includes the following steps:

301. and acquiring a first audio corresponding to the target song containing the noise and a second audio corresponding to the target song not containing the noise.

In one embodiment, the electronic device may be used to record a sufficient amount of singing voice and noise, respectively, and the clean singing voice and the noise are mixed to obtain a noisy singing voice as the first audio and a clean singing voice as the second singing voice. The noise may include a plurality of noises such as human voice noise, outdoor noise, and the like, and the corresponding noise is selected according to actual requirements to synthesize the first audio.

Further, when the clean singing voice and the noise are mixed to obtain the first audio, the first audio can be synthesized according to different signal-to-noise ratios to meet the conditions corresponding to the noise with different loudness, wherein the signal-to-noise ratio refers to the ratio of the signal to the noise in an electronic device or an electronic system. For example, the first audio is synthesized according to the snr of 10dB,5dB,0dB, -5dB, -10dB, etc., respectively, which is not further limited in this embodiment.

302. Amplitude characteristics of the first audio frequency spectrum and audio characteristics of the second audio frequency are obtained.

In an embodiment, if x is the time-domain noisy waveform of the first audio obtained by mixing in step 301, the amplitude feature of the first audio spectrum may be Xt, and then a short-time fourier transform stft (x) is performed on x to obtain the spectrum of the first audio, and then the amplitude feature Xt is extracted according to the spectrum of the first audio. Correspondingly, y is the time domain waveform of the second audio, and Yt is the frequency spectrum of the corresponding second audio or its MASK value, MASK _ t.

That is, Yt ═ stft (y).

Mask_t＝abs(Xt)/abs(Yt)，if Mask_t>1:Mask_t＝1

303. And training a preset audio noise reduction model according to the amplitude characteristic of the first audio frequency spectrum and the audio characteristic of the second audio frequency, wherein the preset audio noise reduction model comprises a multilayer convolutional neural network and a two-layer cyclic neural network.

The preset neural network model comprises a multilayer convolutional neural network CNN and a two-layer recurrent neural network RNN. Specifically, please refer to fig. 1d and fig. 1e, which are two schematic structural diagrams of the preset network model according to the embodiment of the present invention, wherein Xt passes through a multi-layer CNN, then passes through two layers of RNNs or BiRNN, and the output layer is the noise-reduced singing voice signal spectrum Yt 'or its corresponding MASK RATIO value MASK _ t'.

Specifically, the network model in fig. 1d is based on the last frame prediction, and the model block diagram finally uses only the last sequence feature of the temporal sequence of RNN as the input of DNN (Deep Neural Networks) and only uses the correlation of the previous frame thereof the network model in fig. 1e is based on the inter-frame prediction, and the model block diagram finally uses only the middle sequence feature of the temporal sequence of BiRNN as the input of DNN, which makes full use of the context relationship.

And inputting the amplitude characteristic of the first audio frequency spectrum into the preset neural network model, and inputting the middle sequence characteristic or the last sequence characteristic of the time sequence of the cyclic neural network into the deep neural network to obtain the audio characteristic of the third audio frequency after noise reduction.

In an embodiment, after obtaining the noise-reduced singing voice signal spectrum Yt ' of the output layer or its corresponding Mask _ t ', the noise-reduced singing voice signal spectrum Yt ' may be compared with the second audio frequency spectrum Yt or its Mask _ t to calculate an error thereof, and then the weight in the model is sufficiently trained and learned by adjusting the hyper-parameters (e.g., learning rate, batch _ size, CNN or RNNhidden layer size node size, etc.) in the model according to the error. Namely, the step of training the preset neural network model according to the amplitude characteristic of the first audio frequency spectrum and the audio frequency characteristic of the second audio frequency comprises the following steps:

inputting the amplitude characteristic of the first audio frequency spectrum into the preset neural network model to obtain the audio frequency characteristic of the third audio frequency after noise reduction;

calculating an error between the audio features of the third audio and the audio features of the second audio;

and performing iterative training on the preset neural network model according to the error.

In an embodiment, the training may be completed when the iterative training is performed until the error is smaller than a preset value, and the training may be completed when the iterative times reaches a preset number of times, for example, 100 rounds, to obtain a trained preset neural network model.

As described above, the method for training an audio noise reduction model according to the embodiment of the present invention may acquire a first audio corresponding to a target song containing noise and a second audio corresponding to a target song not containing noise, acquire an amplitude feature of a first audio spectrum and an audio feature of the second audio, and train a preset audio noise reduction model according to the amplitude feature of the first audio spectrum and the audio feature of the second audio, where the preset audio noise reduction model includes a multilayer convolutional neural network and a two-layer cyclic neural network. The embodiment of the application adopts CNN feature transformation, and can extract related pure singing voice features from the singing voice features with noise. On the other hand, the input feature is used for learning according to the time dynamic change rule, the judgment information of the feature information of the previous frame and the next frame is acted on the algorithm of the current frame noise reduction prediction, a hybrid neural network is provided, the advantages that the CNN is good at feature extraction and the RNN is good at the capability of analyzing the time sequence are combined, the strong correlation among the audio frames is well utilized, and finally the noise feature is reduced while the original pure human voice feature is protected.

In order to implement the above audio noise reduction method, an embodiment of the present invention further provides an audio noise reduction device, which may be specifically integrated in a terminal device such as a mobile phone and a tablet computer.

For example, as shown in fig. 2a, it is a schematic structural diagram of an audio noise reduction apparatus provided in an embodiment of the present invention. The audio noise reduction apparatus may include:

a signal obtaining unit 301, configured to obtain an audio signal to be noise reduced.

In an embodiment, the audio signal to be denoised is an audio signal input by a user before denoising processing, such as a song sung by the user.

An extracting unit 302, configured to process the audio signal to be denoised to obtain a spectral feature.

A first processing unit 303, configured to input the spectrum feature into a convolution network model for processing, so as to obtain a spectrum convolution feature.

A second processing unit 304, configured to input the spectrum convolution feature into a cyclic network model for processing, so as to obtain a target spectrum or a target spectrum mask.

The audio noise reduction model in the embodiment of the application can be a hybrid neural network structure based on a convolutional neural network and a cyclic neural network. The CNN can well capture the frequency spectrum structure of the audio signal, and on the other hand, the RNN can use the front and rear time sequence information to perform related frequency spectrum prediction.

A third processing unit 305, configured to process the target spectrum to obtain a target audio signal; or processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal.

In an embodiment, when the target spectrum is acquired in the second processing unit 304, the complex spectrum of the target audio signal may be generated according to the target spectrum amplitude and the phase characteristic of the initial spectrum. When the MASK value of the target spectrum is acquired in the second processing unit 304, the complex spectrum of the target audio signal may be generated according to the MASK value of the target spectrum, the magnitude characteristic, and the phase characteristic of the initial spectrum.

Further, after obtaining the complex frequency spectrum of the target audio signal, the complex frequency spectrum may be subjected to short-time inverse fourier transform, so as to obtain a time-domain waveform of the target audio signal. Therefore, the noise reduction processing of the audio signal to be subjected to noise reduction is realized.

The audio noise reduction device provided by the embodiment of the invention can acquire an audio signal to be noise reduced, process the audio signal to be noise reduced to acquire a spectrum characteristic, input the spectrum characteristic into the convolution network model for processing to acquire a spectrum convolution characteristic, input the spectrum convolution characteristic into the circulation network model for processing to acquire a target spectrum or a target spectrum mask, and process the target spectrum to acquire a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.

In order to implement the above training method for the audio noise reduction model, an embodiment of the present invention further provides a training device for the audio noise reduction model, where the training device for the audio noise reduction model may be specifically integrated in a terminal device such as a mobile phone and a tablet computer.

For example, as shown in fig. 2b, it is a schematic structural diagram of a training apparatus for an audio noise reduction model according to an embodiment of the present invention. The training device of the audio noise reduction model can comprise:

an audio acquiring unit 401, configured to acquire a first audio corresponding to a target song that contains noise and a second audio corresponding to a target song that does not contain noise;

a feature obtaining unit 402, configured to obtain a magnitude feature of the first audio frequency spectrum and an audio feature of the second audio frequency;

a training unit 403, configured to train a preset audio noise reduction model according to the amplitude feature of the first audio frequency spectrum and the audio feature of the second audio frequency, where the preset audio noise reduction model includes a multilayer convolutional neural network and a two-layer cyclic neural network.

The training device of the audio noise reduction model provided by the embodiment of the application adopts CNN feature transformation, and can extract related pure singing voice features from the singing voice features with noise. On the other hand, the input feature is used for learning according to the time dynamic change rule, the judgment information of the feature information of the previous frame and the next frame is acted on the algorithm of the current frame noise reduction prediction, a hybrid neural network is provided, the advantages that the CNN is good at feature extraction and the RNN is good at the capability of analyzing the time sequence are combined, the strong correlation among the audio frames is well utilized, and finally the noise feature is reduced while the original pure human voice feature is protected.

An embodiment of the present invention further provides a terminal, as shown in fig. 3, the terminal may include a Radio Frequency (RF) circuit 601, a memory 602 including one or more computer-readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, a Wireless Fidelity (WiFi) module 607, a processor 608 including one or more processing cores, and a power supply 609. Those skilled in the art will appreciate that the terminal structure shown in fig. 3 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 601 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and information processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 access to the memory 602.

The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 603 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 608, and can receive and execute commands sent by the processor 608. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 604 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 604 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 3 the touch-sensitive surface and the display panel are shown as two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

The terminal may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 606, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 606 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 606 and converted into audio data, which is then processed by the audio data output processor 608, and then transmitted to, for example, another terminal via the RF circuit 601, or the audio data is output to the memory 602 for further processing. The audio circuit 606 may also include an earbud jack to provide communication of peripheral headphones with the terminal.

WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 607, and provides wireless broadband internet access for the user. Although fig. 3 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 608 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the handset. Optionally, processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.

The terminal also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 608 via a power management system that may be used to manage charging, discharging, and power consumption. The power supply 609 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 608 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application programs stored in the memory 602, thereby implementing various functions:

acquiring an audio signal to be denoised;

processing the audio signal to be denoised to obtain a frequency spectrum characteristic;

inputting the frequency spectrum characteristics into a convolution network model for processing to obtain frequency spectrum convolution characteristics;

inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask;

processing the target frequency spectrum to obtain a target audio signal; or processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the audio noise reduction method, and are not described herein again.

As can be seen from the above, the terminal according to the embodiment of the present invention may obtain an audio signal to be noise-reduced, process the audio signal to be noise-reduced to obtain a spectrum feature, input the spectrum feature into the convolution network model to process the spectrum feature to obtain a spectrum convolution feature, input the spectrum convolution feature into the circulation network model to process the spectrum convolution feature to obtain a target spectrum or a target spectrum mask, and process the target spectrum to obtain a target audio signal; or processing the noise reduction audio signal by using the target spectrum mask to obtain the target audio signal. The embodiment of the application reduces the noise of the singing recorded by applying the neural network structure to the K song, thereby effectively reducing the noise of the singing with the noise, well keeping the original signal structure of the singing, and avoiding the obvious auditory distortion after the noise reduction.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute steps in any one of the audio noise reduction methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

acquiring an audio signal to be denoised;

processing the audio signal to be denoised to obtain a frequency spectrum characteristic;

inputting the frequency spectrum characteristics into a convolution network model for processing to obtain frequency spectrum convolution characteristics;

inputting the spectrum convolution characteristics into a circulating network model for processing to obtain a target spectrum or a target spectrum mask;

processing the target frequency spectrum to obtain a target audio signal; or processing the audio signal to be denoised by using the target spectrum mask to obtain a target audio signal.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in any audio noise reduction method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any audio noise reduction method provided in the embodiments of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The audio noise reduction method, the audio noise reduction device, the storage medium and the terminal provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于MIDI技术的电子口琴

Audio noise reduction method and training method of audio noise reduction model

相关技术

网友询问留言