Speech enhancement method based on artificial intelligence, server and storage medium

文档序号:1393443 发布日期:2020-02-28 浏览:21次 中文

阅读说明:本技术 基于人工智能的语音增强方法、服务器及存储介质 (Speech enhancement method based on artificial intelligence, server and storage medium ) 是由 王健宗 赵峰 于 2019-10-12 设计创作,主要内容包括:本发明涉及数据处理技术,提供了一种基于人工智能的语音增强方法、服务器及存储介质。该方法首先获取语音数据作为训练样本,构建生成对抗网络,将带噪语音与其对应的去噪语音输入鉴别器,通过损失函数更新鉴别器参数,然后将带噪语音输入生成器,将输出的语音与该带噪语音一起输入鉴别器,计算损失更新鉴别器的参数,固定鉴别器的参数,将带噪语音输入生成器,将输出的语音与该带噪语音输入鉴别器,通过生成器的损失函数更新生成器的参数,将更新参数后的生成器作为语音增强模型,将待增强语音数据输入语音增强模型,生成增强后的语音数据。本发明可以提升基于生成对抗网络的语音增强模型的性能,进而提高语音增强的效果。(The invention relates to a data processing technology and provides a voice enhancement method based on artificial intelligence, a server and a storage medium. Firstly, obtaining voice data as a training sample, constructing and generating a confrontation network, inputting a noisy voice and a denoised voice corresponding to the noisy voice into a discriminator, updating parameters of the discriminator through a loss function, then inputting the noisy voice into a generator, inputting the output voice and the noisy voice into the discriminator, calculating parameters of the loss updating discriminator, fixing the parameters of the discriminator, inputting the noisy voice into the generator, inputting the output voice and the noisy voice into the discriminator, updating the parameters of the generator through the loss function of the generator, taking the generator with the updated parameters as a voice enhancement model, inputting the voice data to be enhanced into the voice enhancement model, and generating enhanced voice data. The invention can improve the performance of the voice enhancement model based on the generated countermeasure network, thereby improving the voice enhancement effect.)

1. A speech enhancement method based on artificial intelligence is applied to a server, and is characterized in that the method comprises the following steps:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

2. The artificial intelligence based speech enhancement method of claim 1 wherein the generator is comprised of a two-layer convolutional network and a two-layer fully-connected neural network, the activation functions of the convolutional network and the first layer fully-connected neural network being Relu functions, and the activation function of the second layer fully-connected neural network being sigmoid functions.

3. The artificial intelligence based speech enhancement method of claim 1 wherein the discriminator consists of an eight-layer convolutional network, a one-layer long-short-term memory-loop network, and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory-loop network, and the first-layer fully-connected neural network being Relu functions, and the activation function of the second-layer fully-connected neural network being sigmoid functions.

4. The artificial intelligence based speech enhancement method of claim 1, wherein the loss function of the generator is:

Figure FDA0002231453440000021

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of sample Z, XcRepresenting speech output after input to a noisy speech generator, E representing the evaluation of a sample XcZ mean of the outputs, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) The representation generator combines the sample Z and the sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents said discriminator pair G (Z, X)c) And XcScore of the degree of truth of (a).

5. An artificial intelligence based speech enhancement method according to any one of claims 1 to 4 in which the discriminator's loss function is:

Figure FDA0002231453440000022

wherein D represents a discriminator, X represents denoised speech, and X representscPost-output of a representation and noisy speech input generatorSpeech, Xc~Pdata(X,Xc) Representing features X and X with respect to training samplescDistribution of (2), D (X, X)c) Representing discriminator pair X and XcScore of degree of truth, Z-Pz(Z) distribution of sample Z, Xc~Pdata(Xc) Represents a sample XcE denotes sampling X, XcOr sample Z, XcMean value of outputs, G (Z, X)c) The representation generator combines the sample Z and the sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents a discriminator pair G (Z, X)c) And XcScore of true.

6. A server comprising a memory and a processor, wherein an artificial intelligence based speech enhancement program is stored on the memory, and wherein the artificial intelligence based speech enhancement program is executed by the processor to perform the steps of:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

7. The server of claim 6, wherein the generator comprises a two-layer convolutional network and a two-layer fully-connected neural network, the activation function of the convolutional network and the first layer fully-connected neural network is a Relu function, and the activation function of the second layer fully-connected neural network is a sigmoid function.

8. The server according to claim 6, wherein the discriminator comprises an eight-layer convolutional network, a one-layer long-short-term memory-cycling network, and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory-cycling network, and the first-layer fully-connected neural network are Relu functions, and the activation function of the second-layer fully-connected neural network is a sigmoid function.

9. The server of claim 6, wherein the loss function of the generator is:

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of sample Z, XcRepresenting speech output after input to a noisy speech generator, E representing the evaluation of a sample XcZ mean of the outputs, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) The representation generator sums the samples Z and ZSample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents said discriminator pair G (Z, X)c) And XcScore of the degree of truth of (a).

10. A computer-readable storage medium, wherein an artificial intelligence based speech enhancement program is included in the computer-readable storage medium, and wherein the artificial intelligence based speech enhancement program, when executed by a processor, implements the steps of the artificial intelligence based speech enhancement method of any one of claims 1 to 5.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice enhancement method based on artificial intelligence, a server and a storage medium.

Background

The purpose of speech enhancement is mainly to remove complex background noise from noisy speech and to ensure improved speech intelligibility without distortion of the speech signal. Most of traditional speech enhancement algorithms are based on noise estimation, and the processed noise type is single, so that the problem of speech denoising under a complex background cannot be well processed. With the rapid development of neural networks, more and more neural network models are also applied to speech enhancement algorithms.

However, since the distribution of speech noise is generally complex, the existing speech enhancement method based on deep learning has unstable model convergence, resulting in poor speech enhancement effect.

Disclosure of Invention

In view of the foregoing, the present invention provides a speech enhancement method, server and storage medium based on artificial intelligence, and aims to enhance the effect of speech enhancement.

In order to achieve the above object, the present invention provides a speech enhancement method based on artificial intelligence, the method comprising:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

Preferably, the generator is composed of a two-layer convolution network and a two-layer fully-connected neural network, the activation functions of the convolution network and the first layer fully-connected neural network are Relu functions, and the activation function of the second layer fully-connected neural network is a sigmoid function.

Preferably, the discriminator comprises an eight-layer convolutional network, a one-layer long-short-term memory cycle network and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory cycle network and the first-layer fully-connected neural network are Relu functions, and the activation function of the second-layer fully-connected neural network is a sigmoid function.

Preferably, the loss function of the generator is:

Figure BDA0002231453450000021

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of sample Z, XcRepresentation and post-input to noisy speech input generatorThe output speech, E, represents the sample XcZ mean of the outputs, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) The representation generator combines the sample Z and the sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents said discriminator pair G (Z, X)c) And XcScore of the degree of truth of (a).

Preferably, the loss function of the discriminator is:

Figure BDA0002231453450000022

wherein D represents a discriminator, X represents denoised speech, and X representscRepresenting speech, X, output after input to a noisy speech generatorc~Pdata(X,Xc) Representing features X and X with respect to training samplescDistribution of (2), D (X, X)c) Representing discriminator pair X and XcScore of degree of truth, Z-Pz(Z) distribution of sample Z, Xc~Pdata(Xc) Represents a sample XcE denotes sampling X, XcOr sample Z, XcMean value of outputs, G (Z, X)c) The representation generator combines the sample Z and the sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents a discriminator pair G (Z, X)c) And XcScore of true.

To achieve the above object, the present invention also provides a server, including: a memory and a processor, wherein the memory stores an artificial intelligence based speech enhancement program, and the artificial intelligence based speech enhancement program is executed by the processor to implement the steps of:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

Preferably, the generator is composed of a two-layer convolution network and a two-layer fully-connected neural network, the activation functions of the convolution network and the first layer fully-connected neural network are Relu functions, and the activation function of the second layer fully-connected neural network is a sigmoid function.

Preferably, the discriminator comprises an eight-layer convolutional network, a one-layer long-short-term memory cycle network and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory cycle network and the first-layer fully-connected neural network are Relu functions, and the activation function of the second-layer fully-connected neural network is a sigmoid function.

Preferably, the loss function of the generator is:

Figure BDA0002231453450000031

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of sample Z, XcRepresenting speech output after input to a noisy speech generator, E representing the evaluation of a sample XcZ mean of the outputs, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) The representation generator combines the sample Z and the sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Represents said discriminator pair G (Z, X)c) And XcScore of the degree of truth of (a).

To achieve the above object, the present invention further provides a computer-readable storage medium, which includes an artificial intelligence based speech enhancement program, and when the artificial intelligence based speech enhancement program is executed by a processor, the artificial intelligence based speech enhancement program can implement any of the steps in the artificial intelligence based speech enhancement method as described above.

Compared with the speech enhancement method based on artificial intelligence in the prior art, the speech enhancement method based on artificial intelligence, the server and the storage medium provided by the invention construct the generative confrontation network comprising the discriminator and the generator by acquiring the noisy speech and the corresponding de-noised speech as training samples, adjust and update parameters of the discriminator for multiple times based on the noisy speech and the speech output by the generator to obtain the first discriminator, obtain the loss function of the generator based on the first discriminator, and finally obtain a speech enhancement model by adjusting the parameters of the generator by minimizing the loss function value of the generator, so as to be applied to speech data enhancement. The generative confrontation network applied by the artificial intelligence-based voice enhancement method provided by the invention has no similar recursive operation in RNN, and has higher timeliness and higher data processing speed compared with a neural network, thereby realizing rapid voice enhancement. In addition, the generator and the discriminator of the generating type countermeasure network process original audio, characteristics do not need to be extracted manually, and the characteristics of the speech can be learned from different speakers and different types of noises and combined together to form shared parameters, so that the system is simple and has strong generalization capability.

Drawings

FIG. 1 is a diagram of a server according to a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a preferred embodiment of the artificial intelligence based speech enhancement process of FIG. 1;

FIG. 3 is a flowchart of a preferred embodiment of the artificial intelligence based speech enhancement method of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a server 1 according to a preferred embodiment of the present invention is shown.

The server 1 includes but is not limited to: memory 11, processor 12, display 13, and network interface 14. The server 1 is connected to a network through a network interface 14 to obtain raw data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System for mobile communications (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.

The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the server 1, such as a hard disk or a memory of the server 1. In other embodiments, the memory 11 may also be an external storage device of the server 1, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped with the server 1. Of course, the memory 11 may also comprise both an internal storage unit of the server 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the server 1 and various types of application software, such as program codes of the artificial intelligence based speech enhancement program 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally used for controlling the overall operation of the server 1, such as performing data interaction or communication-related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the program code of the artificial intelligence based speech enhancement program 10.

The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-emitting diode (OLED) touch screen, or the like. The display 13 is used for displaying information processed in the server 1 and for displaying a visual work interface, for example, results of data statistics.

The network interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the network interface 14 typically being used for establishing a communication connection between the server 1 and other electronic devices.

Fig. 2 only shows the server 1 with the components 11-14 and the artificial intelligence based speech enhancement program 10, but it is to be understood that not all shown components are required to be implemented, and that more or less components may alternatively be implemented.

Optionally, the server 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the server 1 and for displaying a visual user interface.

The server 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which will not be described herein.

In the above embodiment, the processor 12, when executing the artificial intelligence based speech enhancement program 10 stored in the memory 11, may implement the following steps:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

For a detailed description of the above steps, please refer to the following description of fig. 2 regarding a flowchart of an embodiment of the artificial intelligence based speech enhancement program 10 and fig. 3 regarding a flowchart of an embodiment of an artificial intelligence based speech enhancement method.

In other embodiments, the artificial intelligence based speech enhancement program 10 may be partitioned into a plurality of modules that are stored in the memory 12 and executed by the processor 13 to accomplish the present invention. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions.

Referring to FIG. 2, a block diagram of an embodiment of the artificial intelligence based speech enhancement program 10 of FIG. 2 is shown. In this embodiment, the artificial intelligence based speech enhancement program 10 can be divided into: an acquisition module 110, a construction module 120, a first training module 130, a second training module 140, and a feedback module 150.

The obtaining module 110 is configured to obtain a preset number of voices with noise and denoised voices corresponding to each voice with noise as training samples, and divide the training samples into a first data set, a second data set, and a third data set.

In this implementation, a preset number of noisy speech data and denoised speech data corresponding to each noisy speech data may be obtained from a preset third-party speech library as training samples. The denoising voice data and the voice data with noises are sampled by using 16KHz, the length of a voice frame is set to be 16ms, and the voice frame shift is set to be 8 ms. It is to be understood that the present invention does not limit the frame length, the frame shift of the acquired speech spectrum, and the acoustic features included in the speech spectrum.

The noisy speech and the denoised speech obtained from the predetermined speech library are raw speech data, which may contain some invalid and redundant speech data. For example, voice is not satisfactory for a long time, voice quality is not satisfactory, and the like, which are invalid and redundant voice data. Or, there may be a part of invalid or redundant voice time interval in the unprocessed voice data, and the existence of this part of redundant or invalid voice time interval may adversely affect the subsequent voice data processing process, so it is necessary to remove this part of redundant or invalid voice time interval, where the voice time interval is a part of the unprocessed voice data. The original voice data can be subjected to impurity removal and filtering processing, so that the processing efficiency of subsequent voice data is improved.

A construction module 120 for constructing a generative confrontation network comprising at least one generator and one discriminator.

In this embodiment, the constructed generative countermeasure network includes 1 generator and 1 discriminator, the output of the generator is connected with the input of the discriminator, and the discrimination result of the discriminator is fed back to the generator.

The generator can be composed of a two-layer convolution network and a two-layer fully-connected neural network, the activation function of the convolution network and the first layer of the fully-connected neural network is a Relu function, the activation function of the second layer of the fully-connected network is a sigmoid function, the generator inputs generated voice and denoised voice into a discriminator to train the discriminator neural network, the discriminator judges the predicted voice generated by the generator as false data and gives a low score (close to 0), and judges the real denoised voice as real data and gives a high score (close to 1), so that the distribution of the denoised voice and the voice data generated by the generator is learned. The discriminator can be composed of an eight-layer convolutional network, a one-layer long-short-term memory cycle network and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory cycle network and the first-layer fully-connected neural network are Relu functions, and the activation function of the second-layer fully-connected neural network is a sigmoid function.

A first training module 130, configured to input the first data set into the discriminator, adjust parameters of the discriminator with a loss function value of the discriminator as a target, update the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold, obtain a first discriminator, input noisy speech of the second data set into the generator, input the output speech and the noisy speech into the first discriminator, and update the parameters of the first discriminator by using a back propagation algorithm.

When iterative training begins, firstly, the voice of a first data set is input into a discriminator, the output value of the discriminator is the truth score of the input voice with noise, a loss function of the discriminator is obtained according to the truth score of the voice with noise, and the parameters of the discriminator are updated by using a back propagation algorithm according to the loss function of the discriminator to obtain the first discriminator. And inputting the voice with noise of the second data set into a generator in the countermeasure network, inputting the voice output by the generator and the voice with noise into a first discriminator, and updating the parameters of the first discriminator by the output result of the first discriminator through a back propagation algorithm. In this embodiment, for any input sample of noisy speech X, the discriminator outputs real numbers of [0, 1] to indicate the degree of truth of the input X, with closer to 0 indicating lower degrees of truth and closer to 1 indicating higher degrees of truth.

Optimizing the generative countermeasure network according to a target formula, wherein the target formula is as follows:

Figure BDA0002231453450000081

wherein V represents loss value, G represents generator, D represents discriminator, log is logarithmic function, X is denoised voice data, X-Pdata(X) represents the distribution of the de-noised speech X, Z represents the noisy speech, Z-Pz(Z) denotes the distribution, D (x), with respect to the noisy speech Z) The truth score of the discriminator on the denoised voice X is shown, G (Z) shows the generated voice output after the noisy voice is input into the generator, D (G (Z)) shows the truth score of the discriminator on the generated voice output by the generator, and E shows the mean value of the sample X or sample Z output.

In optimizing the discriminator, it is desirable to maximize the sum of the mean values of the noisy speech Z and the de-noised speech X, and the loss function of the discriminator is known from the above objective formula:

Figure BDA0002231453450000091

wherein D denotes a discriminator, X denotes denoised speech data, and X denotes a noise-free speech signalcRepresenting speech output after a noisy speech input generator, PdataRepresenting training samples, Xc~Pdata(X,Xc) Representing features X and X with respect to training samplescDistribution of (2), D (X, X)c) Representing X and X by a discriminator paircScore of degree of truth, Z-Pz(Z) distribution of noisy speech samples Z, Xc~Pdata(Xc) Representing generated speech X with respect to the output of the generatorcE denotes sampling X, XcOr sample Z, XcMean value of outputs, D (G (Z, X)c),Xc) Representing the discriminator versus the resultant data G (Z, X) generated by the generatorc) And XcScore of truth, G (Z, X)c) Indicates that the generator will sample Z and sample XcConverted into synthetic data.

Training sample Z and training sample X, XcThe truth degree scores are substituted into a loss function of the discriminator, the weight among different layers of nodes of the discriminator can be optimized by continuously minimizing the loss function value of the discriminator, and when the loss function value of the discriminator is smaller than a first preset threshold value, the parameters of the discriminator are updated.

A second training module 140, configured to input the noisy speech of the third data set into the generator, input the output speech and the noisy speech into the first discriminator after updating the parameters, obtain a loss function of the generator according to an output result of the first discriminator after updating the parameters, adjust a parameter of the generator with a loss function value of the minimized generator as a target, update the parameter of the generator when the loss function value of the generator is smaller than a second preset threshold, and use the generator after updating the parameters as a speech enhancement model.

In this embodiment, when optimizing the generator G, it is necessary to minimize the truth score of the generated sample, and the loss function of the generator can be known according to the above objective formula:

Figure BDA0002231453450000092

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of the noisy speech samples Z, and E represents the determination of the samples XcZ mean of the outputs, XcRepresenting generated speech output after input to a noisy speech generator, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) Indicates that the generator will sample Z and sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Representing the discriminator versus the resultant data G (Z, X) generated by the generatorc) And XcScore of true.

Training sample Z and training sample XcThe truth degree scores are substituted into a loss function of the generator, the weight among different layers of nodes of the generator can be optimized by continuously minimizing the loss function value of the generator, and when the loss function value of the generator is smaller than a second preset threshold value, the parameters of the generator are updated.

In this example, a total of 86 epochs were trained, with a learning rate of 0.0002 and a Batchsize of 400. An epoch means that all data is sent into the network to complete a forward calculation and backward propagation process. Since an epoch is too large and the computer is hard to load, it is divided into several smaller lots, which is the portion of data that is trained each time it is fed into the network, and the lot Size is the number of training samples per lot.

The feedback module 150 is configured to receive voice data to be enhanced sent by a user, input the voice data to be enhanced into the voice enhancement model, generate enhanced voice data, and feed the enhanced voice data back to the user.

In this embodiment, a voice to be enhanced sent by a user may be received by a microphone, converted into a spectrogram through short-time fourier transform, sent into a trained voice enhancement model, to generate predicted de-noised voice data, converted into a voice analog signal through inverse short-time fourier transform, fed back to the user, played through a speaker or other devices, to obtain an enhanced voice, and fed back to the user.

In addition, the invention also provides a speech enhancement method based on artificial intelligence. Fig. 3 is a schematic method flow diagram of an embodiment of the artificial intelligence based speech enhancement method of the present invention. The processor 12 of the server 1, when executing the artificial intelligence based speech enhancement program 10 stored in the memory 11, implements the following steps of the artificial intelligence based speech enhancement method: .

Step S10: the method comprises the steps of obtaining a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set.

In this implementation, a preset number of noisy speech data and denoised speech data corresponding to each noisy speech data may be obtained from a preset third-party speech library as training samples. In one embodiment, the denoised voice data and the noisy voice data are sampled using 16KHz, the voice frame length is set to 16ms, and the voice frame shift is set to 8 ms. It is to be understood that the present invention does not limit the frame length, the frame shift of the acquired speech spectrum, and the acoustic features included in the speech spectrum.

The noisy speech and the denoised speech obtained from the predetermined speech library are raw speech data, which may contain some invalid and redundant speech data. For example, voice is not satisfactory for a long time, voice quality is not satisfactory, and the like, which are invalid and redundant voice data. Or, there may be a part of invalid or redundant voice time interval in the unprocessed voice data, and the existence of this part of redundant or invalid voice time interval may adversely affect the subsequent voice data processing process, so it is necessary to remove this part of redundant or invalid voice time interval, where the voice time interval is a part of the unprocessed voice data. The original voice data can be subjected to impurity removal and filtering processing, so that the processing efficiency of subsequent voice data is improved.

Step S20: a generative confrontation network is constructed, which includes at least one generator and one discriminator.

In this embodiment, the constructed generative countermeasure network includes 1 generator and 1 discriminator, the output of the generator is connected with the input of the discriminator, and the discrimination result of the discriminator is fed back to the generator.

The generator can be composed of a two-layer convolution network and a two-layer fully-connected neural network, the activation function of the convolution network and the first layer of the fully-connected neural network is a Relu function, the activation function of the second layer of the fully-connected network is a sigmoid function, the generator inputs generated voice and denoised voice into a discriminator to train the discriminator neural network, the discriminator judges the predicted voice generated by the generator as false data and gives a low score (close to 0), and judges the real denoised voice as real data and gives a high score (close to 1), so that the distribution of the denoised voice and the voice data generated by the generator is learned. The discriminator can be composed of an eight-layer convolutional network, a one-layer long-short-term memory cycle network and a two-layer fully-connected neural network, the activation functions of the convolutional network, the long-short-term memory cycle network and the first-layer fully-connected neural network are Relu functions, and the activation function of the second-layer fully-connected neural network is a sigmoid function.

Step S30: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the minimum discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm.

When iterative training begins, firstly, the voice of a first data set is input into a discriminator, the output value of the discriminator is the truth score of the input voice with noise, a loss function of the discriminator is obtained according to the truth score of the voice with noise, and the parameters of the discriminator are updated by using a back propagation algorithm according to the loss function of the discriminator to obtain the first discriminator. And inputting the voice with noise of the second data set into a generator in the countermeasure network, inputting the voice output by the generator and the voice with noise into a first discriminator, and updating the parameters of the first discriminator by the output result of the first discriminator through a back propagation algorithm. In this embodiment, for any input sample of noisy speech X, the discriminator outputs real numbers of [0, 1] to indicate the degree of truth of the input X, with closer to 0 indicating lower degrees of truth and closer to 1 indicating higher degrees of truth.

Optimizing the generative countermeasure network according to a target formula, wherein the target formula is as follows:

Figure BDA0002231453450000121

wherein V represents loss value, G represents generator, D represents discriminator, log is logarithmic function, X is denoised voice data, X-Pdata(X) represents the distribution of the de-noised speech X, Z represents the noisy speech, Z-Pz(Z) represents the distribution of the noisy speech Z, D (X) represents the truth score of the discriminator on the denoised speech X, g (Z) represents the generated speech output after the noisy speech is input to the generator, D (g (Z)) represents the truth score of the discriminator on the generated speech output by the generator, and E represents the averaging of the sample X or sample Z output.

In optimizing the discriminator, it is desirable to maximize the sum of the mean values of the noisy speech Z and the de-noised speech X, and the loss function of the discriminator is known from the above objective formula:

Figure BDA0002231453450000122

wherein D denotes a discriminator, X denotes denoised speech data, and X denotes a noise-free speech signalcRepresenting speech output after a noisy speech input generator, PdataRepresenting training samples, Xc~Pdata(X,Xc) Representing features X and X with respect to training samplescDistribution of (2), D (X, X)c) Representing X and X by a discriminator paircScore of degree of truth, Z-Pz(Z) distribution of noisy speech samples Z, Xc~Pdata(Xc) Representing generated speech X with respect to the output of the generatorcE denotes sampling X, XcOr sample Z, XcMean value of outputs, D (G (Z, X)c),Xc) Representing the discriminator versus the resultant data G (Z, X) generated by the generatorc) And XcScore of truth, G (Z, X)c) Indicates that the generator will sample Z and sample XcConverted into synthetic data.

Training sample Z and training sample X, XcThe truth degree scores are substituted into a loss function of the discriminator, the weight among different layers of nodes of the discriminator can be optimized by continuously minimizing the loss function value of the discriminator, and when the loss function value of the discriminator is smaller than a first preset threshold value, the parameters of the discriminator are updated.

Step S40: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating the parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust the parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model.

In this embodiment, when optimizing the generator G, it is necessary to minimize the truth score of the generated sample, and the loss function of the generator can be known according to the above objective formula:

wherein G denotes a generator, D denotes a discriminator, Z denotes a noisy speech, and Z to Pz(Z) represents the distribution of the noisy speech samples Z, and E represents the determination of the samples XcZ mean of the outputs, XcRepresenting generated speech output after input to a noisy speech generator, Xc~Pdata(Xc) Represents a sample XcDistribution of (2), G (Z, X)c) Indicates that the generator will sample Z and sample XcConversion into synthetic data, D (G (Z, X)c),Xc) Representing the discriminator versus the resultant data G (Z, X) generated by the generatorc) And XcScore of true.

Training sample Z and training sample XcThe truth degree scores are substituted into a loss function of the generator, the weight among different layers of nodes of the generator can be optimized by continuously minimizing the loss function value of the generator, and when the loss function value of the generator is smaller than a second preset threshold value, the parameters of the generator are updated.

In this example, a total of 86 epochs were trained, with a learning rate of 0.0002 and a Batchsize of 400. An epoch means that all data is sent into the network to complete a forward calculation and backward propagation process. Since an epoch is too large and the computer is hard to load, it is divided into several smaller lots, which is the portion of data that is trained each time it is fed into the network, and the lot Size is the number of training samples per lot.

Step S50: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

In this embodiment, a voice to be enhanced sent by a user may be received by a microphone, converted into a spectrogram through short-time fourier transform, sent into a trained voice enhancement model, to generate predicted de-noised voice data, converted into a voice analog signal through inverse short-time fourier transform, fed back to the user, played through a speaker or other devices, to obtain an enhanced voice, and fed back to the user.

Furthermore, the embodiment of the present invention also provides a computer-readable storage medium, which may be any one or any combination of a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, and the like. Included in the computer readable storage medium is an artificial intelligence based speech enhancement program 10, which artificial intelligence based speech enhancement program 10 when executed by a processor performs the operations of:

an acquisition step: acquiring a preset number of voices with noise and denoised voices corresponding to the voices with noise to serve as training samples, and dividing the training samples into a first data set, a second data set and a third data set;

the construction steps are as follows: constructing a generative confrontation network comprising at least one generator and one discriminator;

a first training step: inputting the first data set into the discriminator, adjusting parameters of the discriminator by taking a loss function value of the discriminator as a target, updating the parameters of the discriminator when the loss function value of the discriminator is smaller than a first preset threshold value to obtain a first discriminator, inputting the noisy speech of the second data set into the generator, inputting the output speech and the noisy speech into the first discriminator, and updating the parameters of the first discriminator by using a back propagation algorithm;

a second training step: inputting the noisy speech of the third data set into the generator, inputting the output speech and the noisy speech into the first discriminator after updating parameters, obtaining a loss function of the generator according to an output result of the first discriminator after updating the parameters, taking a loss function value of the generator as a target to adjust parameters of the generator, updating the parameters of the generator when the loss function value of the generator is smaller than a second preset threshold value, and taking the generator after updating the parameters as a speech enhancement model; and

a feedback step: and receiving voice data to be enhanced sent by a user, inputting the voice data to be enhanced into the voice enhancement model, generating enhanced voice data and feeding the enhanced voice data back to the user.

The embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiment of the artificial intelligence based speech enhancement method, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium as described above and includes several instructions for causing a terminal device to execute the methods according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:评估语音增强算法性能的方法及装置、电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!