Voice separation method, model training method and electronic equipment

文档序号：1339717 发布日期：2020-07-17 浏览：5次中文

阅读说明：本技术 语音分离方法、模型训练方法及电子设备 (Voice separation method, model training method and electronic equipment ) 是由艾文冯大航陈孝良于 2020-05-09 设计创作，主要内容包括：本发明提供一种语音分离方法、模型训练方法及电子设备,所述语音分离方法包括：获取待处理语音的语音特征,所述待处理语音中包括至少两个声源的语音信号,所述语音特征至少包括相位特征；将所述待处理语音的语音特征输入预先训练的语音分离网络模型,以对所述待处理语音进行语音分离,得到语音分离结果。本发明实施例能够提高语音分离的效果。(The invention provides a voice separation method, a model training method and electronic equipment, wherein the voice separation method comprises the following steps: acquiring voice characteristics of voice to be processed, wherein the voice to be processed comprises voice signals of at least two sound sources, and the voice characteristics at least comprise phase characteristics; and inputting the voice characteristics of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result. The embodiment of the invention can improve the voice separation effect.)

1. A method of speech separation, the method comprising:

acquiring voice characteristics of voice to be processed, wherein the voice to be processed comprises voice signals of at least two sound sources, and the voice characteristics at least comprise phase characteristics;

and inputting the voice characteristics of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result.

2. The method of claim 1, wherein before inputting the speech features of the speech to be processed into the pre-trained speech separation network model, the method further comprises:

dividing the voice to be processed into a first voice section and a second voice section, wherein the first voice section and the second voice section both comprise N frames of continuous voice signals, the first voice section and the second voice section both at least comprise an ith frame of voice signal, the ith frame of voice signal is any frame of voice signal in the voice to be processed, N is greater than 0, and i is greater than 0;

the inputting the voice features of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result includes:

inputting the voice characteristics of the first voice segment into a pre-trained voice separation network model to obtain a first separation result of the first voice segment;

inputting the voice characteristics of the second voice segment into the voice separation network model to obtain a second separation result of the second voice segment;

and acquiring a voice separation result of the ith frame of voice signal based on the first separation result and the second separation result.

3. The method according to claim 2, wherein the obtaining a speech separation result of the i frame speech signal based on the first separation result and the second separation result comprises:

and acquiring a voice separation result of the i-th frame voice signal based on one of the first separation result and the second separation result, which has a larger product with the voice to be processed.

4. The method of claim 1, wherein the speech features further include spectral features, the speech separation network model includes a first neural network model and a second neural network model, and the inputting the speech features of the speech to be processed into a pre-trained speech separation network model to perform speech separation on the speech to be processed to obtain a speech separation result includes:

inputting the spectral characteristics of the voice to be processed into the first neural network model so as to perform voice separation on the voice to be processed and obtain a third separation result;

and inputting the third separation result and the phase characteristics into the second neural network model so as to perform voice separation on the third separation result and obtain a voice separation result.

5. The method according to claim 1, wherein the inputting the speech features of the speech to be processed into a pre-trained speech separation network model to perform speech separation on the speech to be processed to obtain a speech separation result includes:

and inputting the voice characteristics of the voice to be processed and the voice characteristics of the voice in the fixed beam direction into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result.

6. A method of model training, the method comprising:

acquiring voice features of a training sample, wherein the voice features at least comprise phase features;

and training a voice separation network model based on the voice characteristics of the training samples.

7. The method of claim 6, wherein prior to training a speech separation network model based on the speech features of the training samples, the method further comprises:

dividing the training sample into a third voice section and a fourth voice section, wherein the third voice section and the fourth voice section comprise M frames of continuous voice signals, the third voice section and the fourth voice section both at least comprise a jth frame of voice signal, the jth frame of voice signal is any frame of voice signal in the training sample, M is greater than 0, and j is greater than 0;

training a voice separation network model for voice separation based on voice features of the training samples, including:

inputting the voice characteristics of the third voice segment into a voice separation network model for voice separation to obtain a third separation result of the third voice segment;

inputting the voice features of the fourth voice segment into the voice separation network model to obtain a fourth separation result of the fourth voice segment;

updating parameters of the voice separation network model based on a voice separation result and target output of the j frame voice signal;

wherein the voice separation result of the j frame voice signal is obtained based on one of the third separation result and the fourth separation result in which the confidence score is higher.

8. The method of claim 7, wherein the confidence score is determined based on the target output and an output of the speech separation network model; or

The confidence score is determined based on a speech enhancement value that is a product of an input of the speech separation network model and a beam coefficient of a sound source and an actual speech value that is a product of an output and an input of the speech separation network model.

9. The method of claim 6, wherein the speech features further comprise spectral features, wherein the speech separation network model comprises a first neural network model and a second neural network model, and wherein training the speech separation network model for speech separation based on the speech features of the training samples comprises:

training the first neural network model and the second neural network model based on spectral features and phase features of the training samples.

10. The method of claim 6, wherein the output of the speech separation network model comprises speech signals of at least two sound sources, and wherein the correspondence of the speech signals of the at least two sound sources output by the speech separation network model to the target output is determined based on sound source localization.

11. An electronic device, characterized in that the electronic device comprises:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring the voice characteristics of voice to be processed, the voice to be processed comprises voice signals of at least two sound sources, and the voice characteristics at least comprise phase characteristics;

and the input module is used for inputting the voice characteristics of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result.

12. An electronic device, characterized in that the electronic device comprises:

the acquisition module is used for acquiring the voice characteristics of the training sample, wherein the voice characteristics at least comprise phase characteristics;

and the training module is used for training a voice separation network model based on the voice characteristics of the training samples.

13. An electronic device, comprising: memory, processor and program stored on the memory and executable on the processor, the program implementing the steps in the speech separation method according to any of claims 1 to 5 when executed by the processor or implementing the steps in the model training method according to any of claims 6 to 10 when executed by the processor.

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a voice separation method, a model training method and electronic equipment.

Background

In noisy acoustic environments, such as in cocktail parties, there are often multiple different sound sources present simultaneously, e.g., multiple human speech sounds, impact sounds of tableware, musical sounds, etc. The speech separation may separate the target speech from the background interference. Speech separation is a basic task type, and has a wide application range, including various application fields such as hearing prosthesis, mobile communication, robust automatic speech, speaker recognition, and the like.

However, in the process of training the voice separation network model at present, the separated voice cannot correspond to the actual sound source, so that the accuracy of the trained voice separation network model is low, and the voice separation effect is poor.

Disclosure of Invention

The embodiment of the invention provides a voice separation method, a model training method and electronic equipment, and aims to solve the problems that in the prior art, in the process of training a voice separation network model, separated voice cannot correspond to an actual sound source, so that the accuracy of the trained voice separation network model is low, and the voice separation effect is poor.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a speech separation method, where the method includes:

In a second aspect, an embodiment of the present invention provides a model training method, where the method includes:

acquiring voice features of a training sample, wherein the voice features at least comprise phase features;

and training a voice separation network model based on the voice characteristics of the training samples.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

In a fourth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

the acquisition module is used for acquiring the voice characteristics of the training sample, wherein the voice characteristics at least comprise phase characteristics;

and the training module is used for training a voice separation network model based on the voice characteristics of the training samples.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, the program implementing the steps of the speech separation method according to the first aspect when executed by the processor or implementing the steps of the model training method according to the second aspect when executed by the processor.

In a sixth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the speech separation method according to the first aspect, or the computer program, when executed by the processor, implements the steps in the model training method according to the second aspect.

In the embodiment of the invention, in the process of model training, the voice characteristics of a training sample are obtained, wherein the voice characteristics at least comprise phase characteristics; the voice separation network model is trained based on the voice characteristics of the training samples, so that the separated voice can be corresponding to an actual sound source based on the phase characteristics, and the accuracy of the trained voice separation network model can be improved. In the process of voice separation, acquiring voice characteristics of voice to be processed, wherein the voice to be processed comprises voice signals of at least two sound sources, and the voice characteristics at least comprise phase characteristics; and inputting the voice characteristics of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed to obtain a voice separation result, so that the separated voice can be corresponding to an actual sound source based on the phase characteristics, and the voice separation effect is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart of a speech separation method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a voice separation network model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a model training method provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 5 is a second schematic structural diagram of an electronic device according to an embodiment of the invention;

fig. 6 is a third schematic structural diagram of an electronic apparatus according to an embodiment of the invention;

fig. 7 is a fourth schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 8 is a fifth schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted mobile terminal, a wearable device, a pedometer, and the like.

Referring to fig. 1, fig. 1 is a flowchart of a speech separation method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, obtaining voice characteristics of voice to be processed, wherein the voice to be processed comprises voice signals of at least two sound sources, and the voice characteristics at least comprise phase characteristics.

The voice feature extraction method can extract features of the voice to be processed to obtain voice features of the voice to be processed. The phase characteristic may be a phase characteristic obtained based on a plurality of voice channels, for example, each microphone may be a voice channel, and the to-be-processed voice may be obtained by a plurality of microphones, and the phase characteristic may be a phase characteristic of the to-be-processed voice obtained by a plurality of microphones. The speech features may also include spectral features, which may be spectral features acquired based on a plurality of speech channels. The phase characteristic may comprise an IPD (insertion phase delay) parameter, or may also comprise other phase-dependent characteristic parameters.

And 102, inputting the voice characteristics of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result.

The inputting the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result may include: and inputting the spectral characteristic and the phase characteristic of the voice to be processed into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result. The inputting of the spectral feature and the phase feature of the speech to be processed into the pre-trained speech separation network model may be inputting the pre-trained speech separation network model after splicing the spectral feature and the phase feature of the speech to be processed.

Or, the voice separation network model may further include a first neural network model and a second neural network model, and the inputting the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result may include: inputting the spectral characteristics of the voice to be processed into the first neural network model so as to perform voice separation on the voice to be processed and obtain a third separation result; and inputting the third separation result and the phase characteristics into the second neural network model so as to perform voice separation on the third separation result and obtain a voice separation result.

Or, the inputting the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result may include: and inputting the voice characteristics of the voice to be processed and the voice characteristics of the voice in the fixed beam direction into a pre-trained voice separation network model so as to perform voice separation on the voice to be processed and obtain a voice separation result.

In addition, the speech separation network model may include a B L STM (bidirectional long short term memory network) model, or may include a L STM (long short term memory network) model, or may include an RNN (recurrent neural network) model, etc., and the network models that may be used for speech separation may be used as the speech separation network model, which is not limited in the present embodiment.

In the embodiment of the invention, the voice characteristics of the voice to be processed are obtained, the voice to be processed comprises the voice signals of at least two sound sources, the voice characteristics at least comprise phase characteristics, the voice characteristics of the voice to be processed are input into a pre-trained voice separation network model so as to carry out voice separation on the voice to be processed, a voice separation result is obtained, the separated voice can be corresponding to an actual sound source based on the phase characteristics, and therefore, the voice separation effect is improved.

Optionally, before the inputting the speech feature of the speech to be processed into the pre-trained speech separation network model, the method further includes:

inputting the voice characteristics of the first voice segment into a pre-trained voice separation network model to obtain a first separation result of the first voice segment;

inputting the voice characteristics of the second voice segment into the voice separation network model to obtain a second separation result of the second voice segment;

and acquiring a voice separation result of the ith frame of voice signal based on the first separation result and the second separation result.

The last frame of speech signal of the first speech segment may be the same as the first frame of speech signal of the second speech segment, and is the ith frame of speech signal. Taking N as 3 and i as 3 as an example, the first speech segment may include the 1 st to 3 rd frame speech signals of the speech to be processed, and the second speech segment may include the 3 rd to 5 th frame speech signals of the speech to be processed. The voice to be processed can be overlapped and segmented, the voice to be processed is divided into a plurality of voice sections with the same frame between every two voice sections, and the voice characteristics of the divided voice sections are input into the voice separation network model for voice separation.

In addition, the obtaining of the voice separation result of the i-th frame voice signal based on the first separation result and the second separation result may be determining a separation result corresponding to one of the first separation result and the second separation result, which has a larger product with the voice to be processed; acquiring a voice separation result of the ith frame of voice signal based on the separation result corresponding to the item with the larger product of the voice to be processed; alternatively, the average of the first separation result and the second separation result may be used as the voice separation result of the i-th frame voice signal, and so on.

In this embodiment, the first speech segment and the second speech segment each include at least an i-th frame speech signal, and the speech to be processed is subjected to overlap segmentation processing, so that each frame speech signal can refer to information of an adjacent frame in a speech separation process, and speech separation can be performed twice, thereby improving a speech separation effect.

Optionally, the obtaining a voice separation result of the i-th frame voice signal based on the first separation result and the second separation result includes:

Wherein the voice separation result of the i-th frame voice signal may be obtained based on one of the first separation result and the second separation result whose absolute value of the product with the voice to be processed is large. The first can be determinedAnd extracting the voice separation result of the i frame voice signal from the separation result of the item with the larger absolute value of the product of the separation result and the to-be-processed voice in the second separation result. Taking the first speech segment k1 and the second speech segment k2 as examples, it can be calculated as follows: max (abs (Out)_k1(f,t)*X(f,t),Out_k2(f, t) × X (f, t))) to obtain a separation result, Out, with a larger absolute value of the product with said speech to be processed_k1(f, t) is the first separation result, Out_k2And (f, t) is the second separation result, and X (f, t) is the voice feature of the voice to be processed.

In this embodiment, a separation result corresponding to one of the first separation result and the second separation result, which has a larger product with the speech to be processed, is determined; and acquiring the voice separation result of the i-th frame voice signal based on the separation result corresponding to the item with the larger product of the voice to be processed, wherein the voice separation result can be obtained by taking the item with the higher concentration degree as the voice separation result according to the concentration degree of the spatial spectrum of the sound source, and the voice separation effect can be further improved.

Optionally, the voice feature further includes a spectral feature, the voice separation network model includes a first neural network model and a second neural network model, and the voice feature of the voice to be processed is input into a pre-trained voice separation network model to perform voice separation on the voice to be processed, so as to obtain a voice separation result, including:

The first neural network model may be a B L STM model for speech separation, or a L STM model, or an RNN model, etc. the second neural network model may be a B L STM model for speech separation, or a L STM model, or an RNN model, etc.

In this embodiment, the first speech separation is performed based on the first neural network model using the spectral features, the second speech separation is performed based on the second neural network model using the phase features, and the speech separation effect can be further improved by performing the speech separation twice.

Optionally, the inputting the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result includes:

The fixed beam direction voice may be a preset voice. For example, a voice may be played at a preset position relative to a microphone, the played voice may be collected by the microphone, and the collected voice may be regarded as a fixed beam direction voice. The speech features of the fixed beam direction speech may include spectral features, or may include spectral features and phase features, among others. The voice feature of the voice to be processed and the voice feature of the voice in the fixed beam direction are input into the pre-trained voice separation network model, or the voice feature of the voice to be processed and the voice feature of the voice in the fixed beam direction are spliced and then input into the pre-trained voice separation network model.

In this embodiment, the voice feature of the voice to be processed and the voice feature of the voice in the fixed beam direction are input into the pre-trained voice separation network model to perform voice separation on the voice to be processed, so as to obtain a voice separation result.

Referring to fig. 3, fig. 3 is a flowchart of a model training method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

step 201, obtaining the voice characteristics of the training sample, wherein the voice characteristics at least comprise phase characteristics.

The feature extraction can be performed on the training sample to obtain the voice feature of the training sample. The phase feature may be a phase feature obtained based on multiple voice channels, for example, each microphone may be a voice channel, training samples may be obtained through multiple microphones, and the phase feature may be a phase feature of training samples obtained by multiple microphones. The speech features may also include spectral features, which may be spectral features acquired based on a plurality of speech channels.

And 202, training a voice separation network model based on the voice characteristics of the training samples.

Wherein the training of the speech separation network model based on the speech features of the training samples may include: and training a voice separation network model based on the spectral characteristics and the phase characteristics of the training samples. The training of the speech separation network model for speech separation based on the speech features of the training samples may include: training the first neural network model and the second neural network model based on spectral features and phase features of the training samples.

In addition, the voice separation network model may further include a first neural network model and a second neural network model, and the inputting the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed to obtain a voice separation result may include: inputting the spectral characteristics of the voice to be processed into the first neural network model so as to perform voice separation on the voice to be processed and obtain a third separation result; and inputting the third separation result and the phase characteristics into the second neural network model so as to perform voice separation on the third separation result and obtain a voice separation result.

In practical applications, if N sound sources are shared, the process of separating the voices is input-output x N, and when calculating the loss function loss, label andone-to-one correspondence is required, and a permutation problem occurs due to a sorting problem in the corresponding process, for example, when two voices are separated, A and B respectively represent outputs of two modelsC and D respectively represent labels, and whether A corresponds to D, B corresponds to C, or A corresponds to C and B corresponds to D cannot be determined, so that the trained voice separation network model is inaccurate.

Further, the model training method in the embodiment of the present invention may be used to train the voice separation network model, and the trained voice separation network model is used as the pre-trained voice separation network model in the voice separation method in the above embodiment.

In the embodiment of the invention, in the process of model training, the voice characteristics of a training sample are obtained, wherein the voice characteristics at least comprise phase characteristics; based on the voice characteristic training voice separation network model of the training sample, the voice separated can be corresponding to the actual sound source based on the phase characteristic, so that the accuracy of the trained voice separation network model can be improved, the trained voice separation network model is adopted for voice separation, and the voice separation effect can be improved.

Optionally, before training the speech separation network model based on the speech features of the training samples, the method further includes:

training a voice separation network model for voice separation based on voice features of the training samples, including:

inputting the voice characteristics of the third voice segment into a voice separation network model for voice separation to obtain a third separation result of the third voice segment;

inputting the voice features of the fourth voice segment into the voice separation network model to obtain a fourth separation result of the fourth voice segment;

updating parameters of the voice separation network model based on a voice separation result and target output of the j frame voice signal;

wherein the voice separation result of the j frame voice signal is obtained based on one of the third separation result and the fourth separation result in which the confidence score is higher.

The last frame of speech signal of the third speech segment may be the same as the first frame of speech signal of the fourth speech segment, and both are j-th frame of speech signals. Taking M as 4 and j as 4 as an example, the third speech segment may include the 1 st frame to 4 th frame speech signals of the training samples, and the fourth speech segment may include the 4 th frame to 7 th frame speech signals of the training samples. The training samples can be segmented in an overlapping manner, the training samples are divided into a plurality of voice sections with the same frame between every two training samples, and the voice characteristics of the divided voice sections are input into the voice separation network model for model training. The value of M may be the same as or different from the value of N.

In addition, the voice separation result of the j frame voice signal may be obtained based on one of the third separation result and the fourth separation result in which the confidence score is higher. One of the third separation result and the fourth separation result with a higher confidence score may be determined, and the speech separation result of the jth frame of speech signal may be extracted from the one with a higher confidence score.

Further, taking the speech signal with the output of the speech separation network model as two sound sources as an example, the speech separation network model is divided into a sound source A and a sound source B, and the directions of the sound sources are respectively theta_AAnd theta_BFor each speech segment of the training sample, there are several cases: the speech of two sound sources takes enough time, the speech segment can be regarded as confirmed dual-sound-source segment, and the output sequence of the speech separation network model can be represented by theta_AAnd theta_BDetermine, assume θ_A＜θ_BIn the training process, the voice signal of the sound source A in the output result of the voice separation network model can be considered as the first path of output, and the voice signal of the sound source B can be considered as the second path of output; if only one sound source voice exists and the duration exceeds a certain threshold, the section can be regarded as a confirmed single sound source section, in two paths of output of the voice separation network model, the first path of output is separated voice, and the second path of output is 0; if there is no voice of the sound source, the output of the voice separation network model is all 0. In addition, in order to avoid the situation that the information required by the voice separation is insufficient and the output of the voice separation network model may be uncontrollable, the voice of 1 or 2 sound sources in the voice section, wherein the voice section with the shorter duration of the voice of at least 1 sound source, may not participate in the training of the voice separation network model.

In this embodiment, the third speech segment and the fourth speech segment at least include a jth frame of speech signal, and the training samples are subjected to overlap segmentation processing, so that each frame of speech signal can refer to information of an adjacent frame in the speech separation process, and speech separation can be performed twice, thereby improving accuracy of a speech separation network model, and further improving the speech separation effect.

Optionally, the confidence score is determined based on the target output and the output of the speech separation network model; or

The confidence score is determined based on the target output and the output of the voice separation network model, and may be obtained by calculating a square of a difference between the output of the voice separation network model for performing voice separation on each frame of voice signals in the voice segment and the target output, multiplying the square of the difference by a weight of a frequency point of each frame of voice signals to obtain a first product, and summing the first products corresponding to the multiple frames of voice signals in the voice segment to obtain the confidence score of the voice segment. In practical applications, taking the k speech segments as an example, the k speech segments include the speech signals of the sound source a and the sound source B, for example, the confidence score of the speech signal of the sound source a separated from the k speech segments may be: cos t_k＝∑_f,tWeight(f,t)*[Mask(f,t)-Out_k(f,t)]²Weight (f, t) is the Weight of each frame of voice signal frequency point, Mask (f, t) is the target output of voice signal of sound source A, Out_kAnd (f, t) is a voice signal of the sound source A separated from the k voice segments by the voice separation network model, f is frequency, and t is time.

In addition, the confidence score is determined based on the speech enhancement value and the actual speech value, and may be obtained by calculating a square of a difference between the speech enhancement value and the actual speech value, multiplying the square of the difference by a weight of a frequency point of each frame of speech signal to obtain a second product, and summing first products corresponding to multiple frames of speech signals in a speech segment to obtain the confidence score of the speech segment. In practical applications, taking the k speech segments as an example, the k speech segments include the speech signals of the sound source a and the sound source B, for example, the confidence score of the speech signal of the sound source a separated from the k speech segments may be:

wherein Weight (f, t) is the Weight of each frame of voice signal frequency point, beam (f, t) is the beam coefficient of the beam in the direction of the sound source A, X (f, t) is the voice characteristic of the voice signal input by the voice separation network model, Out_k(f, t) is sound separated from k speech segments by the speech separation network modelThe speech signal of source A, f is the frequency and t is the time.

In this embodiment, the confidence score is determined based on the target output and the output of the voice separation network model, and the output close to the target can be used as a voice separation result, so that the accuracy of the voice separation network model can be improved, and the voice separation effect can be improved; or the confidence score is determined based on the voice enhancement value and the actual voice value, and the higher concentration degree can be used as a voice separation result according to the concentration degree of the space spectrum of the sound source, so that the accuracy of the voice separation network model can be improved, and the voice separation effect can be improved.

Optionally, the speech features further include spectral features, the speech separation network model includes a first neural network model and a second neural network model, and training the speech separation network model for speech separation based on the speech features of the training samples includes:

training the first neural network model and the second neural network model based on spectral features and phase features of the training samples.

The first neural network model can be a B L STM model for speech separation, or a L STM model, or an RNN model, etc. the second neural network model can be a B L STM model for speech separation, or a L STM model, or an RNN model, etc.

In this embodiment, the first neural network model and the second neural network model are trained based on the spectral feature and the phase feature of the training sample, so that the accuracy of the voice separation network model can be improved, and the voice separation effect can be improved.

Optionally, the output of the voice separation network model includes voice signals of at least two sound sources, and the correspondence between the voice signals of the at least two sound sources output by the voice separation network model and the target output is determined based on sound source localization.

In the process of training the speech separation network model, because the training sample includes a plurality of speech segments, the speech separation network model performs speech separation on each speech segment individually, and a situation that a speech signal of a sound source does not correspond to a target output may occur. For example, in two adjacent voice segments, the previous voice segment has only a voice signal of a sound source, e.g., sound source B, the voice separation network model outputs the voice signal of sound source B in the first output, and if the latter voice segment has voice signals of two sound sources, e.g., sound source a and sound source B, at this time, the voice separation network model outputs the voice signal of sound source B in the second output, so that the corresponding relationship between the voice signal of the sound source and the target output is incorrect, and the voice signal of sound source B can be adjusted to the first output. The output sequence can be determined by means of sound source localization, and the voice signals of sound sources located in the same direction are output through the same path of the voice separation network model by means of sound source localization, so that the corresponding relation between the output of the voice separation network model and the target output can be determined.

In this embodiment, the output of the voice separation network model includes voice signals of at least two sound sources, and the correspondence between the voice signals of at least two sound sources output by the voice separation network model and the target output is determined based on sound source localization, so that it is further possible to avoid that the actual output of the voice separation network model does not correspond to the target output, thereby improving the accuracy of the voice separation network model and further improving the voice separation effect.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device 300 includes:

an obtaining module 301, configured to obtain a voice feature of a voice to be processed, where the voice to be processed includes voice signals of at least two sound sources, and the voice feature at least includes a phase feature;

an input module 302, configured to input the voice feature of the voice to be processed into a pre-trained voice separation network model, so as to perform voice separation on the voice to be processed, so as to obtain a voice separation result.

Optionally, as shown in fig. 5, the electronic device 300 further includes:

a segmenting module 303, configured to divide the speech to be processed into a first speech segment and a second speech segment, where the first speech segment and the second speech segment both include N frames of continuous speech signals, and both the first speech segment and the second speech segment include at least an ith frame of speech signal, where the ith frame of speech signal is any frame of speech signal in the speech to be processed, N is greater than 0, and i is greater than 0;

the input module 302 includes:

a first input unit 3021, configured to input a speech feature of the first speech segment into a pre-trained speech separation network model, so as to obtain a first separation result of the first speech segment;

a second input unit 3022, configured to input the voice feature of the second voice segment into the voice separation network model, so as to obtain a second separation result of the second voice segment;

an obtaining unit 3023 configured to obtain a voice separation result of the i-th frame voice signal based on the first separation result and the second separation result.

Optionally, the obtaining unit 3023 is specifically configured to:

Optionally, the input module 302 is specifically configured to:

The electronic device can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Referring to fig. 6, fig. 6 is a third schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device 400 includes:

an obtaining module 401, configured to obtain a speech feature of a training sample, where the speech feature at least includes a phase feature;

a training module 402, configured to train a speech separation network model based on the speech features of the training samples.

Optionally, as shown in fig. 7, the electronic device 400 further includes:

a segmenting module 403, configured to divide the training sample into a third speech segment and a fourth speech segment, where the third speech segment and the fourth speech segment include M continuous speech signals, the third speech segment and the fourth speech segment both include at least a jth frame speech signal, the jth frame speech signal is any frame speech signal in the training sample, M is greater than 0, and j is greater than 0;

the training module 402 is specifically configured to:

inputting the voice characteristics of the third voice segment into a voice separation network model for voice separation to obtain a third separation result of the third voice segment;

inputting the voice features of the fourth voice segment into the voice separation network model to obtain a fourth separation result of the fourth voice segment;

updating parameters of the voice separation network model based on a voice separation result and target output of the j frame voice signal;

wherein the voice separation result of the j frame voice signal is obtained based on one of the third separation result and the fourth separation result in which the confidence score is higher.

Optionally, the confidence score is determined based on the target output and the output of the speech separation network model; or

training the first neural network model and the second neural network model based on spectral features and phase features of the training samples.

The electronic device can implement each process implemented in the method embodiment of fig. 3, and details are not described here to avoid repetition.

Referring to fig. 8, fig. 8 is a fifth schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 8, the electronic device 500 includes: a memory 502, a processor 501, and a program stored on the memory 502 and executable on the processor 501, wherein:

in one embodiment, the processor 501 reads the program in the memory 502 for executing:

Optionally, the processor 501 is further configured to perform:

the processor 501 is configured to input the voice feature of the voice to be processed into a pre-trained voice separation network model, so as to perform voice separation on the voice to be processed, and obtain a voice separation result, where the voice separation result includes:

inputting the voice characteristics of the first voice segment into a pre-trained voice separation network model to obtain a first separation result of the first voice segment;

inputting the voice characteristics of the second voice segment into the voice separation network model to obtain a second separation result of the second voice segment;

and acquiring a voice separation result of the ith frame of voice signal based on the first separation result and the second separation result.

Optionally, the obtaining, by the processor 501, a voice separation result of the i-th frame voice signal based on the first separation result and the second separation result, includes:

Optionally, the voice features further include spectral features, the voice separation network model includes a first neural network model and a second neural network model, and the processor 501 is configured to input the voice features of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed, so as to obtain a voice separation result, including:

Optionally, the inputting, by the processor 501, the voice feature of the voice to be processed into a pre-trained voice separation network model to perform voice separation on the voice to be processed, so as to obtain a voice separation result, where the inputting includes:

In another embodiment, the processor 501 reads the program in the memory 502 for executing:

acquiring voice features of a training sample, wherein the voice features at least comprise phase features;

and training a voice separation network model based on the voice characteristics of the training samples.

Optionally, the processor 501 is further configured to perform:

the training, performed by the processor 501, of the speech separation network model for speech separation based on the speech features of the training samples includes:

inputting the voice characteristics of the third voice segment into a voice separation network model for voice separation to obtain a third separation result of the third voice segment;

inputting the voice features of the fourth voice segment into the voice separation network model to obtain a fourth separation result of the fourth voice segment;

updating parameters of the voice separation network model based on a voice separation result and target output of the j frame voice signal;

wherein the voice separation result of the j frame voice signal is obtained based on one of the third separation result and the fourth separation result in which the confidence score is higher.

Optionally, the confidence score is determined based on the target output and the output of the speech separation network model; or

Optionally, the speech features further include spectral features, the speech separation network model includes a first neural network model and a second neural network model, and the training, performed by the processor 501, of the speech separation network model for speech separation based on the speech features of the training samples includes:

training the first neural network model and the second neural network model based on spectral features and phase features of the training samples.

In fig. 8, the bus architecture may include any number of interconnected buses and bridges, with one or more processors represented by processor 501 and various circuits of memory represented by memory 502 being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.

The processor 501 is responsible for managing the bus architecture and general processing, and the memory 502 may store data used by the processor 501 in performing operations.

It should be noted that any implementation manner in the method embodiment of the present invention may be implemented by the electronic device in this embodiment, and achieve the same beneficial effects, and details are not described here.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the foregoing speech separation method embodiment, or when the computer program is executed by the processor, the computer program implements each process of the foregoing model training method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

19页详细技术资料下载

Voice separation method, model training method and electronic equipment

相关技术

网友询问留言