Voice separation method, system, device and storage medium

文档序号：170874 发布日期：2021-10-29 浏览：46次中文

阅读说明：本技术 语音分离方法、系统、装置和存储介质 (Voice separation method, system, device and storage medium ) 是由刘博卿王健宗张之勇于 2021-07-24 设计创作，主要内容包括：本发明公开了一种语音分离方法、系统、装置和存储介质,语音分离方法包括对语音信号进行分割,得到多个语音片段,将语音片段映射到时域特征空间,得到时域向量；对时域向量迭代进行多轮识别处理,直至执行识别处理所得的置信度小于阈值后停止执行识别处理,获取目标说话人的语音特征向量,根据时域向量、目标说话人的语音特征向量以及各轮识别处理中所识别到的说话人的语音特征向量,确定目标说话人对应的语音活跃值等步骤。本发明语音分离方法所获得的语音活跃值可以表示目标说话人在语音信号中的某个时刻是否说了话,从而容易清晰地区分说话人的说话顺序,改善了顺序模糊的问题。本发明可广泛应用于语音处理技术领域。(The invention discloses a voice separation method, a system, a device and a storage medium, wherein the voice separation method comprises the steps of segmenting a voice signal to obtain a plurality of voice segments, and mapping the voice segments to a time domain feature space to obtain a time domain vector; and performing multiple rounds of identification processing on the time domain vector iteration until the confidence coefficient obtained by executing the identification processing is smaller than a threshold value, stopping executing the identification processing, acquiring the voice characteristic vector of the target speaker, and determining the voice activity value corresponding to the target speaker according to the time domain vector, the voice characteristic vector of the target speaker and the voice characteristic vector of the speaker identified in each round of identification processing. The voice activity value obtained by the voice separation method can represent whether the target speaker speaks at a certain moment in the voice signal, so that the speaking sequence of the speaker can be distinguished easily and clearly, and the problem of sequence ambiguity is solved. The invention can be widely applied to the technical field of voice processing.)

1. A method of speech separation, comprising:

acquiring a voice signal;

segmenting the voice signal to obtain a plurality of voice segments;

mapping the voice segments to a time domain feature space to obtain time domain sub-vectors;

splicing the time domain sub-vectors to obtain time domain vectors;

performing multiple rounds of identification processing on the time domain vector until an iteration stop condition is met; in the 1 st round of identification processing, identifying the voice feature vector of the speaker contained in the time domain vector; in the identification processing of the ith round, i belongs to N and i >1, identifying the voice feature vector of the speaker contained in the time domain vector, inputting the time domain vector and the voice feature vector identified by the identification processing of the (i-1) th round into a classifier, and acquiring a first confidence coefficient output by the classifier, wherein the first confidence coefficient is used for representing the probability that the voice feature vector identified by the identification processing of the ith round comes from a new speaker; the new speaker is a speaker who does not recognize corresponding voice feature vectors in each round of recognition processing before the ith round; the iteration stopping condition is that a first confidence coefficient obtained by executing the identification processing is smaller than a threshold value;

acquiring a voice feature vector of a target speaker;

obtaining a second average value; the second average value is the average value of all the speech feature vectors identified in all the identification processes;

and determining a voice activity value corresponding to the target speaker according to the time domain vector, the voice characteristic vector of the target speaker and the second average value.

2. The speech separation method of claim 1, wherein the mapping the speech segments into a time-domain feature space to obtain time-domain sub-vectors comprises:

inputting each of the speech segments to a time-domain encoder;

and performing down-sampling on each voice segment through the time domain encoder and mapping the down-sampled voice segment to the time domain feature space to obtain a corresponding time domain sub-vector.

3. The speech separation method of claim 2 wherein the concatenating the time-domain subvectors to obtain a time-domain vector comprises:

and splicing the time domain sub-vectors according to the time sequence of the corresponding voice segments in the voice signal to obtain the time domain vector.

4. The speech separation method of claim 3, wherein the inputting the time-domain vector and the speech feature vector identified by the i-1 th round of recognition processing into a classifier to obtain a first confidence level of the classifier output comprises:

acquiring a first average value; the first average value is the average value of the voice characteristic vectors of the speakers identified in the i-1 th round of identification processing;

inputting the first average value and the time domain vector to a classifier;

classifying by the classifier according to the first average value and the time domain vector to determine an event type and a corresponding confidence coefficient; the event types are a first event, a first event and a fourth event; the first event is that only the voice of the new speaker exists in the voice segment; the second event is that only the voices of speakers other than the new speaker exist in the voice segment; the third event is that the voice of the new speaker and the voices of speakers other than the new speaker exist in the voice segment; the fourth event is that the voice of any speaker does not exist in the voice segment; the first event corresponds to a first confidence level;

and when the determined event type is the first event, returning a corresponding first confidence coefficient, and analyzing the speech feature vector of the new speaker from the speech segment.

5. The method of claim 4, wherein the determining the voice activity value corresponding to the target speaker according to the time domain vector, the voice feature vector of the target speaker and the second average value comprises:

splicing the second average value with the voice characteristic vector of the target speaker to obtain a spliced vector;

inputting the splicing vector into a first fully-connected network, and mapping the first fully-connected network to obtain a first mapping value;

inputting the time domain vector to a second fully connected network, and mapping the time domain vector by the second fully connected network to obtain a second mapping value;

and multiplying the first mapping value by the first mapping value to obtain the voice activity value.

6. The speech separation method of any one of claims 1-5, further comprising the step of jointly training the time-domain encoder, the classifier, the first fully-connected network, and the second fully-connected network.

7. The method of claim 6, wherein in the step of jointly training the time-domain encoder, the classifier, the first fully-connected network and the second fully-connected network, the loss function corresponding to the classifier isWherein L is_selector(h, mu) is a loss function corresponding to the classifier, h is the time domain vector, mu is a first average value of the voice feature vector of the speaker recognized in the i-1 th round of recognition processing, T represents the number of the voice segments, N represents the current round number in training, T represents the time, h represents the time_tRepresenting said time-domain vector, e, corresponding to time t_tRepresenting and said time domain vector h_tCorresponding event type, i is the number of rounds, μ_iRepresenting the first average value corresponding to the ith round;

the loss function corresponding to the first fully-connected network and the second fully-connected network isWherein the content of the first and second substances,a loss function for the first fully connected network and the second fully connected network,representing the voice activity value, y_i,tIndicates the label value, B, in the training sample used in the training_rIndicating the set fault tolerance time.

8. A speech separation system, comprising:

a first module for acquiring a voice signal;

the second module is used for segmenting the voice signal to obtain a plurality of voice segments;

a third module, configured to map the speech segment to a time domain feature space, to obtain a time domain sub-vector;

a fourth module, configured to splice the time-domain sub-vectors to obtain time-domain vectors;

a fifth module, configured to perform multiple rounds of identification processing on the time domain vector until an iteration stop condition is met; in the 1 st round of identification processing, identifying the voice feature vector of the speaker contained in the time domain vector; in the identification processing of the ith round, i belongs to N and i >1, identifying the voice feature vector of the speaker contained in the time domain vector, inputting the time domain vector and the voice feature vector identified by the identification processing of the (i-1) th round into a classifier, and acquiring a first confidence coefficient output by the classifier, wherein the first confidence coefficient is used for representing the probability that the voice feature vector identified by the identification processing of the ith round comes from a new speaker; the new speaker is a speaker who does not recognize corresponding voice feature vectors in each round of recognition processing before the ith round; the iteration stopping condition is that a first confidence coefficient obtained by executing the identification processing is smaller than a threshold value;

the sixth module is used for acquiring the voice characteristic vector of the target speaker;

a seventh module for obtaining a second average value; the second average value is the average value of all the speech feature vectors identified in all the identification processes;

and the eighth module is used for determining a voice active value corresponding to the target speaker according to the time domain vector, the voice feature vector of the target speaker and the second average value.

9. A computer apparatus comprising a memory for storing at least one program and a processor for loading the at least one program to perform the speech separation method of any of claims 1-7.

10. A storage medium having stored therein a processor-executable program, wherein the processor-executable program, when executed by a processor, is configured to perform the speech separation method of any one of claims 1-7.

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice separation method, a voice separation system, a computer device and a storage medium.

Background

The voice separation technique can identify the speaking of the target speaker from a section of voice signal, namely, the problem of who speaks the voice in the voice signal can be solved. Some related speech separation technologies currently use a neural network to perform a speech separation task, and require that the speech of a target speaker be used to train the neural network in advance, so that the neural network has the capability of recognizing whether the target speaker is speaking. However, the speech recognition of the target speaker by the voice separation technique has a problem of fuzzy speech sequence, such as recognizing the speech of a and b in a speech signal, but it is difficult to distinguish the speech sequence of a and b.

Disclosure of Invention

In view of the foregoing technical problems, an object of the present invention is to provide a speech separation method, system, computer device and storage medium, so as to improve the accuracy of speech signal recognition.

In one aspect, an embodiment of the present invention provides a speech separation method, including:

acquiring a voice fragment;

mapping the voice segments to a time domain feature space to obtain time domain vectors;

performing multiple rounds of identification processing on the time domain vector until an iteration stop condition is met; in the first round of the recognition process, recognizing the speech feature vectors of the speaker contained in the time domain vectors and the corresponding confidence degrees; in each round of the recognition processing except the first round, recognizing a speech feature vector of a new speaker included in a processing result of the recognition processing of the previous round and a corresponding confidence degree, the new speaker being a speaker who recognized the corresponding speech feature vector in the previous round of the recognition processing; the iteration stopping condition is that the confidence coefficient obtained by executing the recognition processing is smaller than a threshold value;

acquiring a voice feature vector of a target speaker;

and determining a voice active value corresponding to the target speaker according to the time domain vector, the voice feature vector of the target speaker and the voice feature vector of the speaker recognized in each round of recognition processing.

Further, the obtaining the voice segment includes:

acquiring a voice signal;

and segmenting the voice signal to obtain a plurality of voice segments.

Further, the mapping the voice segment to a time domain feature space to obtain a time domain vector includes:

inputting each of the speech segments to a time-domain encoder;

down-sampling each voice segment through the time domain encoder and mapping the voice segment to the time domain feature space to obtain a corresponding time domain sub-vector;

and splicing the time domain sub-vectors according to the time sequence of the corresponding voice segments in the voice signal to obtain the time domain vector.

Further, the recognizing the speech feature vector of the new speaker and the corresponding confidence included in the processing result of the previous round of the recognition processing includes:

acquiring a first average value; the first average value is the average value of the voice feature vectors of the speakers identified in the previous round of identification processing;

inputting the first average value and the time domain vector to a classifier;

classifying by the classifier according to the average value and the time domain vector to determine an event type and a corresponding confidence coefficient; the event category includes one of: only the voice of the new speaker exists in the voice segment; only the voices of speakers except the new speaker exist in the voice segment; the voice segment contains the voice of the new speaker and the voices of speakers except the new speaker; the voice of any speaker is not present in the voice segment;

and when the determined event type is that only the voice of the new speaker exists in the voice segment, returning corresponding confidence coefficient, and analyzing the voice feature vector of the new speaker from the voice segment.

Further, the determining, according to the time domain vector, the voice feature vector of the target speaker, and the voice feature vector of the speaker recognized in each round of the recognition processing, the voice activity value corresponding to the target speaker includes:

obtaining a second average value; the second average value is the average value of the voice feature vectors of all the speakers identified in each round of the identification processing;

splicing the second average value with the voice characteristic vector of the target speaker to obtain a spliced vector;

inputting the splicing vector into a first fully-connected network, and mapping the first fully-connected network to obtain a first mapping value;

inputting the time domain vector to a second fully connected network, and mapping the time domain vector by the second fully connected network to obtain a second mapping value;

and multiplying the first mapping value by the first mapping value to obtain the voice activity value.

Further, the speech separation method further comprises the step of jointly training the time-domain encoder, the classifier, the first fully-connected network and the second fully-connected network.

Further, in the step of jointly training the time-domain encoder, the classifier, the first fully-connected network and the second fully-connected network, the loss function corresponding to the classifier isWherein L is_selector(h, mu) is a loss function corresponding to the classifier, h is the time domain vector, mu is a first average value of the voice feature vector of the speaker recognized in the previous round of recognition processing, T represents the number of the voice segments, N represents the number of current rounds in training, T represents the time, h represents the loss function corresponding to the classifier, h represents the time domain vector, mu represents the first average value of the voice feature vector of the speaker recognized in the previous round of recognition processing, T represents the number of the voice segments, N represents the number of the current rounds in training, T represents the time, and h represents the time_tRepresenting said time-domain vector, e, corresponding to time t_tRepresenting and said time domain vector h_tCorresponding event type, i is the number of rounds, μ_iRepresents the first level corresponding to the ith wheelMean value;

On the other hand, an embodiment of the present invention further provides a speech separation system, including:

a first module for obtaining a voice segment;

a second module, configured to map the speech segment to a time domain feature space to obtain a time domain vector;

a third module, configured to perform multiple rounds of identification processing on the time domain vector until an iteration stop condition is satisfied; in the first round of the recognition process, recognizing the speech feature vectors of the speaker contained in the time domain vectors and the corresponding confidence degrees; in each round of the recognition processing except the first round, recognizing a speech feature vector of a new speaker included in a processing result of the recognition processing of the previous round and a corresponding confidence degree, the new speaker being a speaker who recognized the corresponding speech feature vector in the previous round of the recognition processing; the iteration stopping condition is that the confidence coefficient obtained by executing the recognition processing is smaller than a threshold value;

the fourth module is used for acquiring the voice characteristic vector of the target speaker;

and the fifth module is used for determining the voice activity value corresponding to the target speaker according to the time domain vector, the voice feature vector of the target speaker and the voice feature vector of the speaker recognized in each round of recognition processing.

On the other hand, the embodiment of the present invention further provides a computer apparatus, which includes a memory and a processor, where the memory is used to store at least one program, and the processor is used to load the at least one program to perform the voice separation method in the embodiment of the present invention.

In another aspect, an embodiment of the present invention further provides a storage medium, in which a processor-executable program is stored, and the processor-executable program is used to execute the voice separation method in the embodiment of the present invention when being executed by a processor.

The beneficial effects of the invention include: the voice separation method in the embodiment can determine the voice activity value corresponding to the target speaker in the voice segment, and can clearly show whether the target speaker speaks at a certain moment in the voice signal or not through the voice activity value, so that the speaking sequence of the speaker is easily and clearly distinguished, the problem of sequence ambiguity is solved, and the accuracy of voice recognition is improved.

Drawings

FIG. 1 is a flow chart of a speech separation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a speech separation method in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a computer device for performing a speech separation method according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

Referring to fig. 1, an embodiment of the present application provides a speech separation method, including the following steps:

s1, acquiring a voice signal;

s2, segmenting the voice signals to obtain a plurality of voice segments;

s3, mapping the voice segments to a time domain feature space to obtain time domain sub-vectors;

s4, splicing the time domain sub-vectors to obtain time domain vectors;

s5, performing multi-round identification processing on the time domain vector until an iteration stop condition is met;

s6, acquiring a voice feature vector of the target speaker;

s7, acquiring a second average value; wherein the second average value is an average value of all the speech feature vectors recognized in all the recognition processes;

and S8, determining a voice activity value corresponding to the target speaker according to the time domain vector, the voice characteristic vector of the target speaker and the second average value.

In the present embodiment, the principle of steps S1-S8 is shown in FIG. 2.

In step S1, a long speech signal x may be obtained by live recording or capturing from the existing audio or video, where the length of the speech signal x is L, and in step S2, the speech signal x is sampled by using a window with a fixed length, and a speech segment x can be obtained by each sampling_jTo obtain T voice segments x₁、x₂……x_T。

In steps S3 and S4, T speech segments x₁、x₂……x_TAnd respectively input to a time domain coder for processing. In this embodiment, the used time-domain encoder may be formed by superimposing a hole one-dimensional CNN (Convolutional Neural Networks) with a residual structure, the pcelu is used as an activation function, layer normalization is used for the Convolutional Neural network, and a 1D average pooling layer is introduced in the middle of the residual structure, so that the time-domain encoder may down-sample the input speech segment.

Referring to fig. 2, in step S3, for each speech segment, down-sampling is performed by the time-domain encoder and the down-sampled speech segment is mapped to the time-domain feature space, so as to obtain a corresponding time-domain sub-vector. If the time-domain coder is marked as h, then the time-domain coder will encode the speech segment x_jThe time domain sub-vector obtained by down-sampling and mapping can be represented as h (x)_j)∈R^DThe dimension of the resulting time-domain subvector is D. In step S4, each time-domain subvector h (x)₁)、h(x₂)……h(x_T) According to the corresponding voice segment x₁、x₂……x_TSplicing time sequence in a voice signal x to obtain a time domain vector h (x) epsilon R^T×DThe dimension of the resulting time domain vector is T × D.

By performing fixed-length window sampling processing on the speech signal with a longer time length in steps S1-S4, the obtained speech segment is the result of segmenting the speech signal. The obtained voice fragments are respectively processed by a time domain encoder to obtain time domain subvectors, and finally the time domain subvectors are spliced into the time domain vectors, only proper window length and sampling frequency need to be set, no matter what the length of the voice signals is, the voice fragments with the same length can be obtained by sampling the voice signals with different lengths, and the time domain vectors with the same dimensionality are correspondingly obtained, so that the adaptability of the voice signals with different lengths is improved, in the training stage of the time domain encoder, the voice signals with different lengths can be used for training the time domain encoder, and the diversity of training samples for training the time domain encoder is improved.

Referring to fig. 2, in step S5, multiple rounds of recognition processing may be iteratively performed on the time-domain vector h (x) using a softmax linear classifier as the classifier until the recognition processing is not performed after the stop condition is satisfied. In this embodiment, the softmax linear classifier used may include two fully connected networks, i.e., g_μAnd g_h. The softmax linear classifier is configured as a four-class classifier such that the time domain vector h (x) and the related data input into the softmax linear classifier are classified into the first event, the second event, the third event or the fourth event by the softmax linear classifier. Wherein the first event represents: the time domain vector h (x) only has the voice of a new speaker in the corresponding voice segment; the second event represents: only the voices of speakers except the new speaker exist in the voice segment; the third event represents: the speech segment contains both the speech of the new speaker and the speech of the speaker other than the new speakerSound; the fourth event represents: the speech segment does not contain any speaker's speech. The new speaker refers to a speaker who has not recognized the corresponding speech feature vector in each previous round of recognition processing, that is, the new speaker refers to a speaker who has recognized the corresponding speech feature vector for the first time in this round of recognition processing.

The softmax linear classifier classifies the time domain vector h (x) and the related data input into the softmax linear classifier into a first event, a second event, a third event or a fourth event, and outputs a first confidence level, a second confidence level, a third confidence level and a fourth confidence level to represent the probability of the time domain vector h (x) and the related data input into the softmax linear classifier. In this embodiment, the first confidence level output by the softmax linear classifier represents the probability that the time domain vector h (x) and the related data input into the softmax linear classifier belong to the first event, the second confidence level represents the probability that the time domain vector h (x) and the related data input into the softmax linear classifier belong to the second event, the third confidence level represents the probability that the time domain vector h (x) and the related data input into the softmax linear classifier belong to the third event, and the fourth confidence level represents the probability that the time domain vector h (x) and the related data input into the softmax linear classifier belong to the fourth event.

In round 1 recognition process, the softmax linear classifier identifies the speaker's speech feature vectors and corresponding confidence levels contained in the time-domain vector h (x). Specifically, a time domain vector h (x) is input to a fully connected network g_hIn (1). In this embodiment, the voice separation task is specifically to detect whether each speaker is speaking in the voice signal at time t, so the time domain vector h (x) can be labeled as h_tThen fully connected network g_hTo h_tIs processed to obtain g_h(h_t) (ii) a Fully connected network g_μFor processing the result obtained from the previous round of identification processing, since the current round is the 1 st round of identification processing and the previous round of identification processing does not exist, the fully-connected network g can be set_μAn initial value of the 0 vector is processed. Softmax linear classifier pair input time domain vector h_tIs processed so that h is_tMapping to an event class e_t. In this embodiment, the event type e_tThere are four different values that represent the first event, the second event, the third event, and the fourth event, respectively. H is to be_tMapping to e_tIs to identify the time domain vector h_tThe situation of the speaker's voice contained in the corresponding voice segment. In the round 1 identification process, the processing procedure of the softmax linear classifier can be expressed as:

P(e_th_t,μ_i)＝softmax(g_μ(0)g_h(h_t))。

in the 1 st round of recognition processing, if the speech of the speaker A is recognized from the speech segment, the speaker A belongs to a new speaker relative to the 1 st round of recognition processing, and the softmax linear classifier is mapped to the first event and outputs a corresponding confidence coefficient, wherein the confidence coefficient can be expressed as the maximum confidence coefficient of the appearance of a new speaker, namely, the confidence coefficientAfter the 1 st round of recognition processing is executed, the speech feature vector of the speaker A can be obtained and stored, then whether the confidence coefficient obtained by the round of recognition processing is larger than a threshold value or not is judged, the threshold value can be set to be a smaller value, so that the confidence coefficient is larger than the threshold value, and if the confidence coefficient judged by the softmax linear classifier in the 1 st round of recognition processing is judged to be a new speaker, the speaker A can be considered to be reliable, then the next round of recognition processing, namely the 2 nd round of recognition processing can be continuously executed; if the confidence is less than the threshold, the "speaker a determined by the softmax linear classifier in the 1 st round of recognition processing belongs to the new speaker" may be considered unreliable, that is, the new speaker is not found in the 1 st round of recognition processing, the execution of the recognition processing may be terminated, that is, the next round of recognition processing is not executed, and the process jumps to the execution of step S6.

In the 2 nd round of recognition processing, the softmax linear classifier identifies the speaker's speech feature vectors and corresponding confidence levels contained in the time-domain vector h (x). Specifically, a time domain vector h (x) is input to a fully connected network g_hIn (1). In this embodiment, the network g is fully connected_hTo h_tIs processed to obtain g_h(h_t) (ii) a Fully connected network g_μFor the first mean value mu_iPerforming a treatment in which the first average value mu_iThe average value of the speech feature vectors of the speaker identified in the previous round of identification processing is the first average value mu for the 2 nd round of identification processing_iIs the average of the speech feature vectors of the speakers identified by the 1 st round of identification processing. In the round 2 identification process, the processing procedure of the softmax linear classifier can be expressed as:

P(e_t|h_t,μ_i)＝softmax(g_μ(μ_i)g_h(h_t))。

in the 2 nd round recognition processing, if the voice of the speaker a is recognized from the voice section, since the speaker a therein has been recognized in the 1 st round recognition processing, the speaker a does not belong to a new speaker with respect to the 2 nd round recognition processing. If the voice of the speaker B is recognized from the voice fragment, the speaker B is a new speaker relative to the 2 nd round recognition processing because the speaker B is not recognized in the 1 st round recognition processing. The softmax linear classifier respectively maps the situations of 'only recognizing the voice of the first speaker', 'only recognizing the voice of the second speaker', 'recognizing the voice of the first speaker and the voice of the second speaker' or 'not recognizing the voice of any speaker' and the like to a first event, a second event, a third event or a fourth event, and outputs corresponding confidence coefficient, wherein the confidence coefficient can be expressed as the maximum confidence coefficient of a new speaker, namely the confidence coefficientAfter the 2 nd round of recognition processing is executed, if the obtained result is the first event, the voice feature vector of the speaker B can be obtained and stored, and then whether the confidence coefficient obtained by the round of recognition processing is larger than a threshold value or not is judged, and the threshold value can be set to be a smaller value, so that the confidence coefficient is larger than the threshold valueIf the speaker B judged by the softmax linear classifier in the 2 nd round of identification processing belongs to the new speaker can be considered as reliable, the next round of identification processing, namely the 3 rd round of identification processing can be continuously executed; if the confidence is less than the threshold, the "speaker b determined by the softmax linear classifier in the 2 nd round of recognition processing as belonging to the new speaker" may be considered unreliable, that is, the new speaker is not found in the 2 nd round of recognition processing, the execution of the recognition processing may be terminated, that is, the next round of recognition processing is not executed, and the process jumps to the execution of step S6.

The steps of each i-th round of recognition processing after the 2 nd round are similar to the 2 nd round of recognition processing. The softmax linear classifier identifies the speaker's speech feature vectors and corresponding confidence levels contained in the time-domain vector h (x). Specifically, a time domain vector h (x) is input to a fully connected network g_hMedium and full connection network g_hTo h_tIs processed to obtain g_h(h_t) (ii) a Fully connected network g_μFor the first mean value mu_iProcessing is performed, for the ith round of identification processing, the first mean value mu_iThe average value of the speech feature vectors of the speakers identified by the i-1 th round of identification processing. In the ith round of identification, the processing procedure of the softmax linear classifier can be expressed as:

P(e_t|h_t,μ_i)＝softmax(g_μ(μ_i)g_h(h_t))。

in the i-th round of recognition processing, if the voices of the speakers a and b are recognized from the voice section, the speakers a and b do not belong to the new speaker with respect to the i-th round of recognition processing because the speaker a has been recognized in the 1-th round of recognition processing and the speaker b has been recognized in the 2-th round of recognition processing. If the voice of the speaker C is recognized from the voice fragment and the speaker C is not recognized in the previous i-1 th round of recognition processing, the speaker C belongs to a new speaker relative to the i-th round of recognition processing. The softmax linear classifier recognizes the speech of the speaker A only, recognizes the speech of the speaker B only,The situations of recognizing the voice of the speaker A and the voice of the speaker B or not recognizing the voice of any speaker are respectively mapped into a first event, a second event, a third event or a fourth event, and corresponding confidence coefficients are output, wherein the confidence coefficient can be expressed as the maximum confidence coefficient of a new speaker, namely the maximum confidence coefficient of the new speaker, namelyAfter the i-th round of recognition processing is executed, if the obtained result is a first event, the speech feature vector of the speaker C can be obtained and stored, then whether the confidence coefficient obtained by the i-th round of recognition processing is greater than a threshold value or not is judged, the threshold value can be set to be a smaller value, so that the confidence coefficient is greater than the threshold value, and if the confidence coefficient is greater than the threshold value, the speaker C judged by the softmax linear classifier in the i-th round of recognition processing to belong to a new speaker can be considered as reliable, then the next round of recognition processing, namely the i + 1-th round of recognition processing can be continuously executed; if the confidence is less than the threshold, and the speaker judged by the softmax linear classifier in the i-th round of recognition processing as belonging to the new speaker may be considered unreliable, that is, the new speaker is not found in the i-th round of recognition processing, the execution of the recognition processing may be terminated, that is, the next round of recognition processing is not executed, and the process jumps to the execution of step S6.

By performing multiple rounds of recognition processing in step S5, the speech feature vectors of different speakers, such as speaker a, speaker b, speaker c, etc., contained in the multiple speech segments can be recognized, and in combination with the segmentation of the speech signal in steps S1-S4 to obtain multiple speech segments, the speaking times of the different speakers, such as speaker a, speaker b, speaker c, etc., in the speech signal can be determined according to the time positions of the speech segments in the speech signal.

In step S6, the target speaker, i.e., the subject who wishes to know when the speech is spoken in the speech signal, can be determined. For example, the nail may be determined to be the targeted speaker, i.e., when the nail is expected to be broken and speak in the speech signal. In step S4, the voice feature vector S of the target speaker a can be obtained by collecting and analyzing the voice of the target speaker a_{First of all}. All in oneIn this way, the speech feature vectors of the speaker identified in each round of identification process, including the speech feature vector s of the speaker B, can be obtained_{Second step}And the speech feature vector s of the speaker C_C3And the like.

In steps S7 and S8, the voice activity value corresponding to the target speaker in the voice segment is determined according to the time domain vector, the voice feature vector of the target speaker, and the voice feature vector of the speaker recognized in each round of recognition processing. Specifically, the average value of the speech feature vectors of all speakers identified in each round of identification process may be first calculated through step S7 to obtain the second average value, for example, if the speakers identified in each round of identification process are a, b and c in this embodiment, then the second average value isThe second average value s and the voice characteristic vector s of the target speaker A are compared_{First of all}Splicing to obtain a splicing vectorNext, in step S8, processing is performed using a voice detector. The voice detector in this embodiment comprises a first fully connected network f_sAnd a second fully connected network f_hFirst fully connected network f_sAnd a second fully connected network f_hThe PReLU is used as the activation function, and layer normalization is adopted, and all layers except the last layer are linearly mapped. First fully connected network f_sAnd a second fully connected network f_hVector to be splicedAnd the current time domain vector h (x) e R^T ^×DMapping, in particular stitching vectors, separatelyInput to a first fully connected network f_sFrom a first fully connected network f_sMapping to obtain a first mapping valueInputting the time domain vector h (x) to a second fully connected network f_hFrom a second fully connected network f_hMapping to obtain a second mapping value f_h(h_t)^T(ii) a According to the first mapping valueAnd a first mapping value f_h(h_t)^TDetermining a voice activity value

In this embodiment, the voice activity value obtained in step S8 is executedWhereinIndicating that the targeted speaker is speaking at time t in the speech signal,indicating that the targeted speaker is not speaking at time t in the speech signal. The voice activity value can clearly indicate whether the target speaker speaks at a certain moment in the voice signal, and whether each speaker speaks at a certain moment in the voice signal can be determined by switching different target speakers, so that the speaking sequence of the speakers is distinguished, and the problem of sequence ambiguity is solved.

In this embodiment, before performing steps S1-S8, the time domain encoder, the classifier, the first fully-connected network, and the second fully-connected network to be used in steps S1-S8 may be jointly trained, that is, the time domain encoder, the classifier, the first fully-connected network, and the second fully-connected network are connected into a system according to the sequence determined in steps S1-S8, and the loss function of the whole system during training is set to be the sum of the loss functions used by the bit time domain encoder, the classifier, the first fully-connected network, and the second fully-connected network.

Specifically, in the joint training, the loss function corresponding to the classifier isThe first fully-connected network and the second fully-connected network have corresponding loss functions ofWherein T represents the number of voice segments, N represents the number of current rounds in training, h_tRepresenting one of the time-domain vectors, e_tRepresentation and time domain vector h_tCorresponding event class, μ_iIt is shown that the first average value,representing the value of speech activity, y_i,tIndicates the label value, B, in the training sample used in the training_rIndicating the set fault tolerance time. Setting t ≠ B_rThe significance of this is that when the DER is used to evaluate the system, some tolerance needs to be added to the judgment of the speaker boundary, so that the system does not punish small flag errors, generally, the fault tolerance time is 250ms around the speaker transition. Since this strategy is used when using DER evaluation, this fault tolerance time is preferably taken into account also during training, so that the loss of frames in a time range, B, is removed when calculating the loss function of the speech detector, i.e. the first and second fully connected networks, during training_r。

Loss function L of the whole system_totalCorresponding loss function L for classifier_selectorLoss functions corresponding to a first fully-connected network and a second fully-connected networkSum, i.e.

The advantages of the above-mentioned co-training are: in some related technologies, in order to make the neural network have the capability of identifying whether the first speaker and the second speaker speak, the speaking voices of the first speaker and the second speaker are required to be used in advance to train the neural network, but in most cases, the related information of the target speaker to be identified is difficult to obtain in advance, which limits the application of the related voice separation technology. In the joint training of the embodiment, in the training process of the system formed by the time domain encoder, the classifier, the first fully-connected network and the second fully-connected network, the used training set and the test set do not need to have specificity relative to the target speaker, that is, the system is required to be trained by using the speaking voice of the target speaker in advance. For example, the system can be trained by using speaking voices which need to be used in advance, such as A and B, and the system also has the capability of recognizing whether the third party speaks in a voice signal, so that the limitation of the voice separation technology in application is reduced, and the application range is wider.

An application scenario of the voice separation method in this embodiment is as follows: the three people, namely, the first person, the second person and the third person, speak randomly in the time periods [ t1, t2] and record in the time periods [ t1, t2] to obtain the voice signals, so that after the voice signals are played, the situations that only the first person is heard, other people do not speak, the first person and the second person speak simultaneously, the third person does not speak, the first person, the second person and the third person speak simultaneously, no person speaks and the like can happen at different moments in [ t1, t2 ]. By performing steps S1 and S2, a speech segment can be obtained by sampling the speech signal; by performing steps S3-S5, the speech feature vectors of the three speakers a, b, and c can be identified from the speech segments (e.g., if only the speaker a in a speech segment speaks, the speech feature vector of the speaker a will be output; if only the speaker a and the speaker c in a speech segment speak, the speech feature vector of the speaker a and the speech feature vector of the speaker c will be output); by performing step S6, obtaining the voice feature vector of the target speaker that wants to be recognized (for example, if it is desired to recognize whether the first speaks in a voice segment, the voice feature vector of the first is obtained with the first as the target speaker); by performing steps S7 and S8, the voice activity value of the target speaker, i.e., whether the target speaker speaks in the voice segment at time t, can be obtained, thereby completing the voice separation task.

Those skilled in the art can understand that the above application scenario is only one example of an application scenario of the speech separation method in the embodiment of the present application, and the speech separation method in the embodiment of the present application can also be applied to a scenario in which more speakers participate in a manner similar to the above example, and details are not described here again.

In this embodiment, a speech separation system is further provided, and the speech separation system includes:

a first module for acquiring a voice signal;

the second module is used for segmenting the voice signal to obtain a plurality of voice segments;

a third module, configured to map the speech segment to a time domain feature space, to obtain a time domain sub-vector;

a fourth module, configured to splice the time-domain sub-vectors to obtain time-domain vectors;

the sixth module is used for acquiring the voice characteristic vector of the target speaker;

a seventh module for obtaining a second average value; the second average value is the average value of all the speech feature vectors identified in all the identification processes;

In this embodiment, the voice separation system includes a first module, a second module, a third module, a fourth module, a fifth module, a sixth module, a seventh module, and an eighth module, where the first module, the second module, the third module, the fourth module, the fifth module, the sixth module, the seventh module, and the eighth module may be hardware modules, software modules, or a combination of hardware and software having corresponding functions. The first module may be configured to perform step S1 in the speech separation method in the embodiment, the second module may be configured to perform step S2 in the speech separation method in the embodiment, the third module may be configured to perform step S3 in the speech separation method in the embodiment, the fourth module may be configured to perform step S4 in the speech separation method in the embodiment, the fifth module may be configured to perform step S5 in the speech separation method in the embodiment, the sixth module may be configured to perform step S6 in the speech separation method in the embodiment, the seventh module may be configured to perform step S7 in the speech separation method in the embodiment, and the eighth module may be configured to perform step S8 in the speech separation method in the embodiment. Therefore, by operating the voice separation system, the voice separation method can be performed, so that the voice separation system can achieve the same technical effect as the voice separation method.

In an embodiment of the present invention, steps S1-S8 may be performed using a computer device having the structure shown in fig. 3, wherein the computer device includes a memory 6001 and a processor 6002, wherein the memory 6001 is used to store at least one program and the processor 6002 is used to load the at least one program to perform the speech separation method in the embodiment of the present invention. By operating the computer device, the same technical effect as the voice separation method in the embodiment of the invention can be achieved.

In an embodiment of the present invention, there is provided a storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by a processor, is for performing a voice separation method in an embodiment of the present invention. By using this storage medium, the same technical effects as those of the voice separation method in the embodiment of the present invention can be achieved.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

17页详细技术资料下载

Voice separation method, system, device and storage medium

相关技术

网友询问留言