Single-channel voice separation method and device and electronic equipment

文档序号:1339718 发布日期:2020-07-17 浏览:5次 中文

阅读说明:本技术 一种单通道语音分离方法、装置及电子设备 (Single-channel voice separation method and device and electronic equipment ) 是由 毛启容 陈静静 钱双庆 高利剑 于 2020-05-09 设计创作,主要内容包括:本发明提供了一种单通道语音分离方法、装置及电子设备,使用编码器提取混合语音信号特征,分割提取到的语音信号特征并将其重新拼接成3-D的张量;利用融合了自注意力机制的双路循环神经网络对拼接好的3-D张量进行建模,学习语音信号之间的长时间依赖关系;将建模后的3-D张量进行交叠相加,还原为序列语音信号特征;使用解码器将序列语音信号特征重构为纯净语音信号,得到分离的语音信号。本发明以提升语音分离性能为目的对长时间的语音信号进行建模,充分挖掘了语音信号之间的长时间依赖关系,分离效果较好,有效地降低了语音的失真率,同时提高了分离语音的可懂性。(The invention provides a single-channel voice separation method, a single-channel voice separation device and electronic equipment, wherein a coder is used for extracting characteristics of a mixed voice signal, dividing the extracted characteristics of the voice signal and splicing the characteristics of the voice signal into a 3-D tensor; modeling the spliced 3-D tensor by utilizing a two-way circulation neural network integrated with a self-attention mechanism, and learning the long-time dependency relationship between the voice signals; overlapping and adding the modeled 3-D tensors, and restoring the 3-D tensors into sequence voice signal characteristics; the sequence speech signal features are reconstructed into a clean speech signal using a decoder, resulting in a separated speech signal. The invention models the long-time voice signals with the aim of improving the voice separation performance, fully excavates the long-time dependency relationship among the voice signals, has better separation effect, effectively reduces the distortion rate of the voice and simultaneously improves the intelligibility of the separated voice.)

1. A single-channel voice separation method is characterized in that extracted voice signal features are segmented and spliced into a 3-D tensor again, the spliced 3-D tensor is modeled by a two-path cyclic neural network integrated with a self-attention mechanism, long-time dependency relations among voice signals are learned, the modeled 3-D tensor is restored into sequence voice signal features, and the sequence voice signal features are reconstructed into pure voice signals to obtain separated voice signals.

2. The single-channel speech separation method of claim 1, wherein the auto-attention mechanism fuses a recurrent neural network to map (K, V) consisting of Q, key value pairs (keys, values) and query values (keys, values) to a specified output.

3. The single channel speech separation method of claim 2, wherein the self-attention mechanism comprises a point-by-point attention module, a multi-head attention module, a residual normalization module, and a recurrent neural network module.

4. The single-channel speech separation method of claim 3, wherein the recurrent neural network module employs a bidirectional recurrent neural network.

5. The single-channel speech separation method of any of claims 2-4, wherein the self-attention mechanism of the fused recurrent neural network is re-fused into a two-way network.

6. The single-channel speech separation method of claim 5, wherein the two-way network comprises intra-block modules and inter-block modules.

7. The single-channel speech separation method of claim 6, wherein the calculation method of the two-way network is as follows:

IntraD=LN([MultiHead(D[:,:,s],D[:,:,s],D[:,:,s]),s=1,...,H])

Intrablock(D)=[BiLSTM(IntraD[:,:,s]),s=1,...,H]

wherein IntraD refers to the output of the intra-block module after being processed by the multi-head attention module and the residual error normalization module,the output of the inter-block module after the processing of the multi-head attention module and the residual error normalization module is indicated, intrablock (D),The outputs of the intra-block module and the inter-block module are respectively, the Bi L STM is a bidirectional long-short term memory unit, the Multihead is a multi-head attention module, D is a tensor, P is a unit pair length, and H is the block number of the voice signal feature.

8. A single-channel separation voice separation device is characterized by comprising a voice acquisition module, a voice separation module and a voice playing module which are sequentially connected;

the voice acquisition module acquires a single-channel mixed voice signal;

the voice separation module separates the mixed voice signal based on a self-attention mechanism and a two-way circulation neural network to obtain a separated voice signal;

and the voice playing module plays the voice signal obtained from the voice separation module.

9. The single-channel separation voice separation device according to claim 8, wherein the separation of the mixed voice signal based on the self-attention mechanism and the two-way recurrent neural network is specifically:

and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executable by the processor to: and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

Technical Field

The invention relates to the fields of voice signal processing, mode recognition and the like, in particular to a single-channel voice separation method, a single-channel voice separation device and electronic equipment.

Background

The single-channel speech separation means that pure speech of each person is separated from mixed speech of multiple speaking persons, and is an important branch of the signal processing field. It has many practical applications in the real world, for example: the pure speech signal is separated from the mixed noise speech to improve the accuracy of speech recognition and speaker recognition. In the fields of video conference transcription, hearing assistance, mobile communication and the like, single-channel voice separation has wide application prospect and practical significance.

The traditional single-channel speech separation mainly adopts a nonnegative matrix decomposition method and an auditory scene analysis method. Non-negative matrix factorization decouples the spectral features of the mixed speech signal into specific representations associated with the speaker through a non-negative lexicon, and then derives each person's clean speech from these specific representations. The auditory scene analysis method is to decompose the spectrum characteristics into time-frequency blocks and then extract the voice signals of a specific speaker by grouping the blocks. However, these conventional methods can only process the voice separation task of the known speaker, and cannot be generalized to the separation of the mixed voice of the unknown speaker, so the application scenarios are greatly limited. After the deep learning era, the neural network based on the frequency spectrum characteristics solves the generalization problem and improves the separation performance to a certain extent. However, the neural network based on the spectrum features still uses the spectrum features as the input of the neural network, and in most cases, only the amplitude features are separated and the phase information is not processed; the artifacts exist in the speech separated by the neural network, so that the performance of the speech separated by the neural network has an upper limit, and the separation performance cannot be maximally improved. In order to overcome the problem, the time domain separation method extracts the characteristics of the voice signal and recovers the voice signal in a convolution-deconvolution mode, thereby avoiding the generation of artifacts in principle and greatly improving the performance of voice separation. However, the time-domain separation system usually needs to model an extremely long input sequence, and needs to mine the relationship between frames in the input sequence, which is a great challenge for the time-domain separation method.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a single-channel voice separation method, a single-channel voice separation device and electronic equipment.

The present invention achieves the above-described object by the following technical means.

A single-channel voice separation method comprises the steps of dividing extracted voice signal features, splicing the extracted voice signal features into 3-D tensors again, modeling the spliced 3-D tensors by utilizing a two-path cyclic neural network integrated with a self-attention mechanism, learning long-time dependency relations among voice signals, restoring the modeled 3-D tensors into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

Further, the self-attention mechanism merges a recurrent neural network to map (K, V) consisting of Q consisting of a plurality of query queries and a plurality of key value pairs (keys, values) to a specified output.

Still further, the self-attention mechanism includes a point-by-point attention module, a multi-head attention module, a residual normalization module, and a recurrent neural network module.

Further, the recurrent neural network module adopts a bidirectional recurrent neural network.

Further, the self-attention mechanism of the fused recurrent neural network is fused into a two-way network.

Further, the two-way network includes intra-block modules and inter-block modules.

Furthermore, the calculation method of the two-way network is as follows:

IntraD=LN([MultiHead(D[:,:,s],D[:,:,s],D[:,:,s]),s=1,...,H])

Intrablock(D)=[BiLSTM(IntraD[:,:,s]),s=1,...,H]

wherein IntraD refers to the output of the intra-block module after being processed by the multi-head attention module and the residual error normalization module,the output of the inter-block module after the processing of the multi-head attention module and the residual error normalization module is indicated, intrablock (D),The outputs of the intra-block module and the inter-block module are respectively, the Bi L STM is a bidirectional long-short term memory unit, the Multihead is a multi-head attention module, D is a tensor, P is a unit pair length, and H is the block number of the voice signal feature.

A single-channel separation voice separation device comprises a voice acquisition module, a voice separation module and a voice playing module which are connected in sequence;

the voice acquisition module acquires a single-channel mixed voice signal;

the voice separation module separates the mixed voice signal based on a self-attention mechanism and a two-way circulation neural network to obtain a separated voice signal;

and the voice playing module plays the voice signal obtained from the voice separation module.

In the above technical solution, the separating the mixed voice signal based on the self-attention mechanism and the two-way recurrent neural network specifically includes:

and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to: and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

The invention has the following beneficial effects: the method utilizes a double-path cyclic neural network based on a self-attention mechanism to model long-time voice signals, fully excavates long-time dependency relationship among the voice signals, restores the modeled 3-D tensor into sequence voice signal characteristics, and reconstructs the sequence voice signal characteristics into pure voice signals to obtain separated voice signals; the distortion rate of the voice is effectively reduced, and the intelligibility of the separated voice is improved.

Drawings

FIG. 1 is a flow chart of a single-channel speech separation method of the present invention;

FIG. 2 is a schematic diagram illustrating a self-attention mechanism of the fusion recurrent neural network of the present invention;

FIG. 3 is a schematic diagram illustrating a two-way recurrent neural network incorporating the self-attentive mechanism of the present invention;

FIG. 4 is a schematic structural diagram of a single-channel separation voice separation apparatus according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention is clearly and completely described below with reference to the drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a single-channel speech separation method based on a self-attention mechanism and a two-way recurrent neural network includes the following steps:

step one, an encoder receives a mixed voice signal of multiple speaking persons, and extracts the characteristics of the mixed voice signal:

using one-dimensional convolutional neural network as coder, extracting characteristic X ∈ R from mixed speech signal of multiple speakersN×LThe feature is a 2-D tensor, where R represents the real number set, L is the number of time steps for the extracted speech signal feature, and N is the dimensionality of the extracted speech signal feature.

Step two, segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor:

the method comprises the steps of partitioning L voice signal features by taking P as a unit to obtain H partitions, enabling the blocks to have overlapped parts, namely the blocks to have overlapped parts, and then splicing all the blocks together to form a 3-D tensor D ∈ RN×P×H

Thirdly, modeling the spliced 3-D tensor by using a two-way cyclic neural network fused with a self-attention mechanism, and learning the long-time dependency relationship between the voice signals:

as shown in fig. 2, the self-attention mechanism merges the recurrent neural network to map (K, V) composed of Q composed of a plurality of query queries and a plurality of key value pairs (keys) to a specified output.

The self-attention mechanism comprises a point-by-point attention module, a multi-head attention module, a residual error normalization module and a recurrent neural network module.

The dot-and-multiply-attention module first calculates a weight by Q and the corresponding K, and then weights V according to the weight to sum up, thereby obtaining an output. The calculation formula is as follows:

wherein d ismodelThe dimension of the input sequence, which is equal to the dimension N of the speech signal features in the present invention; kTRepresenting the transposition of the matrix K, wherein SoftMax is an activation function; a normalization layer is provided before SoftMax.

The multi-head attention module is formed by combining a plurality of point-by-point attention modules. The multi-head attention module firstly carries out linear mapping on Q, K, V, then sends the mapped result to a plurality of point-by-attention modules for operation, and finally splices the operation results of the plurality of point-by-attention modules to obtain the output of the multi-head attention module. The calculation formula is as follows:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO(3)

wherein the content of the first and second substances,are all parameters of the fully connected layer; h is the number of parallel point-by-point attention modules; and h, dmodel、dk、dVThere is the following relationship between: dk=dV=dmodel/h,dkIs composed ofDimension of (d)VIs composed ofOf (c) is calculated. The multi-head attention module has fewer parameters, can effectively learn the long-time dependency relationship among the voice signals, and is favorable for improving the final voice separation performance.

The residual normalization module adds the output of the multi-headed attention module to the initial input (Q, K, V) and then performs a layer normalization operation on the result. If necessary, go toThe object of the normalization is U ∈ RN×P×HThen, the calculation method of normalization here is:

mu (U) and sigma (U) are respectively the mean value and variance of U, L N represents layer normalization, z and r are normalization factors which are extremely small positive numbers for preventing denominator from being 0, residual normalization is beneficial to convergence of neural network parameters and prevents the neural network from appearing gradient explosion or gradient disappearance in the training process.

The recurrent neural network module is a bidirectional long-short term memory unit Bi L STM, and the calculation mode is as follows:

u=σ(Wu[a<t-1>;x<t>]+bu) (7)

f=σ(Wf[a<t-1>;x<t>]+bf) (8)

o=σ(Wo[a<t-1>;x<t>]+bo) (9)

=tanh(Wc[a<t-1>;x<t>]+bc) (10)

c<t>u*+f*c<t-1>(11)

a<t>o*tanh(c<t>) (12)

wherein the content of the first and second substances,uforespectively an update gate, a forgetting gate and an output gate; wu、buTo update the parameters of the door, Wf、bfTo forget the parameters of the door, Wo、boTo output the parameters of the gate, Wc、bcIs a parameter of the memory cell; x is the number of<t>For input of the current time, a<t>As output at the current time, c<t>Andare memory cells in the module. The bidirectional cyclic neural network can further learn the long-time dependency relationship between frames in the voice signal, and promote the improvement of the final voice separation performance. In addition, the bi-directional recurrent neural network can also provide location information for the self-attention mechanism.

As shown in FIG. 3, the self-attention mechanism of the fusion cyclic neural network is fused into a two-way network, the two-way network is divided into two modules, namely an intra-block module and an inter-block module, and the objects needing to be processed by the two-way network are 3-D tensor D ∈ RN×P×HAccording to the self-attention mechanism process of the fusion cyclic neural network, the calculation mode of obtaining the two-way network is as follows:

IntraD=LN([MultiHead(D[:,:,s],D[:,:,s],D[:,:,s]),s=1,...,H]) (13)

Intrablock(D)=[BiLSTM(IntraD[:,:,s]),s=1,...,H](14)

wherein IntraD refers to the output of the intra-block module after being processed by the multi-head attention module and the residual error normalization module;in the inter-block module, the output is processed by a multi-head attention module and a residual error normalization module; intrablock (D),Respectively the outputs of the intra-block module and the inter-block module.

The use of the two modules in the block and between the blocks can exponentially reduce the time step number of the voice signal features to be processed, and solve the problem of difficult modeling of extremely long time series signals, so that the neural network can fully mine the long-time dependency relationship between the voice signals, and the voice separation performance is greatly improved.

In the step, the spliced 3-D tensor D ∈ R in the step two is subjected to the two-way circulation neural network fused with the self-attention mechanismN×P×HModeling, learning the local information of speech signals by using intra-block modules and the global information of speech signals by using inter-block modules to learn the long-time dependence relationship between speech signals, and mapping the modeled speech signals into masks D' ∈ R of multiple pure human voices by using a two-dimensional convolutional neural network(S×N)×P×HThe mask is then compared to the original 3-D tensor D ∈ RN×P×HDot multiplication is carried out to obtain the pure voice signal characteristics D' ∈ R of a plurality of persons(S×N)×P×H(ii) a Wherein S is the number of speakers in the mixed speech.

And fourthly, overlapping and adding the modeled 3-D tensors, and restoring the 3-D tensors into sequence voice signal characteristics:

for a plurality of persons, the characteristic D' ∈ R of the pure speech signal(S×N)×P×HOverlapping addition operation is carried out to restore the overlapping addition operation into a plurality of human pure voice signal characteristics X' ∈ R(S×N)×L

Step five, reconstructing the sequence voice signal characteristics into a pure voice signal by using a decoder to obtain a separated voice signal:

and a one-dimensional deconvolution neural network is used as a decoder to restore the pure voice signal characteristics of each person into respective pure voice signals to obtain separation results.

As shown in fig. 4, a single-channel separation voice separation apparatus includes a voice acquisition module, a voice separation module and a voice playing module, which are connected in sequence; the voice acquisition module acquires a single-channel mixed voice signal; the voice separation module separates the mixed voice signal based on a self-attention mechanism and a two-way circulation neural network to obtain a separated voice signal; the voice playing module plays the voice signal obtained from the voice separating module.

Separating the mixed voice signal based on a self-attention mechanism and a two-way circulation neural network, specifically comprising the following steps: and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

As shown in fig. 5, an electronic device comprises a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor.

The Memory may be a Random-Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as an EVO disk Memory of samsung 860. The memory is used for storing programs and comprises program codes of the single-channel voice separation method. The memory also includes memory for providing instructions and data to the processor.

The processor may be an Intel core i5-4200U processor. The processor reads the corresponding program codes from the memory to the memory for running, and the single-channel voice separation method is formed. The processor executes the program stored in the memory and is specifically configured to perform the following operations: and segmenting the extracted voice signal features and splicing the extracted voice signal features into a 3-D tensor again, modeling the spliced 3-D tensor by utilizing a two-way cyclic neural network which integrates a self-attention mechanism, learning the long-time dependency relationship between the voice signals, restoring the modeled 3-D tensor into sequence voice signal features, and reconstructing the sequence voice signal features into pure voice signals to obtain separated voice signals.

The memory and the processor may be connected to each other by an internal bus, which may be an ISA (Industry standard Architecture) bus, a PCI (Peripheral component interconnect) bus, an EISA (Extended Industry standard Architecture) bus, or the like; the buses are indicated by double-headed arrows in fig. 5.

The above two-way cyclic neural network is trained with normalized signal-to-noise ratio (SI-SNR) as a loss function in the training process, and the formula is:

wherein the content of the first and second substances,and 5, the separated voice obtained in the step five, wherein x is the original pure voice.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种双声源的声音信号分离方法和拾音器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!