Voice extraction method, device and equipment

文档序号：344498 发布日期：2021-12-03 浏览：12次中文

阅读说明：本技术 一种语音提取方法、装置及设备 (Voice extraction method, device and equipment ) 是由史慧宇尹首一韩慧明刘雷波魏少军于 2021-09-03 设计创作，主要内容包括：本说明书实施例提供一种语音提取方法、装置及设备。所述方法包括：获取混合语音样本数据；所述混合语音样本数据中包括噪声信号、干扰语音信号、混响信号中的至少一种和目标语音信号；利用所述混合语音样本数据训练预设语音分离模型,得到预训练语音分离模型；基于所述预训练语音分离模型构建策略网络和评估网络；所述策略网络和评估网络对应有网络参数；基于所述网络参数确定目标量化策略；利用目标量化策略更新所述预训练语音分离模型得到语音提取模型；利用所述语音提取模型从待处理语音数据中提取目标对象语音信号。上述方法减小了语音提取模型的规模,进而快速有效地对单通道语音中的目标对象的语音实现了分离。(The embodiment of the specification provides a voice extraction method, a voice extraction device and voice extraction equipment. The method comprises the following steps: acquiring mixed voice sample data; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; determining a target quantization strategy based on the network parameters; updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model; and extracting a target object voice signal from the voice data to be processed by utilizing the voice extraction model. The method reduces the scale of the voice extraction model, and further quickly and effectively separates the voice of the target object in the single-channel voice.)

1. A method of speech extraction, comprising:

acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal;

training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model;

constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters;

determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model;

updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model;

extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

2. The method of claim 1, wherein the mixed voice sample data is obtained by:

mixing at least two voice signals in a first signal-to-noise ratio range to obtain a voice mixed voice signal;

mixing the voice mixed voice signal with a noise signal within a second signal-to-noise ratio range to obtain a comprehensive voice signal;

and processing the comprehensive voice signal by utilizing a voice signal generating function to obtain mixed voice sample data.

3. The method of claim 1, in which the mixed voice sample data comprises training sample data, verification sample data, and test sample data; the training of the preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model comprises the following steps:

training a preset voice separation model by using the training sample data to obtain a pre-training voice separation model;

before extracting the target object voice signal from the voice data to be processed by using the voice extraction model, the method further comprises the following steps:

extracting a test target voice signal in the test sample data by using the voice extraction model;

optimizing the voice extraction model according to the comparison result of the test target voice signal and the verification sample data;

correspondingly, the extracting the target object voice signal from the voice data to be processed by using the voice extraction model includes:

and extracting the target object voice signal from the voice data to be processed by using the optimized voice extraction model.

4. The method of claim 1, wherein said training a pre-trained speech separation model using said mixed speech sample data to obtain a pre-trained speech separation model comprises:

initializing model parameters in the preset voice separation model so as to enable a neural network to carry out forward propagation; wherein, include: initializing weights and biases among neuron nodes in the preset voice separation model; an activation function is arranged between network neuron nodes in the initial voice extraction model; the activation function is used for generating a nonlinear mapping between the input and the output corresponding to the network neuron node in the process of neural network forward propagation;

calculating a loss function of the preset voice separation model based on the mixed voice sample data;

and updating the model parameters by using a gradient descent method according to the loss function.

5. The method of claim 4, wherein said calculating a loss function of said pre-set speech separation model based on said mixed speech sample data comprises:

inputting the mixed voice sample data into a preset voice separation model to obtain a predicted target voice;

using formulasCalculating a loss function, wherein L is the loss function,where s is an ideal target voice,in order to predict the target voice,

6. the method of claim 1, wherein the constructing a policy network and an evaluation network based on the pre-trained speech separation model comprises:

constructing a current strategy network, a target strategy network, a current evaluation network and a target evaluation network based on the pre-training voice separation model;

the determining a target quantization policy based on the network parameters includes:

constructing a first state based on network parameters in a current policy network, a target policy network, a current evaluation network and a target evaluation network; the first state corresponds to a first feature vector;

constructing a first action according to the current policy network and the initial state;

executing the first action to obtain a second state corresponding to the pre-training voice separation model, an execution reward and a state termination judgment result; the second state corresponds to a second feature vector;

constructing an experience playback set based on the first feature vector, the first action, the execution reward, the second feature vector and a state termination judgment result;

calculating a current target value using the set of empirical playbacks;

updating a current evaluation network and a current strategy network respectively based on the experience playback set and a current target value;

updating the target evaluation network and the target strategy network based on the updated current evaluation network and the current strategy network;

and under the condition that the state termination judgment result is in a termination state, determining a target quantization strategy based on the updated current evaluation network, the current strategy network target, the evaluation network and the target strategy network.

7. The method of claim 6, wherein the building a first action based on the current policy network and the initial state comprises:

constructing a first action by using a formula A ═ pi theta (phi (S)) + N, wherein A is the first action, theta is a network parameter of the current policy network, phi (S) is a first feature vector, and N is noise;

the calculating a current target value using the empirical playback set includes:

using formulasCalculating the current target value, where y_iFor the current target value, R, corresponding to the jth sample in the empirical playback set_jIs the execution reward corresponding to the jth sample in the experience playback set, is _ end is the state termination judgment result, true is the termination state, false is the non-termination state, gamma is the attenuation factor, Q 'is the feedback value obtained by the target evaluation network, phi (S'_j) Is a second feature vector, pi_θ'For the conversion function, ω' is the network parameter of the target evaluation network;

the updating a current evaluation network and a current policy network based on the experience playback set and a current target value, respectively, includes:

using formulasUpdating the network parameters of the current evaluation network, wherein m is the number of samples in the experience playback set, Q is the feedback value obtained by the current evaluation network, A_jA first action corresponding to the jth sample in the experience playback set, wherein omega is a network parameter of the current evaluation network;

using formulasUpdating network parameters of the current policy network, where s_iFor empirical playback of a first state corresponding to the jth sample in the set, a_iA target quantization strategy corresponding to the jth sample in the experience playback set is obtained, and theta is a network parameter of the current strategy network;

the updating the target evaluation network and the target policy network based on the updated current evaluation network and the current policy network includes:

updating the target evaluation network based on a formula omega '← tau omega + (1-tau) omega', wherein tau is a regulating coefficient;

and updating the target policy network based on a formula theta ' ← tau theta + (1-tau) theta ', wherein theta ' is a network parameter of the target policy network.

8. The method of claim 6, wherein prior to determining a target quantization policy based on the updated current evaluation network, current policy network target, evaluation network, and target policy network, further comprising:

and repeating the steps of constructing the first action, acquiring the second state, executing the reward and the state termination judgment result, constructing an experience playback set, calculating the current target value, updating the current evaluation network and the current strategy network and updating the target evaluation network and the target strategy network until the state termination judgment result is in a termination state.

9. A speech extraction device, comprising:

the mixed voice sample data acquisition module is used for acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal;

the preset voice separation model training module is used for training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model;

the network construction module is used for constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters;

a target quantization strategy determination module for determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model;

the pre-training voice separation model updating module is used for updating the pre-training voice separation model through the target quantization strategy to obtain a voice extraction model;

the target object voice signal extraction module is used for extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

10. A speech extraction device comprising a memory and a processor;

the memory to store computer program instructions;

the processor to execute the computer program instructions to implement the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model; updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

Technical Field

The embodiment of the specification relates to the technical field of voice signal processing, in particular to a voice extraction method, a device and equipment.

Background

With the development of computer and artificial intelligence technologies, automatic speech recognition based on intelligent devices has also been widely used. In practical application, the intelligent device collects the voice of the target object and often receives interference signals such as the voice of other objects and noise in the environment. Therefore, before performing speech recognition, a speech signal corresponding to a target object is first extracted from the acquired speech signal.

At present, when a multi-channel voice signal is processed, voice extraction can be performed by comparing voice signals of different channels. However, when processing a single-channel speech signal, it is more difficult to directly extract the corresponding sound source from the noisy and reverberant environment. Some existing methods for separating single-channel voice signals mainly improve the performance of the model by expanding the structure of the model and increasing the parameter number of the model on the basis of the original model, but in this way, not only higher requirements are put forward on the performance of computing equipment, but also the computing time is greatly prolonged. Therefore, there is a need for a method that can reduce the scale of a speech extraction model to quickly and efficiently extract single-channel speech on the premise of ensuring the speech separation effect.

Disclosure of Invention

An object of the embodiments of the present specification is to provide a method, an apparatus, and a device for speech extraction, so as to solve a problem how to reduce a scale of a speech extraction model to achieve fast and efficient speech extraction.

In order to solve the above technical problem, an embodiment of the present specification provides a speech extraction method, including: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model; updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

An embodiment of this specification further provides a speech extraction apparatus, including: the mixed voice sample data acquisition module is used for acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; the preset voice separation model training module is used for training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; the network construction module is used for constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; a target quantization strategy determination module for determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model; the pre-training voice separation model updating module is used for updating the pre-training voice separation model through the target quantization strategy to obtain a voice extraction model; the target object voice signal extraction module is used for extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

The embodiment of the present specification further provides a speech extraction device, including a memory and a processor; the memory to store computer program instructions; the processor to execute the computer program instructions to implement the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model; updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

According to the technical scheme provided by the embodiment of the specification, after the mixed voice sample data is obtained, the preset voice separation model is trained by using the mixed voice sample data, then the strategy network and the evaluation network are constructed according to the pre-trained voice separation model obtained through training, and then the target quantization strategy is determined according to the network parameters corresponding to the constructed network, so that the pre-trained voice separation model is updated by using the obtained strategy, and finally the target voice signal is extracted from the voice data to be processed by using the updated voice extraction model. By the method, after the model is pre-trained, the model can be updated by determining the quantization strategy to ensure the accuracy of the model, and further, the structural details and parameters of the preset voice separation model can be reduced as much as possible when the model structure is determined, so that the model training time is shortened, the construction of the corresponding model is rapidly and effectively realized, the extraction of the voice signal corresponding to the target object in single-channel voice is effectively realized, and the use experience of a user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for extracting speech according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an Actor-critical network according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating a process for obtaining a speech extraction model according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of a speech extraction apparatus according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a speech extraction device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.

In order to solve the above technical problem, a speech extraction method according to an embodiment of the present specification is introduced. The execution main body of the voice extraction method is voice extraction equipment, and the voice extraction equipment comprises but is not limited to a server, an industrial personal computer, a Personal Computer (PC) and the like. As shown in fig. 1, the speech extraction method may include the following implementation steps.

S110: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal.

In this embodiment, in order to achieve the purpose of extracting the voices of one or more objects from a single-channel voice, a corresponding model needs to be constructed and trained, and finally, the purpose of extracting the voices can be achieved by using the model.

The mixed speech sample data is the data utilized in training the model. For the purpose of separating the voice of the target object from the voice signal, the target voice signal and at least one of a noise signal, an interference voice signal, a reverberation signal may be included in the mixed voice sample data.

The noise signal can be a signal which causes interference to the original voice due to the reasons of incompleteness of the voice signal collected by the microphone, loss in the signal transmission process and the like in the voice collection process. The interfering speech signal may be speech generated by an object other than the target object, for example, if the speech acquisition region includes a plurality of objects emitting sounds, and only one of the objects is the target object required for extracting speech this time, the acquired speech signal corresponding to the other object is the interfering speech signal. The reverberation signal may be a signal received by the sound collection device after the sound emitted by the target object itself is reflected by an object such as a surrounding obstacle or barrier. Since there is a certain delay in the acquisition of these sounds compared to the sounds directly emitted by the target object, certain interference is also caused to the speech extraction.

The target speech signal may be a signal corresponding to speech produced by the target object. The number of the target objects may be one or more. In order to perform processing such as voice recognition on the target voice signal in a subsequent process, the target voice signal needs to be separated from the mixed voice sample data.

Specifically, for the purpose of performing speech extraction on a single-channel speech in conjunction with the embodiments of the present specification, the mixed speech sample data may also be a single-channel speech signal. The single-channel speech signal may be a sound signal collected by only one microphone.

Specifically, under the condition that the process of training the model is realized based on supervised learning, the mixed voice sample data may further correspond to a corresponding label for identifying the target voice signal therein. The specific identification manner may be set based on the requirement of the actual application, which is not limited to this.

In some embodiments, the mixed voice sample data may be prepared by: firstly, at least two human voice signals are mixed in a first signal-to-noise ratio range to obtain a human voice mixed voice signal, the human voice signal can be an independent voice signal which is acquired or separated in advance and corresponds to human voice, and the first signal-to-noise ratio range is used for limiting a signal-to-noise ratio interval for mixing the human voice signals, and can be 0dB to 5dB, for example. Secondly, the voice mixed voice signal and the noise signal are mixed in a second signal-to-noise ratio range to obtain a comprehensive voice signal, the noise signal can be an additionally generated signal which causes interference to the voice signal, and the second signal-to-noise ratio range is used for limiting a signal-to-noise ratio interval for mixing the two signals, and can be-6 dB to 3dB, for example. And finally, processing the comprehensive voice signal by using a voice signal generating function to obtain mixed voice sample data, wherein the voice signal generating function can generate a corresponding voice signal based on corresponding data so as to achieve the effect of simulating the voice of practical application, specifically a pyroomics function, and the pyroomics function can quickly construct simulation scenes of single/multiple sound sources and microphones in a 2D/3D room, so as to help construct simulated voice sample data.

The above process is explained in detail by using a specific example, when preparing mixed voice sample data, firstly, resampling time domain signals of a WSJ0 voice signal sample and a WHAM noise sample at 8kHz, randomly mixing two different speaker voices between 0dB and 5dB of signal-to-noise ratio, mixing the mixed voice and the randomly extracted noise sample within the range of signal-to-noise ratio of-6 dB to 3dB, and obtaining room pulse correspondence based on a pyroomics function and finally mixing the obtained voice based on room configuration parameters in table 1 to obtain the final mixed voice sample data y containing noise, reverberation and other speaker interference.

TABLE 1

Based on the above embodiment, after a certain number of mixed voice sample data is obtained, the mixed voice sample data may be further divided. Specifically, the mixed voice sample data may be divided into training sample data, verification sample data, and test sample data. Wherein the training sample data is used for training for a model in a subsequent step; the test sample data and the verification sample data can be used for respectively testing and verifying the model after the model training is finished so as to ensure the effect of the model.

To illustrate by using a specific example, assuming that the total number of sample data generated based on the above steps is 28000, 20000 sample data, 3000 sample data and 5000 sample data can be divided into training sample data, and then used in the subsequent model training and model verification processes, respectively. In practical applications, the sample number may be set to other ratios according to the application requirements, and is not limited to the above examples.

S120: and training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model.

The preset speech separation model may be a pre-constructed model for speech extraction for single-channel speech.

Preferably, in order to achieve the technical purpose of the present application, that is, to achieve speech extraction quickly and efficiently, the preset speech separation model may reduce the size of the model itself as much as possible while ensuring that the effect of extracting speech can be achieved completely, so as to reduce the time consumed by extracting speech using the model. Accordingly, the accuracy of the extraction result can be effectively ensured through the subsequent steps.

In some embodiments, the preset speech separation model may be a model constructed based on a deep neural network. When the model is trained, forward propagation and backward propagation can be sequentially performed based on the connection sequence between the neural network nodes in the preset voice separation model, so that the model is updated to complete the training.

Based on the above embodiment, when training the model, the model parameters in the preset speech separation model may be initialized first. The model parameters mainly include weight values and bias values between network neuron nodes, and the model parameters constitute a specific way of processing data by the model, specifically, in the embodiment of the present specification, that is, are used to extract target speech from single-channel speech.

After the initialization of the model is completed, the loss function of the preset voice separation model can be calculated according to the mixed voice sample data. After the mixed voice sample data is input into the preset voice separation model, the data can be correspondingly processed based on the structures of all parts in the preset voice separation model and the data flow relation among different structures so as to obtain the final predicted target voice. The specific process of calculating data by using the preset voice separation model may be set based on the actual structure of the preset voice separation model, which is not described herein again.

Specifically, an activation function may be set between network neuron nodes in the initial speech extraction model. The activation function is used for increasing the nonlinear relation between the networks through the activation function in the process of forward propagation of the neural network, and finally, the nonlinear mapping between the input result and the output result can be generated. The specific type of the activation function may be set according to application requirements, and may be, for example, an activation function such as a PReLU, which is not limited in this respect.

After the predicted target speech is obtained, the model may be optimized by using a loss function according to the predicted target speech. The loss function may be a preset loss function corresponding to the preset speech separation model, and is used for evaluating the loss of the model according to the prediction result, and further correcting the model by combining the calculation result so as to obtain a more accurate prediction result.

Specifically, the predetermined loss function may beWherein L is a predetermined loss function,is used for representing effective signals in the voice signals, wherein s is ideal target voice, and specifically, can be embodied by labeling in the mixed voice sample data in advance,in order to predict the target voice,for representing a noise signal in a speech signal,<·,·>represents the dot product between two vectors, and |)²Representing the euclidean distance. Where SNR is the signal-to-noise ratio.

In some embodiments, the model may be optimized using a gradient descent method in combination with the predetermined loss function. The optimization process may be to calculate a first gradient of the loss function corresponding to the output layer of the initial speech extraction model, sequentially calculate gradients corresponding to each layer in the initial speech extraction model based on the first gradient, and finally update the weight and the offset of the initial speech extraction model by combining the gradients of each layer.

Specifically, the updating of the parameter of the multi-scale extraction deep neural network by using the gradient descent method may be fixing the parameter of the deep neural network within a certain time, calculating the gradient of the loss function of the output layer by using the above formula, then taking the initial network level as the L-th layer, and sequentially calculating the gradient corresponding to each layer when the number of network layers is L-1, L-2, …, 2, wherein L is the number of layers of the neural network. And after all the gradients are obtained through calculation, updating the weight and the bias of the whole network according to the calculated gradients, thereby completing the optimization of the model and obtaining the pre-training voice separation model.

Accordingly, because the difference of the preset speech separation model is mainly embodied by the model parameters in the model, the optimization process for the preset speech separation model can be mainly optimized for the model parameters. The specific optimization process may be adjusted based on the requirements of the actual application, and will not be described herein.

Since the scale and parameters of the preset speech separation model are reduced as much as possible in the embodiments of the present specification, when training the preset speech separation model, the training time and the computing resources necessary for training must be relatively reduced. However, correspondingly, the accuracy of the model separated speech obtained by training is also certain to be deficient, so that the accuracy of the model can be compensated by other methods in the subsequent steps.

S130: constructing a strategy network and an evaluation network based on the pre-training voice separation model; the policy network and the evaluation network correspond to network parameters.

In practical application, for a trained model, some preset optimization rules may be used to optimize the model, for example, analysis features may be input into the model, and after corresponding performance parameters are obtained, a corresponding optimization strategy is determined.

Specifically, when the model is optimized in the mode, the Actor-Critic evaluation point can be adopted for optimization, the Actor is used for generating an Action based on a strategy function and continuously interacting with the environment to try and mistake to obtain a sample for training, the Critic is used for evaluating the performance of the Actor based on a value function and guiding the Actor to perform the next Action by combining the feedback condition of actual operation, and therefore the accuracy of updating the model is guaranteed. In a specific implementation process, the problem can be converted into a reinforced learning scene. Four primitive groups (S, a, P, R) for reinforcement learning are defined, wherein State (State), Action (Action), Reward (Reward), and Policy (Policy), respectively. Wherein, the State (State) can represent the relevant parameters of the model; actions (actions) include what is performed based on the current state of the network to get a new state of the network. Reward (Reward) includes information that may indicate the difference between the effect that the current network can achieve and the desired effect, thereby constituting a Reward for the optimized network to achieve. Policy (Policy) is applied to the Policy taken to maximize the desired effect achieved corresponding to the generated reward, and ultimately to the effect of model updating to the speech extraction model.

The policy network and the evaluation network may be respectively used to refer to an Actor network and a Critic network, where the Actor network is responsible for generating actions and interacting with the environment, and the Critic network is responsible for evaluating the Actor network and guiding actions of the Actor network in the next step.

When the method is applied specifically, the constructed policy network comprises a current policy network and a target policy network, and the evaluation network comprises a current evaluation network and a target evaluation network. The target policy network may be a network that is adjusted based on parameters in the current policy network, so that the policy network can be updated successively based on the optimization process, and the current evaluation network and the target evaluation network are also in a corresponding relationship.

As shown in fig. 2, the schematic diagram of the Actor-critical network structure is shown, in which an Actor current network interacts with a pre-training model based on an action a to generate sample data to be input into an experience pool, and then sample data y is extracted from the experience pool_jAnd based on sample data y_jAnd feeding back the critical current network to realize further updating of the network. Correspondingly, the Actor current network and the Critic current network may also be further used to implement the Actor target network and the Critic target network, thereby promoting the generation of the final policy.

The policy network and the evaluation network respectively correspond to respective network parameters. The network parameters may be initialized separately during an initial application phase. The specific initialization mode and the corresponding relationship between the current policy network and the target policy network, and the parameters between the current evaluation network and the target evaluation network may be set based on the requirements of the actual application, and are not described herein again.

S140: determining a target quantization strategy based on the network parameters; the target quantization strategy is used to determine an optimization mode for the pre-trained speech separation model.

And determining a target quantization strategy to optimize the pre-training voice separation model of the pre-training voice separation model so as to obtain a final voice extraction model. To determine the target quantization strategy, an initial mixed quantization strategy a may be first established₀The policy is updated in subsequent processes. The specific initial hybrid quantization strategy specified may be set based on specific parameters.

Before the specific steps are executed, the network parameters may be initialized, specifically, the network parameters of the target policy network may be set to be the same as the network parameters of the current policy network, and the network parameters of the target evaluation network may be set to be the same as the network parameters of the current evaluation network. An experience playback set D may also be set, which is used to store corresponding samples in subsequent steps, and correspondingly, in the initialization process, the experience playback set is emptied to ensure subsequent application effects.

Thereafter, a first state S may be constructed based on network parameters in the current policy network, the target policy network, the current evaluation network, and the target evaluation network, the first state corresponding to a first feature vector φ (S).

Thereafter, a first action a may be obtained in the current policy network based on the first state S, and specifically, the first action may be constructed by using a formula a ═ θ (Φ (S)) + N, where a is the first action, θ is a network parameter of the current policy network, Φ (S) is a first eigenvector, and N is noise, and specifically, may be some additional supplementary noise.

Based on the foregoing description of the policy network, the first action may be performed in the current policy network, thereby generating a second state S' corresponding to the pre-trained speech separation model, an execution reward R corresponding to the execution result, and a state termination determination result "is _ end". The execution reward is used for representing the difference between the realization effect and the ideal effect of the current model, and the state termination judgment result is used for judging whether the updating process based on the strategy network and the evaluation network reaches the termination condition. Correspondingly, the second state S 'also corresponds to the second feature vector Φ (S').

And storing the obtained first feature vector phi (S), the first action A, the execution reward R, the second feature vector phi (S ') and the state termination judgment result' is _ end 'into an empirical playback set D as a group of samples { phi (S), A, R, phi (S'), is _ end }. And then, repeatedly executing the operation based on the obtained second state S', and obtaining a plurality of groups of samples which are stored in the experience playback set together.

Based on the samples in the empirical playback set, a calculation of the current target value may be made. The current target value may specifically enable an update to the network. In particular, m samples may be extracted from the empirical playback set, and these samples may be respectively expressed as { phi (S) }_j)，A_j，R_j，φ(S′_j)，is_end_jWhere j is 1 … m. Then, the formula is reusedCalculating the current target value, where y_iFor the current target value, R, corresponding to the jth sample in the empirical playback set_jThe execution reward corresponding to the jth sample in the experience playback set is IS _ end, status end, ture and false, gamma is an attenuation factor, and can be obtained based on actual experience, Q ' is a feedback value phi (S ') obtained by the target evaluation network '_j) Is a second feature vector, pi_θ'For the conversion function, the second feature vector may be converted into a corresponding second action in the above formula, and ω' is the network parameter of the target evaluation network.

After the calculation of the current target value is completed, all network parameters ω in the current evaluation network are updated by the gradient back propagation of the neural network using the mean square error loss function. Specifically, a formula can be utilizedUpdating the network parameters of the current evaluation network, wherein m is the number of samples in the experience playback set, Q is the feedback value obtained by the current evaluation network, A_jAnd omega is a network parameter of the current evaluation network, wherein omega is a first action corresponding to the jth sample in the empirical playback set.

Accordingly, updating the current policy network may also be implemented based on the current target value, the initial hybrid quantization policy, and other network parameters, and specifically, may be implemented using a formulaUpdating network parameters of the current policy network, where s_iFor empirical playback of a first state corresponding to the jth sample in the set, a_iAnd theta is a target quantization strategy corresponding to the jth sample in the empirical playback set, and is a network parameter of the current strategy network.

After the current evaluation network and the current policy network are updated, the updated networks can be used to update the target evaluation network and the target policy network, specifically, the target evaluation network can be updated based on the formula ω '← τ ω + (1- τ) ω', where τ is an adjustment coefficient; updating the target policy network based on the formula θ ' ← τ θ + (1- τ) θ ', where θ ' is a network parameter of the target policy network, where ← can be used to represent assignments.

When the state termination determination result is the termination state, a final target quantization policy may be determined based on the updated current evaluation network, the current policy network target, the evaluation network, and the target policy network.

If the termination judgment condition is not the termination state after the target evaluation network and the target policy network are updated in the previous step, the steps of constructing the first action, obtaining the second state, executing the reward and the state termination judgment result, constructing the experience playback set, calculating the current target value, updating the current evaluation network and the current policy network, and updating the target evaluation network and the target policy network can be repeatedly executed until the state termination judgment result is the termination state.

S150: and updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model.

After the target quantization strategy is obtained, the pre-training voice separation model can be updated by using the target quantization strategy to obtain a final voice extraction model. Specifically, the model may be subjected to parameter quantization to obtain the required speech extraction model.

Preferably, after the model is quantized, the model may be further fine-tuned by one to two epochs based on the training method in step S120, so as to optimize the voice extraction effect of the model. The specific implementation process may be adjusted based on the requirements of the actual application, and is not described herein again.

The preset voice separation model is relatively simple in structure, so that training time and resources of the model can be reduced as much as possible in the training process, the model is updated by utilizing a target quantization strategy, the recognition accuracy of the model is further ensured, the corresponding model can be quickly and effectively obtained, and the single-channel voice separation is realized.

S160: extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

After the voice extraction model is obtained, the voice data to be processed is input into the network model after the quantization fine adjustment, and the separation result of the target voice can be obtained through the calculation of the model, so that the requirements of subsequent voice recognition and the like are met. In some embodiments, after the speech extraction model is obtained by training, the model may be tested and verified to ensure the training effect of the model. Specifically, based on the implementation in step S110, after the mixed voice sample data is obtained, test sample data and verification sample data may be obtained therefrom.

And extracting a test target voice signal in the test sample data by using the trained voice extraction model, comparing the extracted test target voice signal with the verification sample data, and optimizing the voice extraction model according to a comparison result. By analyzing the consistency of the prediction result and the original result, the prediction accuracy of the model can be effectively judged, so that whether the model can be directly applied or trained again is determined, and the training effect of the model is effectively ensured.

As shown in fig. 3, to summarize the process of obtaining the speech extraction model in combination with the above process, after pre-training a preset speech separation model, network parameters are updated based on automatic search of the network, and then the model is fine-tuned in combination with the training mode of the model, and finally the accuracy of the model is verified according to the test, so that the accuracy of the model on the extraction result of single-channel speech is ensured.

After the voice extraction model is obtained, the voice of the target object in the single-channel voice can be accurately and effectively extracted, so that the subsequent application process is effectively ensured. The specific process of extracting the speech may be set based on the requirements of the actual application, and is not described herein again.

Based on the introduction of the embodiment, it can be seen that, in the method, after the mixed voice sample data is obtained, the preset voice separation model is trained by using the mixed voice sample data, then the strategy network and the evaluation network are constructed according to the pre-trained voice separation model obtained through training, and then the target quantization strategy is determined according to the network parameters corresponding to the constructed network, so that the pre-trained voice separation model is updated by using the obtained strategy, and finally, the target object voice signal is extracted from the voice data to be processed by using the updated voice extraction model. By the method, after the model is pre-trained, the model can be updated by determining the quantization strategy to ensure the accuracy of the model, and further, the structural details and parameters of the preset voice separation model can be reduced as much as possible when the model structure is determined, so that the model training time is shortened, the construction of the corresponding model is rapidly and effectively realized, the extraction of the voice signal corresponding to the target object in single-channel voice is effectively realized, and the use experience of a user is improved.

A speech extraction device according to an embodiment of the present description is introduced based on the speech extraction method corresponding to fig. 1. The voice extraction device is arranged on the voice extraction equipment. As shown in fig. 4, the speech extraction apparatus includes the following modules.

A mixed voice sample data obtaining module 410, configured to obtain mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal.

And the preset voice separation model training module 420 is configured to train a preset voice separation model by using the mixed voice sample data to obtain a pre-trained voice separation model.

A network construction module 430, configured to construct a policy network and an evaluation network based on the pre-trained speech separation model; the policy network and the evaluation network correspond to network parameters.

A target quantization strategy determination module 440, configured to determine a target quantization strategy based on the network parameter; the target quantization strategy is used to determine an optimization mode for the pre-trained speech separation model.

And the pre-training voice separation model updating module 450 is configured to update the pre-training voice separation model through the target quantization strategy to obtain a voice extraction model.

A target object voice signal extraction module 460, configured to extract a target object voice signal from the voice data to be processed by using the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

Based on the speech extraction method corresponding to fig. 1, an embodiment of the present specification provides a speech extraction device. As shown in fig. 5, the speech extraction device may include a memory and a processor.

In this embodiment, the memory may be implemented in any suitable manner. For example, the memory may be a read-only memory, a mechanical hard disk, a solid state disk, a U disk, or the like. The memory may be used to store computer program instructions.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may execute the computer program instructions to perform the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a target voice signal; training a preset voice separation model by using the mixed voice sample data to obtain a pre-training voice separation model; constructing a strategy network and an evaluation network based on the pre-training voice separation model; the strategy network and the evaluation network correspond to network parameters; determining a target quantization strategy based on the network parameters; the target quantization strategy is used for determining an optimization mode for the pre-training speech separation model; updating the pre-training voice separation model by using a target quantization strategy to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

While the process flows described above include operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于融合多频带语音信号特征的语音识别鲁棒性增强方法

Voice extraction method, device and equipment

相关技术

网友询问留言