Hybrid speech recognition network training method, hybrid speech recognition device and storage medium

文档序号：1355685 发布日期：2020-07-24 浏览：6次中文

阅读说明：本技术 混合语音识别网络训练方法、混合语音识别方法、装置及存储介质 (Hybrid speech recognition network training method, hybrid speech recognition device and storage medium ) 是由王珺陈杰苏丹俞栋于 2018-05-24 设计创作，主要内容包括：本申请提供了混合语音识别网络训练方法,包括：通过混合语音识别网络中的深度神经网络获取混合语音样本,通过混合语音识别网络对混合语音样本的向量和相应的有监督标注进行处理,形成目标对象在向量空间的语音提取子；通过混合语音识别网络,利用混合语音样本的向量和语音提取子确定目标对象的掩码；通过目标对象的掩码与目标对象的参考语音对混合语音识别网络的参数进行更新。本申请还提供了混合语音识别方法、装置及存储介质。本申请可从混合语音中确定出目标对象的语音,方便对混合语音中目标对象的语音进行追踪,同时在混合语音识别网络训练过程中仅需要混合语音样本有效减少了训练阶段的样本数量,提升了混合语音识别网络的训练效率。(The application provides a hybrid speech recognition network training method, which comprises the following steps: acquiring a mixed voice sample through a deep neural network in a mixed voice recognition network, and processing a vector of the mixed voice sample and a corresponding supervised label through the mixed voice recognition network to form a voice extractor of a target object in a vector space; determining a mask of a target object by using a vector of a mixed voice sample and a voice extraction unit through a mixed voice recognition network; and updating the parameters of the hybrid voice recognition network through the mask of the target object and the reference voice of the target object. The application also provides a mixed voice recognition method, a device and a storage medium. This application can follow the pronunciation that determines the target object in mixing pronunciation, and the pronunciation of target object is tracked in the convenient mixed pronunciation, only needs to mix the voice sample effectively to have reduced the sample quantity in training stage at mixed speech recognition network training in-process simultaneously, has promoted the training efficiency of mixing the speech recognition network.)

1. A hybrid speech recognition network training method, the method comprising:

acquiring a mixed voice sample through a deep neural network in the mixed voice recognition network, wherein the mixed voice sample comprises voices of at least two different speakers;

the deep neural network determines a vector of a mixed voice sample corresponding to the mixed voice sample;

processing the vector of the mixed voice sample and the corresponding supervised label through the mixed voice recognition network to form a voice extraction sub of a target object in a vector space;

determining, by the hybrid speech recognition network, a mask of the target object using the vector of the hybrid speech sample and the speech extractor;

and updating the parameters of the hybrid voice recognition network through the mask of the target object and the reference voice of the target object.

2. The method of claim 1, wherein the determining, by the deep neural network of the speech recognition network, a vector of mixed speech samples corresponding to the mixed speech samples comprises:

embedding the mixed voice sample into a vector space of K dimension to obtain a vector of each frame in each vector dimension in the mixed voice sample, wherein,

the mixed speech sample is an input non-adaptive speech sample.

3. The method of claim 1, wherein the processing the vectors of the mixed speech samples and the corresponding supervised labels through the mixed speech recognition network to form a speech extractor of the target object in a vector space comprises:

denoising low-energy spectrum window noise in the mixed voice sample;

according to the voice spectrum amplitude of a target object in the mixed voice sample and the spectrum amplitude of an interference object in a corresponding voice frame, determining a supervised label of the target object in the mixed voice sample;

and determining corresponding voice extractors of the voices of different speakers in the mixed voice sample in a vector space according to the vector of the mixed voice sample and the supervised label of the target object in the mixed voice sample.

4. The method of claim 1, wherein updating the parameters of the hybrid speech recognition network with the mask of the target object and the reference speech of the target object comprises:

extracting the voice of each speaker in the mixed voice sample according to the mask corresponding to the ages of different speakers in the mixed voice sample;

determining spectral errors of the voices of the speakers extracted by using the masks of the target object and the reference voice of the target object through an objective function of the voice recognition network;

and minimizing the objective function of the voice recognition network through the spectrum error so as to update the parameters of the hybrid voice recognition network.

5. A hybrid speech recognition method applied to a hybrid speech recognition network trained by the method according to any one of claims 1-4, comprising:

monitoring the input of voice;

when the input of adaptive voice and mixed voice is monitored, acquiring the voice characteristics of a target object based on the adaptive voice;

determining the voice belonging to the target object in the mixed voice based on the voice characteristics of the target object;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

6. The hybrid speech recognition method of claim 5, wherein the obtaining the speech feature of the target object based on the adapted speech comprises:

respectively embedding the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice into a vector space with a dimension K to obtain a vector of each frame of the adaptive voice in each vector dimension and a vector of each frame of the mixed voice in each vector dimension, wherein the adaptive voice is voice containing preset voice information, the mixed voice is non-adaptive voice input after the adaptive voice, and the K is not less than 1;

calculating the average vector of the adaptive voice in each vector dimension based on the vector of each frame of the adaptive voice in each vector dimension;

taking the average vector of the adaptive voice in each vector dimension as a voice extraction sub of a target object in each vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension so as to estimate the mask of each frame of the mixed voice;

the voice belonging to the target object in the mixed voice is determined to be:

and determining the voice belonging to the target object in the mixed voice based on the mask of each frame of the mixed voice.

7. The hybrid speech recognition method according to claim 6, wherein the calculating an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension is specifically:

calculating an average vector of the adaptive voice in each vector dimension based on the vector of the adaptive voice effective frame in each vector dimension, wherein the adaptive voice effective frame is a frame of which the spectral amplitude is greater than an adaptive spectral comparison value in the adaptive voice, and the adaptive spectral comparison value is equal to a difference value between the maximum spectral amplitude of the adaptive voice and a preset spectral threshold value.

8. The hybrid speech recognition method of claim 7 wherein said computing an average vector of the adapted speech in each vector dimension based on the vectors of the adapted speech active frames in each vector dimension comprises:

for each vector dimension, multiplying the vector of each frame of the adaptive voice in the corresponding vector dimension by the supervised labeling of the corresponding frame respectively, and then summing to obtain the total vector of the effective frame of the adaptive voice in the corresponding vector dimension;

dividing the total vector of the adaptive voice effective frame in each vector dimension by the sum of supervised labels of each frame of the adaptive voice to obtain an average vector of the adaptive voice in each vector dimension;

and the supervised label of the frame with the spectral amplitude larger than the adaptive spectrum comparison value in the adaptive voice is 1, and the supervised label of the frame with the spectral amplitude not larger than the adaptive spectrum comparison value in the adaptive voice is 0.

9. The hybrid speech recognition method of any one of claims 6 to 8, wherein the calculating an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension further comprises:

inputting the average vector of the adaptive voice in each vector dimension and the vector of each frame of the mixed voice in each vector dimension into a pre-trained forward neural network to obtain a regular vector of each frame in each vector dimension;

taking the average vector of the adaptive voice in each vector dimension as a voice extractor of a target object in each vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extractor of the corresponding vector dimension, so as to estimate the mask replacement of each frame of the mixed voice as follows:

and respectively measuring the distance between the regular vector of each frame in each vector dimension and a preset voice extraction son to estimate and obtain the mask of each frame of the mixed voice.

10. The hybrid speech recognition method according to any one of claims 6 to 8, wherein the embedding the spectrum of the adapted speech and the spectrum of the hybrid speech into the vector space of K dimensions further comprises:

processing the vector of each frame of the mixed voice in each vector dimension based on a clustering algorithm to determine the centroid vector of the mixed voice corresponding to the voices of different speakers in each vector dimension;

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by:

and taking a target centroid vector of the mixed voice in each vector dimension as a voice extractor of a target object in the corresponding vector dimension, wherein the target centroid vector is the centroid vector with the minimum distance from the average vector of the adaptive voice in the same vector dimension.

11. The hybrid speech recognition method of any one of claims 6 to 8, wherein the calculating an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension further comprises:

respectively comparing the distances between preset M voice extractors and the average vectors of the adaptive voice in each vector dimension, wherein M is larger than 1;

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by:

and taking the voice extraction sub with the minimum average vector distance with the adaptive voice in a vector dimension as the voice extraction sub of the target object in the corresponding vector dimension.

12. The hybrid speech recognition method according to any one of claims 6 to 8, wherein the embedding the spectrum of the adapted speech and the spectrum of the hybrid speech into a vector space of K dimensions respectively to obtain the vector of each frame of the adapted speech in each vector dimension and the vector of each frame of the hybrid speech in each vector dimension is specifically:

and mapping the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice to a vector space of K dimensionality through a deep neural network to obtain the vector of each frame of the adaptive voice in each vector dimensionality and the vector of each frame of the mixed voice in each vector dimensionality.

13. A hybrid speech recognition device, comprising:

the monitoring unit is used for monitoring the input of voice;

an acquisition unit configured to acquire a speech feature of a target object based on an adapted speech when the monitoring unit monitors input of the adapted speech and the mixed speech;

a determining unit, configured to determine, based on a voice feature of the target object, a voice belonging to the target object in the mixed voice;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

14. Hybrid speech recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor realizes the steps of the method according to any of claims 5 to 12 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a hybrid speech recognition network training method according to any one of claims 1 to 4, or carries out the steps of a method according to any one of claims 5 to 12.

Technical Field

The application belongs to the technical field of voice recognition, and particularly relates to a hybrid voice recognition network training method, a hybrid voice recognition device and a storage medium.

Background

Speech is one of the most natural, effective and convenient means for human to communicate information as an acoustic representation of language, and in recent years, speech recognition technology has been greatly developed, however, people inevitably receive voice interference of different speakers in the same environment while inputting speech. These disturbances eventually cause the captured speech to be not pure speech, but noise contaminated speech (i.e., mixed speech). In recent years, many methods and systems based on deep learning have been developed to handle the separation and recognition of mixed speech signals, such as deep attraction networks. To this end, Artificial Intelligence (AI) technology provides a solution to train an appropriate speech recognition network to support the above-described applications. The artificial intelligence is the theory, method and technology for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing environment, acquiring knowledge and obtaining the best result by using the knowledge, and the artificial intelligence of an application system, namely the artificial intelligence for researching the design principle and the implementation method of various intelligent machines, so that the machine has the functions of sensing, reasoning and decision making, and in the field of voice processing, the recognition of voice is realized by using the digital computer or the machine controlled by the digital computer.

The Deep Attractor Network (i.e. Deep Attractor Network) generates a distinctive embedded vector for each time-frequency window of the mixed voice, and generates an Attractor for each speaker in the mixed voice, then estimates the Mask (i.e. Mask) belonging to the corresponding speaker by calculating the distance between the embedded vector and the attractors, and then calculates the representation of each speaker in the mixed voice in the time-frequency domain by using the masks. The system framework of the hybrid speech recognition scheme based on the deep attraction subnetwork can be shown in fig. 1, and the processing flow of the scheme is explained in conjunction with fig. 1 as follows:

firstly, inputting a mixed speech frequency spectrum (namely, mix in fig. 1) into a long-short term memory network (namely, L STMlayer in fig. 1), calculating to obtain an embedded vector (namely, Embedding in fig. 1) corresponding to each time-frequency window, then, utilizing supervised labeling information (namely, Ideal Mask in fig. 1) of each speaker in the mixed speech to carry out weighting and normalization on all embedded vectors to obtain Attractors (namely, Attractors in fig. 1) corresponding to each speaker, then, estimating masks of the speech of each speaker by measuring the distance between each embedded vector and the Attractors of the mixed speech, and utilizing the masks to calculate to obtain the representation (namely, clean reference in fig. 1) of each speaker in the mixed speech in a time-frequency domain.

Although deep attractors do not rigidly limit the number of speakers in a mixed voice, they still need to know or estimate the number of speakers in the recognition process. Moreover, the hybrid speech recognition scheme based on the deep attraction subnetwork can only realize the separation of the speeches of the speakers in the hybrid speech, but cannot track the speech of a specific speaker (e.g., a target object), i.e., cannot pertinently obtain the representation of the target object in the hybrid speech in the time-frequency domain.

Disclosure of Invention

In view of this, the present application provides a hybrid speech recognition network training method, a hybrid speech recognition method, an apparatus and a storage medium, which can determine the speech of a target object from the hybrid speech and facilitate tracking the speech of the target object in the hybrid speech.

The embodiment of the invention provides a hybrid speech recognition network training method, which comprises the following steps:

acquiring a mixed voice sample through a deep neural network in the mixed voice recognition network, wherein the mixed voice sample comprises voices of at least two different speakers;

the deep neural network determines a vector of a mixed voice sample corresponding to the mixed voice sample;

processing the vector of the mixed voice sample and the corresponding supervised mark through the mixed voice recognition network to form a voice extraction sub of a target object in a vector space;

determining, by the hybrid speech recognition network, a mask for the target object using the vector of hybrid speech samples and the speech extractor;

and updating the parameters of the hybrid voice recognition network through the mask of the target object and the reference voice of the target object.

In the foregoing solution, the determining, by the deep neural network of the speech recognition network, the vector of the mixed speech sample corresponding to the mixed speech sample includes:

embedding the mixed voice sample into a vector space of K dimension to obtain a vector of each frame in each vector dimension in the mixed voice sample, wherein,

the mixed speech sample is a non-adapted speech sample input after the adapted speech sample.

In the foregoing solution, the processing the vector of the mixed speech sample and the corresponding supervised label through the mixed speech recognition network to form a speech extractor of the target object in a vector space includes:

denoising low-energy spectrum window noise in the mixed voice sample;

In the foregoing solution, the updating the parameter of the hybrid speech recognition network through the mask of the target object and the reference speech of the target object includes:

extracting the voice of each speaker in the mixed voice sample according to the mask corresponding to the ages of different speakers in the mixed voice sample;

and minimizing the objective function of the voice recognition network through the spectrum error so as to update the parameters of the hybrid voice recognition network.

A first aspect of an embodiment of the present application provides a hybrid speech recognition method, including:

monitoring the input of voice;

when the input of adaptive voice and mixed voice is monitored, acquiring the voice characteristics of a target object based on the adaptive voice;

determining the voice belonging to the target object in the mixed voice based on the voice characteristics of the target object;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

Based on the first aspect of the present application, in a first possible implementation manner, the obtaining a voice feature of a target object based on adaptive voice includes:

respectively embedding the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice into a vector space of a K dimension to obtain a vector of each frame of the adaptive voice in each vector dimension and a vector of each frame of the mixed voice in each vector dimension, wherein the adaptive voice is voice containing preset voice information, the mixed voice is non-adaptive voice input after the adaptive voice, and the K is not less than 1;

calculating the average vector of the adaptive voice in each vector dimension based on the vector of each frame of the adaptive voice in each vector dimension;

the voice belonging to the target object in the mixed voice is determined to be:

and determining the voice belonging to the target object in the mixed voice based on the mask of each frame of the mixed voice.

Based on the first possible implementation manner of the first aspect of the present application, in a second possible implementation manner, the calculating, based on the vector of each frame of the adaptive speech in each vector dimension, an average vector of the adaptive speech in each vector dimension specifically includes:

Based on the second possible implementation manner of the first aspect of the present application, in a third possible implementation manner, the calculating an average vector of the adapted speech in each vector dimension based on the vector of the adapted speech valid frame in each vector dimension includes:

Based on the first possible implementation manner of the first aspect of the present application, the second possible implementation manner of the first aspect of the present application, or the third possible implementation manner of the first aspect of the present application, in a fourth possible implementation manner, the calculating, based on the vector of each frame of the adapted speech in each vector dimension, an average vector of the adapted speech in each vector dimension further includes:

and respectively measuring the distance between the regular vector of each frame in each vector dimension and a preset voice extraction unit to estimate and obtain the mask of each frame of the mixed voice.

Based on the first possible implementation manner of the first aspect of the present application, the second possible implementation manner of the first aspect of the present application, or the third possible implementation manner of the first aspect of the present application, in a fifth possible implementation manner, after the respectively embedding the spectrum of the adapted speech and the spectrum of the mixed speech into a vector space of a K dimension, the method further includes:

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by:

Based on the first possible implementation manner of the first aspect of the present application, the second possible implementation manner of the first aspect of the present application, or the third possible implementation manner of the first aspect of the present application, in a sixth possible implementation manner, the calculating, based on the vector of each frame of the adapted speech in each vector dimension, an average vector of the adapted speech in each vector dimension further includes:

respectively comparing the distances between preset M voice extractors and the average vectors of the adaptive voice in each vector dimension, wherein M is larger than 1;

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by:

Based on the first possible implementation manner of the first aspect of the present application, the second possible implementation manner of the first aspect of the present application, or the third possible implementation manner of the first aspect of the present application, in a seventh possible implementation manner, the embedding the spectrum of the adapted speech and the spectrum of the mixed speech into a vector space of a K dimension, respectively, to obtain a vector of each frame of the adapted speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension specifically includes:

and mapping the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice to a vector space with K-dimensional degree through a deep neural network to obtain the vector of each frame of the adaptive voice in each vector dimension and the vector of each frame of the mixed voice in each vector dimension.

Based on the seventh possible implementation manner of the first aspect of the present application, in an eighth possible implementation manner, the deep neural network is formed by 4 layers of bidirectional long-short term memory networks, and each layer of bidirectional long-short term memory networks has 600 nodes.

Based on the seventh possible implementation manner of the first aspect of the present application, in a ninth possible implementation manner, K is 40.

A second aspect of the present application provides a hybrid speech recognition apparatus comprising:

the monitoring unit is used for monitoring the input of voice;

an acquisition unit configured to acquire a speech feature of a target object based on an adapted speech when the monitoring unit monitors input of the adapted speech and the mixed speech;

a determining unit, configured to determine, based on a voice feature of the target object, a voice belonging to the target object in the mixed voice;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

Based on the second aspect of the present application, in a first possible implementation manner, the obtaining unit includes:

a space mapping unit, configured to, when the monitoring unit monitors that adaptive speech and mixed speech are input, embed a frequency spectrum of the adaptive speech and a frequency spectrum of the mixed speech into a vector space of a K dimension, respectively, to obtain a vector of each frame of the adaptive speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, where the adaptive speech is speech including preset speech information, the mixed speech is non-adaptive speech input after the adaptive speech, and K is not less than 1;

the calculation unit is used for calculating the average vector of the adaptive voice in each vector dimension based on the vector of each frame of the adaptive voice in each vector dimension;

a mask estimation unit, configured to use the average vector of the adaptive speech in each vector dimension as a speech extractor of a target object in each vector dimension, and measure distances between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension, respectively, to estimate a mask of each frame of the mixed speech;

the determining unit is specifically configured to determine, based on a mask of each frame of the mixed speech, a speech belonging to the target object in the mixed speech.

Based on the first possible implementation manner of the second aspect of the present application, in a second possible implementation manner, the computing unit is specifically configured to: and obtaining an average vector of the adaptive voice in each vector dimension based on the vector of the adaptive voice effective frame in each vector dimension, wherein the adaptive voice effective frame is a frame of which the spectral amplitude is greater than an adaptive spectral comparison value in the adaptive voice, and the adaptive spectral comparison value is equal to the difference between the maximum spectral amplitude of the adaptive voice and a preset spectral threshold.

Based on the second possible implementation manner of the second aspect of the present application, in a third possible implementation manner, the calculating unit is specifically configured to: for each vector dimension, multiplying the vector of each frame of the adaptive voice in the corresponding vector dimension by the supervised labeling of the corresponding frame respectively, and then summing to obtain the total vector of the effective frame of the adaptive voice in the corresponding vector dimension; respectively dividing the total vector of the adaptive voice effective frame in each vector dimension by the sum of supervised labels of each frame of the adaptive voice to obtain an average vector of the adaptive voice in each vector dimension;

the regularization unit is used for inputting the average vector of the adaptive voice in each vector dimension and the vector of each frame of the mixed voice in each vector dimension into a pre-trained forward neural network to obtain the regularized vector of each frame in each vector dimension;

the mask estimation unit is specifically configured to: and respectively measuring the distance between the regular vector of each frame in each vector dimension and a preset voice extraction son to estimate and obtain the mask of each frame of the mixed voice.

the clustering unit is used for processing the vector of each frame of the mixed voice in each vector dimension based on a clustering algorithm so as to determine the centroid vector of the mixed voice corresponding to the voices of different speakers in each vector dimension;

the mask estimation unit is specifically configured to: and taking the target centroid vector of the mixed voice in each vector dimension as a voice extraction sub of a target object in a corresponding vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension so as to estimate the mask of each frame of the mixed voice.

the comparison unit is used for respectively comparing the distances between preset M voice extractors and the average vectors of the adaptive voice in each vector dimension, wherein M is larger than 1;

the mask estimation unit is specifically configured to: and taking the voice extraction sub with the minimum average vector distance with the adaptive voice in a vector dimension as the voice extraction sub of a target object in the corresponding vector dimension in the M voice extraction sub, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension to estimate the mask of each frame of the mixed voice.

Based on the first possible implementation manner of the second aspect of the present application, the second possible implementation manner of the second aspect of the present application, or the third possible implementation manner of the second aspect of the present application, in a seventh possible implementation manner, the spatial mapping unit is specifically configured to: when the monitoring unit monitors the input of the adaptive voice and the mixed voice, the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice are mapped to a vector space with K dimensionality through a deep neural network, and the vector of each frame of the adaptive voice in each vector dimensionality and the vector of each frame of the mixed voice in each vector dimensionality are obtained.

A third aspect of the application provides a hybrid speech recognition device comprising a memory, a processor and a computer program stored on the memory and executable on the processor. The processor, when executing the computer program, implements the hybrid speech recognition method mentioned in the first aspect or any of the possible implementations of the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon. The computer program as mentioned above, when executed by a processor, implements the hybrid speech recognition method as mentioned in the first aspect above or in any of the possible implementations of the first aspect above.

As can be seen from the above, in the scheme of the application, when the input of the adaptive voice and the mixed voice is monitored, the voice feature of the target object is obtained based on the adaptive voice; and determining the voice belonging to the target object in the mixed voice based on the voice characteristics of the target object. By introducing the voice characteristics of the adaptive voice learning target object, the voice of the target object can be determined from the mixed voice by the scheme of the application, so that the voice of the target object in the mixed voice can be conveniently tracked. For example, in the application scenario of a smart speaker, the wake-up speech can be used as a feature for learning the wake-up speech speaker (i.e., the target object) in response to the speech, and the speech belonging to the wake-up speech speaker can be identified and tracked from the mixed speech input after the wake-up speech. In addition, because the voice characteristics of the target object of the application do not depend on the number of the speakers in the mixed voice, the application scheme does not need to know or estimate the number of the speakers in the mixed voice in advance in the mixed voice recognition process.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow diagram of a hybrid speech recognition scheme based on deep attractors;

FIG. 2 is a flow diagram illustrating an embodiment of a hybrid speech recognition method according to the present application;

FIG. 3 is a schematic flow chart diagram illustrating a hybrid speech recognition method according to another embodiment of the present application;

FIG. 4-a is a schematic diagram of an identification network structure provided herein;

FIG. 4-b is a schematic view of another identification network configuration provided herein;

FIG. 5 is a schematic flow chart diagram illustrating a hybrid speech recognition method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of yet another identification network structure provided in the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a hybrid speech recognition device provided in the present application;

FIG. 8 is a schematic structural diagram of another embodiment of a hybrid speech recognition device according to the present invention;

fig. 9 is a schematic structural diagram of a hybrid speech recognition device according to still another embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the size of the sequence numbers of the steps in the method embodiments described below does not mean that the execution sequence is first and last, and the execution sequence of the processes is determined by the function and the inherent logic of the processes, and the implementation process of the embodiments is not limited in any way.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 2, a hybrid speech recognition method according to an embodiment of the present application includes:

step 101, monitoring input of voice;

in the embodiment of the application, the input of voice can be monitored through the microphone array, so that noise interference of voice input is reduced.

Step 102, when the input of adaptive voice and mixed voice is monitored, acquiring the voice characteristics of a target object based on the adaptive voice;

in an embodiment of the present application, the adaptive speech is a speech including preset speech information. When a voice input including preset voice information is monitored, it may be considered that an input adapted to the voice is monitored. For example, in an application scenario of a smart speaker, it is generally necessary to input a wake-up voice to wake up a voice control function of the smart speaker, where the wake-up voice is a voice including a wake-up word (e.g., "ding-dong"), and therefore, in the application scenario, the wake-up voice can be regarded as an adaptive voice, and when an input of the wake-up voice is monitored, the input of the adaptive voice can be considered to be monitored.

Optionally, in step 102, the speech feature of the target object may be extracted from the adapted speech based on a speech feature recognition algorithm (e.g., Mel-frequency cepstral coefficients (MFCC) algorithm).

Of course, in step 102, the speech feature of the target object may also be extracted from the adaptive speech in other ways, which may be specifically referred to in the description in the following embodiments.

Step 103, determining the voice belonging to the target object in the mixed voice based on the voice characteristics of the target object;

wherein the mixed voice is a non-adaptive voice input after the adaptive voice.

In step 103, based on the voice feature of the target object, a voice feature similar to the voice feature of the target object may be recognized from the mixed voice through a similarity likelihood algorithm, and the voice belonging to the target object in the mixed voice is determined.

Referring to fig. 3, the hybrid speech recognition method in the present application is described in another embodiment, where the hybrid speech recognition method in the embodiment of the present application includes:

step 201, monitoring the input of voice;

in the embodiment of the application, the input of voice can be monitored through the microphone array, so that noise interference of voice input is reduced.

Step 202, when monitoring the input of adaptive voice and mixed voice, respectively embedding the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice into a vector space with a dimension of K to obtain a vector of each frame of the adaptive voice in each vector dimension and a vector of each frame of the mixed voice in each vector dimension;

the adaptive voice is a voice containing preset voice information, and K is not less than 1, and optionally, K may be 40.

In the embodiment of the present application, when a voice input including preset voice information is monitored, it may be considered that an input adapted to the voice is monitored. For example, in an application scenario of a smart speaker, it is generally necessary to input a wake-up voice to wake up a voice control function of the smart speaker, where the wake-up voice is a voice including a wake-up word (e.g., "ding-dong"), and therefore, in the application scenario, the wake-up voice can be regarded as an adaptive voice, and when an input of the wake-up voice is monitored, the input of the adaptive voice can be considered to be monitored.

The mixed speech is non-adaptive speech input after the adaptive speech, and in a real intelligent speech interaction scene, especially under a far-distance speaking condition, the speech aliasing of different speakers often occurs, so that the input speech is the mixed speech.

In step 202, the frequency spectrums of the adapted speech and the frequency spectrums of the mixed speech can be mapped to a vector space of K dimensions through a deep Neural Network to obtain vectors of each frame of the adapted speech and vectors of each frame of the mixed speech in each dimension, wherein the deep Neural Network is optionally composed of 4 layers of bidirectional long-Short time Memory networks (L & ' lttt translation = L "& ' gttt L &/ttt & ' gtton Short-Term Memory), each layer of L STM can have 600 nodes, and of course, the deep Neural Network can be replaced by various other effective novel model structures, such as a convolution model combining the Convolutional Neural Network (CNN, Convolutional Neural Network) with other Network structures, or other Network structures, such as a time delay Network, gated Neural Network and the like.

Specifically, the frequency spectrum of the embodiment of the present application may be obtained by performing short-time fourier transform on the voice and then taking the logarithm of the result of the short-time fourier transform.

Step 202 is described below by way of example, where the adaptive speech is denoted by the superscript "ws", the mixed speech is denoted by the superscript "cs", and the mixed speech is denoted by the "X_f,t"represents the frequency spectrum of the t-th frame speech (f represents the sequence number of the frequency spectrum dimension, t represents the frame sequence number of the time dimension), then the adaptive speechThe spectrum of the tones may be represented asThe spectrum of the mixed speech can be represented asThen in step 202 the input spectrum of the speech can be adapted separatelyAnd input spectrum of mixed speechMapping the voice frames into K-dimensional vectors through a deep neural network to obtain the vectors suitable for the voice frames in each vector dimension(Vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of adapted speech]) And mixing the vectors of each frame of speech in each vector dimension(The vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of mixed speech])。

Step 203, calculating the average vector of the adaptive voice in each vector dimension based on the vector of each frame of the adaptive voice in each vector dimension;

in the embodiment of the application, the formula can be usedCalculating the average vector of the adaptive speech in each vector dimensionWhere T1 denotes the number of frames for accommodating speech.

Or, in order to remove the low-energy spectrum window noise to obtain the effective frame of the adaptive speech, in step 203, the spectrum of the adaptive speech may also be compared with a certain spectrum threshold, if the spectrum amplitude of a certain frame (i.e., a certain time-frequency window) of the adaptive speech is greater than the adaptive spectrum comparison value, the frame is considered as the effective frame of the adaptive speech, and in step 203, the average vector of the adaptive speech in each vector dimension is calculated based on the vector of the effective frame of the adaptive speech in each vector dimension. And the adaptive spectrum comparison value is equal to the difference between the maximum spectrum amplitude of the adaptive voice and a preset spectrum threshold value. In particular, voice-adapted supervised labeling can be setComparing the frequency spectrum of each frame of the adaptive voice with a frequency spectrum threshold value respectively, if the frequency spectrum amplitude of a certain frame (namely a certain time frequency window) of the adaptive voice is greater than the adaptive frequency spectrum comparison value (namely the difference value between the maximum frequency spectrum amplitude of the adaptive voice) and the supervised labeling of the adaptive voice corresponding to the time frequency windowTaking 0; if not, then,taking 1, the specific formula can be expressed as the following first formula:

the first formula:

the obtaining an average vector of the adaptive speech in each vector dimension based on the vector of the adaptive speech effective frame in each vector dimension includes: for each vector dimension, multiplying the vector of each frame of the adaptive voice in the corresponding vector dimension by the supervised labeling of the corresponding frame respectively, and then summing to obtain the total vector of the effective frame of the adaptive voice in the corresponding vector dimension; and respectively dividing the total vector of the effective frame of the adaptive voice in each vector dimension by the sum of the supervised labels of each frame of the adaptive voice to obtain the average vector of the adaptive voice in each vector dimension. Specifically, the obtaining of the average vector of the adaptive speech in each vector dimension based on the vector of the adaptive speech effective frame in each vector dimension may be implemented by the following second formula:

the second formula: represents the average vector of the adapted speech in vector dimension K, K ∈ [1, K]。

Step 204, taking the average vector of the adaptive voice in each vector dimension as a voice extractor of a target object in each vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extractor of the corresponding vector dimension to estimate the mask of each frame of the mixed voice;

in step 204, estimating the mask of each frame of the mixed speech by measuring the distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor, so as to restore the speech of the target object, where the estimation method is shown in a third formula:

the third formula:

in the third formula described above, the first formula,a mask representing the tth frame of the mixed speech,andreference is made to the preceding description.

If the inner product distance between the vector of a certain frame (i.e. time-frequency window) of the mixed speech and the speech extractor is smaller, the probability that the frame belongs to the target object is higher, and correspondingly, the mask corresponding to the time-frequency window is higher through the third formula estimation.

Step 205, determining the voice belonging to the target object in the mixed voice based on the mask of each frame of the mixed voice;

in this embodiment, after obtaining the mask of each frame of the mixed speech, the speech belonging to the target object in the mixed speech may be determined based on the mask of each frame of the mixed speech. Specifically, the mask is used to weight the mixed speech, so that the speech belonging to the target object in the mixed speech can be extracted frame by frame, and the larger the mask is, the more speech in the corresponding time-frequency window will be extracted.

It should be noted that, in the embodiment shown in fig. 3, the average vector of the adaptive speech in each vector dimension is used as the speech extractor of the target object in each vector dimension, but in other embodiments, the speech extractor of the target object in each vector dimension may be selected in other manners.

For example, one alternative may be: after the step 202, the vectors of the frames of the mixed speech in each vector dimension are processed based on a clustering algorithm (e.g., K-means algorithm) to determine the centroid vector of the mixed speech corresponding to the speech of different speakers in each vector dimension. Step 204 above is replaced by: and taking the target centroid vector of the mixed voice in each vector dimension as a voice extraction sub of a target object in a corresponding vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension to estimate the mask of each frame of the mixed voice, wherein the target centroid vector is the centroid vector with the minimum distance from the average vector of the adaptive voice in the same vector dimension.

For another example, another alternative may be: after the step 203, the distances between the preset M speech extractors and the average vector of the adaptive speech in each vector dimension are respectively compared, where M is greater than 1. Step 204 above is replaced by: and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub in the corresponding vector dimension to estimate the mask of each frame of the mixed voice by taking the voice extraction sub with the minimum average vector distance with the proper voice in the vector dimension as the voice extraction sub of a target object in the corresponding vector dimension.

In order to implement the hybrid speech recognition process shown in fig. 3, in the embodiment of the present application, a recognition network for implementing the hybrid speech recognition process may be pre-constructed, and the recognition network may be trained.

(these specific application scenarios technical proposal is added to the claims, and more than a few from the protection of details scheme)

In one application scenario, the structure of the recognition network may be as shown in fig. 4-a. The training process of the above recognition network is described below with reference to fig. 4-a:

1. the adapted and mixed speech samples used to train the recognition network are input into a deep neural network, which is composed of 4 bidirectional L STM layers, each layer L STM having 600 nodes.

In the application scenario, the superscript ' ws ' represents adaptive voice sample, ' cs ' represents mixed voice sample, ' X_f,t"represents the spectrum of the t-th frame speech (f represents the sequence number in the spectral dimension, and t represents the frame sequence number in the time dimension), the spectrum of the adapted speech sample can be represented asThe spectrum of the mixed speech sample may be represented asThe input spectrum of the speech samples can be adapted separatelyAnd input spectra of mixed speech samplesMapping the data to K-dimensional vectors through a deep neural network to obtain the vector adapting to each frame of the voice sample in each vector dimension(Vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of adapted speech samples]) And mixing the vectors of each frame of the speech sample in each vector dimension(The vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of mixed speech samples])。

2. Supervised labeling of adapted speech samples for removing low energy spectral window noise to obtain valid frames of adapted speechComparing the frequency spectrum of each frame of the adaptive voice sample with a frequency spectrum threshold value respectively, if the frequency spectrum amplitude of a certain frame (namely a certain time frequency window) of the adaptive voice sample is greater than the adaptive frequency spectrum comparison value (namely the difference value between the maximum frequency spectrum amplitude of the adaptive voice sample) and the adaptive frequency spectrum threshold value, then the supervised labeling of the adaptive voice sample corresponding to the time frequency window is carried outTaking 0; if not, then,taking 1, the specific formula can be expressed as a fourth formula.

The fourth formula:

in the application scenario, vectors adapted to voice samples are utilizedAnd supervised labelingTo estimate the speech extraction of the target object in the vector spaceFor each vector dimension, multiplying the vector of each frame of the adaptive voice sample in the corresponding vector dimension by the supervised labeling of the corresponding frame, and then summing to obtain the total vector of the effective frame of the adaptive voice sample in the corresponding vector dimension; and respectively dividing the total vector of the effective frame of the adaptive voice sample in each vector dimension by the sum of the supervised labels of each frame of the adaptive voice sample to obtain the average vector of the adaptive voice sample in each vector dimension, wherein the calculation method can be as a fifth formula.

The fifth formula:

3. by measuring the vector and the voice extraction son of each frame of the mixed voice sample in each vector dimensionThe estimation method is as shown in the sixth formula, if the inner product distance between a time-frequency window and the voice extraction sub is smaller, the probability that the time-frequency window belongs to the target object is higher, the Mask of the corresponding time-frequency window estimated by the sixth formula is higher, and the Mask of the mixed voice sample is largerThe more speech of the corresponding time-frequency window will be extracted in this case.

The sixth formula:

in the above-mentioned sixth formula,a mask representing the t-th frame of the mixed speech sample,andreference is made to the preceding description.

4. The target function L may be as shown in the seventh formula by reconstructing the spectral error between the target object speech restored by the estimated Mask and the reference speech of the target object through the above-mentioned target function of the recognition network, and then training the entire network by minimizing the target function.

A seventh formula:

in the seventh formula described above,the reference speech representing the target object is represented in the frequency spectrum of the tth frame (i.e., the reference speech frequency spectrum). the seventh equation is a standard L2 reconstruction error, since the reconstruction error reflects the spectral error between the reduced speech and the reference speech of the target object, the recognition network can be trained to reduce the global error by generating gradients to optimize the extracted speech quality of the target object.

In another application scenario, the structural diagram of the above recognition network can also be shown in fig. 4-b. In the application scenario, the training process of the recognition network does not need to adapt to the input of the voice sample, that is, the target object and the interference object are not distinguished. The training process of the above recognition network is described below with reference to fig. 4-b:

1. let the voices of C speakers in the mixed voice sample be shared, and obtain the supervised annotation Y of each speaker_c,f,tThe low-energy spectrum window noise in the mixed speech sample can be removed first, then for the speech spectrum amplitude of each speaker in the mixed speech sample, if the speech spectrum amplitude of a certain speaker in a certain frame is greater than the spectrum amplitudes of other speakers in the frame, the speaker in the corresponding Y of the frame_c,f,tGet 1, otherwise get 0.

In the present application scenario, with "X_c,f,t"represents the spectrum of the t-th frame speech of the mixed speech sample, and mixes the input spectrum X of the mixed speech sample_c,f,tMapping the voice sample to a K-dimensional vector through a deep neural network to obtain a vector V of each frame of the mixed voice sample in each vector dimension_k,f,t(V_k,f,tThe vector representing the kth frame of mixed speech samples in the kth vector dimension, K ∈ [1, K]) The deep neural network is composed of 4 bidirectional L STM layers, each layer L STM has 600 nodes, of course, the deep neural network can be replaced by various other effective novel model structures, such as a model combining CNN and other network structures, or other network structures, such as a delay network, a gated convolutional neural network and the like.

2. Vector V using mixed speech samples_k,f,tAnd with a supervision label Y_c,f,tTo estimate the speech extraction sub-A of each speaker in the vector space_c,kThe calculation method is as the eighth formula.

Eighth formula:

3. the Mask of each speaker is estimated by measuring the distance between the vector of each frame of the mixed voice sample in each vector dimension and each voice extraction son, and the estimation method is shown as a ninth formula.

Ninth formula:

in the ninth formula above, M_c,f,tA mask, A, representing the association of the t-th frame with the speaker c in the mixed speech sample_c,kAnd V_k,f,tReference is made to the preceding description.

4. Extracting the voice of each speaker in the mixed voice sample by using the Mask of each speaker;

5. the objective function L can be shown as the tenth formula by reconstructing the spectrum error between each speaker's voice restored by the Mask obtained by estimation and the corresponding speaker's reference voice through the above objective function of the recognition network, and then training the whole network by minimizing the objective function.

The tenth formula:

in the above tenth formula, S_c,f,tThe spectral representation of the reference speech of speaker c at frame t (i.e., the reference speech spectrum). The tenth equation is the standard L2 reconstruction error, since the reconstruction error reflects the spectral error between the recovered reference speech of each speaker and the corresponding speaker, the global error can be reduced by generating gradients to optimize the speech quality of all speakers extracted when training the recognition network.

As can be seen from the above, in the embodiment of the present application, when input of adaptive speech and mixed speech is monitored, the frequency spectrum of the adaptive speech and the frequency spectrum of the mixed speech are respectively embedded into a vector space of a K dimension, a speech extractor is determined for a target object based on the adaptive speech, then a mask of each frame of the mixed speech is estimated by measuring a distance between a vector of each vector dimension of each frame of the mixed speech and the speech extractor of a corresponding vector dimension, and finally, speech belonging to the target object in the mixed speech is determined based on the mask. By introducing the characteristics of the adaptive voice learning target object, the voice of the target object can be determined from the mixed voice by the scheme of the application, so that the voice of the target object in the mixed voice can be conveniently tracked. For example, in the application scenario of a smart speaker, the wake-up speech can be used as a feature for adapting speech learning to wake up a speaker (i.e., a target object), and the speech belonging to the wake-up speaker can be identified and tracked from the mixed speech input after the wake-up speech. In addition, since the determination of the speech extractor does not depend on the number of speakers in the mixed speech, the application does not need to know or estimate the number of speakers in the mixed speech in advance in the mixed speech recognition process.

The hybrid speech recognition method in the present application is described below with another embodiment, which is different from the embodiment shown in fig. 3 in that the present embodiment introduces a forward neural network in the recognition network (i.e., the network for implementing hybrid speech recognition) to map the original vector space to the regular vector space, so that the distribution of the speech extractors obtained by training the recognition network is relatively more concentrated and stable. As shown in fig. 5, the hybrid speech recognition method in the embodiment of the present application includes:

step 301, monitoring the input of voice;

in the embodiment of the application, the input of voice can be monitored through the microphone array, so that noise interference of voice input is reduced.

Step 302, when monitoring the input of adaptive voice and mixed voice, respectively embedding the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice into a vector space with K dimension to obtain a vector of each frame of the adaptive voice in each vector dimension and a vector of each frame of the mixed voice in each vector dimension;

the adaptive voice is a voice containing preset voice information, and K is not less than 1, and optionally, K may be 40.

In step 302, the spectrum of the adaptive speech and the spectrum of the mixed speech are mapped to a K-dimensional vector space through a deep Neural Network to obtain vectors of frames of the adaptive speech in each vector dimension and vectors of frames of the mixed speech in each vector dimension, optionally, the deep Neural Network is composed of 4 layers of bidirectional L STMs, each layer L STM may have 600 nodes.

Step 302 is described below by way of example, where the adaptive speech is indicated by the superscript "ws", the mixed speech is indicated by the superscript "cs", and the mixed speech is indicated by the superscript "X_f,t"represents the spectrum of the t-th frame speech (f represents the sequence number of the spectral dimension, and t represents the frame sequence number of the time dimension), the spectrum of the adapted speech can be represented asThe spectrum of the mixed speech can be represented asThen in step 302 the input spectrum of the speech can be adapted separatelyAnd input spectrum of mixed speechMapping the voice frames into K-dimensional vectors through a deep neural network to obtain the vectors suitable for the voice frames in each vector dimension(Vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of adapted speech]) And mixing the vectors of each frame of speech in each vector dimension(The vector in the K-th vector dimension, K ∈ [1, K ], representing the t-th frame of mixed speech])。

Step 303, calculating an average vector of the adaptive speech in each vector dimension based on the vector of each frame of the adaptive speech in each vector dimension;

Alternatively, in step 303, to remove the low energy spectrum window noise to obtain the valid frame of the adapted speech, the spectrum of the adapted speech may be compared with a certain spectrum threshold, if the adapted speech is in a certain frame (i.e. a certain time-frequency window)) If the spectrum amplitude of the frame is greater than the adaptive spectrum comparison value, the frame is considered as an adaptive speech valid frame, and in step 303, an average vector of the adaptive speech in each vector dimension is calculated based on the vector of the adaptive speech valid frame in each vector dimension. And the adaptive spectrum comparison value is equal to the difference between the maximum spectrum amplitude of the adaptive voice and a preset spectrum threshold value. In particular, voice-adapted supervised labeling can be setComparing the frequency spectrum of each frame of the adaptive voice with a frequency spectrum threshold value respectively, if the frequency spectrum amplitude of a certain frame (namely a certain time frequency window) of the adaptive voice is greater than the adaptive frequency spectrum comparison value (namely the difference value between the maximum frequency spectrum amplitude of the adaptive voice) and the supervised labeling of the adaptive voice corresponding to the time frequency windowTaking 0; if not, then,taking 1, the specific formula may refer to the first formula, and the calculating of the average vector of the adaptive speech in each vector dimension based on the vector of the adaptive speech effective frame in each vector dimension may be implemented by the second formula.

Step 304, inputting the average vector of the adaptive voice in each vector dimension and the vector of each frame of the mixed voice in each vector dimension into a pre-trained forward neural network to obtain a regular vector of each frame in each vector dimension;

in this embodiment, the forward neural network may be a two-layer network, and the number of nodes in each layer may be 256. Further illustrated by the foregoing example, the above adapted speech is averaged in each vector dimensionAnd the vector of each frame of the mixed voice in each vector dimensionMerging the vectors into a 2K-dimensional vector, inputting the vector into the forward neural network, and outputting a K-dimensional regular vectorSpecifically, the functional representation of the forward neural network may be as shown in the eleventh formula.

An eleventh formula:

in the above-mentioned eleventh formula,represents a non-linear mapping function learned by a deep neural network, whose effect is to map the original vector space to a new vector space (i.e., a regularized vector space).

305, respectively measuring the distance between the regular vector of each frame in each vector dimension and a preset voice extractor to estimate and obtain a mask of each frame of the mixed voice;

because the voice extractors obtained by training the recognition network in the embodiment of the present application have the characteristic of stable and concentrated distribution, in the embodiment of the present application, the centroids of all the voice extractors obtained by training the recognition network can be used as the preset voice extractors. Because the speech extractor does not need to be estimated again in the recognition process of the mixed speech in the embodiment of the application, the mixed speech recognition scheme in the embodiment of the application can better realize frame-by-frame real-time processing.

Step 306, determining the voice belonging to the target object in the mixed voice based on the mask of each frame of the mixed voice;

The following describes a recognition network for implementing the hybrid speech recognition procedure shown in fig. 5, and the structure of the recognition network can be schematically shown in fig. 6. The training process of the above recognition network is described below with reference to fig. 6:

In the application scenario, the vector of each frame in each vector dimension based on the adaptive voice isAnd supervised labelingCalculating the average vector of the adaptive speech in each vector dimensionThe calculation method is as the above fifth formula.

3. Averaging the above adapted speech in each vector dimensionAnd the vector of each frame of the mixed voice in each vector dimensionMerging into a vector of 2K dimension, inputting into a forward neural network, and outputting a regular vector of K dimensionSpecifically, the functional representation of the forward neural network can be as shown in the above-mentioned eleventh formula. For the description of the forward neural network, reference may be made to the description in step 304, and the description is not repeated here.

4. Supervised labeling for target objects in mixed speech samplesThe low-energy spectrum window noise in the mixed voice sample can be removed firstly, then aiming at the voice spectrum amplitude of the target object in the mixed voice sample, if the voice spectrum amplitude of the target object in a certain frame is larger than the spectrum amplitude of the interference object in the frame, the target object corresponds to the frameGet 1, otherwise get 0.

5. Based on regular vectorsAnd supervised labeling of target objects in mixed speech samplesEstimating regular speech extractor by twelfth formula

The twelfth formula:

6. by measuring the warping vector of each frame in each vector dimensionAnd regular voice extractionThe estimation method is as shown in the thirteenth formula, if the inner product distance between a time-frequency window and a speech extraction sub is smaller, the probability that the time-frequency window belongs to the target object is higher, and the Mask of the corresponding time-frequency window estimated by the twelfth formula is higher, so that the speech of the corresponding time-frequency window in the mixed speech sample is extracted more.

A thirteenth formula:

in the above-mentioned thirteenth formula,a mask representing the t-th frame of the mixed speech sample.

7. The target function L can be expressed as a fourteenth formula by reconstructing the spectrum error between the target object voice restored by the estimated Mask and the reference voice of the target object through the above target function of the recognition network, and training the whole network by minimizing the target function.

A fourteenth formula:

in the above-mentioned fourteenth formula,the spectral representation of the reference speech of the target object at frame t (i.e., the reference speech spectrum). the fourteenth equation is the standard L2 reconstruction error, since the reconstruction error reflects the spectral error between the restored speech and the reference speech of the target object, the global error can be reduced by generating gradients to optimize the extracted speech quality of the target object when training the recognition network.

Unlike the mixed speech samples, in the actual mixed speech recognition, since it is not known which speech in the input mixed speech belongs to the target object, and therefore the supervised labeling of the target object in the mixed speech is unknown, as mentioned above, the centroids of all the speech extractors obtained when training the recognition network can be used as the preset speech extractors, and in step 305 of the embodiment shown in fig. 3, the distances between the regular vectors of each frame in each vector dimension and the preset speech extractors are measured respectively to estimate the mask of each frame of the mixed speech.

The embodiment of the application provides a hybrid speech recognition device. As shown in fig. 7, the hybrid speech recognition apparatus in the embodiment of the present application includes:

a monitoring unit 71 for monitoring the input of voice;

an acquisition unit 72 configured to acquire a speech feature of the target object based on the adapted speech when the input of the adapted speech and the mixed speech is monitored by the monitoring unit 71;

a determining unit 73 configured to determine a voice belonging to the target object in the mixed voice based on a voice feature of the target object;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

Alternatively, on the basis of the embodiment shown in fig. 7, as shown in fig. 8, the obtaining unit 72 may include:

a space mapping unit 721, configured to, when the monitoring unit 71 monitors the input of adaptive speech and mixed speech, embed the frequency spectrum of the adaptive speech and the frequency spectrum of the mixed speech into a vector space of a dimension K respectively, to obtain a vector of each frame of the adaptive speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, where the adaptive speech is speech including preset speech information, the mixed speech is non-adaptive speech input after the adaptive speech, and K is not less than 1;

a calculating unit 722, configured to calculate an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension;

a mask estimation unit 723, configured to use the average vector of the adaptive speech in each vector dimension as a speech extractor of a target object in each vector dimension, and measure distances between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension, respectively, so as to estimate a mask of each frame of the mixed speech;

a determining unit 73, configured to determine, based on the mask of each frame of the mixed speech, a speech belonging to the target object in the mixed speech.

Optionally, the calculating unit 722 is specifically configured to: calculating the average vector of the adaptive voice in each vector dimension based on the vector of the adaptive voice effective frame in each vector dimension, wherein the adaptive voice effective frame is a frame of which the spectral amplitude is greater than an adaptive spectral comparison value in the adaptive voice, and the adaptive spectral comparison value is equal to the difference between the maximum spectral amplitude of the adaptive voice and a preset spectral threshold value.

Optionally, the hybrid speech recognition apparatus in this embodiment of the application further includes: and the warping unit is used for inputting the average vector of the adaptive voice in each vector dimension and the vector of each frame of the mixed voice in each vector dimension into a pre-trained forward neural network to obtain the warped vector of each frame in each vector dimension. The mask estimation unit 723 is specifically configured to: and respectively measuring the distance between the regular vector of each frame in each vector dimension and a preset voice extraction son to estimate and obtain the mask of each frame of the mixed voice.

Optionally, the hybrid speech recognition apparatus in this embodiment of the application further includes: and the clustering unit is used for processing the vector of each frame of the mixed voice in each vector dimension based on a clustering algorithm so as to determine the centroid vector of the mixed voice corresponding to the voices of different speakers in each vector dimension. The mask estimation unit 723 is specifically configured to: and taking the target centroid vector of the mixed voice in each vector dimension as a voice extraction sub of a target object in a corresponding vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension so as to estimate the mask of each frame of the mixed voice.

Optionally, the hybrid speech recognition apparatus in this embodiment of the application further includes: and the comparison unit is used for respectively comparing the distances between preset M voice extractors and the average vectors of the adaptive voice in each vector dimension, wherein M is larger than 1. The mask estimation unit 723 is specifically configured to: and taking the voice extraction sub with the minimum average vector distance with the adaptive voice in a vector dimension from the M voice extraction sub as the voice extraction sub of a target object in the corresponding vector dimension, and respectively measuring the distance between the vector of each frame of the mixed voice in each vector dimension and the voice extraction sub of the corresponding vector dimension so as to estimate the mask of each frame of the mixed voice.

It should be understood that the hybrid speech recognition apparatus in the embodiment of the present invention may be configured to implement all technical solutions in the above method embodiment, and the functions of each functional module may be specifically implemented according to the method in the above method embodiment, and the specific implementation process may refer to the relevant description in the above embodiment, which is not described herein again.

As can be seen from the above, in the embodiment of the present application, when the input of the adaptive speech and the mixed speech is monitored, the speech feature of the target object is obtained based on the adaptive speech; and determining the voice belonging to the target object in the mixed voice based on the voice characteristics of the target object. By introducing the voice characteristics of the adaptive voice learning target object, the voice of the target object can be determined from the mixed voice by the scheme of the application, so that the voice of the target object in the mixed voice can be conveniently tracked. For example, in the application scenario of a smart speaker, the wake-up speech can be used as a feature for learning the wake-up speech speaker (i.e., the target object) according to the speech, and the speech belonging to the wake-up speech speaker can be identified and tracked from the mixed speech input after the wake-up speech. In addition, because the voice characteristics of the target object of the application do not depend on the number of the speakers in the mixed voice, the application scheme does not need to know or estimate the number of the speakers in the mixed voice in advance in the mixed voice recognition process.

Referring to fig. 9, the hybrid speech recognition apparatus according to the embodiment of the present application further includes: a memory 81, one or more processors 82 (only one shown in fig. 9), and a computer program stored on the memory 81 and executable on the processors. Wherein: the memory 82 is used to store software programs and modules, and the processor 82 executes various functional applications and data processing by operating the software programs and units stored in the memory 81. Specifically, the processor 82 realizes the following steps by running the above-mentioned computer program stored in the memory 81:

monitoring the input of voice;

an acquisition unit configured to acquire a speech feature of a target object based on an adapted speech when the monitoring unit monitors input of the adapted speech and the mixed speech;

a determining unit, configured to determine, based on a voice feature of the target object, a voice belonging to the target object in the mixed voice;

the adaptive voice is a voice containing preset voice information, and the mixed voice is a non-adaptive voice input after the adaptive voice.

Assuming that the foregoing is the first possible implementation manner, in a second possible implementation manner provided on the basis of the first possible implementation manner, the acquiring the voice feature of the target object based on the adaptive voice includes:

respectively embedding the frequency spectrum of the adaptive voice and the frequency spectrum of the mixed voice into a vector space of K dimensionality to obtain a vector of each frame of the adaptive voice in each vector dimensionality and a vector of each frame of the mixed voice in each vector dimensionality, wherein K is not less than 1;

calculating the average vector of the adaptive voice in each vector dimension based on the vector of each frame of the adaptive voice in each vector dimension;

and determining the voice belonging to the target object in the mixed voice based on the mask of each frame of the mixed voice.

In a third possible implementation manner provided on the basis of the second possible implementation manner, the calculating an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension specifically includes:

In a fourth possible implementation manner provided on the basis of the third possible implementation manner, the calculating unit is specifically configured to: for each vector dimension, multiplying the vector of each frame of the adaptive voice in the corresponding vector dimension by the supervised labeling of the corresponding frame respectively, and then summing to obtain the total vector of the effective frame of the adaptive voice in the corresponding vector dimension; dividing the total vector of the adaptive voice effective frame in each vector dimension by the sum of supervised labels of each frame of the adaptive voice to obtain an average vector of the adaptive voice in each vector dimension;

In a fifth possible implementation manner provided on the basis of the second possible implementation manner, the third possible implementation manner, or the fourth possible implementation manner, after calculating an average vector of the adapted speech in each vector dimension based on a vector of each frame of the adapted speech in each vector dimension, the processor 82 further implements the following steps when executing the computer program stored in the memory 81:

In a sixth possible implementation manner provided on the basis of the second possible implementation manner, the third possible implementation manner, or the fourth possible implementation manner, after the spectrum of the adapted speech and the spectrum of the mixed speech are respectively embedded into a vector space in a K dimension, the processor 82 further implements the following steps when executing the computer program stored in the memory 81:

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by: and taking a target centroid vector of the mixed voice in each vector dimension as a voice extractor of a target object in the corresponding vector dimension, wherein the target centroid vector is the centroid vector with the minimum distance from the average vector of the adaptive voice in the same vector dimension.

In a seventh possible implementation manner provided on the basis of the second possible implementation manner, the third possible implementation manner, or the fourth possible implementation manner, after the calculating an average vector of the adapted speech in each vector dimension based on the vector of each frame of the adapted speech in each vector dimension, the processor 82 further implements the following steps when executing the computer program stored in the memory 81:

respectively comparing the distances between preset M voice extractors and the average vectors of the adaptive voice in each vector dimension, wherein M is larger than 1;

replacing the average vector of the adaptive voice in each vector dimension as a voice extraction sub-of the target object in each vector dimension by: and taking the voice extraction sub with the minimum average vector distance with the adaptive voice in a vector dimension as the voice extraction sub of the target object in the corresponding vector dimension.

Optionally, as shown in fig. 9, the hybrid speech recognition apparatus further includes: one or more input devices 83 (only one shown in fig. 9) and one or more output devices 84 (only one shown in fig. 9). The memory 81, processor 82, input device 83 and output device 84 are connected by a bus 85.

It should be understood that in the embodiments of the present Application, the Processor 82 may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 83 may include a keyboard, a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device 84 may include a display, a speaker, etc.

The memory 84 may include a read-only memory and a random access memory, and provides instructions and data to the processor 81. Some or all of memory 84 may also include non-volatile random access memory.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned functional units and modules are illustrated as being divided, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in the form of a hardware or a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described or recited in detail in a certain embodiment, reference may be made to the descriptions of other embodiments.

Those of ordinary skill in the art would appreciate that the elements and algorithm steps of the various embodiments described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules or units is only one logical functional division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, and software distribution medium, etc. It should be noted that the computer readable medium described above may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media not including electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the present disclosure, and are intended to be included within the scope thereof.

33页详细技术资料下载

Hybrid speech recognition network training method, hybrid speech recognition device and storage medium

相关技术

网友询问留言