Data acquisition and model training method and device for isolated word speech recognition

文档序号：925434 发布日期：2021-03-02 浏览：3次中文

阅读说明：本技术 一种孤立词语音识别的数据采集及模型训练方法及装置 (Data acquisition and model training method and device for isolated word speech recognition ) 是由徐彧毋磊续素芬于 2020-10-16 设计创作，主要内容包括：本发明公开了一种孤立词语音识别的数据采集及模型训练方法及装置,涉及语音识别技术领域,用以在保证语音识别鲁棒性的前提下,降低孤立词语音样本采集的成本,提高采集效率。该方法包括：分批次的采集孤立词语音,利用第一批次或前几个批次采集“嘈杂环境”孤立词语音和“固定环境”孤立词语音对“Y型”网络进行训练。后续批次只采集“固定环境”孤立词语音,网络训练时只更新语义特征子网络的模型参数(见说明书附图图1)。(The invention discloses a data acquisition and model training method and device for isolated word voice recognition, relates to the technical field of voice recognition, and aims to reduce the acquisition cost of isolated word voice samples and improve the acquisition efficiency on the premise of ensuring the robustness of voice recognition. The method comprises the following steps: isolated word voices are collected in batches, and the isolated word voices in the noisy environment and the isolated word voices in the fixed environment are collected in the first batch or the first batches to train the Y-shaped network. In the subsequent batch, only isolated word speech of the 'fixed environment' is collected, and only the model parameters of the semantic feature sub-network are updated during network training (see the attached figure 1 of the specification).)

1. A data acquisition and model training method for isolated word speech recognition is characterized by comprising the following steps:

the method for collecting the isolated word voice data in batches adopts a Y-shaped network to train the isolated word voice samples collected in the first batch or the previous batches;

a "Y-type" network can be functionally divided into two sub-networks: (1) generating a sub-network of regularized speech features, (2) generating a sub-network of semantic features;

for subsequent batches of speech, only the model parameters of the semantic feature generation sub-network are updated.

2. The method for data collection and model training of isolated word speech recognition according to claim 1, wherein the speech samples collected in the first batch or the first batches comprise isolated word speech in "noisy environment" and isolated word speech in "fixed environment", and only isolated word speech in "fixed environment" can be collected in the subsequent batches.

3. The isolated word speech recognition data collection and model training method of claim 1, further comprising:

when isolated word voice samples collected from the first batch or the previous batches are adopted to train the whole Y-shaped network, the cost function is L ═ alpha L_C+βL_RWherein L is_C＝-lny_i，z_iIs the output of semantic features;x_i，jandare spectrogram of isolated word voice of 'noisy environment' and isolated word voice of 'fixed environment', which are from the same person and send the same voice,is the result of spectrogram reconstruction; alpha and beta are adjustable coefficients between 0 and 1;

when isolated word voice samples collected in subsequent batches are adopted to train part of the Y-shaped network, the cost function is L ═ L_C＝-lny_i。

4. The isolated word speech recognition data collection and model training method of claim 1, further comprising:

when isolated word voice samples collected in a first batch or a plurality of previous batches are adopted to train the whole Y-shaped network, the input of the network is isolated word voice in a noisy environment;

when isolated word voice samples collected in subsequent batches are adopted to train part of the Y-shaped network, the input of the network is isolated word voice in a fixed environment.

5. A data acquisition and model training device for isolated word speech recognition is characterized by comprising:

the speech spectrum extraction module is used for framing the speech signals and calculating speech characteristics such as MFCC (Mel frequency cepstrum coefficient), Fbank and the like;

the regularization speech feature generation module is used for generating regularization features of isolated word speech;

the voice reconstruction module is used for reconstructing voice and providing the voice to the cost function generation module;

the semantic feature generation module is used for generating semantic features of isolated word voice;

the semantic classification module is used for calculating the probability that the input voice is classified into a certain recognized word;

the cost function generating module is used for calculating classification cost and reconstruction cost;

the network parameter 'freezing' indicating module is used for determining whether the model parameters in the regularized voice feature generation module can be updated through training;

and the network parameter updating module is used for updating the model parameters.

Technical Field

The invention relates to the technical field of voice processing and voice recognition, in particular to a data acquisition and model training method and device for isolated word voice recognition.

Background

Currently, in some fields (e.g., mobile phone applications, smart furniture, industrial controls, etc.), device wake-up, on-demand changes in device status may be involved. If the key mode is adopted, the functions are realized, and the convenience is not strong.

The device is awakened by adopting specific voice or the state of the device is changed in a voice command mode, so that the method has the advantages of non-contact and strong real-time performance, and the application experience of a user is improved.

Due to differences in application environments and voice acquisition devices, voice signals are affected by environmental noise, surrounding human voice, channel distortion and other factors. A successful speech recognition system must be able to cope with all such varying factors of sound.

Therefore, the above-mentioned variation factors need to be considered in the process of voice sample collection and sample amplification to achieve good recognition effect. For example, different user devices have different microphone brands and models, and a speech recognition system needs to recognize speech signals with different channel distortions. At this time, the voice sample collection needs to be performed for microphones of different brands and models.

The isolated word speech recognition system is unique compared with a general speech recognition system: (1) recognizing phrase or isolated word voice; (2) the number of the "identifying words" is limited, and is generally several or more than ten; (3) the "identifying word" is typically specified by a customer, such as a customer of a child's toy, who specifies the "lucky cat" as the wake word for his plush cat toy.

The strong correlation of the "recognized words" of an isolated word speech recognition system with the application, the collection of "recognized word" speech is usually batch-wise. When the isolated word speech recognition system is customized for a certain customer, only the speech of the 'recognition word' specified by the customer is collected. Since the speech to be recognized by the client is unpredictable, the cost of collecting the "recognized word" speech in large quantities in batches is considerable.

In the application of the isolated word speech recognition system, there are cases where the application environments are similar but the recognition words are different. At this time, redundant information exists in the voice samples of different environmental factors collected in batches. If only speech samples for certain environmental factors are collected and the speech recognition model is trained accordingly, the recognition performance is affected. Therefore, how to more effectively acquire the voice sample and which strategy to train the voice recognition model is adopted to ensure the robustness of the recognition effect is the problem to be solved.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present invention provide a method and an apparatus for data acquisition and model training for isolated word speech recognition, so as to achieve the purposes of improving the acquisition efficiency of isolated word speech samples and enhancing the robustness of speech recognition, and the technical scheme is as follows:

in order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a method for batch collection of isolated word tone data is provided, the method comprising:

for a specific application scenario (e.g., home environment, industrial environment, etc.) of a customer product, the speech samples collected in the first batch or the first several batches include "noisy environment" isolated word speech and "fixed environment" isolated word speech. Here, the "noisy environment" isolated word speech refers to a speech sample that sufficiently contains environmental noise, channel distortion, intonation, speech speed, and other variable factors. The "fixed environment" isolated word tone refers to a voice sample of "fixed" environmental noise, "fixed" channel distortion, "fixed" intonation, etc., such as that taken in a quiet room using a particular type of microphone. Subsequent batches may only collect "fixed environment" orphan words speech.

In a second aspect, a method for training an isolated word speech recognition network is provided, the method comprising:

the isolated word tone samples collected for the first batch or the first few batches were trained using a "Y-type" network, see fig. 1. The "Y-type" network has one input, i.e. "noisy environment" isolated word soundsIt may be raw speech or a spectrogram (e.g., MFCC, Fbank, etc.); two outputs, i.e. semantic featuresAnd speech reconstruction results。

A "Y-type" network can be functionally divided into two sub-networks: (1) the regularization voice feature sub-network is similar to U-net in network structure, is used for generating voice features for eliminating factors such as environmental noise, surrounding human voice, channel distortion and the like, is called regularization voice features, and is not output as a network; (2) and the semantic feature sub-network is used for generating semantic features of the voice signals so as to realize the recognition of isolated word voice.

Each rectangle (Layer) in fig. 1 and 2 represents a Layer of a neural network (e.g., convolutional Layer, fully-linked Layer) or a residual block, and the arrows indicate the flow direction of signals.

The cost function of a "Y-type" network is divided into two parts: (1) the classification cost uses cross entropy as a cost function, i.e.. Wherein，Is the output of semantic features. (2) The reconstruction cost uses the mean square error as a cost function, i.e.. WhereinIs an input to the network, and,is the speech reconstruction result.Andis a same person,Samples of the same isolated word speech, andin order to isolate the word speech in a "noisy environment",is the isolated word voice of the 'fixed environment'. The total cost function isWherein, in the step (A),andthe adjustable coefficient is 0-1 and is used for controlling the proportion of the cost functions of the two parts.

And aiming at the voices of the subsequent batches, if the sample acquisition method is the same as that of the first batch, training by adopting a Y-shaped network. If only the isolated word speech of the 'fixed environment' is collected, multiplexing partial structure of the 'Y-shaped' network, as shown in figure 2, and 'freezing' parameters of the regularized speech feature generation sub-network, and only updating model parameters of the semantic feature generation sub-network. At this time, the input of the network is the isolated word voice of the' fixed environmentThe cost function of the network contains only the classification part, i.e.。

In a third aspect, there is provided a training apparatus for an isolated word speech recognition model, the apparatus comprising:

the spectrogram extraction module is used for acquiring spectrograms such as MFCC (Mel frequency cepstrum coefficient) and Fbank;

the regularization speech feature generation module is used for generating regularization features of isolated word speech;

the voice reconstruction module is used for reconstructing voice and providing the voice to the cost function generation module;

the semantic feature generation module is used for generating semantic features of isolated word voice;

the semantic classification module is used for calculating the probability that the input voice is classified into a certain recognized word;

the cost function generating module is used for calculating classification cost and reconstruction cost;

the network parameter 'freezing' indicating module is used for determining whether the model parameters in the regularized voice feature generation module can be updated through training;

and the network parameter updating module is used for updating the model parameters.

According to the data acquisition and model training method and device for isolated word voice recognition provided by the embodiment of the invention, the voice sample is divided into two types of isolated word voice in a noisy environment and isolated word voice in a fixed environment, and the isolated word voice model training is carried out by utilizing a Y-shaped network, so that the acquisition cost of the isolated word voice sample is reduced and the acquisition efficiency is improved on the premise of ensuring the recognition robustness.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a block diagram of the structure provided by the present invention for training the entire "Y-type" network using "noisy environment" isolated word speech and "fixed environment" isolated word speech.

FIG. 2 is a block diagram of a semantic feature generation sub-network trained using "fixed environment" isolated word sounds as provided by the present invention.

Fig. 3 is a flowchart of a method for collecting and recognizing isolated word speech according to an embodiment of the present invention.

Fig. 4 is a block diagram of a "Y-type" network structure provided by the embodiment of the present invention.

FIG. 5 is a diagram of an isolated word speech recognition network model training apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the invention discloses a method and a device for data acquisition and model training of isolated word voice recognition, wherein the method comprises the following steps: the method comprises the steps of collecting voice samples in batches, wherein the voice samples collected in the first batch or the first batches comprise isolated word voice in a noisy environment and isolated word voice in a fixed environment, and only the isolated word voice in the fixed environment can be collected in the subsequent batch; framing the voice signal in the noisy environment and the voice signal in the fixed environment, and calculating an Fbank spectrogram; training a Y-shaped network by using a spectrogram to generate a regularized voice characteristic model parameter and a semantic characteristic model parameter; if only 'fixed environment' voice signals are used for voice recognition model training, parameters of the regularization voice feature generation sub-network are 'frozen', and only model parameters of the semantic feature generation sub-network are updated. The invention can reduce the collection cost of the isolated word voice sample and improve the collection efficiency on the premise of ensuring the robustness of the isolated word voice recognition.

Next, a speech recognition method for isolated words disclosed in the embodiment of the present invention is described, referring to fig. 3, which may include the following steps:

step S11, collecting an initial voice sample, wherein the initial voice sample contains the voice of the "recognized word" designated by the customer and some other interference voices.

In this embodiment, the voice samples are collected in batches, and for the "recognition word" specified by the client, the voice of the "recognition word" specified by the client is collected in the batch. To increase the diversity of the speech samples, the batch may collect some interfering speech.

The voice samples collected in the first batch or the first batches comprise isolated word voice in a noisy environment and isolated word voice in a fixed environment, and only isolated word voice in the fixed environment can be collected in the subsequent batches.

And step S12, calculating the spectrogram of the isolated word voice in the noisy environment and the isolated word voice in the fixed environment.

Framing is carried out on a 16KHz sampled voice signal, the frame length is 32ms, the overlapping area is 16ms, an Fbank spectrogram of the framed voice signal is calculated, and the number of Mel scale triangular filters is 40.

And step S13, taking the spectrogram of the isolated word speech in the noisy environment as input, and calculating semantic features and spectrogram reconstruction results by utilizing a Y-shaped network.

The Fbank spectrogram formed by 1.04s voice signals is used as the input of the Y-shaped network, namelyIs 64 ◊ 40. The "Y-type" network in the embodiment of the present invention is shown in fig. 4. Conv2 of FIG. 5 ◊ 5 refers to a 2D convolutional layer with a convolutional kernel size of 5 ◊ 5, with a step size of 2; GAP refers to the global average pooling layer; FC refers to the full link layer; DeCov2 refers to a 2D deconvolution layer with a convolution kernel size of 3 ◊ 3, with a step size of 2; conv2 of 3 ◊ 3 refers to a 2D convolutional layer with a convolutional kernel size of 3 ◊ 3, with a step size of 1. In the figure, the step sizes of the residual blocks 1-0 and 1-1 are 2, and the step sizes of the other residual blocks 1-0 and 1-1 are 1.

And calculating semantic features and spectrogram reconstruction results by using the Y-shaped network.

And step S14, calculating the cost function of the Y-shaped network.

The cost function of the "Y-type" network isWherein，，Is the output of semantic features;，andare spectrogram of isolated word voice of 'noisy environment' and isolated word voice of 'fixed environment', which are from the same person and send the same voice,is the result of spectrogram reconstruction;andis an adjustable coefficient between 0 and 1.

And step S15, updating the model parameters of the whole network according to the cost function of the Y-shaped network.

And updating the model parameters of the whole network by using a gradient descent method according to the result of the cost function of the Y-shaped network.

And step S21, collecting an initial voice sample, wherein the initial voice sample contains the voice of the client-specified recognized word collected in the fixed environment.

After training of the sub-network generating the regularized speech features is completed, only isolated word speech in a fixed environment is collected in subsequent batches.

And step S22, calculating a spectrogram of the isolated word voice of the fixed environment.

Step S23, using the spectrogram of the isolated word speech in "fixed environment" as input, and using local "Y-type" network to calculate semantic features, as shown in fig. 2.

Step S24, calculating the local partOf "Y-type" networks, i.e. the cost function。

And S25, parameters of the regularized voice feature generation sub-network are frozen, and model parameters of the semantic feature generation sub-network are updated according to the cost function of the step S24.

The embodiment of the present invention further provides a device for training an isolated word speech recognition network model, as shown in fig. 5, the device for training an isolated word speech recognition network model includes:

the speech spectrum extraction module 101 is configured to perform framing on a speech signal and calculate speech features such as MFCC and Fbank;

a regularization speech feature generation module 102, configured to generate regularization features of isolated word speech;

the voice reconstruction module 103 is used for reconstructing voice and providing the voice to the cost function generation module;

a semantic feature generation module 104, configured to generate semantic features of isolated word speech;

a semantic classification module 105, configured to calculate a probability that the input speech is classified into a certain "recognized word";

a cost function generating module 106, configured to calculate a classification cost and a reconstruction cost;

a network parameter "freeze" indication module 107, configured to determine whether the model parameters in the regularized speech feature generation module can be updated through training; and a network parameter updating module 108 for updating the model parameters.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

9页详细技术资料下载

Data acquisition and model training method and device for isolated word speech recognition

相关技术

网友询问留言