End-to-end speech recognition model training method, speech recognition method and related device

文档序号：139044 发布日期：2021-10-22 浏览：42次中文

阅读说明：本技术 端到端语音识别模型训练方法、语音识别方法及相关装置 (End-to-end speech recognition model training method, speech recognition method and related device ) 是由吴振宗徐易楠康世胤许佳于 2021-07-22 设计创作，主要内容包括：本发明提供的端到端语音识别模型训练方法、语音识别方法及相关装置,该方法包括：根据文本语料,获得训练后的语言模型；根据语言模型构建端到端语音识别模型,并根据音频语料对构建后的端到端语音识别模型进行训练,获得训练后的端到端语音识别模型。本发明基于数量级较大的文本语料先训练出一个语言模型,让这个语言模型可以学习更多的语言知识,进而,利用训练后的语言模型构建端到端语音识别模型,在结合音频语料进行训练,不仅可以让训练后的模型避免因多音字现象造成识别准确度降低的现象,同时在避免可训练之前需要对音频语料进行标注成本较大的问题。(The invention provides an end-to-end speech recognition model training method, a speech recognition method and a related device, wherein the method comprises the following steps: obtaining a trained language model according to the text corpus; and constructing an end-to-end voice recognition model according to the language model, and training the constructed end-to-end voice recognition model according to the audio corpus to obtain the trained end-to-end voice recognition model. The invention firstly trains a language model based on the text corpus with larger magnitude order, so that the language model can learn more language knowledge, and then utilizes the trained language model to construct an end-to-end speech recognition model, and trains in combination with the audio corpus, thereby not only avoiding the phenomenon that the recognition accuracy is reduced due to polyphone phenomenon of the trained model, but also avoiding the problem that the audio corpus needs to be labeled with higher cost before being trained.)

1. A method for end-to-end speech recognition model training, the method comprising:

obtaining a trained language model according to the text corpus;

and constructing an end-to-end voice recognition model according to the language model, and training the constructed end-to-end voice recognition model according to the audio corpus to obtain the trained end-to-end voice recognition model.

2. The method according to claim 1, wherein constructing an end-to-end speech recognition model according to the language model, and training the constructed end-to-end speech recognition model according to an audio corpus to obtain the trained end-to-end speech recognition model comprises:

constructing the language model into a decoding module of the end-to-end voice recognition model to obtain the constructed end-to-end voice recognition model;

for the constructed end-to-end voice recognition model, keeping other model parameters except the cross attention mechanism parameter of the language model fixed, and training the language model according to the audio corpus;

wherein the other model parameters include: model parameters in the end-to-end speech recognition model other than the language model, and parameters in the language model other than the cross-attention mechanism parameters; the cross attention mechanism parameter is used for calculating an attention score of output information of an encoder module of the end-to-end speech recognition model;

when the loss value of the loss function of the end-to-end voice recognition model is determined to be reduced to a first value and the first value is not changed any more, keeping fixed model parameters except the language model in the end-to-end voice recognition model, and training the language model according to the audio corpus;

when the loss value of the loss function is determined to be reduced to a second value and the second value is not changed any more, obtaining the trained end-to-end speech recognition model; wherein the first value is greater than the second value.

3. The method of claim 2, further comprising:

configuring a weight parameter for each layer of the language model, wherein the weight parameter represents the probability of filtering out the output information;

the cross attention mechanism parameter is used to calculate the attention score of the output information of the encoder modules of the end-to-end speech recognition model in the following way:

and obtaining the attention score of the output information according to the output information, the weight parameter of the current layer, the cross attention mechanism parameter and the calculation result of the previous layer of the current layer.

4. The method of claim 1, wherein obtaining the trained language model from the text corpus comprises:

obtaining a spoken language text corpus set and a business text corpus set; the spoken language text corpus is a text set collected in any scene; the service text corpus is a text set corresponding to a user collected in a service scene;

according to the spoken language text corpus, pre-training the initial language model to obtain a pre-trained language model;

and performing fine tuning training on the pre-trained language model according to the service text corpus to obtain the trained language model.

5. A method of speech recognition, the method comprising:

acquiring a voice to be recognized;

and inputting the audio characteristics of the voice to be recognized into the trained end-to-end voice recognition model for recognition to obtain a target text corresponding to the voice to be recognized.

6. The speech recognition method of claim 5, wherein the end-to-end speech recognition model is trained by:

obtaining a trained language model according to the text corpus;

and constructing an end-to-end voice recognition model according to the voice-mode audio corpus with the labels, and training the constructed end-to-end voice recognition model to obtain the trained end-to-end voice recognition model.

7. The speech recognition method of claim 6, wherein the language model is trained by:

according to the spoken language text corpus, pre-training the initial language model to obtain a pre-trained language model;

and performing fine tuning training on the pre-trained language model according to the service text corpus to obtain the trained language model.

8. The speech recognition method of claim 5, wherein obtaining the speech to be recognized comprises:

responding to an input operation instruction of a voice input area on a user interface to obtain the voice to be recognized;

inputting the audio features of the speech to be recognized into the trained end-to-end speech recognition model for recognition, and obtaining a target text corresponding to the speech to be recognized, wherein the method comprises the following steps:

and responding to the recognition instruction on the user interface, performing feature extraction on the voice to be recognized to obtain the audio features, and inputting the audio features into a trained end-to-end voice recognition model for recognition to obtain the target text.

9. The speech recognition method of claim 8, further comprising:

and when a preview instruction is received on the user interface, displaying the target text in a preview area.

10. An end-to-end speech recognition model training apparatus, comprising:

obtaining a trained language model according to the text corpus;

11. A speech recognition apparatus, comprising:

acquiring a voice to be recognized;

12. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being operable to execute the computer program to perform the method of any of claims 1 to 4 or the method of any of claims 5 to 9.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4 or the method according to any one of claims 5-9.

Technical Field

The invention relates to the technical field of voice recognition, in particular to an end-to-end voice recognition model training method, a voice recognition method and a related device.

Background

Speech Recognition (ASR) is a process of converting audio collected by a microphone into text, and in recent years, End-to-End speech recognition (End-to-EndASR, E2E-ASR for short) has gradually become the mainstream, and its speech recognition performance is better than that of a traditional speech recognition model.

However, the current end-to-end language identification model needs audio training corpora to be trained in the training process, but many scenes do not have enough audio training corpora, so that the language knowledge that the model can learn is less, polyphones are easily caused during decoding because the model does not have the language model, the identification accuracy is reduced, and meanwhile, the audio corpora needs to be labeled before training, and the cost is higher.

Disclosure of Invention

An objective of the present invention is to provide an end-to-end speech recognition model training method, a speech recognition method and a related apparatus, so as to improve the accuracy of an end-to-end speech recognition model.

Embodiments of the invention may be implemented as follows:

in a first aspect, the present invention provides a method for training an end-to-end speech recognition model, where the method includes: obtaining a trained language model according to the text corpus; and constructing an end-to-end voice recognition model according to the language model, and training the constructed end-to-end voice recognition model according to the audio corpus to obtain the trained end-to-end voice recognition model.

In a second aspect, the present invention provides a speech recognition method, the method comprising: acquiring a voice to be recognized; and inputting the audio characteristics of the voice to be recognized into the trained end-to-end voice recognition model for recognition to obtain a target text corresponding to the voice to be recognized.

In a third aspect, the present invention provides an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being capable of executing the computer program to implement the method of the first aspect or the method of the second aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect or the method of the second aspect.

The invention provides an end-to-end speech recognition model training method, a speech recognition method and a related device, wherein the method comprises the following steps: obtaining a trained language model according to the text corpus; and constructing an end-to-end voice recognition model according to the language model, and training the constructed end-to-end voice recognition model according to the audio corpus to obtain the trained end-to-end voice recognition model. The difference with the prior art is that the existing end-to-end language identification model needs audio training corpora to be trained in the training process, but the audio training corpora do not exist in many scenes, so that the language knowledge which can be learned by the model is less, polyphones are easily caused during decoding because the model does not have the language model, the identification accuracy is reduced, and meanwhile, the audio corpora need to be marked before training, and the cost is higher. Therefore, the invention firstly trains a language model based on the text corpus with larger magnitude order, so that the language model can learn more language knowledge, and then utilizes the trained language model to construct an end-to-end speech recognition model, and trains in combination with the audio corpus, thereby not only avoiding the phenomenon that the recognition accuracy is reduced due to polyphone phenomenon of the trained model, but also avoiding the problem that the tagging cost of the audio corpus is larger before the training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a diagram of a prior art training framework for an end-to-end speech recognition model;

FIG. 2 is a frame diagram of a shallow fusion scheme proposed in the related art;

FIG. 3 is a schematic flow chart of an end-to-end speech recognition model training method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of an implementation manner of step S305 provided by the embodiment of the present invention;

FIG. 5 is a diagram of a training framework for an end-to-end speech recognition model according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of an implementation manner of step S303 provided by the embodiment of the present invention;

FIG. 7 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 8 is an implementation of a user interface provided by an embodiment of the present invention;

FIG. 9 is a functional block diagram of an end-to-end speech recognition model training apparatus according to an embodiment of the present invention;

FIG. 10 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 11 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

At present, speech recognition is widely applied to smart phones, smart homes, smart vehicle-mounted equipment and smart customer service robots, and can go deep into each link of learning, life and work in the future.

Traditional speech recognition is mainly based on hidden markov model-deep neural network (HMM-DNN) modeling. Due to the modeling limitations of hidden markov models, and manual rules such as pronunciation dictionaries, language models, etc. used by the decoder. These manual rules, while achieving good results when the amount of data is small, do not fully exploit the modeling potential when the amount of data is large. Therefore, in recent years, End-to-End speech recognition (E2E-ASR for short) has gradually become the mainstream, the speech recognition performance of which is better than that of the traditional speech recognition model, and the model is small, so that the model does not need an additional language model, can be easily deployed in equipment, and can be widely applied to various fields.

At present, a hybrid decoding network structure of a connection timing classification model (CTC) and an attention mechanism (attentionchannels) is adopted for a popular end-to-end speech recognition model because: CTC model decoding recognizes speech by predicting the output of each frame on the assumption that the decoding of each frame remains independent of each other, and thus lacks the connection between the preceding and following speech features during decoding, comparing the corrections depending on the language model. While the decoding only adopting the attention mechanism is independent of the frame sequence of the input voice, each decoding unit generates the current result through the decoding result of the previous unit and the overall voice characteristic, and the monotonous time sequence of the voice is ignored in the decoding process. In order to take advantage of the advantages and disadvantages of the two methods into account, a hybrid decoding framework model is generally adopted.

Based on the basic framework of the end-to-end speech recognition model described in the foregoing, fig. 1 shows a training framework diagram of an end-to-end speech recognition model in the prior art, and the prior training process is as follows: the output information of the coder is simultaneously input into a CTC model and a decoder, wherein in the training process of the decoder, audio corpus with labels is added, the loss function of the end-to-end speech recognition model is weighted and summed by the loss function of the CTC model and the loss function of the decoder according to the respective proportion, the weighted loss function value can be calculated to be 0.3 CTCloss +0.7 GPT-2loss, and the training is finished when the weighted loss function value of the end-to-end speech recognition model is reduced to be stable.

It can be seen that the end-to-end language identification model requires audio training corpora to train in the training process, but many scenes do not have enough audio training corpora, so that the language knowledge that the model can learn is less, polyphones are easily caused during decoding because the model does not have a language model, the identification accuracy is reduced, and meanwhile, the audio corpora needs to be labeled before training, and the cost is higher.

In order to solve the above technical problem, the related art proposes a solution for fusing an end-to-end speech recognition model and a language model, please refer to fig. 2, fig. 2 is a shallow fusion scheme framework diagram proposed by the related art, and the core of the scheme lies in: without changing the original end-to-end speech recognition model, a language model (such as GPT-2 model) is additionally added, and the two modules are added together for re-scoring, so that the magnetic sequence with the highest score is reserved.

For example, with continued reference to fig. 2, after inputting the audio feature z into a speech recognition model (ASR), a first distribution probability Pasr can be obtained; inputting the text obtained by the decoding in the last step and the hidden layer information H of the Language Model (LM) into the language model to obtain a second distribution probability Plm, and adding the first distribution probability Pasr and the second distribution probability Plm of the LM and the ASR model according to a certain weight until the decoding is finished finally. Where c1 and c2 shown in fig. 1 are words corresponding to the maximum sum of the probabilities obtained by each decoding, respectively.

However, the applicant found that in the implementation of the above scheme, although the problem that polyphones are easily generated in decoding can be solved, and the recognition accuracy is improved, the additionally added language model structure is too large, so that the decoding speed is very slow.

Therefore, the applicant proposes an end-to-end speech recognition model training method based on the training framework of end-to-end speech recognition model shown in fig. 1, please refer to fig. 3, fig. 3 is a schematic flow chart of an end-to-end speech recognition model training method provided by an embodiment of the present invention, and the method may include:

and S304, obtaining the trained language model according to the text corpus.

In the embodiment of the present invention, the text corpus may be acquired from a network by any existing crawler technology, and the corpus content may relate to any field and scene, for example, texts appearing in daily life basic conversations, chat software, live broadcast software, social software, and the like, and may also be texts randomly generated according to existing texts.

It is anticipated that the number of text corpora is far greater than that of audio corpora, and the text corpora are much easier to obtain than the audio corpora with labels, so that the language model can learn rich language knowledge, thereby solving the problem of polyphones in the decoding process, and when the audio corpora is longer than a certain length of time, the language model has no decoding advantage.

It is also foreseen that the number of the text corpora used for training does not need to be labeled in advance in the training process, so that the training cost and the time consumption can be reduced, and the training efficiency is improved.

In some possible embodiments, the language model is preferably a GPT-2 model, the GPT-2 model is formed by stacking decoder groups in an existing transform framework and can have different sizes, the minimum GPT-2 model can have 12-layer decoders, the maximum GPT-2 model can have 48-layer decoders, and the decoder level of the GPT-2 model can be customized according to actual needs.

S305, an end-to-end voice recognition model is built according to the language model, the built end-to-end voice recognition model is trained according to the audio corpus, and the trained end-to-end voice recognition model is obtained.

It can be understood that the existing end-to-end speech recognition model is composed of two parts, namely, a decoder group and an encoder group, wherein the encoder group is formed by overlapping multiple encoders, the decoder group is formed by overlapping multiple decoders, moreover, the decoder group structure is usually 6 layers, and therefore, in order to align the output result of the model, a GPT-2 model with 6 layers is used in the embodiment of the invention.

Therefore, when the GPT-2 model in the invention is 6 layers, the trained GPT-2 model can be used for replacing the existing decoder group of the end-to-end speech recognition model, and because the GPT-2 model adopts text corpora in the training process, a model which can learn more language knowledge can be obtained, so that the end-to-end speech recognition model is constructed according to the GPT-2 model language model, and the problem of low accuracy caused by the phenomenon of polyphones can be solved by the subsequently obtained end-to-end speech recognition model.

In some possible embodiments, in the process of training the constructed end-to-end speech recognition model, the used training corpus is the labeled audio corpus, because the training of the GPT-2 model in the early stage has learned enough language knowledge, so that in the training process in the later stage, the training can be performed by using the conventional audio corpus.

The difference with the prior art is that the existing end-to-end language identification model needs audio training corpora to be trained in the training process, but the audio training corpora do not exist in many scenes, so that the language knowledge which can be learned by the model is less, polyphones are easily caused during decoding because the model does not have the language model, the identification accuracy is reduced, and meanwhile, the audio corpora need to be marked before training, and the cost is higher. Therefore, the invention firstly trains a language model based on the text corpus with larger magnitude order, so that the language model can learn more language knowledge, and then utilizes the trained language model to construct an end-to-end speech recognition model, and trains in combination with the audio corpus, thereby not only avoiding the phenomenon that the recognition accuracy is reduced due to polyphone phenomenon of the trained model, but also avoiding the problem that the tagging cost of the audio corpus is larger before the training.

Optionally, an implementation manner of training the constructed end-to-end speech recognition model according to the audio corpus is further provided below, please refer to fig. 4, where fig. 4 is a schematic flowchart of an implementation manner of step S305 provided in the embodiment of the present invention, where step S305 may include the following sub-steps:

and a substep S305-1, constructing the language model into a decoding module of the end-to-end speech recognition model to obtain the constructed end-to-end speech recognition model.

It is understood that a new end-to-end speech recognition model can be obtained by using the trained GPT-2 model to replace the existing decoder set of the end-to-end speech recognition model.

And a substep S305-2, aiming at the constructed end-to-end speech recognition model, keeping other model parameters except the cross attention mechanism parameter of the language model fixed, and training the language model according to the audio corpus.

Wherein the other model parameters mentioned above are expressed as: model parameters of models other than the language model in the end-to-end speech recognition model, and parameters other than the cross-attention mechanism parameters in the language model.

For example, with continued reference to FIG. 1, it can be seen that the end-to-end speech recognition model includes an encoder module, a decoder module, and a CTC model, and the model parameters described above in addition to the language model can be parameters of the encoder module and CTC model parameters. Other model parameters except the cross attention mechanism parameters are kept fixed in the training process, so that the pre-training information of the language model can be prevented from being damaged.

In the embodiment of the present invention, the language model used to replace the end-to-end speech recognition model in the embodiment of the present invention also has a hierarchical structure, in each layer, not only the existing self-attention mechanism (selection) is retained, but also a cross-attention mechanism (cross) is added, the self-attention mechanism is to perform attention calculation on the result output from the previous layer as a value vector, a key vector and a query vector, the self-attention mechanism of the first layer is from an encoder, and the cross-attention mechanism is: for each layer, the query vector of the previous layer and the value vector and the key vector of the current layer are subjected to attention calculation, so that the model can notice the whole input sequence instead of a final single vector, wherein the larger the value vector, the larger the weight. The cross attention mechanism parameter is used to calculate an attention score of the output information of the encoder modules of the end-to-end speech recognition model, wherein the higher the attention score of the output information is, the higher the attention score value will be a large proportion in the result vector.

The invention aims to add a cross attention mechanism to a language model, and comprises the following steps: and performing attention calculation on the acoustic information and the text information by using a cross attention mechanism to combine the information of the encoder and the decoder so as to prevent the modelization from failing to have any knowledge.

And a substep S305-3, when the loss value of the loss function of the end-to-end speech recognition model is determined to be reduced to a first value and the first value is not changed, keeping the model parameters except the language model in the end-to-end speech recognition model fixed, and training the language model according to the audio corpus.

It is understood that the loss function is weighted by the loss function of the CTC model and the loss function of the decoder according to their respective weights, for example, the weighted loss function value may be calculated as 0.3 CTCloss +0.7 GPT-2loss, and the second stage of training may be performed when the first value of the weighted loss function value of the end-to-end speech recognition model no longer changes.

Substep S305-4, when determining that the loss value of the lost function is reduced to a second value and the second value is not changed any more, obtaining a trained end-to-end speech recognition model; wherein the first value is greater than the second value.

For facilitating understanding of the above training thought, please refer to fig. 5, fig. 5 is a training framework diagram of an end-to-end speech recognition model provided by an embodiment of the present invention, which is different from the existing training framework shown in fig. 1, a decoder module in the end-to-end speech recognition model in the present invention is a pre-trained language model GPT-2, and the GPT-2 model has 6 layers of decoders in total, during the training process, first, other parameters except for a cross mechanism parameter of the encoder module, the CTC model, and the GPT-2 are kept fixed, and the GPT-2 model is trained according to an audio corpus, for example, when the value is reduced to be not changed, the trained end-to-end speech recognition model is obtained.

Optionally, in the training process described in the foregoing, a weight parameter may also be configured for each layer of the language model, where the weight parameter represents the probability that the output information is filtered out. Thus, the cross-attention mechanism parameters are used to calculate the attention score of the output information of the encoder modules of the end-to-end speech recognition model in the following way:

It is contemplated that each layer of the language model is given a weighting parameter such that each layer has a certain probability of not being given a weighting parameter, and the higher the number of layers, the higher the probability of losing information. This is to simulate the situation where the input is text only without audio, and to let the model retain pre-trained information to prevent over-fitting of the model.

For example, please continue to refer to fig. 5, GPT-2 has a 6-layer structure, the weight parameters from the bottom layer to the top layer gradually increase, in the training process, the output information of the encoder is sequentially used as the input of the 6 layers for attention calculation, meanwhile, the audio corpus is also input in the 1 st layer, the input information of the current layer includes the calculation result of the previous layer and the output information of the encoder, for example, in the 1 st layer, the input information, the weight parameter 0.1 corresponding to the first layer, and the audio corpus are input in the 1 st layer for attention calculation, so as to obtain the first layer calculation result, in the 2 nd layer, the input information, the weight parameter 0.2 corresponding to the second layer, and the first layer calculation result are input in the 2 nd layer for attention calculation, so as to obtain the second layer calculation result, and the distribution probability of the dictionary is finally obtained by outputting the calculation result obtained in the 6 th layer.

The end-to-end speech recognition model obtained by the multi-stage pre-training strategy enables the decoding result to reduce the condition of polyphones, enables the text to be smoother and improves the accuracy.

Optionally, an implementation of the training language model is further provided below, please refer to fig. 6, where fig. 6 is a schematic flowchart of an implementation of step S303 provided in the embodiment of the present invention, and step S303 may include the following sub-steps:

and a substep S303-1, obtaining a spoken language text corpus set and a business text corpus set.

The spoken language text corpus is a text set collected in any scene; the service text corpus is a text set corresponding to a user collected in a service scene. The service scenario may be, but is not limited to, a live service, a game service, a social service, and the like.

And a substep S303-2 of pre-training the initial language model according to the spoken language text corpus to obtain a pre-trained language model.

It is understood that the pre-training (pre-training/trained) refers to a model or a process of pre-training a model, in the pre-training process, a text with a text length too small, for example, a text with a text length less than 5, is removed, the remaining text is used as a training sample for training, and the training is stopped when a convergence condition is reached, so as to obtain a pre-trained language model.

And a substep S303-3, performing fine tuning training on the pre-trained language model according to the service text corpus to obtain the trained language model.

It can be understood that fine tuning training (refining) refers to a process of applying a pre-trained model to a self business data set and adapting parameters to self business data, in the pre-training process, as in the pre-training process, a text with a small text length is removed, the remaining text is used as a training sample for training, and the training is stopped when a convergence condition is reached, so as to obtain an expected language model.

Based on the obtained end-to-end voice recognition model, a voice recognition method is further provided below, and the voice recognition method can be applied to electronic equipment such as smart phones, tablet computers, smart homes, smart vehicle-mounted equipment and smart customer service robots, and is not limited herein. Taking the application of the above method in a smart phone as an example, please refer to fig. 7, where fig. 7 is a schematic flowchart of a speech recognition method according to an embodiment of the present invention, and the method may include:

and S703, acquiring the voice to be recognized.

It can be understood that the speech to be recognized may be speech data pre-stored in the smart phone, may also be speech data acquired in real time by the smart phone, and may also be speech captured from other audio/video data, which is not limited herein.

And S704, inputting the audio characteristics of the voice to be recognized into the trained end-to-end voice recognition model for recognition, and obtaining a target text corresponding to the voice to be recognized.

It is understood that the above-mentioned end-to-end speech recognition model can be obtained according to any one of the training methods in the above-mentioned embodiments, and will not be described herein again.

Optionally, an implementation manner of the front end of the voice recognition method is further provided below, please refer to fig. 8, and fig. 8 is an implementation manner of a user interface provided in an embodiment of the present invention, where the user interface may be displayed on any one of an intelligent electronic device, such as a smart phone, a tablet computer, a smart home, an intelligent vehicle-mounted device, and an intelligent customer service robot. The electronic device is deployed with the end-to-end speech recognition model in any of the above embodiments.

As shown in fig. 8, the user interface has a voice entry area, a start recognition identifier, and a text preview area, and when the electronic device receives an entry instruction in the voice entry area, the electronic device responds to the entry instruction to obtain a voice to be recognized.

In some possible embodiments, a user may record a voice signal in real time by operating a recording identifier, or the user may also operate a file uploading identifier, upload a pre-recorded voice signal or a voice intercepted from other audio/video data in advance, perform feature extraction on the voice to be recognized when the device receives an operation instruction for starting to recognize the identifier, obtain the audio feature, and then input the audio feature into an end-to-end voice recognition model for recognition, so as to obtain a target text.

In some possible embodiments, the user interface may further have a text preview area, and when a preview instruction is received on the user interface, the target text is displayed in the preview area. For example, the content of the speech to be recognized is "please recognize the text of the speech", and after performing the speech recognition, the text information of "please recognize the text of the speech for me" may be displayed in the preview area.

In other possible embodiments, the electronic device may also search for data that matches the target text based on the recognized target text, for example, when the user records the record identification "please find me the nearest restaurant", the electronic device recognizes the speech to the text "please find me the nearest restaurant", and then performs the search based on the text.

In order to execute the end-to-end speech recognition model training method in the above embodiments and various possible manners, an implementation manner of an end-to-end speech recognition model training apparatus is provided below, please refer to fig. 9, and fig. 9 is a functional block diagram of an end-to-end speech recognition model training apparatus according to an embodiment of the present invention. It should be noted that the basic principle and the generated technical effect of the end-to-end speech recognition model training device provided in this embodiment are the same as those of the above embodiment, and for the sake of brief description, no part of this embodiment is mentioned, and reference may be made to the corresponding contents in the above embodiment. The end-to-end speech recognition model training apparatus 30 includes:

the training module 31 is configured to obtain a trained language model according to the text corpus; and constructing an end-to-end voice recognition model according to the language model, and training the constructed end-to-end voice recognition model according to the audio corpus to obtain the trained end-to-end voice recognition model.

Optionally, the training module 31 is specifically configured to: constructing the language model into a decoding module of the end-to-end voice recognition model to obtain the constructed end-to-end voice recognition model; for the constructed end-to-end voice recognition model, keeping other model parameters except the cross attention mechanism parameter of the language model fixed, and training the language model according to the audio corpus; wherein the other model parameters include: model parameters in the end-to-end speech recognition model other than the language model, and parameters in the language model other than the cross-attention mechanism parameters; the cross attention mechanism parameter is used for calculating an attention score of output information of an encoder module of the end-to-end speech recognition model; when the loss value of the loss function of the end-to-end voice recognition model is determined to be reduced to a first value and the first value is not changed any more, keeping fixed model parameters except the language model in the end-to-end voice recognition model, and training the language model according to the audio corpus; when the loss value of the loss function is determined to be reduced to a second value and the second value is not changed any more, obtaining the trained end-to-end speech recognition model; wherein the first value is greater than the second value.

Optionally, the end-to-end speech recognition model training apparatus 30 further includes a configuration module, configured to configure a weight parameter for each layer of the language model, where the weight parameter represents a probability that the output information is filtered out; the cross attention mechanism parameter is used to calculate the attention score of the output information of the encoder modules of the end-to-end speech recognition model in the following way: and obtaining the attention score of the output information according to the output information, the weight parameter of the current layer, the cross attention mechanism parameter and the calculation result of the previous layer of the current layer.

Optionally, the training module 31 is further specifically configured to: obtaining a spoken language text corpus set and a business text corpus set; the spoken language text corpus is a text set collected in any scene; the service text corpus is a text set corresponding to a user collected in a service scene; according to the spoken language text corpus, pre-training the initial language model to obtain a pre-trained language model; and performing fine tuning training on the pre-trained language model according to the service text corpus to obtain the trained language model.

To execute the steps of the speech recognition method in the above embodiments and various possible ways, an implementation manner of the speech recognition apparatus is given below, please refer to fig. 10, and fig. 10 is a functional block diagram of the speech recognition apparatus according to an embodiment of the present invention. It should be noted that the basic principle and the generated technical effect of the speech recognition apparatus provided in the present embodiment are the same as those of the above embodiments, and for the sake of brief description, no part of the present embodiment is mentioned, and corresponding contents in the above embodiments may be referred to. The speech recognition apparatus 40 includes:

and the obtaining module 41 obtains the voice to be recognized.

And the recognition module 42 is configured to input the audio feature of the speech to be recognized into the trained end-to-end speech recognition model for recognition, so as to obtain a target text corresponding to the speech to be recognized.

Optionally, the speech recognition device 40 further comprises: the processing module is used for responding to an input operation instruction of a voice input area on a user interface to obtain the voice to be recognized; and responding to the recognition instruction on the user interface, performing feature extraction on the voice to be recognized to obtain the audio features, and inputting the audio features into a trained end-to-end voice recognition model for recognition to obtain the target text.

Optionally, the processing module is further configured to display the target text in a preview area when a preview instruction is received on the user interface.

An embodiment of the present invention further provides an electronic device, as shown in fig. 11, and fig. 11 is a block diagram of a structure of an electronic device according to an embodiment of the present invention. The electronic device 80 comprises a communication interface 81, a processor 82 and a memory 83. The processor 82, memory 83 and communication interface 81 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 83 may be used for storing software programs and modules, such as program instructions/modules corresponding to the end-to-end speech recognition model training method or the speech recognition method provided by the embodiments of the present invention, and the processor 82 executes the software programs and modules stored in the memory 83 to perform various functional applications and data processing. The communication interface 81 can be used for communicating signaling or data with other node devices. The electronic device 80 may have a plurality of communication interfaces 81 in the present invention.

The memory 83 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable read-only memory (PROM), an erasable read-only memory (EPROM), an electrically erasable read-only memory (EEPROM), and the like.

The processor 82 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

Alternatively, the modules may be stored in the form of software or Firmware (Firmware) in the memory shown in fig. 11 or solidified in an Operating System (OS) of the electronic device, and may be executed by the processor in fig. 11. Meanwhile, data, codes of programs, and the like required to execute the above modules may be stored in the memory.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements an end-to-end speech recognition model based training method or a speech recognition method according to any of the foregoing embodiments. The computer readable storage medium may be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic or optical disk, etc.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：语言模型自动训练方法及系统

End-to-end speech recognition model training method, speech recognition method and related device

相关技术

网友询问留言