Voice recognition method, computer program product and electronic equipment

文档序号：193357 发布日期：2021-11-02 浏览：24次中文

阅读说明：本技术 一种语音识别方法、计算机程序产品及电子设备 (Voice recognition method, computer program product and electronic equipment ) 是由吴振宗徐易楠康世胤许佳于 2021-08-25 设计创作，主要内容包括：本申请提供一种语音识别方法、计算机程序产品及电子设备,所述方法应用于端到端语音识别模型,所述端到端语音识别模型包括编码子模型、解码子模型以及语言子模型；所述方法通过将经过解码子模型解码后的文本序列中置信度低于阈值的目标文本单元掩蔽,并利用语言子模型根据编码子模型输出的特征序列对掩蔽序列解码得到识别文本。在端到端语音识别模型中引入语言子模型,可以有效减少多音字对识别结果的影响,从而提升了语音识别准确率。(The application provides a speech recognition method, a computer program product and an electronic device, wherein the method is applied to an end-to-end speech recognition model, and the end-to-end speech recognition model comprises an encoding sub-model, a decoding sub-model and a language sub-model; the method includes the steps that a target text unit with the confidence coefficient lower than a threshold value in a text sequence decoded by a decoding submodel is masked, and a language submodel is used for decoding the masking sequence according to a feature sequence output by a coding submodel to obtain an identification text. The language sub-model is introduced into the end-to-end voice recognition model, so that the influence of polyphones on the recognition result can be effectively reduced, and the voice recognition accuracy is improved.)

1. A speech recognition method is characterized in that the method is applied to an end-to-end speech recognition model, and the end-to-end speech recognition model comprises an encoding sub-model, a decoding sub-model and a language sub-model; the method comprises the following steps:

acquiring a feature sequence output after the voice features are coded through the coding sub-model, wherein the voice features are features of voice signals after feature extraction;

acquiring a text sequence output after the characteristic sequence is decoded by the decoding submodel, wherein the text sequence comprises at least one text unit;

at least masking a target text unit with the confidence level lower than a preset threshold value in the text sequence to obtain a masking sequence;

inputting the masking sequence and the feature sequence into the language sub-model so that the language sub-model decodes the masking sequence according to the feature sequence;

and acquiring the recognition text output by the language submodel.

2. The method of claim 1, wherein the language submodel is a model optimized using a spoken text, the spoken text including at least one text element, the optimizing of the language submodel comprising:

aiming at the spoken text, at least one text unit is selected according to a preset selected probability to be masked;

inputting the masked spoken text into a language sub-model to be optimized;

and updating the parameters of the language sub-model to be optimized according to the loss function.

3. The method of claim 1, wherein the number of target text units does not exceed a preset number threshold.

4. The method of claim 1, wherein the training process for the language submodel comprises:

performing word segmentation processing on the text for training;

selecting at least one word in the text according to a preset selection probability to perform masking treatment;

inputting the masked text into a language sub-model to be trained;

and updating the parameters of the language sub-model to be trained according to the loss function.

5. The method of claim 4, wherein the masking sequence comprises at least three types of sequences:

masking the sequence of target text units;

masking the target text unit and one of the sequences of adjacent text units;

masking the target text unit and another sequence of text units adjacent thereto.

6. The method of claim 5, wherein the decoding process of the language submodel comprises:

and the language submodel respectively decodes the three sequences according to the characteristic sequences and determines the identification text with the highest confidence level in each decoded sequence.

7. The method of claim 1, wherein the decoding process of the language submodel includes no more than a threshold number of cycles, and the termination condition of the decoding process is:

the number of cycles reaches the number threshold; or

And the confidence of each text unit in the circularly output text sequence is greater than a preset threshold.

8. The method according to claim 2 or 4, wherein the predetermined selected probability is determined according to a decoding accuracy of the decoding submodel.

9. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to any of claims 1-8 when executed by a processor.

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring a feature sequence output after a voice feature is coded through a coding sub-model, wherein the voice feature is a feature obtained after feature extraction is carried out on a voice signal;

acquiring a text sequence output after the characteristic sequence is decoded by a decoding submodel, wherein the text sequence comprises at least one text unit;

at least masking a target text unit with the confidence level lower than a preset threshold value in the text sequence to obtain a masking sequence;

inputting the masking sequence and the feature sequence into a language sub-model, so that the language sub-model decodes the masking sequence according to the feature sequence;

and acquiring the recognition text output by the language submodel.

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a computer program product, and an electronic device.

Background

Disclosure of Invention

The application provides a voice recognition method, a computer program product and an electronic device, which can effectively improve the accuracy of voice recognition.

According to a first aspect of embodiments of the present application, there is provided a speech recognition method, which is applied to an end-to-end speech recognition model, where the end-to-end speech recognition model includes an encoding sub-model, a decoding sub-model, and a language sub-model; the method comprises the following steps:

acquiring a feature sequence output after the voice features are coded through the coding sub-model, wherein the voice features are features of voice signals after feature extraction;

acquiring a text sequence output after the characteristic sequence is decoded by the decoding submodel, wherein the text sequence comprises at least one text unit;

at least masking a target text unit with the confidence level lower than a preset threshold value in the text sequence to obtain a masking sequence;

inputting the masking sequence and the feature sequence into the language sub-model so that the language sub-model decodes the masking sequence according to the feature sequence;

and acquiring the recognition text output by the language submodel.

In some examples, the language submodel is a model optimized using a spoken text, the spoken text including at least one text unit, the optimization of the language submodel including:

aiming at the spoken text, at least one text unit is selected according to a preset selected probability to be masked;

inputting the masked spoken text into a language sub-model to be optimized;

and updating the parameters of the language sub-model to be optimized according to the loss function.

In some examples, the number of target text units does not exceed a preset number threshold.

In some examples, the training process for the language submodel includes:

performing word segmentation processing on the text for training;

selecting at least one word in the text according to a preset selection probability to perform masking treatment;

inputting the masked text into a language sub-model to be trained;

and updating the parameters of the language sub-model to be trained according to the loss function.

In some examples, the masking sequences include at least the following three types of sequences:

masking the sequence of target text units;

masking the target text unit and one of the sequences of adjacent text units;

masking the target text unit and another sequence of text units adjacent thereto.

In some examples, the decoding process of the language submodel includes:

In some examples, the decoding process of the language submodel includes no more than a threshold number of cycles, and the termination condition of the decoding process is:

the number of cycles reaches the number threshold; or

And the confidence of each text unit in the circularly output text sequence is greater than a preset threshold.

In some examples, the predetermined selected probability is determined according to a decoding accuracy of the decoding submodel.

According to a second aspect of embodiments herein, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the steps of the method according to the first aspect.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring a text sequence output after the characteristic sequence is decoded by a decoding submodel, wherein the text sequence comprises at least one text unit;

at least masking a target text unit with the confidence level lower than a preset threshold value in the text sequence to obtain a masking sequence;

inputting the masking sequence and the feature sequence into a language sub-model, so that the language sub-model decodes the masking sequence according to the feature sequence;

and acquiring the recognition text output by the language submodel.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the application provides a voice recognition method, a computer program product and electronic equipment, which are used for masking a target text unit with the confidence coefficient lower than a threshold value in a text sequence decoded by a decoding submodel and decoding the masked sequence by using a language submodel to obtain a recognition text. The language sub-model is introduced into the end-to-end voice recognition model, so that the influence of polyphones on the recognition result can be effectively reduced, and the voice recognition accuracy is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a diagram illustrating an end-to-end speech recognition model according to an embodiment of the present application.

FIG. 2 is a flow diagram illustrating a method of speech recognition according to one embodiment of the present application.

FIG. 3 is a diagram illustrating a text sequence and a masking sequence according to an embodiment of the present application.

FIG. 4 is a flow diagram illustrating a method of speech recognition according to another embodiment of the present application.

Fig. 5(a) is a schematic diagram of a BERT language model decoding process shown in the present application according to an embodiment.

Fig. 5(b) is a schematic diagram of a BERT language model decoding process shown in the present application according to another embodiment.

Fig. 6(a) is a flow chart illustrating a speech recognition method according to another embodiment of the present application.

Fig. 6(b) is a schematic diagram of a BERT language model decoding process shown in the present application according to another embodiment.

FIG. 7 is a flow diagram illustrating a method of speech recognition according to another embodiment of the present application.

Fig. 8 is a hardware block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Automatic Speech Recognition (ASR) is a technology that converts human Speech into text. In the related art, the modeling process of the end-to-end speech recognition model is simple, and a speech signal can be directly mapped to a text sequence after passing through the speech recognition model. One of the more mainstream end-to-end speech recognition models is a non-autoregressive end-to-end speech recognition model (hereinafter referred to as mask-CTC model) based on a CTC algorithm and a prediction mask (mask). The Mask-CTC model is a model structure based on the combination of an encoder-decoder model, a prediction Mask and CTC training, but the accuracy of the output recognition result of the model is limited. The inventor finds that the recognition result output by the model is susceptible to polyphones, which is one of the reasons that the accuracy of the recognition result is limited. The language model may be used to solve the problem of determining which word sequence is more likely or to predict the next most likely word given several words. The language model is used in the ASR field to limit the search range of words, and the matching process is greatly limited by eliminating impossible words, so that the problem of limited recognition accuracy caused by polyphones can be solved. Therefore, the present application proposes a speech recognition method, which is applied to the end-to-end speech recognition model 100 shown in fig. 1, where the end-to-end speech recognition model 100 includes an encoding sub-model 110, a decoding sub-model 120, and a language sub-model 130. The method comprises the steps as shown in fig. 2:

step 210: acquiring a feature sequence output after the voice features are coded through the coding sub-model, wherein the voice features are features of voice signals after feature extraction;

step 220: acquiring a text sequence output after the characteristic sequence is decoded by the decoding submodel, wherein the text sequence comprises at least one text unit;

step 230: at least masking a target text unit with the confidence level lower than a preset threshold value in the text sequence to obtain a masking sequence;

step 240: inputting the masking sequence and the feature sequence into the language sub-model so that the language sub-model decodes the masking sequence according to the feature sequence;

step 250: and acquiring the recognition text output by the language submodel.

The application provides a voice recognition method, which is characterized in that a target text unit with the confidence coefficient lower than a threshold value in a text sequence decoded by a decoding submodel is masked, and a language submodel is used for decoding the masked sequence to obtain a recognition text. The language sub-model is introduced into the end-to-end voice recognition model, so that the influence of polyphones on the recognition result can be effectively reduced, and the voice recognition accuracy is improved.

In some embodiments, the end-to-end speech recognition model 100 may be a speech recognition model based on a mask-CTC model, the encoding sub-model 110 may be an Encoder in the mask-CTC model, the decoding sub-model 120 may be a CTC model, and the language sub-model 130 may be a bert (bidirectional Encoder retrieval from transformations) language model. The decoder in the mask-CTC model is replaced by the BERT language model, so that the language model is fused in the mask-CTC model, and the problem of limited identification accuracy caused by polyphones can be solved.

The speech signal may obtain speech features after feature extraction, and the speech features may include fbank (filter bank), MFCC (Mel Frequency Cepstral coefficients), LPC (Linear Prediction Coefficient), and the like, which are not limited herein. The speech features are encoded by the encoding submodel 110 to obtain a feature sequence, and the feature sequence is decoded by the decoding submodel 120 to obtain a text sequence. The text sequence includes at least one text unit, the text unit is the smallest constituent unit in the text sequence, and if the text sequence is a chinese sentence, the text unit may be a single chinese character. As shown in FIG. 3, which is a schematic diagram of a text sequence according to an embodiment of the present application, the text sequence A includes 5 text units T1-T5, and each text unit has a corresponding confidence level. And masking the target text unit with the confidence coefficient smaller than the preset threshold value to obtain a masking sequence B. A threshold value may be set by a person skilled in the art according to actual needs, for example, the threshold value may be 0.90, and then in the above example, the target text unit T3 may be masked to obtain the masking sequence B.

Masking the target text unit may be marking the target text unit, such as marking the target text unit T3 as mask in this embodiment. In some embodiments, if the confidence of more than one text unit in the text sequence is less than the confidence threshold, the target number of text units may be set not to exceed a preset number threshold. The quantity threshold may be a fixed value and set by a person skilled in the art according to the actual need. The number threshold may also be dynamically adjusted according to the length of the text sequence, for example, the number threshold may be 20% of the total number of text units included in the text sequence, that is, the number of target text units does not exceed 20% of the total number of text units.

After the masking sequence is input into the language submodel, the language submodel decodes the masking sequence according to the feature sequence. In some embodiments, the decoding process includes several cycles, each of which outputs an output sequence, each of which has a corresponding confidence level for each text unit in the output sequence. The termination condition of the decoding process may be that the number of times of the loop reaches a threshold number, or that the confidence of each text unit in the output sequence of the loop is greater than the threshold. As in the above example, the masking sequence B passes through several cycles after being input into the language submodel, and when the number of cycles reaches a preset threshold number, for example, 10 times, the decoding is terminated, and the output sequence of the last cycle is used as the recognition text. Or when the confidence of each text unit in the output sequence of a certain cycle is greater than the threshold, for example, the threshold may be 0.9 in the above example, the cycle may be terminated early, and the output sequence of the current cycle may be used as the recognition text. Thus, the accuracy of the output recognition text is higher than that of the text sequence by decoding the language sub-model.

In some embodiments, a voice recognition method provided by the present application may be applied to a live scene and executed by a live server, and a voice signal may be received from a main broadcast terminal. Through the voice recognition technology, the speaking content during the main broadcasting live broadcasting can be converted into characters, and subtitles are generated in real time, so that the user experience is improved. However, a large number of spoken expressions are typically included in the live broadcast. Thus, in order to better adapt the language submodel to the live scene, in some embodiments, the language submodel may be optimized using spoken text, the spoken text including at least one text element, the optimization process for the language submodel including the steps as shown in fig. 4:

step 410: aiming at the spoken text, at least one text unit is selected according to a preset selected probability to be masked;

step 420: inputting the masked spoken text into a language sub-model to be optimized;

step 430: and updating the parameters of the language sub-model to be optimized according to the loss function.

Taking the language sub-model as the BERT language model as an example, the masking processing performed by selecting at least one text unit may be performed according to a default mask mechanism of the BERT language model, that is, the probability of each word or word in a sentence is 15%, for the selected word or word, 80% of the probability is masked, that is, replaced with [ mask ], 10% of the probability is not replaced, and 10% of the probability is randomly replaced with any word or word. The masked spoken text can optimize a BERT language model, and model parameters are updated, so that the BERT language model is suitable for a spoken live broadcast scene.

Furthermore, in some embodiments, the selected probability may be determined according to a decoding accuracy of the decoding submodel. Taking the decoding sub-model as the CTC model as an example, the decoding accuracy of the CTC model is usually greater than 80%, and then the selection probability can be adjusted up to 20%, that is, the selection probability of each word or word of a sentence in the spoken text is 20% in the optimization process of the BERT language model.

During use of the model, a language sub-model, such as the BERT language model, may decode the masking sequence based on the feature sequence. In some embodiments, when the masking sequence has more than two consecutive errors, the BERT language model is difficult to correct, which greatly affects the accuracy of the output result. Fig. 5(a) is a schematic diagram of a BERT language model decoding process according to an embodiment of the present application. The decoded text sequence of the CTC model (not shown) is "true from the day to the day" in which the text unit is a single chinese character. According to the confidence of each text unit, the start sequence can be masked, and a masking sequence of today mask is really good is obtained. After the masking sequence is input into the BERT language model, the BERT language model can predict the word most likely to appear in the mask as 'qi' according to the context, so that the recognition text 'true weather today' is obtained. The BERT language model is able to accurately correct single errors occurring in a text sequence. However, for more than two consecutive errors, as shown in fig. 5(b), it is a schematic diagram of the BERT language model decoding process according to an embodiment of the present application. The decoded text sequence of the CTC model (not shown) is "today's Tian Jian is true" in which the text units are single Chinese characters. According to the confidence of each text unit, the ' wedge ' can be masked to obtain a masking sequence ' today Tian mask is true. It should be noted that, because the model determines the target text unit according to the confidence of the text unit, the target text unit may be "field" or "wedge", and the target text unit is taken as "wedge" in this embodiment as an example. After the masking sequence is input into the BERT language model, the BERT language model can predict the most probable word of the mask to be seven according to the context, so that the recognized text is ' today ' Tianqi is true and good '. Clearly, the BERT language model has difficulty in correcting more than two consecutive errors that occur in a text sequence. In this regard, in some embodiments, the training process of the BERT language model may be improved, and the training process of the BERT language model includes the steps as shown in fig. 6 (a):

step 610: performing word segmentation processing on the text for training;

step 620: selecting at least one word in the text according to a preset selection probability to perform masking treatment;

step 630: inputting the masked text into a language sub-model to be trained;

step 640: and updating the parameters of the language sub-model to be trained according to the loss function.

In this embodiment, when the BERT language model is trained, the text used for training is first subjected to word segmentation, and then at least one word in the text is selected according to a preset selection probability to be subjected to masking processing. Namely, the probability of each word in the text is 15%, for the selected word, 80% of the probability is masked, namely, the word is replaced by [ mask ], 10% of the probability is not replaced, and 10% of the probability is randomly replaced by any word.

Furthermore, in some embodiments, the selected probability may be determined according to a decoding accuracy of the decoding submodel. Taking the decoding sub-model as the CTC model as an example, the decoding accuracy of the CTC model is usually greater than 80%, and then the selection probability can be adjusted up to 20%, that is, the selection probability of each word in the text for training is 20% in the training process of the BERT language model.

Accordingly, in some embodiments, the masking sequences include at least the following three types of sequences: masking the sequence of target text units; masking a sequence of target text units and one of the text units adjacent thereto; the sequence of the target text unit and the other text unit adjacent thereto is masked. Then, in the process of decoding the masking sequence by the BERT language model, the BERT language model may decode the three types of sequences according to the feature sequences, and determine the sequence with the highest confidence as the recognition text. The confidence of the decoded sequence refers to the average confidence of all text units masked in the sequence before decoding after decoding.

Fig. 6(b) is a schematic diagram of a BERT language model decoding process according to an embodiment of the present application. The decoded text sequence of the CTC model (not shown) is "today's Tian Jian is true" in which the text units are single Chinese characters. The masking sequence includes the following three types of sequences: masking a sequence of target text units, i.e. masking a sequence of "wedges"; masking a sequence of a target text unit and one of the text units adjacent to the target text unit, namely masking sequences of 'field' and 'wedge'; the sequence of the target text unit and the other text unit adjacent to the target text unit, namely the sequence of masking "wedge" and "true". After the three sequences are input into the BERT language model, the BERT language model can predict and mask the most probable word of the mask in the sequence of the ' contract ' as seven ' according to the context; the word masking the two masks in the sequence of "farm" and "wedge" that are most likely to appear is "weather"; the word that two masks in the sequence masking "wedge" and "true" are most likely to appear is "curiosity". Each masked text unit has corresponding confidence coefficient after being decoded by a BERT language model, and a corresponding sequence with the highest confidence coefficient or the highest confidence coefficient average value is taken as a recognition text by comparing the confidence coefficient of a decoded 'wedge', the average value of the confidence coefficients of a 'day' and a 'gas' and the average value of the confidence coefficients of an 'odd' and a 'treasure'.

Through the improvement, the BERT language model can effectively correct more than two continuous errors of the text sequence. In some embodiments, the BERT language model can decode the three types of sequences in parallel, so that although the BERT language model has more sequences to decode, the decoding efficiency is not affected, and the decoding accuracy is improved.

The speech recognition method provided by the application can fuse the language model in the end-to-end language recognition model, effectively reduces the influence of polyphones on the recognition result, effectively improves the decoding accuracy by 5%, and enables the text to be more smooth.

In addition, the present application further provides a voice recognition method applied to a live broadcast server, where an end-to-end voice recognition model is stored in the live broadcast server, the end-to-end voice recognition model includes a coding sub-model, a CTC model, and a BERT language model, and the method includes the steps shown in fig. 7:

step 710: receiving a voice signal sent by a main broadcasting terminal, and extracting the characteristics of the voice signal to obtain voice characteristics;

step 720: acquiring a feature sequence output after the voice features are coded through the coding sub-model;

step 730: acquiring a text sequence output after the feature sequence is decoded by the CTC model, wherein the text sequence comprises at least one text unit;

step 740: determining a target text unit with the confidence level lower than a preset threshold value in the text sequence, and obtaining the following three types of masking sequences: masking the sequence of target text units; masking the target text unit and one of the sequences of adjacent text units; masking the target text unit and another adjacent text unit;

step 750: inputting the three types of masking sequences and the characteristic sequences into the BERT language model so that the BERT language model respectively decodes the three types of masking sequences according to the characteristic sequences;

wherein, the decoding process comprises a loop which does not exceed a number threshold, and the termination condition of the decoding process is that the number of the loop reaches the number threshold; or the confidence of each text unit in the output sequence of the loop is larger than a preset threshold.

Step 760: and determining the recognition text with the highest confidence level in each decoded sequence.

The BERT language model is a model optimized by spoken texts, and the texts used for training are subjected to word segmentation processing. For specific implementation, refer to the above embodiments, which are not described herein again.

According to the voice recognition method, a target text unit with the confidence coefficient lower than a threshold value in a text sequence decoded by a CTC model is masked, and a BERT language model is used for decoding the masked sequence to obtain a recognition text. The BERT language model is introduced into the end-to-end voice recognition model, so that the influence of polyphones on the recognition result can be effectively reduced, and the voice recognition accuracy is improved. Meanwhile, the BERT model is subjected to spoken text optimization, and the trained sentences are subjected to word segmentation, so that the model can adapt to a live broadcast scene, more accurately correct more than two continuous errors of the primary decoding sequence, and further improve the applicability and the recognition accuracy of the model.

Based on the speech recognition method according to any of the embodiments, the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program can be used to execute the speech recognition method according to any of the embodiments.

Based on the voice recognition method described in any of the above embodiments, the present application further provides a schematic structural diagram of an electronic device as shown in fig. 8. As shown in fig. 8, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs, the processor is configured to: