Non-autoregressive speech recognition network, method and equipment based on bidirectional context

文档序号：116978 发布日期：2021-10-19 浏览：44次中文

阅读说明：本技术 基于双向上下文的非自回归语音识别网络、方法及设备 (Non-autoregressive speech recognition network, method and equipment based on bidirectional context ) 是由不公告发明人于 2021-09-13 设计创作，主要内容包括：本发明适用于人类语言处理技术领域,提供了一种基于双向上下文的非自回归语音识别网络、方法、设备及存储介质,本发明提供的语音识别网络采用Transformer的编码器-解码器结构,语音识别网络的编码器用于对输入的语音特征进行初步识别,得到初步识别结果,语音识别网络的解码器用于利用由初步识别结果提供的双向语言信息对初步识别结果进行调整,并输出最终的语音识别结果,其中,解码器通过预设的、应用于解码器的每个多头自注意力层的注意力掩码利用双向语言信息,从而充分了利用了语言信息,提高了语音识别效果,且相较于使用两个单向解码器分别利用单向语言信息的方法,结构更加高效统一。(The invention is suitable for the technical field of human language processing, and provides a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context, wherein the speech recognition network provided by the invention adopts a Transformer encoder-decoder structure, an encoder of the speech recognition network is used for carrying out primary recognition on input speech characteristics to obtain a primary recognition result, a decoder of the speech recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final speech recognition result, the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, thereby fully utilizing the language information, improving the speech recognition effect, and compared with a method of respectively utilizing unidirectional language information by using two unidirectional decoders, the structure is more efficient and uniform.)

1. A bi-directional context based non-autoregressive speech recognition network, wherein the speech recognition network employs a Transformer encoder-decoder architecture, and wherein:

the encoder of the voice recognition network is used for carrying out primary recognition on the input voice characteristics to obtain a primary recognition result;

and the decoder of the voice recognition network is used for adjusting the preliminary recognition result by utilizing the bidirectional language information provided by the preliminary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through preset attention masks applied to each multi-head self-attention layer of the decoder.

2. The voice recognition network of claim 1, wherein the attention mask is a two-dimensional matrix having major diagonal elements of 0 and other elements outside the major diagonal of 1.

3. The speech recognition network of claim 1, wherein the decoder uses a position code as a query Q for a first multi-headed self-attention layer of the decoder, and inputs the same key K and value V into each multi-headed self-attention layer of the decoder.

4. A method of speech recognition based on the bi-directional context based non-autoregressive speech recognition network of any of claims 1-3, the method comprising:

performing primary recognition on input voice features through a trained encoder of the voice recognition network to obtain a primary recognition result;

and adjusting the initial recognition result through a decoder of the trained voice recognition network, and outputting a final voice recognition result, wherein the decoder adjusts the initial recognition result by utilizing the bidirectional language information provided by the initial recognition result.

5. The method of claim 4, wherein prior to performing the preliminary recognition of the input speech by the trained speech recognition network encoder, further comprising:

and performing joint training on a decoder and an encoder of the voice recognition network by using a training set until the loss value of the voice recognition network is minimum, so as to obtain the trained voice recognition network.

6. The method of claim 5, wherein the encoder and the decoder joint loss function is as follows:

wherein the content of the first and second substances,for the purpose of the joint loss function,a loss of classification for the connection timing of the encoder,for the cross-entropy loss of the decoder,is a hyperparameter。

7. The method of claim 4, wherein the decoder adapts the preliminary recognition result using an adaptive stop iteration mechanism.

8. The method of claim 4, wherein the preliminary recognition result includes a word sequence length, the decoder outputting the speech recognition results in parallel based on the word sequence length.

9. A speech recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 4 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 4 to 8.

Technical Field

The invention belongs to the technical field of human language processing, and particularly relates to a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context.

Background

The voice recognition is widely applied to the scenes of vehicle-mounted application, voice awakening, man-machine communication, intelligent home and the like. The input of the speech recognition model is speech, and the output is characters in the speech content. The traditional speech recognition technology is generally an autoregressive decoding mode, namely, the output of characters is serial, the method has higher precision, but the speed is far from meeting the requirement of real-time property. In contrast, the character prediction of the non-autoregressive method is parallel, and can meet the requirement of real-time performance, but the non-autoregressive method cannot better model language information, and generally needs to determine the length of an output sequence in advance before decoding, and compared with the autoregressive method, the length prediction is difficult, and the recognition accuracy is low. In the past, a great number of methods for improving the non-autoregressive speech recognition capability have emerged in academia and industry. The method commonly applied in the industry at present is based on CTC (connecting Temporal classification) (Alex Graves, Santiago Fernandez, et al. connecting Temporal classification: labeling unsegmented sequence data with a repeating Temporal network [ C ]. International reference on Machine Learning, 2006: 369 376.), but the CTC method only models the input speech features, resulting in strong conditional independence assumption between output words, unable to utilize the mutual linguistic information between the output words, and the computational complexity of the CTC method is the square of the input speech frame length, and high computational complexity. In recent years, with the mutual fusion of the methods in each field, the Transformer (Ashish varwani, Noam shaker, Niki Parmar, Jakob uszkorit, lilon Jones, et al. Attention is all you connected [ C ]. Conference and Workshop on Neural Information Processing Systems, 2017: 5998-.

The present invention focuses on the non-autoregressive speech recognition problem. In the speech recognition problem, whether speech information and language information can be fully utilized determines the quality of a recognition result, but the non-autoregressive method generally has low utilization rate of the language information and poor recognition result. To improve the performance of non-autoregressive methods, Yosuke et al (Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsuori Kobayashi. Mask CTC: Nonautoregensive end-to-end ASR with CTC and Mask prediction [ C ] reference of the International Speech Communication Association,2020: 3655 and 3659.) propose to use part of the text output by the encoder as input to the decoder, to re-predict the masked text by masking the low text and using the language information provided by the non-masked text. Tian et al (Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, et al, Spike-triggered non-auto-regenerative transform for end-to-end Speech recognition [ C ]. Conference of the International Speech Communication Association,2020: 5026-5030.) directly use part of Speech coding features output by an encoder as input of a decoder in order to accelerate recognition speed. To reduce length prediction errors, Yosuke et al (Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsuori Kobayashi. Improved mask-ctc for non-autoregesived-to-end ASR

[C] International Conference on Acoustics, Speech and Signal Processing,2021: 8363- > 8367) designs a length prediction decoder which can dynamically adjust the length of an output sequence in the decoding process, thereby reducing length prediction errors. Furthermore, to further reduce the difficulty of modeling the decoder and increase the textual information available to the decoder, Song et al (Xingchen Song, zhongwu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng. non-autonomous predictive transmitter apparatus with ctc-enhanced decoder input [ C ]. International Conference on Acoustics, Speech and Signal Processing,2021: 5894-. However, the above method either needs to mask part of the language information according to the probability or directly uses only one-way language information, so as to limit the language information that can be used by the decoder, resulting in waste of language information.

In auto-regressive and streaming Speech recognition, there are studies on bi-directional linguistic information, such as those proposed by Dong et al (Dong M, HE D, LUO C, et al. Transformator with a bidirectional decoder for Speech recognition [ C ]. Conference of the International Speech Communication Association,2020: 1773-. Other methods use two completely separate decoders to model the unidirectional linguistic information, but they are structurally complex and also cause loss of the reverse linguistic information.

Disclosure of Invention

The invention aims to provide a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context, aiming at solving the problem that language information cannot be fully utilized due to the fact that unidirectional context is used for prediction in the non-autoregressive speech recognition method.

In one aspect, the present invention provides a bi-directional context-based non-autoregressive speech recognition network, wherein the speech recognition network employs a transform encoder-decoder architecture, wherein,

the encoder of the voice recognition network is used for carrying out primary recognition on the input voice characteristics to obtain a primary recognition result;

Preferably, the attention mask is a two-dimensional matrix, and the elements of the main diagonal of the two-dimensional matrix are all 0, and the elements outside the main diagonal are all 1.

Preferably, the decoder takes a position code as Q of the decoder's first multi-headed self-attention layer and inputs the same K and V into each of the decoder's multi-headed self-attention layers.

In another aspect, the present invention further provides a speech recognition method based on a bidirectional context non-autoregressive speech recognition network, where the method includes:

performing primary recognition on input voice features through a trained encoder of the voice recognition network to obtain a primary recognition result;

Preferably, before the initial recognition of the input speech by the trained speech recognition network encoder, the method further includes:

Preferably, the encoder and the decoder joint loss function is as follows:

Preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration-stopping mechanism.

Preferably, the preliminary recognition result includes a word sequence length, and the decoder outputs the voice recognition result in parallel based on the word sequence length.

In another aspect, the present invention also provides a speech recognition device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method as described above when executing the computer program.

In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to a first multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method of utilizing unidirectional language information by two unidirectional decoders respectively.

Drawings

FIG. 1A is a schematic structural diagram of a bi-directional context-based non-autoregressive speech recognition network according to an embodiment of the present invention;

FIG. 1B is a flowchart of an implementation of the method for learning bi-directional context and other methods using a decoder according to an embodiment of the present invention;

FIG. 1C is a diagram illustrating an exemplary structure of a transform-based decoder after being improved according to an embodiment of the present invention;

FIG. 1D is a diagram illustrating an example of a structure of an attention mask according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an implementation of a bi-directional context-based non-autoregressive speech recognition method according to a second embodiment of the present invention; and

fig. 3 is a schematic structural diagram of a speech recognition device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

fig. 1A illustrates a structure of a bidirectional context-based non-autoregressive speech recognition network according to an embodiment of the present invention, and for convenience of description, only the relevant portions of the embodiment of the present invention are shown, which is detailed as follows:

in the speech recognition problem, the decoder in the transform structure can use the linguistic information in the input text sequence to predict the output. The autoregressive speech recognition method is to predict the speech by using the language information provided by the characters output before the current position, and because the autoregressive method is a serial decoding method, only the character information output before the time can be used. Many non-autoregressive methods use unidirectional linguistic information for prediction, however, non-autoregressive methods output text sequences in parallel, and the use of unidirectional linguistic information results in waste of reverse linguistic information. Therefore, the non-autoregressive speech recognition network based on bidirectional context provided by the embodiment adopts a Transformer encoder-decoder structure to perform speech recognition by using bidirectional language information, and the whole speech recognition network can be trained and tested end to end. The bidirectional context, i.e., bidirectional language information, proposed in this embodiment includes two directions, i.e., left to right and right to left.

As shown in fig. 1A, the speech recognition network provided in this embodiment mainly includes an encoder 11 and a decoder 12 connected in sequence, the encoder 11 is used for performing preliminary recognition on input speech,the decoder 12 is configured to adjust the preliminary recognition result using the bi-directional language information provided by the preliminary recognition result, and output a final speech recognition result. Wherein the decoder utilizes the bi-directional language information through a preset attention mask applied to each multi-headed self-attention layer of the decoder. FIG. 1B is a diagram of an example of learning bi-directional context and other methods using a decoder according to this embodiment, where in FIG. 1B, y₁、y₂And y₃For a character, eos (end of sequence) is an end marker, sos (start of sequence) is a start marker, fig. 1B (a) is to use a unidirectional decoder to learn a left-to-right context, fig. 1B (B) is to use a unidirectional decoder to learn a right-to-left context, and fig. 1B (c) is to use a decoder to learn a bidirectional context as provided in this embodiment.

In a specific implementation, the encoder 11 of the Speech recognition network may adopt an encoder structure in a Speech Transformer (Linhao Dong, Shuang Xu, and Bo Xu. Speech-Transformer: A no-prediction sequence-to-sequence model for Speech recognition [ C ]. International Conference on Acoustics, Speech and Signal Processing, 2018: 5884 Processing 5888.), wherein the encoder mainly comprises a self-attention layer and a full connection layer, and outputs encoded Speech features and a text sequence by extracting global features of input Speech features; the decoder takes the speech features and the character sequence encoded by the encoder as input, and predicts the recognition result by further extracting the speech information and the language information. When the recognition result is predicted by further extracting the voice information and the language information, each position can update the self by utilizing the bidirectional language information, and even if the original input sequence has the wrong recognition result, the decoder can adjust the self recognition result according to other characters in the input sequence. Further, the output sequence of the decoder can be re-input into the decoder and identified in an iterative manner to further reduce the character error rate at the expense of slightly sacrificing decoding speed. Because the encoder does not model the language information, i.e. the output words have stronger conditional independent assumptions, and the decoder can eliminate the conditional independent assumptions by using the language information provided by the input word sequence, and output more accurate recognition results.

The number of iterations may be preset, and preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration stop mechanism to improve the decoding speed. Specifically, the adaptive iteration stop mechanism may be understood as that, when the output and the input of the decoder of the current iteration are completely the same, the iteration automatically stops, and because the language information that can be utilized at each position is the same as that of the current iteration at the next iteration, the iteration result does not change. The adaptive iteration stop mechanism effectively improves the decoding speed.

Preferably, the speech recognition network further comprises a convolution down-sampling layer, wherein the convolution down-sampling layer is used for performing down-sampling on the input speech signal, and inputting the speech characteristics obtained after the down-sampling into the encoder, so that redundant frames in the speech signal are eliminated through the down-sampling, and the computational complexity of the whole network is reduced. In a specific implementation, the speech signal may be first subjected to a convolution downsampling layer to perform N-fold downsampling, for example, 4-fold downsampling, and the speech feature obtained after downsampling is used as the input of the encoder.

In the embodiment of the invention, the speech recognition needs to solve two problems, namely the prediction problem of the length and the recognition problem of the output characters. For non-autoregressive speech recognition, all words are output in parallel, and the decoder needs to determine the length of the whole output sequence in advance before decoding. Preferably, the preliminary recognition result includes a text sequence length, and the decoder outputs the voice recognition result in parallel based on the text sequence length, thereby implementing real-time operation. Wherein, the encoder may use a CTC (connection timing Classification) loss so that the encoder may predict the length of the sequence during testing, and the decoder may use a CE (Cross Entropy) loss so that the decoder may use the bidirectional language information provided by the sequence output by the encoder to perform re-prediction. Since the decoder takes the speech features output by the encoder as part of the input, the encoder and decoder can be jointly trained, so that preferably the encoder and decoder joint loss function is as follows:

wherein the content of the first and second substances,in order to be a function of the joint loss,the loss of the classification for the connection timing of the encoder,in order to be a cross-entropy loss for the decoder,is a hyper-parameter. I.e., using CTC loss and CE loss co-training.

In the embodiment of the present invention, each position in the decoder can be predicted by using a bidirectional context, but the bidirectional context causes a problem of information leakage. In short, the information leakage problem refers to that the output of the decoder can see the relevant information of the input during training, and the information leakage can cause that the decoder cannot predict the input again during testing, so that the ability of adjusting the result by using the language information is lost.

To prevent information leakage, the decoder preferably takes the position code as Q for the first multi-headed self-attention layer of the decoder and inputs the same K and V into each of the multi-headed self-attention layers of the decoder. In a specific implementation, as shown in fig. 1C, the Queries (query, Q), Keys (K), and Values (V) input by the decoder can be improved on the basis of the original transform decoder, and fig. 1C includes character encoding, position encoding, multi-head source attention layer, and multi-head self-attention layer, because of the residual connection in the decoderObtaining a mapping of position codesAnd can be combined withAs the Q of the first multi-headed self-attention layer of the decoder.

Wherein the content of the first and second substances,for the purpose of a linear mapping of the position coding,for linear mapping, P is positional encoding (positional encoding).

Then, the sameKAndVeach multi-headed self-attention layer input to the decoder:

，1 ≤i ≤ I

wherein the content of the first and second substances,Ithe total number of layers of the multi-headed self-attention layer of the decoder,ic is character embedding (character embedding) for the current number of multi-headed self-attention layers. Of multiple source attention layers of a decoderKAndVmay be determined based on the encoding state.

In addition, to prevent information leakage, the above-described attention mask is also used to make language information relating to its own position invisible for each position. As shown in fig. 1D, the attention mask is preferably a two-dimensional matrix, the main diagonal elements of the matrix are all 0, and the remaining elements are all 1, that is, the attention mask makes each position have an attention weight of 0 to itself, so as to prevent information leakage.

In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.

Example two:

fig. 2 shows an implementation flow of a bidirectional context-based non-autoregressive speech recognition method according to a second embodiment of the present invention, which is implemented according to the first embodiment of the present invention, and for convenience of description, only the relevant parts of the second embodiment of the present invention are shown, and the following details are described:

in step S201, an encoder of the trained speech recognition network performs a preliminary recognition on the input speech features to obtain a preliminary recognition result.

The embodiment of the invention is applicable to a voice recognition device, the voice recognition device can be a mobile phone, a tablet personal computer, a wearable device, an intelligent sound box, a vehicle-mounted device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA) and other devices, and the embodiment of the invention does not limit the specific type of the voice recognition device.

In the embodiment of the invention, before the initial recognition of the input speech features is carried out through the encoder of the trained speech recognition network, the speech recognition network needs to be trained, and when the speech recognition network is trained, preferably, a training set is used for carrying out combined training on the decoder and the encoder of the speech recognition network until the loss value of the speech recognition network is minimum, so that the trained speech recognition network is obtained, thereby realizing end-to-end training and reducing the training complexity and the running time. Wherein joint training trains both the encoder and the decoder.

When the speech recognition network is trained, the loss values of the speech recognition network can be calculated by weighted summation using the respective loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights. Preferably, the encoder and decoder joint loss function is as follows:

When the voice recognition network is trained, preferably, two data amplification strategies, namely frequency spectrum enhancement and speed perturbation, are used for data amplification on data in a training set so as to enhance the robustness of the voice recognition network.

In step S202, the preliminary recognition result is adjusted by the decoder of the trained speech recognition network, and the final speech recognition result is output, wherein the decoder adjusts the preliminary recognition result by using the bidirectional language information provided by the preliminary recognition result.

In the embodiment of the invention, when the initial recognition result is adjusted by the decoder of the trained voice recognition network, each position can update the self by utilizing the bidirectional language information, and even if the original input sequence has an erroneous recognition result, the decoder can adjust the self recognition result according to other characters in the input sequence. Further, the output sequence of the decoder can be re-input into the decoder and identified in an iterative manner to further reduce the character error rate at the expense of slightly sacrificing decoding speed. Preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration stopping mechanism to improve the decoding speed. Specifically, the adaptive iteration stop mechanism may be understood as that when the output and the input of the decoder of the current iteration are completely the same, the iteration is automatically stopped.

Preferably, the preliminary recognition result includes a text sequence length, and the decoder outputs the voice recognition result in parallel based on the text sequence length, thereby implementing real-time operation.

In the embodiment of the invention, the initial recognition result is obtained by initially recognizing the input voice characteristics through the encoder of the trained voice recognition network, the initial recognition result is adjusted through the decoder of the trained voice recognition network, and the final voice recognition result is output, wherein the decoder adjusts the initial recognition result by utilizing the bidirectional language information provided by the initial recognition result, so that the language information is fully utilized, the recognition effect is improved, and the structure is more efficient and uniform compared with a method of respectively utilizing the unidirectional language information by using two unidirectional decoders.

The license plate location method provided by the embodiment is further verified and explained by combining an experimental example as follows:

(1) corpus used in this experimental example:

the Aishell1 corpus is a Hill Shell Mandarin open source Speech corpus, which is part of the Hill Shell Mandarin Chinese Speech database AISHELL-ASR 0009. In a quiet indoor environment, 400 speakers from different accent areas in China participate in recording, a high-fidelity microphone (44.1 kHz, 16-bit) with 16kHz audio down-sampling is used for manufacturing, and the recording time is 178 hours. And (4) transcription and labeling by professional voice proofreaders, and passing strict quality inspection. The text accuracy of the corpus is more than 95%, and the corpus is divided into a training set, a development set and a test set.

The Magicdata corpus is published by MAGIC DATA technical company and is a Chinese Mandarin speech corpus. The corpus recording process is carried out in a quiet indoor environment, 1000 Chinese mainland Mandarin native language persons participate in recording, and recording is carried out by using a mobile phone for 755 hours. The text accuracy of the corpus is more than 98%, and the corpus is also divided into a training set, a development set and a test set.

(2) Description of the experiments:

during model training, the experimental example uses two data amplification strategies of frequency spectrum enhancement and speed disturbance to enhance the robustness of the model. The experimental example adopts a pytorch1.7.0 deep learning framework, and is trained by using an Adam optimization strategy and a gradient accumulation strategy, wherein momentum parameters are set to be beta _1=0.9 and beta _2= 0.999. The initial learning rate was set to 0.0001 and the training batch was 32. All experiments were performed on a machine containing 4 NVIDIA Titan XP GPUs.

The corpus used in this example is two open-source mandarin chinese corpuses, and the speech in both corpuses is clean speech. In training, CTC and CE loss joint training is used, with the CTC loss weight set to 0.3 and the CE loss weight set to 0.7. A plurality of dropout layers exist in the network, and the drop probabilities are all set to be 0.1.

(3) The experimental results are as follows:

to evaluate the effectiveness of this example, this example performed speech recognition tests in the above-mentioned corpus. The method provided by the embodiment is compared with the existing mainstream autoregressive and non-autoregressive voice recognition method, and comprises KERMIT, LASO, ST-NAR, Masked-NAT, CASS-NAT, CTC-enhanced Transformer, TS-NAT and AR Transformer.

The experimental results are shown in tables 1 and 2, and tables 1 and 2 are the experimental results of Aishell1 corpus and Magicdata corpus, respectively, wherein NAT-BC (Non-autoregressive transform with bidirectional conditions) represents the method described in this embodiment. The result shows that the character error rate of the method described in this embodiment under different corpora is lower than that of all other non-autoregressive methods, and the recognition speed is significantly faster than that of the autoregressive method under the condition of keeping the character error rate similar to that of the autoregressive method, so that the requirement of real-time performance can be met. The character error rate comprises three errors of insertion, replacement and deletion, the real-time rate is numerically equal to the time spent by a computer for processing a voice signal in unit time, and the relative speed is the speed of a model relative to an autoregressive Transformer model.

TABLE 1

TABLE 2

To further verify the advantage of using a bi-directional context in the method described in this example, the bi-directional context in the decoder was replaced with a uni-directional context, and comparative experiments were performed on the Magicdata corpus. The properties are shown in Table 3. As can be seen from Table 3, the bi-directional context achieves a lower character error rate for different iterations, where the character error rate for multiple iterations is lower than the character error rate for one iteration. It is noted that by iterating multiple times, the character error rate of the bidirectional context is reduced more than that of the unidirectional context, which further highlights the superiority of the bidirectional context.

TABLE 3

Example three:

fig. 3 shows a structure of a speech recognition apparatus according to a third embodiment of the present invention, and for convenience of explanation, only the parts related to the third embodiment of the present invention are shown.

The speech recognition device 3 of an embodiment of the present invention comprises a processor 30, a memory 31 and a computer program 32 stored in the memory 31 and executable on the processor 30. The processor 30, when executing the computer program 32, implements the steps in the above-described method embodiments, such as the steps S201 to S202 shown in fig. 2.

In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.

Example four:

in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiment, for example, steps S201 to S202 shown in fig. 2.

In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.

The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于提供交互服务的方法和装置

Non-autoregressive speech recognition network, method and equipment based on bidirectional context

相关技术

网友询问留言