Error correction method based on multi-mode speech recognition result and related equipment

文档序号:1955186 发布日期:2021-12-10 浏览:24次 中文

阅读说明:本技术 基于多模态语音识别结果纠错方法及相关设备 (Error correction method based on multi-mode speech recognition result and related equipment ) 是由 庄子扬 魏韬 马骏 王少军 肖京 于 2021-09-10 设计创作,主要内容包括:本申请实施例提供了一种基于多模态语音识别结果纠错方法及相关设备,所述方法包括:采用声学模型和语言模型对用户的语音数据进行处理,获得多个第一候选识别结果以及对应的声学得分和语言得分;获得每个第一候选识别结果对应的权重得分;将权重得分最高的第一候选识别结果作为目标识别结果,并获取所述目标识别结果的文本序列向量;从所述多个第一候选识别结果中确定声学得分最高的第一候选识别结果,并获取声学得分最高的第一候选识别结果对应的拼音序列向量;将所述文本序列向量和所述拼音序列向量输入预先训练的纠错模型,以获得纠错识别结果。本发明可以有效地纠正语音识别结果的文本错误,尤其是针对谐音字的错误,可以得到有效纠正。(The embodiment of the application provides an error correction method based on a multi-mode speech recognition result and related equipment, wherein the method comprises the following steps: processing voice data of a user by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and corresponding acoustic scores and language scores; obtaining a weight score corresponding to each first candidate identification result; taking a first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result; determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score; and inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result. The invention can effectively correct the text error of the voice recognition result, especially the error of the harmonic character.)

1. An error correction method based on multi-modal speech recognition results, the method comprising:

acquiring voice data;

processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and obtain an acoustic score and a language score corresponding to each first candidate recognition result;

respectively carrying out weighting calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result;

taking a first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result;

determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score;

and inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result.

2. The method of claim 1, wherein the error correction model comprises an input layer, a fully-connected layer, and a softmax layer;

the inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result includes:

inputting the text sequence vector and the pinyin sequence vector into the input layer, and performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer to obtain a fused feature vector;

inputting the fused feature vector into the full-link layer and inputting the fused feature vector into the softmax layer through the full-link layer so as to obtain an error correction identification result output by the softmax layer.

3. The method of claim 2, wherein feature fusing the text sequence vector and the pinyin sequence vector through the input layer comprises:

and performing dot multiplication and summation operation on the text sequence vector and the pinyin sequence vector through the input layer.

4. The method of claim 1, wherein the obtaining the text sequence vector of the target recognition result comprises:

and inputting the text data of the target recognition result into a pre-trained BERT model to obtain a text sequence vector of the text data.

5. The method of claim 1, wherein the obtaining the pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score comprises:

obtaining a pinyin sequence corresponding to a first candidate recognition result with the highest acoustic score;

inputting the pinyin sequence to a pre-trained Tacotron2 model to obtain the pinyin sequence vector.

6. The method of claim 5, wherein prior to said inputting the pinyin sequence to a pre-trained Tacotron2 model, the method further comprises:

constructing a pinyin sequence sample set, wherein the pinyin sequence sample set comprises a plurality of pinyin sequence samples;

acquiring Mel spectrum characteristics of mandarin pronunciation corresponding to each Pinyin sequence sample to obtain multiple Mel spectrum sequence vectors;

and training a Tacotron2 model according to the multiple Pinyin sequence samples and the multiple Mel spectrum sequence vectors to obtain a trained Tacotron2 model.

7. The method according to claim 1, further comprising, after the obtaining the error correction recognition result:

judging whether the text which obtains the error correction identification result has errors or not;

and when the text of the error correction recognition result has errors, training the error correction model again by using the text sequence vector, the pinyin sequence vector and the error correction recognition result.

8. An apparatus for correcting an error based on a multi-modal speech recognition result, the apparatus comprising:

the acquisition module is used for acquiring voice data;

the first processing module is used for processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and obtain an acoustic score and a language score corresponding to each first candidate recognition result;

the second processing module is used for respectively carrying out weighted calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result;

the third processing module is used for taking the first candidate recognition result with the highest weight score as a target recognition result and acquiring a text sequence vector of the target recognition result;

the fourth processing module is used for determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score;

and the error correction module is used for inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model so as to obtain an error correction recognition result.

9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium readable by a processor, the storage medium storing computer instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1 to 7.

Technical Field

The application relates to the field of artificial intelligence, in particular to an error correction method based on a multi-mode speech recognition result and related equipment.

Background

Speech recognition technology based on artificial intelligence is now widely used in a variety of scenarios such as vehicle navigation, smart home, social chat, application assistant, entertainment games, and the like. The voice recognition technology can realize automatic recognition of voice content input by a user, convert the voice content into corresponding text and output the text, and greatly improve the convenience of interaction between the user and the terminal. However, in the actual voice interaction process, the error rate of voice recognition is high due to the influence of factors such as the nonstandard pronunciation of the user, noise and the like. The prior art focuses on improving the accuracy of speech recognition, but lacks a means for correcting the speech recognition result. Due to the reasons, the popularization of voice interaction products is greatly influenced.

Disclosure of Invention

The present application aims to provide a method, an apparatus, a computer device and a computer readable storage medium for correcting errors based on a multi-modal speech recognition result, which can effectively correct errors of the speech recognition result and greatly improve user experience of a speech recognition system.

In a first aspect, the present application provides a method for correcting errors based on multi-modal speech recognition results, the method comprising:

acquiring voice data;

processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and obtain an acoustic score and a language score corresponding to each first candidate recognition result;

respectively carrying out weighting calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result;

taking a first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result;

determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score;

and inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result.

According to some embodiments of the present application, in the above scheme, the error correction model includes an input layer, a fully-connected layer, and a softmax layer;

the inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result includes:

inputting the text sequence vector and the pinyin sequence vector into the input layer, and performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer to obtain a fused feature vector;

inputting the fused feature vector into the full-link layer and inputting the fused feature vector into the softmax layer through the full-link layer so as to obtain an error correction identification result output by the softmax layer.

According to some embodiments of the present application, in the above scheme, performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer includes:

and performing dot multiplication and summation operation on the text sequence vector and the pinyin sequence vector through the input layer.

According to some embodiments of the present application, in the above scheme, the obtaining a text sequence vector of the target recognition result includes:

and inputting the text data of the target recognition result into a pre-trained BERT model to obtain a text sequence vector of the text data.

According to some embodiments of the present application, in the foregoing scheme, the obtaining a pinyin sequence vector corresponding to a first candidate recognition result with a highest acoustic score includes:

obtaining a pinyin sequence corresponding to a first candidate recognition result with the highest acoustic score;

inputting the pinyin sequence to a pre-trained Tacotron2 model to obtain the pinyin sequence vector.

According to some embodiments of the present application, in the above scheme, before the inputting the pinyin sequence to a pre-trained Tacotron2 model, the method further includes:

constructing a pinyin sequence sample set, wherein the pinyin sequence sample set comprises a plurality of pinyin sequence samples;

acquiring Mel spectrum characteristics of mandarin pronunciation corresponding to each Pinyin sequence sample to obtain multiple Mel spectrum sequence vectors;

and training a Tacotron2 model according to the multiple Pinyin sequence samples and the multiple Mel spectrum sequence vectors to obtain a trained Tacotron2 model.

According to some embodiments of the present application, in the above scheme, after obtaining the error correction recognition result, the method further includes: and replacing the target recognition result with the error correction recognition result to take the error correction recognition result as a final recognition result.

In a second aspect, the present application provides an apparatus for correcting errors based on multi-modal speech recognition results, the apparatus comprising:

the acquisition module is used for acquiring voice data;

the first processing module is used for processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and obtain an acoustic score and a language score corresponding to each first candidate recognition result;

the second processing module is used for respectively carrying out weighted calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result;

the third processing module is used for taking the first candidate recognition result with the highest weight score as a target recognition result and acquiring a text sequence vector of the target recognition result;

the fourth processing module is used for determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score;

and the error correction module is used for inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model so as to obtain an error correction recognition result.

In a third aspect, the present application provides a computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by one or more of the processors, cause the one or more processors to perform the steps of any one of the methods described above in the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium readable by a processor, the storage medium storing computer instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of any of the methods described above in the first aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the embodiment of the application, the acoustic model and the language model are adopted to process the voice data of the user, a plurality of first candidate recognition results are obtained, and an acoustic score and a language score corresponding to each first candidate recognition result are obtained; respectively carrying out weighting calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result; taking a first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result; determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score; and inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model so as to further obtain an error correction recognition result. The embodiment of the application adopts a multi-mode feature fusion method, combines the pinyin sequence vector feature corresponding to the result with the highest acoustic score and the text sequence vector feature of the target recognition result for error correction, and can effectively correct text errors of the voice recognition result, particularly errors of harmonic characters. According to the technical scheme of the embodiment of the application, the acoustic features are utilized for error correction, so that the error correction recall rate can be improved, the error correction rate can be reduced, and the improvement on the accuracy rate of the whole voice recognition is obviously beneficial.

Drawings

FIG. 1 is a schematic flow chart of an error correction method based on multi-modal speech recognition results according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating the sub-steps of step S150 in FIG. 1;

FIG. 3 is a schematic structural diagram of a Tacotron2 model provided in an embodiment of the present application;

FIG. 4 is a flow chart illustrating the sub-steps of step S160 in FIG. 1;

FIG. 5 is a schematic structural diagram of an error correction apparatus based on multi-modal speech recognition results according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the embodiments of the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The text "/" generally indicates that the pre-and post-association objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

It should be appreciated that the embodiments of the present application may acquire and process relevant data based on artificial intelligence techniques. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Speech recognition technology based on artificial intelligence is now widely used in a variety of scenarios such as vehicle navigation, smart home, social chat, application assistant, entertainment games, and the like. The voice recognition technology can realize automatic recognition of voice content input by a user and convert the voice content into corresponding text for output, thereby greatly improving the convenience of interaction between the user and the terminal.

In the actual voice interaction process, the error rate of the voice recognition result is high due to the influence of factors such as nonstandard pronunciation of the user, noise and the like. For example, after receiving voice information input by a certain user, the terminal automatically identifies and converts the voice information and outputs a text "tell me", however, a correct output result should be "please tell me". Grammatical errors, harmonic errors, and the like are pain points that affect the speech recognition user experience.

In the related technology, in order to correct text errors of a speech recognition result, a person writes corresponding grammar and syntax semantic rules for error checking and correction by summarizing rules of errors occurring in a Chinese character sequence of continuous Chinese speech recognition according to characteristics of Chinese speech, finds and corrects errors in the Chinese character sequence by using a 'vocabulary semantic driving' analysis method, and finally obtains a correct Chinese character sequence. It has also been proposed to correct the text of the speech recognition result by a language model based on the likelihood probability of the word sequence spoken by the speaker. However, the above methods are all based on text to correct grammar or syntax, and the accuracy is not high.

The embodiment of the application provides a method, a device, computer equipment and a computer readable storage medium for correcting errors based on a multi-mode speech recognition result, and acoustic features and text feature information in a speech recognition decoding process are combined on the basis of deep learning, so that errors of the speech recognition result, especially grammar errors and harmonic errors, can be effectively corrected, and the user experience of a speech recognition system can be greatly improved.

The error correction method based on the multi-modal speech recognition result provided by the embodiment of the application can be specifically applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA) and the like, and can also be applied to intelligent household appliances such as a sound box, a television, a washing machine and the like.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an error correction method based on a multi-modal speech recognition result according to an embodiment of the present application. The method comprises the following steps:

s110, voice data are obtained.

It can be understood that the terminal device to which the method of the embodiment of the present application is applied is provided with a voice acquisition device. The voice collecting device may specifically be a sound pick-up device. A sound pickup is a sound sensor, which is an energy conversion device that converts a sound signal into an electrical signal, and is also called a microphone, or a microphone. The terminal equipment can collect voice information of a user through the sound pick-up, sound vibration of the voice information is transmitted to the vibrating diaphragm of the microphone, the magnet inside the vibrating diaphragm is pushed to form a changing current, and the changing current is sent to the sound processing circuit behind the vibrating diaphragm to be amplified and processed to form a voice signal. After the terminal acquires the voice signals through the voice acquisition device, the terminal can perform space-time sampling processing on the acquired voice signals to obtain voice data. The forming process of the voice data comprises the following steps: collecting voice signals in a receiving range through a sound pick-up, converting the voice signals into analog electric signals, and amplifying the analog electric signals through a front-end amplifier; and then, sampling the amplified analog electric signal by using a multi-channel synchronous sampling unit so as to convert the analog electric signal into a digital electric signal and form voice data to be recognized.

And S120, processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results, and obtaining an acoustic score and a language score corresponding to each first candidate recognition result.

In specific implementation, voice data can be input into the acoustic model; and then, carrying out acoustic feature extraction processing on the voice data by using an acoustic model to obtain acoustic features corresponding to the voice data to be recognized. Here, the acoustic feature may specifically be an FBK feature, a fundamental frequency feature, a formant feature, a spectral feature, or the like.

The acoustic model can also recognize acoustic characteristic information to obtain at least one phoneme of current voice data and an acoustic score of each phoneme in the at least one phoneme, and then a plurality of candidate text sequences corresponding to the voice to be recognized are determined according to the acoustic scores.

And when the specific implementation is carried out, scoring is carried out on each candidate text sequence through the language model, and the language score corresponding to each candidate text sequence is obtained.

Based on the acoustic score and the linguistic score corresponding to each candidate text sequence, a final text sequence, that is, a speech recognition result to be output, may be determined. It should be understood that the candidate text sequence is the first candidate recognition result.

And S130, respectively carrying out weighted calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result.

For example, in the speech recognition process, the acoustic model and the speech model respectively calculate acoustic scores and linguistic scores of a plurality of candidate results according to the input acoustic features and text features, and then obtain a weight score of each candidate result through weighted summation.

And S140, taking the first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result.

After obtaining the weight scores of the respective first candidate recognition results, an optimal result output may be determined from the plurality of first candidate recognition results. Specifically, the first candidate recognition result with the highest score may be used as the target recognition result, and the text sequence corresponding to the target recognition result may be output.

It will be appreciated that the score output by the acoustic model is used to represent the probability that a given acoustic feature belongs to each acoustic modeling unit, and the language model represents the prior probability of the occurrence of a text sequence. When the speech recognition result is obtained through the foregoing steps, the final weight score is not high enough due to a low linguistic score caused by infrequent occurrence of grammar and syntax of a correct text sequence, and the like, so that a correct result cannot be output. The text sequence of the target recognition result is corrected subsequently by the embodiment of the application, so that the correct result is finally output.

And in order to carry out error correction processing on the text sequence of the target recognition result, after the text sequence corresponding to the target recognition result is obtained, the vector expression of the text sequence is also obtained.

As an example, the obtaining of the text sequence vector of the target recognition result specifically includes: and inputting the text data of the target recognition result into a pre-trained BERT model to obtain a text sequence vector of the text data.

The BERT model is a pre-trained language model. A character vector table is preset in the BERT model, and each character can find a corresponding vector in the character vector table. The BERT model converts characters which cannot be directly calculated in a text into a vector form which can be calculated, and enriches semantic vectors of words according to context information of the characters so as to better reflect the meaning of corresponding characters in sentences through the digitized vectors. In specific implementation, the BERT model may predict the probability of occurrence of a character in a given sentence according to characters preceding and following the character (i.e., the upper and lower question information), and select a correct character for the sentence according to the predicted probability, so that the output sentence conforms to the habit of human language.

S150, determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score.

As an example, referring to fig. 2, step S150 may specifically include the following sub-steps:

and S151, acquiring a pinyin sequence corresponding to the first candidate identification result with the highest acoustic score.

It is understood that in the process of obtaining the recognition result through the acoustic model and the language model, a higher acoustic score is often obtained for the correct recognition result. For example, in the speech recognition process of "please tell me", the pinyin sequence of the candidate result with the highest acoustic score is "q 3 gao4 su4 wo 3", so that the pinyin sequence can be used as auxiliary error correction information to correct the error of the previously obtained target recognition result "tell me".

S152, inputting the pinyin sequence to a pre-trained Tacotron2 model to obtain the pinyin sequence vector.

It should be appreciated that tacontron 2 is an end-to-end TTS model based on a neural network. The Tacotron2 model can convert the input text into continuous Character embedded vectors (Character Embeddings) through an encoder, process the Character embedded vectors through a connection layer and a multilayer LSTM network, and output Mel spectral feature vectors corresponding to the text. In the embodiment of the application, a pinyin sequence replacement text is used as an input of a tacontron 2 model, the pinyin sequence replacement text is converted into a continuous pinyin sequence embedded vector (which can be understood as Embeddings of the pinyin sequence) through an encoder, the embedded vector of the pinyin sequence is processed through a connecting layer and a multi-layer LSTM network, a Mel spectrum feature vector corresponding to the pinyin sequence is output, and then the Embeddings of the pinyin sequence are taken as the pinyin sequence vector.

It will be appreciated that in order to achieve the effect of inputting pinyin sequences to the tacontron 2 model to obtain pinyin sequence vectors, the tacontron 2 model needs to be trained in advance. That is, before the pinyin sequence is input to the pre-trained Tacotron2 model, the method of the embodiment of the present application further includes the following steps:

s1501, constructing a pinyin sequence sample set, wherein the pinyin sequence sample set comprises a plurality of pinyin sequence samples;

s1502, acquiring Mel spectrum characteristics of mandarin pronunciation corresponding to each pinyin sequence sample to obtain a plurality of Mel spectrum sequence vectors;

s1503, training a Tacotron2 model according to the pinyin sequence samples and the Mel spectrum sequence vectors to obtain a trained Tacotron2 model.

Specifically, fig. 3 shows a schematic structural diagram of a Tacotron2 model provided in an embodiment of the present application. In the process of training the tacontron 2 model shown in fig. 3, a pinyin sequence sample set can be pre-constructed, a mandarin standard pronunciation segment corresponding to each pinyin sequence sample in the pinyin sequence sample set is obtained, and a mel-spectrum feature sample vector corresponding to the pronunciation segment is extracted as a label for model training; during specific training, each input pinyin sequence corresponds to a randomly initialized pinyin embedding vector, the vector is accessed to a bi-LSTM network after passing through 3 convolutional layers, time-dimension coding is carried out on the vector, Attention based on position characteristics is obtained through Location Sensitive attribute, a Mel spectrum characteristic vector is generated through a plurality of layers of networks, finally, a loss (loss) value of the Mel spectrum characteristic sample vector is calculated, model parameters are optimized according to the loss value until a model converges, and a trained Tacotron2 model is obtained.

And S160, inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result.

Illustratively, the error correction model may include an input layer, a fully-connected layer, and a softmax layer. The input layer is used for fusing the input text sequence vector and the pinyin sequence vector to obtain a multi-modal feature vector; the full-connection layer is connected with the input layer and the softmax layer and used for reducing the dimension of the multi-modal characteristic vectors and splicing the multi-modal characteristic vectors to obtain a long vector; the softmax layer is used for generating a plurality of texts according to the output long vectors of the full connection layer, calculating probability data of each text as a correct recognition result, and outputting the text with the highest probability as an error correction recognition result.

Correspondingly, referring to fig. 4, in step S160, inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result, which may specifically include the following sub-steps:

s161, inputting the text sequence vector and the pinyin sequence vector into the input layer, and performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer to obtain a fused feature vector;

and S162, inputting the fused feature vector into the full connection layer and inputting the fused feature vector into the softmax layer through the full connection layer so as to obtain an error correction identification result output by the softmax layer.

In step S161, performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer, which specifically includes: and performing dot multiplication and summation operation on the text sequence vector and the pinyin sequence vector through the input layer.

Illustratively, the dot multiplication and summation operations of the text sequence vector and the pinyin sequence vector can be implemented by the following equations (1) and (2).

Fe=σ(FpWp+bp)·Fs (1)

Fes=Fe+Fs (2)

Wherein, in the formula (1), Fs represents a text sequence vector, Fp represents a pinyin sequence vector,is a model learning parameter, σ is a nonlinear activation function (specifically, a ReLU activation function), and "·" represents a vector dot product. Equation (2) represents the residual concatenation, result F of equation (2)esAs input to the fully connected layer.

It is understood that, after obtaining the error correction recognition result, the embodiment of the present application outputs the error correction recognition result as a final recognition result.

In an exemplary application scenario of the embodiment of the application, after receiving voice data of a user, a terminal processes the voice data by using an acoustic model and a language model to obtain a plurality of first candidate recognition results, and obtains an acoustic score and a language score corresponding to each first candidate recognition result; respectively carrying out weighted calculation on acoustic scores and language scores aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result, wherein the result with the highest weight score is 'tell me' so that the 'tell me' is taken as a target recognition result, and then obtaining a text sequence vector of the target recognition result 'tell me'; determining a first candidate recognition result with the highest acoustic score from the obtained first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score, wherein the pinyin sequence corresponding to the result with the highest acoustic score is 'q 3 gao4 su4 wo 3', and converting the pinyin sequence into the pinyin sequence vector; inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model to obtain an error correction recognition result 'please appeal me'; and finally, outputting the 'claiming me' as a final identification result.

Optionally, after obtaining the error correction identification result, the following steps may be further included: judging whether the text which obtains the error correction identification result has errors or not; and when the text of the error correction recognition result has errors, training the error correction model again by using the text sequence vector, the pinyin sequence vector and the error correction recognition result.

It can be understood that, if the error of the output text still exists after the target recognition result is corrected by the error correction model, the error correction model can be trained again by using the text sequence vector, the pinyin sequence vector and the error correction recognition result, so as to improve the accuracy of the output result of the error correction model and reduce the error correction rate.

In specific implementation, whether the text of the error correction recognition result has an error or not can be judged, and then the error correction recognition result is labeled, for example, if the text is correct, the label of the error correction recognition result is set to be 1; and if the text is wrong, setting the error correction recognition result to be a second preset value (for example 0). When the error correction model is trained again by using the text sequence vector, the pinyin sequence vector and the error correction recognition result, the text sequence vector and the pinyin sequence vector are used as input, a second preset value of the error correction recognition result is used as probability data expected to be output, and then the error correction model is trained. Through multiple iterative optimization, the accuracy of the output result of the error correction model is improved, and meanwhile, the error correction rate is reduced. In the embodiment of the application, the acoustic model and the language model are adopted to process the voice data of the user, a plurality of first candidate recognition results are obtained, and an acoustic score and a language score corresponding to each first candidate recognition result are obtained; respectively carrying out weighting calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result; taking a first candidate recognition result with the highest weight score as a target recognition result, and acquiring a text sequence vector of the target recognition result; determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results, and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score; and inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model so as to further obtain an error correction recognition result. The embodiment of the application adopts a multi-mode feature fusion method, combines the pinyin sequence vector feature corresponding to the result with the highest acoustic score and the text sequence vector feature of the target recognition result for error correction, and can effectively correct text errors of the voice recognition result, particularly errors of harmonic characters. According to the technical scheme of the embodiment of the application, the acoustic features are utilized for error correction, so that the error correction recall rate can be improved, the error correction rate is reduced, and the beneficial effect of improving the accuracy rate of the whole voice recognition is remarkable.

It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

Referring to fig. 5, the present application provides an error correction apparatus based on multi-modal speech recognition results, the apparatus comprising:

the acquisition module is used for acquiring voice data;

the first processing module is used for processing the voice data by adopting an acoustic model and a language model to obtain a plurality of first candidate recognition results and obtain an acoustic score and a language score corresponding to each first candidate recognition result;

the second processing module is used for respectively carrying out weighted calculation on the acoustic score and the language score aiming at each first candidate recognition result to obtain a weight score corresponding to each first candidate recognition result;

the third processing module is used for taking the first candidate recognition result with the highest weight score as a target recognition result and acquiring a text sequence vector of the target recognition result;

the fourth processing module is used for determining a first candidate recognition result with the highest acoustic score from the plurality of first candidate recognition results and acquiring a pinyin sequence vector corresponding to the first candidate recognition result with the highest acoustic score;

and the error correction module is used for inputting the text sequence vector and the pinyin sequence vector into a pre-trained error correction model so as to obtain an error correction recognition result.

As an example, the error correction model includes an input layer, a fully-connected layer, and a softmax layer.

In one embodiment, the error correction module comprises:

the fusion unit is used for inputting the text sequence vector and the pinyin sequence vector into the input layer, and performing feature fusion on the text sequence vector and the pinyin sequence vector through the input layer to obtain a fused feature vector;

a result output unit, configured to input the fused feature vector to the fully-connected layer and to the softmax layer via the fully-connected layer, so as to obtain an error correction recognition result output by the softmax layer.

In a specific embodiment, the feature fusion of the text sequence vector and the pinyin sequence vector by the fusion unit through the input layer includes: and performing dot multiplication and summation operation on the text sequence vector and the pinyin sequence vector through the input layer.

In a specific embodiment, the third processing module includes:

a first determination unit configured to take a first candidate recognition result with a highest weight score as a target recognition result;

and the first acquisition unit is used for inputting the text data of the target recognition result into a pre-trained BERT model so as to obtain a text sequence vector of the text data.

In one embodiment, the fourth processing module comprises:

a second determination unit configured to determine a first candidate recognition result having a highest acoustic score from the plurality of first candidate recognition results;

the second acquisition unit is used for acquiring the pinyin sequence corresponding to the first candidate recognition result with the highest acoustic score; inputting the pinyin sequence to a pre-trained Tacotron2 model to obtain the pinyin sequence vector.

In a specific embodiment, the apparatus further comprises a training module configured to:

constructing a pinyin sequence sample set, wherein the pinyin sequence sample set comprises a plurality of pinyin sequence samples;

acquiring Mel spectrum characteristics of mandarin pronunciation corresponding to each Pinyin sequence sample to obtain multiple Mel spectrum sequence vectors;

and training a Tacotron2 model according to the multiple Pinyin sequence samples and the multiple Mel spectrum sequence vectors to obtain a trained Tacotron2 model.

In a particular embodiment, the training module is further configured to: and when the text of the error correction recognition result has errors, training the error correction model again by using the text sequence vector, the pinyin sequence vector and the error correction recognition result.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Fig. 6 illustrates a computer device 500 provided by an embodiment of the present application. The computer device 500 includes, but is not limited to:

a memory 510 for storing programs;

a processor 520 for executing the program stored in the memory 510, wherein when the processor 520 executes the program stored in the memory 510, the processor 520 is configured to perform the above-mentioned error correction method based on the multi-modal speech recognition result.

The processor 520 and the memory 510 may be connected by a bus or other means.

The memory 510, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs and non-transitory computer executable programs, such as the method for error correction based on multi-modal speech recognition results described in any of the embodiments of the present invention. Processor 520 implements the above-described multi-modal speech recognition result-based error correction method by executing non-transitory software programs and instructions stored in memory 510.

The memory 510 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area can store and execute the multi-modal voice recognition result-based error correction method. Further, the memory 510 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 510 may optionally include memory located remotely from the processor 520, which may be connected to the processor 520 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions necessary to implement the multi-modal speech recognition result-based error correction method described above are stored in memory 510 and, when executed by the one or more processors 520, perform the multi-modal speech recognition result-based error correction method provided by any of the embodiments of the present invention.

The embodiment of the application also provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used for executing the error correction method based on the multi-modal voice recognition result.

In one embodiment, the storage medium stores computer-executable instructions, which are executed by one or more control processors 520, for example, by one of the processors 520 in the computer device 500, and the one or more processors 520 may be enabled to execute the method for correcting error based on multi-modal speech recognition results according to any of the embodiments of the present invention.

The above described embodiments are merely illustrative, wherein elements illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and those skilled in the art will appreciate that the present invention is not limited thereto. Under the shared conditions, various equivalent modifications or substitutions can be made, and the equivalent modifications or substitutions are included in the scope of the invention defined by the claims.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种应用于电力领域中声纹识别的身份认证方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!