Acoustic model training method, system, device and medium for speech recognition

文档序号:972889 发布日期:2020-11-03 浏览:2次 中文

阅读说明:本技术 语音识别的声学模型训练方法、系统、设备及介质 (Acoustic model training method, system, device and medium for speech recognition ) 是由 李明 江文斌 李健 于 2020-07-30 设计创作,主要内容包括:本发明公开了一种语音识别的声学模型训练方法、系统、设备及介质,获取训练样本,将第一语音片段分别输入多个预设的不同的基准语音识别模型中进行识别,以获取多个识别文本,计算每两个识别文本间的相似度值,以确定第一语音片段对应的文本得分,判断文本得分是否大于预设阈值,若是,则将文本得分对应的第一语音片段筛选出来,以作为第三语音片段,基于第三语音片段和第二语音片段训练生成语音识别模型。本发明解决了人工标注训练数据集费时费力,无法在短时间形成大量训练数据集,造成语音识别的字错率高的缺陷,提高了语音识别模型识别的准确性。(The invention discloses a method, a system, equipment and a medium for training an acoustic model of voice recognition, wherein a training sample is obtained, a first voice segment is respectively input into a plurality of preset different reference voice recognition models for recognition so as to obtain a plurality of recognition texts, the similarity value between every two recognition texts is calculated so as to determine the text score corresponding to the first voice segment, whether the text score is greater than a preset threshold value or not is judged, if yes, the first voice segment corresponding to the text score is screened out to be used as a third voice segment, and the voice recognition model is generated based on the third voice segment and the second voice segment training. The method solves the defects that manual marking of the training data set wastes time and labor, a large number of training data sets cannot be formed in a short time, and the word error rate of voice recognition is high, and improves the accuracy of voice recognition model recognition.)

1. A method for training an acoustic model for speech recognition is characterized by comprising the following steps:

obtaining a training sample; wherein the training sample comprises a plurality of first voice segments without labels and a plurality of second voice segments with labels;

respectively inputting the first voice fragment into a plurality of preset different reference voice recognition models for recognition so as to obtain a plurality of recognition texts;

calculating the similarity value between every two recognized texts to determine the text score corresponding to the first voice fragment;

judging whether the text score is larger than the preset threshold value or not, if so, screening the first voice segment corresponding to the text score to be used as a third voice segment; wherein the third speech segment has a pseudo label generated after recognition by the reference speech recognition model;

training and generating a voice recognition model based on the third voice segment and the second voice segment.

2. The acoustic model training method of claim 1, wherein after determining whether the text score is greater than the preset threshold, further comprising:

and if not, deleting the first voice segment corresponding to the text score.

3. The method of acoustic model training of claim 1, wherein the step of screening the first speech segment corresponding to the text score as a third speech segment further comprises:

acquiring a recognition text to be corrected, which is generated after the third voice fragment is recognized by the reference voice recognition model with the lowest word error rate;

and after the recognized text to be corrected is corrected, acquiring a pseudo label corresponding to the third voice fragment.

4. The acoustic model training method of claim 1, wherein the step of calculating a similarity value between each two of the recognized texts to determine a text score corresponding to the first speech segment comprises:

calculating the similarity between the recognized texts by using an edit distance method; wherein the edit distance method includes at least one of replacing one character with another character, inserting one character, and deleting one character;

and calculating a text score corresponding to the first voice fragment based on the similarity.

5. The acoustic model training method of claim 1, wherein the training to generate a speech recognition model based on the third speech segment and the second speech segment comprises:

selecting the reference speech recognition model with the lowest word error rate as a model to be trained;

updating the training data of the model to be trained by using the third voice segment and the labeled pseudo label thereof;

retraining the model to be trained based on the third speech segment including the pseudo label and the second speech segment to generate a speech recognition model.

6. The acoustic model training method of claim 1, wherein the step of obtaining training samples is further followed by:

training a plurality of reference models with the second speech segments to generate a plurality of the reference speech recognition models with speech recognition capability; the network structure of each reference model is different, and the corresponding word error rate of each reference speech recognition model is different.

7. An acoustic model training system for speech recognition, the acoustic model training system comprising:

the first acquisition module is used for acquiring a training sample; wherein the training sample comprises a plurality of first voice segments without labels and a plurality of second voice segments with labels;

the recognition module is used for respectively inputting the first voice fragment into a plurality of preset different reference voice recognition models for recognition so as to obtain a plurality of recognition texts;

the calculation module is used for calculating the similarity value between every two recognition texts so as to determine the text score corresponding to the first voice fragment;

the judging module is used for judging whether the text score is larger than the preset threshold value or not;

if so, calling a screening module, wherein the screening module is used for screening the first voice segment corresponding to the text score to serve as a third voice segment; wherein the third speech segment has a pseudo label generated after recognition by the reference speech recognition model;

and the training module is used for training and generating a voice recognition model based on the third voice segment and the second voice segment.

8. The system for training an acoustic model for speech recognition according to claim 7, wherein if the determination result of the determining module is negative, a deleting module is invoked, and the deleting module is configured to delete the first speech segment corresponding to the text score.

9. The system for acoustic model training for speech recognition according to claim 7, wherein the system further comprises:

the second obtaining module is used for obtaining a recognition text to be corrected, which is generated after the third voice segment is recognized by the reference voice recognition model with the lowest word error rate;

and the third acquisition module is used for correcting the to-be-corrected recognized text to acquire the pseudo label corresponding to the third voice fragment.

10. The acoustic model training system for speech recognition of claim 7, wherein the computation module comprises:

a similarity calculation unit for calculating the similarity between the recognition texts by using an edit distance method; wherein the edit distance method includes at least one of replacing one character with another character, inserting one character, and deleting one character;

and the text score calculating unit is used for calculating the text score corresponding to the first voice fragment based on the similarity.

11. The acoustic model training system for speech recognition of claim 7, wherein the training module comprises:

the selection unit is used for selecting the reference speech recognition model with the lowest word error rate as a model to be trained;

the updating unit is used for updating the training data of the model to be trained by using the third voice segment and the labeled pseudo label thereof;

and the training execution unit is used for retraining the model to be trained based on the third voice segment comprising the pseudo label and the second voice segment to generate a voice recognition model.

12. The system for acoustic model training for speech recognition according to claim 7, wherein the system further comprises:

a pre-training module for training a plurality of reference models with the second speech segment to generate a plurality of reference speech recognition models with speech recognition capability; the network structure of each reference model is different, and the corresponding word error rate of each reference speech recognition model is different.

13. An electronic device, characterized by a computer program comprising a processor, a memory, and a computer stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the acoustic model training method for speech recognition according to any one of claims 1-6.

14. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the acoustic model training method for speech recognition according to any one of claims 1-6.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition acoustic model training method, a voice recognition acoustic model training system, voice recognition acoustic model training equipment and a voice recognition acoustic model training medium.

Background

Speech is an important carrier of human thought, and speech recognition technology is a technology that receives, recognizes, and understands a speech signal using a machine, and converts it into a corresponding digital signal. Along with the continuous development of speech recognition technology, the application based on speech recognition is more and more extensive, and such technology has penetrated into the aspects of family life, office field, entertainment and the like. The voice recognition technology enables products such as voice input, voice search, intelligent voice customer service and the like to enter the consumer electronics field.

Speech recognition in the customer service voice recording scene is very complex, and a large amount of linguistic data and enough scenes are needed to train a speech recognition acoustic model due to different regional dialects of the customer service and habits of individual pronunciation and spoken language. Because a large number of voice labeling processes are expensive and time-consuming, sufficient training data sets after manual labeling can not be acquired in a short time, and finally the word error rate of voice recognition is high.

Disclosure of Invention

The invention aims to overcome the defect of high word error rate of speech recognition caused by the fact that manual labeling is time-consuming and labor-consuming and a large number of training data sets cannot be formed in a short time in the prior art, and provides a method, a system, equipment and a medium for training an acoustic model of speech recognition.

The invention solves the technical problems through the following technical scheme:

in a first aspect, the present invention provides a method for training an acoustic model for speech recognition, including the following steps:

obtaining a training sample; wherein the training sample comprises a plurality of first voice segments without labels and a plurality of second voice segments with labels;

respectively inputting the first voice fragment into a plurality of preset different reference voice recognition models for recognition so as to obtain a plurality of recognition texts;

calculating the similarity value between every two recognized texts to determine the text score corresponding to the first voice fragment;

judging whether the text score is larger than the preset threshold value or not, if so, screening the first voice segment corresponding to the text score to be used as a third voice segment; wherein the third speech segment has a pseudo label generated after recognition by the reference speech recognition model;

training and generating a voice recognition model based on the third voice segment and the second voice segment.

Preferably, after determining whether the text score is greater than the preset threshold, the method further includes:

and if not, filtering the first voice fragment corresponding to the text score.

Preferably, the step of screening out the first speech segment corresponding to the text score as a third speech segment further includes:

acquiring a recognition text to be corrected, which is generated after the third voice fragment is recognized by the reference voice recognition model with the lowest word error rate;

and after the recognized text to be corrected is corrected, acquiring a pseudo label corresponding to the third voice fragment.

Preferably, the step of calculating a similarity value between every two recognized texts to determine a text score corresponding to the first speech segment includes:

calculating the similarity between the recognized texts by using an edit distance method; wherein the edit distance method includes at least one of replacing one character with another character, inserting one character, and deleting one character;

and calculating a text score corresponding to the first voice fragment based on the similarity.

Preferably, training and generating a speech recognition model based on the third speech segment and the second speech segment includes:

selecting the reference speech recognition model with the lowest word error rate as a model to be trained;

updating the training data of the model to be trained by using the third voice segment and the labeled pseudo label thereof;

retraining the model to be trained based on the third speech segment including the pseudo label and the second speech segment to generate a speech recognition model.

Preferably, the step of obtaining the training sample further comprises:

training a plurality of reference models with the second speech segments to generate a plurality of the reference speech recognition models with speech recognition capability; the network structure of each reference model is different, and the corresponding word error rate of each reference speech recognition model is different.

In a second aspect, the present invention provides an acoustic model training system for speech recognition, the acoustic model training system comprising:

the first acquisition module is used for acquiring a training sample; wherein the training sample comprises a plurality of first voice segments without labels and a plurality of second voice segments with labels;

the recognition module is used for respectively inputting the first voice fragment into a plurality of preset different reference voice recognition models for recognition so as to obtain a plurality of recognition texts;

the calculation module is used for calculating the similarity value between every two recognition texts so as to determine the text score corresponding to the first voice fragment;

the judging module is used for judging whether the text score is larger than the preset threshold value or not;

if so, calling a screening module, wherein the screening module is used for screening the first voice segment corresponding to the text score to serve as a third voice segment; wherein the third speech segment has a pseudo label generated after recognition by the reference speech recognition model;

and the training module is used for training and generating a voice recognition model based on the third voice segment and the second voice segment.

Preferably, if the judgment result of the judgment module is negative, a deletion module is called, and the deletion module is used for deleting the first voice segment corresponding to the text score.

Preferably, the system further comprises:

the second obtaining module is used for obtaining a recognition text to be corrected, which is generated after the third voice segment is recognized by the reference voice recognition model with the lowest word error rate;

and the third acquisition module is used for correcting the to-be-corrected recognized text to acquire the pseudo label corresponding to the third voice fragment.

Preferably, the calculation module comprises:

a similarity calculation unit for calculating the similarity between the recognition texts by using an edit distance method; wherein the edit distance method includes at least one of replacing one character with another character, inserting one character, and deleting one character;

and the text score calculating unit is used for calculating the text score corresponding to the first voice fragment based on the similarity.

Preferably, the training module comprises:

the selection unit is used for selecting the reference speech recognition model with the lowest word error rate as a model to be trained;

the updating unit is used for updating the training data of the model to be trained by using the third voice segment and the labeled pseudo label thereof;

and the training execution unit is used for retraining the model to be trained based on the third voice segment comprising the pseudo label and the second voice segment to generate a voice recognition model.

Preferably, the system further comprises:

a pre-training module for training a plurality of reference models with the second speech segment to generate a plurality of reference speech recognition models with speech recognition capability; the network structure of each reference model is different, and the corresponding word error rate of each reference speech recognition model is different.

In a third aspect, the present invention also provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program is executed by the processor to implement the acoustic model training method for speech recognition according to the first aspect.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for training an acoustic model for speech recognition according to the first aspect.

The positive progress effects of the invention are as follows: the method comprises the steps of constructing a pseudo label for an unlabeled data set by using a reference voice recognition model through the idea of ensemble learning, screening a training set corresponding to the pseudo label according to the result of the reference voice recognition model by calculating a similarity value, combining an originally obtained training set and the training set labeled with the pseudo label, and training the expanded data set to generate a new voice recognition model. The defects that manual labeling is time-consuming and labor-consuming, a large number of training data sets cannot be formed in a short time, and the word error rate of voice recognition is high are overcome, and finally the accuracy of acoustic model recognition is improved.

Drawings

Fig. 1 is a flowchart of an acoustic model training method for speech recognition according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of step S4 of the acoustic model training method for speech recognition according to embodiment 1 of the present invention.

Fig. 3 is a flowchart of step S10 of the acoustic model training method for speech recognition according to embodiment 1 of the present invention.

Fig. 4 is a schematic block diagram of an acoustic model training system for speech recognition according to embodiment 2 of the present invention.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:话术训练处理方法、装置、计算机设备和可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!