Speech recognition model training method and system

文档序号：193314 发布日期：2021-11-02 浏览：23次中文

阅读说明：本技术 语音识别模型训练方法及系统 (Speech recognition model training method and system ) 是由温亚于 2021-07-30 设计创作，主要内容包括：本发明公开一种语音识别模型训练方法,包括：根据用户的选择操作确定待训练模型,所述待训练模型至少包括待训练声学模型、待训练语言模型和待训练热词模型之一；获取用户上传的预设领域训练数据集；基于所述预设领域训练数据集对所述待训练模型进行训练。本发明根据用户的选择操作确定待训练模型,获取用户上传的预设领域训练数据集,然后基于所述预设领域训练数据集对所述待训练模型进行训练。只需要用户根据需求选择需要进行训练的模型,并将目标领域的训练数据集进行上传即可完成所选择的模型的训练,以得到对语音识别模型的训练,无需用户具备系统算法和人工智能知识,用户可以更加自主的完成识别优化,降低了训练模型的门槛和成本。(The invention discloses a speech recognition model training method, which comprises the following steps: determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained; acquiring a preset field training data set uploaded by a user; and training the model to be trained based on the preset domain training data set. The method comprises the steps of determining a model to be trained according to selection operation of a user, obtaining a preset field training data set uploaded by the user, and then training the model to be trained based on the preset field training data set. The training of the selected model can be completed only by selecting the model to be trained according to the requirement by the user and uploading the training data set of the target field so as to obtain the training of the voice recognition model, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more independently, and the threshold and the cost of the training model are reduced.)

1. A method of speech recognition model training, comprising:

determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;

acquiring a preset field training data set uploaded by a user;

and training the model to be trained based on the preset domain training data set.

2. The method according to claim 1, wherein the obtaining of the preset domain training data set uploaded by the user comprises:

and detecting and acquiring a preset field training data set uploaded by a user on the interactive interface, or detecting that the user acquires the preset field training data set by calling an acquisition request sent by the API interface.

3. The method according to claim 1, wherein the training the model to be trained based on the preset domain training dataset comprises:

for the acoustic model, processing the preset field training data set to obtain an audio data set with text labels so as to train the acoustic model;

for the language model, processing the preset domain training data set to obtain a pure text corpus or a custom corpus template to train the language model;

and for the hot word model, processing the preset field training data set to obtain a vocabulary set of a preset field so as to train the hot word model.

4. The method of claim 1, further comprising: detecting a training mode selected by a user, wherein the training mode comprises an incremental training mode and a full training mode;

training the model to be trained based on the preset domain training data set, including: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.

5. The method according to any one of claims 1-4, further comprising: and testing the trained acoustic model or language model or hotword model by adopting the test audio data set uploaded by the user in batch.

6. The method according to any one of claims 1-4, further comprising: and selecting a single test audio to test the trained acoustic model or language model or hot word model.

7. The method according to any one of claims 1-4, wherein the trained plurality of acoustic models, plurality of language models, and plurality of hotword models;

the method further comprises the following steps:

testing the plurality of acoustic models respectively by adopting a test audio data set uploaded by a user in batch to determine an acoustic model with optimal performance;

respectively testing a plurality of language models by adopting a test audio data set uploaded by a user in batch to determine a language model with optimal performance;

and respectively testing the plurality of hot word models by adopting the test audio data sets uploaded by the users in batch to determine the hot word model with the optimal performance.

8. A speech recognition model training system, comprising:

the model selection program module is used for determining a model to be trained according to the selection operation of a user, and the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;

the user data input program module is used for acquiring a preset field training data set uploaded by a user;

and the model training program module is used for training the model to be trained based on the preset field training data set.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and a system for training a voice recognition model.

Background

With the increase of data volume, the enhancement of computing power and the development of deep learning theory technology, the accuracy rate of voice recognition is continuously improved, and the application field is continuously widened. The voice recognition is applied in an interactive mode, for example, a voice assistant loaded on a vehicle/mobile phone converts the voice of a user into characters which can be understood by the machine through voice recognition, so that the machine executes corresponding tasks and gives feedback, and natural man-machine communication is realized. In addition, non-interactive applications such as the safety of drivers and passengers through travel recording and the application in the fields of customer service quality inspection, intelligent outbound and the like are available.

Taking an interactive product as an example, the accuracy rate of voice recognition can basically reach a word level of 95%. But this does not meet the ever-increasing business demands. Especially for newly added special words in the subdivision field, such as English words, place names and professional terms, if the model is not specifically tuned, the voice recognition model of any manufacturer can hardly meet the service requirement.

At present, in the speech recognition optimization level, model training and coefficient adjustment are generally performed in the research and development stage, and after the test is passed, deployment is finally performed. However, if new special data cannot be inferred after deployment, the model can only be returned to the development stage again to perform model training and coefficient adjustment, and the model can be tested again and deployed after adjustment. Two problems arise here: firstly, retraining needs to return to a research and development stage for model training and coefficient adjustment, so that the model can cause errors due to human reasons and is unknown, and iteration of the model cannot be performed in real time, so that the efficiency is low; secondly, the traditional training needs to follow the mode of retraining according to batches, otherwise, the model may not be converged, so that the model cannot be adjusted immediately after encountering special data each time, and the retraining and redeployment can be performed only after collecting enough data, which cannot meet the requirement of the service for rapid iteration of the identification. Because the optimization period of the service can generally reach several weeks or even months, the time of a plurality of service lines is overlapped, and the emergency demand can occur occasionally, the voice communication system completely depends on the processing of limited voice engineers, cannot respond in time, and has insufficient support force. In addition, the communication cost is high, the customers excessively depend on voice manufacturers, and no space is given out by the customers, so that the business promotion and the user experience are influenced.

Disclosure of Invention

An embodiment of the present invention provides a method and a system for training a speech recognition model, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, including:

acquiring a preset field training data set uploaded by a user;

and training the model to be trained based on the preset domain training data set.

In a second aspect, an embodiment of the present invention provides a speech recognition model training system, including:

the user data input program module is used for acquiring a preset field training data set uploaded by a user;

and the model training program module is used for training the model to be trained based on the preset field training data set.

In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described speech recognition model training methods of the present invention.

In a fourth aspect, an electronic device is provided, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute any one of the speech recognition model training methods of the invention.

In a fifth aspect, the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any one of the above-mentioned speech recognition model training methods.

According to the embodiment of the invention, a model to be trained (an acoustic model to be trained and/or a language model to be trained and/or a hot word model to be trained) is determined according to the selection operation of a user, a preset field training data set uploaded by the user is obtained, and then the model to be trained is trained based on the preset field training data set. The training of the selected model can be completed only by selecting the model to be trained according to the requirement by the user and uploading the training data set of the target field so as to obtain the training of the voice recognition model, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more independently, and the threshold and the cost of the training model are reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of one embodiment of a speech recognition model training method of the present invention;

FIG. 2 is a flow chart of another embodiment of a speech recognition model training method of the present invention;

FIG. 3 is a functional block diagram of an embodiment of a speech recognition model training system of the present invention;

FIG. 4 is a functional block diagram of another embodiment of a speech recognition model training system of the present invention;

fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1, an embodiment of the present invention provides a method for training a speech recognition model, including:

and S11, determining a model to be trained according to the selection operation of the user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained.

And S12, acquiring the preset field training data set uploaded by the user.

Illustratively, the obtaining of the preset domain training data set uploaded by the user includes: and detecting and acquiring a preset field training data set uploaded by a user on the interactive interface, or detecting that the user acquires the preset field training data set by calling an acquisition request sent by the API interface.

And S13, training the model to be trained based on the preset domain training data set.

The method comprises the steps of determining a model to be trained according to selection operation of a user, obtaining a preset field training data set uploaded by the user, and then training the model to be trained based on the preset field training data set. The training of the selected model can be completed only by selecting the model to be trained according to the requirement by the user and uploading the training data set of the target field so as to obtain the training of the voice recognition model, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more independently, and the threshold and the cost of the training model are reduced.

As shown in fig. 2, which is a flowchart of another embodiment of the method for training a speech recognition model of the present invention, in this embodiment, the training the model to be trained based on the preset domain training data set includes:

s131, processing the acoustic model on the preset field training data set to obtain an audio data set with text labels so as to train the acoustic model;

s132, for the language model, processing the preset domain training data set to obtain a pure text corpus or a corpus template based on self-definition so as to train the language model;

s133, processing the preset field training data set to obtain a vocabulary set of a preset field for the hot word model so as to train the hot word model.

In this embodiment, the training data set is respectively processed specifically for different models (e.g., an acoustic model, a language model, and a hotword model), so as to obtain data suitable for training each model, which facilitates fast training of the models. And the preset field training data set is respectively processed according to different requirements of different models on training data, so that the method for training by selecting the required model according to the requirements of a user is realized. The speech recognition model training method can be used for training more flexibly, and the pertinence and the effectiveness of training are improved.

In some embodiments, the speech recognition model training method of the present invention further comprises: detecting a training mode selected by a user, wherein the training mode comprises an incremental training mode and a full training mode; training the model to be trained based on the preset domain training data set, including: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.

In this embodiment, when the user performs model training, the training mode (for example, an incremental training mode or a full training mode) may be selected according to actual requirements, so that when the user needs emergency deployment and the recognition accuracy is relatively low, the incremental training mode may be selected to implement fast training and fast deployment. When the user is not in urgent need of deployment and use and the recognition precision is high, the full training mode can be selected to ensure the precision requirement.

In some embodiments, the speech recognition model training method of the present invention further comprises: and testing the trained acoustic model or language model or hotword model by adopting the test audio data set uploaded by the user in batch.

In this embodiment, a user uploads a test audio data set in batch through a UI interface to test a trained acoustic model, language model, or hotword model, so as to determine the performance of a customized model, and select a customized model with the best performance. The user uploads the testing audio data sets in batch to carry out efficient testing on the trained model, and each testing audio data is uploaded in batch in an undifferentiated mode, so that the testing take over is objective and accurate.

In some embodiments, the speech recognition model training method of the present invention further comprises: and selecting a single test audio to test the trained acoustic model or language model or hot word model.

testing the plurality of acoustic models respectively by adopting a test audio data set uploaded by a user in batch to determine an acoustic model with optimal performance;

respectively testing a plurality of language models by adopting a test audio data set uploaded by a user in batch to determine a language model with optimal performance;

and respectively testing the plurality of hot word models by adopting the test audio data sets uploaded by the users in batch to determine the hot word model with the optimal performance.

In the embodiment, a plurality of models (for example, an acoustic model, a language model and a hotword model) are pre-selected and trained respectively, so that the plurality of models can be tested simultaneously in a testing stage, a model with the best performance is selected, and the efficiency of training and testing the models is improved.

FIG. 3 is a schematic block diagram of an embodiment of the speech recognition model training system of the present invention, which includes:

the model selection program module 100 is configured to determine a model to be trained according to a selection operation of a user, where the model to be trained at least includes one of an acoustic model to be trained, a language model to be trained, and a hotword model to be trained;

a user data input program module 200, configured to obtain a preset domain training data set uploaded by a user;

and the model training program module 300 is configured to train the model to be trained based on the preset domain training data set.

The speech recognition model training system determines a model to be trained according to selection operation of a user, acquires a preset field training data set uploaded by the user, and trains the model to be trained based on the preset field training data set. The training of the selected model can be completed only by selecting the model to be trained according to the requirement by the user and uploading the training data set of the target field so as to obtain the training of the voice recognition model, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more independently, and the threshold and the cost of the training model are reduced.

In some embodiments, the obtaining of the preset domain training data set uploaded by the user includes: and detecting and acquiring a preset field training data set uploaded by a user on the interactive interface, or detecting that the user acquires the preset field training data set by calling an acquisition request sent by the API interface.

In some embodiments, the training the model to be trained based on the preset domain training dataset includes:

for the acoustic model, processing the preset field training data set to obtain an audio data set with text labels so as to train the acoustic model;

for the language model, processing the preset domain training data set to obtain a pure text corpus or a custom corpus template to train the language model;

and for the hot word model, processing the preset field training data set to obtain a vocabulary set of a preset field so as to train the hot word model.

In some embodiments, the speech recognition model training system of the present invention further comprises:

the training mode detection module is used for detecting a training mode selected by a user, and the training mode comprises an incremental training mode and a full training mode; training the model to be trained based on the preset domain training data set, including: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.

In some embodiments, the speech recognition model training system of the present invention further comprises: the first training module is used for testing the trained acoustic model, language model or hot word model by adopting the test audio data set uploaded by the user in batch.

In some embodiments, the speech recognition model training system of the present invention further comprises: and the second training module is used for selecting a single test audio to test the trained acoustic model, language model or hot word model.

In some embodiments, the trained plurality of acoustic models, plurality of language models, and plurality of hotword models; in some embodiments, the speech recognition model training system of the present invention further comprises: the second training module is used for testing the plurality of acoustic models respectively by adopting the test audio data sets uploaded by the users in batch so as to determine the acoustic model with the optimal performance; respectively testing a plurality of language models by adopting a test audio data set uploaded by a user in batch to determine a language model with optimal performance; and respectively testing the plurality of hot word models by adopting the test audio data sets uploaded by the users in batch to determine the hot word model with the optimal performance.

FIG. 4 is a schematic block diagram of another embodiment of the speech recognition model training system of the present invention, which includes: the system comprises a user data input module, a data preprocessing service module, a model training module, a model automatic evaluation testing module, a model publishing online module, an online data acquisition module and a voice marking module. Wherein:

(1) and a user data input module:

a user can upload data set corpora used for training a language model, an acoustic model and a hot word model through a UI (user interface), and can also send an HTTP (hyper text transport protocol) request and transmit the data set corpora by calling an API (application program interface). At the level of corpus format requirement:

the corpus used for language model training needs to be a pure text corpus or a custom corpus template and entity. For example, for an airline scenario, a business involving reservation of tickets. Similarly: "Xiaoming wants to order a ticket to fly from Shanghai to Beijing", we abstract such a language as: "{ person name } wants to order a { quantity } ticket to fly from { city } to { city }. In the template, "names", "numbers" and "cities" are slots, and the slots can have a great number of specific entry entities, that is, the slots can be customized by users. Such as the name of a person: three Zhang, four Li, etc., in number: one, two, etc., city: beijing, Shanghai, Shenzhen, Guangzhou, etc.

The above example can be further extended and abstracted, i.e. the complexity of the template is defined, and more concrete terms can be covered by the more complex template. Similarly, "Xiaoming plan orders a train ticket from Shanghai to Beijing tomorrow", a template can be abstracted as: "{ person name } { action } { quantity } { time } ticket from { city } to { city }". Each brace { } represents a reserved slot, and the slot can be filled with a corresponding entity entry.

The corpus used for acoustic model training needs to be an audio data set with text labels.

The corpus used for training the hot word model needs to be a vocabulary set in the professional field.

(2) And the data preprocessing module is used for:

the data input by the user often has a problem in data format, and the data received by the language model, the acoustic model and the hotword training have respective standard format requirements. For this reason, it is necessary to process input data, such as text normalization, word segmentation, audio format standardization, annotation data processing, and the like.

(3) And a model training module:

the model training module mainly comprises: language model self-training, acoustic model self-training, hotword model self-training. A user can create a language model, an acoustic model and a hotword model through a UI interface or an API, and after the models are created, a unique task ID is identified. The user may select a language model or an acoustic model training or hotword model. For example, a user can operate through a UI interactive interface and can also select a model to be trained through API interaction, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more autonomously, and the threshold and the cost of training the model are reduced.

When the user selects the language model or the acoustic model training or the hotword model, different training modes can be selected, such as incremental training or full-scale training. In the incremental training mode, when a user triggers incremental training, a historically trained model can be selected, and iterative optimization is performed on newly added data on the basis of the historical model. In the full-scale training mode, the user can select the historical training data set to perform the superposition combination training, or can only use the current data set to perform the training. Under different training modes, the user can also customize the model training parameters. Therefore, the user has the choice of a plurality of angle optimization voice recognition and the space for carrying out single-angle depth optimization.

For example, for an acoustic model, a user may define a training mode, so that the user may have more freedom in choosing from training time and training effect. Under the condition that user input data is not changed, three modes are supported, wherein one mode is short in training time, and the optimization effect is general. Secondly, the training time is slightly long, and the optimization effect is moderate. Thirdly, the training time is long and the optimization effect is good. The second is the selection of increment and total amount, and the user can add data training based on the model which is trained historically. It can be retrained from beginning to end. More detailed parameters of the underlying algorithm are not released, so that unnecessary interference is avoided for users.

For the language model, the user can freely choose whether to interpolate the model with the large base model. Therefore, several parameters of model interpolation are opened, mainly whether interpolation is carried out or not, an interpolation coefficient is used, whether clipping is carried out or not and a clipping coefficient is used. The parameters are used for indicating whether to interpolate with a large model or not, and how much proportion the model obtained after interpolation is cut or not and what proportion the model obtained after interpolation is cut.

For the hot word model, each slot position is abstracted out by the hot word, and a user can upload a word list of the user based on the slot position.

(4) And a model evaluation test module:

the model evaluation test mainly comprises the following steps: objective test, subjective test, comparative test.

Under the objective test, a user can upload a test audio data set in batch and select to perform recognition tests of different customized language models, acoustic models or hot word models so as to obtain the accuracy of speech recognition quantification under different situations.

The speech recognition model training system in the embodiment of the application is a model self-training system, and has the significance that a user can collect data of a business scene based on the condition of the business scene, train a language model, an acoustic model and a hotword model independently, and complete customized optimization of a language recognition user level. For example, some users have more noisy business scenes, dialect languages, mixed Chinese languages, and more proper names, or have more special dialect styles such as court trial and trial. In these situations, ordinary general speech recognition cannot be well satisfied, and customization optimization, i.e., personalized and targeted customization, is required. By the aid of the system, thousands of users can be achieved, users can independently complete the system with low threshold, and professional research and development people are not required to participate.

When selecting to perform recognition tests of different customized language models, acoustic models or hotword models, it is assumed that the user has customized the language models, acoustic models or hotword models, respectively. After customization, the user can upload the test data set and respectively select the language model, the acoustic model and the hotword model for testing. The effect improvement brought by which optimization method is measured is most obvious. The user can also select the three simultaneously, does a joint test, sees down under three dimension combined action, and the discernment that can bring optimizes the effect and promotes.

Under the subjective test, a user can upload a single test audio, select to perform recognition tests of different customized language models, acoustic models or hot word models to obtain voice recognition results under different conditions, and directly and visually verify the effectiveness of the self-training model.

Under the contrast test, the user can upload audio data sets in batches to perform the identification test. Before testing, the method needs to be divided into two different test groups, and different self-training language models, acoustic models or hot word models are selected for testing the difference of the recognition effects of the two different self-training models so as to select a better model for use.

Illustratively, a plurality of models, e.g., two language models, two acoustic models, two hotword models, are trained in the training phase. When the user conducts model training, the user can start from three dimensions, namely a language model, an acoustic model and a hotword model can be selected as training types. Assume that the user selects a language model training type. The user can customize many language models under this type of training. Then, model tests are carried out to compare the customized multi-language models to determine which optimization is better. Similarly, the acoustic model and the hotword model may be customized in plurality.

(5) The model issuing online module:

and the model publishing online module is mainly used for deploying a user self-training language, acoustic and hot word model online. In an actual production process, identification service and training service are often deployed on different machines in the same cluster, model resources which are self-trained by a user through the training service are stored in a server where the training service is located, and a model obtained through training is a production process. However, the consuming process is to identify the service consumption, and on another machine, the produced model release needs to be deployed to the machine where the consumption service is identified. Therefore, a user can trigger release through a UI interface or an API, and the model can be synchronized to the recognition machine through the background program to complete release, deployment and online.

(6) The on-line data acquisition module:

the on-line data acquisition module mainly completes two things, wherein one thing is that data is filtered according to a certain discrimination strategy and a threshold value, and the data is directly transmitted to an acoustic model training system. And secondly, importing the online data into a labeling system module according to a certain strategy for accurate data labeling.

For better optimization of the model, a large amount of annotation data is usually required for data support, however, the annotation data itself is acquired for a long time and at high cost. In order to optimize the model quickly even in the case of less or no annotation data. Therefore, a semi-supervised training mode is provided, and a large amount of on-line label-free data can be fully utilized to quickly carry out self-training optimization on the model.

For example, for a new business scenario, after online voice data is periodically pulled from a database table by a timer, the online voice data is put into a data acquisition module consisting of a plurality of recall models and a selectable discrimination strategy, and voice with higher quality is recalled by data acquisition and a corresponding pseudo label is obtained.

Although a large amount of voice data can be obtained through the internet, accurate labels corresponding to the voice data are not easily obtained. Therefore, under the condition of low data resources, the optimization of the speech recognition performance is carried out, and an unsupervised or semi-supervised learning method is adopted, so that the method is an effective method.

Firstly, a few existing labeled data are utilized to randomly sample a plurality of data sets, and a plurality of initial acoustic recall models are obtained through training through a standard training process.

Secondly, using the obtained plurality of initial acoustic recall models to recognize and decode the online voice data without labels to obtain recognized pseudo labels. This annotation can be stored either as the best result or as multiple candidates, where we store the best result.

Then, due to the acoustic recall models, the pseudo labels obtained by speech recognition decoding have many recognition errors. Therefore, a discrimination strategy is defined, and the automatically generated pseudo labels are screened by combining confidence coefficient and confusion degree, so that relatively reliable recognition results are reserved.

Confidence coefficient: and sequentially screening the data according to the posterior probability or the likelihood fraction in decoding. Giving a confidence score to each sentence, setting a score threshold value, and selecting the sentence with high confidence.

The confusion degree is: and calculating the confusion degree selection data of the decoding result and the initial language model. And setting a threshold value of the confusion degree, and selecting the sentences with lower confusion degree.

The sentence confidence degree selection criterion is used for selecting data from the reliability angle of the decoded text generated by a plurality of acoustic models, the confusion degree selection criterion is used for selecting data from the matching degree of the decoded text and the language model, the two data selection criteria can complement each other due to the difference of the principles, and the data selected by the two methods are mixed and deduplicated by combining the sentence confidence degree and the confusion degree data selection strategy, so that the complementarity is exerted, and the reliability of data selection is higher.

The recall model refers to an acoustic model and corresponds to a confidence criterion in the discrimination strategy. That is, the data is sequentially filtered according to the posterior probability in the decoding result, and here, a confidence threshold, for example, 0.75, is designed, i.e., below the threshold, and is manually labeled. Above this threshold, are screened out for model training.

And the data recalled from the online by us is added into the model as new training data, and the optimization of the model and the adjustment of parameters are automatically carried out according to the test set provided by the service side. And finally, the optimized model carries out capability output and simultaneously returns the obtained acoustic model to the data acquisition recall module for data screening, and the quality of data recalled next time is optimized in a mode of updating the recall model (wherein the capability output refers to that the selected data enters a model training system through the data screening system to produce and train the obtained acoustic model. Illustratively, each obtained optimized acoustic model is updated to the data acquisition recall module to be regarded as one of the acoustic recall models. In the period, in order to improve the quality of data recalled by a data acquisition module, when data is recalled, a single model is not selected for carrying out pseudo label prediction, but a plurality of models similar to a target scene are selected for carrying out data selection and pseudo label prediction according to certain similarity under the condition of a specified threshold value. The method can ensure that the quality of data can effectively improve the performance of model training, and can increase the diversity of training samples, so that the model becomes more robust in the training process.

Illustratively, a single model refers to a neural network model that is input as speech and output as pseudo-labels of the speech, such as the acoustic recall model mentioned above, which is also a neural network model.

Illustratively, the prescribed threshold includes confidence and confusion. Wherein, a certain similarity is under the confidence criterion, the threshold value can be set and adjusted, and the default setting is 0.75. Certain similarity under the confusion criterion, the threshold value can also be set and adjusted, and the default is given as 100.

(7) The voice labeling system module:

a user autonomously optimizes an acoustic model, a large amount of labeled data are generally needed for data support during self-training of the acoustic model, in order to enable the user to directly use the labeled data for acoustic model training, a voice data labeling system module is provided, the user can upload a large amount of voice data through a UI (user interface) or API (application program interface) interface, a data labeling task is issued, the data labeling task can be allocated to labeling personnel for labeling, after labeling is completed, the labeled task data are pushed to data auditing, and the data passing auditing can be directly used for language model or acoustic model training.

The invention realizes the self-training of a language model, an acoustic model and a hot word model, and can directly achieve the effect of reducing the threshold of the user for independently optimizing the speech recognition performance and realizing the high-efficiency and rapid recognition performance optimization. The user does not need to understand the speech recognition algorithm logic of the deeper layer of the bottom layer and the link logic of the optimization of the bottom layer, and only needs to pay attention to the service scene and which real data can be collected by the service scene. The data related to the service scene are collected and uploaded to the system, and scene optimization can be completed from multiple angles. But the self-training system is more deep, because the service scenes suitable for the whole self-training system are wider, and each flow module is operated as a single micro-service, the user can conveniently carry out secondary development on the system. In addition, modules of a system link are loosely coupled, so that convenience is provided for positioning problems of users, and service stability is enhanced. When designing a system link, the design principle of high cohesion and low coupling is followed. Such as language model training, acoustic model training, hotword model training, are directed to optimizing language recognition from multiple dimensions. The three are independent, and one or more of the three can be selectively deployed by a user. For this purpose, each optimization module exists as a business API service. And the three are computationally intensive, and are suitable for isolation from the viewpoints of service stability and maintenance.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above-described speech recognition model training methods of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speech recognition model training methods described above.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition model training method.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, where the computer program is used to implement a method for training a speech recognition model when the computer program is executed by a processor.

The speech recognition model training system according to the embodiment of the present invention may be used to execute the speech recognition model training method according to the embodiment of the present invention, and accordingly achieve the technical effects achieved by the speech recognition model training method according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 5 is a schematic hardware structure diagram of an electronic device for performing a speech recognition model training method according to another embodiment of the present application, and as shown in fig. 5, the electronic device includes:

one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5.

The apparatus for performing the speech recognition model training method may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.

The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speech recognition model training method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, so as to implement the speech recognition model training method of the above method embodiment.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the speech recognition model training apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a speech recognition model training apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the speech recognition model training device. The output device 540 may include a display device such as a display screen.

The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform a speech recognition model training method in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

16页详细技术资料下载

Speech recognition model training method and system

相关技术

网友询问留言