Optimization method and device of n-gram language model, computer equipment and storage medium

文档序号:1273695 发布日期:2020-08-25 浏览:13次 中文

阅读说明:本技术 n-gram语言模型的优化方法、装置、计算机设备和存储介质 (Optimization method and device of n-gram language model, computer equipment and storage medium ) 是由 张旭华 齐欣 孙泽明 朱林林 王宁 于 2020-04-07 设计创作,主要内容包括:本发明公开了一种n-gram语言模型的优化方法、装置、计算机设备及存储介质,该方法包括:从待优化的n-gram语言模型的原语料表中筛选出与目标语料相匹配的相似语料;从待优化的n-gram语言模型的原模型文件中获取与相似语料对应的原n-gram;根据原n-gram的最高阶数与目标语料的分词数的关系以及原n-gram的概率生成与目标语料对应的目标n-gram;将目标n-gram添加至所述原模型文件中。本发明在不改变声学模型及发音词典的基础上,快速优化了原有n-gram语言模型对目标语料的识别效果。(The invention discloses an optimization method, a device, computer equipment and a storage medium of an n-gram language model, wherein the method comprises the following steps: screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized; acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized; generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word segmentation number of the target corpus and the probability of the original n-gram; adding a target n-gram to the original model file. The method quickly optimizes the recognition effect of the original n-gram language model on the target corpus on the basis of not changing the acoustic model and the pronunciation dictionary.)

1. A method for optimizing a n-gram language model, the method comprising the steps of:

screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized;

acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word number of the target corpus and the probability of the original n-gram;

and adding the target n-gram into the original model file.

2. The method for optimizing an n-gram language model according to claim 1, wherein the step of screening similar corpora matching the target corpora from the primitive corpus table of the n-gram language model to be optimized includes:

vectorizing each corpus and a target corpus in a primitive corpus table of an n-gram language model to be optimized respectively by utilizing a pre-trained word vector model to obtain a word vector of each corpus and a word vector of the target corpus in the primitive corpus table;

respectively calculating the similarity between the target corpus and each corpus in the original corpus table according to the word vector of each corpus in the original corpus table and the word vector of the target corpus;

and screening out target similarity meeting a preset threshold from the similarity, and taking the corpus corresponding to the target similarity in the original corpus table as a similar corpus matched with the target corpus.

3. The method for optimizing a n-gram language model according to claim 1 or 2, wherein the generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the number of the part-words of the target corpus and the probability of the original n-gram comprises:

performing word segmentation processing on the target corpus to obtain word segmentation results and the number of segmented words of the target corpus;

judging whether the word segmentation number of the target corpus is greater than the highest order of the original n-gram or not;

if so, splitting the word segmentation result into a target word segmentation result with the same number of words as the highest order of the original n-gram, otherwise, directly taking the word segmentation result as the target word segmentation result;

and replacing the word groups in the original n-gram with the target word segmentation result to generate a target n-gram, and determining the probability of the target n-gram according to the probability of the original n-gram.

4. The method of optimizing a n-gram language model according to claim 3, wherein determining the probability of the target n-gram based on the probability of the original n-gram comprises:

and acquiring the probability of the original n-gram with the same order as the target n-gram as the probability of the target n-gram.

5. The method of optimizing a n-gram language model as claimed in claim 3, wherein said determining the probability of the target n-gram from the probability of the original n-gram further comprises:

and obtaining the probability of an original n-gram with the same order as the target n-gram, determining the weight value of the target n-gram according to the corresponding target similarity, and calculating to obtain the probability of the target n-gram according to the probability of the original n-gram and the weight value of the target n-gram.

6. The method for optimizing a n-gram language model according to claim 1 or 2, wherein before adding the target n-gram to the original model file, the method further comprises:

judging whether the target n-gram exists in the original model file or not;

if yes, obtaining the probability of the n-gram corresponding to the target n-gram in the original model file, otherwise, directly adding the target n-gram to the original model file;

and judging whether the probability of the target n-gram is greater than that of the n-gram corresponding to the target n-gram in the original model file, if so, replacing the n-gram corresponding to the target n-gram in the original model file by using the target n-gram.

7. An apparatus for optimizing an n-gram language model, the apparatus comprising:

the corpus matching module is used for screening out similar corpuses matched with the target corpus from a primitive corpus table of the n-gram language model to be optimized;

the n-gram obtaining module is used for obtaining an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

the n-gram generation module is used for generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the number of the part words of the target corpus and the probability of the original n-gram;

and the file adding module is used for adding the target n-gram to the original model file.

8. The apparatus for optimizing an n-gram language model according to claim 7, wherein the corpus matching module comprises:

the vector generation unit is used for vectorizing each corpus and the target corpus in a primitive corpus table of the n-gram language model to be optimized respectively by utilizing a pre-trained word vector model, and obtaining a word vector of each corpus and a word vector of the target corpus in the primitive corpus table;

the similarity calculation unit is used for respectively calculating the similarity between the target corpus and each corpus in the original corpus table according to the word vector of each corpus in the original corpus table and the word vector of the target corpus;

and the corpus screening unit is used for screening out target similarity meeting a preset threshold from the similarity, and taking the corpus corresponding to the target similarity in the original corpus table as the similar corpus matched with the target corpus.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

Technical Field

The invention relates to the technical field, in particular to an optimization method and device of an n-gram language model, computer equipment and a storage medium.

Background

In the speech recognition task, there are two important models: an acoustic model and a language model. The basic principle of speech decoding is: and scoring the acoustic features through the acoustic model, outputting the probability of the phoneme string, and scoring the corresponding text string through the language model. And finally, combining the scores of the two text strings to give a text string with the highest probability as a recognition result. A common formula is as follows:

it can be seen that the language model plays a very important role in the final decoding result. The language model most used today is the statistical-based n-gram language model. n-Gram is a language model commonly used in large vocabulary continuous speech recognition to realize the conversion from phonemes to words, wherein the words can be Chinese words or English words. Generally, the acoustic model gives probabilities of phoneme sequences, and the language model scales the phoneme sequences by counting the probabilities between words using the language model probabilities, so that word sequences more conforming to language habits are output.

A large amount of related corpora are usually needed to train a better language model for a specific scene, however, in reality, most of the training corpora that we can use are general corpora, and it is very difficult to find a large amount of natural corpora related to the scene. In some scenes, many proper nouns are difficult to recognize, although improvement can be achieved by modifying a word segmentation dictionary and a pronunciation dictionary, the acoustic model needs to be iterated to achieve a good effect while the language model is trained. However, continuous modification of the dictionary results in a larger and larger dictionary and may even affect the recognition of other words.

Therefore, how to improve the recognition effect of the n-gram language model on the specific vocabulary is a problem to be solved at present.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for optimizing an n-gram language model, a computer device, and a storage medium, so as to overcome the problems in the prior art that some special words in a specific scene are difficult to identify, and the like.

In order to solve one or more technical problems, the invention adopts the technical scheme that:

in a first aspect, a method for optimizing a n-gram language model is provided, the method comprising the following steps:

screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized;

acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word number of the target corpus and the probability of the original n-gram;

and adding the target n-gram into the original model file.

Further, the step of screening out similar corpora matched with the target corpora from the original corpus table of the n-gram language model to be optimized includes:

vectorizing each corpus and a target corpus in a primitive corpus table of an n-gram language model to be optimized respectively by utilizing a pre-trained word vector model to obtain a word vector of each corpus and a word vector of the target corpus in the primitive corpus table;

respectively calculating the similarity between the target corpus and each corpus in the original corpus table according to the word vector of each corpus in the original corpus table and the word vector of the target corpus;

and screening out target similarity meeting a preset threshold from the similarity, and taking the corpus corresponding to the target similarity in the original corpus table as a similar corpus matched with the target corpus.

Further, the generating a target n-gram corresponding to the target corpus according to the relationship between the highest order of the original n-gram and the number of the part-words of the target corpus and the probability of the original n-gram includes:

performing word segmentation processing on the target corpus to obtain word segmentation results and the number of segmented words of the target corpus;

judging whether the word segmentation number of the target corpus is greater than the highest order of the original n-gram or not;

if so, splitting the word segmentation result into a target word segmentation result with the same number of words as the highest order of the original n-gram, otherwise, directly taking the word segmentation result as the target word segmentation result;

and replacing the word groups in the original n-gram with the target word segmentation result to generate a target n-gram, and determining the probability of the target n-gram according to the probability of the original n-gram.

Further, the determining the probability of the target n-gram according to the probability of the original n-gram includes:

and acquiring the probability of the original n-gram with the same order as the target n-gram as the probability of the target n-gram.

Further, the determining the probability of the target n-gram according to the probability of the original n-gram further includes:

and obtaining the probability of an original n-gram with the same order as the target n-gram, determining the weight value of the target n-gram according to the corresponding target similarity, and calculating to obtain the probability of the target n-gram according to the probability of the original n-gram and the weight value of the target n-gram.

Further, before adding the target n-gram to the master model file, the method further includes:

judging whether the target n-gram exists in the original model file or not;

if yes, obtaining the probability of the n-gram corresponding to the target n-gram in the original model file, otherwise, directly adding the target n-gram to the original model file;

and judging whether the probability of the target n-gram is greater than that of the n-gram corresponding to the target n-gram in the original model file, if so, replacing the n-gram corresponding to the target n-gram in the original model file by using the target n-gram.

In a second aspect, an apparatus for optimizing a n-gram language model is provided, the apparatus comprising:

the corpus matching module is used for screening out similar corpuses matched with the target corpus from a primitive corpus table of the n-gram language model to be optimized;

the n-gram obtaining module is used for obtaining an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

the n-gram generation module is used for generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the number of the part words of the target corpus and the probability of the original n-gram;

and the file adding module is used for adding the target n-gram to the original model file.

Further, the corpus matching module includes:

the vector generation unit is used for vectorizing each corpus and the target corpus in a primitive corpus table of the n-gram language model to be optimized respectively by utilizing a pre-trained word vector model, and obtaining a word vector of each corpus and a word vector of the target corpus in the primitive corpus table;

the similarity calculation unit is used for respectively calculating the similarity between the target corpus and each corpus in the original corpus table according to the word vector of each corpus in the original corpus table and the word vector of the target corpus;

and the corpus screening unit is used for screening out target similarity meeting a preset threshold from the similarity, and taking the corpus corresponding to the target similarity in the original corpus table as the similar corpus matched with the target corpus.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:

screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized;

acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word number of the target corpus and the probability of the original n-gram;

and adding the target n-gram into the original model file.

In a fourth aspect, there is provided a computer readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:

screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized;

acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word number of the target corpus and the probability of the original n-gram;

and adding the target n-gram into the original model file.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

1. the embodiment of the invention provides an optimization method and device of an n-gram language model, computer equipment and a storage medium, similar corpora matched with the target corpora are screened out from a primitive corpus table of the n-gram language model to be optimized, then obtaining an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized, generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the fractional word number of the target corpus and the probability of the original n-gram, and finally adding the target n-gram to the original model file, the recognition effect of the original n-gram language model on the target corpus is quickly optimized on the basis of not changing the acoustic model and the pronunciation dictionary;

2. according to the optimization method, device, computer equipment and storage medium of the n-gram language model provided by the embodiment of the invention, the similarity between the target corpus and each corpus in the original corpus table is respectively calculated according to the word vector of each corpus in the original corpus table and the word vector of the target corpus, so that the efficiency and the accuracy of calculating the similarity between the corpus in the original corpus table and the target corpus are improved;

3. according to the optimization method, device, computer equipment and storage medium of the n-gram language model, the probability of the original n-gram with the same order as the target n-gram is obtained, the weight value of the target n-gram is determined according to the corresponding target similarity, and the probability of the target n-gram is obtained through calculation according to the probability of the original n-gram and the weight value of the target n-gram, so that different similar linguistic data are different in weight, and the distinguishing performance of the n-gram language model is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method for optimizing a n-gram language model in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating the screening of similar corpora matching a target corpus from a primitive corpus table of an n-gram language model to be optimized in accordance with an exemplary embodiment;

FIG. 3 is a flowchart illustrating a process of generating a target n-gram corresponding to a target corpus according to a relationship between a highest order of an original n-gram and a number of participles of the target corpus and a probability of the original n-gram, according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating an apparatus for optimizing an n-gram language model, according to an exemplary embodiment;

FIG. 5 is a schematic diagram of an internal structure of a computer device shown in accordance with an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Currently, the most used language model in speech recognition tasks is the statistical-based n-gram language model. Similar to other model training, most of the training corpora of the n-gram language model which can be generally used in the training process are general corpora. However, training a good n-gram language model for a specific scene requires a large amount of related corpora, but it is difficult to find a large amount of natural corpora related to the scene. In some scenarios, many proper nouns are difficult to recognize, and therefore, the proper nouns are divided into single words when dividing the words, which may cause certain influence on the calculation result when calculating the language model probability.

Some scene words in table 1 below are exemplified, and as shown in table 1 below, it can be seen that many proper nouns are broken into single words during word segmentation. For example, for words such as "break and disqualify" and "sweat abade" in the 1 st and 2 nd rows of the table, the words are common in some scenes, although the words can be improved by modifying a word segmentation dictionary and a pronunciation dictionary, when a language model is trained, an acoustic model also needs to be iterated to achieve a good effect, and thus the implementation is complicated; for another example, the names of persons and places included in the 3 rd, 4 th, and 5 th rows in the table belong to an open set. If the dictionary is improved by continuously modifying the dictionary, the dictionary is larger and larger, and the recognition of other words can be influenced. Therefore, it is desirable to provide an optimization method for an n-gram language model, which can rapidly improve the recognition effect of the model on a specific scene word without involving the updating of a vocabulary and the training of an acoustic model.

TABLE 1

1 Dragon sentry area tear open and violate Dragon sentry area dismantling violation office
2 Sweat Abadde military airport Sweat Abadde military airport
3 The rest becoming the subsidiary leader The rest becomes the subsidiary leader
4 Shengting garden hotel Shengting garden hotel
5 Wuklan New Zen general blade-Harnuovif All-leaf of Ukrainian Kanuorufu

The application creatively provides that a new n-gram and a corresponding low-order gram are formed by finding the n-gram corresponding to the word related to the target corpus (namely the specific scene word) in the n-gram language model and replacing the related word in the found n-gram by the target corpus. Finally, all the generated new n-grams are added into the model file, wherein the probability of the newly generated n-grams can be represented by the original weighted expression of the probability of the related n-grams of the same order. The process does not involve the updating of the vocabulary and the training of the acoustic model, so that the recognition effect of the model on the target corpus can be rapidly improved.

FIG. 1 is a flow diagram illustrating a method for optimizing a n-gram language model according to an exemplary embodiment, and referring to FIG. 1, the method includes the steps of:

s1: screening out similar corpora matched with the target corpora from a primitive corpus table of the n-gram language model to be optimized;

s2: acquiring an original n-gram corresponding to the similar corpus from an original model file of the n-gram language model to be optimized;

s3: generating a target n-gram corresponding to the target corpus according to the relation between the highest order of the original n-gram and the word number of the target corpus and the probability of the original n-gram;

s4: and adding the target n-gram into the original model file.

Specifically, when the original n-gram language model is required to be capable of realizing high-accuracy recognition of a target corpus (such as some specific scene words), in order to quickly realize model optimization on the basis of not changing a pronunciation dictionary and an acoustic model, in the embodiment of the invention, a mode of generating n-grams corresponding to the target corpus and adding the n-grams to a model file of the original n-gram language model is adopted. In order to quickly generate the n-gram corresponding to the target corpus, in the embodiment of the invention, a new n-gram and a corresponding low-order gram are formed by finding out words similar to the target corpus in the n-gram language model to be optimized and then replacing related words in the found n-gram with the target corpus.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音识别方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!