Text error correction method, device, terminal and storage medium

文档序号:1378979 发布日期:2020-08-14 浏览:14次 中文

阅读说明:本技术 文本的纠错方法、装置、终端、及存储介质 (Text error correction method, device, terminal and storage medium ) 是由 郭晗暄 单彦会 李娜 郑文彬 罗红 于 2020-04-17 设计创作,主要内容包括:本发明实施例涉及人工智能领域,公开了一种文本的纠错方法、装置、终端、及计算机可读存储介质。本发明中,所述文本的纠错方法,包括:获取待纠错句子;将所述待纠错句子转换为句子向量;将所述待纠错句子的句子向量输入到训练好的神经机器翻译模型NMT中,以获取所述神经机器翻译模型NMT输出的纠错后句子的句子向量;将所述纠错后句子的句子向量转换为纠错后句子。本发明实施例能够减少人工维护的工作量,从而减少处理成本。(The embodiment of the invention relates to the field of artificial intelligence and discloses a text error correction method, a text error correction device, a text error correction terminal and a computer readable storage medium. In the present invention, the text error correction method includes: obtaining a sentence to be corrected; converting the sentence to be corrected into a sentence vector; inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT; and converting the sentence vector of the error-corrected sentence into the error-corrected sentence. The embodiment of the invention can reduce the workload of manual maintenance, thereby reducing the processing cost.)

1. A method for correcting a text, comprising:

obtaining a sentence to be corrected;

converting the sentence to be corrected into a sentence vector;

inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and converting the sentence vector of the error-corrected sentence into the error-corrected sentence.

2. The method according to claim 1, wherein the step of obtaining the sentence to be error-corrected is preceded by the method further comprising: generating the neural machine translation model NMT;

the step of generating the neural machine translation model NMT comprises:

carrying out new word discovery processing on the original corpus to generate original corpus participles;

adding the original corpus and pre-collected hot words into a word segmentation table;

segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments;

converting the original sentence participles into sentence vectors;

converting a correct sentence corresponding to the original sentence into a sentence vector;

inputting a sentence vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence into an NMT model for training;

and generating the neural machine translation model NMT through training of the sentence vector pair.

3. The method according to claim 2, wherein said step of performing new word discovery processing on the original corpus to generate original corpus participles comprises:

dividing original sentences in the original corpus into word fragments;

determining the word segments which can possibly form words according to the solidification degree of the word segments;

and when the value of the information entropy of the word segment which can be formed into a word is larger than a preset threshold value, determining the word segment as an original corpus participle.

4. The method according to claim 2, wherein the step of generating the neural machine translation model, NMT, comprises:

carrying out new word discovery processing on original corpora corresponding to different speaking roles to generate original corpus participles corresponding to different speaking roles;

adding the original corpus participles and pre-collected hot words into a participle table; the word segmentation table corresponds to the same speaking role as the original corpus;

segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments; the word segmentation table corresponds to the same speaking role as the original corpus;

converting the original sentence participles into sentence vectors;

converting a correct sentence corresponding to the original sentence into a sentence vector;

inputting a vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence and a corresponding speaking role into an NMT model for training;

and generating the neural machine translation model NMT through the training of the vector pair.

5. The method of claim 4,

the step of obtaining the sentence to be corrected comprises the following steps: acquiring a sentence to be corrected and a speaking role corresponding to the sentence to be corrected;

the step of inputting the sentence vector of the sentence to be corrected into the trained neural machine translation model NMT comprises: and inputting the sentence vector of the sentence to be corrected and the speaking role corresponding to the sentence to be corrected into a trained neural machine translation model NMT.

6. The method according to claim 1, wherein the step of converting the sentence to be error-corrected into a sentence vector comprises:

segmenting the sentence to be corrected to generate segmented sentences to be corrected;

converting each sentence to be corrected into a word vector;

and combining the word vectors of the word segments of the sentences to be corrected to generate sentence vectors of the sentences to be corrected.

7. The method of claim 6,

while the step of converting each sentence to be corrected into a word vector, the method further comprises: generating a corresponding relation between the participles and the word vectors;

the step of converting the sentence vector of the error-corrected sentence into an error-corrected sentence includes:

generating an error-corrected word vector by the sentence vector of the sentence to be error-corrected through a decoder;

acquiring error-corrected participles corresponding to the error-corrected word vectors according to the corresponding relation between the participles and the word vectors;

and combining the error-corrected word segments to generate an error-corrected sentence.

8. An apparatus for correcting a text, comprising:

an acquisition unit configured to acquire a sentence to be error-corrected;

a first conversion unit for converting the sentence to be corrected into a sentence vector;

the input unit is used for inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT so as to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and the second conversion unit is used for converting the sentence vector of the error-corrected sentence into the error-corrected sentence.

9. A terminal, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of correcting text according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a method of correcting a text according to any one of claims 1 to 7.

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to a text error correction method, a text error correction device, a text error correction terminal and a computer readable storage medium.

Background

The voice conversation mainly refers to each round of conversation between two parties through telephone conversation, and is usually spoken and repetitive, and contains much useless information. Speech is not directly processed and it needs to be transcribed into text by ASR (speech recognition). In the process of transcription, due to the influence of various noise, speaker tone and other factors, the transcribed text has poor effect and cannot be directly used for actual downstream tasks. Aiming at the problem, the text converted from the voice needs to be corrected, and errors such as grammar and syntax in the text are corrected, so that the requirements of subsequent tasks are met.

The current stage of text error correction is computed as a rule-based error correction algorithm. The rule-based error correction algorithm mainly comprises two steps of error detection and error correction, and mainly comprises the steps of analyzing suspected error sentences and generating candidate sentences or words at the error positions for replacement.

The inventors found that at least the following problems exist in the related art:

Disclosure of Invention

An object of embodiments of the present invention is to provide a method, an apparatus, a terminal, and a computer-readable storage medium for correcting a text, which can reduce the workload of manual maintenance, thereby reducing the cost of text correction.

In order to solve the above technical problem, an embodiment of the present invention provides an error correction method, including:

obtaining a sentence to be corrected;

converting the sentence to be corrected into a sentence vector;

inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and converting the sentence vector of the error-corrected sentence into the error-corrected sentence.

The embodiment of the present invention further provides a text error correction apparatus, including:

an acquisition unit configured to acquire a sentence to be error-corrected;

a first conversion unit for converting the sentence to be corrected into a sentence vector;

the input unit is used for inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT so as to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and the second conversion unit is used for converting the sentence vector of the error-corrected sentence into the error-corrected sentence.

An embodiment of the present invention further provides a terminal, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for correcting text.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the text error correction method.

Compared with the prior art, the embodiment of the invention obtains the sentence to be corrected; converting the sentence to be corrected into a sentence vector; inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT; and converting the sentence vector of the error-corrected sentence into the error-corrected sentence. Therefore, automatic sentence correction is performed by using the neural machine translation model NMT, and the workload of manual maintenance can be reduced, thereby reducing the processing cost.

In addition, before the step of obtaining the sentence to be corrected, the method further includes: generating the neural machine translation model NMT;

the step of generating the neural machine translation model NMT comprises:

carrying out new word discovery processing on the original corpus to generate original corpus participles;

adding the original corpus and pre-collected hot words into a word segmentation table;

segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments;

converting the original sentence participles into sentence vectors;

converting a correct sentence corresponding to the original sentence into a sentence vector;

inputting a sentence vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence into an NMT model for training;

and generating the neural machine translation model NMT through training of the sentence vector pair.

In the embodiment of the invention, in the process of generating the neural machine translation model NMT, a processing step of new word discovery is added to generate the participles in the participle list, and some common hot words are added into the participle list, so that the accuracy of the participle can be improved, and a certain improvement effect is provided for a subsequent error correction task. Meanwhile, the participles are generated from the original corpus, the semantics of the context and the occurrence frequency of the context can be considered, and the accuracy of the participles can be improved, so that the accuracy of error correction is improved.

Further, the step of generating the neural machine translation model NMT comprises: carrying out new word discovery processing on original corpora corresponding to different speaking roles to generate original corpus participles corresponding to different speaking roles; adding the original corpus participles and pre-collected hot words into a participle table; the word segmentation table corresponds to the same speaking role as the original corpus; segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments; the word segmentation table corresponds to the same speaking role as the original corpus; converting the original sentence participles into sentence vectors; converting a correct sentence corresponding to the original sentence into a sentence vector; inputting a vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence and a corresponding speaking role into an NMT model for training; and generating the neural machine translation model NMT through the training of the vector pair. In the above embodiment, different processing is performed on the texts of different roles in the error correction process according to different dialog texts corresponding to different roles. During word segmentation, different roles correspond to different word lists, and complexity of model operation can be reduced.

In addition, the step of converting the sentence to be error-corrected into a sentence vector includes: segmenting the sentence to be corrected to generate segmented sentences to be corrected; converting each sentence to be corrected into a word vector; and combining the word vectors of the word segments of the sentences to be corrected to generate sentence vectors of the sentences to be corrected. In the above embodiment, before the model is input, the sentence to be corrected is converted into the sentence vector form, so that the compatibility of the format specification of the sentence to be corrected is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a flowchart of a text error correction method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a text error correction method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a text error correction method according to another embodiment of the present invention;

FIG. 4 is a diagram of a neural machine translation model according to another embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for correcting text errors according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The first embodiment of the present invention relates to a text error correction method, the flow of which is shown in fig. 1 and specifically includes the following steps:

step 11, generating a neural machine translation model NMT;

step 12, obtaining a sentence to be corrected;

step 13, converting the to-be-sentence error correction seed into a sentence vector;

step 14, inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the sentence after error correction output by the neural machine translation model NMT;

and step 15, converting the sentence vector of the error-corrected sentence into an error-corrected sentence.

Compared with the prior art, the embodiment of the invention obtains the sentence to be corrected; converting the to-be-sentence error correction seed into a sentence vector; inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT; and converting the sentence vector of the error-corrected sentence into the error-corrected sentence. The embodiment of the invention adopts the neural machine translation model NMT to automatically correct the sentences, so that the workload of manual maintenance can be reduced, and the processing cost is reduced.

In addition, the embodiment of the invention converts the sentence to be corrected into the word vector, so the neural machine translation model NMT has no special requirement on the text format of the sentence to be corrected, and the compatibility of the sentence to be corrected is higher.

Wherein, step 11 comprises:

step 111, carrying out new word discovery processing on the original corpus to generate original corpus participles;

step 112, adding the original corpus and pre-collected hot words into a participle table; the embodiment of the invention adds the processing steps of new word discovery, and simultaneously adds some common hot words into the word segmentation table, thereby improving the word segmentation effect and having certain improvement effect on the subsequent error correction task.

113, segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments; the step can adopt a jieba and other word segmentation tools to perform word segmentation.

Step 114, converting the original sentence participles into sentence vectors;

step 115, converting the correct sentence corresponding to the original sentence into a sentence vector;

step 116, inputting a sentence vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence into an NMT model for training;

step 117, generating the neural machine translation model NMT through training of the sentence vector pair

The neural machine translation model NMT can be trained aiming at the original linguistic data of a specific scene, and the accuracy of the neural machine translation model NMT aiming at the specific scene can be improved. The embodiment of the invention can be used for correcting the error of the text which is converted from the voice, and certainly can also be used for correcting the error of the written text. The invention can train the model according to different scenes, thereby reducing the labor cost and improving the automation degree.

Wherein step 111 comprises:

1111, dividing the original sentence in the original corpus into word segments;

step 1112, determining the word segments which can be possibly formed into words according to the solidification degree of the word segments;

and 1113, when the information entropy of the word segment which can be formed into words is larger than a preset threshold value, determining the word segment as an original corpus participle.

In the invention, the main process of finding the new word is to calculate the internal solidification degree and the free application degree (represented by information entropy) of the word, thereby improving the accuracy of word segmentation.

Wherein step 13 comprises:

step 131, performing word segmentation on the sentence to be corrected to generate word segmentation of the sentence to be corrected; the word segmentation can be carried out by adopting a word segmentation tool such as jieba and the like, and the word segmentation table can also be generated by referring to the generation method of the word segmentation table.

Step 132, converting each word of the sentence to be corrected into a word vector; at the same time of the step, the corresponding relation between the generated participles and the word vectors can be recorded; in the process of converting word segmentation into word vectors, the one-to-one corresponding relation between words and word vectors is recorded.

Step 133, combining the word vectors of the word segments of each sentence to be corrected to generate a sentence vector of the sentence to be corrected. The specific combination mode can be various modes, including a method 1, simply carrying out weighted average on each word vector; method 2. Using the embedding layer embedding in the model, the sentence vector is represented by each word vector.

Accordingly, step 15 comprises:

and step 151, generating an error-corrected word vector by the sentence vector of the sentence to be error-corrected through a decoder. The decoding method here corresponds to the above-mentioned processing method of the word vector to sentence vector in step 133, and is an inverse process.

Step 152, obtaining the error-corrected word segments corresponding to the error-corrected word vectors according to the corresponding relationship between the word segments and the word vectors;

and step 153, combining the error-corrected word segments to generate an error-corrected sentence.

Another embodiment of the present invention relates to a text error correction method, a flow of which is shown in fig. 2 and specifically includes the following steps:

step 21, generating a neural machine translation model NMT;

step 22, obtaining a sentence to be corrected and a speaking role corresponding to the sentence to be corrected;

step 23, converting the sentence to be corrected into a sentence vector; the steps can be as follows: and converting the sentence to be corrected into a sentence vector with a fixed length through a word steering vector word2vec model.

Step 24, inputting the sentence vector of the sentence to be corrected and the speaking role corresponding to the sentence to be corrected into the trained neural machine translation model NMT to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and step 25, converting the sentence vector of the error-corrected sentence into an error-corrected sentence.

In the above embodiment, in order to reduce the complexity of the model in the error correction process, different roles correspond to different word lists during word segmentation, and the model can perform different processing on sentences with different roles.

Accordingly, step 21 comprises:

step 211, performing new word discovery processing on the original corpus corresponding to different speaking roles, and generating original corpus participles corresponding to different speaking roles;

step 212, adding the original corpus and pre-collected hot words into a participle table; the word segmentation table corresponds to the same speaking role as the original corpus;

step 213, segmenting the original sentence in the original corpus according to the segmentation table to generate original sentence segmentation; the word segmentation table corresponds to the same speaking role as the original corpus;

step 214, converting the original sentence participles into sentence vectors so as to represent the original sentences by the vectors;

step 215, representing the correct sentence corresponding to the original sentence by a sentence vector;

step 216, inputting a vector pair composed of the sentence vector of the original sentence and the sentence vector of the correct sentence and a corresponding speaking role into an NMT model for training;

and step 217, generating the neural machine translation model NMT through the training of the text pair.

In the embodiment, in the training process of the model, different processing aiming at different conversation roles is considered, and the simplicity and the accuracy of the model are improved. Moreover, the automatic model training is carried out on different roles each time, so that the automatic processing level can be improved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

The following describes an application scenario of the present invention. The embodiment of the invention provides a text error correction method, which is a text error correction model based on neural machine translation and can correct the error of a text or other texts after voice recognition. The invention idea is as follows: firstly, new words are found in an original voice dialogue text, a word list of possible words in an original corpus is obtained, and the word list and pre-collected hot words are added into a jieba word-dividing list to divide each sentence in the original corpus; then, the word vector of each word is obtained by the word2vec word steering quantity model of the word segmentation result; after the word vector is obtained, each sentence in the dialogue process is vectorized and represented. The sentence vector form of the sentence to be corrected, namely the correct sentence pair, is input into the NMT model for training, and the NMT can learn the error condition under a specific scene through the training of a large number of text pairs. In the subsequent use process, the sentence vector of the sentence after error correction can be automatically output only by inputting the vector representation of the sentence to be corrected into the trained NMT, and then the sentence vector is converted into the sentence after error correction.

The following is a detailed description. The embodiment of the invention relates to a text error correction method, as shown in fig. 3, comprising the following steps:

firstly, a text error correction model is trained according to data (mainly comprising a text to be corrected after speech recognition and a corresponding correct sentence marked manually) and based on the neural machine translation idea. The original text file needs to be processed first before the model is trained. And acquiring the conversation texts transcribed by the speech recognition ASR, wherein the texts contain information of transcription errors and role information in each sentence, different roles correspond to different conversation texts, and the texts of different roles need to be processed differently in order to reduce the complexity of a model in the error correction process. During word segmentation, different roles correspond to different word lists.

And then, carrying out new word discovery on the original voice dialogue text, acquiring a word list which can be word-formed in the original language material, adding the word list and the pre-collected hot words into a jieba word-separating list, and carrying out word-separating on each sentence in the original language material. That is, a new word discovery algorithm is used for extracting new words in an original text, and the extracted new words and pre-collected hot words are used as a word segmentation word list together for carrying out word segmentation on an original corpus.

Specifically, in order to make word segmentation effect better, it is necessary to first collect hot words that may be used by different roles, and then collect new words in the original corpus by using a new word discovery algorithm, where new word discovery is a method for unsupervised construction of a lexicon, and it is mainly desired to determine which character segments in a corpus are likely to be a new word by unsupervised discovery of some language features (mainly statistical features), and the main process of new word discovery is to calculate the degree of internal solidification and the degree of free application of a word. Determining a possible word segment into a word according to the degree of solidification; and then, calculating the free application degree of the segment by using the information entropy, and determining that the word is a new word when the calculated value of the information entropy is larger than a certain set threshold value.

In order to calculate the degree of solidification of a text segment, its solidification pattern is enumerated, from which two parts the text segment is combined. For example, let p (x) be the probability that the text passage x is present in the whole corpus, then the degree of aggregation defining "movie theatre" is:

and the degree of condensation of "movie" is:

through calculation, the condensation degree of the movie theater is obviously greater than that of the movie, so that the word formation possibility of the movie theater is higher.

The degree of freedom of use of the word is also required after the degree of cohesion has been calculated. For example: consider two segments, the "child" and the "ancestor". The user can say that the user can buy the quilt, cover the quilt, enter the quilt, get the quilt, and add various characters in front of the quilt; however, the usage of "lifetime" is very fixed, except for "lifetime", "this lifetime", "up lifetime" and "down lifetime", basically, no other word can be added in front of "lifetime". The characters that can appear on the left side of the text segment of "ancestor" are too limited, so that intuitively we may think that "ancestor" does not form words alone, but actually forms words that are integers such as "ancestor", and the like. Therefore, the free application degree of the text segment is also an important standard for judging whether the text segment is word-forming. If a text segment can be counted as a word, it should be flexible to appear in a variety of different environments, with a very rich set of left and right neighbourhoods. The degree of solidification and the degree of freedom, and one of the two judgment standards is not available. If only the former is seen, a fragment which is actually a half word is found; if only the latter is seen, the program will find out "junk phrases" such as "ones", "movies". The embodiment of the invention utilizes the information entropy to calculate the free application degree of the segment. The entropy can reflect how much information the result of an event will bring on average. The degree of randomness of the left adjacent character set and the right adjacent character set of one text segment is measured by using the information entropy.

The calculation formula of the information entropy ispi represents the probability of the occurrence of the left and right neighbourhood in the text; n represents the size of the left and right adjacent word sets, and i represents the serial number of the left and right adjacent word sets.

Assume for a sentence: the words of 'eating grape and not eating grape skin and inversely eating grape skin' are four times, left adjacent characters are { eating, spitting, eating and spitting } respectively, and right adjacent characters are { non, skin, inverted and skin } respectively. According to the definition of the information entropy, the information entropy of the left adjacent characters of the word grape isThe information entropy of its right adjacent word isIt can be seen that the right neighborhood of the word "grape" is more abundant in this sentence. And comparing the information entropy with a threshold value set in advance, if the information entropy is larger than the threshold value, the word can be formed, otherwise, the word cannot be formed.

Then, in the word segmentation process, a jieba word segmentation tool can be adopted. The jieba word segmentation tool has a function of adding a word list, adds the word list obtained in the last step into the jieba word segmentation process, and then performs word segmentation on the original text by using the jieba word segmentation tool added with the word list. Because hotwords and newly discovered words have been added, word segmentation results are better than those without word lists.

And then, obtaining a vector of fixed-length words of each word by the word2vec word steering quantity model according to the result of the word segmentation. That is, the divided words are trained to a word2vec word steering vector model, and each word is expressed in the form of a word vector. And subsequently, inputting the word vector into a neural machine translation model to train the model.

In natural language processing tasks, words have two vector representations, one is a discrete representation and the other is a distributed representation. The discrete representation is to represent the word by using a long vector, the dimension of the long vector is the size of the word list, only one element in the vector is represented as 1, the rest positions are all 0, and the position of the element represented as 1 corresponds to the position of the word in the dictionary. The vectors represented by the discrete mode can be divided in a high-dimensional space, and the process of obtaining the vectors is simple, but the mode can cause the dimensionality of the vectors to be very large and is not beneficial to representation. The distributed representation represents the words as a dense vector with a fixed length and continuity, and the distributed representation can represent that similar relations exist among the words and more information can be contained in the vector. The preferred way of generating word vectors in the embodiments of the present invention is to train a word2vec model to represent the word vectors.

And then, after the word vector is obtained, each sentence in the conversation process is expressed in a vectorization mode, the vector pair of the sentence to be corrected and the correct sentence is input into the NMT model for training, and the NMT can learn the error condition under a specific scene through the training of a large number of text pairs. The method specifically comprises the following steps: after each word in a sentence is represented as a word vector, the word vectors need to be combined to convert the whole sentence into a vector representation form, and the combination can be performed in various ways, for example: carrying out weighted average on each word vector; alternatively, sentence vectors are represented with each word vector using the embedding layer embedding.

The idea of the neural machine translation model (NMT), which is actually to translate one language to another, is to use neural network based techniques to achieve more context-accurate translations, rather than translating fragmented sentences of one word at a time. Conventionally, machine translation generally uses a large statistical model developed using highly complex linguistic knowledge, but many studies have recently been made to directly model the translation process using a depth model and automatically learn the necessary linguistic knowledge in the case where only primitive data and translated text data are provided. The deep neural network-based translation model has the best effect at present. In such a model based on neural machine translation, only the vectors of the source language sentence and the target language sentence need to be input for training.

The probabilities of word sequences are computed using large artificial neural networks and the NMT puts the complete sentence into an integrated model. The neurons of the NMT can learn and collect information, aim to mimic the neurons of the human brain, establish connections, and evaluate inputs as whole units. NMT is largely divided into two stages, encoding and decoding, where the encoding stage essentially processes text from the source language, inputs the text into a model, and then represents it in vector form. These vectors are then seamlessly represented by the decoding stage as the target language. In the whole translation process, the technology is not only to translate words and phrases; instead, it also includes translation context and information. The embodiment of the invention uses the idea of neural machine translation, and takes the sentences to be corrected as the original language and the correct sentences as the target language for text correction. The process of text error correction is processed by utilizing the idea of neural machine translation, so that a lot of work can be simplified, the model can be trained only by the sentences to be corrected and the correct sentences corresponding to the sentences, and the labeling work is greatly simplified.

The neural-machine translation model generally consists of four parts, namely a simple Recurrent Neural Network (RNN), a word-embedded RNN, a bi-directional RNN and a coding and decoding model. By means of the idea that a sentence to be corrected can be input as a source of the neural network NMT, and a correct sentence corresponding to an error sentence can be output as a target, so that an end-to-end error correction model is achieved. The structure of the model is shown in FIG. 4, the source input is "the happy lam you feel", the target output is "the happy lam you feel", and the error correction model under a specific scene is trained through a large number of text pairs.

Then, in the subsequent use process, the sentence vector of the sentence after error correction can be automatically output only by inputting the vector representation form of the sentence to be corrected into the trained neural machine translation model NMT, and an end-to-end error correction model is realized.

An embodiment of the present invention further provides a text error correction apparatus, as shown in fig. 5, including:

an acquisition unit configured to acquire a sentence to be error-corrected;

a first conversion unit for converting the sentence to be corrected into a sentence vector;

the input unit is used for inputting the sentence vector of the sentence to be corrected into a trained neural machine translation model NMT so as to obtain the sentence vector of the corrected sentence output by the neural machine translation model NMT;

and the second conversion unit is used for converting the sentence vector of the error-corrected sentence into the error-corrected sentence.

The error correction apparatus further includes: generating a model for generating the neural machine translation model, NMT; the method specifically comprises the following steps: carrying out new word discovery processing on the original corpus to generate original corpus participles; adding the original corpus and pre-collected hot words into a word segmentation table; segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments; converting the original sentence participles into sentence vectors; converting a correct sentence corresponding to the original sentence into a sentence vector; inputting a sentence vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence into an NMT model for training; and generating the neural machine translation model NMT through training of the sentence vector pair.

The new word discovery processing is performed on the original corpus, and the generation of the original corpus participles specifically comprises the following steps: dividing original sentences in the original corpus into word fragments; determining the word segments which can possibly form words according to the solidification degree of the word segments; and when the value of the information entropy of the word segment which can be formed into a word is larger than a preset threshold value, determining the word segment as an original corpus participle.

Wherein the generating of the neural machine translation model NMT specifically includes: carrying out new word discovery processing on original corpora corresponding to different speaking roles to generate original corpus participles corresponding to different speaking roles; adding the original corpus participles and pre-collected hot words into a participle table; the word segmentation table corresponds to the same speaking role as the original corpus; segmenting the original sentences in the original corpus according to the segmentation table to generate original sentence segments; the word segmentation table corresponds to the same speaking role as the original corpus; converting the original sentence participles into sentence vectors; converting a correct sentence corresponding to the original sentence into a sentence vector; inputting a vector pair consisting of the sentence vector of the original sentence and the sentence vector of the correct sentence and a corresponding speaking role into an NMT model for training; and generating the neural machine translation model NMT through the training of the vector pair.

The obtaining of the sentence to be corrected specifically includes: acquiring a sentence to be corrected and a speaking role corresponding to the sentence to be corrected;

the specific step of inputting the sentence vector of the sentence to be corrected into the trained neural machine translation model NMT is as follows: and inputting the sentence vector of the sentence to be corrected and the speaking role corresponding to the sentence to be corrected into a trained neural machine translation model NMT.

Wherein, the converting the sentence to be corrected into the sentence vector specifically comprises: segmenting the sentence to be corrected to generate segmented sentences to be corrected; converting each sentence to be corrected into a word vector; and combining the word vectors of the word segments of the sentences to be corrected to generate sentence vectors of the sentences to be corrected.

When the word segmentation of each sentence to be corrected is converted into a word vector, the method further comprises the following steps: generating a corresponding relation between the participles and the word vectors;

the converting the sentence vector of the error-corrected sentence into the error-corrected sentence specifically comprises: generating an error-corrected word vector by the sentence vector of the sentence to be error-corrected through a decoder; acquiring error-corrected participles corresponding to the error-corrected word vectors according to the corresponding relation between the participles and the word vectors; and combining the error-corrected word segments to generate an error-corrected sentence.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

An embodiment of the present invention further provides a terminal, as shown in fig. 6, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for correcting text.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

The embodiment of the invention has the following beneficial effects:

1. the embodiment of the invention provides a speech text error correction model based on neural machine translation, which is characterized in that a vector pair of a sentence to be corrected and a correct sentence is input into an NMT model for training, so that the NMT error correction model suitable for a specific dialogue scene can be obtained.

3. The embodiment of the invention uses the NMT model to correct the sentences, can save a great deal of extra work in the aspects of part-of-speech tagging and word frequency statistics, greatly reduces the labor cost, and can also improve the error correction performance by laterally discarding the tagging workload while tagging may have a certain error.

3. The embodiment of the invention adds the function of new word discovery in the word segmentation process, and simultaneously adds some common hot words into the word segmentation word list, thereby greatly improving the word segmentation effect and having certain improvement effect on the subsequent error correction task.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种生成富文本的方法、装置、介质和电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!