Text automatic generation method based on theme

文档序号:1215800 发布日期:2020-09-04 浏览:3次 中文

阅读说明:本技术 一种基于主题的文本自动生成方法 (Text automatic generation method based on theme ) 是由 路松峰 李天成 于 2020-04-01 设计创作,主要内容包括:本发明公开了一种基于主题的文本自动生成方法,包括如下步骤:1)获取语料库,并对语料库中语句进行预处理,提取经预处理后语句的主题词及词向量;2)构建文本生成模型,并输入步骤1)获取的词向量进行模型参数的训练;3)输入待生成文本,提取待生成文本的主题词,并获取其词向量,将主题词词向量输入至步骤2)中经参数训练后的文本生成模型中生成新文本。通过本发明生成的文本生成语句流畅、连贯,且涉及到所有输入主题词,且与主题词紧紧相关。(The invention discloses a theme-based text automatic generation method, which comprises the following steps: 1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences; 2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters; 3) inputting a text to be generated, extracting subject words of the text to be generated, acquiring word vectors of the subject words, and inputting the subject word vectors into the text generation model after parameter training in the step 2) to generate a new text. The text generated by the invention has fluent and coherent generation sentences, relates to all input subject terms and is closely related to the subject terms.)

1. A text automatic generation method based on a theme is characterized by comprising the following steps:

1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences;

2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters;

3) inputting a text to be generated, extracting subject words of the text to be generated, acquiring word vectors of the subject words, and inputting the subject word vectors into the text generation model after parameter training in the step 2) to generate a new text.

2. The method according to claim 1, wherein the step 1) extracts subject words of the sentence by TFIDF method and trains word vectors of the subject words by open-source python library generation.

3. The method for automatically generating a subject-based text according to claim 1, wherein the sentence preprocessing in step 1) comprises: unify punctuation and remove English, number and emoticons.

4. The method for automatically generating a text based on a theme according to claim 1, wherein the step 2) specifically comprises:

1) input sharing vector Ct={C0,C1,.., as t changes, before model training CtIs randomly initialized into a K-dimensional vector, and K is random in the step 1)The number of extracted subject word vectors and the initial value of each dimension is 1, i.e. C0=[c0,1,c0,2,...c0,K]=[1.0,1.0,1.0,1.0,1.0,...]When a new word is generated, the jth component c of the tth vectort,jCalculated by the following formula:

ct,j=ct-1,jtj

2) topic representation TtBy updating with each step of text generation, T for each time TtCalculated by the following formula:

wherein topicjIs the word vector of the subject word j, i.e., the attention score at time t, αtjAnd gtjRepresented by the following formula:

gtj=va Ttanh(Waht-1+Uatopicj)

α thereintjIs the attention weight score of the subject word vector i at time t, va, Wa and Ua are parameter matrices trained in LSTM, gtjIs the attention score of the subject word vector j at time t; thus, the next word ytThe probability distribution of (a) can be defined as follows:

P(yt|yt-1,Tt,Ct)=softmax(g(ht))

before each generation, ht is updated by the following formula:

ht=f(ht-1,yt-1)

where the function g is a linear function and the function f is an activation function determined by the structure of LSTM, sofmax is an excitation function for calculating the probability, model P (y)t|yt-1,Tt,Ct)=softmax(g(ht) Maintaining a shared vector, each dimension of which represents the probability that a subject word vector will need to appear in the generated text in the future, the shared vector can improve the subject integrity and readability of the generated text, and meanwhile, an attention mechanism is added to the LSTM model, the attention mechanism can calculate semantic relevance between the generated text and each subject word vector, and meanwhile, relevant subjects are automatically selected to guide the model to generate the text.

Technical Field

The invention relates to the field of natural language processing, in particular to a text automatic generation method based on a theme.

Background

Natural language generation is a fundamental and challenging task in the field of natural language processing and computational linguistics, and topic-based text generation can be viewed as a special natural language generation. The following three main research directions for natural language generation are today: template-based methods, grammar-based methods, and statistical learning-based methods. Template-based methods often employ a large number of manually customized templates, and then reserve some empty spaces for custom filling. Grammar-based methods generate text step by setting the grammatical structure of an article by human beings. The statistical-based method focuses on learning language models in a corpus, learning how normal human beings write, and learning the relationship between various language components.

The traditional natural language generation is based on rules, the method is good in generating effect, but needs a plurality of experts in specific fields to formulate uniform grammar and Chinese rules, so that the time and labor cost are high, and meanwhile, a system formed by the method is difficult to transplant, namely, has no generalization capability.

The reason why the research direction is changed from the rule-based method to the statistical-based method is mainly as follows:

the rapid development of computer hardware has led to a steady increase in computing power, and since ENIAC, the first electronic computer constructed in 1946, each component constituting the computer has changed in skyrocken every year, various hardware costs have become low and the operation speed has become faster. The establishment of a large-scale operating system and the invention of various programming languages make the processing of various problems convenient and fast, and people can use more powerful and more intelligent computers.

The Choeski theory gradually loses dominance, and natural language processing based on statistics gradually moves to mainstream.

Statistics is mainly processed according to words and appearance frequencies, but words are only part of semantics and cannot represent semantics, so that the semantics of text automatic generation performed by the words are poor, and it is difficult to represent a theme related to a central idea.

Disclosure of Invention

The invention aims to solve the problems, and provides a theme-based text automatic generation method.

In order to achieve the above object, the present invention provides a method for automatically generating a text based on a theme, which comprises the following steps:

1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences;

2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters;

3) inputting a text to be generated, extracting subject words of the text to be generated, acquiring word vectors of the subject words, and inputting the subject word vectors into the text generation model after parameter training in the step 2) to generate a new text.

Further, the step 1) extracts the subject word of the sentence through a TFIDF method, and trains the word vector of the subject word through an open-source python library.

Further, the sentence preprocessing in the step 1) comprises: unify punctuation and remove English, number and emoticons.

Further, the step 2) specifically includes:

1) input sharing vector Ct={C0,C1…, changing as t changes, before model training CtIs randomly initialized into K-dimensional vectors, K is the number of subject word vectors randomly extracted in the step 1), and the initial value of each dimension is 1, namely C0=[c0,1,c0,2,…c0,K]=[1.0,1.0,1.0,1.0,1.0,…]When a new word is generated, the jth component c of the tth vectort,jCalculated by the following formula:

ct,j=ct-1,jtj

2) topic representation TtBy updating with each step of text generation, T for each time TtCalculated by the following formula:

wherein topicjIs the word vector of the subject word j, i.e., the attention score at time t, αtjAnd gtjIs expressed by the following formula:

gtj=va Ttanh(Waht-1+Uatopicj)

α thereintjIs the attention weight score of the subject word vector i at time t, va, Wa and Ua are parameter matrices trained in LSTM, gtjIs the attention score of the subject word vector j at time t; thus, the next word ytThe probability distribution of (a) can be defined as follows:

P(yt|yt-1,Tt,Ct)=softmax(g(ht))

before each generation, ht is updated by the following formula:

ht=f(ht-1,yt-1)

where the function g is a linear function and the function f is an activation function determined by the structure of LSTM, sofmax is an excitation function for calculating the probability, model P (y)t|yt-1,Tt,Ct)=softmax(g(ht) Maintaining a shared vector, each dimension of which represents the probability that a subject word vector will need to appear in the generated text in the future, the shared vector can improve the subject integrity and readability of the generated text, and meanwhile, an attention mechanism is added to the LSTM model, the attention mechanism can calculate semantic relevance between the generated text and each subject word vector, and meanwhile, relevant subjects are automatically selected to guide the model to generate the text.

According to the text generation method based on the theme, the theme words are mapped into the word embedding vector space to represent the theme, the LSTM is used as a generator, then the attention mechanism is introduced to construct sentence correlation between the theme words and the generated text, and the attention mechanism can guide the generator to generate the text related to the theme. In addition, considering that each independent text is related to a plurality of topics with different relevance, the model automatically assigns a weight to each topic by using a special vector, and the weight represents the relevance of the target text of the topic word. The text generated by the invention has fluent and coherent generation sentences, relates to all input subject terms and is closely related to the subject terms.

Drawings

FIG. 1 is a flowchart of a method for automatically generating a text based on a topic according to an embodiment of the present invention;

fig. 2 is a diagram of a text generation model structure constructed in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It is to be noted that the drawings are merely illustrative and not to be drawn to strict scale, and that there may be some enlargement and reduction for the convenience of description, and there may be some default to the known partial structure.

In step 101, a corpus is obtained, and sentences in the corpus are preprocessed to extract subject words and word vectors of the preprocessed sentences.

In step 101-1, the corpus can be obtained by crawling from various websites, such as from a web site known as (https:// www.zhihu.com /).

In step 101-2, the preprocessing operation of the statement is as follows:

english punctuation marks are replaced by Chinese punctuation marks, namely punctuation marks in the unified corpus.

Various ellipses are unified because everyone is used the same, and some people use three periods to represent ellipses, and some people use 6 periods to represent ellipses, which are collectively referred to as '…'.

After the above processing is done, it is determined whether each sample contains numbers or English, if so, the sample is discarded, otherwise, the sample is taken.

And selecting sample data subjected to the above process operation according to a maximum length threshold and a minimum length threshold of the preset values, for example, controlling the length of all text data within 50-300 words.

The specific Chinese and English punctuation symbol comparison relationship is shown in Table 1:

TABLE 1 Chinese and English punctuation mark symbol corresponding table

Figure BDA0002434751930000041

In step 101-3, the subject term and its word vector are obtained

The use of word vectors is important in the present invention, the dictionary words formed by the preprocessed corpus are tens of thousands, if the words are represented by one-hot coding, the dimension of each vector is very large, and the sparse vectors also cause a lot of problems in calculation and storage.

At present, a plurality of trained Chinese word vectors exist on the network, such as Chinese word vectors trained by data of Baidu encyclopedia, Chinese Wikipedia, people's daily newspaper and the like. These word vectors tend to be extremely data intensive, but will contain a large number of words consisting of numbers and english, while focusing on only a certain industry domain. During the experiment, a certain trained Chinese word vector related to the field of the corpus industry used for the experiment can be selected and then appropriately processed. Meanwhile, a proper corpus can be selected to train the word vectors, the training speed is high for text data of small samples, and meanwhile, the word vector result is good.

Before obtaining the word vector, the subject word needs to be extracted. The term extraction is defined as that given a character string sequence s and the number n of terms to be extracted, n terms (existing in the string s) are extracted by a certain specific method, thereby achieving the effect of simplifying or summarizing the sentence. The extracted subject term can be used as the subject term of the sentence, and the subject term is extracted by using a mature TFIDF method in the invention.

The method uses an open-source python library genesis to train word vectors of Chinese subject words in a corpus, the input of a word vector model is a text file (the tail of the text file comprises the subject words) after the subject words are extracted, and only the path of the text file needs to be given. Firstly reading in a file under the path, then segmenting words of each line of sentences in the text, then iterating training data for a plurality of times according to the configuration of the model, and storing the model and Chinese word vectors after the training is finished.

The configuration of the parameters of the model during training is shown in table 2:

TABLE 2 word vector model parameter configuration

Algorithm sg CBOW word vector dimension size 300
Sliding window size window 15 frequency threshold min _ count 1
Iteration number iter 10Hierarchical softmax 1
Learning rate alpha 0.025 work threads workers 8

In step 102, a text generation model is constructed, and the word vectors obtained in step 101 are input for training model parameters.

Model As shown in FIG. 2, input sharing vector Ct={C0,C1…, changing as t changes, before model training CtIs randomly initialized to K-dimensional vectors (K being the number of subject word vectors) and each dimension has an initial value of 1, i.e., C0=[c0,1,c0,2,…c0,K]=[1.0,1.0,1.0,1.0,1.0,…]. When a new word is generated, the jth component c of the tth vectort,jCalculated by the following formula:

ct,j=ct-1,jt,j

at the same time, the topic representation TtBy updating with each step of text generation, T for each time TtCalculated by the following formula:

Figure BDA0002434751930000061

wherein topicjIs the word vector of the subject word j, i.e., the attention score at time t, αtjRepresented by the following formula:

gtj=va Ttanh(Waht-1+Uatopicj)

α thereint,jIs the attention weight score of the subject word vector i at time t, va, Wa, and Ua are the trained parameter matrices in LSTM, and topicjIs the attention score of the subject word vector j at time t. Thus, the next word ytThe probability distribution of (a) can be defined as follows (text generation model):

P(yt|yt-1,Tt,Ct)=softmax(g(ht))

before each generation, ht is updated by the following formula:

ht=f(ht-1,yt-1)

where the function g is a linear function and the function f is an activation function determined by the structure of the LSTM. sofmax is the excitation function that calculates the probability. The model maintains a shared vector with each dimension representing the probability that a subject word vector will need to appear in the generated text in the future. The sharing vector may improve the subject integrity and readability of the generated text. Meanwhile, an attention mechanism is added into the LSTM model, and the attention mechanism can calculate semantic relevance between the generated text and each subject word vector, and automatically select related subjects to guide the model to generate the text.

The output of the model is y, which has the formula P (y)t|yt-1,Tt,Ct)=softmax(g(ht) Is iterative, and the result of the subsequent step is step-by-step data required for the previous step, where y istRequire yt-1Tt, Ct, ht. From the above equation, it is known that Tt is represented by topicjAnd αtjOf decision, topicjThat is, the word vector of the subject word (which is the same at each step), αtjReferring to the above formula, to calculate gtj, there is also formula calculation for gtj, several parameters of the formula need to be trained, some initial values are given at first, and accurate values of the parameters are gradually obtained after multiple steps of calculation. Each step of Ct is a K-dimensional vector, initially initialized to [1.0,1.0, … ]]. At each subsequent step, each dimension is calculated according to the formula ct,j=ct-1,jt,jI (where j is the dimension 1,2, … K).

Since many word vectors are obtained in step 101, many word vectors (each being a 300-dimensional number vector) cannot be selected at a time when training is performed, and generally K word vectors are selected. Fig. 2 is trained with 5 word vectors as an example (i.e., each piece of text is represented as 5 word vectors), and the model can be trained many times. The word vector obtained from step 101 at each trainingOf which 5 are randomly drawn, the 5 word vectors are topic1 through topic 5. Beginning with the step of0And h0Initialization to [1.0,1.0,1.0,1.0]And randomly initializing the matrixes va, Wa and Ua, and then calculating C1,T1,y1. The training is divided into many steps, assumed to be m steps, and the final output result is ym. After each training, comparing with the last result, selecting out the output result ymAnd (4) carrying out random fine adjustment on the parameters va, Wa and Ua which are closer to the original input until the training is finished to obtain proper parameters.

The lower part of FIG. 2 is to calculate Ct, t is the t-th step of each training, the output of each step is yt, and the upper part only writes three C0,C1,C2And, in fact, is continuously iterated through the calculations. When the text is generated, 1 text is randomly input (or only a plurality of words are input), word vectors are obtained and then input into the model, parameters va, Wa and Ua adopt the parameters of the trained model, ym is output, and the most similar text is searched in the original text to generate the text.

In step 103, a text to be generated is input, a subject word of the text to be generated is extracted, a word vector of the subject word is obtained, and the word vector of the subject word is input into the text generation model after parameter training in step 102 to generate a new text.

According to the text generation method based on the theme, the theme words are mapped into the word embedding vector space to represent the theme, the LSTM is used as a generator, then the attention mechanism is introduced to construct sentence correlation between the theme words and the generated text, and the attention mechanism can guide the generator to generate the text related to the theme. In addition, considering that each independent text is related to a plurality of topics with different relevance, the model automatically assigns a weight to each topic by using a special vector, and the weight represents the relevance of the target text of the topic word. The text generated by the invention has fluent and coherent generation sentences, relates to all input subject terms and is closely related to the subject terms.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种演示文稿生成方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!