Text automatic generation method based on theme

文档序号：1215800 发布日期：2020-09-04 浏览：3次中文

阅读说明：本技术 一种基于主题的文本自动生成方法 (Text automatic generation method based on theme ) 是由路松峰李天成于 2020-04-01 设计创作，主要内容包括：本发明公开了一种基于主题的文本自动生成方法,包括如下步骤：1)获取语料库,并对语料库中语句进行预处理,提取经预处理后语句的主题词及词向量；2)构建文本生成模型,并输入步骤1)获取的词向量进行模型参数的训练；3)输入待生成文本,提取待生成文本的主题词,并获取其词向量,将主题词词向量输入至步骤2)中经参数训练后的文本生成模型中生成新文本。通过本发明生成的文本生成语句流畅、连贯,且涉及到所有输入主题词,且与主题词紧紧相关。(The invention discloses a theme-based text automatic generation method, which comprises the following steps: 1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences; 2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters; 3) inputting a text to be generated, extracting subject words of the text to be generated, acquiring word vectors of the subject words, and inputting the subject word vectors into the text generation model after parameter training in the step 2) to generate a new text. The text generated by the invention has fluent and coherent generation sentences, relates to all input subject terms and is closely related to the subject terms.)

1. A text automatic generation method based on a theme is characterized by comprising the following steps:

1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences;

2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters;

3) inputting a text to be generated, extracting subject words of the text to be generated, acquiring word vectors of the subject words, and inputting the subject word vectors into the text generation model after parameter training in the step 2) to generate a new text.

2. The method according to claim 1, wherein the step 1) extracts subject words of the sentence by TFIDF method and trains word vectors of the subject words by open-source python library generation.

3. The method for automatically generating a subject-based text according to claim 1, wherein the sentence preprocessing in step 1) comprises: unify punctuation and remove English, number and emoticons.

4. The method for automatically generating a text based on a theme according to claim 1, wherein the step 2) specifically comprises:

1) input sharing vector C_t＝{C₀，C₁,.., as t changes, before model training C_tIs randomly initialized into a K-dimensional vector, and K is random in the step 1)The number of extracted subject word vectors and the initial value of each dimension is 1, i.e. C₀＝[c_0，1，c_0，2，...c_0，K]＝[1.0，1.0，1.0，1.0，1.0，...]When a new word is generated, the jth component c of the tth vector_t，jCalculated by the following formula:

c_t，j＝c_t-1，j-α_tj

2) topic representation T_tBy updating with each step of text generation, T for each time T_tCalculated by the following formula:

wherein topic_jIs the word vector of the subject word j, i.e., the attention score at time t, α_tjAnd g_tjRepresented by the following formula:

g_tj＝v_a ^Ttanh(W_ah_t-1+U_atopic_j)

α therein_tjIs the attention weight score of the subject word vector i at time t, va, Wa and Ua are parameter matrices trained in LSTM, g_tjIs the attention score of the subject word vector j at time t; thus, the next word y_tThe probability distribution of (a) can be defined as follows:

P(y_t|y_t-1，T_t，C_t)＝softmax(g(h_t))

before each generation, ht is updated by the following formula:

h_t＝f(h_t-1，y_t-1)

where the function g is a linear function and the function f is an activation function determined by the structure of LSTM, sofmax is an excitation function for calculating the probability, model P (y)_t|y_t-1，T_t，C_t)＝softmax(g(h_t) Maintaining a shared vector, each dimension of which represents the probability that a subject word vector will need to appear in the generated text in the future, the shared vector can improve the subject integrity and readability of the generated text, and meanwhile, an attention mechanism is added to the LSTM model, the attention mechanism can calculate semantic relevance between the generated text and each subject word vector, and meanwhile, relevant subjects are automatically selected to guide the model to generate the text.

Technical Field

The invention relates to the field of natural language processing, in particular to a text automatic generation method based on a theme.

Background

Natural language generation is a fundamental and challenging task in the field of natural language processing and computational linguistics, and topic-based text generation can be viewed as a special natural language generation. The following three main research directions for natural language generation are today: template-based methods, grammar-based methods, and statistical learning-based methods. Template-based methods often employ a large number of manually customized templates, and then reserve some empty spaces for custom filling. Grammar-based methods generate text step by setting the grammatical structure of an article by human beings. The statistical-based method focuses on learning language models in a corpus, learning how normal human beings write, and learning the relationship between various language components.

The traditional natural language generation is based on rules, the method is good in generating effect, but needs a plurality of experts in specific fields to formulate uniform grammar and Chinese rules, so that the time and labor cost are high, and meanwhile, a system formed by the method is difficult to transplant, namely, has no generalization capability.

The reason why the research direction is changed from the rule-based method to the statistical-based method is mainly as follows:

the rapid development of computer hardware has led to a steady increase in computing power, and since ENIAC, the first electronic computer constructed in 1946, each component constituting the computer has changed in skyrocken every year, various hardware costs have become low and the operation speed has become faster. The establishment of a large-scale operating system and the invention of various programming languages make the processing of various problems convenient and fast, and people can use more powerful and more intelligent computers.

The Choeski theory gradually loses dominance, and natural language processing based on statistics gradually moves to mainstream.

Statistics is mainly processed according to words and appearance frequencies, but words are only part of semantics and cannot represent semantics, so that the semantics of text automatic generation performed by the words are poor, and it is difficult to represent a theme related to a central idea.

Disclosure of Invention

The invention aims to solve the problems, and provides a theme-based text automatic generation method.

In order to achieve the above object, the present invention provides a method for automatically generating a text based on a theme, which comprises the following steps:

1) acquiring a corpus, preprocessing sentences in the corpus, and extracting subject words and word vectors of the preprocessed sentences;

2) constructing a text generation model, and inputting the word vectors obtained in the step 1) to train model parameters;

Further, the step 1) extracts the subject word of the sentence through a TFIDF method, and trains the word vector of the subject word through an open-source python library.

Further, the sentence preprocessing in the step 1) comprises: unify punctuation and remove English, number and emoticons.

Further, the step 2) specifically includes:

1) input sharing vector C_t＝{C₀,C₁…, changing as t changes, before model training C_tIs randomly initialized into K-dimensional vectors, K is the number of subject word vectors randomly extracted in the step 1), and the initial value of each dimension is 1, namely C₀＝[c_0,1,c_0,2,…c_0,K]＝[1.0,1.0,1.0,1.0,1.0,…]When a new word is generated, the jth component c of the tth vector_t，jCalculated by the following formula:

c_t,j＝c_t-1,j-α_tj

2) topic representation T_tBy updating with each step of text generation, T for each time T_tCalculated by the following formula:

wherein topic_jIs the word vector of the subject word j, i.e., the attention score at time t, α_tjAnd g_tjIs expressed by the following formula：

g_tj＝v_a ^Ttanh(W_ah_t-1+U_atopic_j)

P(y_t|y_t-1，T_t，C_t)＝softmax(g(h_t))

before each generation, ht is updated by the following formula:

h_t＝f(h_t-1，y_t-1)

According to the text generation method based on the theme, the theme words are mapped into the word embedding vector space to represent the theme, the LSTM is used as a generator, then the attention mechanism is introduced to construct sentence correlation between the theme words and the generated text, and the attention mechanism can guide the generator to generate the text related to the theme. In addition, considering that each independent text is related to a plurality of topics with different relevance, the model automatically assigns a weight to each topic by using a special vector, and the weight represents the relevance of the target text of the topic word. The text generated by the invention has fluent and coherent generation sentences, relates to all input subject terms and is closely related to the subject terms.

Drawings

FIG. 1 is a flowchart of a method for automatically generating a text based on a topic according to an embodiment of the present invention;

fig. 2 is a diagram of a text generation model structure constructed in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It is to be noted that the drawings are merely illustrative and not to be drawn to strict scale, and that there may be some enlargement and reduction for the convenience of description, and there may be some default to the known partial structure.

In step 101, a corpus is obtained, and sentences in the corpus are preprocessed to extract subject words and word vectors of the preprocessed sentences.

In step 101-1, the corpus can be obtained by crawling from various websites, such as from a web site known as (https:// www.zhihu.com /).

In step 101-2, the preprocessing operation of the statement is as follows:

english punctuation marks are replaced by Chinese punctuation marks, namely punctuation marks in the unified corpus.

Various ellipses are unified because everyone is used the same, and some people use three periods to represent ellipses, and some people use 6 periods to represent ellipses, which are collectively referred to as '…'.

After the above processing is done, it is determined whether each sample contains numbers or English, if so, the sample is discarded, otherwise, the sample is taken.

And selecting sample data subjected to the above process operation according to a maximum length threshold and a minimum length threshold of the preset values, for example, controlling the length of all text data within 50-300 words.

The specific Chinese and English punctuation symbol comparison relationship is shown in Table 1:

TABLE 1 Chinese and English punctuation mark symbol corresponding table

In step 101-3, the subject term and its word vector are obtained

The use of word vectors is important in the present invention, the dictionary words formed by the preprocessed corpus are tens of thousands, if the words are represented by one-hot coding, the dimension of each vector is very large, and the sparse vectors also cause a lot of problems in calculation and storage.

At present, a plurality of trained Chinese word vectors exist on the network, such as Chinese word vectors trained by data of Baidu encyclopedia, Chinese Wikipedia, people's daily newspaper and the like. These word vectors tend to be extremely data intensive, but will contain a large number of words consisting of numbers and english, while focusing on only a certain industry domain. During the experiment, a certain trained Chinese word vector related to the field of the corpus industry used for the experiment can be selected and then appropriately processed. Meanwhile, a proper corpus can be selected to train the word vectors, the training speed is high for text data of small samples, and meanwhile, the word vector result is good.

Before obtaining the word vector, the subject word needs to be extracted. The term extraction is defined as that given a character string sequence s and the number n of terms to be extracted, n terms (existing in the string s) are extracted by a certain specific method, thereby achieving the effect of simplifying or summarizing the sentence. The extracted subject term can be used as the subject term of the sentence, and the subject term is extracted by using a mature TFIDF method in the invention.

The method uses an open-source python library genesis to train word vectors of Chinese subject words in a corpus, the input of a word vector model is a text file (the tail of the text file comprises the subject words) after the subject words are extracted, and only the path of the text file needs to be given. Firstly reading in a file under the path, then segmenting words of each line of sentences in the text, then iterating training data for a plurality of times according to the configuration of the model, and storing the model and Chinese word vectors after the training is finished.

The configuration of the parameters of the model during training is shown in table 2:

TABLE 2 word vector model parameter configuration

Algorithm sg	CBOW word vector dimension size	300
			Sliding window size window	15 frequency threshold min _ count	1
Iteration number iter	10Hierarchical softmax	1
			Learning rate alpha	0.025 work threads workers	8

In step 102, a text generation model is constructed, and the word vectors obtained in step 101 are input for training model parameters.

Model As shown in FIG. 2, input sharing vector C_t＝{C₀,C₁…, changing as t changes, before model training C_tIs randomly initialized to K-dimensional vectors (K being the number of subject word vectors) and each dimension has an initial value of 1, i.e., C₀＝[c_0,1,c_0,2,…c_0,K]＝[1.0,1.0,1.0,1.0,1.0,…]. When a new word is generated, the jth component c of the tth vector_t，jCalculated by the following formula:

c_t,j＝c_t-1,j-α_t,j

at the same time, the topic representation T_tBy updating with each step of text generation, T for each time T_tCalculated by the following formula:

wherein topic_jIs the word vector of the subject word j, i.e., the attention score at time t, α_tjRepresented by the following formula:

g_tj＝v_a ^Ttanh(W_ah_t-1+U_atopic_j)

α therein_t，jIs the attention weight score of the subject word vector i at time t, va, Wa, and Ua are the trained parameter matrices in LSTM, and topic_jIs the attention score of the subject word vector j at time t. Thus, the next word y_tThe probability distribution of (a) can be defined as follows (text generation model):

P(y_t|y_t-1，T_t，C_t)＝softmax(g(h_t))

before each generation, ht is updated by the following formula:

h_t＝f(h_t-1，y_t-1)

where the function g is a linear function and the function f is an activation function determined by the structure of the LSTM. sofmax is the excitation function that calculates the probability. The model maintains a shared vector with each dimension representing the probability that a subject word vector will need to appear in the generated text in the future. The sharing vector may improve the subject integrity and readability of the generated text. Meanwhile, an attention mechanism is added into the LSTM model, and the attention mechanism can calculate semantic relevance between the generated text and each subject word vector, and automatically select related subjects to guide the model to generate the text.

The output of the model is y, which has the formula P (y)_t|y_t-1，T_t，C_t)＝softmax(g(h_t) Is iterative, and the result of the subsequent step is step-by-step data required for the previous step, where y is_tRequire y_t-1Tt, Ct, ht. From the above equation, it is known that Tt is represented by topic_jAnd α_tjOf decision, topic_jThat is, the word vector of the subject word (which is the same at each step), α_tjReferring to the above formula, to calculate gtj, there is also formula calculation for gtj, several parameters of the formula need to be trained, some initial values are given at first, and accurate values of the parameters are gradually obtained after multiple steps of calculation. Each step of Ct is a K-dimensional vector, initially initialized to [1.0,1.0, … ]]. At each subsequent step, each dimension is calculated according to the formula c_t，j＝c_t-1，j-α_t，jI (where j is the dimension 1,2, … K).

Since many word vectors are obtained in step 101, many word vectors (each being a 300-dimensional number vector) cannot be selected at a time when training is performed, and generally K word vectors are selected. Fig. 2 is trained with 5 word vectors as an example (i.e., each piece of text is represented as 5 word vectors), and the model can be trained many times. The word vector obtained from step 101 at each trainingOf which 5 are randomly drawn, the 5 word vectors are topic1 through topic 5. Beginning with the step of₀And h₀Initialization to [1.0,1.0,1.0,1.0]And randomly initializing the matrixes va, Wa and Ua, and then calculating C₁,T₁,y₁. The training is divided into many steps, assumed to be m steps, and the final output result is y_m. After each training, comparing with the last result, selecting out the output result y_mAnd (4) carrying out random fine adjustment on the parameters va, Wa and Ua which are closer to the original input until the training is finished to obtain proper parameters.

The lower part of FIG. 2 is to calculate Ct, t is the t-th step of each training, the output of each step is yt, and the upper part only writes three C₀,C₁,C₂And, in fact, is continuously iterated through the calculations. When the text is generated, 1 text is randomly input (or only a plurality of words are input), word vectors are obtained and then input into the model, parameters va, Wa and Ua adopt the parameters of the trained model, ym is output, and the most similar text is searched in the original text to generate the text.

In step 103, a text to be generated is input, a subject word of the text to be generated is extracted, a word vector of the subject word is obtained, and the word vector of the subject word is input into the text generation model after parameter training in step 102 to generate a new text.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

8页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种演示文稿生成方法及装置

Text automatic generation method based on theme

相关技术

网友询问留言