Text generation method and system based on neural network vocabulary extension paragraphs

文档序号:1922138 发布日期:2021-12-03 浏览:11次 中文

阅读说明:本技术 一种基于神经网络词汇扩展段落的文本生成方法及系统 (Text generation method and system based on neural network vocabulary extension paragraphs ) 是由 陈海林 张蓬 赵绪龙 于 2021-09-07 设计创作,主要内容包括:本发明公开了一种基于神经网络词汇扩展段落的文本生成方法及系统,属于文本处理技术领域,包括数据采集模块、文章数据库、储存模块、建模模块、输入模块、文本生成模块和服务器;所述数据采集模块用于采集文章数据,并将采集到的文章数据发送到文章数据库进行保存,文章数据库对保存的文章数据进行分段,并对分段数据进行数据预处理,提取分段数据中的核心词,将分段数据和对应的核心词整合标记为训练集数据,将训练集数据发送到储存模块进行储存;通过建模模块建立预测模型,并将预测模型发送至文本生成模块;通过按段落生成的方式,更好的控制文章生成的字数要求,使用户使用文本生成的体验感更好。(The invention discloses a text generation method and a system based on a neural network vocabulary extension paragraph, which belong to the technical field of text processing, and comprise a data acquisition module, an article database, a storage module, a modeling module, an input module, a text generation module and a server; the data acquisition module is used for acquiring article data and sending the acquired article data to an article database for storage, the article database segments the stored article data, performs data preprocessing on the segment data, extracts core words in the segment data, integrates and marks the segment data and the corresponding core words as training set data, and sends the training set data to the storage module for storage; establishing a prediction model through a modeling module, and sending the prediction model to a text generation module; by means of paragraph generation, word number requirements generated by the article are better controlled, and the experience of a user for text generation is better.)

1. A text generation system based on a neural network vocabulary extension paragraph is characterized by comprising a data acquisition module, an article database, a storage module, a modeling module, an input module, a text generation module and a server;

the data acquisition module is used for acquiring article data and sending the acquired article data to an article database for storage, the article database segments the stored article data, performs data preprocessing on the segment data, extracts core words in the segment data, integrates and marks the segment data and the corresponding core words as training set data, and sends the training set data to the storage module for storage; establishing a prediction model through a modeling module, and sending the prediction model to a text generation module;

a user inputs prediction parameters through an input module, wherein the prediction parameters comprise the industry fields, keywords, the generated article space and the number range of each paragraph; the input module obtains predicted input data according to the input predicted parameters, the predicted input data are sent to the text generation module, the text generation module generates a text according to the obtained data, and the text is sent to a user.

2. The system of claim 1, wherein the article database checks the received article data before storing the article data, and when the received article data is duplicate article data, stores the article data that was most recent in the text-sending date and deletes another article data.

3. The system of claim 2, wherein when there is no text date in the article data, the time of the article data that is obtained most recently is used as the text date, and the text date is marked.

4. The neural network vocabulary extension paragraph-based text generation system of claim 1, wherein the method for the modeling module to build the prediction model comprises:

writing a seq2seq model, wherein the seq2seq model adopts an xml model structure for an encode end and a decode end, acquiring training set data from a storage module, training the seq2seq model by using the training set data, and marking the trained seq2seq model as a prediction model.

5. The system of claim 1, wherein the input module operates in a manner that includes:

setting a retrieval unit and a keyword library, selecting the industry field to which the user belongs, recommending keywords to the user according to the industry field to which the user belongs, selecting the recommended keywords by the user, and retrieving the recommended keywords through the retrieval unit to obtain the keywords when the recommended keywords do not contain the keywords required by the user; setting the number range of the generated article sections and the number range of each paragraph;

inputting keywords into an article database for matching to obtain matched sentences, obtaining a word number range required by a user and the word number of the matched sentences, screening out the matched sentences meeting the requirements of the user, carrying out data preprocessing, marking the sentences subjected to data preprocessing as basic sentences, and extracting core words in the basic sentences; the core words are labeled as prediction input data.

6. The system of claim 5, wherein the method for recommending keywords to the user according to the industry field comprises:

the method comprises the steps of obtaining an affiliated industry field, matching in a keyword library according to the affiliated industry field, obtaining keywords in the same field, marking the keywords as keywords to be selected, obtaining the number of times of using the keywords to be selected, sequencing the keywords to be selected according to the number of times of using the keywords to be selected, and selecting N keywords to be selected before sequencing to recommend to a user.

7. The system of claim 1, wherein the text generation module generates the text according to the acquired data by a method comprising:

acquiring prediction input data and a prediction model, inputting the prediction input data into the prediction model to obtain sentences, marking the sentences as output sentences, generating a sentence rule which is a beam search, and sequencing the output sentences according to a heuristic rule to form text data.

8. The generation method of the text generation system based on the neural network vocabulary extension paragraphs according to any one of claims 1 to 7, wherein the specific method comprises the following steps:

the method comprises the following steps: establishing a prediction model;

step two: acquiring a prediction parameter input by a user, and setting prediction input data according to the prediction parameter;

step three: inputting the predicted input data into a prediction model to obtain output sentences, and sequencing the output sentences according to heuristic rules to form text data;

step four: the text data is presented to the user.

Technical Field

The invention belongs to the technical field of text processing, and particularly relates to a text generation method and system based on a neural network vocabulary extension paragraph.

Background

The text generation is a very important research direction in natural language processing, has very wide application scenes, and is mainly applied to the generation of formatted data texts, or the generation of information contents, interpretation texts and the like. The common directions for the current unformatted text generation task are roughly: abstract generation, text repetition and the like. Text repeat generation needs a large amount of materials to support so as to form the repeat generation effect, and parallel linguistic data required by training a repeat model needs to be sufficient and regular; in reality, it is difficult to obtain such parallel corpora on a large scale. Poetry generation and novel generation are not detailed in generation effect and can have certain research value, but most of the conditions are implemented by researchers, and the experimental significance is larger, rather than being considered in an application level.

The text generation of the neural network vocabulary extension paragraphs is adopted, and training can be performed in the corpus within a certain range, so that a good effect of expanding and generating the article can be obtained. The sentence/article generation mode of deep learning network training is carried out on the basis of the accumulated user characteristic data, user preference data and article material data and the added label characteristic data through relationship extraction and entity recognition, and the diversity and originality of the generated article can be greatly expanded.

Disclosure of Invention

In order to solve the problems existing in the scheme, the invention provides a text generation method and system based on a neural network vocabulary extension paragraph.

The purpose of the invention can be realized by the following technical scheme:

a text generation system based on neural network vocabulary extension paragraphs comprises a data acquisition module, an article database, a storage module, a modeling module, an input module, a text generation module and a server;

the data acquisition module is used for acquiring article data and sending the acquired article data to an article database for storage, the article database segments the stored article data, performs data preprocessing on the segment data, extracts core words in the segment data, integrates and marks the segment data and the corresponding core words as training set data, and sends the training set data to the storage module for storage; establishing a prediction model through a modeling module, and sending the prediction model to a text generation module;

the user inputs the prediction parameters through the input module, the input module obtains prediction input data according to the input prediction parameters, the prediction input data are sent to the text generation module, and the text generation module generates a text according to the obtained data and sends the text to the user.

Further, the article database checks the received article data before storing the article data, and when the received article data is duplicate article data, stores the article data that is the latest article sending date, and deletes the other article data.

Further, when there is no text date in the article data, the time of the article data acquired last is taken as the text date, and the text date is marked.

Further, the method for building the prediction model by the modeling module comprises the following steps:

writing a seq2seq model, wherein the seq2seq model adopts an xml model structure for an encode end and a decode end, acquiring training set data from a storage module, training the seq2seq model by using the training set data, and marking the trained seq2seq model as a prediction model.

Further, the working method of the input module comprises the following steps:

setting a retrieval unit and a keyword library, selecting the industry field to which the user belongs, recommending keywords to the user according to the industry field to which the user belongs, selecting the recommended keywords by the user, and retrieving the recommended keywords through the retrieval unit to obtain the keywords when the recommended keywords do not contain the keywords required by the user; setting the number range of the generated article sections and the number range of each paragraph;

inputting keywords into an article database for matching to obtain matched sentences, obtaining a word number range required by a user and the word number of the matched sentences, screening out the matched sentences meeting the requirements of the user, carrying out data preprocessing, marking the sentences subjected to data preprocessing as basic sentences, and extracting core words in the basic sentences; the core words are labeled as prediction input data.

Further, the method for recommending keywords to the user according to the industry field comprises the following steps:

the method comprises the steps of obtaining an affiliated industry field, matching in a keyword library according to the affiliated industry field, obtaining keywords in the same field, marking the keywords as keywords to be selected, obtaining the number of times of using the keywords to be selected, sequencing the keywords to be selected according to the number of times of using the keywords to be selected, and selecting N keywords to be selected before sequencing to recommend to a user.

Further, the method for generating the text by the text generation module according to the acquired data comprises the following steps:

acquiring prediction input data and a prediction model, inputting the prediction input data into the prediction model to obtain sentences, marking the sentences as output sentences, generating a sentence rule which is a beam search, and sequencing the output sentences according to a heuristic rule to form text data.

A text generation method based on neural network vocabulary extension paragraphs specifically comprises the following steps:

the method comprises the following steps: establishing a prediction model;

step two: acquiring a prediction parameter input by a user, and setting prediction input data according to the prediction parameter;

step three: inputting the predicted input data into a prediction model to obtain output sentences, and sequencing the output sentences according to heuristic rules to form text data;

step four: the text data is presented to the user.

Compared with the prior art, the invention has the beneficial effects that: the problems that the quality of texts generated by common self-coding models such as bert and the like used for generating common texts is not high and sentences are not smooth are solved; by means of paragraph generation, word number requirements generated by the article are better controlled, and experience of a user for generating the text is better; by extracting core words from sentences and taking the sentences as training corpora, the model can be converged better, the rigor and diversity of text generation are increased, and the quality of generated texts is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic block diagram of the system of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a text generation system based on a neural network vocabulary extension paragraph includes a data collection module, an article database, a storage module, a modeling module, an input module, a text generation module, and a server;

the data acquisition module is used for acquiring article data and sending the acquired article data to an article database for storage, the article database checks the received article data, and when the received article data is repeated article data, the article data which is the latest article sending date is stored, and the other article data is deleted; because each published article basically has a text date, when the article data has no text date, the time of the newly acquired article data is taken as the text date, and the text date is marked; the article database segments the stored article data to generate segment data, wherein the segment data is a sentence in the article data, is equivalent to a sentence, and can be segmented according to paragraphs and sentence numbers; performing data preprocessing on the segmented data, wherein the data preprocessing comprises cleaning and extracting the data, and extracting core words in the segmented data, and the segmented data is the data subjected to the data preprocessing; extracting keywords is a conventional technology and is not an improvement point of the method, so that detailed description is not needed, and a neural network model can be used for training; integrating and marking the segmented data and the corresponding core words as training set data, and sending the training set data to a storage module for storage; establishing a prediction model through a modeling module, and sending the prediction model to a text generation module;

the user inputs the prediction parameters through the input module, the input module obtains prediction input data according to the input prediction parameters, the prediction input data are sent to the text generation module, and the text generation module generates a text according to the obtained data and sends the text to the user.

The method for acquiring article data by the data acquisition module comprises the following steps:

and formulating an article data acquisition field, and acquiring article data from the Internet according to the article data acquisition field.

The method for establishing the prediction model by the modeling module comprises the following steps:

writing a seq2seq model, wherein the seq2seq model adopts an xml model structure for an encode end and a decode end, acquiring training set data from a storage module, training the seq2seq model by using the training set data, and marking the trained seq2seq model as a prediction model.

The input module is used for inputting prediction parameters by a user and setting prediction input data according to the prediction parameters, wherein the prediction parameters comprise the industry fields, keywords, the generated article space and the word number range of each paragraph; the industry field is the industry field of the target article, and the specific method comprises the following steps:

the method comprises the following steps of setting a retrieval unit and a keyword library, wherein the retrieval unit is used for retrieving keywords, and the keyword library is used for storing the keywords and can be stored according to the keywords in training set data; selecting the industry field, recommending keywords to a user according to the industry field, selecting the recommended keywords by the user, and retrieving the recommended keywords through a retrieval unit to obtain the keywords when the recommended keywords do not have the keywords required by the user; setting the number range of the generated article sections and the number range of each paragraph;

inputting keywords into an article database for matching to obtain a matched sentence, namely a sentence containing the keywords, obtaining a word number range required by a user and the word number of the matched sentence, screening the matched sentence meeting the requirements of the user, and performing data preprocessing, wherein the data preprocessing comprises data cleaning and extraction, a processing object is the screened matched sentence, the sentence subjected to the data preprocessing is marked as a basic sentence, and a core word in the basic sentence is extracted; marking the core words as prediction input data;

the method for recommending keywords to the user according to the industry field comprises the following steps:

acquiring an affiliated industry field, matching in a keyword library according to the affiliated industry field to acquire keywords in the same field, marking the keywords as keywords to be selected, acquiring the number of times of using the keywords to be selected, sequencing the keywords to be selected according to the number of times of using the keywords to be selected, and selecting N keywords to be selected before sequencing to recommend to a user, wherein N is a proportionality coefficient, and N is more than or equal to 50 and more than or equal to 10;

the text generation module is used for generating a text according to the acquired data, and the specific method comprises the following steps:

acquiring prediction input data and a prediction model, inputting the prediction input data into the prediction model to obtain sentences, marking the sentences as output sentences, generating a new sentence with a beam search rule, and sequencing the output sentences according to a heuristic rule to form text data; heuristic rules are common knowledge in the art and are therefore not described in detail.

Exemplary, user-selected industry domains: machinery and industrial equipment/agricultural machinery; the recommended keywords are: rake teeth, micro-nano oxygen supply machine, scrape excrement machine, stone mill, mix machine, loader, branch crusher etc. the user selects: the micro-nano oxygen supply machine selects an article space of one, and outputs a result of:

micro-nano oxygen supply machine:

the micro-nano bubbles are quickly generated, gas (such as air, oxygen, ozone and the like) is dissolved in water in a high-speed rotary cutting mode, nano bubble water is quickly prepared, the dissolving efficiency of the gas is improved, and the requirement of treating a water body is met, so that the micro-nano bubbles can be widely applied to the treatment of industrial, agricultural and domestic water.

The product is characterized in that:

1. the diameter of the air bubble is 100nm-10 μm;

2. the rising speed is slow;

3. pressurizing and dissolving;

4. the specific surface area is large;

5. the surface is charged;

6. the micro-nano bubble generating device is convenient to be combined with the existing equipment;

7. different kinds of gas and liquid can be freely combined, and different gas sources (air, oxygen, ozone, carbon dioxide and the like) can be used.

The application field is as follows:

agricultural production: oxygenation and disinfection of nutrient solution, oxygenation and irrigation;

aquatic product and livestock breeding: purifying and disinfecting water quality and oxygenating water body;

treating sewage: purifying water, sterilizing and oxygenating;

medical health preserving: sterilizing, bathing and protecting health;

food processing: cleaning, disinfecting and preserving fruits and vegetables.

A text generation method based on neural network vocabulary extension paragraphs specifically comprises the following steps:

the method comprises the following steps: establishing a prediction model;

collecting article data, checking the collected article data, storing the article data which is the latest article sending day when the received article data is repeated article data, and deleting the other article data; when the article data has no text sending date, taking the time of the latest acquired article data as the text sending date, and marking the text sending date; segmenting article data to generate segmented data, performing data preprocessing on the segmented data, wherein the data preprocessing comprises data cleaning and extraction, extracting core words in the segmented data, integrally marking the segmented data and the corresponding core words as training set data, compiling a seq2seq model, acquiring the training set data from a storage module by adopting an encoder end and a decoder end respectively using an xlnet model structure through the seq2seq model, training the seq2seq model by using the training set data, and marking the trained seq2seq model as a prediction model.

Step two: acquiring a prediction parameter input by a user, and setting prediction input data according to the prediction parameter;

the method comprises the following steps of setting a retrieval unit and a keyword library, wherein the retrieval unit is used for retrieving keywords, and the keyword library is used for storing the keywords and can be stored according to the keywords in training set data; selecting the industry field, recommending keywords to a user according to the industry field, selecting the recommended keywords by the user, and retrieving the recommended keywords through a retrieval unit to obtain the keywords when the recommended keywords do not have the keywords required by the user; setting the number range of the generated article sections and the number range of each paragraph;

inputting keywords into an article database for matching to obtain matched sentences, obtaining a word number range required by a user and the word number of the matched sentences, screening out the matched sentences meeting the requirements of the user, performing data preprocessing, wherein the data preprocessing comprises data cleaning and extraction, marking the sentences subjected to the data preprocessing as basic sentences, and extracting core words in the basic sentences; marking the core words as prediction input data;

step three: inputting the predicted input data into a prediction model to obtain output sentences, and sequencing the output sentences according to heuristic rules to form text data;

when the obtained text data does not meet the requirements of the user, returning to the step two, and adding new keywords by the user;

step four: the text data is presented to the user.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions when the actual implementation is performed; the modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the method of the embodiment.

It will also be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above examples are only intended to illustrate the technical process of the present invention and not to limit the same, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical process of the present invention without departing from the spirit and scope of the technical process of the present invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数据处理方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!