Automatic manuscript writing system based on text analysis technology

文档序号:1505362 发布日期:2020-02-07 浏览:8次 中文

阅读说明:本技术 一种基于文本分析技术的自动写稿系统 (Automatic manuscript writing system based on text analysis technology ) 是由 陶敬伟 包盛 诸葛忠 杨谦 于 2019-09-10 设计创作,主要内容包括:本发明涉及文本分析、自然语言文字处理技术领域,且公开了一种基于文本分析技术的自动写稿系统,包括以下步骤:S1,从网页中及时获取企业发布的各种年报、公告信息,主要是以PDF文档为主;S2,下载PDF文件后,将PDF文件输入到R-CNN神经网络中,通过深度学习技术对文档进行解析,分离图片和表格。本发明通过自动化写稿系统,能够快速的从互联网上获取上市企业发布的各类信息,并对这些信息进行提取、处理、生成摘要、格式变换后,最终为用户呈现出来可视化的、易理解的摘要稿件信息。(The invention relates to the technical field of text analysis and natural language word processing, and discloses an automatic manuscript writing system based on a text analysis technology, which comprises the following steps: s1, various annual newspapers and announcement information issued by enterprises are obtained in time from the webpage, and the PDF document is mainly used; and S2, after the PDF file is downloaded, inputting the PDF file into the R-CNN neural network, analyzing the document through a deep learning technology, and separating the picture and the table. The invention can rapidly acquire various information released by listed enterprises from the Internet through an automatic draft writing system, and finally present visual and easily understood abstract manuscript information for users after extracting, processing, generating the abstract and converting the format of the information.)

1. An automatic manuscript writing system based on text analysis technology is characterized by comprising the following steps:

s1, various annual newspapers and announcement information issued by enterprises are obtained in time from the webpage, and the PDF document is mainly used;

s2, after downloading the PDF file, inputting the PDF file into the R-CNN neural network, analyzing the document through the deep learning technology, and separating the picture and the table;

s2-1, firstly analyzing the catalog of the document, and grouping the document according to the catalog by page number;

s2-2, extracting characters of each paragraph title as key data according to the directory, and organizing according to paragraph grading to obtain paragraph information of the whole document as a Sector [ i ] array;

s2-3, separating the picture and the table in the document from the characters, directly storing the picture in a database, and inputting the table into a neural network;

s2-4, extracting the data in the table through a neural network and storing the data in a relational database;

s3, natural language processing is carried out on the separated document to obtain emotion information in the document, and emotion indexes (i) of the document are given after the processing is finished;

s4, natural language processing is carried out on the separated documents, text abstract extraction is carried out on the documents, and the extracted text fragments are text (i);

s5, selecting a proper template Model [ i ] from the dynamic template library according to the information of the Sector [ i ] and the Motion [ i ], filling the Text [ i ] into the selected template Model [ i ], and finally obtaining a distributable manuscript Article [ i ];

s6, converting the document format of the Article [ i ].

2. An automatic drafting system based on text analysis technology according to claim 1, characterized in that: each Sector [ i ] array in the step S2-2 contains paragraph hierarchy, paragraph header information, and page number range.

3. An automatic drafting system based on text analysis technology according to claim 1, characterized in that: the emotional information in the step S3 mainly includes three types, namely, a happy mood, and a neutral mood.

4. An automatic drafting system based on text analysis technology according to claim 1, characterized in that: the format of the document in the step S6 may be converted into a web page format, an audio format, and a video format.

Technical Field

The invention relates to the technical field of text analysis and natural language word processing, in particular to an automatic manuscript writing system based on a text analysis technology.

Background

The core technology behind automatic manuscript writing is Natural Language Processing (NLP), and simultaneously relates to a plurality of artificial intelligence technologies such as data mining, machine learning, search technology, knowledge graph and the like. Natural language processing refers to the ability of machines to understand and interpret human writing, speech patterns. The goal is to make the computer intelligent in understanding language like a human being, and eventually to make up the gap between human communication and computer understanding (machine language).

With the great development of technologies such as NLP, deep learning and big data and the like and the beginning of great diversity in industrial application, automatic draft writing is a trend of technical development and industry revolution hastening, and at present, there are three implementation methods for automatic draft writing: template, extraction and generation.

The three automatic manuscript writing technologies have respective defects in the actual using process.

The template type automatic manuscript writing is more suitable for a scene that the manuscript content structure is relatively fixed, the rapid response is difficult to the sudden event, the content announced by an enterprise frequently has the sudden event, and the timeliness of manuscript release can be influenced if the manuscript content is found to be abnormal and then manual intervention is carried out;

the extraction type automatic manuscript writing is to extract information from a large amount of existing text materials and perform secondary creation, the extraction type automatic text summarization technology is mainly applied to the news field at present, because a large amount of news texts on the Internet can be conveniently collected, and for enterprise announcement contents in the financial industry, the contents basically exist in a PDF/DOC document form, and no direct data source can be used;

the generative automatic manuscript writing is more used for imitating the content writing of a certain person/a certain scene, a large amount of target texts need to be collected for learning, and then texts closer to the target documents are automatically generated, so that the prior art is not mature.

Therefore, the invention provides an automatic manuscript writing system based on the text analysis technology after combining the text analysis technology, the natural language processing technology and the dynamic template technology.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an automatic manuscript writing system based on a text analysis technology, which solves the problems mentioned in the background technology.

The invention provides the following technical scheme: an automatic manuscript writing system based on text analysis technology comprises the following steps:

s1, various annual newspapers and announcement information issued by enterprises are obtained in time from the webpage, and the PDF document is mainly used;

s2, after downloading the PDF file, inputting the PDF file into the R-CNN neural network, analyzing the document through the deep learning technology, and separating the picture and the table;

s2-1, firstly analyzing the catalog of the document, and grouping the document according to the catalog by page number;

s2-2, extracting characters of each paragraph title as key data according to the directory, and organizing according to paragraph grading to obtain paragraph information of the whole document as a Sector [ i ] array;

s2-3, separating the picture and the table in the document from the characters, directly storing the picture in a database, and inputting the table into a neural network;

s2-4, extracting the data in the table through a neural network and storing the data in a relational database;

s3, natural language processing is carried out on the separated document to obtain emotion information in the document, and emotion indexes (i) of the document are given after the processing is finished;

s4, natural language processing is carried out on the separated documents, text abstract extraction is carried out on the documents, and the extracted text fragments are text (i);

s5, selecting a proper template Model [ i ] from the dynamic template library according to the information of the Sector [ i ] and the Motion [ i ], filling the Text [ i ] into the selected template Model [ i ], and finally obtaining a distributable manuscript Article [ i ];

s6, converting the document format of the Article [ i ].

Preferably, each Sector [ i ] array in the step S2-2 includes a paragraph hierarchy, paragraph header information, and a page number range.

Preferably, the emotional information in the step S3 mainly includes a solo emotion, a pluronic emotion, and a neutral emotion.

Preferably, the format of the document in the step S6 may be converted into a web page format, an audio format and a video format.

The invention has the following beneficial effects:

the invention can rapidly acquire various information released by listed enterprises from the Internet through an automatic draft writing system, and finally present visual and easily understood abstract manuscript information for users after extracting, processing, generating the abstract and converting the format of the information.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an automatic manuscript writing system based on text analysis technology includes the following steps:

s1, various annual newspapers and announcement information issued by enterprises are obtained in time from the webpage, and the PDF document is mainly used;

s2, after downloading the PDF file, inputting the PDF file into the R-CNN neural network, analyzing the document through the deep learning technology, and separating the picture and the table;

s2-1, firstly analyzing the catalog of the document, and grouping the document according to the catalog by page number;

s2-2, extracting characters of each paragraph title as key data according to the directory, and organizing according to paragraph grading to obtain paragraph information of the whole document as a Sector [ i ] array;

s2-3, separating the picture and the table in the document from the characters, directly storing the picture in a database, and inputting the table into a neural network;

s2-4, extracting the data in the table through a neural network and storing the data in a relational database;

s3, natural language processing is carried out on the separated document to obtain emotion information in the document, and emotion indexes (i) of the document are given after the processing is finished;

s4, natural language processing is carried out on the separated documents, text abstract extraction is carried out on the documents, and the extracted text fragments are text (i);

s5, selecting a proper template Model [ i ] from the dynamic template library according to the information of the Sector [ i ] and the Motion [ i ], filling the Text [ i ] into the selected template Model [ i ], and finally obtaining a distributable manuscript Article [ i ];

s6, converting the document format of the Article [ i ].

Further, each Sector [ i ] array in step S2-2 includes paragraph hierarchy, paragraph header information, and page number range.

Further, the emotion information in step S3 mainly includes three kinds of the happy emotion, and the neutral emotion.

Further, the format of the document in the step S6 may be converted into a web page format, an audio format, and a video format.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于数学公式在HTML中编辑显示和导出到Word文档中的系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!