Nuclear power structural material data file automatic extraction system and method

文档序号:1953413 发布日期:2021-12-10 浏览:14次 中文

阅读说明:本技术 一种核电结构材料数据文件自动化提取系统及方法 (Nuclear power structural material data file automatic extraction system and method ) 是由 王卓 王者 许斌 颜达鹏 于 2021-09-10 设计创作,主要内容包括:本发明提供一种核电结构材料数据文件自动化提取系统及方法,包括文档分类模块、文档转化模块、文档判断模块、文档提取模块,其中:文档分类模块对保存在计算机的核电结构材料数据文件进行文档分类;文档转化模块将分类的文档进行转化成可以提取出文字的PDF文件;文档判断模块将转化后的PDF文件与系统内已存在的PDF文件进行比较判断,并提取转化后的PDF文件中指定内容,即完成提取核电结构材料数据;文档提取模块将PDF文件根据已知布局,被自动化提取为核电结构材料结构化数据,随后整理为结构化数据。本发明提供的自动化提取系统中能够支持处理核电结构材料领域所有数据的信息化、数字化,不仅可以对已知格式文档进行全自动提取归档。(The invention provides a nuclear power structural material data file automatic extraction system and a method, comprising a document classification module, a document conversion module, a document judgment module and a document extraction module, wherein: the document classification module is used for carrying out document classification on nuclear power structural material data files stored in a computer; the document conversion module converts the classified documents into PDF files from which characters can be extracted; the document judgment module compares and judges the converted PDF file with a PDF file existing in the system, and extracts the specified content in the converted PDF file, namely completes the extraction of nuclear power structural material data; the document extraction module automatically extracts the PDF file into structural data of the nuclear power structural material according to the known layout, and then arranges the structural data into the structural data. The automatic extraction system provided by the invention can support the informationization and digitization of all data in the field of nuclear power structural materials, and can fully automatically extract and archive documents with known formats.)

1. An automatic extraction system for nuclear power structural material data files is characterized by comprising a document classification module, a document conversion module, a document judgment module and a document extraction module, wherein:

the document classification module is used for carrying out document classification on nuclear power structural material data files stored in a computer;

the document conversion module converts the classified documents into PDF files from which characters can be extracted;

the document judgment module compares and judges the converted PDF file with a PDF file existing in the system, and extracts the specified content in the converted PDF file, namely completes the extraction of nuclear power structural material data;

the document extraction module automatically extracts the PDF file into structural data of the nuclear power structural material according to a known layout, and then arranges the structural data into the structural data.

2. The automatic extraction system of nuclear power structural material data files of claim 1, wherein the documents in the nuclear power structural material data file classification module are classified into PDF files, picture files, and paper files.

3. The automatic extraction system of nuclear power structural material data files of claim 2, wherein the paper documents are collated in a manual collation manner.

4. The automatic extraction system of nuclear power structural material data files of claim 2, characterized in that the PDF file uses Java language PDFBox framework to test whether the PDF file can normally extract text picture content, if so, the PDF file is considered not to need conversion, if not, the PDF file is retained to wait for conversion; and for the picture document, converting.

5. The automatic extraction system of nuclear power structural material data files of claim 2, wherein, when the picture document without text exists in the document conversion module, the picture document is judged to be a molecular structure picture.

6. The automatic extraction system of nuclear power structural material data files of claim 1, wherein when the document judgment module judges to extract the specified content in the PDF file, the nuclear power structural material data is extracted by adopting position analysis to extract the specified content, logic analysis and fuzzy matching.

7. The method for the automatic extraction system of the nuclear power structural material data file as recited in any one of claims 1 to 6, comprising the steps of:

s1, storing the nuclear power structural material data file in a computer, classifying the file by the computer according to the file extension name, and manually sorting and classifying the paper file;

s2, for the PDF file, testing whether the PDF file can normally extract text picture content by using a Java language PDFBox frame, if so, determining that the PDF file does not need to be converted, and if not, keeping the PDF to wait for conversion; for the picture document, conversion is needed to generate a recognizable PDF file.

S3, carrying out document layout judgment on the converted identifiable PDF file in S2, adopting Java analysis when the PDF file already existing in the system and the identifiable PDF have the same document layout, and accurately extracting the specified content in the PDF file, namely obtaining position analysis, logic analysis and fuzzy matching extracted nuclear power structural material data;

s4, decomposing the PDF file into text data and picture data when the PDF file of unknown nuclear power structural material data is laid out, and restoring form data of the PDF file on the text data according to a layout and form tolerance strategy;

s5, automatically extracting the document layout of the identifiable PDF file in the S3 into nuclear power structural material structured data according to the known layout, and then arranging the nuclear power structural material structured data into the structured data;

and S6, storing the structured data after being sorted in a nuclear power structural material database.

8. The method for automatically extracting a nuclear power structural material data file according to claim 7, wherein in the S3, the layout of the nuclear power structural material data file is expanded, and a developer can expand a new template for the layout of the nuclear power structural material data file by using Java language.

Technical Field

The invention belongs to the technical field of nuclear power material data processing, and particularly relates to a nuclear power structural material data file automatic extraction system and method.

Background

With the continuous progress of the modern computer technology, more and more industries select the operations of informationization extraction, storage analysis and the like of enterprise data in paper documents or electronic documents. Therefore, how to efficiently extract data influences the construction progress of a nuclear power structural material database and the research and development cycle of a new material.

In the process of extracting nuclear power structural material data, data from different sources (texts, tables, paper documents, electronic documents and the like) need to be informationized, extracted as structural data which can be stored and identified by a computer and stored in a storage. Due to the fact that nuclear power structural material data are complex in type, different experiments, different periodicals, different documents and different data have different document recording modes, valuable data are often hidden in documents with complicated structures, the valuable data which are difficult to find in the complex nuclear power structural material data are automatically extracted and extracted, and the data value of the nuclear power structural material is greatly reduced in the informatization process.

In the existing extraction process, information in a nuclear power structural material data paper document or an electronic document is manually input into a database according to an input requirement by means of pure manual input. The input personnel must possess two aspects of knowledge of nuclear power structural material knowledge and data arrangement extraction ability simultaneously, and a document is read, is extracted, is input line by line in proper order, and the learning cost is very high, has inefficiency, makes mistakes easily, influences the problem of data application.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a nuclear power structural material data file automatic extraction system and a nuclear power structural material data file automatic extraction method, which solve the technical problems in the prior art.

The purpose of the invention can be realized by the following technical scheme:

an automatic extraction system for nuclear power structural material data files comprises a document classification module, a document conversion module, a document judgment module and a document extraction module, wherein:

the document classification module is used for carrying out document classification on nuclear power structural material data files stored in a computer;

the document conversion module converts the classified documents into PDF files from which characters can be extracted;

the document judgment module compares and judges the converted PDF file with a PDF file existing in the system, and extracts the specified content in the converted PDF file, namely completes the extraction of nuclear power structural material data;

the document extraction module automatically extracts the PDF file into structural data of the nuclear power structural material according to a known layout, and then arranges the structural data into the structural data.

Furthermore, the documents in the nuclear power structural material data document classification module are classified into PDF documents, picture documents and paper documents.

Further, the paper documents are sorted in a manual sorting mode.

Further, the PDF file uses a Java language PDFBox framework to test whether the PDF file can normally extract the text picture content, if so, the PDF file is considered not to be converted, and if not, the PDF file is reserved for conversion; and for the picture document, converting.

Further, when the picture document without characters exists in the document conversion module, the data nuclear power structural material in the picture document is judged to be a molecular structure picture.

Furthermore, when the document judgment module judges and extracts the specified content in the PDF file, the position analysis is adopted to extract the specified content, the logic analysis and the fuzzy matching are adopted to extract the nuclear power structural material data.

The method for the nuclear power structural material data file automatic extraction system comprises the following steps:

s1, storing the nuclear power structural material data file in a computer, classifying the file by the computer according to the file extension name, and manually sorting and classifying the paper file;

s2, for the PDF file, testing whether the PDF file can normally extract text picture content by using a Java language PDFBox frame, if so, determining that the PDF file does not need to be converted, and if not, keeping the PDF to wait for conversion; for the picture document, conversion is needed to generate a recognizable PDF file.

S3, carrying out document layout judgment on the converted identifiable PDF file in S2, adopting Java analysis when the PDF file already existing in the system and the identifiable PDF have the same document layout, and accurately extracting the specified content in the PDF file, namely obtaining position analysis, logic analysis and fuzzy matching extracted nuclear power structural material data;

s4, decomposing the PDF file into text data and picture data when the PDF file of unknown nuclear power structural material data is laid out, and restoring form data of the PDF file on the text data according to a layout and form tolerance strategy;

s5, automatically extracting the document layout of the identifiable PDF file in the S3 into nuclear power structural material structured data according to the known layout, and then arranging the nuclear power structural material structured data into the structured data;

and S6, storing the structured data after being sorted in a nuclear power structural material database.

Further, in the S3, the layout of the nuclear power structural material data document is expanded, and a developer may use Java language to expand a new template for the layout of the nuclear power structural material data document.

The invention has the beneficial effects that:

1. the automatic extraction system provided by the invention can support the informationization and digitization of all data in the field of nuclear power structural materials, and can fully automatically extract and archive documents with known formats.

2. According to the method, a developer uses Java language to expand a new template for the nuclear power structural material data document layout, so that the analysis capability is expanded, and the self-improvement is realized continuously.

3. The extraction method provided by the invention improves the digitization efficiency and accuracy of nuclear power structural material data, and reduces the working difficulty and the input cost of input personnel.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an automated extraction system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of an automated extraction method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides an automatic extraction system for nuclear power structural material data files, including a document classification module, a document conversion module, a document judgment module, and a document extraction module, where:

a document classification module: carrying out document classification on nuclear power structural material data files stored in a computer; the documents are classified into PDF (PDF under extension), picture documents (jpg under extension, jpeg, png, bmp, tif, etc.), office documents (doc, docx, xls, xlsx, etc.) and paper documents, and nuclear power structural material data files stored in paper files in the form of books or forms, etc. mostly recorded manually or printed by equipment in the production research and development process are manually sorted and classified by workers, and the picture documents are mostly documents shot or produced by machines in the production research and development process and are stored in the form of paper photos or computer picture files, etc.

The document conversion module: converting the classified documents into PDF files capable of extracting characters, and specifically operating as follows: for a PDF file, the method uses a Java language PDFBox frame to test whether the PDF file can normally extract text picture content, if so, the PDF file is considered not to be converted, and if not, the PDF file is reserved for conversion; for picture documents, conversion is uniformly needed; for office documents, the method uniformly considers that only simple transformations are required.

The following is a conversion step, the method uses abbyfrengine as OCR means and PDFBox as means for converting office documents to PDF files. The ABBYY FREEngine converts the PDF file and the picture document into a PDF file with words capable of being extracted, and if no words are extracted in the process of converting the picture, the method can judge the molecular structure picture of the nuclear power structure material of the picture data, and the like. The PDFBox directly converts the office document into a PDF file from which characters can be extracted.

A document judgment module: the converted PDF file is compared with the PDF file already existing in the system for judgment, whether the document layout already exists is judged by an operator, the document layout analysis logic is written by using Java language, and the specified content in the PDF file can be extracted by the analysis logic with high efficiency and high accuracy for the PDF file layout already existing in the system. The logic adopts position analysis (according to the position of the content to be extracted approximately), designated content extraction, logic analysis (according to the context of the content to be extracted), fuzzy matching (according to the keywords of the content to be extracted), and nuclear power structural material data extraction.

Meanwhile, the nuclear power structural material data document layout is expanded, and developers can use Java language to expand new templates for the nuclear power structural material data document layout.

The document extraction module: automatically extracting the PDF file into structural data of the nuclear power structural material according to a known layout, and then arranging the structural data into structural data; when nuclear power structural material data documents are not arranged in the system, nuclear power structural material data can be automatically decomposed by the method, PDF is decomposed into two metadata of text (digital) data and picture data, and form data of the PDF files are restored on the basis of the text (digital) metadata according to a layout and form tolerance strategy.

The PDF file uses a Java language PDFBox frame to test whether the PDF file can normally extract the text picture content, if so, the PDF file is considered not to be converted, and if not, the PDF file is reserved for conversion; and for the picture document, converting.

As shown in fig. 2, the method for automatically extracting the nuclear power structural material data file includes the following steps:

s1, storing nuclear power structural material data files in a computer, classifying the documents by the computer according to file extensions, namely classifying the documents by the computer according to file extensions (PDF files:. PDF, picture files:. jpg,. jpeg,. png,. bmp, tif, etc., office files:. doc, docx,. xls, xlsx, etc.). Manual sorting and classifying are adopted for paper documents;

s2, for the PDF file, testing whether the PDF file can normally extract text picture content by using a Java language PDFBox frame, if so, determining that the PDF file does not need to be converted, and if not, keeping the PDF to wait for conversion; for the picture document, conversion (judgment of the picture of the molecular structure of the nuclear power structure material and the like) is required to generate a recognizable PDF file.

S3, carrying out document layout judgment on the converted identifiable PDF file in S2, adopting Java analysis and accurately extracting the specified content in the PDF file when the PDF file already existing in the system and the identifiable PDF have the same document layout, namely logically extracting the specified content by adopting position analysis (according to the position of the approximate appearance of the content to be extracted), logically analyzing (according to the context of the content to be extracted) and fuzzy matching (according to the key words of the content to be extracted) and extracting nuclear power structural material data

Meanwhile, the nuclear power structural material data document layout is expanded, and developers can use Java language to expand new templates for the nuclear power structural material data document layout.

S4, decomposing the PDF file into text data and picture data when the PDF file of unknown nuclear power structural material data is laid out, and restoring form data of the PDF file on the text data according to a layout and form tolerance strategy;

s5, automatically extracting the document layout of the identifiable PDF file in the S3 into nuclear power structural material structured data according to the known layout, and then arranging the nuclear power structural material structured data into the structured data;

and S6, automatically extracting the nuclear power structural material data file, and storing the organized structural data in a nuclear power structural material database.

The extraction method provided by the invention can support the informationization and digitization of all data in the field of nuclear power structural materials through a set of complete and rigorous nuclear power structural material processing logic, can fully automatically extract and file documents with known formats, and can expand the analysis capability to continuously improve the self-perfection. The efficiency and the accuracy of nuclear power structural material data digitization are improved, and the work difficulty and the input cost of input personnel are reduced.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种报告生成方法、装置、电子设备、存储介质及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!