Knowledge base construction method based on Word and control method thereof

文档序号：1953396 发布日期：2021-12-10 浏览：14次中文

阅读说明：本技术 基于Word的知识库构建方法及其控制方法 (Knowledge base construction method based on Word and control method thereof ) 是由张少举陶静远吴海荣于 2021-08-31 设计创作，主要内容包括：本发明涉及基于Word的知识库构建方法及其控制方法。构建方法包括步骤：⑴登陆网站,选择或创建文档分类；⑵从本地选择要发布的word文档；⑶提交给转换器进行转换；⑷转换后的文件上传到文件系统；⑸将word的元数据信息保存到数据库,得到数据库记录ID；⑹基于转换后的文本内容,进行内容索引,同时也将作者、数据库ID进行索引；⑺刷新网站页面,在最近文档中,即可查到新上传的文档链接。(The invention relates to a knowledge base construction method based on Word and a control method thereof. The construction method comprises the following steps: firstly, logging in a website, and selecting or creating a document classification; secondly, selecting a word document to be published from the local; submitting the third product to a converter for conversion; the converted file is uploaded to a file system; the metadata information of the word is saved in the database, and a database record ID is obtained; sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID; the website page is refreshed, and newly uploaded document links can be found in the latest document.)

1. A knowledge base construction method based on Word is characterized by comprising the following steps:

firstly, logging in a website, and selecting or creating a document classification;

secondly, selecting a word document to be published from the local;

submitting the third product to a converter for conversion;

the converted file is uploaded to a file system;

the metadata information of the word is saved in the database, and a database record ID is obtained;

sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID;

the website page is refreshed, and newly uploaded document links can be found in the latest document.

2. The method of building a Word-based knowledge base according to claim 1, wherein the step three, the converting refers to converting Word to H5, the converter completes the recognition of the internal elements of the Word document and converts them to the corresponding H5 elements, further comprising:

(3.1) reading the outline structure of the word and converting the outline structure into an H5 directory;

(3.2) reading the paragraph contents, and converting into H5 paragraphs;

(3.3) reading the text style, and converting the text style into an H5Css3 style;

(3.4) resolving the hyperlink in the word, and converting the hyperlink into a hyperlink form of H5;

(3.5) reading the picture file in the word, converting the picture file into a base64 coding format, and displaying the picture file on an H5 page;

(3.6) creating a popup album of the H5 version based on the pictures in the document, and improving the picture viewing experience which the word does not have;

(3.7) reading the attachment information, uploading the attachment information to a file server, generating a downloading connection, and displaying the downloading connection on an H5 page;

(3.8) reading word mathematical formula information, converting the word mathematical formula information into an xml code or a png picture, and displaying the xml code or the png picture on an H5 page;

and (3.9) reading the table information and converting the table information into a table supported by H5.

3. A Word-based knowledge base construction method according to claim 2, wherein the converter uploads the attachment to the file server by means of the client of the file server if the attachment is encountered during the Word conversion, the file server returns a unique file ID, and the converter saves the file ID as a hyperlink into the converted H5 page. After the data is released to the website, the download can be clicked.

4. A Word-based knowledge base construction method according to claim 2, characterized in that after the Word document is converted into H5 by a converter, the basic elements of the Word document are indexed, including author name, release date, document content.

5. A control method based on Word knowledge base construction is characterized by comprising the following steps:

loading a Word document into a memory;

extracting all picture information, converting the picture information into base64 for temporary storage, wherein each picture has a unique ID corresponding to each other, and establishing a unique mapping relation between the picture ID and the picture content for temporary storage;

thirdly, extracting all accessory information, uploading the accessory information to a file server to obtain the unique ID of the file server; each attachment corresponds to a unique ID, and a unique mapping relation is established between the attachment ID and the file server ID for temporary storage;

fourthly, analyzing style and xml inside the Word document, distinguishing style levels, establishing a hierarchical relationship, namely the hierarchical relationship of the catalog, and generating a hierarchical sequence number by using a recursive algorithm based on the hierarchical relationship;

fifthly, paragraph analysis is carried out;

sixthly, judging the type of the paragraph;

a quiesce form paragraph;

reading the row and column numbers, and performing traversal output;

judging whether the parallel combination exists or not by the self-lifting;

if yes, carrying out merging column conversion;

the separator further judges whether the parallel connection exists;

if yes, performing conversion according to combination and parallel;

paragraph parsing in the selection table is finished;

receiving the step VI and the picture paragraph;

matching is carried out on the base image ID to the initial image list, if matching is successful, the base image ID is converted into an H5 image tag, and a base64 character string is stored in the src attribute;

receiving the common text paragraphs in the determining step;

⒄ reading the style information, converting the style and outputting to H5 page;

the middle section judges whether the hyperlink is contained or not;

⒆ if yes, performing hyperlink analysis;

⒇ further determining whether an attachment is present;

(21) carrying out id search with the previous accessory list, converting into a hyperlink label of H5 if matching is successful, and ending;

(22) carrying out the step sixteenth, and determining the type of the outline paragraph;

(23) and matching with the outline level based on the paragraph content, and if the matching is successful, rendering by using the corresponding level style alignment, and ending.

6. The control method for Word-based knowledge base construction according to claim 5, wherein the step three is that the file server uploads the file server by means of a file server client if an attachment is encountered during the process of converting Word by the converter, the file server returns a unique file ID, and the converter stores the unique file ID in a hyperlink form into a converted H5 page; after the data is released to the website, the download can be clicked.

7. The control method for building a knowledge base based on Word according to claim 5, wherein the recursive algorithm generates a hierarchical number, and the essence of building the hierarchical number generated by the recursive algorithm is a process of building a catalog, and the specific steps are as follows:

the method comprises the steps of circularly traversing each paragraph, and extracting paragraphs with title patterns; the title style has the following features: heading1, heading2, heading 3;

the method includes the steps of extracting the following characteristic values in the cycle process of the step, and combining the form expression of tuples:

paragraph->catalog(content,level)

the release is as follows: the paragraph paramgraph at present is a directory catalog, the content is content, and the level of the content is level;

obtaining all directory nodes of the current word document; at this time, the hierarchical relationship between the nodes is not established, and the next step is to establish the hierarchical relationship between the catalogues;

abstracting the catalog into a tree structure, and then packaging the tree structure by using the characteristics of the tree structure; attention needs to be paid to a hidden feature relationship of the word document, wherein the feature relationship is a key for constructing a directory tree structure;

based on the characteristics, establishing a hierarchical relationship of the directory; as for the hierarchical sequence numbers, such as 1.1,1.2, and 1.2.1, the sequence of the upper node of the current node in the concatenation can be obtained only in the traversal process.

8. The control method constructed based on a Word knowledge base according to claim 7, wherein the step of step four is characterized by the following steps:

word document paragraphs are ordered from top to bottom, and extracted outline paragraphs are also ordered;

b. the outline paragraph appearing for the first time is definitely the lowest top-level paragraph with the smallest level;

c. a child paragraph must appear after its parent paragraph;

d. the upper paragraph closest to the child paragraph must be its parent paragraph.

9. The control method based on Word knowledge base construction according to claim 5 or 6, wherein the server comprises:

a file server for storing the converted attachment on page H5;

the index server is used for indexing the document content, the author and the release time, so that the retrieval is convenient;

a web server for providing presentation of the H5 page and classification management of the document; the method comprises the following steps: displaying the converted H5 page; providing a document downloading function; providing a document uploading function; a document classification function is provided.

Technical Field

The invention belongs to the technical field of online preview, retrieval and cloud storage of Word documents, and particularly relates to a Word-based knowledge base construction method and a control method thereof.

Background

Word documents inside a company are very numerous and scattered and are easily lost, and finding history documents is almost impossible. When the staff leave the job and flow, the history documents are unknown. word documents cannot be used for content retrieval, and documents which are useful for the word documents cannot be found at the highest speed.

CN201811043059.3 discloses a method and a device for converting a nuclear power plant Word file into a template-based HTML file, and aims to provide a finally generated HTML file with strong structuredness and inheriting the structure of the Word file content. The technical scheme is as follows: creating an HTML file template; setting a unique pseudo code for the key content; reading the text content and the graphic content from the Word file; loading the read text content into an array, and loading the read graphic content into a folder; opening the created HTML file template; reading the set unique pseudo code of the HTML file template; establishing a corresponding relation between the set unique pseudo code of the HTML file template and the text content and the graphic content in the Word file; and based on the corresponding relation between the set unique pseudo code of the HTML file template and the text content and the graphic content in the Word file, injecting the text content and the graphic content in the Word file into the HTML file. The conversion method of the invention can complete the conversion from the Word file to the HTML file only by engineering production personnel, and the conversion period is greatly shortened. The disadvantages are as follows:

aiming at a specific field, the comparison document has limited capacity of identifying word elements; according to the patent, it is described that it is possible to recognize text and graphic elements in a word document.

And the converted reference is a static html webpage.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a knowledge base construction method based on Word, which enables a user to perform centralized management on Word documents, including archiving, classifying, indexing and checking, by h5 conversion of the Word and publishing the Word to a server, fully exerts knowledge value carried by the Word documents and improves efficiency. Another objective of the present invention is to provide a control method for Word document-based knowledge base construction that generates personalized typesetting by custom parsing of Word documents, i.e. by loading custom css files and js files during the conversion process.

The technical scheme of the invention is that the method for constructing the knowledge base based on Word is characterized by comprising the following steps: