Knowledge base construction method based on Word and control method thereof

文档序号:1953396 发布日期:2021-12-10 浏览:14次 中文

阅读说明:本技术 基于Word的知识库构建方法及其控制方法 (Knowledge base construction method based on Word and control method thereof ) 是由 张少举 陶静远 吴海荣 于 2021-08-31 设计创作,主要内容包括:本发明涉及基于Word的知识库构建方法及其控制方法。构建方法包括步骤:⑴登陆网站,选择或创建文档分类;⑵从本地选择要发布的word文档;⑶提交给转换器进行转换;⑷转换后的文件上传到文件系统;⑸将word的元数据信息保存到数据库,得到数据库记录ID;⑹基于转换后的文本内容,进行内容索引,同时也将作者、数据库ID进行索引;⑺刷新网站页面,在最近文档中,即可查到新上传的文档链接。(The invention relates to a knowledge base construction method based on Word and a control method thereof. The construction method comprises the following steps: firstly, logging in a website, and selecting or creating a document classification; secondly, selecting a word document to be published from the local; submitting the third product to a converter for conversion; the converted file is uploaded to a file system; the metadata information of the word is saved in the database, and a database record ID is obtained; sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID; the website page is refreshed, and newly uploaded document links can be found in the latest document.)

1. A knowledge base construction method based on Word is characterized by comprising the following steps:

firstly, logging in a website, and selecting or creating a document classification;

secondly, selecting a word document to be published from the local;

submitting the third product to a converter for conversion;

the converted file is uploaded to a file system;

the metadata information of the word is saved in the database, and a database record ID is obtained;

sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID;

the website page is refreshed, and newly uploaded document links can be found in the latest document.

2. The method of building a Word-based knowledge base according to claim 1, wherein the step three, the converting refers to converting Word to H5, the converter completes the recognition of the internal elements of the Word document and converts them to the corresponding H5 elements, further comprising:

(3.1) reading the outline structure of the word and converting the outline structure into an H5 directory;

(3.2) reading the paragraph contents, and converting into H5 paragraphs;

(3.3) reading the text style, and converting the text style into an H5Css3 style;

(3.4) resolving the hyperlink in the word, and converting the hyperlink into a hyperlink form of H5;

(3.5) reading the picture file in the word, converting the picture file into a base64 coding format, and displaying the picture file on an H5 page;

(3.6) creating a popup album of the H5 version based on the pictures in the document, and improving the picture viewing experience which the word does not have;

(3.7) reading the attachment information, uploading the attachment information to a file server, generating a downloading connection, and displaying the downloading connection on an H5 page;

(3.8) reading word mathematical formula information, converting the word mathematical formula information into an xml code or a png picture, and displaying the xml code or the png picture on an H5 page;

and (3.9) reading the table information and converting the table information into a table supported by H5.

3. A Word-based knowledge base construction method according to claim 2, wherein the converter uploads the attachment to the file server by means of the client of the file server if the attachment is encountered during the Word conversion, the file server returns a unique file ID, and the converter saves the file ID as a hyperlink into the converted H5 page. After the data is released to the website, the download can be clicked.

4. A Word-based knowledge base construction method according to claim 2, characterized in that after the Word document is converted into H5 by a converter, the basic elements of the Word document are indexed, including author name, release date, document content.

5. A control method based on Word knowledge base construction is characterized by comprising the following steps:

loading a Word document into a memory;

extracting all picture information, converting the picture information into base64 for temporary storage, wherein each picture has a unique ID corresponding to each other, and establishing a unique mapping relation between the picture ID and the picture content for temporary storage;

thirdly, extracting all accessory information, uploading the accessory information to a file server to obtain the unique ID of the file server; each attachment corresponds to a unique ID, and a unique mapping relation is established between the attachment ID and the file server ID for temporary storage;

fourthly, analyzing style and xml inside the Word document, distinguishing style levels, establishing a hierarchical relationship, namely the hierarchical relationship of the catalog, and generating a hierarchical sequence number by using a recursive algorithm based on the hierarchical relationship;

fifthly, paragraph analysis is carried out;

sixthly, judging the type of the paragraph;

a quiesce form paragraph;

reading the row and column numbers, and performing traversal output;

judging whether the parallel combination exists or not by the self-lifting;

if yes, carrying out merging column conversion;

the separator further judges whether the parallel connection exists;

if yes, performing conversion according to combination and parallel;

paragraph parsing in the selection table is finished;

receiving the step VI and the picture paragraph;

matching is carried out on the base image ID to the initial image list, if matching is successful, the base image ID is converted into an H5 image tag, and a base64 character string is stored in the src attribute;

receiving the common text paragraphs in the determining step;

⒄ reading the style information, converting the style and outputting to H5 page;

the middle section judges whether the hyperlink is contained or not;

⒆ if yes, performing hyperlink analysis;

⒇ further determining whether an attachment is present;

(21) carrying out id search with the previous accessory list, converting into a hyperlink label of H5 if matching is successful, and ending;

(22) carrying out the step sixteenth, and determining the type of the outline paragraph;

(23) and matching with the outline level based on the paragraph content, and if the matching is successful, rendering by using the corresponding level style alignment, and ending.

6. The control method for Word-based knowledge base construction according to claim 5, wherein the step three is that the file server uploads the file server by means of a file server client if an attachment is encountered during the process of converting Word by the converter, the file server returns a unique file ID, and the converter stores the unique file ID in a hyperlink form into a converted H5 page; after the data is released to the website, the download can be clicked.

7. The control method for building a knowledge base based on Word according to claim 5, wherein the recursive algorithm generates a hierarchical number, and the essence of building the hierarchical number generated by the recursive algorithm is a process of building a catalog, and the specific steps are as follows:

the method comprises the steps of circularly traversing each paragraph, and extracting paragraphs with title patterns; the title style has the following features: heading1, heading2, heading 3;

the method includes the steps of extracting the following characteristic values in the cycle process of the step, and combining the form expression of tuples:

paragraph->catalog(content,level)

the release is as follows: the paragraph paramgraph at present is a directory catalog, the content is content, and the level of the content is level;

obtaining all directory nodes of the current word document; at this time, the hierarchical relationship between the nodes is not established, and the next step is to establish the hierarchical relationship between the catalogues;

abstracting the catalog into a tree structure, and then packaging the tree structure by using the characteristics of the tree structure; attention needs to be paid to a hidden feature relationship of the word document, wherein the feature relationship is a key for constructing a directory tree structure;

based on the characteristics, establishing a hierarchical relationship of the directory; as for the hierarchical sequence numbers, such as 1.1,1.2, and 1.2.1, the sequence of the upper node of the current node in the concatenation can be obtained only in the traversal process.

8. The control method constructed based on a Word knowledge base according to claim 7, wherein the step of step four is characterized by the following steps:

word document paragraphs are ordered from top to bottom, and extracted outline paragraphs are also ordered;

b. the outline paragraph appearing for the first time is definitely the lowest top-level paragraph with the smallest level;

c. a child paragraph must appear after its parent paragraph;

d. the upper paragraph closest to the child paragraph must be its parent paragraph.

9. The control method based on Word knowledge base construction according to claim 5 or 6, wherein the server comprises:

a file server for storing the converted attachment on page H5;

the index server is used for indexing the document content, the author and the release time, so that the retrieval is convenient;

a web server for providing presentation of the H5 page and classification management of the document; the method comprises the following steps: displaying the converted H5 page; providing a document downloading function; providing a document uploading function; a document classification function is provided.

Technical Field

The invention belongs to the technical field of online preview, retrieval and cloud storage of Word documents, and particularly relates to a Word-based knowledge base construction method and a control method thereof.

Background

Word documents inside a company are very numerous and scattered and are easily lost, and finding history documents is almost impossible. When the staff leave the job and flow, the history documents are unknown. word documents cannot be used for content retrieval, and documents which are useful for the word documents cannot be found at the highest speed.

CN201811043059.3 discloses a method and a device for converting a nuclear power plant Word file into a template-based HTML file, and aims to provide a finally generated HTML file with strong structuredness and inheriting the structure of the Word file content. The technical scheme is as follows: creating an HTML file template; setting a unique pseudo code for the key content; reading the text content and the graphic content from the Word file; loading the read text content into an array, and loading the read graphic content into a folder; opening the created HTML file template; reading the set unique pseudo code of the HTML file template; establishing a corresponding relation between the set unique pseudo code of the HTML file template and the text content and the graphic content in the Word file; and based on the corresponding relation between the set unique pseudo code of the HTML file template and the text content and the graphic content in the Word file, injecting the text content and the graphic content in the Word file into the HTML file. The conversion method of the invention can complete the conversion from the Word file to the HTML file only by engineering production personnel, and the conversion period is greatly shortened. The disadvantages are as follows:

aiming at a specific field, the comparison document has limited capacity of identifying word elements; according to the patent, it is described that it is possible to recognize text and graphic elements in a word document.

And the converted reference is a static html webpage.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a knowledge base construction method based on Word, which enables a user to perform centralized management on Word documents, including archiving, classifying, indexing and checking, by h5 conversion of the Word and publishing the Word to a server, fully exerts knowledge value carried by the Word documents and improves efficiency. Another objective of the present invention is to provide a control method for Word document-based knowledge base construction that generates personalized typesetting by custom parsing of Word documents, i.e. by loading custom css files and js files during the conversion process.

The technical scheme of the invention is that the method for constructing the knowledge base based on Word is characterized by comprising the following steps:

firstly, logging in a website, and selecting or creating a document classification;

secondly, selecting a word document to be published from the local;

submitting the third product to a converter for conversion;

the converted file is uploaded to a file system;

the metadata information of the word is saved in the database, and a database record ID is obtained;

sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID;

the website page is refreshed, and newly uploaded document links can be found in the latest document.

Preferably, the method comprises the following steps: step three, the conversion is from Word to H5, the converter completes the identification of the internal elements of the Word document and converts the internal elements into corresponding H5 elements, and the method further comprises the following steps:

(3.1) reading the outline structure of the word and converting the outline structure into an H5 directory;

(3.2) reading the paragraph contents, and converting into H5 paragraphs;

(3.3) reading the text style, and converting the text style into an H5Css3 style;

(3.4) resolving the hyperlink in the word, and converting the hyperlink into a hyperlink form of H5;

(3.5) reading the picture file in the word, converting the picture file into a base64 coding format, and displaying the picture file on an H5 page;

(3.6) creating a popup album of the H5 version based on the pictures in the document, and improving the picture viewing experience which the word does not have;

(3.7) reading the attachment information, uploading the attachment information to a file server, generating a downloading connection, and displaying the downloading connection on an H5 page;

(3.8) reading word mathematical formula information, converting the word mathematical formula information into an xml code or a png picture, and displaying the xml code or the png picture on an H5 page;

and (3.9) reading the table information and converting the table information into a table supported by H5.

Preferably, the method comprises the following steps: in the process of converting the word, if the attachment is encountered, the attachment is uploaded to a file server by a client of the file server, the file server returns a unique file ID, and the converter stores the unique file ID in a hyperlink mode into a converted H5 page. After the data is released to the website, the download can be clicked.

Preferably, the method comprises the following steps: after the Word document is converted into H5 by the converter, the basic elements of the Word document are indexed, including author name, release date and document content.

The other technical solution of the invention is the control method constructed based on the Word knowledge base, which is characterized by comprising the following steps:

loading a Word document into a memory;

extracting all picture information, converting the picture information into base64 for temporary storage, wherein each picture has a unique ID corresponding to each other, and establishing a unique mapping relation between the picture ID and the picture content for temporary storage;

thirdly, extracting all accessory information, uploading the accessory information to a file server to obtain the unique ID of the file server; each attachment corresponds to a unique ID, and a unique mapping relation is established between the attachment ID and the file server ID for temporary storage;

fourthly, analyzing style and xml inside the Word document, distinguishing style levels, establishing a hierarchical relationship, namely the hierarchical relationship of the catalog, and generating a hierarchical sequence number by using a recursive algorithm based on the hierarchical relationship;

fifthly, paragraph analysis is carried out;

sixthly, judging the type of the paragraph;

a quiesce form paragraph;

reading the row and column numbers, and performing traversal output;

judging whether the parallel combination exists or not by the self-lifting;

if yes, carrying out merging column conversion;

the separator further judges whether the parallel connection exists;

if yes, performing conversion according to combination and parallel;

paragraph parsing in the selection table is finished;

receiving the step VI and the picture paragraph;

matching is carried out on the base image ID to the initial image list, if matching is successful, the base image ID is converted into an H5 image tag, and a base64 character string is stored in the src attribute;

receiving the common text paragraphs in the determining step;

⒄ reading the style information, converting the style and outputting to H5 page;

the middle section judges whether the hyperlink is contained or not;

⒆ if yes, performing hyperlink analysis;

⒇ further determining whether an attachment is present;

(21) carrying out id search with the previous accessory list, converting into a hyperlink label of H5 if matching is successful, and ending;

(22) carrying out the step sixteenth, and determining the type of the outline paragraph;

(23) and matching with the outline level based on the paragraph content, and if the matching is successful, rendering by using the corresponding level style alignment, and ending.

Preferably, the method comprises the following steps: step three, the file server uploads the file to the file server by virtue of a file server client if the file is attached during the word conversion process of the converter, the file server returns a unique file ID, and the converter stores the unique file ID into a converted H5 page in a hyperlink mode; after the data is released to the website, the download can be clicked.

Preferably, the method comprises the following steps: step four, the recursive algorithm generates a hierarchy sequence number, and the hierarchy sequence number comprises a formula:

the essence of constructing the recursive algorithm to generate the hierarchical sequence number is the process of constructing the directory, and the specific steps are as follows:

the method comprises the steps of circularly traversing each paragraph, and extracting paragraphs with title patterns; the title style has the following features: heading1, heading2, heading 3;

in the cyclic process of the step, the following characteristic values are extracted, and form expression of tuples is combined:

paragraph->catalog(content,level)

the release is as follows: the paragraph paramgraph at present is a directory catalog, the content is content, and the level of the content is level;

obtaining all directory nodes of the current word document; at this time, the hierarchical relationship between the nodes is not established, and the next step is to establish the hierarchical relationship between the catalogues;

abstracting the catalog into a tree structure, and then packaging the tree structure by using the characteristics of the tree structure; attention needs to be paid to a hidden feature relationship of the word document, wherein the feature relationship is a key for constructing a directory tree structure;

based on the characteristics, establishing a hierarchical relationship of the directory; as for the hierarchical sequence numbers, such as 1.1,1.2, and 1.2.1, the sequence of the upper node of the current node in the concatenation can be obtained only in the traversal process.

Preferably, the method comprises the following steps: the characteristics of step four are described as follows:

word document paragraphs are ordered from top to bottom, and extracted outline paragraphs are also ordered;

b. the outline paragraph appearing for the first time is definitely the lowest top-level paragraph with the smallest level;

c. a child paragraph must appear after its parent paragraph;

d. the upper paragraph closest to the child paragraph must be its parent paragraph.

Preferably, the method comprises the following steps: the server includes:

a file server for storing the converted attachment on page H5;

the index server is used for indexing the document content, the author and the release time, so that the retrieval is convenient;

a web server for providing presentation of the H5 page and classification management of the document; the method comprises the following steps: displaying the converted H5 page; providing a document downloading function; providing a document uploading function; a document classification function is provided.

Compared with the prior art, the invention has the beneficial effects that:

the invention comprises an outline directory structure, tables, mathematical formulas, accessories, hyperlinks and style information (bold, underline, strikethrough, font color, background color and the like) besides text and graphics, and identification elements are relatively rich.

The H5 page generated by the invention can provide an access interface externally, and content optimization adjustment can be performed on the H5 page dynamically at a later stage, such as adjustment of the position of a directory.

The method is oriented to the general field, engineering deployment is facilitated, and a knowledge base system in an organization (government, enterprise and school) can be easily constructed based on word documents by combining a file server, a database server, a website server and an index server.

According to the invention, technical support is provided for Word documents retained in enterprise management, and the Word documents are subjected to h5 conversion (original typesetting of the Word documents is retained as much as possible) and are issued to the server, so that enterprises can perform centralized management on the Word documents, including archiving, classifying, indexing and checking. And the knowledge value borne by the word document is fully exerted. The efficiency of enterprise management, decision-making is promoted.

Fourth, the Word document of the present invention is converted into an H5 webpage: the content and the typesetting are almost consistent with the original Word document, so that the document can be classified by matching with a website server; in cooperation with the index server, the document can be retrieved, for example, by author, content; matching with a file server, the attachment originally pasted in the word can be downloaded online; the word document can be uniformly and efficiently managed from the inside of the company.

Drawings

FIG. 1 is a flow chart of the Word-based knowledge base construction method of the present invention;

FIG. 2 is a flow chart of the control method based on the Word knowledge base construction of the invention.

Detailed Description

The invention will be described in more detail below with reference to the accompanying drawings:

referring to fig. 1, the method for constructing a knowledge base based on Word includes the steps of:

firstly, logging in a website, and selecting or creating a document classification;

secondly, selecting a word document to be published from the local;

submitting the third product to a converter for conversion;

the converted file is uploaded to a file system;

the metadata information of the word, including the author, the release time and the classification information, is saved in the database to obtain a database record ID;

sixthly, indexing the content based on the converted text content, and simultaneously indexing an author and a database ID;

the website page is refreshed, and newly uploaded document links can be found in the latest document.

Wherein: step three, the conversion is from Word to H5, the converter completes the identification of the internal elements of the Word document and converts the internal elements into corresponding H5 elements, and the method further comprises the following steps:

(3.1) reading the outline structure of the word and converting the outline structure into an H5 directory;

(3.2) reading the paragraph contents, and converting into H5 paragraphs;

(3.3) reading the text style, and converting the text style into an H5Css3 style;

(3.4) resolving the hyperlink in the word, and converting the hyperlink into a hyperlink form of H5;

(3.5) reading the picture file in the word, converting the picture file into a base64 coding format, and displaying the picture file on an H5 page;

(3.6) creating a popup album of the H5 version based on the pictures in the document, and improving the picture viewing experience which the word does not have;

(3.7) reading the attachment information, uploading the attachment information to a file server, generating a downloading connection, and displaying the downloading connection on an H5 page;

(3.8) reading word mathematical formula information, converting the word mathematical formula information into an xml code or a png picture, and displaying the xml code or the png picture on an H5 page;

and (3.9) reading the table information and converting the table information into a table supported by H5.

In this embodiment, in the process of converting word, if an attachment is encountered, the attachment is uploaded to the file server by the client of the file server, the file server returns a unique file ID, and the converter stores the unique file ID in the form of a hyperlink into the converted H5 page. After the data is released to the website, the download can be clicked.

In this embodiment, after the Word document is converted into H5 by the converter, the basic elements of the Word document are indexed, including author name, release date, and document content.

Referring to fig. 2, the control method based on Word knowledge base construction includes the steps of:

loading a Word document into a memory;

extracting all picture information, converting the picture information into base64 for temporary storage, wherein each picture has a unique ID corresponding to each other, and establishing a unique mapping relation between the picture ID and the picture content for temporary storage;

thirdly, extracting all accessory information, uploading the accessory information to a file server to obtain the unique ID of the file server; each attachment corresponds to a unique ID, and a unique mapping relation is established between the attachment ID and the file server ID for temporary storage;

fourthly, analyzing style and xml inside the Word document, distinguishing style levels, establishing a hierarchical relationship, namely the hierarchical relationship of the catalog, and generating a hierarchical sequence number such as 1,1, 1, 2 and 1 by using a recursive algorithm based on the hierarchical relationship;

fifthly, paragraph analysis is carried out;

sixthly, judging the type of the paragraph;

a quiesce form paragraph;

reading the row and column numbers, and performing traversal output;

judging whether the parallel combination exists or not by the self-lifting;

if yes, carrying out merging column conversion;

the separator further judges whether the parallel connection exists;

if yes, performing conversion according to combination and parallel;

paragraph parsing in the selection table is finished;

receiving the step VI and the picture paragraph;

matching is carried out on the base image ID to the initial image list, if matching is successful, the base image ID is converted into an H5 image tag, and a base64 character string is stored in the src attribute;

receiving the common text paragraphs in the determining step;

⒄ reading the style information, converting the style and outputting to H5 page;

the middle section judges whether the hyperlink is contained or not;

⒆ if yes, performing hyperlink analysis;

⒇ further determining whether an attachment is present;

(21) carrying out id search with the previous accessory list, converting into a hyperlink label of H5 if matching is successful, and ending;

(22) carrying out the step sixteenth, and determining the type of the outline paragraph;

(23) and matching with the outline level based on the paragraph content, and if the matching is successful, rendering by using the corresponding level style alignment, and ending.

Wherein: step three, the file server uploads the file to the file server by virtue of a file server client if the file is attached during the word conversion process of the converter, the file server returns a unique file ID, and the converter stores the unique file ID into a converted H5 page in a hyperlink mode; after the data is released to the website, the download can be clicked.

Wherein: the recursive algorithm described in step four generates a hierarchical sequence number, including:

the essence of constructing the recursive algorithm to generate the hierarchical sequence number is the process of constructing the directory, and the specific steps are as follows:

the method comprises the steps of circularly traversing each paragraph, and extracting paragraphs with title patterns; the title style has the following features: heading1, heading2, heading 3;

in the cyclic process of the step, the following characteristic values are extracted, and form expression of tuples is combined:

paragraph->catalog(content,level)

the release is as follows: the paragraph paramgraph at present is a directory catalog, the content is content, and the level of the content is level;

obtaining all directory nodes of the current word document; note that only the directory nodes are obtained, the hierarchical relationship between the nodes is not established, and the next step is to establish the hierarchical relationship between the directories;

abstracting the catalog into a tree structure, and then packaging the tree structure by using the characteristics of the tree structure; attention needs to be paid to a hidden feature relationship of the word document, wherein the feature relationship is a key for constructing a directory tree structure;

based on the characteristics, establishing a hierarchical relationship of the directory; as for the hierarchical sequence numbers, such as 1.1,1.2, and 1.2.1, the sequence of the upper node of the current node is obtained by splicing the current node only in the traversal process;

the characteristic relation is a key for constructing a directory tree structure, and the characteristics are described as follows:

a. word document paragraphs are ordered from top to bottom, and extracted outline paragraphs are also ordered;

b. the outline paragraph that appears for the first time, must be the topmost (lowest level) paragraph;

c child paragraph must appear after its parent paragraph;

the superior paragraph d closest to the child paragraph must be its parent paragraph.

Based on the characteristics, the hierarchical relation of the directories can be established; as for the hierarchical sequence numbers, such as 1.1,1.2, and 1.2.1, the sequence number of the upper node of the current node is spliced to the sequence number of the current node only in the traversal process, and can be obtained naturally.

In this embodiment, the server includes:

a file server for storing the converted attachment on page H5;

the index server is used for indexing the document content, the author and the release time, so that the retrieval is convenient;

a web server for providing presentation of the H5 page and classification management of the document; the method comprises the following steps: displaying the converted H5 page; providing a document downloading function; providing a document uploading function; a document classification function is provided.

The above-mentioned embodiments are only preferred embodiments of the present invention, and all equivalent changes and modifications made within the scope of the claims of the present invention should be covered by the claims of the present invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数字格式化方法、装置、终端设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!