Paragraph type identification method and system and document structure identification method and system

文档序号：749416 发布日期：2021-04-23 浏览：24次中文

阅读说明：本技术 段落类型识别方法及系统和文档结构识别方法及系统 (Paragraph type identification method and system and document structure identification method and system ) 是由邓吉秋夏晨晨刘文毅雷玉娇何美香路馥毓于 2021-01-08 设计创作，主要内容包括：本发明涉及段落类型识别方法及系统和文档结构识别方法及系统,其中,段落类型识别方法,包括：S1、根据预先设定的段落类型识别规则,判断文本中任一段落是否符合段落类型识别规则,获取判断结果；所述段落类型识别规则包括：第一级规则为规定段落类型识别规则判断先后顺序的优先级规则；第二级规则为段落识别关联准则；S2、根据所述判断结果和预先设定段落类型识别准则及第一编号确定所述段落的段落类型；所述第一编号与所述段落类型识别准则对应。解决了现有的地质资料段落类型识别方法中语料标注成本过高、语料库稀缺的问题。(The invention relates to a paragraph type identification method and a system thereof, and a document structure identification method and a system thereof, wherein the paragraph type identification method comprises the following steps: s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining a judgment result; the paragraph type identification rule includes: the first level rule is a priority rule for determining the order of paragraph type identification rules; the second level rules identify association criteria for the paragraphs; s2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number; the first number corresponds to the paragraph type identification criteria. The method solves the problems of high corpus labeling cost and scarce corpus in the conventional geological data paragraph type identification method.)

1. A paragraph type identification method, comprising:

s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining a judgment result;

the text includes: the geological text in at least one of Markdown format, MID format and MIF format;

the paragraph type identification rule includes:

the first level rule is a priority rule for determining the order of paragraph type identification rules;

the second level rules identify association criteria for the paragraphs;

s2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number;

the first number corresponds to the paragraph type identification criteria.

2. The method according to claim 1, wherein the S1 includes:

and judging each paragraph of the text step by step according to the priority sequence corresponding to the paragraph type identification rule to obtain the judgment result of the paragraph.

3. The method according to claim 2, wherein the step S2 includes:

and if the judgment result of the paragraph accords with the corresponding paragraph type identification rule, taking the first number corresponding to the paragraph type identification rule as the paragraph type of the paragraph.

4. The method of claim 3,

the priorities include: the method comprises the following steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level;

the judging sequence of the priority is as follows: the method comprises the steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level.

5. The method of claim 4,

the paragraph identification association rule comprises: one or more of a multiple condition criterion, a regular expression, a paragraph type, a start-stop paragraph criterion, a structure criterion, a no-format criterion, a method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which represent that the paragraph needs to satisfy the rule at the same time and at the two sides of the rule;

or a rule: the expression paragraph only needs to satisfy one regular expression or other regular expressions on two sides of the rule;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: paragraph features are described;

the paragraph types are: a first number;

the start-stop paragraph criteria include:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

no format criterion: paragraphs other than those meeting a multiple condition criterion or a start-stop paragraph criterion or a structural criterion or a method criterion;

the method criteria include: the preset label marking criterion of the title paragraphs; and (4) a preset label marking criterion of the catalogue paragraphs.

6. The method of claim 5, wherein the step of applying the coating comprises applying a coating to the substrate

The first number is also corresponding to the preset paragraph identification criterion description information, the paragraph identification criterion priority and the paragraph identification rule.

7. A paragraph type identification system, comprising:

at least one first processor; and

at least one memory communicatively coupled to the first processor, wherein the memory stores program instructions executable by the first processor, the first processor invoking the program instructions to perform a paragraph type identification method as recited in any one of claims 1 to 6.

8. A method for identifying a document structure, the method comprising:

a1, judging whether any paragraph in paragraphs with paragraph types accords with the text structure identification rule according to the preset text structure identification rule, and acquiring a second judgment result;

the text structure recognition rule comprises:

the text structure definition rules include: the method comprises the steps of presetting a definition rule of a full-text structure, a definition rule of a full-text paragraph text structure, a definition rule of a table text structure, a definition rule of a geological year representation text structure, a definition rule of a formula text structure and a definition rule of a picture text structure;

the structure identification association criterion is used for identifying the hierarchical structure of covers, chapters and paragraphs in the text structure of the structure which accords with the definition rule of the preset full-text structure; and the sequential structure used for identifying the text structure of the structure according with the preset definition rule of the text structure of the full text paragraph, the structure according with the preset definition rule of the text structure of the table, the structure according with the preset definition rule of the text structure of the geological year representation, the structure according with the preset definition rule of the text structure of the formula and the structure according with the preset definition rule of the text structure of the picture;

a2, determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number;

the second number corresponds to the text structure identification rule;

and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph.

9. The method of claim 8,

the structure identification association criteria include: the method comprises the following steps of a multi-condition criterion, a regular expression, a paragraph type, a second start-stop paragraph criterion, a structure criterion and a second method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which need to meet the requirement of both sides of the rule simultaneously are represented;

or a rule: representing that only one regular expression or other regular expressions on both sides of the rule need to be satisfied;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: describing text structural features;

the paragraph types are: a first number;

the second start-stop paragraph criterion includes:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

a start paragraph rule having a first number indicating that a paragraph starts with a paragraph satisfying the paragraph type corresponding to the first number in the start paragraph rule;

an end paragraph rule having a first number indicating that a paragraph satisfies the paragraph type corresponding to the first number in the end paragraph rule;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

the second method criterion includes:

the preset full text structure marking method is used for full text structure marking;

the preset text structure marking method is used for marking the text structure.

The second number is also corresponding to the text structure identification criterion description and the text structure identification rule respectively.

10. A document structure recognition system, comprising:

at least one second processor; and

at least one memory communicatively coupled to the second processor, wherein the memory stores program instructions executable by the second processor, and wherein the second processor invokes the program instructions to perform a document structure identification method according to any of claims 8 to 9.

Technical Field

The invention relates to the technical field of textual geological data identification, in particular to a paragraph type identification method and system and a document structure identification method and system.

Background

The textual geological data refers to geological data which exists in Markdown, MID/MIF formats after the digital geological data is subjected to textual processing. With the rapid growth of geological document resources, researchers in the geological field urgently need to perform operations of rapid and accurate knowledge retrieval, organization and classification from massive geological documents. The same vocabulary appears at different positions in the geological document and has different degrees of semantic importance, so that it becomes important to recognize paragraph types and document structures of geological data.

The prior art is as follows: and designing a document structure processing template in advance based on the document chapter titles, calling the corresponding document structure template by a user according to actual requirements, and generating a document logic structure outline by adopting human-computer interaction. And fixing the document structure template, and capturing and integrating the corresponding content of the document structure to the corresponding position of the document structure through semantic analysis and information extraction so as to generate a final document structure. Based on information such as document chapters, paragraphs and charts, the paragraph types and the document structures are identified by adopting a machine learning method.

The prior art has the following disadvantages: for identification of document material paragraph types and document structures, the prior art adopts a method of human-computer interaction processing or machine learning of a fixed document structure template and a standard document structure processing template. The method of man-machine interaction and pure manual processing has the advantages of low speed and efficiency, and failure to avoid judgment errors caused by negligence; the fixed document structure template has mostly fixed schema structure, paragraph content and document format, insufficient flexibility, and cannot process documents with special format.

Because geological data has complex format and difficult labeling, the accumulated results in the past are few, and the problems of high corpus labeling cost and scarce corpus are faced when the machine learning method is used for geological data paragraph type and document structure identification application; paragraph type and document structure recognition studies on textual geology are not involved.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a paragraph type identification method and system and a document structure identification method and system. The method solves the problems of high corpus labeling cost and scarce corpus in the existing geological data paragraph type and document structure identification method and the problems of most fixed outline structures, paragraph contents and document formats and insufficient flexibility in the existing document structure identification due to the fixed document structure template.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, an embodiment of the present invention provides a paragraph type identification method, including: