Paragraph type identification method and system and document structure identification method and system

文档序号:749416 发布日期:2021-04-23 浏览:24次 中文

阅读说明:本技术 段落类型识别方法及系统和文档结构识别方法及系统 (Paragraph type identification method and system and document structure identification method and system ) 是由 邓吉秋 夏晨晨 刘文毅 雷玉娇 何美香 路馥毓 于 2021-01-08 设计创作,主要内容包括:本发明涉及段落类型识别方法及系统和文档结构识别方法及系统,其中,段落类型识别方法,包括:S1、根据预先设定的段落类型识别规则,判断文本中任一段落是否符合段落类型识别规则,获取判断结果;所述段落类型识别规则包括:第一级规则为规定段落类型识别规则判断先后顺序的优先级规则;第二级规则为段落识别关联准则;S2、根据所述判断结果和预先设定段落类型识别准则及第一编号确定所述段落的段落类型;所述第一编号与所述段落类型识别准则对应。解决了现有的地质资料段落类型识别方法中语料标注成本过高、语料库稀缺的问题。(The invention relates to a paragraph type identification method and a system thereof, and a document structure identification method and a system thereof, wherein the paragraph type identification method comprises the following steps: s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining a judgment result; the paragraph type identification rule includes: the first level rule is a priority rule for determining the order of paragraph type identification rules; the second level rules identify association criteria for the paragraphs; s2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number; the first number corresponds to the paragraph type identification criteria. The method solves the problems of high corpus labeling cost and scarce corpus in the conventional geological data paragraph type identification method.)

1. A paragraph type identification method, comprising:

s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining a judgment result;

the text includes: the geological text in at least one of Markdown format, MID format and MIF format;

the paragraph type identification rule includes:

the first level rule is a priority rule for determining the order of paragraph type identification rules;

the second level rules identify association criteria for the paragraphs;

s2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number;

the first number corresponds to the paragraph type identification criteria.

2. The method according to claim 1, wherein the S1 includes:

and judging each paragraph of the text step by step according to the priority sequence corresponding to the paragraph type identification rule to obtain the judgment result of the paragraph.

3. The method according to claim 2, wherein the step S2 includes:

and if the judgment result of the paragraph accords with the corresponding paragraph type identification rule, taking the first number corresponding to the paragraph type identification rule as the paragraph type of the paragraph.

4. The method of claim 3,

the priorities include: the method comprises the following steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level;

the judging sequence of the priority is as follows: the method comprises the steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level.

5. The method of claim 4,

the paragraph identification association rule comprises: one or more of a multiple condition criterion, a regular expression, a paragraph type, a start-stop paragraph criterion, a structure criterion, a no-format criterion, a method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which represent that the paragraph needs to satisfy the rule at the same time and at the two sides of the rule;

or a rule: the expression paragraph only needs to satisfy one regular expression or other regular expressions on two sides of the rule;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: paragraph features are described;

the paragraph types are: a first number;

the start-stop paragraph criteria include:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

no format criterion: paragraphs other than those meeting a multiple condition criterion or a start-stop paragraph criterion or a structural criterion or a method criterion;

the method criteria include: the preset label marking criterion of the title paragraphs; and (4) a preset label marking criterion of the catalogue paragraphs.

6. The method of claim 5, wherein the step of applying the coating comprises applying a coating to the substrate

The first number is also corresponding to the preset paragraph identification criterion description information, the paragraph identification criterion priority and the paragraph identification rule.

7. A paragraph type identification system, comprising:

at least one first processor; and

at least one memory communicatively coupled to the first processor, wherein the memory stores program instructions executable by the first processor, the first processor invoking the program instructions to perform a paragraph type identification method as recited in any one of claims 1 to 6.

8. A method for identifying a document structure, the method comprising:

a1, judging whether any paragraph in paragraphs with paragraph types accords with the text structure identification rule according to the preset text structure identification rule, and acquiring a second judgment result;

the text structure recognition rule comprises:

the text structure definition rules include: the method comprises the steps of presetting a definition rule of a full-text structure, a definition rule of a full-text paragraph text structure, a definition rule of a table text structure, a definition rule of a geological year representation text structure, a definition rule of a formula text structure and a definition rule of a picture text structure;

the structure identification association criterion is used for identifying the hierarchical structure of covers, chapters and paragraphs in the text structure of the structure which accords with the definition rule of the preset full-text structure; and the sequential structure used for identifying the text structure of the structure according with the preset definition rule of the text structure of the full text paragraph, the structure according with the preset definition rule of the text structure of the table, the structure according with the preset definition rule of the text structure of the geological year representation, the structure according with the preset definition rule of the text structure of the formula and the structure according with the preset definition rule of the text structure of the picture;

a2, determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number;

the second number corresponds to the text structure identification rule;

and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph.

9. The method of claim 8,

the structure identification association criteria include: the method comprises the following steps of a multi-condition criterion, a regular expression, a paragraph type, a second start-stop paragraph criterion, a structure criterion and a second method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which need to meet the requirement of both sides of the rule simultaneously are represented;

or a rule: representing that only one regular expression or other regular expressions on both sides of the rule need to be satisfied;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: describing text structural features;

the paragraph types are: a first number;

the second start-stop paragraph criterion includes:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

a start paragraph rule having a first number indicating that a paragraph starts with a paragraph satisfying the paragraph type corresponding to the first number in the start paragraph rule;

an end paragraph rule having a first number indicating that a paragraph satisfies the paragraph type corresponding to the first number in the end paragraph rule;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

the second method criterion includes:

the preset full text structure marking method is used for full text structure marking;

the preset text structure marking method is used for marking the text structure.

The second number is also corresponding to the text structure identification criterion description and the text structure identification rule respectively.

10. A document structure recognition system, comprising:

at least one second processor; and

at least one memory communicatively coupled to the second processor, wherein the memory stores program instructions executable by the second processor, and wherein the second processor invokes the program instructions to perform a document structure identification method according to any of claims 8 to 9.

Technical Field

The invention relates to the technical field of textual geological data identification, in particular to a paragraph type identification method and system and a document structure identification method and system.

Background

The textual geological data refers to geological data which exists in Markdown, MID/MIF formats after the digital geological data is subjected to textual processing. With the rapid growth of geological document resources, researchers in the geological field urgently need to perform operations of rapid and accurate knowledge retrieval, organization and classification from massive geological documents. The same vocabulary appears at different positions in the geological document and has different degrees of semantic importance, so that it becomes important to recognize paragraph types and document structures of geological data.

The prior art is as follows: and designing a document structure processing template in advance based on the document chapter titles, calling the corresponding document structure template by a user according to actual requirements, and generating a document logic structure outline by adopting human-computer interaction. And fixing the document structure template, and capturing and integrating the corresponding content of the document structure to the corresponding position of the document structure through semantic analysis and information extraction so as to generate a final document structure. Based on information such as document chapters, paragraphs and charts, the paragraph types and the document structures are identified by adopting a machine learning method.

The prior art has the following disadvantages: for identification of document material paragraph types and document structures, the prior art adopts a method of human-computer interaction processing or machine learning of a fixed document structure template and a standard document structure processing template. The method of man-machine interaction and pure manual processing has the advantages of low speed and efficiency, and failure to avoid judgment errors caused by negligence; the fixed document structure template has mostly fixed schema structure, paragraph content and document format, insufficient flexibility, and cannot process documents with special format.

Because geological data has complex format and difficult labeling, the accumulated results in the past are few, and the problems of high corpus labeling cost and scarce corpus are faced when the machine learning method is used for geological data paragraph type and document structure identification application; paragraph type and document structure recognition studies on textual geology are not involved.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a paragraph type identification method and system and a document structure identification method and system. The method solves the problems of high corpus labeling cost and scarce corpus in the existing geological data paragraph type and document structure identification method and the problems of most fixed outline structures, paragraph contents and document formats and insufficient flexibility in the existing document structure identification due to the fixed document structure template.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, an embodiment of the present invention provides a paragraph type identification method, including:

s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining a judgment result;

the text includes: the geological text in at least one of Markdown format, MID format and MIF format;

the paragraph type identification rule includes:

the first level rule is a priority rule for determining the order of paragraph type identification rules;

the second level rules identify association criteria for the paragraphs;

s2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number;

the first number corresponds to the paragraph type identification criteria.

Preferably, the S1 includes:

and judging each paragraph of the text step by step according to the priority sequence corresponding to the paragraph type identification rule to obtain the judgment result of the paragraph.

Preferably, the step S2 includes:

and if the judgment result of the paragraph accords with the corresponding paragraph type identification rule, taking the first number corresponding to the paragraph type identification rule as the paragraph type of the paragraph.

Preferably, the first and second liquid crystal materials are,

the priorities include: the method comprises the following steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level;

the judging sequence of the priority is as follows: the method comprises the steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level.

Preferably, the first and second liquid crystal materials are,

the paragraph identification association rule comprises: one or more of a multiple condition criterion, a regular expression, a paragraph type, a start-stop paragraph criterion, a structure criterion, a no-format criterion, a method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which represent that the paragraph needs to satisfy the rule at the same time and at the two sides of the rule;

or a rule: the expression paragraph only needs to satisfy one regular expression or other regular expressions on two sides of the rule;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: paragraph features are described;

the paragraph types are: a first number;

the start-stop paragraph criteria include:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

no format criterion: paragraphs other than those meeting a multiple condition criterion or a start-stop paragraph criterion or a structural criterion or a method criterion;

the method criteria include: the preset label marking criterion of the title paragraphs; and (4) a preset label marking criterion of the catalogue paragraphs.

Preferably, the first and second liquid crystal materials are,

the first number is also corresponding to the preset paragraph identification criterion description information, the paragraph identification criterion priority and the paragraph identification rule.

In a second aspect, an embodiment of the present invention provides a paragraph type identification system, including:

at least one first processor; and

at least one memory communicatively coupled to the first processor, wherein the memory stores program instructions executable by the first processor, and wherein the first processor calls upon the program instructions to perform a paragraph type identification method as in any of the preceding.

In a third aspect, an embodiment of the present invention provides a document structure identification method, including:

a1, judging whether any paragraph in paragraphs with paragraph types accords with the text structure identification rule according to the preset text structure identification rule, and acquiring a second judgment result;

the text structure recognition rule comprises:

the text structure definition rules include: the method comprises the steps of presetting a definition rule of a full-text structure, a definition rule of a full-text paragraph text structure, a definition rule of a table text structure, a definition rule of a geological year representation text structure, a definition rule of a formula text structure and a definition rule of a picture text structure;

the structure identification association criterion is used for identifying the hierarchical structure of covers, chapters and paragraphs in the text structure of the structure which accords with the preset definition rule of the full text structure; and the sequential structure used for identifying the text structure of the structure according with the preset definition rule of the text structure of the full text paragraph, the structure according with the preset definition rule of the text structure of the table, the structure according with the preset definition rule of the text structure of the geological year representation, the structure according with the preset definition rule of the text structure of the formula and the structure according with the preset definition rule of the text structure of the picture;

a2, determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number;

the second number corresponds to the text structure identification rule;

and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph.

Preferably, the first and second liquid crystal materials are,

the structure identification association criteria include: the method comprises the following steps of a multi-condition criterion, a regular expression, a paragraph type, a second start-stop paragraph criterion, a structure criterion and a second method criterion;

the multi-condition criteria include:

and a rule: regular expressions or other regular expressions which need to meet the requirement of both sides of the rule simultaneously are represented;

or a rule: representing that only one regular expression or other regular expressions on both sides of the rule need to be satisfied;

and (3) irregular: representing that the paragraph does not satisfy the regular expression or other regular expressions on the right side of the non-rule;

the regular expression is: describing text structural features;

the paragraph types are: a first number;

the second start-stop paragraph criterion includes:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds;

a non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number;

a rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds;

a before paragraph rule with a regular expression indicating that a paragraph is before a paragraph that satisfies the regular expression;

a post-paragraph rule having a regular expression, indicating that a paragraph is after a paragraph that satisfies the regular expression;

a start paragraph rule having a first number indicating that a paragraph starts with a paragraph satisfying the paragraph type corresponding to the first number in the start paragraph rule;

an end paragraph rule having a first number indicating that a paragraph satisfies the paragraph type corresponding to the first number in the end paragraph rule;

the structural criteria are as follows: the paragraph type representing the paragraph meets the paragraph type corresponding to the first number on the right side of the structural criterion;

the second method criterion includes:

the preset full text structure marking method is used for full text structure marking;

the preset text structure marking method is used for marking the text structure.

The second number is also corresponding to the text structure identification criterion description and the text structure identification rule respectively.

In a fourth aspect, an embodiment of the present invention provides a document structure identification system, including:

at least one second processor; and

at least one memory communicatively coupled to the second processor, wherein the memory stores program instructions executable by the second processor, and wherein the second processor invokes the program instructions to perform a document structure identification method as in any above.

(III) advantageous effects

The invention has the beneficial effects that:

according to the paragraph type identification method and system, the preset paragraph type identification rule is adopted, whether any paragraph in the text accords with the paragraph type identification rule is judged, and the judgment result is obtained; determining the paragraph type of the paragraph according to the judgment result, a preset paragraph type identification criterion and a first number; the first number corresponds to the paragraph identification association criterion. The paragraph type identification method and the paragraph type identification system greatly improve the working efficiency and the accuracy of paragraph type identification; the method has universality and expansibility.

According to the document structure identification method and system, a preset text structure identification rule is adopted, whether any paragraph in paragraphs with paragraph types accords with the text structure identification rule is judged, and a second judgment result is obtained; determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number; the second number corresponds to the text structure identification rule; and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph. The document structure identification method and the document structure identification system greatly improve the working efficiency and accuracy of document structure identification.

Drawings

FIG. 1 is a flow chart of a paragraph type identification method of the present invention;

FIG. 2 is a flow chart of a document structure identification method of the present invention;

FIG. 3 is a diagram illustrating a paragraph type identification method according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a method for identifying a document structure based on a paragraph type identification method in an embodiment of the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The paragraph type identification method and system and the document structure identification method and system mainly aim at the textual geological data which is the geological data existing in Markdown, MID/MIF format after the digital geological data is subjected to textual processing.

Example one

Referring to fig. 1, the present embodiment provides a paragraph type identifying method, including:

s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining the judgment result.

The paragraph type identification rule includes:

the first level rule is a priority rule for determining the sequence of paragraph type identification rules.

The second level rules identify association criteria for the paragraph.

And S2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number.

The first number corresponds to the paragraph type identification criteria.

Preferably, in this embodiment, the S1 includes:

and judging each paragraph of the text step by step according to the priority sequence corresponding to the paragraph type identification rule to obtain the judgment result of the paragraph.

Preferably, in this embodiment, the step S2 includes: and if the judgment result of the paragraph accords with the corresponding paragraph type identification rule, taking the first number corresponding to the paragraph type identification rule as the paragraph type of the paragraph.

Preferably in this embodiment, the priority includes: the method comprises the steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level.

The judging sequence of the priority is as follows: the method comprises the steps of presetting paragraph type identification rules of a first level, presetting paragraph type identification rules of a second level, presetting paragraph type identification rules of a third level, presetting paragraph type identification rules of a fourth level, presetting paragraph type identification rules of a fifth level and presetting paragraph type identification rules of a sixth level.

Preferably in this embodiment, the paragraph identification association rule includes: one or more of a multiple condition criterion, a regular expression, a paragraph type, a start-stop paragraph criterion, a structure criterion, a no-format criterion, a method criterion.

The multi-condition criteria include:

and a rule: the regular expression or other regular expressions on both sides of the representation paragraph need to be satisfied simultaneously with the rule.

In this embodiment and the rule & &: regular expressions or other regular expressions indicating that a paragraph needs to satisfy and two sides of a rule at the same time refer to rules that the paragraph needs to satisfy and the rule & & two sides (the rules include the regular expressions, other rule combinations (composed of start and stop paragraph criteria, structural criteria, paragraph types)) such as: is? Numbering: ? \ d +? $ & & font LINE 2020100 ", which means that the paragraph should both satisfy the sum rule & & left". > a? Numbering: ? \ d +? And $ regular expression, and further to satisfy other rule combinations with rule & & right side "font LINE 2020100".

Or a rule: the representation paragraph only needs one regular expression or other regular expressions on both sides of the satisfying or rule.

In this embodiment or rule |, a regular expression or other regular expression indicating that a paragraph only needs to satisfy or on both sides of the rule | | |, refers to a rule that only needs to satisfy or on both sides of the rule | | (the rule includes a regular expression or other rule combinations (including start/stop paragraph criterion, structural criterion, and paragraph type)), such as "\{ 2 }? \ {2} $ & & BEFORE ^ year? $ | font LINE 2020100 ", which means that a paragraph only needs to satisfy or rule | | left" \{ 2} -? \ {2} $ & & BEFORE ^ year? One of the criteria of "BEFORE LINE 2020100" on the right side of | | ".

And (3) irregular: the representation paragraph does not satisfy the regular expression or other regular expression to its right.

In this embodiment, the irregular NOT, which means that a paragraph does NOT satisfy the regular expression or other regular expressions on the right side of the irregular NOT, means that the paragraph does NOT satisfy the rule following the irregular NOT, for example, the type of the paragraph "NOT 2010200" means that the paragraph does NOT satisfy 2010200 (first number) following the irregular NOT.

The regular expression is: paragraph features are described.

The paragraph types are: the first number.

The start-stop paragraph criteria include:

a before paragraph rule having a first number indicating that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds; the front rule in this embodiment BEFORE the paragraph is font LINE: the presentation paragraph precedes the paragraph that satisfies the paragraph type following LINE, such as: "BEFORE LINE 2010200" indicates that the paragraph precedes the paragraph that satisfies the paragraph type 2010200 (first number) following the LINE.

A non-paragraph rule having a first number, indicating that the paragraph type of the paragraph is not the paragraph type corresponding to the first number; the non-paragraph rule NOT LINE in this embodiment indicates that the first number of the LINE heel is NOT satisfied, for example, NOT LINE 2010200 indicates that the paragraph type is NOT the paragraph type 2010200 (first number) of the LINE heel.

A rule after a paragraph having a first number indicating that the paragraph follows the paragraph of the paragraph type to which the first number corresponds; the rule AFTER LINE after a paragraph in this embodiment indicates that the paragraph follows a paragraph that satisfies the paragraph type following the LINE, and "AFTER LINE 2010200" indicates that the paragraph follows a paragraph that satisfies the paragraph type 2010200 following the LINE.

A before paragraph rule with a regular expression indicates that a paragraph precedes a paragraph that satisfies the regular expression.

A rule after a paragraph with a regular expression, meaning that a paragraph follows a paragraph that satisfies the regular expression.

The structural criteria are as follows: the paragraph type representing a paragraph satisfies the paragraph type corresponding to the first number to the right of the structural criterion.

IN this embodiment, the structural criterion IN PART is followed by a first number corresponding to the paragraph type, which indicates that the paragraph is a paragraph satisfying the paragraph type following the PART. For example, "IN PART 2010200", indicates that the paragraph is a paragraph of a paragraph type satisfying '2010200' following PART (which is a structural criterion because the paragraph can be < Lambda > 20102\ d {2} after PART (referring to the paragraph type beginning with 20102).

No format criterion: paragraphs other than those meeting a multiple condition criterion or a start-stop paragraph criterion or a structural criterion or a method criterion.

The plain rule in this embodiment is used for paragraphs without special format features, the rule is at the last level of priority, and the last level of priority has only the rule. The other criteria under the priority have special paragraph format characteristics (such as the title marked by "#", the table is marked by '| word | letter |'), and when all the paragraphs meeting the special paragraph format characteristics are identified, the rest paragraphs are paragraphs without the format characteristics.

The method criteria include: the preset label marking criterion of the title paragraphs; and (4) a preset label marking criterion of the catalogue paragraphs.

Preferably, in this embodiment, the first number further corresponds to preset paragraph identification criterion description information, a paragraph identification criterion priority, and a paragraph identification rule.

In the paragraph type identification method provided by this embodiment, a preset paragraph type identification rule is adopted, and whether any paragraph in the text meets the paragraph type identification rule is determined, so as to obtain a determination result; determining the paragraph type of the paragraph according to the judgment result, a preset paragraph type identification criterion and a first number; the first number corresponds to the paragraph type identification criteria. By adopting the paragraph type identification method of the embodiment, the working efficiency and the accuracy of paragraph type identification are greatly improved; the method has universality and expansibility.

Example two

Referring to fig. 2, the present embodiment provides a document structure identification method, including:

a1, judging whether any paragraph in the paragraphs with paragraph types accords with the text structure identification rule according to the preset text structure identification rule, and obtaining a second judgment result.

The present structure recognition rule includes: the text structure definition rules include: the method comprises the steps of presetting a definition rule of a full-text structure, a definition rule of a full-text paragraph text structure, a definition rule of a table text structure, a definition rule of a geological year representation text structure, a definition rule of a formula text structure and a definition rule of a picture text structure.

The structure identification association criterion is used for identifying the hierarchical structure of covers, chapters and paragraphs in the text structure of the structure which accords with the definition rule of the preset full-text structure; and the sequence structure is used for identifying the structure of the text structure of the preset full-text paragraph, the structure of the preset table text structure, the structure of the preset geological year representation text structure, the structure of the preset formula text structure and the structure of the preset picture text structure.

A2, determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number; the second number corresponds to the text structure recognition rule.

And if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph.

Preferably, in this embodiment, the structure identification association criterion includes: a multi-condition criterion, a regular expression, a paragraph type, a second start-stop paragraph criterion, a structural criterion, a second method criterion.

The multi-condition criteria include:

and a rule: regular expressions or other regular expressions that represent both sides of the rule that need to be satisfied simultaneously.

Or a rule: meaning that only one regular expression or other regular expression on either side of the satisfaction or rule is needed.

And (3) irregular: the representation paragraph does not satisfy the regular expression or other regular expression to its right.

The regular expression is: paragraph features are described.

The paragraph types are: the first number.

The second start-stop paragraph criterion includes:

a before paragraph rule with a first number indicates that a paragraph precedes the paragraph of the paragraph type to which the first number corresponds.

A non-paragraph rule having a first number indicates that the paragraph type of the paragraph is not the paragraph type to which the first number corresponds.

A rule after a paragraph with a first number indicates that the paragraph follows the paragraph of the paragraph type to which the first number corresponds.

A before paragraph rule with a regular expression indicates that a paragraph precedes a paragraph that satisfies the regular expression.

A rule after a paragraph with a regular expression, meaning that a paragraph follows a paragraph that satisfies the regular expression.

A begin paragraph rule having a first number indicates that the paragraph begins with a paragraph that satisfies the paragraph type corresponding to the first number in the begin paragraph rule.

An end paragraph rule having a first number indicates that a paragraph ends when the paragraph type corresponding to the first number in the end paragraph rule is satisfied.

The structural criteria are as follows: the paragraph type representing a paragraph satisfies the paragraph type corresponding to the first number to the right of the structural criterion.

The second method criterion includes: the preset full text structure marking method is used for full text structure marking; the preset text structure marking method is used for marking the text structure.

The second number is also corresponding to the text structure identification criterion description and the text structure identification rule respectively.

In the document structure identification method of the embodiment, because a preset text structure identification rule is adopted, whether any paragraph in the paragraphs with the paragraph types conforms to the text structure identification rule is judged, and a second judgment result is obtained; determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number; the second number corresponds to the text structure identification rule; and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph. By adopting the document structure identification method of the embodiment, the work efficiency and the data accuracy of the document structure identification are greatly improved.

EXAMPLE III

Referring to fig. 1 and fig. 3, the present embodiment further provides a method for identifying a type of a geological data segment, including:

s1, judging whether any paragraph in the text accords with the paragraph type identification rule according to the preset paragraph type identification rule, and obtaining the judgment result.

The paragraph type identification rule includes.

The first level rule is a priority rule for determining the sequence of paragraph type identification rules.

The second level rules identify association criteria for the paragraph.

In practical application of this embodiment, step S1 specifically includes:

traversal of paragraph type identification: traversing the text and the paragraph labels step by step from the highest priority of the paragraph identification rule, setting a paragraph identification rule set rules, wherein the current priority currpri of each traversal conforms to the current rule rul with the priority currpri and the current text texts; during progressive traversal, starting from the current priority currpri, then sequentially identifying paragraphs of the text texts according to the front and back orders of rule with the priority currpri, processing one rule with the priority currpri each time, and completing the processing of each rule, namely the completion of the current rule, so as to obtain a paragraph identification tag list paraList; traversing all rules rule with currpri priority, namely finishing the priority processing of the current paragraph identification rule, and returning to the paragraph identification tag list paraList; and then entering the next priority, and identifying the paragraph types according to the next priority rule and the paragraph identification tag list paraList until all the priorities and rules are traversed.

(2) Establishing a paragraph type identification and document structure identification rule table, wherein the rule table has the following characteristics:

(2-1) cover paragraph types and possible types of document structures.

(2-2) defining two levels of rules for each type:

the first-level rule is a paragraph type identification Priority rule and is mainly used for defining the sequence of the paragraph type identification rule, the Priority has 0-5 levels, the 0 level is the first judgment rule, then the 1 level is carried out, and the like.

The second level of criteria is paragraph identification association criteria, which is mainly used for paragraph identification and paragraph type marking, the paragraph identification and paragraph type marking is shown in section (3), the paragraph identification association criteria is defined by multiple condition criteria (and rule & &, or rule |, non-regular NOT), regular expression, paragraph type, start-stop paragraph criteria, structural criteria, no-format criteria, method criteria (\% > -% > "), wherein the association criteria and grammar rule table is as the table 1 basic association criteria definition table:

table 1 basic association criteria definition table

When there are multiple conditional criteria controls, the irregular NOT level is highest, or the rule | | level is next to the rule & & lowest.

The association criteria can be combined for use to form specific paragraph type identification criteria; such as "AFTER LINE 2020300| AFTER LINE 2020301& & ^ a? A graph (? A graph (.

The Markdown and MIF/MID format paragraph type identification rule is defined as shown in a Markdown format paragraph type identification rule table 2, and the MIF/MID format paragraph type identification rule table 3: the ID is a first number used for marking paragraph types, the Description is the Description of paragraph identification criteria, the Priority is the Priority of the paragraph identification criteria, and Rules is the paragraph identification rule.

TABLE 2Markdown format segment type recognition rule Table

TABLE 3MIF/MID Format paragraph type recognition rule Table

The ID is a first number used for marking paragraph type, the Description is the Description of paragraph type identification criterion, the Priority is the Priority of paragraph type identification criterion, and Rules is the paragraph identification rule.

And S2, determining the paragraph type of the paragraph according to the judgment result, the preset paragraph type identification criterion and the first number.

The first number corresponds to the paragraph type identification criteria.

In practical application of this embodiment, step S2 specifically includes:

(3) identifying a priority traversal text according to paragraph type: acquiring paragraph type identification rule tables rules (acquiring corresponding paragraph identification rule tables according to text formats, table 2Markdown format paragraph type identification rule table or table 3MIF/MID format paragraph type identification rule table), and acquiring paragraph type identification rule Priority lists according to values of 'Priority' columns in the paragraph identification rule tables; setting a paragraph type mark list labelList with the length of the text size to store paragraph types; traversing the paragraph type identification criterion priority list PRIORYLList according to the sequence, setting the current traversal priority currPRI, and then searching the rule with the priority currPRI in the paragraph type identification rule list ruls; and (3-1) searching for the paragraph conforming to the rule, acquiring the recorded serial number ID as the paragraph type of the paragraph, storing the paragraph type ID in the corresponding position of the paragraph type mark list labelList, and returning to the paragraph type mark list labelList.

(3-1) obtaining a rule, judging whether the rule contains a method criterion in a basic association criterion definition table of table 1, if so, assigning a keyword of the extracted method criterion to a method name f _ name and obtaining a back part criterion as the rule, if the rule is "/% DIRECTORY FORMAT%/\\ d + \\ (#.) $. & \ AFTER LINE 2020100", the method name f _ name is equal to "/% DIRY FORMAT%/", and the rule is equal to "\\\ d + \\\\\\\ (#.) $. & & AFTER LINE 2020100"; judging whether the rule contains a plurality of condition criteria in a table 1 basic association criteria definition table: if the step (3-1-1) is included, judging whether the paragraph meets the rule, and acquiring a paragraph serial number list eligibleList meeting the condition of the paragraph type identification criterion; otherwise, step (3-1-2) is entered, and the paragraph number list eligibleList conforming to rule is obtained. And (3-2) marking the paragraphs which accord with the rule, acquiring a paragraph type mark list labelList, and returning the paragraph type mark list labelList.

(3-1-1) obtaining a rule, entering the step (3-1-1-1) if the rule contains and the rule & & (otherwise entering the step (3-1-1-2)), obtaining a paragraph number list eligibleList which conforms to the rule, and returning the paragraph number list eligibelist.

(3-1-1-1) acquiring a paragraph sequence number list eligibleList and a rule which meet the condition of a paragraph identification criterion; taking a multi-condition criterion and a keyword '&' of a rule as a rule segmentation character segmentation rule to obtain a rule subset list rulList; traversal rule subset ruleList: if the current traversed rule contains or a rule keyword "|", according to the principle of processing the rule containing or the rule first and then processing the rule containing or the rule, entering the step (3-1-1-2) to obtain a paragraph serial number list eligibleList meeting the condition of the current paragraph identification criterion; if the current traversed rule does not contain or the rule keyword '|', storing the current traversed rule in a rule subset andList; continuing the traversal until the traversal of the rule subset ruleList is completed; if the paragraph number list eligibelist is not empty, traverse the store and rule subset andList: if the sub-rules do not contain the structure and start-stop keywords in the basic association rule definition table of table 1, temporarily storing the currently traversed rules if the rules do not have the rule table NoLabel table; otherwise, sequentially entering the step (3-1-2), and acquiring a paragraph sequence number list eligiblilts meeting the current paragraph identification criterion condition: if the paragraph number list is empty, directly returning to the paragraph number list eligibeliList; otherwise, the paragraph sequence number list eligibelist is not empty, and the rule subset andList is continuously traversed until the rule subset andList is completely traversed; and (4) if the no-criterion table NoLabel is not empty, traversing the no-criterion table NoLabel, sequentially entering the step (3-1-2), and obtaining a paragraph serial number list eligibleList. The paragraph number list eligibleList is returned.

(3-1-1-2) acquiring a paragraph number list eligibleList and a rule, and acquiring an irregular subset rulList by taking a keyword 'I' of a multi-condition criterion irregular as a rule partitioning character partitioning rule; traverse the irregular subset ruleList: sequentially entering the step (3-1-2) to obtain a sub-section paragraph number list eList meeting the current paragraph identification criterion condition; if the subsection drop sequence number list eList is not empty, assigning the union of the subsection drop sequence number list eList and the paragraph sequence number list eligibeliList to the paragraph sequence number list eligibeliList; and continuing traversing until the irregular subset ruleList is completely traversed. The paragraph number list eligibleList is returned.

(3-1-2) obtaining a rule and text texts, and judging whether the rule contains a start-stop paragraph criterion or a structure association criterion or a no-format criterion in a table 1 basic association criterion definition table: if yes, entering the step (3-1-3) to obtain a paragraph serial number list eligibleList which accords with the rule; otherwise, traversing the text texts, searching paragraphs matched with the text texts and the rule, and storing the paragraph numbers of the paragraphs which accord with the rule into a paragraph number list eligibleList; the paragraph number list eligibleList is returned.

(3-1-3) obtaining a rule, a paragraph type mark list, a label list, and a paragraph number list; according to the basic association criterion definition table of table 1, the beginning and ending paragraph criterion, the structure association criterion and the keywords of the no-format criterion are defined, and the criterion keywords in the rule and the rule after the criterion keywords are extracted are the method name fun and the rule; calling a corresponding criterion method (the criterion method is 3-1-3-1 to 3-1-3-4) according to the method name fun to obtain a paragraph serial number list eligibleList; the paragraph number list eligibleList is returned.

(3-1-3-1) method BEFORE paragraph and method BEFORE paragraph type, BEFORE LINE: obtaining rule, text texts (the method of BEFORE LINE obtains paragraph type mark list labelList) and paragraph sequence number list eligibleList, wherein the sub sequence number list searchList is used for storing the paragraph sequence numbers meeting the rule, and the start Index s _ Index is equal to 0; if the paragraph number list eligibelist is not empty, traverse the paragraph number list eligibelist: the sequence number table index seIndex corresponding to the current paragraph sequence number index and the paragraph sequence number list elgibleList: if the content of the texts first Index position (the content of the paragraph type mark list labelList first Index position is used in the BEFORE LINE method) matches the rule, the paragraph numbers from the start Index s _ Index to the front of the sequence number list seIndex in the paragraph number list elgibleList are stored in the sub sequence number list searchList, and the sequence number list Index seIndex is assigned to the start Index s _ Index; if the content of the texts first index position (the content of the paragraph type mark list labelList first index position in the BEFORE LINE method) does not match the rule, entering the next paragraph number; repeating the above operations until the traversal of the paragraph sequence number list eligibelist is completed; if the paragraph number list eligibleList is empty, go through full text texts (the BEFORE LINE method goes through the paragraph type tag list labelList): currently traversing the paragraph content item, text index itemlndex: if the paragraph content item matches the rule, storing full text texts (or paragraph type mark list labelList) from the start index s _ index to the index of the previous line of the text index itemIndex into a sub sequence number list searchList, and assigning the text index itemIndex to the start index s _ index; if the paragraph content item is not matched with the rule, entering the next line of content; repeating the above operations until the text texts (or the paragraph type mark list labelList) traversal is completed; the child sequence number list searchList is returned.

(3-1-3-2) method AFTER paragraph AFTER and method AFTER paragraph type AFTER LINE: if the paragraph number list eligibleslist is not empty, traversing the paragraph number list eligibleslist conforming to the rule: the sequence number table index seIndex corresponding to the current paragraph sequence number index and the paragraph sequence number list elgibleList: if the content of the texts first index position (the content of the paragraph type tag list labelist first index position is defined as AFTER LINE), matches the rule, then store the sequence number table index seIndex in the paragraph sequence number list eligibelist to the paragraph sequence number at the end of the paragraph sequence number list eligibelist in the searchList in the sub sequence number list, and skip the traversal of the paragraph sequence number list eligibelist; if the content of the texts first index position (the content of the paragraph type mark list labelList first index position in the AFTER LINE method) does not match the rule, enter the next paragraph number; repeating the above operations until the traversal of the paragraph sequence number list eligibelist is completed; if the paragraph number list eligibleList is empty, then go through text texts (the method AFTER LINE is to go through the paragraph type list labelist): currently traversing the paragraph item, paragraph index itemlndex: if the paragraph item is matched with the rule, storing the index from the position of the second itemIndex of the text texts (or the paragraph type mark list labelList) to the end of the text texts into a sub sequence number list searchList, and skipping the traversal of the text texts (or the paragraph type list labelLis); if the paragraph item is not matched with the rule, entering the next line of content; the above operation is repeated until the text texts (or the paragraph type tag list labelList) traversal is completed. A list of sub-segment sequence numbers searchList is returned.

(3-1-3-3) Structure criteria method IN PART and non-Structure criteria method NOT IN PART: obtaining a rule, a paragraph type mark list labelist, a paragraph sequence number list eligibleList, and a sub sequence number list searchList for storing paragraph sequence numbers conforming to the paragraph type identification rule: if the paragraph number list eligibelist is not empty, traverse the paragraph number list eligibelist: setting a current serial number index; if the content of the first index of the paragraph type flag list labelList matches with the rule (NOT IN PART means that the content of the first index of the paragraph type flag list labelList does NOT match with the rule), the sequence number index is stored IN the subsection sequence number list searchList; otherwise, entering the sequence number of the next paragraph; repeating the above operations until the passage sequence number list eligibelist is traversed; if the paragraph number list eligibleList is empty, traverse the paragraph type tag list labelList: if the content item matches with the rule (NOT IN PART means that the content item does NOT match with the rule), the index number itemIndex is stored into a sub-segment sequence number list searchList; otherwise, entering the content of the next section of falling type list; and repeating the above operations until the paragraph type list labelList is traversed. A list of sub-segment sequence numbers searchList is returned.

(3-1-3-4) unformatted method informat: paragraph type tag list labelList, plan paragraph number list searchList, traverse paragraph type tag list labelList: current content item, index number itemlndex: if the content item has no paragraph type mark, storing the index number itemIndex in a subsection paragraph sequence number list searchList; if the content item has a paragraph type mark, entering the next segment of content; and repeating the operations until the labelList of the paragraph type mark list is traversed. A list of sub-segment sequence numbers searchList is returned.

(3-2) obtaining a method name f _ name, a paragraph type identifier ID, a paragraph sequence number list eligibleList meeting the paragraph type identification rule, and a paragraph type mark list labelList: if the method name f _ name is null, traverse the paragraph number list eligibelist: setting a current traversal content index, and assigning the content of the first index position of the paragraph type mark table labelList to be equal to the ID; entering the next paragraph serial number, and repeating the above operations until the paragraph serial number list eligibelist is traversed; if the method name f _ name is not null, calling a corresponding marking method (the marking method is 3-2-1 to 3-2-2) according to the method name f _ name to carry out paragraph marking to obtain a paragraph type marking list labelList; the paragraph type flag column labelList is returned.

(3-2-1) labeling method of TITLE paragraph type/% TITLE%/: acquiring paragraph type mark ID with integral data type, text texts, paragraph sequence number list eligibleList and paragraph type mark list labelList which accord with paragraph type identification rules; traverse the paragraph number list eligibelist: setting a current traversal sequence number index; acquiring the number n of '#' contained in the content of the first index position of the text texts according to a regular expression "# {1, }) (; entering the next paragraph serial number, and repeating the above operations until the paragraph type list eligibelist is traversed; the paragraph type tag list labelList is returned.

(3-2-2) labeling method for catalog paragraph type/% direct FORMAT%/: acquiring paragraph type mark ID, text texts, paragraph sequence number list eligibleList and paragraph type mark list labelist with integral data type; traversing a paragraph sequence number list eligibleList, wherein the current traversal sequence number index corresponds to a sequence number list index sIndex in the paragraph sequence number list eligibeliList, and if the content of the texts index position of the text is matched with a regular expression of ^ \ [ (preamble |. prime. reference. -/J. -) -, the paragraph type mark list labelist index position is equal to the paragraph type mark ID; if the matching result between the content of the texts index at the first position and the regular expression "(: if the paragraph corresponding to the sequence number is not matched with the regular expression "^ [ (preamble |. prime. prime.)' then the paragraph type mark ID is equal to the paragraph type mark +1 corresponding to the paragraph type of the paragraph; else the paragraph type tag ID is equal to the paragraph type tag ID; traverse sequence number type list otipelist: assuming that the current type index oIndex is equal to the pre-feature sPart + the type index oIndex + the post-feature oPart, the identification criterion R traverses the paragraph index list eligibleList starting from the index slndex position of the sequence list: if the found paragraph label corresponding to a paragraph index is smaller than the current paragraph type label ID, ending the traversal of the sequence number type list oTypeList; otherwise, searching the paragraph content corresponding to the paragraph index and the paragraph matched with the identification criterion R, assigning the position of the paragraph corresponding to the paragraph in the paragraph type mark list labelList to be equal to the paragraph type label ID, and entering the next sequence number list index sIndex until the traversal of the paragraph sequence number list elgibleList is completed; entering the next traversal sequence number index until the traversal of the paragraph sequence number list eligibelist is completed; the paragraph type tag list labelList is returned.

In the paragraph type identification method provided by this embodiment, a preset paragraph type identification rule is adopted, and whether any paragraph in the text meets the paragraph type identification rule is determined, so as to obtain a determination result; determining the paragraph type of the paragraph according to the judgment result, a preset paragraph type identification criterion and a first number; the first number corresponds to the paragraph type identification criteria. By adopting the paragraph type identification method of the embodiment, the working efficiency and the data accuracy of the paragraph type are greatly improved; the method has universality and expansibility.

Example four

Referring to fig. 2 and fig. 4, the present embodiment further provides an automatic document structure recognition method for geological data, where the automatic document structure recognition method in the fourth embodiment is based on the third paragraph type recognition method in the third embodiment, and the document structure recognition is performed on the basis of paragraph type recognition performed on geological data, and includes:

a1, judging whether any paragraph in the paragraphs with paragraph types accords with the text structure identification rule according to the preset text structure identification rule, and obtaining a second judgment result.

The text structure recognition rule comprises:

the text structure definition rules include: the method comprises the steps of presetting a definition rule of a full-text structure, a definition rule of a full-text paragraph text structure, a definition rule of a table text structure, a definition rule of a geological year representation text structure, a definition rule of a formula text structure and a definition rule of a picture text structure.

The structure identification association criterion is used for identifying the hierarchical structure of covers, chapters and paragraphs in the text structure of the structure which accords with the definition rule of the preset full-text structure; and the sequence structure is used for identifying the structure of the text structure of the preset full-text paragraph, the structure of the preset table text structure, the structure of the preset geological year representation text structure, the structure of the preset formula text structure and the structure of the preset picture text structure.

In practical application of this embodiment, step a1 specifically includes:

traversing the paragraph labels and the text by rules from the first rule of the text structure identification rule; setting a structure identification rule set as rules, a rule of each traversal as rule, a text as texts, a specific line text, a line index contextIndex and a sub-label contained in a paragraph label set labelilist as a label, wherein the label is a corresponding paragraph label of the context; during rule-by-rule traversal, starting from the current rule, and then processing the rule by the sequence of the line index contextIndex (corresponding to the sequence of the sub-label), and processing one sub-label of the paragraph label set labelist and the text context corresponding to the sub-label each time; completing the processing of all the sub-labels label, namely completing the identification of the document structure of the text texts in the current rule; and entering the next rule until all rules rule go through one step.

The first level of rules defines for the text structure: all text structures are defined as six-level structures of full text, full text paragraphs, tables, geological annual representatives, formulas and pictures; and adjusting the structure (aiming at the Markdown format text) according to the actual content of the text.

The second level is an association criterion which is mainly used for identifying the text structure.

For a full-text structure, a hierarchical structure is identified within its text structure with respect to covers, chapters, paragraphs, etc.

And identifying the sequential structure of the text structure of the full-text paragraphs, the tables, the geological annual representations, the formulas and the picture structures.

Text structure recognition and text structure type marking are shown in a section (4), the text structure association criterion is defined by a plurality of condition criteria (and rule &, or rule |, irregular NOT), a regular expression, a paragraph type, a start-stop paragraph criterion, a structure criterion, and a method criterion (\\% > -%), wherein the association criterion is formed by adding two start-stop paragraph criteria, completely replacing the method criterion and NOT removing the format criterion on the basis of a table 1 basic association criterion definition table, and the change is specifically performed in the table 4 basic association criterion change table.

TABLE 4 change table of basic association criteria

The association criteria can be combined for use, and specific document structure identification criteria can be formed; such as "/% FULLTEXT%/END LINE 9010400", indicates that the text structure ENDs at paragraph type 9010400 and the full text labeling method/% FULLTEXT%/is invoked to label the text structure.

The Markdown format text structure recognition criteria table is shown in table 5:

TABLE 5Markdown format text structure recognition criteria Table

The identification criteria of the internal structure of the full-text structure in Markdown format are shown in Table 6

Table 6 Markdown format full text structure internal structure identification criterion table

The MIF/MID format document structure recognition criteria are shown in Table 7.

TABLE 7MIF/MID document structure recognition criteria Table

ID2 is a second number used for marking the type of the text structure, Description is the Description of the text structure identification criteria, and Rules is the text structure identification rule.

The rule table is stored in an Excel file.

All the criteria of paragraph type, document structure type and document structure internal structure type are predefined in the corresponding table according to the rule.

A2, determining the text structure type of the paragraph according to the judgment result, the preset text structure identification rule and the second number.

The second number corresponds to the text structure recognition rule.

And if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph.

In practical application of this embodiment, step a2 specifically includes:

(4) acquiring a paragraph type mark list labelList, text texts and a text structure identification rule list strucRuleList output in the process (3) (acquiring a text structure identification rule table with a corresponding format according to the text type of the text texts, a Markdown format text structure identification rule table 5 and a MIF/MID document structure identification rule table 7); setting a text structure mark list strucList with the same size as the text texts for storing a text structure identification tag; traversing the text structure recognition rule list strucRuleList according to the sequence: and (4) setting a current traversal text structure identification rule, entering the step (4-1), searching paragraphs matched with the rule, storing the paragraphs in a list eligibleList according with rule sequence numbers, acquiring a serial number ID2 corresponding to the text structure identification rule as a structure type of a text structure, storing the serial number ID2 in a corresponding position of a text structure label list strucList, and returning to the text structure label list.

(4-1) obtaining a rule, a text texts, a paragraph type mark list labelist, and a sequence number list eligibelist matched with the rule, wherein the algorithm process is almost the same as that of the rule (3-1) (mainly two starting and stopping paragraph sequence criteria ((4-1-1), (4-1-2)) are added and a no-format criterion is deleted), and the sequence number list eligibelist is searched through the matching result of the rule and the paragraph type mark list labelist or the text texts; returning a sequence number list eligibeliList; entering the step (4-2), marking the paragraphs which accord with the rule to obtain a text structure mark list strucList; a text structure tag list struclist is returned.

(4-1-1) opening paragraph method BEGIN LINE: acquiring a rule, a paragraph type mark list labelList and a sequence number list eligiblelList which accords with a text structure criterion, wherein a sub sequence number list searchlList stores sequence numbers which accord with the rule; if the sequence number list eligibelist is not empty, traverse the sequence number list eligibelist: current paragraph sequence number index, the corresponding sequence number index seIndex in the sequence number list elgibleList: if the content of the first index position of the paragraph type mark list labelLits is matched with the rule, the paragraph sequence numbers from the sequence number index seIndex position to the end in the sequence number list elgibleList are stored in the sub sequence number list searchlList, and the eligibeliList jumping out of the sequence number list is traversed; if not, entering the next section of falling sequence number index, and continuing traversing until the traversal of the sequence number list elgibleList is finished; if the sequence number list eligibelist is empty, traverse the paragraph type tag list labelist: currently traversing the content item, corresponding to the content index itemlndex: if the content item is matched with the rule, the paragraph number from the content index itemIndex to the end of the paragraph type mark list labelList is stored in a sub-sequence number list searchList, and the skip sequence number list eligiblelList is traversed; if not, entering the next content item until the label list labelList of the paragraph type is traversed; the child sequence number list searList is returned.

(4-1-2) END paragraph rule method END LINE: acquiring a rule, a paragraph type mark list labelList and a sequence number list eligiblelList which accords with a text structure criterion, wherein a sub sequence number list searchList is used for storing paragraph sequence numbers which accord with the rule, and a position storage s _ Index is equal to 0; if the sequence number list eligibelist is not empty, traverse the sequence number list eligibelist: the index of the current paragraph sequence number index, index of the sequence number list selndex corresponding to the sequence number list elgibleList: if the content of the first Index position of the paragraph type mark list labelLits matches with the rule, the paragraph sequence numbers from the position storage s _ Index to the Index number seIndex position in the sequence number list elgibleList are stored in the sub sequence number list searchList, and the Index number seIndex +1 is assigned to the position storage s _ Index; entering the next section of sequence number index until the traversal of the sequence number list elgibleList is finished; if the sequence number list eligibelist is empty, traverse the paragraph type tag list labelist: currently, traversing the content item, corresponding to the content index number itemlndex: if the content item matches the rule, the paragraph number from the location storage s _ index to the location of the content index itemIndex is stored in the sub-sequence list searCHList, and the location storage s _ index is equal to the content index itemIndex + 1; entering the traversal of the next content item until the traversal of the label list of the paragraph type is finished; the child sequence number list searchList is returned.

(4-2) acquiring a method name f _ name, a paragraph type identifier ID, a sequence number list eligibleList, a paragraph type mark list labelList and a text structure label list strucList which accord with text type identification rules: if the method name f _ name is null, traversing the sequence number list eligibeliList, setting the current sequence number index, and enabling the position of the first index of the text structure label list strucList to be equal to 'ID'; entering the next traversal content, and repeating the operations until the traversal of the sequence number list eligibelist is completed; if the method name f _ name is not null (the method criterion keyword in the table 4), calling a corresponding marking method (4-2-1 to 4-2-2) according to the method name f _ name to mark the text structure type, and acquiring a text structure mark list strucList; a text structure tag list struclist is returned.

(4-2-1) full text structure labeling method/% FULLTEXT%/: acquiring a sequence number List eligibleList meeting the text type identification rule, assigning the sequence number List eligibleList to full text texts and a text structure mark ID2, acquiring an internal structure List r _ List according to a full text structure internal structure criterion table in a table 6, and assigning the sequence number List eligibeliList to a text structure mark List strucList; entering (4-1), and acquiring a text structure mark list strucList; a text structure tag list struclist is returned.

(4-2-2) text structure labeling method/% CHAPTER%/: acquiring a sequence number list eligibleList, a paragraph type mark list labelList and a text structure mark ID2 which accord with the text structure identification rule; obtaining a title level list delelist according to the corresponding relation between the sequence number list eligibelist and the paragraph type mark list labelist, and obtaining a dictionary delect corresponding to the text paragraph type and the title level (the first-level title paragraph and the content thereof correspond to the text structure mark ID2, the second-level title paragraph and the content thereof correspond to the text structure mark ID2+1, and so on); entering the step (4-2-2-1), marking the text structure, and acquiring a text structure mark list strucList; a text structure tag list struclist is returned.

(4-2-2-1) acquiring a sequence number list eligibleList, a paragraph type mark list labelist, a title level list dictionary deDict and a title level list delest which accord with a text structure recognition rule; traverse sequence number list eligibelist: setting a current traversal sequence number index to correspond to the index number sIndex, and if the content of the first index position of the paragraph type mark list labelList is matched with the content of the 0 th position of the title grade list delest, the value of the first index position of the text structure mark list strucList is the value of the position of the title grade list dictionary delect second delest [0 ]; the sequence number list eligibleList is traversed starting with index sIndex: and setting a current traversal sequence number searlndex, wherein a corresponding index number in a sequence number list eligibleList is i: if the content at the first search index position of the paragraph type mark list labelList is matched with the content at the 0 th position of the title level list deList, skipping out the current traversal; otherwise, the value of the first index position of the text structure mark list strucist is the value of the position of the second list [0] of the title level list dictionary dedicate, and the next traversal sequence number search index is entered until the traversal of the sequence number list eligibelilist is completed or the list is jumped out; entering a step (4-2-2-1) (partial introduction of parameters that the sequence number list eligibelist is equal to the paragraph sequence number of the sequence number list eligibelist from the traversal sequence number index +1 to the index number i, and the title level list delest is equal to the title level list delest from the position of the 1 st to the end), and acquiring a text structure mark list strucList; traversing the sequence number list eligibeliList from the index number i +1 in the next step until the sequence number list eligibeliList is completely traversed; a text structure tag list struclist is returned.

In the document structure identification method of the embodiment, because a preset text structure identification rule is adopted, whether any paragraph in the paragraphs with the paragraph types conforms to the text structure identification rule is judged, and a second judgment result is obtained; determining the text structure type of the paragraph according to the judgment result, a preset text structure identification rule and a second number; the second number corresponds to the text structure identification rule; and if the judgment result of the paragraph accords with the corresponding text structure identification rule, taking a second number corresponding to the text structure identification rule as the text structure type of the paragraph. By adopting the document structure identification method of the embodiment, the work efficiency and the data accuracy of the document structure identification are greatly improved.

Since the system described in the above embodiment of the present invention is a system used for implementing the method of the above embodiment of the present invention, a person skilled in the art can understand the specific structure and the modification of the system based on the method described in the above embodiment of the present invention, and thus the detailed description is omitted here. All systems adopted by the method of the above embodiments of the present invention are within the intended scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

31页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种面向飞机维修的系统划分码编码方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!