Multi-source Markdown geological data text format standardization method and system

文档序号：749415 发布日期：2021-04-23 浏览：23次中文

阅读说明：本技术 一种多源Markdown地质资料文本格式规范化方法及系统 (Multi-source Markdown geological data text format standardization method and system ) 是由邓吉秋夏晨晨刘文毅雷玉娇何美香路馥毓于 2021-01-08 设计创作，主要内容包括：本发明涉及一种多源Markdown地质资料文本格式规范化方法及系统,所述方法包括：S1、根据预先设定的文本清理判断规则,判断文本中任一行文本是否符合清理判断准则,获取判断结果；所述文本清理判断规则包括：第一级规则为规定清理判断准则进行判断的顺序的优先级规则；第二级规则为清理判断准则,用于判断文本是否符合清理判断准则；S2、根据所述判断结果和预先设定文本清理判断规则及规范化处理方法,进行规范处理,获取规范文本；所述规范化处理方法与所述文本清理判断规则对应,解决了对Markdown格式地质资料文本格式规范化处理需要有经验的操作人员才能完成,且速度慢、效率低、无法避免人为疏忽带来的判断错误的问题。(The invention relates to a method and a system for standardizing text formats of multisource Markdown geological data, wherein the method comprises the following steps: s1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result; the text cleaning judgment rule comprises the following steps: the first-level rule is a priority rule which specifies the sequence of judgment performed by the cleaning judgment criterion; the second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion; s2, performing standard processing according to the judgment result, a preset text cleaning judgment rule and a standard processing method to obtain a standard text; the normalization processing method corresponds to the text cleaning judgment rule, and solves the problems that the normalization processing of the text format of the geological data in the Markdown format can be completed only by experienced operators, the speed is low, the efficiency is low, and the judgment error caused by human negligence cannot be avoided.)

1. A multi-source Markdown geological data text format standardization method is characterized by comprising the following steps:

s1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result;

the text is a multi-source Markdown geological data text;

the text cleaning judgment rule comprises the following steps:

the first-level rule is a priority rule which specifies the sequence of judgment performed by the cleaning judgment criterion;

the second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion;

s2, performing standard processing according to the judgment result, a preset text cleaning judgment rule and a standard processing method to obtain a standard text;

the normalized processing method corresponds to the text cleaning judgment rule.

2. The method according to claim 1, wherein the step S1 includes:

and judging each line of text of the text step by step according to the priority sequence corresponding to the text cleaning rule, and acquiring the judgment result of the line of text.

3. The method according to claim 2, wherein the step S2 includes:

if the judgment result of the line text is in accordance with the corresponding cleaning judgment criterion, performing standard processing by adopting a preset standard processing method corresponding to the cleaning judgment criterion to obtain the standard text corresponding to the line text.

4. The method of claim 3,

the priorities include: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

5. The method of claim 4,

the judging sequence of the priority is as follows: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

6. The method of claim 5,

the cleaning judgment criterion comprises: the method comprises the following steps of presetting a plurality of condition criteria, presetting a regular expression criterion and presetting a method identifier criterion;

the preset multi-condition criteria include:

a preset sum rule, a preset or rule, a preset non-rule;

the preset judging sequence of the priority corresponding to the rule is before the preset judging sequence of the priority corresponding to the rule;

the judging order of the priority corresponding to the preset or regular is before the judging order of the priority corresponding to the preset non-regular;

the preset summation rule represents that the line text needs to simultaneously meet the regular expressions at both sides of the summation rule or other preset regular expressions;

the preset rule indicates that the line text only needs to satisfy one regular expression or other preset regular expressions at both sides of the rule;

the preset non-rule represents a regular expression or other preset regular expressions after the line text does not meet the non-rule;

the method comprises the following steps of presetting a regular expression criterion, wherein a line text meets a regular expression;

a predetermined method identifier criterion, the line text satisfying a predefined method.

7. The method of claim 6,

the normalization processing method also corresponds to preset cleaning judgment criterion description, priority of the cleaning judgment criterion, cleaning judgment rules, the original characters required to call the normalization processing method, the replacement characters required to call the normalization processing method, the antecedent of the value required to call the normalization processing method and the consequent of the value required to call the normalization processing method.

8. A multisource Markdown geological data text format standardization system is characterized by comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and wherein the processor invokes the program instructions to perform a method of normalizing text formats of multisource Markdown geological data as claimed in any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of document data content standardization processing, in particular to a method and a system for standardizing text formats of multi-source Markdown geological data.

Background

With the development of information technology, the characteristics of various formats and large data storage space of the digitized geological data are increasingly highlighted, and data storage and use become problems to be solved urgently. The data format of geological data is the key to whether the data can be stored for a long time, the data formats of the geological data such as Word, PDF, TXT, HTML, CAJ, PPT, MagGIS, MapInfo and the like exist, and the key to the continuous use of the data is the complete text content, high identifiability and specification of the geological data.

In the face of the requirement of storing and using geological data, normalized geological data becomes a practical requirement. Normalization refers not only to the normalization of document content, but also to the normalization of document formats, but mostly only to the former. The experienced professional can manually standardize the format according to the characteristics of each format. However, manual processing is time-consuming and labor-consuming, omission and errors cannot be guaranteed, and the format is diversified and inconvenient for subsequent use.

The prior art is as follows: 1. and designing a document normalization processing template containing a data normalization processing method in advance, and calling a corresponding data normalization processing method by a user according to requirements. 2. And fixing the standardized template, capturing and integrating text content to the corresponding position of the standardized template through semantic analysis and information extraction, and generating a final document.

The prior art has the following disadvantages: aiming at the standardized processing of document data content, the prior art adopts a fixed standardized template, a standardized processing template for man-machine interactive processing or a pure manual mode to process a document. Man-machine interaction and pure manual processing can be completed only by experienced operators, so that the speed is low, the efficiency is low, and judgment errors caused by human negligence cannot be avoided; the fixed standard template has mostly fixed schema structure, paragraph content and document format, and is not flexible enough. The specification of the text format of the geological data in the multisource Markdown format is not involved.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a method and system for text format normalization of multi-source Markdown geological data. The method solves the problems that the normalized processing of the text format of the geological data in the Markdown format can be completed only by experienced operators, the speed is low, the efficiency is low, and the judgment error caused by human negligence cannot be avoided.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, the embodiment of the present invention provides a method for normalizing text formats of multisource Markdown geological data, which is characterized by including: