Multi-source Markdown geological data text format standardization method and system

文档序号:749415 发布日期:2021-04-23 浏览:23次 中文

阅读说明:本技术 一种多源Markdown地质资料文本格式规范化方法及系统 (Multi-source Markdown geological data text format standardization method and system ) 是由 邓吉秋 夏晨晨 刘文毅 雷玉娇 何美香 路馥毓 于 2021-01-08 设计创作,主要内容包括:本发明涉及一种多源Markdown地质资料文本格式规范化方法及系统,所述方法包括:S1、根据预先设定的文本清理判断规则,判断文本中任一行文本是否符合清理判断准则,获取判断结果;所述文本清理判断规则包括:第一级规则为规定清理判断准则进行判断的顺序的优先级规则;第二级规则为清理判断准则,用于判断文本是否符合清理判断准则;S2、根据所述判断结果和预先设定文本清理判断规则及规范化处理方法,进行规范处理,获取规范文本;所述规范化处理方法与所述文本清理判断规则对应,解决了对Markdown格式地质资料文本格式规范化处理需要有经验的操作人员才能完成,且速度慢、效率低、无法避免人为疏忽带来的判断错误的问题。(The invention relates to a method and a system for standardizing text formats of multisource Markdown geological data, wherein the method comprises the following steps: s1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result; the text cleaning judgment rule comprises the following steps: the first-level rule is a priority rule which specifies the sequence of judgment performed by the cleaning judgment criterion; the second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion; s2, performing standard processing according to the judgment result, a preset text cleaning judgment rule and a standard processing method to obtain a standard text; the normalization processing method corresponds to the text cleaning judgment rule, and solves the problems that the normalization processing of the text format of the geological data in the Markdown format can be completed only by experienced operators, the speed is low, the efficiency is low, and the judgment error caused by human negligence cannot be avoided.)

1. A multi-source Markdown geological data text format standardization method is characterized by comprising the following steps:

s1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result;

the text is a multi-source Markdown geological data text;

the text cleaning judgment rule comprises the following steps:

the first-level rule is a priority rule which specifies the sequence of judgment performed by the cleaning judgment criterion;

the second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion;

s2, performing standard processing according to the judgment result, a preset text cleaning judgment rule and a standard processing method to obtain a standard text;

the normalized processing method corresponds to the text cleaning judgment rule.

2. The method according to claim 1, wherein the step S1 includes:

and judging each line of text of the text step by step according to the priority sequence corresponding to the text cleaning rule, and acquiring the judgment result of the line of text.

3. The method according to claim 2, wherein the step S2 includes:

if the judgment result of the line text is in accordance with the corresponding cleaning judgment criterion, performing standard processing by adopting a preset standard processing method corresponding to the cleaning judgment criterion to obtain the standard text corresponding to the line text.

4. The method of claim 3,

the priorities include: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

5. The method of claim 4,

the judging sequence of the priority is as follows: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

6. The method of claim 5,

the cleaning judgment criterion comprises: the method comprises the following steps of presetting a plurality of condition criteria, presetting a regular expression criterion and presetting a method identifier criterion;

the preset multi-condition criteria include:

a preset sum rule, a preset or rule, a preset non-rule;

the preset judging sequence of the priority corresponding to the rule is before the preset judging sequence of the priority corresponding to the rule;

the judging order of the priority corresponding to the preset or regular is before the judging order of the priority corresponding to the preset non-regular;

the preset summation rule represents that the line text needs to simultaneously meet the regular expressions at both sides of the summation rule or other preset regular expressions;

the preset rule indicates that the line text only needs to satisfy one regular expression or other preset regular expressions at both sides of the rule;

the preset non-rule represents a regular expression or other preset regular expressions after the line text does not meet the non-rule;

the method comprises the following steps of presetting a regular expression criterion, wherein a line text meets a regular expression;

a predetermined method identifier criterion, the line text satisfying a predefined method.

7. The method of claim 6,

the normalization processing method also corresponds to preset cleaning judgment criterion description, priority of the cleaning judgment criterion, cleaning judgment rules, the original characters required to call the normalization processing method, the replacement characters required to call the normalization processing method, the antecedent of the value required to call the normalization processing method and the consequent of the value required to call the normalization processing method.

8. A multisource Markdown geological data text format standardization system is characterized by comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and wherein the processor invokes the program instructions to perform a method of normalizing text formats of multisource Markdown geological data as claimed in any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of document data content standardization processing, in particular to a method and a system for standardizing text formats of multi-source Markdown geological data.

Background

With the development of information technology, the characteristics of various formats and large data storage space of the digitized geological data are increasingly highlighted, and data storage and use become problems to be solved urgently. The data format of geological data is the key to whether the data can be stored for a long time, the data formats of the geological data such as Word, PDF, TXT, HTML, CAJ, PPT, MagGIS, MapInfo and the like exist, and the key to the continuous use of the data is the complete text content, high identifiability and specification of the geological data.

In the face of the requirement of storing and using geological data, normalized geological data becomes a practical requirement. Normalization refers not only to the normalization of document content, but also to the normalization of document formats, but mostly only to the former. The experienced professional can manually standardize the format according to the characteristics of each format. However, manual processing is time-consuming and labor-consuming, omission and errors cannot be guaranteed, and the format is diversified and inconvenient for subsequent use.

The prior art is as follows: 1. and designing a document normalization processing template containing a data normalization processing method in advance, and calling a corresponding data normalization processing method by a user according to requirements. 2. And fixing the standardized template, capturing and integrating text content to the corresponding position of the standardized template through semantic analysis and information extraction, and generating a final document.

The prior art has the following disadvantages: aiming at the standardized processing of document data content, the prior art adopts a fixed standardized template, a standardized processing template for man-machine interactive processing or a pure manual mode to process a document. Man-machine interaction and pure manual processing can be completed only by experienced operators, so that the speed is low, the efficiency is low, and judgment errors caused by human negligence cannot be avoided; the fixed standard template has mostly fixed schema structure, paragraph content and document format, and is not flexible enough. The specification of the text format of the geological data in the multisource Markdown format is not involved.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a method and system for text format normalization of multi-source Markdown geological data. The method solves the problems that the normalized processing of the text format of the geological data in the Markdown format can be completed only by experienced operators, the speed is low, the efficiency is low, and the judgment error caused by human negligence cannot be avoided.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, the embodiment of the present invention provides a method for normalizing text formats of multisource Markdown geological data, which is characterized by including:

s1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result;

the text is a multi-source Markdown geological data text;

the text cleaning judgment rule comprises the following steps:

the first-level rule is a priority rule which specifies the sequence of judgment performed by the cleaning judgment criterion;

the second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion;

s2, performing standard processing according to the judgment result, a preset text cleaning judgment rule and a standard processing method to obtain a standard text;

the normalized processing method corresponds to the text cleaning judgment rule.

Preferably, the step S1 includes:

and judging each line of text of the text step by step according to the priority sequence corresponding to the text cleaning rule, and acquiring the judgment result of the line of text.

Preferably, the step S2 includes:

if the judgment result of the line text is in accordance with the corresponding cleaning judgment criterion, performing standard processing by adopting a preset standard processing method corresponding to the cleaning judgment criterion to obtain the standard text corresponding to the line text.

Preferably, the first and second liquid crystal materials are,

the priorities include: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

Preferably, the first and second liquid crystal materials are,

the judging sequence of the priority is as follows: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

Preferably, the first and second liquid crystal materials are,

the cleaning judgment criterion comprises: the method comprises the following steps of presetting a plurality of condition criteria, presetting a regular expression criterion and presetting a method identifier criterion;

the preset multi-condition criteria include:

a preset sum rule, a preset or rule, a preset non-rule;

the preset judging sequence of the priority corresponding to the rule is before the preset judging sequence of the priority corresponding to the rule;

the judging order of the priority corresponding to the preset or regular is before the judging order of the priority corresponding to the preset non-regular;

the preset summation rule represents that the line text needs to simultaneously meet the regular expressions at both sides of the summation rule or other preset regular expressions;

the preset rule indicates that the line text only needs to satisfy one regular expression or other preset regular expressions at both sides of the rule;

the preset non-rule represents a regular expression or other preset regular expressions after the line text does not meet the non-rule;

the method comprises the following steps of presetting a regular expression criterion, wherein a line text meets a regular expression;

a predetermined method identifier criterion, the line text satisfying a predefined method.

Preferably, the first and second liquid crystal materials are,

the normalization processing method also corresponds to preset cleaning judgment criterion description, priority of the cleaning judgment criterion, cleaning judgment rules, the original characters required to call the normalization processing method, the replacement characters required to call the normalization processing method, the antecedent of the value required to call the normalization processing method and the consequent of the value required to call the normalization processing method.

In a second aspect, an embodiment of the present invention provides a system for normalizing text formats of multisource Markdown geological data, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and wherein the processor invokes the program instructions to perform a method for text format normalization of multisource Markdown geological data as described in any of the above.

(III) advantageous effects

The invention has the beneficial effects that: according to the invention, whether any line of text in the text meets the clearing judgment criterion is judged according to the preset text clearing judgment rule, and a judgment result is obtained; and performing standard processing according to the judgment result and a preset text cleaning judgment rule and a standard processing method to obtain a standard text, and compared with the prior art, the method can realize automatic standard processing of geological data in a Markdown format, improves the working efficiency and the data accuracy, and has universality and expansibility.

Drawings

FIG. 1 is a flow chart of a text format normalization method for multisource Markdown geological data according to the invention;

fig. 2 is a schematic diagram of a text format normalization method for multi-source Markdown geological data in the second embodiment of the invention.

Detailed Description

In order to better explain the present invention and to facilitate understanding, exemplary embodiments of the present invention may be described in more detail by referring to the accompanying drawings. Furthermore, the present invention may be embodied in various forms and is not limited to the embodiments set forth herein. These embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Example one

Referring to fig. 1, the embodiment provides a method for normalizing text formats of multisource Markdown geological data, which includes:

and S1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result.

And the text is a multi-source Markdown geological data text.

The text cleaning judgment rule comprises the following steps:

the first level rule is a priority rule that specifies the order in which the cleaning judgment criteria are judged.

The second-level rule is a cleaning judgment criterion and is used for judging whether the text meets the cleaning judgment criterion.

And S2, performing standard processing according to the judgment result, the preset text cleaning judgment rule and the standard processing method to obtain the standard text.

The normalized processing method corresponds to the text cleaning judgment rule.

Preferably, the step S1 includes:

and judging each line of text of the text step by step according to the priority sequence corresponding to the text cleaning rule, and acquiring the judgment result of the line of text.

Preferably, in this embodiment, the step S2 includes:

if the judgment result of the line text is in accordance with the corresponding cleaning judgment criterion, performing standard processing by adopting a preset standard processing method corresponding to the cleaning judgment criterion to obtain the standard text corresponding to the line text.

Preferably in this embodiment, the priority includes: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

Preferably, in this embodiment, the priority determination sequence sequentially includes: the first-level discrimination rule, the second-level discrimination rule, the third-level discrimination rule, the fourth-level discrimination rule, the fifth-level discrimination rule and the sixth-level discrimination rule.

The cleaning judgment criterion comprises: a predetermined multi-condition criterion, a predetermined regular expression criterion, a predetermined method identifier criterion.

The preset multi-condition criteria include:

a predetermined sum rule, a predetermined negative rule, or a predetermined negative rule.

The preset judging order of the priority corresponding to the rule is before the preset judging order of the priority corresponding to the rule.

The judgment order of the priority corresponding to the preset or rule is before the judgment order of the priority corresponding to the preset non-rule.

The preset summation rule indicates that the line text needs to simultaneously satisfy the regular expressions on both sides of the summation rule or other preset regular expressions.

The preset rule indicates that the line text only needs to satisfy one regular expression or other preset regular expressions at both sides of the rule.

The preset irregular expression represents regular expression or other preset regular expression after the line text does not meet the irregular.

And the line text meets the regular expression according to the preset regular expression criterion.

A predetermined method identifier criterion, the line text satisfying a predefined method.

In this embodiment, the normalization processing method further corresponds to a preset cleaning judgment criterion description, a priority of the cleaning judgment criterion, a cleaning judgment rule, an original character required to call the normalization processing method, an alternative character required to call the normalization processing method, a top item of an incoming value required to call the normalization processing method, and a bottom item of an incoming value required to call the normalization processing method, respectively.

According to the multisource Markdown geological data text format standardization method, whether any line of text in the text meets the cleaning judgment criterion or not is judged according to the preset text cleaning judgment rule, and a judgment result is obtained; and performing standard processing according to the judgment result and a preset text cleaning judgment rule and a standard processing method to obtain a standard text.

The embodiment also provides a multisource Markdown geological data text format standardization system, which comprises:

at least one processor, and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform a method for text format normalization of multisource Markdown geological data as described in any of the above.

Example two

Referring to fig. 1 and fig. 2, the embodiment provides a multi-source Markdown geological data normalization processing method, which includes the following steps:

and S1, judging whether any line of text in the text meets the clearing judgment criterion according to a preset text clearing judgment rule, and acquiring a judgment result. And the text is a multi-source Markdown geological data text.

In the practical application of step S1 in this embodiment, first, full-text traversal needs to be performed: traversing the full text step by step according to rules from the highest priority of the format cleaning rule; setting a rule set as rule, text as texts, current cleaning rule priority currpri, current traversal rule, current line text context, line index context and integer; when the progressive traversal is performed, starting from the current priority currpri, when the priority corresponding to the rule is currpri, processing the line text context according to the line index contextIndex sequence, processing one rule with the priority currpri each time, traversing all the rules with the priority currpri, namely finishing the processing of the line text context under all the rules with the current priority currpri; and entering the next line of text, continuing the processing of all rules rule of the current priority curPri until all lines of text of the text texts are traversed, finishing the processing of each text texts, namely finishing the processing of the priority curPri, and entering the next priority until all priorities and the rules rule are traversed once.

In the practical application of step S1 of the present embodiment, the rule then defines:

establishing a text cleaning and judging rule table, wherein the rule table has the following characteristics:

covering the possible situation that the geological data in the Markdown format needs to be standardized

Two levels of rules are defined for each case:

the first-level rule is a cleaning and judging Priority rule and is mainly used for defining the Priority order of rule judgment, the Priority has 0-5 level, the 0 level is the first judging rule, then the 1 st level, and the like.

The second-level rule is a cleaning judgment association rule, and is mainly used for judging text content and judging whether the text meets the cleaning judgment rule, the cleaning judgment association rule adopts a multi-condition rule (and rule &, or rule |, irregular NOT), a regular expression and a method identifier (\% >. The%) definition, wherein the association rule and the grammar rule are as shown in table 1:

table 1 basic association criteria definition table

Wherein: the name in the/% name% is the name of the method, the multi-condition criterion control symbol (irregular NOT, or rule |, and rule & &), the priority level corresponding to the irregular NOT is the highest, or the order of the rule | | | and the rule & & the lowest.

The association criteria can be used in combination to form the criteria for cleaning judgment.

Such as tablenaamescomplete &/% nexttwoline%/NOT tableContext, indicates that the line text satisfies tablenaamescomplete criterion and the returned result of the nexttwoline method does NOT satisfy tableContext criterion.

The regular expression is defined as table 2:

TABLE 2 regular expression Table

The cleaning judgment rules and normalization processing are as shown in table 3: header Description is the Description of the cleaning judgment criterion, Priority is the Priority of the cleaning judgment criterion, Rules is the cleaning judgment rule, Processing is the normalized Processing method, oriChar original characters, repChar replaced characters, the antecedent, and the consequent is the incoming value needed to call the normalized Processing method.

TABLE 3 cleaning judgment rules and normalization processing tables

The normalization process is illustrated in table 4:

table 4 specification processing method description table

The rule table is stored in an Excel file;

all the criterion discrimination and processing of the Markdown-format geological data need to be defined in the cleaning judgment rule and the normalized processing table in Table 3 in advance according to the rule description criterion, and the regular expression for cleaning judgment can be defined in the regular expression table in Table 2 or directly defined in the cleaning judgment rule and the normalized processing table in Table 3.

In the practical application of step S1 in this embodiment, the Markdown-format geological data format cleaning and determining process is as follows:

traversing a rule set rules, acquiring a judgment priority set PRIORITYLIST, traversing the text texts step by step according to the rule priority, setting the current priority currpri, and traversing the text texts: starting from the first line of text texts, setting the line text context which is not empty currently and the line index contextIndex, traversing all rules rule of priority currpri, entering the step (3-1) to judge whether the line text context is matched with the rules rule, and determining whether the line text context needs to be normalized: and (4) if the rule is required, entering the step (4-1) to obtain the standard line text, and if the rule is not required to enter, directly entering the judgment of the next rule when the line text is the standard text in the rule.

(3-1) judging whether the rule "& &", or the rule "|" is in the rule: if the text is judged to be required to be processed in the step (3-2), acquiring a return value isclearing: otherwise, the step (3-3) is carried out to judge whether the text needs to be processed or not and obtain a return value isclearing.

(3-2) obtaining a rule, a rowtext context, segmenting the rule by taking the rule "& &" or the rule "|" as a key, storing the rule in a list RuleList, traversing the rule list RuleList: setting a current traversal rule middle, entering the step (3-4) to judge whether the line text context needs to be processed, and acquiring a judgment result isclearing: if the judgment result isClening is equal to "True", entering the step (4-1) to carry out standardization processing on the line text context, acquiring the standard line text context, and returning to the line text context; otherwise, entering the judgment of the next rule.

(3-3) obtaining a rule and a line text context, assigning the rule as a middle rule, entering the step (3-4) to judge whether the line text context needs to be processed, and obtaining a judgment result isclearing: if the judgment result is isclearing equal to True, entering the step (4-1) to carry out standardization processing on the line text context, acquiring the standard line text context, and returning to the line text context; otherwise, entering the judgment of the next rule.

(3-4) acquiring a rule midule and a literary text context, entering the step (3-5) to judge whether the rule midule contains a method keyword, acquiring a method judgment result isMethod and acquiring a method name; judging whether the rule middle contains an irregular NOT keyword, wherein an irregular judgment result isNOT: if the method judgment result isMethod is equal to 'True', entering a step (3-6) to judge whether the regular expression is the name of the regular expression, obtaining the regular expression corresponding to the regular expression and assigning the regular expression as the regular expression, entering a step (3-7) to carry out the processing of the regular method, judging whether the line text context needs to be processed, and obtaining a judgment result isCleaning; if the method judgment result isMethod is equal to 'False', judging whether the rule middle contains a length Len method, and judging the length result isLen: if the length judgment result isLen is equal to 'True', entering the step (3-8) of processing the length Len method, judging whether the line text context needs to be processed, and acquiring a judgment result isCleanig; and if the length judgment result isLen is equal to 'False', entering a step (3-6) to judge whether the rule is the name of the regular expression, obtaining the regular expression corresponding to the regular expression and assigning the regular expression as the regular expression, entering a step (3-9) to judge whether the text context needs to be processed, and obtaining a judgment result isCleanig. And returning the judgment result isclearing.

(3-5) obtaining a rule midule, and judging whether "/% -%" is matched with the rule midule: if "/% -%/" does not match the rule midule, the method determines that the result, isMethod, is equal to "False"; if the starting rule sRUle is not empty, the starting mark s _ flag is equal to the starting rule sRUle, the ending mark e _ flag is equal to the ending rule eRUle, and the starting mark s _ flag and the ending mark e _ flag are returned; otherwise the rule midrule equals the end rule eRule, returning the rule midrule. .

(3-6) obtaining a rule midrule, passing through the regular expression table 2, establishing a regular expression dictionary general Recect, judging whether the rule midrule is in the regular expression dictionary general Recect, if so, calling the regular expression corresponding to the rule midrule, assigning the regular expression to the rule midrule, and returning to the rule midrule.

(3-7) acquiring a method name functionName and a line text context, calling corresponding methods (the corresponding methods are (3-10) next line method nextline, (3-11) next two line methods nextwoline and (3-12) until method unity) according to the method name functionName, judging whether the line text context needs to be processed, acquiring a judgment result isclearing, and returning the judgment result isclearing.

(3-8) obtaining a rule middle, a literary text context, and judging whether the ">" ", or '<' keyword is in the rule middle: if not, returning a rule middle; if the keyword of ">" or "<" is assigned as the middle character midamble, the number num after the middle character midamble is obtained, the character string before "(" and ")" of the middle character midamble is obtained and assigned to the rule midamble, whether the matching number meeting the midamble rule in the text context is ">" or "<" number num is judged: if the result is consistent with the judgment result isCleaning is equal to True; otherwise, judging that the result isCleaning is equal to 'False', and returning to the judgment result isCleaning.

(3-9) obtaining a rule midule, an irregular judgment result isNot and a literary text context: if the result isNot is judged to be equal to "True" by the non-rule: if the line text context matches the rule middle, then the result isclearing is determined to be equal to "False"; the mismatching judgment result isclearing is equal to "True"; if the result isNot is judged to be equal to 'Flase': if the line text context matches the rule median, the result isclearing is judged to be equal to "True"; the mismatch determination result isclearing is equal to "ase". And returning the judgment result isclearing.

(3-10) next line method nextLine: acquiring a line index contextIndex, a rule middle, a line text context and a text texts, wherein the data type is integer: if the line index contextIndex +1 does not exceed the line length of the text texts, acquiring the next line of the line text context as the next line text next corresponding to the next line index nextIndex, entering the step (3-6) to judge whether the rule is the name of a regular expression, acquiring the rule midule corresponding to the regular expression and assigning the rule midule to the regular expression, entering the step (3-9) to judge whether the line text context needs to be processed, acquiring a judgment result isclearing, and returning the judgment result isclearing; otherwise, the line index contextIndex +1 exceeds the line length of the text texts, and a judgment result iscCleaning is returned to be equal to 'False'; .

(3-11) the following two lines of methods nexttwooline: acquiring a line index contextIndex, a rule middle, a line text context and a text texts, wherein the data type is integer: if the line index contextIndex +2 does not exceed the line length of the texts, acquiring two lines below the line text as next two lines of text newCon, corresponding to the next two lines of index newConIndex, entering step (3-6) to judge whether the rule is the name of a regular expression, acquiring the rule midule corresponding to the regular expression, assigning the rule midule to the regular expression, entering step (3-9) to judge whether the line text needs to be processed, acquiring a judgment result isclearing, and returning the judgment result isclearing; otherwise, if the line index contextIndex +2 exceeds the text texts line length, a judgment result iscCleaning equal to "False" is returned.

(3-12) until method unity: acquiring a start mark s _ flag, a textual context and text texts, setting a rule middle equal to the start mark s _ flag, judging whether the rule middle contains an irregular NOT, judging a result isNOT, entering a step (3-6) to judge whether the rule is the name of a regular expression, acquiring the regular expression corresponding to the rule middle and assigning the regular expression as the rule middle, entering a step (3-9) to judge whether the textual context meets the rule middle: if the end position endIndex is matched with the end position endIndex, the step (3-13) is carried out, the judgment result isclean is returned, and the end position endIndex is returned; if the returned judgment result isCleaning is not met, the judgment result isCleaning is equal to 'False';

(3-13) acquiring an end mark e _ flag, acquiring line text context and text texts, assigning the end mark e _ flag to a rule middle, and traversing the text texts from the next line of the line where the line text is located: and (4) judging whether the line text context conforms to the rule midule or not in the step (3-4), if yes, returning a judgment result isclearing to be equal to 'True', and if not, continuously traversing until the line text context conforms to the rule midule or ending.

And S2, performing standard processing according to the judgment result, the preset text cleaning judgment rule and the standard processing method to obtain the standard text.

In the practical application of this embodiment, the Markdown-format geological data specification processing process specifically includes:

(4-1) acquiring a cleaning judgment rule, traversing the cleaning judgment rule and the normalization processing table of the acquisition table 3 to acquire a method name proName corresponding to the cleaning judgment rule, and calling a method corresponding to the method name proName to normalize the text context or the text context list. The standard method comprises the following steps: (4-2) processing methods Markdown inclusion of each level of title deletion, (4-3) method htmltomd for converting html form into Markdown form, (4-4) method multmomd for converting multiline form into Markdown form, (4-5) method sintmomd for converting single line form into Markdown form, (4-6) method symbol _ suppression for symbol replacement, (4-7) method delnextline for deleting next line, (4-8) method merge for merging, (4-9) method overttrik for thickening two lines, (4-10) method mulpho for processing multipicture hyperlink, (4-11) method for processing table name position error, (4-12) method for dividing two lines, and (4-13) method for changing same line content order, exngeder, (4-14) method for processing geochemical year representation, and method for introducing 4-15) method for introducing single line form, (4-16) a special geological year representative processing method of choroSpecial, and (4-17) a title standardized processing method of titeSYMBOL. The specific specification process is as follows:

(4-2) processing method Markdown Incomplite of each level of title deletion, namely obtaining text texts, setting the highest level of the title, i.e. TiMaxDE, equal to 10, and traversing the text texts from the title level, i.e. TiDe, equal to 1: setting a line text context and a line index context; if the line text conforms to the title characteristics and has ti < x > '#' marks, extracting the title serial number orderPart and the characteristics of two sides of the serial number as a front characteristic starPart and a back characteristic endPart, if the # first chapter, extracting the front characteristic starPart to be equal to the 'th', the title serial number orderPart to be equal to the 'one', the back characteristic endPart to be equal to the 'chapter', and acquiring all serial number lists, number types and numbers of the serial number types according to the serial number types of the title serial number orderPart: if the title sequence number orderPart is equal to the first element of the sequence number list numberType, the sequence number index type is equal to 1, traversing the title start row index starIndex to be equal to the row index contextIndex +1, and entering the step (4-2-1) to perform compensation on each level of missing titles; if the title sequence number orderPart is not equal to the first element of the sequence number list numberType, the sequence number index type index is equal to 0, starting from traversing the titles to index the starIndex to be equal to 0, entering the step (4-2-1) to check and supplement each level of missing titles, obtaining the standard text texts after all the check and supplement of each level of titles are finished, and returning to the texts;

(4-2-1) acquiring a sequence number index tyIndex, a traversal starting line index starIndex, text texts, a sequence number list numberType, a pre-feature starPart and a post-feature endPart, and traversing the text texts from the position of the traversal starting line index starIndex: and (3) when the current line text seCon is matched with the position + the rear feature orderPart corresponding to the serial number index type index in the "front feature starPart + serial number list numberType" and accords with the title ending feature, the increment of the serial number index type index is 1: directly traversing the next line if the line text secn has the tite '#' marks, otherwise assigning the line text secn as the tite '#' + secn, and traversing the next line; returning the text texts until the text texts is finished or the serial number index type exceeds the serial number list numberType;

(4-3) a method htmltomd for converting the html table into the Markdown table: acquiring a html form list fileList needing to be converted into Markdown, and setting a result list dataList: traversing the html table fileList from the list index number contentIndex equal to 0, searching whether the starting position of the html table fileList contains a table name, if so, proposing the table name, endowing the corresponding table with a mark format for storing a result list dataList, deleting the corresponding row content in the html table fileList, and entering the step (4-3-1) to convert the html table fileList to obtain a result list rusultList converted from the html table to Markdown; and (4) if the html form list fileList does not contain the table name, directly entering the step (4-3-1), converting the html form to obtain a result list rusultList of the html form conversion Markdown, and returning to the result list resultList.

(4-3-1) acquiring a form fileList to be converted, converting the form into a character string and endowing the tail of a list element with "\ n", converting a result, calling a html2text packet to convert the conversion result into a character string htmlMarkdown (the character string htmlMarkdown needs to be processed due to deviation caused by data and conversion) with an original md format, and firstly, removing unnatural disconnection with the "\ n \ n" characteristic in the character string htmlMarkdown to obtain a character string result htmlMarkdown; dividing the character string htmlMarkdown by taking ' I ' as a divider, obtaining a table column numerical value colum according to the principle that the number of Markdown table features is ' minus ' -in ' as the number of table columns in a division result list htmlMdSplit; converting the character string htmlMarkdown into a list (one character is one tuple of the list), traversing the list to acquire each data by taking "|" as an end or start mark of each lattice data, and ending "\ n" as row data: if the row data is insufficient to present '\ n', removing '\ n', and traversing the next data; if the data amount of each row is enough, traversing the data of the next row; until the data traversal is completed; and writing the data into a text in the txt format, opening the data to be read according to rows, acquiring a data storage result list resultList, and returning the result list resultList.

(4-4) method for converting multiline table to Markdown table multmd: setting a result list resultList for storing the conversion result; acquiring a multi-line List needing to be converted, removing ' \ n ' at the end of each element of the List, traversing the elements in the multi-line List, and finding a line start mark ' ^ \ + (+ \ +) +; starting from the next row, taking "|" as an element segmentation mark, sequentially storing the segmented elements to corresponding positions to obtain a data list resultItem, repeating segmentation and storage operations until a row end mark "\\\+ (+ \ +) + $) is found, converting the data list resultItem into a character string, separating two adjacent list elements by taking" | "as a mark, and storing the two adjacent list elements into a result list resultList; and returning to a result List resultList until the multi-line List is traversed.

(4-5) method of converting single-line table into Markdown table sintromd method: setting a result list resultList for storing the conversion result; acquiring a single-line List needing to be converted, removing '\ n' at the end of each element of the List, and traversing the elements in the single-line List: if the current row data dataItem is the starting position of the single-line table, using '' as a mark of the split single-line table to obtain a result list hand List; other data: if the row data item accords with the table name characteristics, giving a table name mark to the row data item and storing the table name mark in a result list resultList; if the data is tabular data, if the row data contains Chinese characters, adding placeholders behind the Chinese characters in the row data dataItem, obtaining the starting and ending nodes of each grid data according to the lengths of all elements in a result list hand List divided by a single-line table mark, dividing the row data dataItem into the row data dataItem, converting the row data dataItem into a character string, separating two adjacent list elements by taking "|" as a mark, and storing the character string in a result list result List; and continuously traversing to the end mark of the single-line table, and returning to a result list resultList.

(4-6) method symbol _ suppression for symbol substitution: acquiring a line text context to be replaced, an original rule oriChar and a replacement rule repChar; matching the original character string oChar of the original rule oriChar that the line text context conforms to, matching the replacement character string of the replacement rule rChar that the line text context conforms to, replacing the original character string oChar that the line text context conforms to with the replacement character string rChar, and returning the line text context.

(4-7) method delnextline of deleting next line: and acquiring index numbers contextIndex and text texts of the lines to be deleted, deleting the line text of the text texts corresponding to the contextIndex of the lines, and returning the text texts.

(4-8) method merge for merging two rows: acquiring a front index contextIndex, a rear index nextIndex and text texts of two lines of index numbers to be merged, wherein the merged text texts corresponds to a line text between the front index contextIndex and the rear index nextIndex and is a line, deleting the acquired line between the maximum index number and the minimum index number, and returning the text texts.

(4-9) bold method overtrik: acquiring the line text context needing to be bolded, a starting identifier (needed) and an ending marker (needed) of the line text bolded, assigning the line text context as a starting marker (needed) plus the line text context (except the last '\ n') + ending marker (needed) plus '\\ n', and returning the line text context.

(4-10) processing method of multi-picture hyperlink mulPhoto: acquiring the line text context needing to process multiple picture names, extracting a picture name list and a picture hyperlink list according to the characteristics of the picture hyperlinks and the picture names, respectively rearranging in an ascending order, and combining according to the corresponding order of the picture names and the picture hyperlinks; if the picture name is less than the picture hyperlink, the picture name is filled to '' to obtain the picture name list nameList, and the picture name list nameList is returned.

(4-11) method tablerameexchange indicating a position error: acquiring table name line text context, line index contextIndex and text texts of positions to be changed; and traversing the text texts from the current line index contextIndex forwards, finding the table closest to the line text, inserting the line text into the position, deleting the original position content of the line text, and returning to the text texts.

(4-12) method of dividing one line of contents into two lines: acquiring a line text context, a front item rule, an inverse rule and a back item rule, which need to be divided, extracting front and back items matched with the front item rule, the inverse rule and the back item rule in the line text context into an additive and a sdent, generating a division result list rusultList, and returning to the division result list resultList.

(4-13) method exchangeOrder of changing the order of the same line: acquiring a line text context, a predecessor rule, an argument and a successor rule, wherein the positions of the line text context, the predecessor rule, the argument and the successor rule need to be changed, and extracting a predecessor and a successor matched with the predecessor rule, the argument and the successor rule in the line text context into an argument and a sdent; the rowtext context is assigned the value of adent + sedent + '\ n', returning rowtext context.

(4-14) processing method of geological annual representative, choro: acquiring a starting line index contextIndex and text texts which accord with a geological year representation; and (4) setting an end flag e _ flag to be equal to ' underburden ' | (not seen from bottom) ', entering a step (3-13) for searching an end index endIndex of a line text meeting the characteristics of the end flag e _ flag in the text texts, entering a step (4-14-1) for normalizing text contents of the text texts from the line index contextIndex to the end index endIndex, acquiring a result list resultList of normalized geological year representations, and returning the result list resultList.

(4-14-1) obtaining a preliminarily determined geological year representation list geoList which needs to process the text texts from the contextIndex to the endIndex, traversing the geological year representation list geoList, and searching a starting processing index starIndex which is used for starting processing of the geological year representation geoList; starting to process the index starIndex, if the reference in the GeoList of the geological year representation does not accord with the specification and the serial number, then determining the blank and pre-standardizing the problem of the parting line for subsequent processing; the disconnect of content problem in the chronolist of geologic age representatives, geoList, is processed from the start of processing the index, starIndex, with the processing feature that the current line of text has a sequence number designation, but the next line of text does not have a sequence number designation and does not conform to the "upper | lower | not see bottom | underburden | not measured (not see bottom) | (\\\\ -) {2, } [ \ u4E00- \\ u9FA5] + (\\ \ \ \ \ \ \ \ -) {2, } |! \ [ ] ", merging the text of the current line and the text of the next line, and deleting the text of the next line; and repeating the operations until the geological annual representative list geoList is traversed, and returning to the standard geological annual representative geoList.

(4-15) a normative method of geological section introduction table, choroIntro: acquiring a contextIndex and a text texts which accord with the beginning of a geological profile introduction table; setting the end flag e _ flag equal to "! \\ \ (. -%.

(4-16) special geological annual representative treatment method of choroSpecial: acquiring a line index contextIndex and a text texts of a line text which accords with the ending characteristics of the geological year representation; setting the start feature s _ flag equal to "underburden", traversing the text texts from the row index contextIndex until finding a start index sIndex in the text texts that meets the start feature s _ flag; the entering step enters the step (4-14-1) of normalizing the text content of the text texts from the start index sIndex to the row index contextIndex to obtain a normalized result list resultList of the geological year representation and a normalized result list resultList.

(4-17) title normalization processing method titleSymbol: acquiring text texts needing to be added with a title mark, traversing the text texts: current line text context: if the line text has the characteristics of a title without a serial number label like "introduction | abstract | reference", add directly and title label #; if the line text is a sequence number tagged title and does not contain a title # tag and it belongs to the beginning of the sequence number tag type: searching a title with a # mark closest to the current line from the contextIndex of the current line upwards, acquiring the title level upttGre of the title, and giving the current title level upttGre + 1; acquiring a front template and a rear template of the serial number mark of the line text context, and marking the text which does not have the title mark and accords with the title template and the serial number characteristic from the next line of the line index contextIndex; until the text texts traversal is complete.

According to the multisource Markdown geological data standardization processing method in the embodiment, text abnormal contents such as paragraphs, pictures, chapters and the like can be automatically identified and standardized according to the text contents of geological data in a Markdown format, and a standardized document is generated; the working efficiency and the data accuracy are greatly improved, and the method has universality and expansibility. The automatic standardization processing of the geological data with the Markdown format can be realized only by changing or perfecting the rule table and not (or rarely) changing the program code; the method has strong universality and is suitable for the standardized processing of the documents in multiple fields.

Since the system described in the above embodiment of the present invention is a system adopted for implementing the method of the above embodiment of the present invention, based on the method described in the above embodiment of the present invention, a person skilled in the art can understand the specific structure and the modification of the system/apparatus, and thus the detailed description is omitted here. All systems adopted by the method of the above embodiments of the present invention are within the intended scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:段落类型识别方法及系统和文档结构识别方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!