Identifying sequence titles in a document

文档序号：1043313 发布日期：2020-10-09 浏览：7次中文

阅读说明：本技术 标识文档中的序列标题 (Identifying sequence titles in a document ) 是由达雷尔·E·贝勒特于 2020-03-24 设计创作，主要内容包括：本公开涉及标识文档中的序列标题。一种用于处理电子文档(ED)以推断ED中的章节标题序列的方法。方法包括：由计算机处理器基于预定的章节标题模式和ED中的多个字符的正则表达式匹配,生成ED中的候选标题列表；由计算机处理器基于候选标题列表,生成用于推断章节标题序列的部分的链片段列表；以及由计算机处理器基于预定的准则,通过合并链片段列表中的至少两个链片段来生成章节标题序列。(The present disclosure relates to identifying sequence titles in a document. A method for processing an Electronic Document (ED) to infer a sequence of chapter titles in the ED. The method comprises the following steps: generating, by the computer processor, a list of candidate titles in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED; generating, by the computer processor, a list of chain fragments for inferring portions of the sequence of section titles based on the list of candidate titles; and generating, by the computer processor, a chapter title sequence by merging at least two chain segments in the list of chain segments based on a predetermined criterion.)

1. A method for processing an electronic document ED to infer a sequence of chapter titles in the ED, the method comprising:

generating, by the computer processor, a list of candidate titles in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED;

generating, by the computer processor, a list of chain fragments for inferring portions of the sequence of section titles based on the list of candidate titles; and

generating, by the computer processor, the sequence of chapter titles by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

2. The method of claim 1, further comprising:

generating a parsed version of the ED, wherein the parsed version of the ED includes style characteristics of the plurality of characters in the ED; and

determining a confidence level for each candidate title in the list of candidate titles based on the uniqueness measure of the style property.

3. The method of claim 1 or 2, further comprising:

determining a confidence level for each chain segment in the list of chain segments based on the confidence level for each candidate title in the list of candidate titles; and

excluding at least one chain segment from the list of chain segments to infer the section title sequence based on a predetermined confidence threshold and a confidence of each chain segment.

4. The method of any of claims 1-3, wherein each of the plurality of candidate headings comprises one or more sequence characters according to the predetermined chapter heading pattern, wherein generating the list of chain fragments comprises:

determining a rank of each candidate heading in the list of candidate headings based on a nesting level of the sequence characters,

wherein each chain fragment in the list of chain fragments comprises one or more candidate headings having a single rank defining a rank of the each chain fragment.

5. The method of any of claims 1 to 4, wherein generating the list of chain fragments further comprises:

traversing backwards in the candidate title list to identify a leading candidate title for each chain fragment in the chain fragment list; and

traversing forward from the leading candidate heading in the candidate heading list to identify remaining candidate headings in the each chain segment,

wherein the leading candidate header includes a leading sequence character at the rightmost digit of the sequence character.

6. The method of any of claims 1 to 5, wherein the list of chain fragments is ordered according to a rank of each chain fragment in the list of chain fragments.

7. The method of any one of claims 1 to 6, wherein merging the at least two strand fragments comprises:

determining a proximity metric between a higher level chain segment and a lower level chain segment in the list of chain segments, wherein the higher level chain segment is one higher level chain segment of a plurality of higher level chain segments that are one level higher than the lower level chain segment;

generating a score for the higher level chain segment based on a weighted average of the confidence of the higher level chain segment and the proximity metric; and

selecting the higher-level chain fragment from the plurality of higher-level chain fragments in the list of chain fragments to merge the lower-level chain fragments based on the score.

8. A non-transitory computer readable medium CRM having stored thereon computer readable program code for processing an electronic document ED to infer a sequence of chapter titles in the ED, wherein said computer readable program code, when executed by a computer, comprises the functions of:

generating a candidate title list in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED;

generating a list of chain fragments for inferring portions of the section title sequence based on the list of candidate titles; and

generating the chapter header sequence by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

9. The CRM of claim 8, the computer readable program code, when executed by a computer, further comprising functionality to:

generating a parsed version of the ED, wherein the parsed version of the ED includes style characteristics of the plurality of characters in the ED; and

determining a confidence level for each candidate title in the list of candidate titles based on the uniqueness measure of the style property.

10. The CRM according to claim 8 or 9, the computer readable program code when executed by a computer further comprising the functions of:

determining a confidence level for each chain segment in the list of chain segments based on the confidence level for each candidate title in the list of candidate titles; and

excluding at least one chain segment from the list of chain segments to infer the section title sequence based on a predetermined confidence threshold and a confidence of each chain segment.

11. The CRM of any of claims 8-10, wherein each of the plurality of candidate headings comprises one or more sequence characters according to the predetermined chapter heading pattern, wherein generating the list of chain fragments comprises:

determining a rank of each candidate heading in the list of candidate headings based on a nesting level of the sequence characters,

wherein each chain fragment in the list of chain fragments comprises one or more candidate headings having a single rank defining a rank of the each chain fragment.

12. The CRM of any of claims 8 to 11, wherein generating the list of chain fragments further comprises:

traversing backwards in the candidate title list to identify a leading candidate title for each chain fragment in the chain fragment list; and

traversing forward from the leading candidate heading in the candidate heading list to identify remaining candidate headings in the each chain segment,

wherein the leading candidate header includes a leading sequence character at the rightmost digit of the sequence character.

13. The CRM of any of claims 8 to 12, wherein merging the at least two strand fragments comprises:

generating a score for the higher level chain segment based on a weighted average of the confidence of the higher level chain segment and the proximity metric; and

selecting the higher-level chain fragment from the plurality of higher-level chain fragments in the list of chain fragments to merge the lower-level chain fragments based on the score.

14. A system for processing an electronic document ED to infer a sequence of chapter titles in the ED, the system comprising:

a memory; and

a computer processor connected to the memory and configured to:

generating a candidate title list in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED;

generating a list of chain fragments for inferring portions of the section title sequence based on the list of candidate titles; and

generating the chapter header sequence by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

15. The system of claim 14, the computer processor further configured to:

generating a parsed version of the ED, wherein the parsed version of the ED includes style characteristics of the plurality of characters in the ED; and

determining a confidence level for each candidate title in the list of candidate titles based on the uniqueness measure of the style property.

16. The system of claim 14 or 15, the computer processor further configured to:

determining a confidence level for each chain segment in the list of chain segments based on the confidence level for each candidate title in the list of candidate titles; and

excluding at least one chain segment from the list of chain segments to infer the section title sequence based on a predetermined confidence threshold and a confidence of each chain segment.

17. The system of any of claims 14 to 16, wherein each of the plurality of candidate headings comprises one or more sequence characters according to the predetermined chapter heading pattern, wherein generating the list of chain fragments comprises:

determining a rank of each candidate heading in the list of candidate headings based on a nesting level of the sequence characters,

wherein each chain fragment in the list of chain fragments comprises one or more candidate headings having a single rank defining a rank of the each chain fragment.

18. The system of any of claims 14 to 17, wherein generating the list of chain fragments further comprises:

traversing backwards in the candidate title list to identify a leading candidate title for each chain fragment in the chain fragment list; and

traversing forward from the leading candidate heading in the candidate heading list to identify remaining candidate headings in the each chain segment,

wherein the leading candidate header includes a leading sequence character at the rightmost digit of the sequence character.

19. The system of any of claims 14 to 18, wherein the list of chain fragments is ordered according to a rank of each chain fragment in the list of chain fragments.

20. The system of any one of claims 14 to 19, wherein merging the at least two strand fragments comprises:

generating a score for the higher level chain segment based on a weighted average of the confidence of the higher level chain segment and the proximity metric; and

selecting the higher-level chain fragment from the plurality of higher-level chain fragments in the list of chain fragments to merge the lower-level chain fragments based on the score.

Technical Field

The invention relates to identification of sequence titles in documents.

Background

An author may organize the content of an Electronic Document (ED) (e.g., a PDF document or an OOXML document, etc.) into sections within the ED. Many different file formats exist. Each file format defines how the contents of the file are encoded. Regardless of the file format, semantic information implied by the author (such as chapters or chapter titles) may not be specifiable using computer-recognizable information within the ED.

Disclosure of Invention

In general, in one aspect, the invention relates to a method for processing an Electronic Document (ED) to infer a sequence of chapter titles in the ED. The method comprises the following steps: generating, by the computer processor, a list of candidate titles in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED; generating, by the computer processor, a list of chain fragments for inferring portions of the sequence of section titles based on the list of candidate titles; and generating, by the computer processor, a chapter title sequence by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

In general, in one aspect, the invention relates to a non-transitory Computer Readable Medium (CRM) having stored thereon computer readable program code for processing an Electronic Document (ED) to infer a sequence of chapter titles in the ED. The computer readable program code, when executed by a computer, includes the functions of: generating a candidate title list in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED; generating a list of chain fragments for inferring portions of the chapter header sequence based on the list of candidate titles; the chapter header sequence is generated by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

In general, in one aspect, the invention relates to a system for processing an Electronic Document (ED) to infer a sequence of chapter titles in the ED. The system includes a memory and a computer processor coupled to the memory, the computer processor configured to: generating a candidate title list in the ED based on a predetermined chapter title pattern and regular expression matching of a plurality of characters in the ED; generating a list of chain fragments for inferring portions of the chapter header sequence based on the list of candidate titles; the chapter header sequence is generated by merging at least two chain segments in the list of chain segments based on a predetermined criterion.

Other aspects of the invention will become apparent from the following description and the appended claims.

Drawings

FIG. 1 shows a system in accordance with one or more embodiments of the invention.

FIGS. 2A-2B illustrate flow diagrams in accordance with one or more embodiments of the invention.

Fig. 3A-3G illustrate an implementation example in accordance with one or more embodiments of the invention.

FIG. 4 illustrates a computing system in accordance with one or more embodiments of the invention.

Detailed Description

Specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings. Like elements in the various drawings are represented by like reference numerals for consistency.

In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.

Some Electronic Documents (ED), such as PDF documents or OOXML documents, do not explicitly identify the sections or section titles of the document. In particular, a chapter title is a piece of text that the author suggests to start the chapter of the ED. To gain more meaningful insight, a user may request to view or search for information in a particular section of a large document. For example, a user may request to retrieve information about a particular chapter in a document by issuing a command such as "show me a chapter that discusses the feeding habits of western Ehrlich hazelnuts in this document". In response, if the sections and/or section titles of the document are not explicitly identified, inferences need to be made to facilitate targeted queries.

In general, embodiments of the invention provide methods, non-transitory computer-readable media, and systems for inferring certain text as sequence chapter titles in an ED. In one or more embodiments of the invention, the sequence section titles are section titles in sequence, wherein each section title has one or more sequence characters (e.g., 1.1, 1.2, 1.2.1, a., b., i., ii., iii., iv., etc.) at the leading position (i.e., the leftmost position) of the section title. In particular, the sequence characters may be separated by punctuation marks. The sequence characters in subsequent sequence section headings follow one another in the sequence. All sequence characters in the sequence title belong to the same family (family) type, which is one of numeric characters, upper-case roman numerals, lower-case roman numerals, upper-case alphabetic characters, and lower-case alphabetic characters. Thus, the chapter titles can be grouped into one or more of 5 possible families (including numbers, upper-case roman numerals, lower-case roman numerals, upper-case letters, and lower-case letters) based on the sequential characters of the chapter titles.

In one or more embodiments of the invention, the inferred chapter header information is inserted or embedded (e.g., specified as an OOXML tag or some other standard) into an ED that previously lacked identification of chapters or chapter headers that are computer-recognizable. For example, the inferred section title information may be inserted or embedded near corresponding text in the ED, or otherwise inserted or embedded, for example, in document properties. Further, the final document with the embedded inferred information may be OOXML, PDF, or any other file format that allows searching through standard text search tools in an operating system or software application.

FIG. 1 shows a system 100 in accordance with one or more embodiments of the invention. As shown in FIG. 1, the system 100 has a number of components including, for example, a buffer 104, a parsing engine 108, and an inference engine 110. Each of these components 104, 108, 110 may be located on the same computing device (e.g., Personal Computer (PC), laptop computer, tablet computer, smart phone, multi-function printer, kiosk (kiosk), server, etc.) or on different computing devices connected by any scale network having wired and/or wireless network segments. Each of these components is discussed below.

In one or more embodiments of the invention, the buffer 104 may be implemented in hardware (i.e., circuitry), software, or any combination thereof. The buffer 104 is configured to store an ED106 that includes one or more lines of text composed of characters. The ED106 may also include images and graphics. The ED106 may be obtained from any source (e.g., download, scan, etc.). The ED106 may be part of an ED set. Further, the ED106 may be of any size and in any format (e.g., PDF, OOXML, ODF, HTML, etc.). The ED106 includes semantic content implied by the author as sections and section titles that are not specified or explicitly identified by the ED106 itself. In other words, the chapters and chapter titles are not specified or explicitly identified using computer-identifiable information (e.g., tags or other identifiers) in the ED 106.

In one or more embodiments of the invention, the parsing engine 108 may be implemented in hardware (i.e., circuitry), software, or any combination thereof. The parsing engine 108 parses the ED106 to extract content, layout, and style information for characters in the ED106, and generates a parsed version of the ED106, referred to as a parsed ED 107, based on the extracted information. In particular, the parsed ED 107 includes a representation of the original content of the extracted information via the ED 106. The parsed ED 107 may be stored in the buffer 104.

In one or more embodiments, the parsed ED 107 is a generic, predetermined structured format, such as JSON or XML, that is encoded with the extracted information from the ED 106. This common format stores paragraphs, lines and cells (run) of text, as well as corresponding bounding boxes and style information. In addition, this common format may store other document content, such as images and graphics. Examples of the ED106 and the resolved ED 107 are depicted in FIG. 3A and FIG. 3B, respectively.

As shown in FIG. 3A, ED A310 is an example of ED106 and includes a plurality of lines of text composed of characters. Text lines may be grouped into paragraphs 312. As shown in fig. 3A, each paragraph may include a single line of text or multiple lines of text. After EDA (310) is parsed, a generic format representation of the document subset is shown in fig. 3B.

FIG. 3B shows a portion of a parsed version of ED A310, referred to as parsed ED 321. The parsed ED 321 is an example of the parsed ED 107 and includes style information 324, layout information 323, and content information 322 for characters in the third section of the ED a 310. For example, the contents information 322 includes the character "bomb-sniffing cat" shown in fig. 3A. As seen in FIG. 3B, style information 324 is expressed as variables (i.e., v:4) that define various features or aspects (i.e., styles) of the text (i.e., style _ id). In particular, the content information 322 includes all characters of the applied style information 324 in the text line.

Although the above is merely exemplary, the generic format identifies the infrastructure and style details of the document. In particular, specific segments in a document are identified, where each segment is broken down into one or more text lines. In addition, each line is broken down into one or more units of text (run), where all text in a unit has specific style information. In the above example, style information is processed by referencing the ID (with the exact style details for a particular ID appearing in the "run _ tips" list at the end of the file). In other examples, the style information may also be encoded within a row along with the cell itself. Regardless, style details encode information such as font, point size, text color, bold, underline, and italics. In addition to providing style information, layout information (e.g., layout information 323) is also provided via char _ bbox/visual _ bbox that identifies different bounding boxes for paragraphs, lines, and cells. Finally, the text of the document itself is provided as part of each cell.

Returning to the discussion of FIG. 1, in one or more embodiments of the invention, inference engine 110 may be implemented in hardware (i.e., circuitry), software, or any combination thereof. In particular, the inference engine 110 is configured to infer certain text in the parsed ED 107 as sequence section titles of the ED 106. Based on the content information and style characteristics extracted by the parsing engine 108, the inference engine 110 generates candidate titles in the ED106 that are assembled into a plurality of chain fragments (chain fragments) to form part of a sequence of chapter titles. The lower level chain segments are merged into the higher level chain segments to generate a complete chapter header sequence. As used herein, a candidate title is a piece of text to be identified as a candidate for a chapter title. A chain fragment (or simply fragment) is one or more candidate titles that may qualify as part of a chapter title sequence. Throughout this disclosure, the terms "candidate title" and "title" may be used interchangeably unless explicitly specified as "chapter title".

In one or more embodiments of the invention, the inference engine 110 generates metadata 112 for the ED106 corresponding to one or more intermediate results of the inference engine 110, such as candidate headings, confidence and ranking metrics for the candidate headings, chain fragments, parent/child relationships for the chain fragments, and the like. In other words, metadata 112 includes information representing one or more intermediate results of inference engine 110. In one or more embodiments, inference engine 110 stores metadata 112 in buffer 104. Alternatively, in one or more embodiments, inference engine 110 stores metadata 112 back into parsed ED 107. Metadata 112 may be stored in an external buffer and retrieved for use by inference engine 110.

In one or more embodiments of the invention, inference engine 110 performs the functions described above using the methods described below with reference to FIG. 2A.

Although the system 100 is shown with three components 104, 108, 110, in other embodiments of the invention, the system 100 may have more or fewer components. Further, the functionality of each of the components described above may be divided among the components. Further, each component 104, 108, 110 may be utilized multiple times to perform iterative operations.

FIG. 2A shows a flow diagram in accordance with one or more embodiments of the invention. A flow diagram describes a process for inferring one or more sequence chapter titles in an Electronic Document (ED). One or more of the steps in fig. 2A may be performed by the components of system 100 discussed above with reference to fig. 1. In one or more embodiments of the invention, one or more of the steps shown in FIG. 2A may be omitted, repeated, and/or performed in a different order than that shown in FIG. 2A. Accordingly, the scope of the present invention should not be considered limited to the specific arrangement of steps shown in FIG. 2A.

Referring to FIG. 2A, first in step 200, the ED is parsed to generate a parsed version of the ED that includes content information, style characteristics, and layout characteristics of the characters. In particular, the ED includes chapters and chapter titles that are specified or explicitly identified using computer-identifiable information (such as a label or other identifier) in the ED.

In accordance with one or more embodiments, in step 201, a list of candidate titles in the ED is generated based on a predetermined chapter title pattern. In one or more embodiments of the present invention, the predetermined chapter title pattern is a "regular expression" as a character sequence defining a search pattern. Candidate headings are multiple pieces of text in the ED that match a regular expression for inferring a sequence of section headings ("regular expression match"). The < sequence character > < text > pattern is used as a regular expression for searching candidate titles in the ED. In other words, a piece of text having a < sequence character > < text > pattern is identified as one of the candidate headings. In this context, a candidate heading includes a < sequence character > portion and a < text > portion of sequence characters and text referred to as a candidate heading. Candidate headings include a single paragraph in the ED. In other words, candidate headings are bounded by the corresponding paragraph bounding boxes. The candidate title list is ordered according to the paragraph number of the candidate title, and generating the candidate title list includes generating metadata identifying the candidate titles in the list and storing the metadata in association with the ED or a parsed version of the ED.

In accordance with one or more embodiments, in step 202, a rank is generated for each candidate headline in the candidate headline list. The rank of a candidate heading is a measure of the level of nesting found in the sequence characters of the candidate heading. For example, the rank may correspond to the number of sequential characters separated by punctuation in the sequential characters of the candidate title. The hierarchy is stored as metadata in association with the ED or a parsed version of the ED.

In accordance with one or more embodiments, in step 203, a confidence level is generated for each candidate headline in the candidate headline list. The confidence of a candidate title is a measure of the style uniqueness of a particular candidate title. For example, the style uniqueness may correspond to a statistical measure (e.g., a percentage) of the characters in the ED having a particular style. The confidence may be stored as metadata in association with the ED or the parsed version of the ED.

In accordance with one or more embodiments, in step 204, a list of chain fragments for inferring a sequence of section titles is generated based on the list of candidate titles. One or more candidate headers are grouped into chain fragments according to rank and family type. In other words, all candidate headings in a chain fragment have the same rank and the same family type that define the rank and family of the chain fragment. The chain segments are ordered according to the respective ranks to form a list of chain segments, and a confidence for each chain segment is determined based on the confidence of each candidate title included in the chain segment. In addition, one or more chain segments having an average confidence of the base candidate headline that is less than a predetermined confidence threshold are removed or excluded from the chain segment list. Information representing the list of chain fragments is then stored as metadata in association with the ED or a parsed version of the ED.

In accordance with one or more embodiments, in step 205, a sequence of chapter titles is generated by merging link fragments based on predetermined criteria (e.g., confidence metrics and proximity metrics of the fragments to be merged). In particular, the merging is performed according to the respective classes and families. Within the same family, a lower-level chain fragment is merged into a higher-level chain fragment one level higher than the lower-level chain fragment. Further, a proximity metric between higher level chain segments and lower level chain segments is generated. For example, the proximity metric may correspond to a paragraph number difference between an insertion point in a higher level chain segment and a leading candidate title in a lower level chain segment. In addition, scores for higher-ranked chain fragments are generated based on a weighted average of the confidence metrics and the proximity metrics of the higher-ranked chain fragments. Accordingly, based on the score, a higher-level chain fragment is selected as a parent-chain fragment of a lower-level chain fragment. For example, if the score of a higher-level chain fragment is the highest of all possible higher-level chain fragments, it is selected as the parent chain fragment of the lower-level chain fragment. Information representing the sequence of chapter headers is stored as metadata in association with the ED or a parsed version of the ED.

FIG. 2B shows a flow diagram in accordance with one or more embodiments of the invention. The flow diagram describes a process of searching in an ED where computer-identifiable information in the ED (such as tags or other identifiers) is not used to specify or explicitly identify sections and section titles that an author semantically suggests. To improve search results, the section heading information for the ED may be implemented by components of the system 100 discussed above with reference to FIG. 1 using the method described above with reference to FIG. 2A. In one or more embodiments of the invention, one or more of the steps shown in FIG. 2B may be omitted, repeated, and/or performed in a different order than that shown in FIG. 2B. Accordingly, the scope of the present invention should not be considered limited to the specific arrangement of steps shown in FIG. 2B.

In step 210, a search request specifying a search phrase is received from a user. In one or more embodiments of the invention, a user may open an ED in a file viewer. The user may open a search dialog box in the file viewer and enter a search phrase to search for one or more matching phrases that may lead to the relevant information for the user in the ED.

In step 211, the ED is searched to identify the location of one or more matching phrases. For example, there may be multiple matching phrases in the ED, and some matching phrases that are more relevant to the user than other matching phrases are found in the section of the ED. Inferred section title information is added to the ED, which existing (e.g., legacy) search engines can use to return the entire section where the matching phrase was found. For example, chapter header information may be inferred and added to the ED before a search request is received from the user. In another example, chapter header information may be inferred and added to the ED in response to receiving a search request from the user. The chapter header information is inferred and added to the ED using the method described above with reference to fig. 2A.

In one or more embodiments of the invention, the viewer search engine searches through the inferred section title information to identify the entire section where the matching phrase was found. When a match is found, the file viewer retrieves the location of the matching phrase and the section containing the matching phrase.

In step 212, in one or more embodiments of the invention, the matching phrase and the section containing the matching phrase are presented to the user. Presenting the matched phrase and the associated section may include highlighting the matched phrase in the associated section. A plurality of sections containing a plurality of matching phrases are presented to the user so that the user can select the section containing the information most relevant to the user.

As shown in fig. 2A and 2B, one or more embodiments allow computerized searches of the ED such that not only the matching phrase(s) are returned, but also the section(s) of the ED that found the matching phrase(s). Thus, the user is able to view other information related to the search phrase based on the chapter title that the author semantically suggests (e.g., a tag or other identifier) specifies or explicitly identifies without using computer-recognizable information in the ED.

Fig. 3C-3G illustrate implementation examples in accordance with one or more embodiments of the invention. The implementation examples shown in fig. 3C-3G are based on the system and method flow diagrams described above with reference to fig. 1, 2A, and 2B. In one or more embodiments of the invention, one or more of the elements shown in fig. 3C-3G may be omitted, repeated, and/or organized in a different arrangement. Thus, the scope of the present invention should not be considered limited to the particular arrangement of elements shown in FIGS. 3C-3G.

An example of generating candidate headings with associated rank and confidence metrics is described with reference to fig. 3C and table 1 below. As shown in fig. 3C, ED B330 includes 21 paragraphs from paragraph 0 to paragraph 20, such as paragraph 0331, paragraph 2332, paragraph 5333, paragraph 6334, paragraph 18335, and paragraph 19336. Candidate headlines in the ED are identified as a list of 16 row entries in table 1 below by searching multiple pieces of text having a < sequence character > < text > pattern using a regular expression. In particular, table 1 shows an example of the candidate title list described above with reference to steps 201, 202, and 203 of fig. 2A.

TABLE 1

In the candidate title list of table 1, the position of the candidate title identifies the paragraph number of the candidate title in ED B330. Throughout this disclosure, the term "location" means "location of a candidate title," unless otherwise specified. The rank of the candidate headline is some of the sequence characters in the candidate headline. The rank indicates the nesting level of the candidate title. For example, a candidate heading having a sequence character of "3" is at level 1, a candidate heading having a sequence character of "2.1" is at level 2, a candidate heading having a sequence character of "2.2.1" is at level 3, and so on.

The confidence level of a candidate title indicates the style uniqueness of a particular candidate title. Generally, titles implied by the author of the ED have a unique style compared to other text in the ED. For example, paragraph 0331, paragraph 2332, and paragraph 19336 are all the main headings implied by authors and share a common style that is specific to only these paragraphs. Thus, the confidence of these candidate headings is calculated as 1 minus the number of characters with this particular common style divided by the total number of characters in the ED. In the example of ED B330, paragraph 0331, paragraph 2332, and paragraph 19336 have 90 characters in total, and a total of 503 characters. Thus, as listed above in Table 1, the confidence for each of paragraph 0331, paragraph 2332, and paragraph 19336 is calculated to be 1-90/503, equal to 0.82.

In particular, note that paragraph 6334 is identified in Table 1 as two headings, one being a lower case letter and the other being a lower case Roman numeral. This is due to the ambiguity of "i. In other words, due to ambiguity, one or more candidate titles in the ED may be classified as belonging to multiple families, the ambiguity being resolved in subsequent steps.

As an example above with reference to step 204 of fig. 2A, some chain segments of the sequence of chapter titles are generated from the candidate titles based on the rank metric. As described above, a chain fragment (or simply a fragment) is one or more candidate titles to qualify as part of a chapter title sequence. A candidate title having a leading sequence character at the rightmost digit of the sequence character (i.e., "1" for the number, "a." for the capital letter, or "i." for the lower case roman numeral, etc.) forms a single title chain segment or serves as the start of a chain segment having multiple candidate titles. The sequence characters of the plurality of candidate headings in the chain fragment follow one another from the beginning of the chain fragment. In one or more embodiments, the chain fragments are generated by traveling backward from the candidate headline list to search for the beginning of the chain fragment. As used herein, "backward" means toward the beginning or top of Table 1, and "forward" means toward the end or bottom of Table 1. The initiation of a chain fragment is also referred to as chain fragment initiation. For example, the following sequence characters may all represent the start of a chain segment. In other words, candidate headings that include the following sequence characters may be identified as potential chain segment starts.

·4.1

·4.2.1

·4.3.1

·1.

·i.

·a)

Once a potential chain fragment start is identified, the chain fragments are constructed in turn by searching for subsequent candidate headings that have the same text style as the potential chain fragment start and that have not been included in the same rank and same family of other chain fragments. Disambiguation of the different interpretations occurs in this step. For example, "i." found in a candidate title is distinguished as the beginning of the roman numeral chain or the 9 th entry of the alphabetic chain. In particular, the distinction is based on whether a chain fragment is generated using "i. In other words, if a chain fragment is generated using "i." as the start of the chain fragment, then "i." is considered a roman numeral. Conversely, if no chain fragment is generated using "i." as the start of the chain fragment, "i." is considered a letter.

Continuing with the example of FIG. 3C and Table 1 above, the algorithm begins at paragraph 19336, finding that the sequence character of this candidate title ends with 3 and does not start a chain fragment. The next candidate title, counting backwards from the end of table 1, is paragraph 18335 with the sequence character 2.1 ending with 1. Thus, paragraph 18335 was selected as the starting chain fragment. The algorithm then proceeds from the selected paragraph 18335 forward towards the end of table 1, searching for the next sequence character 2.2 in the candidate header having the same style as paragraph 18335. However, no such candidate header is found in the list of table 1, which concludes that link fragment 1 has a single candidate header, as shown in table 2 below.

TABLE 2

Chain fragment 1:

similar to paragraph 18 in the list of table 1, paragraph 16 is identified as a chain fragment start from which a chain fragment 2 with a single candidate heading is generated, as shown in table 3 below.

TABLE 3

Chain fragment 2:

position of	Family of people	Grade	Confidence level	Text
						16	Number of	3	0.94	2.2.1.This is a minor heading.

The algorithm proceeds back through the list of table 1 and identifies paragraph 12 as the chain segment start based on the "a" at the leftmost digit of the sequence character. The algorithm then moves forward in table 1, searching for the next candidate title that shares the same family (lower case letters) and style and is next in sequence. Thus, paragraphs 13, 14 and 15 are included in strand fragment 3 as shown in Table 4 below.

TABLE 4

Chain fragment 3:

the algorithm again continues back in the list of table 1 and identifies paragraph 6 as a potential chain fragment start. Here, there are two possible interpretations of "i." in a candidate title. "i." the first interpretation as a lower case letter is not identified as a potential chain segment start and is ignored. "i." as a second interpretation of lower case roman numerals is identified as a potential chain segment start and employed by the algorithm to proceed further. Thus, chain fragment 4 was generated using paragraph 6 as the chain fragment start, as shown in table 5 below.

TABLE 5

Chain fragment 4:

similarly, chain fragments 5 and 6 were generated as shown in tables 6 and 7 below.

TABLE 6

Chain fragment 5:

position of	Family of people	Grade	Confidence level	Text
					4	Number of	2	0.88	2.1.This is a subheading
10	Number of	2	0.88	2.2.This is a second subheading

TABLE 7

Chain fragment 6:

the chain fragments are sorted by rank, as described above with reference to step 204 of FIG. 2A. In one or more embodiments, all chain fragments of level 1 are added to the fragment list first, then all chain fragments of level 2, then level 3, and so on. An example of the sorted fragment list 340 generated from ED B330 is shown in fig. 3D. As shown in fig. 3D, level 1 portion of fragment list 340 includes chain fragment 6346, chain fragment 4344, and chain fragment 3343; fragment list 340 includes in level 2 portion chain fragment 5345 and chain fragment 1341; and the chain fragment 2342 is included in the level 3 portion of the fragment list 340.

In addition, unqualified chain fragments are removed from the fragment list. Once all chain fragments have been constructed, the algorithm prunes chain fragments that are unlikely to form a larger sequence chain. In one or more embodiments, chain segments identified as chain segments or lists that fall below a certain confidence threshold are disqualified and removed from the segment list.

If the "list probability" of a chain fragment falls above a certain threshold, the chain fragment is identified as a list. The "list probability" is calculated as the ratio of the number of neighboring candidate headings in the chain fragment to the total number of candidate headings in the chain fragment. For example, the chain fragment 6346 composed of paragraphs 0, 2, and 19 has 0 adjacent candidate headings because 0, 2, and 19 are not adjacent paragraph positions. Thus, the "list probability" of chain segment 6346 is 0/3 ═ 0. In another example, chain fragment 4344, which consists of paragraphs 6, 7, 8 and 9, has 4 adjacent paragraphs, and the "list probability" is 4/4 ═ 1. Pruning of a chain fragment with a single candidate title based on "list probability" is not considered because there is not enough context to identify whether the chain fragment is an isolated title or a list of elements.

The confidence of a chain segment is calculated as the average of the confidence of all candidate headings of the chain segment. For example, the confidence of chain fragment 5345 with paragraphs 4 and 10 is calculated to be 0.88. Chain segments with confidence below a specified threshold are also pruned.

In one or more embodiments, a "list probability" threshold of 1 and a chain fragment confidence threshold of 0.8 are used. Thus, chain fragment 4344, which consists of paragraphs 6, 7, 8, and 9, chain fragment 3346, which consists of paragraphs 12, 13, 14, and 15, and chain fragment 1341, which consists of paragraph 18, are removed from fragment list 340 to generate pruned fragment list 350, as shown in fig. 3E.

As an example with reference to step 205 of fig. 2A above, a chapter header sequence is generated from the pruned list of segments by merging lower-level chain segments into higher-level chain segments. Starting with the lowest ranked chain fragment, the possible parent chain fragments for each chain fragment are located. The best parent-chain fragment is selected among all possible parent-chain fragments, so that the parent-chain fragment and the child-chain fragment are merged. This process is repeated to move to a higher order chain fragment.

In one or more embodiments, merging chain segments at a particular level is based on the following process.

All chain fragments of a particular level are ordered by decreasing confidence to process the chain fragment with the highest confidence first.

For each chain fragment in the above sorted list:

a. a list of all parent-chain fragments (potential _ entries) is generated that the chain fragment may potentially fit. The set of parent chain fragments to be searched is one level higher than the level of the current chain fragment. Thus, for each parent-chain fragment in the set of one level higher chain fragments, if the chain fragment potentially fits into the parent-chain fragment, the parent-chain fragment and its appropriate parent title named (parent _ pos) location are added to the list potential _ entries. In other words, a parent title is a candidate title into which child chain fragments can be inserted for merging. The function FitsWithin () used will be described in detail below.

b. For each fragment in the positional _ documents, the distance from the parent _ pos to the position of the first candidate title in the child chain fragment is identified, and the maximum distance is recorded as max _ dist.

c. The best parent chain fragment in the potential _ entries is identified. This is done using a combination of proximity and chain fragment confidence. The function ScoreFit () is applied to each parent chain segment to select the parent chain segment with the highest score.

d. The strand fragments are merged to the best parent strand fragment. Specifically, each candidate title in a chain fragment is moved to the best parent chain fragment, and the now empty chain fragment is deleted.

Here is an example development of the function FitsWithin ():

for each title in the parent-chain fragment, the following steps are performed:

a. the next title (if any) after the current parent title in the parent chain fragment is identified and designated as next _ header.

b. A placement _ fit identifying the child chain fragment. Displacement _ fit is true if the first title position in the child chain fragment is greater than the current parent title position parent _ pos and 1) there is no next _ header or 2) there is a next _ header and the last title position in the child chain fragment is less than the position of the next _ header.

c. Sequence _ fit identifying the child chain fragment. Sequence _ fit is true if the first title position in the child chain fragment follows the sequence character of the current parent title. For example, 2.3 and 2.2.1 both follow 2.2 and will qualify for a sequence (sequence fit), while 2.2.2 and 2.4 do not follow 2.2 and will disqualify for a sequence. The check whether one candidate title Follows another candidate title is handled in a function Follows () described later.

d. If there are both place _ fit and sequence _ fit for the current parent, then the parent fragment and parent (parent _ pos) following the insertion of the child fragment are identified and the loop exits.

If a parent segment has been identified, it is verified that the parent segment does not have a sequence character that matches the sequence character of the first title in the child segment. In other words, it is verified that the child chain fragment intended for addition is not yet present in the parent chain fragment. If yes, or if the appropriate parent header is not located, "NULL" is returned for the parent and "-1" is returned for parent _ pos. Otherwise, a reference to the parent fragment and parent _ pos is returned.

Here is an example development of the function ScoreFit ():

calculate a distance score from the distance from the child chain fragment to the parent chain fragment. For example, dist _ score is 1.0- (difference between the position of the first title in the daughter strand fragment and parent _ pos)/max _ dist.

Confidence _ score is calculated as the average confidence of all titles in the parent chain fragment.

Return the weighted average of dist _ score and confidence _ score as final _ score. For example, final _ score is 0.75 dist _ score +0.25 confidence _ score.

Here is an example development of a function Follows (a, b) to determine whether title b Follows title a:

an array of numerical levels corresponding to both a and b is built. The size of the array is equal to the level of the header and each entry in the array is the numerical equivalent of each character entry in the sequence. Here are a few examples of the number levels of some different sequence titles:

initialize the pool found _ an _ increment to false.

Repeat for each entry in the digit level array of b:

a. the location of the entry is identified and referred to as entry _ num.

b. If found _ an _ increment is true, then false is returned. (principle: if an increment has been found, then there should no more entries in the numeric level array for b. example: 4.2.1.1 is not followed by 4.2.)

c. If the entry number is less than the size of the numeric level array of a:

i. if b is less than a at entry _ num, then false is returned. (example: 4.2.1 does not follow 4.2.3, since 1 is less than 3.)

Set found _ an _ increment to true if the digit level array of b at entry _ num is equal to the digit level array of a at entry _ num plus 1. Otherwise, if the digit level array of b at entry _ num is not equal to the digit level array of a at entry _ num, false is returned. (principle: if the value of b is 1 greater than the corresponding value of a, then an increment is found. otherwise, if the current corresponding values are equal, then proceed only to the next entry in the numeric level array. for example: 4.2.2 followed by 4.2.1)

d. Otherwise, if the entry number is equal to the size of the numeric level array of a:

i. if the digit level array of b at entry _ num is equal to 1, found _ an _ increment is set to true. (example: 4.2.1 followed by 4.2).

Return found _ an _ increment.

Continuing with the discussion of pruned fragment list 350 shown in fig. 3E above, the process of merging chain fragments begins with all chain fragments at the lowest level, 3 in fragment 2342. All chain segments at this level are sorted by decreasing confidence. When there is only one chain fragment of level 3 (i.e., fragment 2342), the ordering is invalid. The process of merging chain fragments starts with all chain fragments one level higher, i.e., level 2. In pruned fragment list 350, there is only one level 2 chain fragment (i.e., fragment 5345). A function fitswhin () is applied to determine whether a level 3 segment 2342 as a child segment is appropriate for a level 2 segment 5345 as a parent segment.

Inside FitsWithin (), each title in the clip 5345 having a level of 2 is evaluated. The first title corresponds to the following paragraph 4.

Position of	Family of people	Grade	Confidence level	Text
						4	Number of	2	0.88	2.1.This is a subheading

For this title, because there is a next title (position 10) in the fragment 5345, and the last title position 16 in the fragment 2342 is not less than position 10 of the next title, the placement _ fit is false. Furthermore, sequence _ fit is false because sequence character 2.2.1 in fragment 2342 does not follow sequence character 2.1 in fragment 5345. Thus, the evaluation of FitsWithin () continues to the next title of level 2 segment 5345. The next title corresponds to paragraph 10 below.

Position of	Family of people	Grade	Confidence level	Text
						10	Number of	2	0.88	2.2.This is a second subheading

For this title, place _ fit is true because there is no next title in fragment 5345 and the first title location 16 in sub-fragment 2342 is greater than the current parent title location 10 in parent fragment 5345. Further, sequence _ fit is true because sequence character 2.2.1 in child segment 2342 follows sequence character 2.2 in parent segment 5345.

Finally, it is verified that no title with sequence character 2.2.1 already exists in level 2 fragment 5345. Assuming that the parent segment already contains no child segments, FitsWithin () returns segment 5345 as the parent segment and parent _ pos of 10, which is added to the potential parent segment list.

Assuming there is only one potential parent in the list, level 2 fragment 5345 is selected as the best parent for level 3 fragment 2342. Thus, level 3 fragment 2342 is merged into level 2 fragment 5345 to generate merged fragment list a 360 shown in fig. 3F. As shown in fig. 3F, merged segment list a 360 includes level 1 segment 6346 and level 2 merged segment a 361. In particular, merged fragment a361 is a combination of fragment 2342 and fragment 5345 in pruned fragment list 350.

There are no chain fragments that are still level 3, so the merge process repeats a second time in the merged fragment list a 360 with all chain fragments being level 2, which includes only merged fragment a 361. All chain fragments with rank 2 are ordered with decreasing confidence according to the merging process. The ordering is invalid since there is only one chain fragment (i.e., merged fragment a 361). Thus, the process of merging chain fragments starts with all chain fragments one level higher (level 1). In the merged fragment list A360, there is only one level 1 chain fragment (i.e., fragment 6346). A function fitswhin () is applied to determine whether a level 2 merged segment a361 as a child segment is appropriate for a level 1 segment 6346 as a parent segment.

Inside FitsWithin (), each title in the segment 6346 having a level of 1 is evaluated. The first heading corresponds to paragraph 0 below.

Position of	Family of people	Grade	Confidence level	Text
						0	Number of	1	0.82	1.This is a main heading

For this title, because there is a next title (position 2) in the clip 6346, and the last title position 16 in the merged clip a361 is not less than position 2 of the next title, the placement _ fit is false. Further, sequence _ fit is false because sequence character 2.1 in merged segment a361 does not follow sequence character 1 in segment 6346. Thus, the evaluation of FitsWithin () continues to the next title in segment 6346 at level 1. The next title corresponds to paragraph 2 below.

For this title, place _ fit is true because the first title location 4 in the merged sub-segment A361 is larger than the current parent title location 2 in the parent segment 6346. In addition, the last header location 16 in the merged sub-segment A361 is smaller than the next header location 19 in the parent segment 6346. Further, sequence _ fit is true because sequence character 2.1 in merged sub-segment a361 follows sequence character 2 in parent segment 6346.

Finally, it is verified that no title with sequence character 2.1 has been in the level 1 segment 6346. Assuming that the parent segment does not already include child segments, FitsWithin () returns segment 6346 as the parent segment and parent _ pos as 2, which are added to the potential parent segment list.

Assuming there is only one potential parent in the list, segment 6346 is selected as the best parent for level 2 merged segment A361. Thus, level 2 merged segment A361 is merged into level 1 segment 6346 to generate merged segment list B370 as shown in FIG. 3G. As shown in fig. 3G, the merged segment list B370 includes only the merged segment B371 of level 1. In particular, merged fragment B371 is a combination of merged fragment a361 and fragment 6346 in merged fragment list a 360.

The merging process has now been completed and the merged segment B371 is identified as the sequence title or chapter title sequence of ED B330. From this information, chapters can be automatically identified as text regions between chapter titles, and the overall nesting of chapters in a document can be identified according to rank information to allow answers to queries such as "show me chapters about …".

In various steps of the above examples, in one or more embodiments of the invention, inferred metadata is generated for the intermediate results. In particular, the inferred metadata includes representations of candidate title lists, associated ratings and confidence levels, link fragment lists, associated scores and parent/child relationships, and the like. In one or more embodiments of the invention, the inferred metadata is added to the ED and/or a parsed version of the ED.

Embodiments of the invention may be implemented on virtually any type of computing system regardless of the platform being used. For example, a computing system may be one or more mobile devices (e.g., laptop computers, smart phones, personal digital assistants, tablet computers, or other mobile devices), desktop computers, servers, blades (blades) in a server rack, or any other type of computing device or device that includes at least minimal processing power, memory, and input-output device(s) to perform one or more embodiments of the invention. For example, as shown in fig. 4, a computing system 400 may include one or more computer processors 402, associated memory 404 (e.g., Random Access Memory (RAM), cache memory, flash memory, etc.), one or more storage devices 406 (e.g., a hard disk, an optical drive such as a Compact Disk (CD) drive or Digital Versatile Disk (DVD) drive, a flash memory stick, etc.), and many other elements and functions. Computer processor(s) 402 may be an integrated circuit for processing instructions. For example, the computer processor(s) may be one or more cores or micro-cores of a processor. Computing system 400 may also include one or more input devices 410, such as a touch screen, keyboard, mouse, microphone, touch pad, electronic pen, or any other type of input device. In addition, computing system 400 may include one or more output devices 408, such as a screen (e.g., a Liquid Crystal Display (LCD), a plasma display, a touch screen, a Cathode Ray Tube (CRT) monitor, a projector, or other display device), a printer, an external storage, or any other output device. The one or more output devices may be the same or different than the input device(s). Computing system 400 may be connected to a network 412 (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) such as the internet, a mobile network, or any other type of network) via a network interface connection (not shown). The input and output device(s) may be connected to the computer processor(s) 402, memory 404, and storage device(s) 406, either locally or remotely (e.g., via network 412). Many different types of computing systems exist, and the aforementioned input and output device(s) may take other forms.

Software instructions in the form of computer-readable program code for carrying out embodiments of the invention may be stored in whole or in part, temporarily or permanently, on a non-transitory computer-readable medium such as a CD, DVD, a storage device, a floppy disk, a magnetic tape, a flash memory, a physical memory, or any other computer-readable storage medium. In particular, the software instructions may correspond to computer readable program code which, when executed by the processor(s), is configured to perform embodiments of the invention.

In addition, one or more of the elements of the aforementioned computing system 400 may be located at a remote location and connected to the other elements over a network 412. Furthermore, one or more embodiments of the invention may be implemented on a distributed system having multiple nodes, where each portion of the invention may be located on a different node within the distributed system. In one embodiment of the invention, a node corresponds to an exact computing device. Alternatively, the node may correspond to a computer processor having associated physical memory. The node may alternatively correspond to a micro-core or computer processor of a computer processor having shared memory and/or resources.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.

29页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种专利申请文本的生成方法和装置

Identifying sequence titles in a document

相关技术

网友询问留言