Form detection in electronic forms
阅读说明:本技术 电子表单中的表格检测 (Form detection in electronic forms ) 是由 董浩宇 韩石 傅周宇 张冬梅 于 2018-06-29 设计创作,主要内容包括:本公开涉及电子表单中的表格检测。根据本公开的实现,提出了一种用于确定电子表单中的表格的方案。在该方案中,可以提取电子表单中包括的多个单元格各自的多个属性。继而,可以基于提取的多个属性,确定多个单元格各自的特征。基于特征,可以将多个单元格划分为至少一个候选区域。基于至少一个候选区域,可以确定电子表单中的至少一个候选表格。通过该方案,可以基于电子表单中包括的多个单元格各自的属性来确定各个单元格的特征,进而可以基于针对各个单元格的特征来确定其中可能存在表格的候选区域。(The present disclosure relates to form detection in electronic forms. In accordance with implementations of the present disclosure, a scheme for determining a form in an electronic form is presented. In this scheme, a plurality of attributes of each of a plurality of cells included in the electronic form can be extracted. Then, based on the extracted plurality of attributes, respective features of the plurality of cells may be determined. Based on the features, the plurality of cells may be divided into at least one candidate region. Based on the at least one candidate region, at least one candidate form in the electronic form may be determined. With this scheme, the feature of each cell can be determined based on the respective attributes of the plurality of cells included in the electronic form, and then the candidate area in which the table may exist can be determined based on the feature for each cell.)
1. A computer-implemented method, comprising:
extracting a plurality of attributes of each of a plurality of cells included in the electronic form;
determining features of each of the plurality of cells based on the extracted plurality of attributes;
dividing the plurality of cells into at least one candidate region based on the features; and
determining at least one candidate form in the electronic form based on the at least one candidate area.
2. The method of claim 1, wherein extracting the plurality of attributes for each of the plurality of cells included in the electronic form comprises: for a given cell of the plurality of cells, extracting a plurality of attributes of the given cell based on at least any one of:
characters of data in the given cell, a format of the data in the given cell, and a style of the given cell.
3. The method of claim 1, wherein dividing the plurality of cells into at least one candidate region comprises:
obtaining a mapping relationship between a table in an electronic form and features of a plurality of cells included in the table, the mapping relationship being trained based on the features of the plurality of cells included in a set of sample electronic forms and the plurality of tables included in the set of sample electronic forms; and
determining the at least one candidate region based on the mapping relationship and the respective features of the respective cells of the plurality of cells.
4. The method of claim 3, wherein determining the at least one candidate region further comprises:
for a given candidate region of the at least one candidate region,
adjusting the boundary of the given candidate region to update the given candidate region based on the degree of matching of the given candidate region with the mapping relation.
5. The method of claim 3, further comprising:
determining whether a potential error exists in a given candidate region of the at least one candidate region;
responsive to determining that there is a potential error in the given candidate region, retrieving at least one real form in the electronic form; and
updating the mapping relationship based on the at least one real form and the characteristics of the plurality of cells in the electronic form.
6. The method of claim 5, further comprising: obtaining information describing a probability that a table is included in a candidate region of at least one candidate region obtained according to the mapping relationship, and wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining a probability of including a table in the given candidate region based on the information; and
determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
7. The method of claim 5, further comprising: obtaining information describing a probability that a cell in the electronic form is within the table, and determining whether a potential error exists in a given candidate area of the at least one candidate area comprises:
determining a probability that a cell in the given region is within a table based on the information; and
determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
8. The method of claim 5, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining that a potential error exists in the given candidate region in response to an overlap between the given candidate region and another candidate region of the at least one candidate region.
9. The method of claim 5, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining that a potential error exists in the given candidate region in response to an edge of the given candidate region including at least any one of a blank row and a blank column.
10. The method of claim 5, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining whether a potential error exists in a given candidate region of the at least one candidate region based on a positional relationship between non-empty cells of the plurality of cells and the given candidate region.
11. An apparatus, comprising:
a processing unit; and
a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to:
extracting a plurality of attributes of each of a plurality of cells included in the electronic form;
determining features of each of the plurality of cells based on the extracted plurality of attributes;
dividing the plurality of cells into at least one candidate region based on the features; and
determining at least one candidate form in the electronic form based on the at least one candidate area.
12. The apparatus of claim 11, wherein extracting the plurality of attributes for each of the plurality of cells included in the electronic form comprises: for a given cell of the plurality of cells, extracting a plurality of attributes of the given cell based on at least any one of:
characters of data in the given cell, a format of the data in the given cell, and a style of the given cell.
13. The apparatus of claim 11, wherein dividing the plurality of cells into at least one candidate region comprises:
obtaining a mapping relationship between a table in an electronic form and features of a plurality of cells included in the table, the mapping relationship being trained based on the features of the plurality of cells included in a set of sample electronic forms and the plurality of tables included in the set of sample electronic forms; and
determining the at least one candidate region based on the mapping relationship and the respective features of the respective cells of the plurality of cells.
14. The apparatus of claim 13, wherein determining the at least one candidate region further comprises:
for a given candidate region of the at least one candidate region,
adjusting the boundary of the given candidate region to update the given candidate region based on the degree of matching of the given candidate region with the mapping relation.
15. The apparatus of claim 13, further comprising:
determining whether a potential error exists in a given candidate region of the at least one candidate region;
responsive to determining that there is a potential error in the given candidate region, retrieving at least one real form in the electronic form; and
updating the mapping relationship based on the at least one real form and the characteristics of the plurality of cells in the electronic form.
16. The apparatus of claim 15, further comprising: obtaining information describing a probability that a table is included in a candidate region of at least one candidate region obtained according to the mapping relationship, and wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining a probability of including a table in the given candidate region based on the information; and
determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
17. The apparatus of claim 15, further comprising: obtaining information describing a probability that a cell in the electronic form is within the table, and determining whether a potential error exists in a given candidate area of the at least one candidate area comprises:
determining a probability that a cell in the given region is within a table based on the information; and
determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
18. The apparatus of claim 15, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining that a potential error exists in the given candidate region in response to an overlap between the given candidate region and another candidate region of the at least one candidate region.
19. The apparatus of claim 15, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining that a potential error exists in the given candidate region in response to an edge of the given candidate region including at least any one of a blank row and a blank column.
20. The apparatus of claim 15, wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises:
determining whether a potential error exists in a given candidate region of the at least one candidate region based on a positional relationship between non-empty cells of the plurality of cells and the given candidate region.
21. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.
Background
With the advent of various electronic form (spreadsheet) editing tools, electronic forms have become an important data storage format in people's daily life. In particular, electronic forms have become the primary carrier of a wide variety of data in everyday workflows in industries such as banking, accounting, statistics, and the like. The electronic form editing tool provides a flexible workspace for the user, and the user of the editing tool can add one or more tables (tables) in the electronic form as desired.
However, since different users may have their own preferences in generating electronic forms, for example, some users may insert one form into an electronic form, while some users may insert multiple forms into an electronic form and deploy the multiple forms in the electronic form in a manner desired by themselves. Each table may have a different size and location, and there may be one or more blank cells in the respective tables. At this time, how to detect the area occupied by each form in the electronic form becomes a research hotspot. Further, since the subsequent processing for the electronic form depends to a large extent on accurately detecting the respective tables in the electronic form, it is expected that the detection processing can be performed with higher accuracy.
Disclosure of Invention
In accordance with implementations of the present disclosure, a scheme is provided for determining a form in an electronic form. In this scheme, a plurality of attributes of each of a plurality of cells included in the electronic form can be extracted. Then, based on the extracted plurality of attributes, respective features of the plurality of cells may be determined. Based on the features, the plurality of cells may be divided into at least one candidate region. Based on the at least one candidate region, at least one candidate form in the electronic form may be determined.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
FIG. 1 schematically illustrates a block diagram of a computing environment in which implementations of the present disclosure can be implemented;
FIG. 2 schematically illustrates a block schematic diagram of a scheme for detecting forms in an electronic form, according to one implementation of the present disclosure;
FIG. 3 schematically illustrates a flow diagram of a method for detecting forms in an electronic form, according to one implementation of the present disclosure;
FIG. 4 schematically illustrates a block diagram of a scheme for obtaining a mapping relationship, according to one implementation of the present disclosure;
FIG. 5 schematically illustrates a block diagram for adjusting a location of a boundary in a given candidate region, according to one implementation of the present disclosure;
FIG. 6 schematically illustrates a block diagram of a method for updating a mapping relationship based on a detected error in a candidate region, according to one implementation of the present disclosure;
FIG. 7 schematically illustrates a block diagram for determining whether an error exists in a candidate region based on probabilities of whether individual cells in a spreadsheet are within a table according to one implementation of the present disclosure;
FIG. 8 schematically illustrates a block diagram for determining whether an error exists in a given candidate region based on whether the given candidate region overlaps with other candidate regions, according to one implementation of the present disclosure;
FIG. 9 schematically illustrates a block diagram for determining whether an error exists in a candidate region based on whether empty columns/rows are included in the candidate region, according to one implementation of the present disclosure;
FIG. 10A schematically illustrates a block diagram for determining whether an error exists in a candidate region based on blank cells included in the candidate region according to one implementation of the present disclosure; and
fig. 10B schematically illustrates a block diagram for determining whether an error exists in a candidate region based on non-blank cells outside of the candidate region according to one implementation of the present disclosure.
In the drawings, the same or similar reference characters are used to designate the same or similar elements.
Detailed Description
The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable those of ordinary skill in the art to better understand and thus implement the present disclosure, and are not intended to imply any limitation on the scope of the present subject matter.
As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on". The terms "one implementation" and "an implementation" are to be read as "at least one implementation". The term "another implementation" is to be read as "at least one other implementation". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
Currently, several companies have developed tools for editing electronic forms. For example, in Microsoft corporationIn this regard, the user may add one or more forms to the spreadsheet. Because different users may have different preferences, users may add multiple forms to a page and distribute the multiple forms at different locations in the spreadsheet according to the user's own preferences. At this time, how to detect the position of each form in the spreadsheet becomes a research hotspot.
According to one aspect, techniques have been developed to detect forms based on the locations of blank cells and non-blank cells in an electronic form. However, since blank cells are allowed to exist in a spreadsheet in actual applications, there will be a large number of blank cells in the form, especially when the form has not been filled with data. Therefore, the above technical solution is not highly accurate in detecting the table. According to another technical solution, a scheme of detecting a table based on identifying a header, and a data part of the table is also proposed. However, since there may be a complicated structure in the form drawn by the user, for example, a title may be omitted or a sub-form may also exist in the form header, which results in that the technical solution cannot accurately detect the form from the electronic form.
Accordingly, it is desirable to provide a solution for detecting forms from electronic forms in a convenient and efficient manner. Furthermore, the technical scheme is expected to be compatible with the existing electronic form editing tool, and more convenient, quicker and more accurate form detection is realized under the condition of not changing the existing data storage mode of the existing electronic form as much as possible.
Example Environment
The basic principles and several example implementations of the present disclosure are explained below with reference to the drawings. FIG. 1 illustrates a block diagram of a
As shown in FIG. 1,
In some embodiments, the
The
The
The
The
It will be appreciated that communication between the
Principle of operation
Hereinafter, the operation principle of the scheme of the present disclosure will be described in detail with reference to the accompanying drawings. In accordance with implementations of the present disclosure, a scheme is provided for detecting forms in an electronic form. Hereinafter, description will be made with reference to fig. 2. Referring to fig. 2, fig. 2 schematically illustrates a schematic block diagram of a scheme for detecting forms in an electronic form according to one implementation of the present disclosure. As shown in fig. 2, the
For example, a user may edit individual cells in a
As shown in fig. 2, a plurality of attributes may be extracted from a given cell of a plurality of cells included in the
The plurality of cells may be divided into one or more candidate regions (e.g., 182 and 184) based on respective features of the cells of the plurality of cells. In this implementation, one or
It will be appreciated that multiple cells within a table may have some similarity. For example, in general, data in multiple cells within a table may have the same font, character size, and the same background color, among other attributes. With the above exemplary implementation, by extracting features of the respective cells and clustering the obtained features, one or more candidate regions in which a table may exist can be determined in a simple and efficient manner. Further, a table may be determined based on the determined one or more candidate regions.
Example procedure
Hereinafter, a detailed operation flow of the method of the present disclosure will be described in detail with reference to fig. 3. FIG. 3 schematically illustrates a flow diagram of a
It will be understood that the
At
At
Then, at
Extracting attributes of cells
According to one exemplary implementation of the present disclosure, the attributes may include various aspects, and the plurality of attributes of the respective cells may be extracted based on at least any one of: characters of data in the corresponding cell, a format of the data in the corresponding cell, and a style of the corresponding cell. With the above exemplary implementation, aspects of the features of cells in an electronic form may be extracted. In this way, the accuracy of detecting forms in the
According to one exemplary implementation of the present disclosure, attributes of a cell may be extracted based on characters of data in the cell. As shown in table 1 below, the first column in table 1 represents the serial number of the attribute associated with the character string, the second column represents a detailed description of the attribute, and the third column represents the data type of the attribute. Taking the record in line 1 in table 1 as an example, the record is the first attribute, and the attribute indicates whether the character string of the data in the cell is empty. In the example of table 1, if a string within a cell is empty (i.e., does not include any content), the corresponding attribute may be set to "0" at this time; if the string within a cell is not empty (i.e., includes characters), then the corresponding attribute may be set to "1" at this time. Similarly, the 2 nd row in this table 1 represents the length of a character string in a cell, and the length of a character string may be represented in an integer. In lines 3-4 of Table 1, the percentage of digits/letters in the string may be represented by a numerical value between [0,1 ]. In lines 5-6 of table 1, whether or not the character string includes "%" and "-", may be represented in boolean form.
TABLE 1 string-related attributes
Serial number
Description of the invention
Type of value
1
Whether a string is empty
Boolean type
2
Length of character string
Shaping machine
3
Percentage of digits in a string
[0,1]
4
Percentage of letters in a character string
[0,1]
5
Whether a string includes a percentile "%".
Boolean type
6
Whether the string includes a decimal point.
Boolean type
It will be appreciated that examples of attributes are shown only schematically in table 1 above, and that character-related attributes may include more, less, or different content according to one exemplary implementation of the present disclosure.
In the following, examples of attributes associated with the data format in a cell are schematically shown in table 2. As shown in table 2 below, the first column in this table 2 represents the order number of the attribute associated with the data format, the second column represents a detailed description of the attribute, and the third column represents the data type of the attribute. Taking the record in line 1 in table 2 as an example, the record is the first attribute, and the attribute indicates whether the data format in the cell matches a predetermined template.
The templates herein may include multiple types, for example, fractional type templates may be denoted as ". x.", percent type templates may be denoted as ". x%", and so on. If the data format in the cell belongs to a certain type of template, the corresponding attribute may be set to "1" at this time; otherwise the corresponding attribute may be set to "0". Similarly, rows 2 and 3 in this table 2 indicate whether the data format in the cell matches the date template, the time template, respectively, row 4 indicates the length of the template to which the data in the cell matches, and row 5 identifies whether the data in the cell includes a formula.
TABLE 2 data Format related Attribute
It will be appreciated that examples of attributes are shown only schematically in table 2 above, and that data format-related attributes may include more, less or different content according to one exemplary implementation of the present disclosure.
Hereinafter, an example of attributes associated with the style of a cell is schematically shown in table 3. As shown in table 3 below, the first column in this table 3 represents the order number of the attribute associated with the data format, the second column represents a detailed description of the attribute, and the third column represents the data type of the attribute. Taking the record in line 1 in table 3 as an example, the record is the first attribute, and the attribute represents the background color of the cell. In this implementation, the color value may be quantized to a level between
TABLE 3 Style-related attributes
Serial number
Description of the invention
Type of value
1
Background colour of cells, e.g. white
Colour values
2
The colour of the string in the cell, e.g. black
Colour values
3
Whether the character string is bold
Boolean type
4
Whether a string is italicized or not
Boolean type
5
Whether a character string is underlined
Boolean type
6
Whether or not there is a blank at the left boundary
Boolean type
7
Whether or not there is a blank at the right boundary
Boolean type
8
Whether or not there is a blank at the upper boundary
Boolean type
9
Whether or not there is a blank at the lower boundary
Boolean type
10
Whether or not to fuse with horizontally adjacent cells
Boolean type
11
Whether or not to merge with vertically adjacent cells
Boolean type
Determining characteristics of cells
Specific examples of how to extract attributes for a given cell have been described above with reference to tables 1 to 3. Hereinafter, how to determine the features of the cells based on the extracted attributes will be described in detail.
The string-related attributes of the cells may be determined based on the descriptions in table 1. For example, assuming that the string "hello" is included within a given cell, the string-related property in that cell may be represented as vector 1: (1,5,0,1,0,0). The attribute indicates that the character string included in a given cell is not empty, the character string has a length of 5, no number is included in the character string, all characters in the character string are letters, the character string does not include a percentile, and no decimal point.
The data format-related attributes of the cells may be determined based on the descriptions in table 2. For example, continuing the example above for the string "hello", the data format related attribute for that cell may be represented as vector 2: (0,0,0,0,0). The attribute indicates that the data included within a given cell does not match any type of template, the length of the matched template is 0, and no formula is included.
The style-related attributes of the cells may be determined based on the description in table 3. Suppose that the background color of the string "hello" is white, the string color is black, a regular font, there is no blank space with each boundary, and there is no fusion of cells with other cells. At this time, the style-associated attribute of the cell may be represented as vector 3: (4, 96,0,0,0,0,0,0,0,0,0).
According to one exemplary implementation of the present disclosure, the above vectors 1-3 may be combined to obtain a feature vector of a cell. For example, the vectors 1-3 can be connected, and for example, weights can be set for each dimension in the vectors. For simplicity of description, in the following, the feature vector of a cell is determined only based on the following three attributes: whether the character string is empty, whether the data format matches the digital template, the background color of the cell. The feature vector for the given cell at this time may be represented as (1, 0, 4). It will be appreciated that although the above is merely an example of determining a feature vector based on attributes included in tables 1-3, according to one exemplary implementation of the present disclosure, a feature vector may also be determined based on other attributes not included in tables 1-3.
Determining candidate regions
According to an exemplary implementation of the present disclosure, the one or
According to one example implementation of the present disclosure, the mapping relationship may be trained based on features of a plurality of cells included in a set of sample electronic forms and a plurality of tables included in a set of sample electronic forms. Hereinafter, how to obtain the mapping relationship will be described with reference to fig. 4. Fig. 4 schematically illustrates a block diagram 400 of a scheme for obtaining a
It will be understood that the specific type of
According to an exemplary implementation of the present disclosure, the
Having obtained
According to one exemplary implementation of the present disclosure, for a given candidate region of the one or more candidate regions, the boundaries of the candidate region may be adjusted based on
According to one exemplary implementation of the present disclosure, where candidate regions have been determined, adjustments may also be made for respective boundaries of the candidate regions. How to adjust the upper boundary of the candidate region will be described hereinafter with reference to fig. 5. Fig. 5 schematically illustrates a block diagram 500 for adjusting the position of a boundary in a given
As shown in fig. 5, an adjustment range 520 (a block shown by a dotted line) may be provided around the
In this implementation, the height of the
It will be appreciated that although only schematically shown above how the upper boundary of the
Updating of mapping relationships
How to determine the candidate region based on the
In particular, fig. 6 schematically illustrates a block diagram of a
When a potential error is found in a given candidate region, then operational flow proceeds to block 620 to obtain one or more real forms in the
It will be appreciated that the form used as the training input at this time is a manually accurately labeled real form and may reflect the reality of the form within the spreadsheet. By updating
Detecting potential errors in candidate regions
In the context of the present disclosure, potential errors in a candidate region may be determined based on a variety of factors. According to one exemplary implementation of the present disclosure, whether a potential error exists may be determined based on probability information included in the knowledge model. In this implementation, the probability information describes a probability that a table is included in a candidate region of the one or more candidate regions obtained according to the mapping in
In this implementation, the probability information is obtained based on training of the sample spreadsheet, and the probability information may predict the probability that a table exists in a candidate region in the
Specifically, the probability information may include setting a table score (table score) for the given
When the features of the plurality of cells of the
According to an exemplary implementation of the present disclosure, template information included in the knowledge model describing whether a cell in the
In this implementation, the probability that a cell in a given region is located within a table may be determined based on the template information. If the probability is below a predetermined threshold probability, it is determined that a potential error exists in the given candidate region. As shown in fig. 7, for a candidate region 720, the probability values for each cell within the candidate region 720 may be averaged to determine whether there are potential errors in the candidate region 720. At this time, for the candidate region 720, the average value is (0.85+0.85+ … … +0.95)/12 is 0.941, and the average value is smaller than the predetermined threshold probability 0.95. At this point, it may be assumed that there is an error in the candidate region 720. In this manner, whether there is a potential error in the candidate region 720 may be determined in a more accurate manner.
According to one exemplary implementation of the present disclosure, a potential error is determined to exist in a given candidate region if there is an overlap between the given candidate region and another candidate region of the one or more candidate regions. It will be appreciated that conventional experience shows that there is no overlap between two forms in an electronic form. If there is an overlap between two candidate regions divided in the electronic form, it is reasonable to assume that there may be an error in the
With the above-described exemplary implementation, by manually labeling a real table in the electronic form 710 that results in an average value less than a predetermined threshold probability, and updating the
Fig. 8 schematically illustrates a block diagram 800 for determining whether an error exists in a candidate region based on whether one candidate region overlaps with other candidate regions according to one implementation of the disclosure. As shown in fig. 8,
With the above-described exemplary implementation, by manually labeling a real table in the
According to one exemplary implementation of the present disclosure, it is determined that a potential error exists in a given candidate region if a boundary (edge) portion of the given candidate region includes at least any one of a blank row and a blank column. It will be appreciated that conventional experience has shown that the border portions of a table typically do not have blank rows or columns. If there is a blank line/blank column in the boundary portion of the candidate area obtained by the division, it is reasonable to consider that there may be an error in the
Fig. 9 schematically illustrates a block diagram 900 for determining whether an error exists in a
With the above-described exemplary implementation, by manually labeling a real table in the
According to an exemplary implementation of the present disclosure, it may also be determined whether a potential error exists in a given candidate region of the one or more candidate regions based on a positional relationship between a non-empty/blank cell of the plurality of cells and the given candidate region. In this implementation, the discussion may be based on both the non-empty cells within the candidate region and the non-empty cells outside the candidate region, respectively.
According to one exemplary implementation of the present disclosure, a proportion of non-empty cells within a given candidate region is below a predetermined threshold, then it may be determined that a potential error exists. In other words, it is determined that a potential error exists in a given candidate region if the ratio of the number of blank cells within the given candidate region to the number of cells within the given candidate region is above a predetermined threshold ratio.
Fig. 10A schematically illustrates a block diagram 1000A for determining whether an error exists in a candidate region 1020A based on blank cells included in the candidate region 1020A according to one implementation of the present disclosure. As shown in fig. 10A, where the blank legend indicates blank cells and the shaded legend indicates non-blank cells. The electronic form 1010A may be processed as described above, and a candidate area 1020A may be obtained. As shown in fig. 10, the candidate area 1020 includes a large number of blank cells therein. Typically, only a small number of blank cells may be included in the table, e.g., less than a predetermined percentage (e.g., 20%, or other value) of the total number of cells. If the ratio of the number of blank cells to the total number of cells in the candidate area 1020A is above a predetermined threshold, then at this point it may be assumed that there is an error in the
With the above-described exemplary implementation, by manually labeling a real table in the electronic form 1010A and updating the
According to one exemplary implementation of the present disclosure, it is determined that a potential error exists in a given candidate region if a number of non-empty cells located outside of one or more candidate regions reaches a predetermined threshold.
Fig. 10B schematically illustrates a block diagram 1000B for determining whether an error exists in a candidate region based on non-blank cells outside of the candidate region according to one implementation of the present disclosure. As shown in fig. 10B, where the blank legend indicates blank cells and the shaded legend indicates non-blank cells. The
With the above-described exemplary implementation, by manually labeling a real table in the
Example implementation
Some example implementations of the present disclosure are listed below.
In one aspect, the present disclosure provides a computer-implemented method. The method comprises the following steps: extracting a plurality of attributes of each of a plurality of cells included in the electronic form; determining respective features of the plurality of cells based on the extracted plurality of attributes; dividing the plurality of cells into at least one candidate region based on the features; and determining at least one candidate form in the electronic form based on the at least one candidate area.
According to one exemplary implementation of the present disclosure, extracting a plurality of attributes of each of a plurality of cells included in an electronic form includes: for a given cell of the plurality of cells, extracting a plurality of attributes of the given cell based on at least any one of: the characters of the data in a given cell, the format of the data in a given cell, and the style of a given cell.
According to one exemplary implementation of the present disclosure, dividing the plurality of cells into at least one candidate region includes: obtaining a mapping relation between a table in the electronic form and features of a plurality of cells included in the table, wherein the mapping relation is obtained through training based on the features of the plurality of cells included in a group of sample electronic forms and the plurality of tables included in the group of sample electronic forms; and determining at least one candidate region based on the mapping relationship and the respective features of the respective cells of the plurality of cells.
According to an exemplary implementation of the present disclosure, determining at least one candidate region further comprises: for a given candidate region of the at least one candidate region, adjusting the boundary of the given candidate region to update the given candidate region based on the degree of matching of the given candidate region with the mapping relationship.
According to an exemplary implementation of the present disclosure, further comprising: determining whether a potential error exists in a given candidate region of the at least one candidate region; responsive to determining that there is a potential error in the given candidate region, obtaining at least one real form in the spreadsheet; and updating the mapping relation based on the at least one real table and the characteristics of the plurality of cells in the electronic form.
According to an exemplary implementation of the present disclosure, further comprising: obtaining information describing a probability that a table is included in a candidate region of the at least one candidate region obtained according to the mapping relationship, and wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining a probability of including a table in a given candidate region based on the information; and determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
According to an exemplary implementation of the present disclosure, further comprising: obtaining information describing a probability that a cell in the electronic form is within the table, and determining whether a potential error exists in a given candidate area of the at least one candidate area comprises: determining a probability that a cell in a given region is located within a table based on the information; and determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: in response to there being an overlap between the given candidate region and another candidate region of the at least one candidate region, it is determined that a potential error exists in the given candidate region.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining that a potential error exists in the given candidate region in response to the edge of the given candidate region including at least any one of a blank row and a blank column.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining whether a potential error exists in a given candidate region of the at least one candidate region based on a positional relationship between non-empty cells of the plurality of cells and the given candidate region.
In yet another aspect, the present disclosure provides a computer-implemented device. The apparatus comprises: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform the following actions. The actions include: extracting a plurality of attributes of each of a plurality of cells included in the electronic form; determining respective features of the plurality of cells based on the extracted plurality of attributes; dividing the plurality of cells into at least one candidate region based on the features; and determining at least one candidate form in the electronic form based on the at least one candidate area.
According to one exemplary implementation of the present disclosure, extracting a plurality of attributes of each of a plurality of cells included in an electronic form includes: for a given cell of the plurality of cells, extracting a plurality of attributes of the given cell based on at least any one of: the characters of the data in a given cell, the format of the data in a given cell, and the style of a given cell.
According to one exemplary implementation of the present disclosure, dividing the plurality of cells into at least one candidate region includes: obtaining a mapping relation between a table in the electronic form and features of a plurality of cells included in the table, wherein the mapping relation is obtained through training based on the features of the plurality of cells included in a group of sample electronic forms and the plurality of tables included in the group of sample electronic forms; and determining at least one candidate region based on the mapping relationship and the respective features of the respective cells of the plurality of cells.
According to an exemplary implementation of the present disclosure, determining at least one candidate region further comprises: for a given candidate region of the at least one candidate region, adjusting the boundary of the given candidate region to update the given candidate region based on the degree of matching of the given candidate region with the mapping relationship.
According to an exemplary implementation of the present disclosure, further comprising: determining whether a potential error exists in a given candidate region of the at least one candidate region; responsive to determining that there is a potential error in the given candidate region, obtaining at least one real form in the spreadsheet; and updating the mapping relation based on the at least one real table and the characteristics of the plurality of cells in the electronic form.
According to an exemplary implementation of the present disclosure, further comprising: obtaining information describing a probability that a table is included in a candidate region of the at least one candidate region obtained according to the mapping relationship, and wherein determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining a probability of including a table in a given candidate region based on the information; and determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
According to an exemplary implementation of the present disclosure, further comprising: obtaining information describing a probability that a cell in the electronic form is within the table, and determining whether a potential error exists in a given candidate area of the at least one candidate area comprises: determining a probability that a cell in a given region is located within a table based on the information; and determining that a potential error exists in the given candidate region in response to the probability being below a predetermined threshold probability.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: in response to there being an overlap between the given candidate region and another candidate region of the at least one candidate region, it is determined that a potential error exists in the given candidate region.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining that a potential error exists in the given candidate region in response to the edge of the given candidate region including at least any one of a blank row and a blank column.
According to one exemplary implementation of the present disclosure, determining whether a potential error exists in a given candidate region of the at least one candidate region comprises: determining whether a potential error exists in a given candidate region of the at least one candidate region based on a positional relationship between non-empty cells of the plurality of cells and the given candidate region.
In yet another aspect, the present disclosure provides a non-transitory computer storage medium including machine executable instructions that, when executed by a device, cause the device to perform the method of any of the above aspects.
In yet another aspect, the present disclosure provides a computer program product tangibly stored in a non-transitory computer storage medium and comprising machine executable instructions that, when executed by a device, cause the device to perform the method of any of the above aspects.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:一种磁条卡读卡器抗干扰的解码方法