PDF (Portable document Format) table extraction method and device, terminal and computer readable storage medium

文档序号:1544833 发布日期:2020-01-17 浏览:19次 中文

阅读说明:本技术 Pdf表格提取方法、装置、终端及计算机可读存储介质 (PDF (Portable document Format) table extraction method and device, terminal and computer readable storage medium ) 是由 侯丽 于 2019-08-23 设计创作,主要内容包括:本发明公开了一种PDF表格提取方法,包括:获取目标PDF,并对所述目标PDF进行解析获得表格数据;从所述表格数据中获得各表格中的各单元格的长宽属性和位置属性;根据各单元格的长宽属性和位置属性获取各单元格所属的单元格类型;根据所述单元格类型、所述长宽属性和所述位置属性得出各单元格的各顶点坐标;根据各单元格的顶点坐标生成对应格式的表格。本发明还提供一种PDF表格提取装置、终端及计算机可读存储介质,本发明基于数据查询对PDF中的表格进行提取,最终提取生成的表格与PDF中的表格的格式一致,保证了表格提取的准确性。(The invention discloses a PDF form extraction method, which comprises the following steps: acquiring a target PDF, and analyzing the target PDF to obtain table data; obtaining the length and width attributes and the position attributes of each cell in each table from the table data; acquiring the cell type of each cell according to the length and width attribute and the position attribute of each cell; obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute; and generating a table with a corresponding format according to the vertex coordinates of each cell. The invention also provides a PDF table extraction device, a terminal and a computer readable storage medium, the invention extracts the table in the PDF based on data query, and the format of the finally extracted and generated table is consistent with that of the table in the PDF, thereby ensuring the accuracy of table extraction.)

1. A PDF form extraction method is characterized by comprising the following steps:

acquiring a target PDF, and analyzing the target PDF to obtain table data;

obtaining the length and width attributes and the position attributes of each cell in each table from the table data;

acquiring the cell type of each cell according to the length and width attribute and the position attribute of each cell;

obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute;

and generating a table with a corresponding format according to the vertex coordinates of each cell.

2. The PDF form extraction method according to claim 1, wherein the cell types include a first type, a second type, a third type, and a fourth type, and the step of obtaining the cell type to which each cell belongs according to the length-width attribute and the position attribute of each cell includes:

judging the type of the first unit in all the cells before the first unit is replaced by the line character as a first type;

judging the types of the cells except the first cell in all the cells before the first cell line break as a second type;

judging the type of the first unit between two adjacent unit cell line-changing symbols as a corresponding third type or fourth type;

and judging the type of the units except the first unit between two adjacent unit line breaks as a fourth type.

3. The PDF form extraction method according to claim 2, wherein said step of determining the type of the first cell between two adjacent cell line breaks as the corresponding third type or fourth type comprises:

acquiring the number of cell lines before the first cell between two adjacent cell line-changing characters;

if the number of the cell lines is one line, judging whether the widths of the cells before the first cell between two adjacent cell line changing symbols are equal or not;

if the widths of the cells before the first cell between the two adjacent cell line breaks are equal, judging the type of the first cell between the two adjacent cell line breaks as a third type;

and if the widths of the cells before the first cell between the two adjacent cell line breaks are not equal, judging the type of the first cell between the two adjacent cell line breaks as a fourth type.

4. The PDF form extraction method according to claim 3, wherein after said step of obtaining the number of rows before the first cell between two adjacent cell line breaks, further comprising:

if the number of the unit cell lines is larger than one line, starting from the previous line of the first unit cell between the two adjacent unit cell line changing signs, traversing and comparing the width of the first unit cell and the width of the second unit cell of each line, and judging whether the width of the first unit cell and the width of the second unit cell of each line are equal or not;

if the width of the first cell in each row is equal to that of the second cell, judging the type of the first cell between the line changing symbols of the two adjacent cells as a third type;

if an unequal row with unequal width of the first cell and the second cell exists, stopping traversal, summing the width of the second cell of the unequal row and the width of the first cell of the row after the unequal row, and judging whether the sum is equal to the width of the first cell of the unequal row or not, wherein the row after the unequal row refers to all rows which are positioned after the unequal row and before the first cell between two adjacent cell line changing signs;

if the sum is larger than or equal to the width of the first unit cell of the unequal line, judging the type of the first unit cell between the line-changing characters of two adjacent unit cells as a third type;

and if the obtained sum is smaller than the width of the first unit cell of the unequal line, judging the type of the first unit cell between the line-changing characters of the two adjacent unit cells as a fourth type.

5. The PDF form extraction method according to claim 4, wherein said step of deriving the vertex coordinates of each cell based on the cell type, the length and width attribute, and the position attribute comprises:

acquiring a preset standard cell length and a preset standard cell width;

establishing a two-dimensional coordinate system by taking the upper left corner of the table as a coordinate origin, the row direction of the table as the positive direction of an X axis, the row direction of the table as the positive direction of a Y axis, the standard length of the cell as the unit length of the X axis and the standard width of the cell as the unit length of the Y axis;

and obtaining the vertex coordinates of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute and the position attribute.

6. The PDF form extraction method according to claim 5, wherein said step of deriving coordinates of each cell at each vertex of the two-dimensional coordinate system based on the cell type, the length and width attribute, and the position attribute comprises:

obtaining the coordinates of the upper left vertex of the first type of cell as (0, 0), the coordinates of the upper right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, 0), the coordinates of the lower left vertex of the first type of cell as (0, the width of the first type of cell/the standard width of the cell), and the coordinates of the lower right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, the width of the first type of cell/the standard width of the cell);

obtaining the coordinates of the top left vertex of the second type of cell as (sum of lengths of all cells on the left side of the second type of cell/standard length of cell, 0), the coordinates of the top right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell, 0), the coordinates of the bottom left vertex as (0, width of the second type of cell/standard width of cell), and the coordinates of the bottom right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell }, width of cell of the second type of cell/standard width of cell);

obtaining the coordinates of the top left vertex of the third type of cell as (0, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the top right vertex as (the length of the third type of cell/the standard length of the cell, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the bottom left vertex as (0, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell), and the coordinates of the bottom right vertex as (the length of the third type of cell/the standard length of the cell, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell);

the coordinates of the upper left vertex of the cell of the fourth type are obtained as (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the upper right vertex are ({ the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the lower left vertex are (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, { the sum of all the widths of the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell), and the coordinates of the lower right vertex are ({ the sum of the lengths of all Length of cell }/standard length of cell, { sum of widths of all cells on edge of cell of the fourth type + width of cell of the fourth type }/standard width of cell).

7. The PDF form extraction method of any one of claims 1-6, wherein the step of obtaining a target PDF and parsing the target PDF to obtain form data comprises:

acquiring a target PDF;

analyzing the target PDF to obtain a corresponding byte stream;

and identifying a table identifier in the byte stream, and extracting the byte stream corresponding to the table identifier as table data.

8. A PDF form extraction device, comprising:

the analysis module is used for acquiring a target PDF and analyzing the target PDF to acquire table data;

an obtaining module, configured to obtain, from the table data, a length-width attribute and a position attribute of each cell in each table;

the obtaining module is used for obtaining the cell type of each cell according to the length and width attribute and the position attribute of each cell;

the calculation module is used for obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute;

and the generating module is used for generating a table with a corresponding format according to the vertex coordinates of each cell.

9. A terminal comprising a processor, a memory, and a PDF form extraction program stored on the memory and executable by the processor, wherein the PDF form extraction program when executed by the processor implements the steps of the PDF form extraction method of any one of claims 1 to 7.

10. A computer-readable storage medium, having a PDF form extraction program stored thereon, wherein the PDF form extraction program when executed by a processor implements the steps of the PDF form extraction method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of table extraction technologies, and in particular, to a method, an apparatus, a terminal, and a computer-readable storage medium for extracting a PDF table.

Background

At present, PDF supports two encoding modes of ASCII code and binary code, the document structure is a tree structure, the root node of the tree is also the root object of PDF file, the root node has four subtrees: page trees, bookmark trees, thread trees, and name trees.

The original conventional method for analyzing excel in PDF is as follows: the PDFbox is used for reading, however, the obtained table information has a problem of display disorder, especially for the merged cells, the PDFbox cannot identify that the contents with very long space belong to a certain column in a certain row, so that the table extracted from the PDF is inconsistent with the original table in the PDF, that is, the accuracy of the table extracted from the PDF is not high.

Therefore, the accuracy of the table extracted by the existing PDF table extraction method is not high, which is a problem to be solved urgently.

Disclosure of Invention

The invention mainly aims to provide a PDF form extraction method, a PDF form extraction device, a terminal and a computer readable storage medium, and aims to solve the technical problem that the existing PDF form extraction method is low in accuracy.

In order to achieve the above object, the present invention provides a PDF form extraction method, including:

acquiring a target PDF, and analyzing the target PDF to obtain table data;

obtaining the length and width attributes and the position attributes of each cell in each table from the table data;

acquiring the cell type of each cell according to the length and width attribute and the position attribute of each cell;

obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute;

and generating a table with a corresponding format according to the vertex coordinates of each cell.

Preferably, the cell types include a first type, a second type, a third type and a fourth type, and the step of obtaining the cell type to which each cell belongs according to the length and width attribute and the position attribute of each cell includes:

judging the type of the first unit in all the cells before the first unit is replaced by the line character as a first type;

judging the types of the cells except the first cell in all the cells before the first cell line break as a second type;

judging the type of the first unit between two adjacent unit cell line-changing symbols as a corresponding third type or fourth type;

and judging the type of the units except the first unit between two adjacent unit line breaks as a fourth type.

Preferably, the step of determining the type of the first cell between two adjacent cell line breaks as a corresponding third type or fourth type includes:

acquiring the number of cell lines before the first cell between two adjacent cell line-changing characters;

if the number of the cell lines is one line, judging whether the widths of the cells before the first cell between two adjacent cell line changing symbols are equal or not;

if the widths of the cells before the first cell between the two adjacent cell line breaks are equal, judging the type of the first cell between the two adjacent cell line breaks as a third type;

and if the widths of the cells before the first cell between the two adjacent cell line breaks are not equal, judging the type of the first cell between the two adjacent cell line breaks as a fourth type.

Preferably, after the step of obtaining the number of rows before the first cell between two adjacent cell line-changing characters, the method further includes:

if the number of the unit cell lines is larger than one line, starting from the previous line of the first unit cell between the two adjacent unit cell line changing signs, traversing and comparing the width of the first unit cell and the width of the second unit cell of each line, and judging whether the width of the first unit cell and the width of the second unit cell of each line are equal or not;

if the width of the first cell in each row is equal to that of the second cell, judging the type of the first cell between the line changing symbols of the two adjacent cells as a third type;

if an unequal row with unequal width of the first cell and the second cell exists, stopping traversal, summing the width of the second cell of the unequal row and the width of the first cell of the row after the unequal row, and judging whether the sum is equal to the width of the first cell of the unequal row or not, wherein the row after the unequal row refers to all rows which are positioned after the unequal row and before the first cell between two adjacent cell line changing signs;

if the sum is larger than or equal to the width of the first unit cell of the unequal line, judging the type of the first unit cell between the line-changing characters of two adjacent unit cells as a third type;

and if the obtained sum is smaller than the width of the first unit cell of the unequal line, judging the type of the first unit cell between the line-changing characters of the two adjacent unit cells as a fourth type.

Preferably, the step of deriving the coordinates of each vertex of each cell according to the cell type, the length and width attribute, and the position attribute includes:

acquiring a preset standard cell length and a preset standard cell width;

establishing a two-dimensional coordinate system by taking the upper left corner of the table as a coordinate origin, the row direction of the table as the positive direction of an X axis, the row direction of the table as the positive direction of a Y axis, the standard length of the cell as the unit length of the X axis and the standard width of the cell as the unit length of the Y axis;

and obtaining the vertex coordinates of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute and the position attribute.

Preferably, the step of deriving coordinates of each vertex of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute, and the position attribute includes:

obtaining the coordinates of the upper left vertex of the first type of cell as (0, 0), the coordinates of the upper right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, 0), the coordinates of the lower left vertex of the first type of cell as (0, the width of the first type of cell/the standard width of the cell), and the coordinates of the lower right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, the width of the first type of cell/the standard width of the cell);

obtaining the coordinates of the top left vertex of the second type of cell as (sum of lengths of all cells on the left side of the second type of cell/standard length of cell, 0), the coordinates of the top right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell, 0), the coordinates of the bottom left vertex as (0, width of the second type of cell/standard width of cell), and the coordinates of the bottom right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell }, width of cell of the second type of cell/standard width of cell);

obtaining the coordinates of the top left vertex of the third type of cell as (0, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the top right vertex as (the length of the third type of cell/the standard length of the cell, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the bottom left vertex as (0, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell), and the coordinates of the bottom right vertex as (the length of the third type of cell/the standard length of the cell, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell);

the coordinates of the upper left vertex of the cell of the fourth type are obtained as (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the upper right vertex are ({ the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the lower left vertex are (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, { the sum of the widths of all the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell), and the coordinates of the lower right vertex are ({ the sum of the lengths of Length of cell of { cell sum of widths of all cells on the fourth type of cell + width of cell of the fourth type }/cell standard width }).

Preferably, the step of obtaining the target PDF and analyzing the target PDF to obtain table data includes:

acquiring a target PDF;

analyzing the target PDF to obtain a corresponding byte stream;

and identifying a table identifier in the byte stream, and extracting the byte stream corresponding to the table identifier as table data.

The present invention also provides a PDF form extraction device, including:

the analysis module is used for acquiring a target PDF and analyzing the target PDF to acquire table data;

an obtaining module, configured to obtain, from the table data, a length-width attribute and a position attribute of each cell in each table;

the obtaining module is used for obtaining the cell type of each cell according to the length and width attribute and the position attribute of each cell;

the calculation module is used for obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute;

and the generating module is used for generating a table with a corresponding format according to the vertex coordinates of each cell.

The invention also provides a terminal, which comprises a processor, a memory and a PDF form extraction program stored on the memory and capable of being executed by the processor, wherein when the PDF form extraction program is executed by the processor, the steps of the PDF form extraction method are realized.

The present invention also provides a computer readable storage medium, on which a PDF form extraction program is stored, wherein when being executed by a processor, the PDF form extraction program implements the steps of the PDF form extraction method described above.

According to the technical scheme, a target PDF is obtained, and the target PDF is analyzed to obtain table data; obtaining the length and width attributes and the position attributes of each cell in each table from the table data; acquiring the cell type of each cell according to the length and width attribute and the position attribute of each cell; obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute; and generating a table with a corresponding format according to the vertex coordinates of each cell. The technical scheme provided by the invention is that the table in the PDF is extracted based on data query, the table data is firstly analyzed from the PDF, the length and width attribute and the position attribute in the table data are then obtained, the vertex coordinates of each cell are calculated according to the length and width attribute and the position attribute, finally, the table with the corresponding format is generated according to the vertex coordinates, the format of the finally generated table is consistent with that of the table in the PDF, and the accuracy of table extraction is ensured.

Drawings

Fig. 1 is a schematic diagram of a hardware structure of a terminal according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a PDF form extraction method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a detailed process of steps of obtaining a target PDF and analyzing the target PDF to obtain table data according to an embodiment of the present invention;

FIG. 4 is a schematic flowchart illustrating a step of obtaining a cell type to which each cell belongs according to the length and width attribute and the position attribute of each cell in the embodiment of the present invention;

FIG. 5 is a flowchart illustrating a step of determining the type of the first cell between two adjacent cell line breaks as the corresponding third type or fourth type according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a step of obtaining coordinates of vertices of each cell according to the cell type, the length and width attribute, and the position attribute in an embodiment of the present invention;

FIG. 7 is a schematic flowchart illustrating a step of obtaining coordinates of each vertex of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute, and the position attribute in the embodiment of the present invention;

fig. 8 is a block diagram of a PDF form extraction device according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The PDF form extraction method related to the embodiment of the invention is mainly applied to a terminal, and the terminal can be a device with display and processing functions, such as a PC, a portable computer, a mobile terminal and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a terminal structure according to an embodiment of the present invention. In the embodiment of the present invention, the terminal may include a processor 1001 (e.g., a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used for realizing connection communication among the components; the user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface); the memory 1005 may be a high-speed RAM memory, or may be a non-volatile memory (e.g., a magnetic disk memory), and optionally, the memory 1005 may be a storage device independent of the processor 1001.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a Wi-Fi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 does not constitute a limitation of the apparatus, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

With continued reference to fig. 1, the memory 1005 of fig. 1, which is a computer-readable storage medium, may include an operating system, a network communication module, and a PDF form extraction program.

In fig. 1, the network communication module is mainly used for connecting to a server and performing data communication with the server; and the processor 1001 may call the PDF table extraction program stored in the memory 1005 and execute the steps of the PDF table extraction method.

Based on the hardware structure of the terminal, the invention provides various embodiments of the PDF form extraction method.

The invention provides a PDF form extraction method.

Referring to fig. 2, in the first embodiment of the present invention, a PDF table extracting method includes the following steps:

step S100, acquiring a target PDF, and analyzing the target PDF to obtain form data;

specifically, when an excel form in the PDF needs to be extracted, a target PDF may be obtained first, where the target PDF may be a PDF uploaded to the terminal by the user when the extraction needs to be performed, and the target PDF may also be a PDF pre-stored in a database of the terminal, and when the extraction needs to be performed by the user, a PDF corresponding to the extraction needs to be selected from the database. After the target PDF is obtained, the target PDF can be analyzed through visual basic/Python/JAVA and other programming languages to obtain table data, multiple tables may be in the target PDF, and corresponding table data can be extracted for each table.

Specifically, referring to fig. 3, fig. 3 is a schematic flowchart illustrating a process of obtaining a target PDF and analyzing the target PDF to obtain table data according to the embodiment of the present invention, where based on the embodiment, step S100 includes:

step S110, acquiring a target PDF;

step S120, analyzing the target PDF to obtain a corresponding byte stream;

after the target PDF is obtained, the target PDF is analyzed through visual basic/Python/JAVA and other program languages, and a byte stream corresponding to the target PDF can be obtained.

Step S130, identifying a table identifier in the byte stream, and extracting the byte stream corresponding to the table identifier as table data.

Because different format parts in the PDF all have corresponding identifiers, after the target PDF is converted into the byte stream, the table identifiers in the byte stream can be identified through visual basic/Python/JAVA and other program languages, the byte stream corresponding to the table is determined in the byte stream through the identified table identifiers, and the byte stream corresponding to the table is extracted as table data.

Step S200, obtaining the length and width attributes and the position attributes of each cell in each table from the table data;

specifically, after table data corresponding to each table is obtained, macro definition may be applied to obtain the length and width attribute of each cell in each table, where "the length and width attribute of each cell in a table" refers to the actual length and width of each cell in the table. The position attribute of each cell in the table refers to the sequence of each cell in the table and the information of the row where each cell is located, and in the table data, the reading sequence of the table cell data is as follows: all the cells in the first row are traversed from left to right from the cells in the first row and the first column of the table, all the cells in the second row are traversed from left to right from the cells in the first column of the second row, and all the cells in the third row are traversed from left to right from the cells in the third row until all the cells in the table are traversed. The information of the row in which each cell is located can be represented by a cell line break, and by detecting the cell line break in the cell data stream, the cell before the first cell line break is taken as a first row, and the cell between two adjacent cell line breaks is taken as a row.

Step S300, obtaining the cell type of each cell according to the length and width attributes and the position attributes of each cell;

in order to calculate the vertex coordinates of each cell conveniently, each cell can be divided into four types, namely a first type, a second type, a third type and a fourth type, and no cell exists on the left side and the upper side of the cell of the first type; the left side of the second type of cell has a cell, but the upper side has no cell; the left side of the third type of cell has no cell, but the upper side has a cell; there are cells on both the top and left sides of the fourth type of cell. After the length and width attribute and the position attribute of each cell are obtained, the cell type to which each cell belongs can be determined through the length and width attribute of each cell and the position attribute of each cell.

Step S400, obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute;

specifically, after the cells are obtained to be of the first type, the second type, the third type or the fourth type, each cell may be obtained according to the length and width attribute and the position attribute of the cell to obtain each vertex coordinate. For example, the standard cell length and the standard cell width may be preset at the terminal, and when the cell is of the first type, the coordinate of the upper left vertex of the cell is the origin of coordinates; the X coordinate of the top right vertex is the length of the cell/the standard length of the cell, and the Y coordinate is 0; the X coordinate of the lower left vertex is 0, the Y coordinate is the width of the cell/the standard width of the cell, the X coordinate of the lower right vertex is the length of the cell/the standard length of the cell, and the Y coordinate is the width of the cell/the standard width of the cell. When the cell is of the second type, the X coordinate of the top left vertex of the cell is the sum of the lengths of all cells on the left side of the cell/the standard length of the cell, and the Y coordinate is 0; the X coordinate of the top right vertex is { the sum of the lengths of all the cells on the left side of the cell + the length of the cell }/the standard length of the cell, and the Y coordinate is 0; the X coordinate of the lower left vertex is 0, the Y coordinate is the width of the cell/the standard width of the cell, the X coordinate of the lower right vertex is { the sum of the lengths of all the cells on the left side of the cell + the length of the cell }/the standard length of the cell }, and the Y coordinate is the width of the cell/the standard width of the cell.

Step S500, generating a table with a corresponding format according to the vertex coordinates of each cell;

after the vertex coordinates of each cell are obtained, a table of a corresponding format may be generated in the coordinate axis according to each vertex coordinate of each cell, and the generated table may be in accordance with the format of the table in the PDF. In addition, after the table with the corresponding format is generated, the content in the table obtained by using Python and the like can be correspondingly filled in the table, so that the extracted table is consistent with the original table in the PDF in terms of content and format. After the table is generated, the table is an excel table, and the table can be converted into a word form for storage.

According to the technical scheme, a target PDF is obtained, and the target PDF is analyzed to obtain table data; obtaining the length and width attributes and the position attributes of each cell in each table from the table data; acquiring the cell type of each cell according to the length and width attribute and the position attribute of each cell; obtaining each vertex coordinate of each cell according to the cell type, the length and width attribute and the position attribute; and generating a table with a corresponding format according to the vertex coordinates of each cell. The technical scheme provided by the invention is that the table in the PDF is extracted based on data query, the table data is firstly analyzed from the PDF, the length and width attribute and the position attribute in the table data are then obtained, the vertex coordinates of each cell are calculated according to the length and width attribute and the position attribute, finally, the table with the corresponding format is generated according to the vertex coordinates, the format of the finally generated table is consistent with that of the table in the PDF, and the accuracy of table extraction is ensured.

Specifically, referring to fig. 4, fig. 4 is a schematic flowchart illustrating a process of obtaining a cell type to which each cell belongs according to a length attribute and a position attribute of each cell in the embodiment of the present invention, where the cell type includes a first type, a second type, a third type, and a fourth type, and based on the embodiment, step S300 includes:

step S310, judging the type of the first unit in all the cells before the first unit is replaced by the line character as a first type;

specifically, the table data corresponding to each table includes a cell line break, and by detecting the cell line break in the cell data stream, the cell before the first cell line break is taken as the first line, and the cell between two adjacent cell line breaks is taken as one line. The type of the first cell in all the cells before the first cell line break is judged as the first type, that is, when the table data corresponding to each table is read, the type of the first cell in the table read in the reading order is judged as the first type.

Step S320, judging the types of the units except the first unit in all the units before the first unit is changed into the line character as the second type;

specifically, the type of the cells other than the first cell among all the cells before the first cell line break is determined as the second type, that is, when the table data corresponding to each table is read, the type of the cells other than the first cell before the first cell line break in the table read in the reading order is determined as the second type.

Step S330, judging the type of the first unit between two adjacent unit lattice line-changing symbols as a corresponding third type or fourth type;

the first cell between two adjacent cell line breaks may be of the third type or the fourth type, that is, when the first cell between two adjacent cell line breaks is the first cell in the table, the first cell between two adjacent cell line breaks is of the third type, and when the first cell between two adjacent cell line breaks is not the first cell in the table, the first cell between two adjacent cell line breaks is of the fourth type. For example, if the width of the first cell among all cells before the first cell line break is large, the first cell will occupy the space on the left side of the first cell between the first cell line break and the second cell line break, and the first cell between the first cell line break and the second cell line break is of the third type; if the widths of all the cells before the first cell line break are consistent, it means that the space on the left side of the first cell between the first cell line break and the second cell line break is not occupied, and the first cell between the first cell line break and the second cell line break is of the fourth type.

Specifically, referring to fig. 5, fig. 5 is a schematic flowchart of a step of determining a type of a first cell between two adjacent cell line breaks as a corresponding third type or fourth type in the embodiment of the present invention, where based on the embodiment, the step S330 includes:

step S331, obtaining the cell line number before the first cell between two adjacent cell line-changing characters;

since the cell before the first cell line break is taken as the first line and the cell between two adjacent cell line breaks is taken as a line, the number of lines before the first cell between two adjacent cell line breaks is the number of cell line breaks before the first cell between two adjacent cell line breaks. For example, the number of rows before the first cell between the first cell row identifier and the second cell row identifier is 1, and the first cell between the first cell row identifier and the second cell row identifier is preceded by 1 cell row identifier; the number of rows before the first cell between the second cell row identifier and the third cell row identifier is 2, and the first cell between the second cell row identifier and the third cell row identifier is preceded by 2 cell row identifiers.

Step S332, if the number of the cell lines is one, judging whether the widths of the cells before the first cell between the two adjacent cell line-changing characters are equal;

if the number of the cell lines before the first cell between two adjacent cell line breaks is one, it indicates that what needs to be judged is the type of the first cell between the first cell line break and the second cell line break, and at this time, it only needs to compare the widths of the cells before the first cell, and judge whether the widths of the cells before the first cell between two adjacent cell line breaks are equal.

Step S333, if the widths of the cells before the first cell between the two adjacent cell line-changing symbols are equal, determining the type of the first cell between the two adjacent cell line-changing symbols as a third type;

specifically, if the widths of the cells before the first cell between two adjacent cell line breaks are all equal, the type of the first cell between two adjacent cell line breaks is determined as the third type. That is, if the widths of all the cells before the first cell line break are equal, it means that all the cells before the first cell line break do not occupy the space on the left side of the first cell between the first cell line break and the second cell line break, that is, there is no cell on the left side of the first cell between the first cell line break and the second cell line break, and it means that the type of the first cell between the first cell line break and the second cell line break is the third type.

Step 334, if the widths of the cells before the first cell between the two adjacent cell line breaks are not equal, determining the type of the first cell between the two adjacent cell line breaks as a fourth type;

specifically, if the widths of the cells before the first cell between two adjacent cell line breaks are not all equal, that is, if there are cells with unequal widths in the cells before the first cell between two adjacent cell line breaks, the type of the first cell between two adjacent cell line breaks is determined as the fourth type. That is, if there are cells with unequal widths in all the cells before the first cell line break, it means that the cells before the first cell line break occupy the space on the left side of the first cell between the first cell line break and the second cell line break, that is, there is a cell on the left side of the first cell between the first cell line break and the second cell line break, and it means that the type of the first cell between the first cell line break and the second cell line break is the fourth type.

Step S335, if the number of the unit cell lines is larger than one line, from the previous line of the first unit cell between the two adjacent unit cell line changing signs, traversing and comparing the width of the first unit cell and the width of the second unit cell of each line, and judging whether the width of the first unit cell and the width of the second unit cell of each line are equal;

specifically, a first cell between two adjacent cell line changing symbols is defined as a cell to be judged, if the number of rows of cells before the first cell between the two adjacent cell line changing symbols is greater than one, that is, the number of rows of cells before the first cell between the two adjacent cell line changing symbols is two or more, starting from the previous row of the first cell between the two adjacent cell line changing symbols, traversing and comparing the width of the first cell of each row with the width of the second cell, and judging whether the width of the first cell of each row is equal to the width of the second cell. Specifically, when the number of cell lines before the first cell between two adjacent cell line changing symbols is greater than one line, first comparing whether the width of the first cell and the second cell in the previous line of the first cell between two adjacent cell line changing symbols is equal, that is, first comparing whether the width of the first cell and the second cell in the line closest to the cell to be determined is equal, and then sequentially traversing along the direction away from the cell to be determined and comparing whether the widths of the first cell and the second cell in other lines are equal. If the width of the first cell in each row is equal to that of the second cell, directly traversing all the cells in the rows before the first cell between the two adjacent cell line-changing characters; and if the width of the first cell of a certain row is not equal to that of the second cell when the certain row is traversed, stopping traversing and then performing subsequent calculation judgment operation.

Step S336, if the width of the first cell in each row is equal to that of the second cell, judging the type of the first cell between the line changing symbols of the two adjacent cells as a third type;

if the width of the first cell in all the rows before the first cell between two adjacent cell line changing symbols is equal to the width of the second cell, it indicates that the first cell in all the rows before the first cell between two adjacent cell line changing symbols does not occupy the space on the left side of the first cell to be judged currently, and therefore, the type of the first cell between the two adjacent cell line changing symbols can be judged as the third type.

Step S337, if there is an unequal row in which the width of the first cell is unequal to the width of the second cell, stopping traversal, summing the width of the second cell in the unequal row and the width of the first cell in a row after the unequal row, and determining whether the sum is equal to the width of the first cell in the unequal row, where the row after the unequal row is all rows after the unequal row and before the first cell between two adjacent cell line changers;

if an unequal row with the width of the first cell unequal to the width of the second cell is found in the traversal process, stopping the traversal, summing the width of the second cell in the unequal row and the width of the first cell in a row after the unequal row, and judging whether the sum is equal to the width of the first cell in the unequal row, wherein the row after the unequal row refers to all rows after the unequal row and before the first cell between two adjacent cell line-changing characters. Specifically, when the width of the first cell in a certain unequal row is found to be unequal to the width of the second cell, it is necessary to determine whether the first cell in the unequal row occupies the space on the left side of the cell to be determined, at this time, it is only necessary to sum the width of the second cell in the unequal row and the width of the first cell in the row after the unequal row, and then determine whether the first cell in the unequal row occupies the space on the left side of the cell to be determined according to the size relationship between the sum and the width of the first cell in the unequal row.

Step S338, if the sum obtained is greater than or equal to the width of the first cell of the unequal row, determining the type of the first cell between the line changing symbols of the two adjacent cells as a third type;

specifically, if the obtained sum is greater than or equal to the width of the first cell of the unequal row, it indicates that the first cell of the unequal row does not occupy the space on the left side of the first cell between two adjacent cell line changers, and at this time, the type of the first cell between two adjacent cell line changers may be determined as the third type.

In step S339, if the sum is smaller than the width of the first cell of the unequal row, the type of the first cell between the two adjacent cell line-changing symbols is determined as the fourth type.

Specifically, if the obtained sum is smaller than the width of the first cell of the unequal row, it indicates that the first cell of the unequal row occupies the space on the left side of the first cell between two adjacent cell line breaks, and at this time, the type of the first cell between two adjacent cell line breaks may be determined as the fourth type.

Further, referring to fig. 6, fig. 6 is a detailed schematic view of a flow of a step of obtaining coordinates of each vertex of each cell according to the cell type, the length and width attribute, and the position attribute in the embodiment of the present invention, based on the above embodiment, step S400 includes:

step S410, acquiring a preset standard cell length and a preset standard cell width;

specifically, in one embodiment, a fixed standard cell length and standard cell width may be preset at the terminal, and in another embodiment, the corresponding standard cell length and standard cell width may be determined based on table data, that is, all the lengths and widths of the cells in the table may be obtained, a [ length, width ] two-dimensional array is formed by the lengths and widths of the cells, an array with the largest number of occurrences is counted, and the length and width of the array are used as the standard cell length and standard cell width; after the lengths and the widths of all the cells in the table are obtained, the lengths of the cells are compared, the widths of the cells are compared, the shortest length and the shortest width are obtained, and the shortest length and the shortest width are used as the standard length and width of the cells. When the coordinates of each vertex of each cell need to be obtained, the preset standard length and standard width of the cell can be obtained first.

Step S420, establishing a two-dimensional coordinate system by taking the upper left corner of the table as a coordinate origin, the row direction of the table as the positive direction of an X axis, the column direction of the table as the positive direction of a Y axis, the standard length of the cell as the unit length of the X axis and the standard width of the cell as the unit length of the Y axis;

after the preset standard cell length and the preset standard cell width are obtained, a two-dimensional coordinate system is established by taking the upper left corner of the table as the origin of coordinates, the row direction of the table as the positive direction of an X axis, the column direction of the table as the positive direction of a Y axis, the standard cell length as the unit length of the X axis and the standard cell width as the unit length of the Y axis.

Step S430, obtaining each vertex coordinate of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute and the position attribute;

after the two-dimensional coordinate system is established, the vertex coordinates of each cell in the two-dimensional coordinate system can be obtained according to the cell type of each cell, the length and width attribute of each cell and the position attribute of each cell.

Specifically, referring to fig. 7, fig. 7 is a schematic flowchart illustrating a step of obtaining coordinates of each cell at each vertex of the two-dimensional coordinate system according to the cell type, the length and width attribute, and the position attribute in the embodiment of the present invention, where based on the embodiment, the step S430 includes:

step S431, obtaining coordinates of an upper left vertex of the first type cell as (0, 0), coordinates of an upper right vertex of the first type cell as (length of the first type cell/standard cell length, 0), coordinates of a lower left vertex of the first type cell as (0, width of the first type cell/standard cell width), and coordinates of a lower right vertex of the first type cell as (length of the first type cell/standard cell length, width of the first type cell/standard cell width);

the left side in the present embodiment refers to the left side of the same row, and the upper side refers to the upper side of the same column.

Specifically, the upper side and the left side of the first type cell have no cell, and after the cell type and the length and width attribute are obtained, the upper left vertex of the first type cell can be obtained as the coordinate origin (0, 0); the X coordinate of the top right vertex of the first type cell is the length of the first type cell/the standard length of the cell, and the Y coordinate is 0; the X coordinate of the lower left vertex is 0, and the Y coordinate is the width of the first type of cell/the standard width of the cell; the X coordinate of the lower right vertex is the length of the first type cell/cell standard length, and the Y coordinate is the width of the first type cell/cell standard width.

Step S432, obtaining the upper left vertex coordinate of the second type cell as (the sum of all the lengths of the left side cells of the second type cell/the standard length of the cell, 0), the upper right vertex coordinate as ({ the sum of all the lengths of the left side cells of the second type cell + the length of the second type cell }/the standard length of the cell, 0), the lower left vertex coordinate as (0, the width of the second type cell/the standard width of the cell), and the lower right vertex coordinate as ({ the sum of all the lengths of the left side cells of the second type cell + the length of the second type cell }/the standard length of the cell }, the width of the second type cell/the standard width of the cell);

specifically, the upper side of the second type of cell has no cell, and the left side has a cell, after obtaining the cell type and the length and width attribute, the X coordinate of the upper left vertex of the second type of cell can be obtained as the sum of the lengths of all the cells on the left side of the second type of cell/the standard length of the cell, and the Y coordinate is 0; the X coordinate of the top right vertex of the second type cell is { the sum of the lengths of all the cells on the left side of the second type cell + the cell length of the second type cell }/the standard cell length, and the Y coordinate is 0; the X coordinate of the lower left vertex is 0, and the Y coordinate is the width of the second type of cell/the standard width of the cell; the X coordinate of the lower right vertex is { the sum of all cell lengths on the left side of the cell of the second type + the cell length of the second type }/the standard cell length }, and the Y coordinate is the width of the cell of the second type/the standard cell width;

step S433, obtaining the coordinates of the top left vertex of the third type cell as (0, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the top right vertex as (the length of the third type cell/the standard length of the cell, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the bottom left vertex as (0, { the sum of the widths of all cells on the top of the third cell + the width of the third type cell }/the standard width of the cell), and the coordinates of the bottom right vertex as (the length of the third type cell/the standard length of the cell, { the sum of the widths of all cells on the top of the third cell + the width of the third type cell }/the standard width of the cell);

specifically, the upper side of the third type cell has a cell, and the left side has no cell, after the cell type and the length and width attribute are obtained, the X coordinate of the upper left vertex of the third type cell is 0, and the Y coordinate is the sum of the widths of all cells on the upper side of the third cell/the standard width of the cell; the X coordinate of the top right vertex of the third type cell is the length of the third type cell/the standard length of the cell, and the Y coordinate is the sum of the widths of all the cells on the top side of the third cell/the standard width of the cell; the X coordinate of the lower left vertex is 0, and the Y coordinate is { the sum of the widths of all the cells on the upper side of the third cell + the width of the cell of the third type }/the standard width of the cell; the X coordinate of the lower right vertex is the length of the cell of the third type/the standard length of the cell, and the Y coordinate is { the sum of the widths of all the cells on the upper side of the third cell + the width of the cell of the third type }/the standard width of the cell.

Step S434, obtaining the coordinates of the top left vertex of the cell of the fourth type as (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, the sum of the widths of all the cells on the top side of the cell of the fourth type/the standard width of the cell), the coordinates of the top right vertex as ({ the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, the sum of the widths of all the cells on the top side of the cell of the fourth type/the standard width of the cell), the coordinates of the bottom left vertex as (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, { the sum of all the widths of the cells on the top side of the cell of the fourth type + the width of the cell of the fourth type }/the, the coordinates of the lower right vertex are ({ the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, { the sum of the widths of all the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell).

Specifically, the upper side and the left side of the cell of the fourth type both have cells, and after the cell type, the length and width attribute and the position attribute are obtained, the X coordinate of the upper left vertex of the cell of the fourth type can be obtained as the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard cell length, and the Y coordinate is the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard cell width; the X coordinate of the top right vertex of the fourth type cell is { the sum of the lengths of all the cells on the left side of the fourth type cell + the length of the fourth type cell }/the standard length of the cell, and the Y coordinate is the sum of the widths of all the cells on the top side of the fourth type cell/the standard width of the cell; the X coordinate of the lower left vertex is the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, and the Y coordinate is { the sum of the widths of all the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell; the X coordinate of the lower right vertex is { the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, and the Y coordinate is { the sum of the widths of all the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell.

By obtaining the vertex coordinates of the cells of the first type, the second type, the third type and the fourth type, the vertex coordinates of all the cells in the table can be obtained, and by classifying the cells, the coordinates can be obtained more accurately, conveniently and quickly.

In addition, referring to fig. 8, the present invention further provides a PDF form extracting apparatus 10, where the PDF form extracting apparatus 10 includes:

the analysis module 20 is used for acquiring a target PDF and analyzing the target PDF to acquire table data;

an obtaining module 30, configured to obtain, from the table data, a length-width attribute and a position attribute of each cell in each table;

an obtaining module 40, configured to obtain a cell type to which each cell belongs according to the length and width attribute and the position attribute of each cell;

a calculating module 50, configured to obtain vertex coordinates of each cell according to the cell type, the length and width attribute, and the position attribute;

a generating module 60 for generating a table of a corresponding format according to the vertex coordinates of each cell.

Further, the cell types include a first type, a second type, a third type, and a fourth type, and the obtaining module 40 is further configured to:

judging the type of the first unit in all the cells before the first unit is replaced by the line character as a first type;

judging the types of the cells except the first cell in all the cells before the first cell line break as a second type;

judging the type of the first unit between two adjacent unit cell line-changing symbols as a corresponding third type or fourth type;

and judging the type of the units except the first unit between two adjacent unit line breaks as a fourth type.

Further, the obtaining module 40 is further configured to:

acquiring the number of cell lines before the first cell between two adjacent cell line-changing characters;

if the number of the cell lines is one line, judging whether the widths of the cells before the first cell between two adjacent cell line changing symbols are equal or not;

if the widths of the cells before the first cell between the two adjacent cell line breaks are equal, judging the type of the first cell between the two adjacent cell line breaks as a third type;

and if the widths of the cells before the first cell between the two adjacent cell line breaks are not equal, judging the type of the first cell between the two adjacent cell line breaks as a fourth type.

Further, the obtaining module 40 is further configured to:

if the number of the unit cell lines is larger than one line, starting from the previous line of the first unit cell between the two adjacent unit cell line changing signs, traversing and comparing the width of the first unit cell and the width of the second unit cell of each line, and judging whether the width of the first unit cell and the width of the second unit cell of each line are equal or not;

if the width of the first cell in each row is equal to that of the second cell, judging the type of the first cell between the line changing symbols of the two adjacent cells as a third type;

if an unequal row with unequal width of the first cell and the second cell exists, stopping traversal, summing the width of the second cell of the unequal row and the width of the first cell of the row after the unequal row, and judging whether the sum is equal to the width of the first cell of the unequal row or not, wherein the row after the unequal row refers to all rows which are positioned after the unequal row and before the first cell between two adjacent cell line changing signs;

if the sum is larger than or equal to the width of the first unit cell of the unequal line, judging the type of the first unit cell between the line-changing characters of two adjacent unit cells as a third type;

if the sum is smaller than the width of the first unit cell of the unequal lines, judging the type of the first unit cell between the line-changing characters of the two adjacent unit cells as a fourth type;

further, the calculation module 50 is further configured to:

acquiring a preset standard cell length and a preset standard cell width;

establishing a two-dimensional coordinate system by taking the upper left corner of the table as a coordinate origin, the row direction of the table as the positive direction of an X axis, the row direction of the table as the positive direction of a Y axis, the standard length of the cell as the unit length of the X axis and the standard width of the cell as the unit length of the Y axis;

and obtaining the vertex coordinates of each cell in the two-dimensional coordinate system according to the cell type, the length and width attribute and the position attribute.

Further, the calculation module 50 is further configured to:

obtaining the coordinates of the upper left vertex of the first type of cell as (0, 0), the coordinates of the upper right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, 0), the coordinates of the lower left vertex of the first type of cell as (0, the width of the first type of cell/the standard width of the cell), and the coordinates of the lower right vertex of the first type of cell as (the length of the first type of cell/the standard length of the cell, the width of the first type of cell/the standard width of the cell);

obtaining the coordinates of the top left vertex of the second type of cell as (sum of lengths of all cells on the left side of the second type of cell/standard length of cell, 0), the coordinates of the top right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell, 0), the coordinates of the bottom left vertex as (0, width of the second type of cell/standard width of cell), and the coordinates of the bottom right vertex as ({ sum of lengths of all cells on the left side of the second type of cell + length of the second type of cell }/standard length of cell }, width of cell of the second type of cell/standard width of cell);

obtaining the coordinates of the top left vertex of the third type of cell as (0, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the top right vertex as (the length of the third type of cell/the standard length of the cell, the sum of the widths of all cells on the top of the third cell/the standard width of the cell), the coordinates of the bottom left vertex as (0, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell), and the coordinates of the bottom right vertex as (the length of the third type of cell/the standard length of the cell, { the sum of the widths of all cells on the top of the third cell + the width of the third type of cell }/the standard width of the cell);

the coordinates of the upper left vertex of the cell of the fourth type are obtained as (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the upper right vertex are ({ the sum of the lengths of all the cells on the left side of the cell of the fourth type + the length of the cell of the fourth type }/the standard length of the cell, the sum of the widths of all the cells on the upper side of the cell of the fourth type/the standard width of the cell), the coordinates of the lower left vertex are (the sum of the lengths of all the cells on the left side of the cell of the fourth type/the standard length of the cell, { the sum of the widths of all the cells on the upper side of the cell of the fourth type + the width of the cell of the fourth type }/the standard width of the cell), and the coordinates of the lower right vertex are ({ the sum of the lengths of Length of cell of { cell sum of widths of all cells on the fourth type of cell + width of cell of the fourth type }/cell standard width }).

Further, the parsing module 20 is further configured to:

acquiring a target PDF;

analyzing the target PDF to obtain a corresponding byte stream;

and identifying a table identifier in the byte stream, and extracting the byte stream corresponding to the table identifier as table data.

Each module in the PDF form extraction device 10 corresponds to each step in the above embodiment of the PDF form extraction method, and the functions and implementation processes thereof are not described in detail herein.

In addition, the invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores a PDF form extraction program, wherein when the PDF form extraction program is executed by a processor, the steps of the PDF form extraction method as described above are implemented.

The method for implementing the PDF form extraction program when executed may refer to various embodiments of the PDF form extraction method of the present invention, and will not be described herein again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种自动编码方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!