Document processing method and device, electronic equipment and readable storage medium

文档序号：421470 发布日期：2021-12-21 浏览：28次中文

阅读说明：本技术 文档处理方法及装置、电子设备及可读存储介质 (Document processing method and device, electronic equipment and readable storage medium ) 是由余大雨蒋庆高于 2021-08-13 设计创作，主要内容包括：本申请公开了一种文档处理方法,文档处理方法包括：将PDF文件转换成Excel表；确定所述Excel表中目标表格的位置；识别所述目标表格中,目标子表格的起始位置和结束位置；根据所述起始位置、所述结束位置及所述目标子表格,建立第一数据表；及将所述第一数据表的数据,匹配至预先得到的第二数据表中,所述第二数据表与数据库相匹配。本申请还公开了一种文档处理装置、电子设备及非易失性计算机可读存储介质。本申请可以将PDF文件中的表格数据匹配至数据库中,可以减少提高PDF文件中Excel数据录入数据库的时间,提高录入效率,并且可以减少人工成本。(The application discloses a document processing method, which comprises the following steps: converting the PDF file into an Excel table; determining the position of a target table in the Excel table; identifying a starting position and an ending position of a target sub-table in the target table; establishing a first data table according to the starting position, the ending position and the target sub-table; and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with a database. The application also discloses a document processing device, an electronic device and a non-volatile computer readable storage medium. According to the method and the device, the form data in the PDF file can be matched with the database, the time for inputting Excel data into the database in the PDF file can be shortened, the inputting efficiency is improved, and the labor cost can be reduced.)

1. A method of document processing, comprising:

converting the PDF file into an Excel table;

determining the position of a target table in the Excel table;

identifying a starting position and an ending position of a target sub-table in the target table;

establishing a first data table according to the starting position, the ending position and the target sub-table; and

and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with a database.

2. The method of claim 1, wherein the determining the location of the target table in the Excel table comprises:

acquiring a first index row of the target table in the Excel table according to the keywords of the primary catalog of the target table;

matching keywords of a secondary directory starting from the first index line to obtain a second index line and a third index line of the secondary directory, wherein the second index line is used for indexing a starting line of the secondary directory, and the third index line is used for indexing an ending line of the secondary directory; and

and intercepting the target table according to the second index row and the third index row to obtain the target sub-table corresponding to the secondary subdirectory.

3. The method of claim 1, wherein the creating a first data table according to the starting location, the ending location, and the target sub-table comprises:

acquiring a fourth index row for indexing the starting position and a fifth index row for indexing the ending position;

intercepting the target sub-table according to the fourth index row and the fifth index row;

establishing a first initial data table according to the intercepted data; and

and sorting the first initial data table to obtain the first data table.

4. The document processing method according to claim 3, wherein the data in the first row of the first initial data table is a column name of each column, and the sorting the first initial data table to obtain the first data table comprises:

identifying whether a misplaced cell exists in the first initial data table;

if the dislocation unit cell is identified, determining whether a column adjacent unit cell of the dislocation unit cell has data or not according to the column name of the dislocation unit cell, wherein the column adjacent unit cell and the dislocation unit cell are adjacent columns; and

if the column adjacent cell has no data and has data with the row adjacent cell of the row of the adjacent column cell, merging the adjacent cell and the misplaced cell to obtain the first data table.

5. The document processing method according to claim 3, wherein the data in the first row of the first initial data table is a column name of each column, and the sorting the first initial data table to obtain the first data table comprises:

judging whether a plurality of combined columns exist in the first initial data table or not; and

if so, merging the merging parallel and the item row according to the data of the merging parallel and the data of the item row adjacent to the merging row to obtain the first data table.

6. The document processing method according to claim 1, further comprising:

establishing a second initial data table, wherein the data of each cell of the second initial data table is null;

acquiring the name of each preset node in a preset configuration file; and

and inputting the name of each preset node into the second initial data table to obtain the second data table, wherein the column name of each column in the second data table is the name of each preset node.

7. The method of claim 1, wherein matching the data of the first data table to a second data table obtained in advance comprises:

matching the column name of each column of the first data table with the title of each column of the second data table; and

and inputting the data of the column successfully matched with the second data table in the first data table into the corresponding column of the second data table.

8. A document processing apparatus, comprising:

the conversion module is used for converting the PDF file into an Excel table;

the determining module is used for determining the position of a target table in the Excel table;

the identification module is used for identifying the starting position and the ending position of the target sub-table in the target table;

the establishing module is used for establishing a first data table according to the starting position, the ending position and the target sub-table; and

and the matching module is used for matching the data of the first data table to a second data table obtained in advance, and the second data table is matched with a database.

9. An electronic device, comprising:

one or more processors, memory; and

one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the document processing method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium containing a computer program, wherein the computer program, when executed by one or more processors, causes the processors to implement the document processing method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a document processing method, a document processing apparatus, an electronic device, and a non-volatile computer-readable storage medium.

Background

For extracting table data from a Portable Document Format (PDF) file, the table data is generally obtained statically in a manual manner. However, if the number of tables in the PDF file is large or the number of PDF files is large, manual batch extraction of table data is inefficient, and a lot of manpower and material resources are generally spent. In addition, the more manpower the customer invests in a timely manner, the more the cost will increase.

Disclosure of Invention

The embodiment of the application provides a document processing method, a document processing device, electronic equipment and a non-volatile computer readable storage medium.

The document processing method of the embodiment of the application comprises the following steps: converting the PDF file into an Excel table; determining the position of a target table in the Excel table; identifying a starting position and an ending position of a target sub-table in the target table; establishing a first data table according to the starting position, the ending position and the target sub-table; and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with a database.

The document processing device comprises an extraction module, a determination module, an identification module, an establishment module and a matching module. The extraction module is used for converting the PDF file into an Excel table; the determining module is used for determining the position of a target table in the Excel table; the identification module is used for identifying the starting position and the ending position of a target sub-table in the target table; the establishing module is used for establishing a first data table according to the starting position, the ending position and the target sub-table; and the matching module is used for matching the data of the first data table to a second data table obtained in advance, and the second data table is used for matching with a database.

The electronic device of the embodiment of the application comprises one or more processors, a memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the document processing methods of embodiments of the present application. The document processing method comprises the following steps: converting the PDF file into an Excel table; determining the position of a target table in the Excel table; identifying a starting position and an ending position of a target sub-table in the target table; establishing a first data table according to the starting position, the ending position and the target sub-table; and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with a database.

A non-transitory computer-readable storage medium containing a computer program according to an embodiment of the present application is characterized in that, when the computer program is executed by one or more processors, the processors are caused to implement a document processing method according to an embodiment of the present application. The document processing method comprises the following steps: converting the PDF file into an Excel table; determining the position of a target table in the Excel table; identifying a starting position and an ending position of a target sub-table in the target table; establishing a first data table according to the starting position, the ending position and the target sub-table; and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with a database.

In the document processing method, the document processing apparatus, the electronic device, and the non-volatile computer-readable storage medium according to the embodiments of the present application, a PDF file is first converted into an Excel table, then a position of a target table in the Excel table, a start position and an end position of a target sub-table in the target table are sequentially determined, then a first data table is established according to the start position and the end position of the target sub-table, and then data in the first data table is matched with a value in a second data table, so that target data in a table in the PDF file can be matched into a database. Therefore, the time for inputting Excel data into the database in the PDF file can be shortened, the inputting efficiency is improved, and the labor cost can be reduced.

Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of a document processing method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of a document processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of a document processing method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 13 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a document processing method according to an embodiment of the present application;

FIG. 17 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 18 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 19 is a flowchart illustrating a document processing method according to an embodiment of the present application;

FIG. 20 is a schematic diagram illustrating a document processing method according to an embodiment of the present application

Fig. 21 is a schematic diagram illustrating a connection relationship between a computer-readable storage medium and a processor according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the embodiments of the present application, and are not to be construed as limiting the embodiments of the present application.

Referring to fig. 1 to 3, a document processing method according to an embodiment of the present application includes the following steps:

01: converting the PDF file into an Excel table;

02: determining the position of a target table in the Excel table;

03: identifying the starting position and the ending position of a target sub-table in a target table;

04: establishing a first data table according to the starting position, the ending position and the target sub-table; and

05: and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with the database.

The document processing apparatus 10 of the embodiment of the present application includes a conversion module 11, a determination module 12, a recognition module 13, a creation module 14, and a matching module 15. The conversion module 11, the determination module 12, the identification module 13, the establishment module 14, and the matching module 15 may be configured to implement step 01, step 02, step 03, step 04, and step 05, respectively. That is, the conversion module 11 may be configured to convert a PDF file into an Excel table; the determining module 12 may be configured to determine a position of the target table in the Excel table; the identification module can be used for identifying the starting position and the ending position of the target sub-table in the target table; the establishing module can be used for establishing a first data table according to the starting position, the ending position and the target sub-table; the matching module is used for matching the data of the first data table to a second data table obtained in advance, and the second data table is used for matching with a database.

The electronic device 100 of the embodiments of the present application includes one or more processors 20, a memory 30, and one or more programs, where the one or more programs are stored in the memory 30 and executed by the one or more processors 20, the programs including instructions for performing the document processing methods of the embodiments of the present application. Processor 20, when executing the program, processor 20 may implement step 01, step 02, step 03, step 04, and step 05. That is, the processor 20 may be configured to: converting the PDF file into an Excel table; determining the position of a target table in the Excel table; identifying the starting position and the ending position of a target sub-table in a target table; establishing a first data table according to the starting position, the ending position and the target sub-table; and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with the database.

In the document processing method, the document processing apparatus 10, and the electronic device 100 according to the embodiment of the application, first, a PDF file is converted into an Excel table, then, a position of a target table in the Excel table, a start position and an end position of a target sub-table in the target table are sequentially determined, then, a first data table is established according to the start position and the end position of the target sub-table, and then, data in the first data table is matched with a value in a second data table, so that target data in the PDF file can be matched with a database. Therefore, the time for inputting Excel data into the database in the PDF file can be shortened, the inputting efficiency is improved, and the labor cost can be reduced.

In addition, in some embodiments, data of an Excel table in a PDF file is directly located through keywords, and there may be a case where a plurality of keywords correspond to one PDF file, thereby causing inaccuracy of entered data. However, compared with the method for directly positioning the data of the Excel table in the PDF file through the keywords, the method for directly positioning the Excel table in the PDF file has the advantages that the starting position and the ending position of the target sub-table of the target table in the PDF file are firstly determined, and then the data in the target sub-table is recorded into the database, so that the data range is smaller, the data characteristic is stronger, and the accuracy of data recording is higher.

The PDF file may be an announcement file of a company, for example, an announcement file such as an annual report/annual report revision update/annual report summary, a semiannual report summary, a quarter report full text, a quarter report text, and a quarter report full text. The PDF file may include a plurality of tables, or the PDF file may include one table, and a plurality of sub-tables are nested in the large table, and the data corresponding to each sub-table is different.

Specifically, in step 01, the PDF file is converted into an Excel table. Some software or algorithm may be used to convert the PDF file into an Excel file, i.e., to convert the PDF file into an Excel table, which includes the entire contents of the PDF file. For example, a SolidFramework can be used to convert a PDF file to obtain an Excel table. There are, of course, other software or algorithms, not listed here.

In step 02, the location of the target table in the Excel table is determined. All data in the Excel table, that is, data of content, border, font and the like of each cell in the Excel table can be acquired through some software or algorithm. For example, in one embodiment, all data for an Excel sheet may be obtained through the expose. In some embodiments, one PDF file can acquire multiple Excel forms, and in order to avoid reading data from all Excel forms, or to avoid the Excel forms that are not target forms causing inaccurate entered data. The position of the target table in the Excel table can be determined according to the read keywords such as the table header of the target table, the directory of the target table and the like, so that the data in the target table can be further processed and analyzed, the occupation of resources can be reduced, and meanwhile, the working efficiency can be improved.

In step 03, the start position and the end position of the target sub-table in the target table are identified. In some embodiments, a plurality of sub-tables exist in the target table, data in different sub-tables are inconsistent, only data in the target sub-table is accurate when data is entered, or only data in the target sub-table needs to be entered, and for accuracy of data entry and high efficiency of entry, a start position and an end position of the target sub-table in the target table can be identified without identifying the target table in its entirety. For example, the start and key locations of the target sub-table may be determined by identifying a key at the start and a key at the end of the target sub-table.

In step 04, a first data table is established according to the start position, the end position and the target sub-table. In step 03, the starting position and the ending position of the target sub-table are already identified, the data in the Excel table can be intercepted according to the starting position and the ending position, and then the first data table is established according to the intercepted data, so that the first data table is more accurate, the data in the first data table is more refined, and the efficiency of data entry is improved.

In step 05, the data in the first data table is matched to a second data table obtained in advance, and the second data table is matched with the database. It can be understood that the first data table obtained in step 04 may not match with the entry template in the database when entering data, which easily results in entry errors or a long time-consuming entry process. Therefore, a second data table matched with the database can be established, and then the data in the first data table is matched into the second data table, so that the target data can be matched into the database more comfortably and more smoothly.

Referring to fig. 4, in some embodiments, step 02 includes the following steps:

021: acquiring a first index row of the target table in the Excel table according to keywords of a primary catalog of the target table;

022: matching keywords of the secondary directory from the first index line to obtain a second index line and a third index line of the secondary directory, wherein the second index line is used for indexing a starting line of the secondary directory, and the third index line is used for indexing an ending line of the secondary directory; and

023: and intercepting the target table according to the second index row and the third index row to obtain a target sub-table corresponding to the second-level subdirectory.

In some embodiments, the determination module 12 may be further configured to: acquiring a first index row of the target table in the Excel table according to keywords of a primary catalog of the target table; matching keywords of the secondary directory from the first index line to obtain a second index line and a third index line of the secondary directory; and intercepting the target table according to the second index row and the third index row to obtain a target sub-table corresponding to the second-level subdirectory. That is, determining module 12 may be used to implement steps 021, 022, and 023.

In some embodiments, the processor 20 may be further configured to: acquiring a first index row of the target table in the Excel table according to keywords of a primary catalog of the target table; matching keywords of the secondary directory from the first index line to obtain a second index line and a third index line of the secondary directory; and intercepting the target table according to the second index row and the third index row to obtain a target sub-table corresponding to the second-level subdirectory. That is, processor 20 may be used to implement steps 021, 022, and 023.

Specifically, the extracted Excel table may have a plurality of tables, and the target table is one of the plurality of tables. The target table is not fixed and does not change, and when the input data changes, the target table also changes. Or the key words of the primary catalog of the target table can be adjusted along with the attribute of the input data so as to respectively input different data, and meanwhile, higher efficiency can be realized.

Referring to fig. 5, fig. 5 shows a part of table data of an Excel table, in an embodiment, if the keyword of the primary directory of the target table is "seven, merge financial statement item comments", the matching may be started from the first row of the Excel table, and if "seven, merge financial statement item comments" is matched, the first index row of the row (framed data a in fig. 5) where the "seven, merge financial statement item comments" is located is obtained, and in the embodiment shown in fig. 5, the first index row is 2992. Or, all the primary directories in the Excel table may be identified by features such as font, font size, and the like, then the primary directories are matched with the keywords of the primary directories of the target table one by one, and when matching, the first index row of the matched primary directory is obtained, and in the embodiment shown in fig. 5, the first index row is 2992 row.

Further, continuing from the first index row down to match the key of the secondary directory, the secondary directory may be the header of the target sub-table or the title of the target sub-table. According to the matching result of the keywords of the secondary directory, a second index row and a third index row of the target sub-table can be determined, the second index row is used for indexing the beginning of the target sub-table, the third index row can be used for indexing the end of the target sub-table, and then data of the target table can be intercepted according to the second index row and the third index row, so that the target sub-table is intercepted. The second index line can be obtained according to the keyword matching of the second-level directory, and the third index line can be obtained according to the next second-level directory of the second-level directory. The attributes of the font, size, etc. of the secondary directory may not be consistent with the attributes of the font, size, etc. of the primary directory.

For example, the description will be given by taking fig. 6 as an example. In one example, the secondary catalog is "long term amortization cost". The secondary catalog is matched from the 2992 line downwards, and when the secondary catalog is matched to be 'long-term expense waiting', namely, the framed table data B in fig. 6, the behavior 3637 of the secondary catalog is obtained, and the second indexing behavior 3636 is obtained; continuing to perform matching downwards, the row of the next secondary directory (which can be judged according to the serial number of the secondary directory) is matched, for example, the table data C of the next secondary directory framed column in fig. 6 is in the behavior "3645", and then the third indexing behavior "3644". Therefore, the data of the target table may be intercepted according to the rows 3636 and 3644, that is, the data between the rows 3637 and 3645 may be intercepted, and the intercepted data may be used as the target sub-table.

It should be noted that, when the entered data attribute changes, the corresponding primary directory and secondary directory may change. For example, where the entered data is a monetary fund, then the secondary category of the target sub-form is "monetary fund". The changes to the primary directory are similar to the secondary directory and are not expanded in detail here.

In one embodiment, the location of the target table may be configured using Extensible Markup Language (XML). For example, XML extracts corresponding table data according to the table name of the extracted data required by configuration, and then matches the content of each node. For example, the content of each value node below the RootParm configured by XML is the range where the acquisition table is located, the value node has firststartstart, firstLevelIsMust, firstLevelkeyLength, firstNoContian, patterstartstart, patternend, NoContian, keyLength, and firstbeatartstart attribute represents a keyword for primary directory matching, and is represented by a regular expression; the firstLevelIsMust attribute represents whether the primary directory must be matched, Yes represents true, and the others represent false; the firstLevelkeyLength attribute indicates that if a firstAttributer Start is matched, such as "financial statement: 15' means that the current line of excel must contain 4 words of financial statement, and the number of the current characters cannot be larger than 15, otherwise, the result indicates that the primary catalog is not matched. The firstNoContian attribute represents a keyword which cannot be contained in the matched data and is represented by a regular expression; the patternStart attribute represents the keywords matched with the secondary directory, and is represented by a regular expression, if the primary directory firstLevelIsMust is yes, the primary directory must be matched to match the secondary directory, otherwise, the node cannot be positioned at the position of the table. The firstLevelIsMust is false, and even if the primary directory does not match, the secondary directory will match. The pattern end attribute represents a keyword for finishing matching of the secondary directory and is represented by a regular expression; NoContian and keyLength function as the first NoContian and the first LevelkeyLength of the same level of directory. Through a plurality of nodes configured in advance by the XML, the range of the target sub-table in the Excel table can be directly obtained, the efficiency is high, and errors are not easy to occur.

Referring to fig. 7, in some embodiments, step 03 includes the following steps:

031: acquiring an initial position of a target sub-table according to a preset first keyword; and

032: and acquiring the end position of the target sub-table according to a preset second keyword.

In some embodiments, the identification module 13 may be further configured to: acquiring an initial position of a target sub-table according to a preset first keyword; and acquiring the end position of the target sub-table according to a preset second keyword. That is, the recognition module 13 can also be used to implement step 031 and step 032.

In some embodiments, the processor 20 may be further configured to: acquiring an initial position of a target sub-table according to a preset first keyword; and acquiring the end position of the target sub-table according to a preset second keyword. That is, the processor 20 may also be configured to implement step 031 and step 032.

Specifically, although the second index row and the third index row of the target sub-table are acquired in step 022. However, in some embodiments, the index row of the start position of the data content in the target sub-table is not the second index row, and the index row of the end position of the data content in the target sub-table is not the third index row, and the start position and the end position of the data content in the target sub-table can be identified according to some keywords, so that the data content of the target sub-table can be extracted more accurately.

The data name of the start table of the data content usually has keywords such as item, column, name, etc., and the data name of the end table of the data content usually has keywords such as total, etc., so the start position and the end position of the target sub-table can be obtained according to the keywords.

For example, fig. 6 is taken as an example for illustration. The preset first keywords are items, the preset second keywords are totals, the behavior 3639 that the table content includes the "item" keyword is recognized by recognizing the text content of the target sub-table in fig. 6, and the behavior 3643 that the table content includes the "totals" keyword is recognized, so that the starting position of the target sub-table is 3639 lines, and the ending position of the target sub-table is 3643 lines.

In one embodiment, the nodes below ColumParm, which may be configured according to XML, are located to the target sub-table starting location, with the ColumParm's children nodes representing the name of each field of the table. The value node of the child node represents a field to be matched and is identified by a regular expression, and the NoContian attribute represents that the field cannot be matched with keywords which cannot be included and is also represented by the regular expression. As shown in FIG. 6, if the "item" key is matched, the location of the target sub-table starts at 3639 (3638 for the index of cells). The XML configuration may determine the ending position of the cell according to the manner that whether other columns except the first column of the target sub-table have values, the border of the cell, and whether the cell occupies multiple columns, as shown in fig. 6, the XML may recognize that the ending position of the target sub-table is 3643 rows, and the row index for cells is 3642.

Referring to fig. 8, in some embodiments, step 04 includes the following steps:

041: acquiring a fourth index row for indexing the initial position and a fifth index row for indexing the end position;

042: intercepting a target sub-table according to the fourth index row and the fifth index row;

043: establishing a first initial data table according to the intercepted data; and

044: and sorting the first initial data table to obtain a first data table.

In some embodiments, the setup module 14 may also be configured to: acquiring a fourth index row for indexing the initial position and a fifth index row for indexing the end position; intercepting a target sub-table according to the fourth index row and the fifth index row; establishing a first initial data table according to the intercepted data; and sorting the first initial data table to obtain a first data table. The setup module 14 may also be used to implement step 041, step 042, step 043 and step 044.

In some embodiments, the processor 20 may be further configured to: acquiring a fourth index row for indexing the initial position and a fifth index row for indexing the end position; intercepting a target sub-table according to the fourth index row and the fifth index row; establishing a first initial data table according to the intercepted data; and sorting the first initial data table to obtain a first data table. Processor 20 may also be configured to implement step 041, step 042, step 043 and step 044.

Specifically, after the start position and the end position of the specific table in the target sub-table have been obtained in step 03, in order to accurately intercept the data in the target sub-table, a fourth index row for determining the start position of the specific table in the index target sub-table and a fifth index row for determining the end position of the specific table in the index target sub-table are required. Specifically, the fourth index row may be a row before the row where the start position is located, and the fifth index row may be a row before the row where the end position is located. And then, the target sub-table can be intercepted according to the fourth index row and the fifth index row, and a first initial data table corresponding to the intercepted target sub-table can be established according to the intercepted data, so that the data in the first initial data table are all the data needing to be acquired, other unnecessary data do not exist, and the time for subsequently inputting the data into the database can be reduced.

Furthermore, in some embodiments, the data distribution in the first initial data table is irregular, which easily causes the occurrence of phenomena such as large matching difficulty, long consumed time, or wrong matching when the first initial data table is matched with the second data table, so as to avoid the irregular data distribution in the first initial data table, the first initial data table can be sorted, and the accuracy and efficiency during matching can be improved. If the data in the first initial data table are regular, the first initial data table does not need to be sorted, and the first initial data table can be directly used as the first data table.

Referring to fig. 9, the naming rule of the Column name of each Column in the first initial data table is Column + the serial number of the first Column of the current cell, and the data content of the intercepted target sub-table is started from the second row.

Referring to fig. 10, in some embodiments, the data in the first row of the first initial data table is the column name of each column, and step 044 includes the following steps:

0441: identifying whether a dislocation cell exists in the first initial data table;

0442: if the dislocation unit cell is identified, determining whether the adjacent unit cell of the dislocation unit cell has data or not according to the column name of the dislocation unit cell, wherein the adjacent unit cell of the column and the dislocation unit cell are adjacent columns; and

0443: if the column adjacent cell has no data and has data with the row adjacent cell of the row of the adjacent column cell, merging the adjacent cell and the misplaced cell to obtain the first data table.

In some embodiments, the setup module 14 may also be configured to: identifying whether a dislocation cell exists in the first initial data table; if the dislocation unit cell is identified, determining whether the adjacent unit cell of the dislocation unit cell has data or not according to the column name of the dislocation unit cell, wherein the adjacent unit cell of the column and the dislocation unit cell are adjacent columns; and if the column adjacent unit cell has no data and has data with the row adjacent unit cell of the adjacent column unit cell, merging the adjacent unit cell and the staggered unit cell to obtain the first data table. That is, the building module 14 may also be used to implement step 0441, step 0442 and step 0443.

In some embodiments, the processor 20 may be further configured to: identifying whether a dislocation cell exists in the first initial data table; if the dislocation unit cell is identified, determining whether the adjacent unit cell of the dislocation unit cell has data or not according to the column name of the dislocation unit cell, wherein the adjacent unit cell of the column and the dislocation unit cell are adjacent columns; and if the column adjacent unit cell has no data and has data with the row adjacent unit cell of the adjacent column unit cell, merging the adjacent unit cell and the staggered unit cell to obtain the first data table. That is, processor 20 may also be used to implement step 0441, step 0442, and step 0443.

Specifically, the cells in the target sub-table are irregular, so that the first initial data table is prone to having dislocation cells, the dislocation cells are prone to having large errors, the first initial data table is prone to having errors when the data are matched, the dislocation cells can be identified, adjustment is timely conducted when the dislocation cells exist, and the obtained first data table is accurate and small in error.

Referring to fig. 11, there are cell irregularities in the target sub-table, such as the cell D of the framed column in fig. 11, occupying a portion of the cells of the adjacent columns. If the first initial data table created according to the target sub-table shown in fig. 11 is to be as shown in fig. 12, resulting in most of the cells of Column34 in the first initial data table being blank, while the data originally located in the cells of Column33 in row 4 is blank, it is obvious that the data in row 4 of Column34 should be the content of row 4 of Column 33. It is understood that the offset cells exist in the first initial data table shown in fig. 12, and the offset cells are the cells corresponding to column 34. If the first initial data table does not have the misplaced cells, each cell in the second row in the first initial data table shown in fig. 12 has data, and if it is recognized that the cells in the second row in the first initial data table have no data, it may be considered that the misplaced cells exist, and the cells corresponding to the column where the cells are located are all misplaced cells.

Table errors are common in Excel, and may be multi-row value errors, and in order to sort the staggered cells, the staggered cells do not exist in the first data table. Continuing with the exemplary illustration of fig. 12, the specific process of sorting may be: identifying the left and right adjacent columns of the Column in which the dislocation cell is located (referred to as dislocation columns hereinafter), and determining which Column is more adjacent to the dislocation Column according to the Column names of the left and right columns, for example, identifying which Column 2 columns of Column33 and Column46 are closer to Column34, Column33 is more adjacent to Column34 than Column 46; then, identifying the row where the cell with the value in Column34 is located, wherein the cell corresponding to the row in Column33 is not all without the value; if the correspondence is successful, merging the value of Column34 with Column33, otherwise merging with Column46, and deleting the Column; the arrangement is shown in fig. 9.

Referring to fig. 13, in some embodiments, the data in the first row of the first initial data table is the column name of each column, and step 044 further includes the following steps:

0444: judging whether a plurality of combined columns exist in the first initial data table or not; and

0445: if so, merging the merging row and the item row according to the data in the merging row and the data in the item row adjacent to the merging row to obtain the first data table.

In some embodiments, the establishing module 14 may be further configured to determine whether a merged parallel of merging multiple columns exists in the first initial data table; if so, merging the merging row and the item row according to the data in the merging row and the data in the item row adjacent to the merging row to obtain the first data table. That is, the building module 14 may also be used to implement step 0444 and step 0445.

In some embodiments, the processor 20 may be further configured to determine whether a merged parallel with multiple columns merged exists in the first initial data table; if so, merging the merging row and the item row according to the data in the merging row and the data in the item row adjacent to the merging row to obtain the first data table. That is, processor 20 may also be used to implement steps 0444 and 0445.

Specifically, referring to fig. 14, the target sub-table is shown in fig. 14, there are two merge rows merging multiple columns, and the first initial data table created according to the target sub-table shown in fig. 14 will be shown in fig. 15. If data import is performed by using the table shown in fig. 15, corresponding data may not be accurately identified, for example, a matching keyword is "balance in 2019 s", and matching cannot be performed. The embodiment processes the involution and the parallel, thereby being beneficial to the subsequent data matching and reducing the matching difficulty.

Whether a merged parallel with multiple columns being merged exists can be judged by identifying the number of columns occupied by each cell in the first initial data table. When multiple columns of merged merge rows exist in the first initial data table, the item rows adjacent to the merge rows and without multiple columns of merges and the merge rows may be merged, the first initial data table obtained after merging is the first data table, and each cell in the first data table only occupies one row and one column.

For example, the first initial data table obtained is shown in fig. 15. It can be recognized that a merged row with merged multiple columns exists in the second row and the third row in the table, and it is recognized that a fourth row adjacent to the third row in the table does not have merged multiple columns, and data in each column of the fourth row mostly has a "year" word, data in the third merged parallel row is 2019, and a keyword in the second merged parallel row is a local group, data in the second row, the third row and the fourth row can be merged, and the merged data can be as shown in fig. 16. Alternatively, according to the merging condition in the cells in the first row and the first column in the Excel table, merging the other cells in the first initial data table to obtain the first data table, as also shown in fig. 16.

In certain embodiments, after step 043, or after step 0443 or step 0445, step 04 further comprises the steps of:

045: acquiring a starting column and an ending column of a first initial data table; and

046: deleting the columns in the first initial data table that are before the initial column and the columns in the first initial data table that are after the initial column.

In some embodiments, the establishing module may be further configured to: acquiring a starting column and an ending column of a first initial data table; and deleting the columns in the first data table that are located before the initial column and the columns in the first data table that are located after the initial column. That is, the setup module may also be used to implement step 045 and step 046.

In some embodiments, the processor may be further configured to: acquiring a starting column and an ending column of a first initial data table; and deleting the columns in the first data table that are located before the initial column and the columns in the first data table that are located after the initial column. That is, the processor may also be configured to implement step 045 and step 046.

Specifically, blank columns may exist in the obtained first initial data table, in order to make the memory occupied by the first initial data table smaller, and data in the first initial data table is more concentrated, a start column and an end column of the first initial data table may be determined, where the start column refers to a column in which a cell at the start of the specific content of the first initial data table is located, and the end column may refer to a column in which a cell at the end of the specific content of the first initial data table is located, and then a column before the start column in the first data table and a column after the start column in the first data table may be deleted, so that no redundant cell exists in the first data table, and the first initial data table is more refined and occupies a smaller memory.

In certain embodiments, step 04 may further comprise the steps of:

047: and deleting the head row in the first data table so that the column name of each column in the first data table is the item name.

In some embodiments, the creating module 14 may be further configured to delete the first row in the first data table, so that the column name of each column in the first data table is the item name. That is, the setup module 14 may also be used to implement step 047.

In some embodiments, the processor 20 is further configured to delete the first row in the first data table, such that the column name of each column in the first data table is the item name. That is, processor 20 may also be used to implement step 047.

Specifically, the top row in the first initial data table is Column + the Column number of the top Column of the current cell, and each Column of cells in the second row of the first initial data table is a specific item name. In order to better match the obtained first data table with the second data table, the first initial data table head row can be deleted, so that the cell data of the head row of the first initial data table becomes each item name, and each row below the first initial data table head row is data needing to be extracted (i.e. entered).

Referring to FIG. 17, in some embodiments, the document processing method may further include:

06: establishing a second initial data table, wherein the data of each cell of the second initial data table is null;

07: acquiring the name of each preset node in a preset configuration file; and

08: and inputting the name of each preset node into a second initial data table to obtain a second data table, wherein the column name of each column in the second data table is the name of each preset node.

In some embodiments, the document processing device 10 may be further configured to: establishing a second initial data table, wherein the data of each cell of the second initial data table is null; acquiring the name of each preset node in a preset configuration file; and inputting the name of each preset node into a second initial data table to obtain a second data table, wherein the column name of each column in the second data table is the name of each preset node. That is, the document processing apparatus 10 can also be used to implement step 06, step 07, and step 08.

In some embodiments, the processor 20 may be further configured to: establishing a second initial data table, wherein the data of each cell of the second initial data table is null; acquiring the name of each preset node in a preset configuration file; and inputting the name of each preset node into a second initial data table to obtain a second data table, wherein the column name of each column in the second data table is the name of each preset node. That is, processor 20 may also be used to implement step 06, step 07, and step 08.

Specifically, the second data table may be a standard table, and the second data table may be matched with the database to accurately and efficiently import the corresponding data into the database. First, a second initial data table with empty contents may be established. And then acquiring the name of each preset node in the preset configuration file, taking the name of each preset node as the column name of each column of the second initial data table, and filling the names into the cells of the first row of the second initial data table in sequence. Thus, the second data table is more standard and more matched to the database.

In one embodiment, the Name of each predetermined node under ColumParm in the XML configuration file is converted into the column Name of the second initial data table, and specifically, the Name attribute, the Value of Value, and the NoContian attribute of each predetermined node are changed into a title (header) of each column. Wherein, the Name attribute is separated from the Value node by "@", the Value of Value and NoContain are marked by special symbol ": spaced, the Value nodes are spaced with "&". And defining the type of each column of values, such as text, numbers and the like, and obtaining a cell which is the second data table.

Further, referring to fig. 18, in some embodiments, step 05 includes the following steps:

051: matching the column name of each column of the first data table with the title of each column of the second data table; and

052: and inputting the data of the column successfully matched with the second data table in the first data table into the corresponding column of the second data table.

In some embodiments, the matching module 15 may be further configured to match the column name of each column of the first data table with the title of each column of the second data table; and inputting the data of the column successfully matched with the second data table in the first data table into the corresponding column of the second data table. That is, the matching module 15 may also be used to implement step 051 and step 052.

In some embodiments, the processor 20 may be further configured to match the column name of each column of the first data table with the title of each column of the second data table; and inputting the data of the column successfully matched with the second data table in the first data table into the corresponding column of the second data table. That is, processor 20 may also be used to implement steps 051 and 052.

Specifically, since the column name of each column in the first data table is the name of each item, and the title of each column in the second data table includes the name of a predetermined node, in order to accurately fill the data in the first data table into the second data table, the column name of the first data table and the title of each column of the second data table may be matched, and if the column name and the title are matched, the data in the cell representing the column of the first data table may be input into the cell representing the column of the second data table, so that the second data table is completely filled, and the data in the second data table may be further entered into the database, which is efficient in data entry, and avoids the time spent on manual entry.

Still further, referring to fig. 19, in some embodiments, the header of the second data table includes a matchable value and a unmatchable value, and step 051 includes the steps of:

0511: matching the column names of each column of the first data table with a matchable value and a non-matchable value respectively; and

0512: and when the column name of the first data table is successfully matched with the matchable value and is failed to be matched with the unmatchable value, determining that the matching is successful.

In some embodiments, matching module 15 may be further configured to match the column names of each column of the first data table with a matchable value and a non-matchable value, respectively; and when the column name of the first data table is successfully matched with the matchable value and is failed to be matched with the unmatchable value, determining that the matching is successful. That is, the matching module 15 may also be used to implement step 0511 and step 0512.

In some embodiments, the processor 20 is further operable to match the column names of each column of the first data table with a matchable value and a non-matchable value, respectively; and when the column name of the first data table is successfully matched with the matchable value and is failed to be matched with the unmatchable value, determining that the matching is successful. That is, processor 20 may also be used to implement step 0511 and step 0512.

Specifically, the header of the second data table includes a matchable value and an unmatchable value, the matchable value is a value to be matched, the unmatchable value is a value that cannot be matched, if the matchable value is matched and the unmatchable value is also matched, the matching is considered to be failed, and if the matchable value is matched and the unmatchable value is not matched, the matching is considered to be successful. The column names of each column of the first data table and the matchable value and the unmatchable value of each column of the second data table can be matched respectively by using a regular expression, and data corresponding to columns successfully matched in the first data table are filled into the second data table until all data in the first data table are extracted. Therefore, matching of the unmatchable values is increased, accuracy of data of the second data table can be increased, and data which fail in matching are prevented from being input into the second data table.

Further, after the data in the second data table is completely filled, the second data table and the database can be matched to record the data into the database after other special subsections in the second data table are simply processed. For example, "@", ": and "&" and the like, and the value contents of "NoContain" and "value" of the header of the second data table may be removed. That is, the header of the second data table only retains the matchable values, so that the second data table is more clearly understood and the data is more intuitive, as shown in fig. 20.

Referring to fig. 1 and fig. 2 again, the memory 30 is used for storing a computer program that can be executed on the processor 20, and the processor 20 executes the computer program to implement the document processing method according to any of the above embodiments.

The memory 30 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Further, the electronic device 100 may also include a communication interface 40, the communication interface 40 being used for communication between the memory 30 and the processor 20.

If the memory 30, the processor 20 and the communication interface 40 are implemented independently, the communication interface 40, the memory 30 and the processor 20 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 3, but this does not mean only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 30, the processor 20, and the communication interface 40 are integrated on a chip, the memory 30, the processor 20, and the communication interface 40 may complete communication with each other through an internal interface.

The processor 20 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

Referring to fig. 21, a non-transitory computer-readable storage medium 300 according to an embodiment of the present application includes computer-executable instructions 301, which when executed by one or more processors 400, cause the processors 400 to perform a document processing method according to any embodiment of the present application. That is, the processor 20 may execute step 01, step 02, step 03, step 04, step 05, step 06, step 07, step 08, step 021, step 022, step 023, step 031, step 032, step 041, step 042, step 043, step 044, step 045, step 046, step 047, step 0441, step 0442, step 0443, step 0444, step 0445, step 051, step 052, step 0511, and step 0512 in the above embodiments.

For example, referring to fig. 1, when the computer-executable instructions 301 are executed by the processor 400, the processor 400 is configured to perform the following steps:

01: converting the PDF file into an Excel table;

02: determining the position of a target table in the Excel table;

03: identifying the starting position and the ending position of a target sub-table in a target table;

04: establishing a first data table according to the starting position, the ending position and the target sub-table; and

05: and matching the data of the first data table to a second data table obtained in advance, wherein the second data table is matched with the database.

For another example, referring to fig. 8, when the computer-executable instructions 301 are executed by the processor 400, the processor 400 is configured to perform the following steps:

041: acquiring a fourth index row of the index starting position and a fifth index row of the index ending position;

042: intercepting a target sub-table according to the fourth index row and the fifth index row;

043: establishing a first initial data table according to the intercepted data; and

044: and sorting the first initial data table to obtain a first data table.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example" or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

Although embodiments of the present application have been shown and described above, it is to be understood that the above embodiments are exemplary and not to be construed as limiting the present application, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：数据处理方法、装置、服务器及存储介质

Document processing method and device, electronic equipment and readable storage medium

相关技术

网友询问留言