Method, device and computer storage medium for automatically identifying PDF electronic receipt information

文档序号:1310782 发布日期:2020-07-10 浏览:15次 中文

阅读说明:本技术 一种自动识别pdf电子回单信息的方法、装置与计算机存储介质 (Method, device and computer storage medium for automatically identifying PDF electronic receipt information ) 是由 秦涛 王士勇 钟如玉 李海彬 司慧杰 于 2020-03-11 设计创作,主要内容包括:目前,许多集团公司在月底的几百甚至上千笔银行回单全部经由人工匹配到资金结算单、业务报账单上,迫切需要提高工作效率、降低成本。基于上述问题,本发明提出一种自动识别PDF电子回单信息的方法,包括:在数据库中预置银行模板,接受用户发来的PDF电子回单任务,确定对应银行模板,读取所需业务内容,插入数据库业务表,自动匹配资金结算单、业务报账单。本发明通过分银行预置模板的方法,将文本内容的银行电子PDF回单文件识别为格式化数据并解析,进而自动依次关联资金结算单、业务报账单,解决了出纳人员拿着银行纸质银行回单进行手工对账时工作量大、耗时长、效率低的工作痛点。(At present, hundreds or even thousands of bank receipts of many group companies at the end of the month are all matched to a fund settlement sheet and a business statement bill through manual work, and the work efficiency is urgently required to be improved and the cost is required to be reduced. Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising: presetting a bank template in a database, receiving a PDF electronic receipt task sent by a user, determining a corresponding bank template, reading required service content, inserting the required service content into a database service table, and automatically matching a fund settlement sheet and a service statement bill. The bank electronic PDF receipt file of the text content is identified as formatted data and analyzed through a method of presetting a template by banks, and then the fund settlement receipt and the business report bill are automatically and sequentially associated, so that the work pain point that the workload is large, the time consumption is long and the efficiency is low when a cashier holds a paper bank receipt to manually check accounts is solved.)

1. A method for automatically identifying PDF electronic receipt information is characterized by comprising the following steps:

s1, receiving a PDF electronic receipt task sent by a user;

s2, determining a corresponding bank template;

s3 reading the required service content;

s4 inserting a business table of the database;

s5 automatically matches the fund settlement bill and the business report bill.

2. The method for automatically identifying the PDF electronic receipt information as claimed in claim 1, further comprising the step of, at step S1:

s0 presets the bank template.

3. The method for automatically identifying PDF electronic receipt information according to claim 2, wherein step S0 comprises:

s101, reading PDF electronic receipt text information of each bank;

s102, establishing a template preset table according to PDF electronic receipt text information of each bank;

s103, establishing a field preset table according to PDF electronic receipt text information of each bank;

s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.

4. The method for automatically identifying the PDF electronic receipt information according to claim 3, wherein the template preset table in step S102 includes fields whose data types are all VARCHAR: internal code, bank number, bank name, template number and template name.

5. The method for automatically identifying PDF electronic receipt information according to claim 3, wherein said field preset table in step S103 includes fields whose data types are all VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.

6. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S2 comprises:

circularly traversing the data of the template preset table, acquiring field name data of a field preset information table corresponding to each template, and searching the read text contents one by one until a unique template is searched and matched; if the plurality of templates are found, prompting that the plurality of bank templates are found and the configuration of the templates is checked; and if the matched template cannot be found, prompting that the corresponding bank template cannot be found.

7. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S3 comprises:

after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.

8. An apparatus for automatically identifying PDF electronic receipt information, comprising:

a memory for storing a computer program;

a processor for executing said computer program to implement the method of automatically identifying PDF electronic receipt information according to any of the preceding claims 1 to 7.

9. A computer storage medium storing a computer program, wherein the computer program when executed causes an apparatus of the computer storage medium to perform a method of automatically identifying PDF electronic receipt information according to any of claims 1-7.

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device and a storage medium for automatically identifying PDF electronic receipt information.

Background

The bank receipt is the original basis for the enterprise to compile the bookkeeping voucher, and the enterprise has corresponding receipt as the evidence when receiving and paying. The receipt content mainly comprises detailed information such as date, serial number, account number, currency, amount and the like, and each account has a receipt. Therefore, a large amount of receipt is processed in the capital management of the corporate company.

At present, the capital receipt and payment control of group companies to subordinate enterprises is higher and higher, and at the end of the month, hundreds or even thousands of bank receipts are matched to a capital settlement sheet and a business report bill by manpower, so that simple and repeated labor is a very time-consuming matter, is a work pain point for cashier, and urgently needs to improve work efficiency and reduce cost.

Disclosure of Invention

Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, which aims to accurately obtain the required text content and get rid of the current situation that a cashier holds a paper bank receipt to perform manual reconciliation.

Currently, there are many ways to read a PDF document of text content, for example: the ITestSharp and the PdfBox can be read out in a character string mode, but the format of the electronic PDF receipt between banks is not uniform, and the problem of non-uniform format also exists in the same bank, so that the read character sequence is various, and the required text content cannot be accurately identified and acquired in a fixed mode.

Therefore, the acquired character string can only be automatically analyzed according to a certain logic rule, and the required text content can be acquired more accurately by presetting the template and the methods of the preposed field and the postpositional field of a certain field.

In order to achieve the above object, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising:

s1, receiving a PDF electronic receipt task sent by a user;

s2, determining a corresponding bank template;

s3 reading the required service content;

s4 inserting a business table of the database;

s5 automatically matches the fund settlement bill and the business report bill.

Preferably, step S1 is preceded by the steps of:

s0 presets the bank template.

Further, step S0 includes:

s101, reading PDF electronic receipt text information of each bank;

s102, establishing a template preset table according to PDF electronic receipt text information of each bank;

s103, establishing a field preset table according to PDF electronic receipt text information of each bank;

s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.

Preferably, the template preset table in step S102 includes fields with data types of VARCHAR: internal code, bank number, bank name, template number and template name.

Preferably, the field preset table in step S103 includes fields with data types of VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.

Further, step S2 includes:

and circularly traversing the data of the template preset table, acquiring the field name data of the field preset information table corresponding to each template, searching the read text contents one by one until the unique template is searched, prompting that the plurality of bank templates are searched and the configuration of the templates is checked if the plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.

Further, step S3 includes:

after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.

The invention also provides a device for automatically identifying the PDF electronic receipt information, which comprises:

a memory for storing a computer program;

and the processor is used for executing the computer program to realize any one of the above methods for automatically identifying the PDF electronic receipt information.

The invention also provides a computer storage medium, which stores a computer program, and when the computer program is executed, the computer storage medium is positioned in equipment to execute any one of the above methods for automatically identifying the PDF electronic receipt information.

The invention reads the bank electronic PDF receipt file of the text content into the system by a method of presetting templates by banks, and identifies the file as formatted data. The formatted data is analyzed through a preset format, and then the fund settlement sheet and the business report bill are automatically and sequentially associated, so that the working pain points of large workload, long consumed time and low efficiency when a cashier holds a paper bank receipt to perform manual account checking are solved.

In addition, the invention can flexibly define the template format by banks, and the PDF receipt format of the same bank is different and can define the corresponding template format. The method for presetting and flexibly identifying the preposed field and the postposition field of a certain field can accurately acquire the required text content, and further automatically match a fund settlement list and a business bill according to the acquired content.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram illustrating text contents searched and read according to a start field and an end field in a field preset table according to the present invention;

fig. 3 is a schematic diagram of a PDF electronic receipt of a certain bank in the embodiment.

Detailed Description

In order to better illustrate and facilitate the understanding of the process of the invention, examples are presented for the purpose of illustration. It should be noted that the examples are only for illustration and should not be taken as a basis for limiting the scope of the present invention.

The invention provides a method for automatically identifying PDF electronic receipt information, which comprises the following steps of:

according to a certain bank PDF electronic receipt (figure 3), the text content read by using the GetTextFromPage method of PdfTextExtractor class of ITestSharp program set of C # is as follows:

the template preset table structure is designed as table 1:

serial number Name of field Field identification Data type
1 Inner code ZJYHDZHDYSZB_NM VARchar(40)
2 Bank number ZJYHDZHDYSZB_YHBH VARchar(100)
3 Name of bank ZJYHDZHDYSZB_YHMC VARchar(100)
4 Template numbering ZJYHDZHDYSZB_MBBH VARchar(100)
5 Name of template ZJYHDZHDYSZB_MBMC VARchar(100)

TABLE 1

The field preset table structure is designed as table 2:

serial number Name of field Field identification Data type
1 Inner code ZJYHDZHDYS_NM VARchar(40)
2 Name of field ZJYHDZHDYS_ZDMC VARchar(100)
3 Field numbering ZJYHDZHDYS_ZDBH VARchar(40)
4 Start field ZJYHDZHDYS_KSZD VARchar(100)
5 Termination field ZJYHDZHDYS_ZZZD VARchar(100)
6 Starting field sequence number ZJYHDZHDYS_KSZDXH VARchar(10)

TABLE 2

Analyzing the word sequence, the template and the preset data of the prepositive field and the postpositive field of the service field to obtain a template preset table shown in table 3:

TABLE 3

The field preset table is as in table 4:

TABLE 4

Other bank templates are preset in the database in the same way.

When receiving the PDF electronic receipt task of the user, carrying out program analysis according to preset data, and the steps are as follows.

Determining templates, traversing and circularly traversing the data of the template preset table, acquiring ZJYHDZHDYS _ ZDMC line data of the field preset information table corresponding to each template, searching the read text contents one by one until a unique template is searched and matched, prompting that a plurality of bank templates are searched and template configuration is checked if a plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.

Reading the required service content: after the template is determined, the read text content is searched according to ZJYHDZHDYS _ KSZD (start field) and ZJYHDZHDYS _ ZZZD (end field) in the field preset table, as shown in FIG. 2, if the start field data in the text content has a repetition value, the start text position is determined according to ZJYHDZHDYS _ KSZDXH (start field number), and then the first matched end field is searched afterwards, and the content in the middle of the two fields is the required service content.

And (4) inserting a database business table, namely forming an SQ L statement and inserting the SQ L statement into the business table according to the business content read in the step 2 and the corresponding ZJYHDZHDYS _ ZDBH (field number).

And automatically matching a fund settlement bill and a business report bill: and (4) searching a fund settlement list and a service bill according to the service table data formed in the step (3).

The above is only one embodiment of the present invention, and is not intended to limit the scope of protection. All equivalents made by using the contents of the specification and the attached drawings of the present invention fall within the protection scope of the present invention.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种外部知识增强的幽默文本生成方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!