Method for extracting value-added tax invoice information

文档序号:1490756 发布日期:2020-02-04 浏览:9次 中文

阅读说明:本技术 一种增值税***信息的提取方法 (Method for extracting value-added tax invoice information ) 是由 罗中 宋爽 王君健 于 2019-11-04 设计创作,主要内容包括:本发明涉及一种增值税发票信息的提取方法,包括:首先,对发票图片进行预处理,使发票图片无旋转且方位正确,定位和识别发票监制章并根据发票监制章内容确定发票版式;然后,利用增值税发票的背景表格文字和打印内容的色调差异,从发票图片中提取蓝色调像素并二值化得到打印内容图片;接着,根据发票版式构造发票信息打印区域模板并利用模板对打印内容图片上的文本行进行区域匹配得到发票信息图片块;最后,对发票信息图片块进行文本识别以及综合分析得到发票信息提取结果。本发明所涉及的一种增值税发票信息提取方法相比于现有技术具有更好的容错性,对于发票打印内容和背景表格文字存在重叠干扰的情况,也能很好地进行发票信息提取。(The invention relates to a method for extracting value-added tax invoice information, which comprises the following steps: firstly, preprocessing an invoice picture to ensure that the invoice picture has no rotation and is correct in direction, positioning and identifying an invoice monitoring seal and determining an invoice format according to the content of the invoice monitoring seal; then, extracting blue hue pixels from the invoice picture and binarizing to obtain a print content picture by using the hue difference between the background table characters and the print content of the value-added tax invoice; then, constructing an invoice information printing area template according to the invoice format and carrying out area matching on text lines on the printed content picture by using the template to obtain an invoice information picture block; and finally, performing text recognition and comprehensive analysis on the invoice information picture block to obtain an invoice information extraction result. Compared with the prior art, the value-added tax invoice information extraction method has better fault tolerance, and can also well extract invoice information under the condition that the invoice printing content and background table characters have overlapping interference.)

1. A method for extracting value-added tax invoice information is characterized in that:

step (1), invoice picture preprocessing, and determining a value-added tax invoice format: acquiring a color scanning picture of the value-added tax invoice, and preprocessing the picture to obtain a non-rotation and correct-orientation preprocessed picture; identifying province names in an invoice monitoring seal in the invoice picture, and determining a value-added tax invoice format of the processed invoice picture;

and (2) extracting invoice printing content pictures: extracting blue printing content pixels from the preprocessed picture by using the hue difference between background table characters of the value-added tax invoice and printing content on the invoice, and performing binarization processing to obtain a printing content picture;

step (3), constructing a template, and matching and extracting the invoice information item picture block by using the template: constructing an invoice information printing area template according to the invoice format determined in the step (1), performing area matching on the invoice printing content picture obtained in the step (2) by using the template, and extracting picture blocks of the matching area as each information item picture block of the invoice;

step (4), identifying the invoice information item content: performing text recognition on each information item image block of the invoice obtained in the step (3), and comprehensively analyzing to obtain each item of information of the invoice; and (6) ending.

2. The method for extracting value-added tax invoice information as claimed in claim 1, wherein the preprocessing of the picture in step (1) obtains a preprocessed picture without rotation and with correct orientation, and the method comprises the following steps:

identifying straight lines in the invoice pictures, calculating an included angle between the uppermost straight line in the invoice pictures and the horizontal direction clockwise, and when the included angle is not equal to 0, rotating the pictures anticlockwise by the included angle to ensure that the pictures do not rotate;

and (3) positioning the invoice monitoring seal in the invoice picture, and if the position of the invoice monitoring seal is not in the upper middle position of the picture, rotating the picture by 90 degrees or 180 degrees or 270 degrees to ensure that the position of the invoice monitoring seal is in the upper middle position of the picture.

3. The method for extracting value-added tax invoice information as claimed in claim 1, wherein the step (1) of determining the invoice format corresponding to the processed invoice picture is characterized in that:

and performing character recognition on the positioned invoice monitoring seal, matching the name of the Chinese province in the recognized characters, wherein the value-added tax invoice format of the province matched with the name is the invoice format of the processed invoice picture.

4. The method for extracting value-added tax invoice as claimed in claim 1, wherein the step (2) of extracting the picture only containing the printing content on the invoice picture from the preprocessed picture, comprises the following steps:

and (2) converting the preprocessed picture into an HSV (or HSL) color space model, copying all pixels (blue pixels, the value of delta is between 0 and 60) of which the H value of the pixel in the converted picture is within a range (240-degree-delta and 240-degree + delta) to the corresponding position of a newly-built blank picture with the same size as that of the preprocessed picture, and obtaining the picture after binarization processing as the printed content picture.

5. The method for extracting value-added tax invoice as claimed in claim 1, wherein the step (3) of constructing the value-added tax invoice template is characterized in that:

and (3) constructing a template picture which has the same size as the printing content picture obtained in the step (2) and marks each invoice information item printing rectangular area by using the invoice format determined in the step (1) and the size information of the invoice background table identified on the invoice picture.

6. The method for extracting value-added tax invoice as claimed in claim 1, wherein the comprehensive analysis in step (4) obtains each item of information of the invoice, and is characterized in that:

and comprehensively analyzing the text line identification result of the picture block corresponding to each information item of the invoice by combining the service meaning, the composition rule, the font size and the line height of the text line in the invoice picture of the invoice information item, and merging and splitting the text line to obtain the exact invoice information item content.

Technical Field

The invention relates to a method for extracting value-added tax invoice information, and belongs to the field of value-added tax invoice automatic processing.

Background

In recent years, with the progress of enterprise informatization, more and more enterprises have come to manage financial data using an informatization system, in which electronic processing of bills such as invoices is involved. The traditional method for electronizing the invoices specifically comprises the steps that financial staff look up paper invoices and manually input invoice information on the invoices into an informatization system, and the traditional method consumes a large amount of manpower and is easy to cause errors.

With the application of technologies such as text OCR recognition, methods for automatically recognizing and extracting invoice information based on the text OCR recognition technology also appear. However, in the aspect of invoice identification, due to the fact that a large number of invoices have a "bias", the printed content on the invoices and the background table characters of the invoices overlap, which causes interference, greatly reduces the accuracy of the current text OCR recognition technology for automatically identifying invoice information, and limits the potential of invoice informatization application. Along with the increase of tax supervision, the requirement of enterprises on invoice information identification accuracy is remarkably improved, and how to more accurately extract invoice information with printing position deviation is an urgent problem to be solved.

Disclosure of Invention

In order to solve the problems, the invention provides a method for extracting value-added tax invoice information, which can quickly and accurately extract various invoice information such as invoice codes, invoice numbers, invoice dates, purchaser and seller information (names, taxpayer identification numbers, address phones, account numbers of the issuing bank), goods or services (including names, specification models, units, quantities, amounts, tax amounts), invoice amounts (total), invoice tax amounts (total), invoices and the like in invoices from scanned pictures of the value-added tax invoices.

The invention provides an extraction algorithm of value-added tax invoice information, which comprises the following steps:

step (1), invoice picture preprocessing, and determining a value-added tax invoice format: acquiring a color scanning picture of the value-added tax invoice, and preprocessing the picture to obtain a non-rotation and correct-orientation preprocessed picture; identifying province names in an invoice monitoring seal in the invoice picture, and determining a value-added tax invoice format of the processed invoice picture;

and (2) extracting invoice printing content pictures: extracting blue printing content pixels from the preprocessed picture by using the hue difference between background table characters of the value-added tax invoice and printing content on the invoice, and performing binarization processing to obtain a printing content picture;

step (3), constructing a template, and matching and extracting the invoice information item picture block by using the template: constructing an invoice information printing area template according to the invoice format determined in the step (1), performing area matching on the invoice printing content picture obtained in the step (2) by using the template, and extracting picture blocks of the matching area as each information item picture block of the invoice;

step (4), identifying the invoice information item content: performing text recognition on each information item image block of the invoice obtained in the step (3), and comprehensively analyzing to obtain each item of information of the invoice; and (6) ending.

Further, the picture non-rotation in the step (1) means that the direction of any frame line in the background table in the invoice picture is the horizontal direction or the vertical direction, the method for making the picture non-rotation is to identify a straight line in the picture, calculate a clockwise included angle (the included angle should range from-90 degrees to 90 degrees) between the uppermost straight line in the picture (obviously, the straight line is one of the frame lines of the background table for scanning the invoice in the picture) and the horizontal direction, and when the included angle is not equal to 0, rotate the picture counterclockwise by the included angle to ensure that the picture is non-rotation; the picture azimuth is correct in the step (1), namely the position of an invoice monitoring chapter in an invoice picture is located at the upper middle position of the picture, the invoice monitoring chapter in the picture is located by adopting an object recognition technology after the picture is ensured to be non-rotated, and if the invoice monitoring chapter is not located at the upper middle position of the picture, the picture is rotated clockwise by 90 degrees (when the located invoice monitoring chapter is located at the left vertical middle position of the picture) or 180 degrees (when the located invoice monitoring chapter is located at the lower middle position of the picture) or 270 degrees (when the located invoice monitoring chapter is located at the right vertical middle position of the picture), so that the invoice picture azimuth is correct; in order to facilitate subsequent processing, all pixels of the part, which accounts for 3% of the width of the picture, near the left and right boundaries in the preprocessed picture are replaced by white, so that the interference of the variegated pixels near the boundary line of the invoice scanning picture on the subsequent processing is prevented; carrying out character recognition on the positioned invoice monitoring seal, matching the name of province (including direct municipality and autonomous region) of China in the recognized characters, and knowing the province of the processed invoice and the invoice format corresponding to the invoice picture according to the matched province name; and (4) recording the picture obtained in the step (1) as a preprocessed picture.

Further, the color difference between the background table characters of the value-added tax invoice and the print content in the step (2) means that if the processed invoice picture is an invoice union of the value-added tax invoice, the color of the background table characters is brown (the main color tone is red), the color of the print content on the invoice is blue (the main color tone is blue), and if the processed invoice picture is a deduction union of the value-added tax invoice, the color of the background table characters is green (the main color tone is green), the color of the print content on the invoice is blue (the main color tone is blue), and the color tone of the invoice background table characters and the color tone of the print content have obvious difference; the step of extracting the printing content on the invoice by using the hue difference between the invoice picture background table characters and the printing content is to convert the preprocessed picture obtained in the step (1) into an HSV color space model (or an HSL color space model), create a blank picture with the same size, and reset each pixel in the newly-created picture according to the following rules: if the hue value H of the pixel at the corresponding position in the preprocessed picture is within a numerical range (240 degrees-delta, 240 degrees + delta) (according to the definition of an HSV or HSL color space model, when H =240 degrees or is about 240 degrees, the hue of the color is blue; delta is a threshold value for judging the hue difference between the color of the pixel and 'pure' blue; usually, the value of delta is between 0 degree and 60 degrees), judging that the pixel is blue, considering that the pixel is positioned in a printing content part in the invoice picture, and copying the pixel to the corresponding position in a newly-built picture; otherwise, judging that the pixel is not blue, considering that the pixel is positioned in a blank part or a background table character part in the invoice picture, and setting the pixel at the corresponding position in the newly-built picture as white; the newly-built picture obtained by the rules only contains (blue) invoice printing content; carrying out binarization processing on the newly-built picture; and (4) recording the picture obtained in the step (2) as a printing content picture.

Further, the step (3) of constructing the value-added tax invoice template refers to identifying an outer rectangular frame of a background table in the invoice issuing picture from the preprocessed picture obtained in the step (1), and constructing an invoice printing area template picture which has the same size as the preprocessed picture and comprises the same frame of the background table and marks printing areas of information items of the invoice respectively by using the relative position of the printing areas of each item of invoice information specified in the invoice format determined in the step (1) relative to the rectangular frame of the invoice background table; performing text line positioning on a print content picture, performing area matching on the print content picture and a template picture (overlapping the print content picture and the template picture, fixing the print content picture, moving the template picture up, down, left and right, and finely adjusting the relative positions of the template picture and the print content picture), and when a rectangular area representing the printing position of an invoice information item in a template file covers a text line in the print content picture, extracting all picture blocks covered by the rectangular area from the print content picture as picture blocks of the corresponding invoice information item; it is obvious that a picture block contains one or more lines of text content.

Further, in the step (4), the text recognition is performed on the photo block of the invoice information item obtained in the step (3), if a plurality of text lines are recognized by the photo block of a certain invoice information item, the business meaning, the composition rule, and the font size and line height of the text lines in the invoice image of the invoice information item need to be combined for comprehensive analysis, and adjacent text lines may need to be combined or a single text line may need to be split into a plurality of pieces of information, so as to determine the exact meaning of the invoice information item.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a method for extracting value-added tax invoice information;

FIG. 2 is a schematic diagram of an original scanned picture of a value-added tax invoice from which information needs to be extracted;

FIG. 3 is a schematic diagram of detecting the uppermost line in a scanned picture of an invoice;

LL is a schematic illustration of the top straight line identified in the invoice scan picture;

FIG. 4 is a schematic view of a pre-processed picture after rotation and orientation correction;

SS is the schematic description of invoice monitoring seal (position) in the preprocessed invoice picture;

FIG. 5 is a schematic diagram of a print content picture after binarization of a blue tone print content extracted from an invoice picture;

FIG. 6 is a schematic diagram of a printing area template of value-added tax invoice information constructed using a confirmed invoice format;

AA is a schematic description for showing the frame position of an invoice background table in a constructed invoice information printing area template picture;

01 is a schematic description of the printing position of the invoice information item 'invoice code, invoice number' in the constructed invoice template picture;

02 is a schematic illustration of the printing position of the invoice information item 'invoicing date' in the constructed invoice template picture;

03 is a schematic illustration showing the printing position of the invoice information item "purchaser information" in the constructed invoice template picture;

04 is a schematic illustration showing the printing position of the invoice information item 'code area' in the constructed invoice template picture;

05 is a schematic description of a printing position of an invoice information item 'goods or taxable labor and service name' in a constructed invoice template picture;

06 is a schematic description showing the printing position of the invoice information item 'specification model' in the constructed invoice template picture;

07 is a schematic illustration of the printing position of the invoice information item 'unit' in the constructed invoice template picture;

08 is a schematic illustration showing the printing positions of the invoice information item 'quantity' in the constructed invoice template picture;

09 is a schematic illustration showing the printing position of the invoice information item 'unit price' in the constructed invoice template picture;

10 is a schematic illustration of the invoice template picture constructed to represent the printing position of the invoice information item "amount";

11 is a schematic illustration showing the printing position of the invoice information item 'tax rate' in the constructed invoice template picture;

12 is a schematic illustration showing the printing position of the invoice information item 'tax amount' in the constructed invoice template picture;

reference numeral 13 denotes a schematic illustration of a printing position of the invoice information item "amount (total)" in the structured invoice template picture;

14 is a schematic illustration showing the printing position of the invoice information item "tax amount (total)" in the constructed invoice template picture;

15 is a schematic illustration showing the printing position of the invoice information item 'price tax (capitalization)' in the constructed invoice template picture;

16 is a schematic illustration showing the printing position of the invoice information item "price tax (lower case)" in the constructed invoice template picture;

17 is a schematic illustration showing the printing position of the invoice information item 'seller information' in the constructed invoice template picture;

18 is a schematic illustration of the structured invoice template picture showing the printing position of the invoice information item "remark";

19 is a schematic illustration of the invoice template picture constructed to represent the print position of the invoice information item "payee";

20 is a schematic illustration of the structured invoice template picture showing the "recheck" print position of the invoice information item;

21 is a schematic illustration showing the printing position of the invoice information item 'drawer' in the constructed invoice template picture;

fig. 7 is a schematic diagram of the invoice information item printing area on the invoice template of fig. 6 and the invoice printing content picture of fig. 5 which are constructed to perform area matching.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides an extraction algorithm of value-added tax invoice information, which comprises the following steps:

step (1), invoice picture preprocessing, and determining a value-added tax invoice format: acquiring a color scanning picture of the value-added tax invoice, and preprocessing the picture to obtain a non-rotation and correct-orientation preprocessed picture; identifying province names in an invoice monitoring seal in the invoice picture, and determining a value-added tax invoice format of the processed invoice picture;

and (2) extracting invoice printing content pictures: extracting blue printing content pixels from the preprocessed picture by using the hue difference between background table characters of the value-added tax invoice and printing content on the invoice, and performing binarization processing to obtain a printing content picture;

step (3), constructing a template, and matching and extracting the invoice information item picture block by using the template: constructing an invoice information printing area template according to the invoice format determined in the step (1), performing area matching on the invoice printing content picture obtained in the step (2) by using the template, and extracting picture blocks of the matching area as each information item picture block of the invoice;

step (4), identifying the invoice information item content: performing text recognition on each information item image block of the invoice obtained in the step (3), and comprehensively analyzing to obtain each item of information of the invoice; and (6) ending.

Further, in this embodiment, each step of the algorithm will be described in detail with reference to a scanned image of a deduction of a value-added tax special invoice (as shown in fig. 2); it should be emphasized that, with the method for extracting value-added tax invoice information provided by the present invention, invoice information in invoice unions of other value-added tax common invoices and invoice unions and deduction unions pictures of other value-added tax special invoices besides the invoice picture mentioned in the present embodiment can also be extracted. Before describing in detail the steps of the embodiment of the present invention, a scanned picture of an invoice (as shown in fig. 2) to be processed in the embodiment is described; it can be seen that, in the original invoice scanning picture (as shown in fig. 2), due to various reasons such as printing deviation and scanning deviation, the position of the invoice rotates relative to the normal reading position, and the printing content on the invoice has a bias, and the invoice viewing condition is influenced by the overlap of part of the printing content and invoice background table characters; through the following detailed description of the specific embodiment, it can be seen that the method for extracting value-added tax invoice information provided by the invention can well cope with the interference of the printing deviation and the scanning deviation on invoice information identification and extraction.

Further, the method for making the picture rotate nothing in step (1) is to identify a straight line in the picture (the straight line identified in the picture belongs to the prior art, and no further description is given here), calculate an angle between a straight line at the top of the picture (obviously, the straight line is one of the frame lines of the background table of the invoice in the scanned picture) and the horizontal direction clockwise (the range of the angle should be between-90 degrees and 90 degrees), and when the angle is not equal to 0, rotate the picture counterclockwise by the angle to ensure that the picture does not rotate (i.e., eliminate a tiny rotation angle in the paper direction when the invoice is printed or scanned); in this example, the clockwise included angle between the uppermost straight line (shown as reference number LL in fig. 3) identified in the invoice picture and the horizontal direction is-3 degrees, so that the invoice picture is rotated counterclockwise by-3 degrees, and the invoice picture can be ensured to be not rotated; the picture orientation is correct, namely the position of an invoice monitoring seal of the value-added tax invoice in the invoice picture is located at the middle position above the picture; the invoice monitoring seal in the invoice picture is positioned by adopting an object identification technology (the technology for positioning a specific object in the picture belongs to the prior art, and is not described herein), so that the fact that the invoice monitoring seal is positioned at the vertical central position on the left side of the invoice picture, the invoice picture is rotated by 90 degrees clockwise, the correct orientation of the invoice picture can be ensured, and the invoice monitoring seal is positioned at the central position above the picture (as shown by a reference sign SS in the attached figure 4); all pixels of a part, which is 3% of the width of the picture, near the left and right boundaries in the invoice picture are replaced by white, so that the interference of the variegated pixels near the boundaries of the invoice picture on subsequent processing is prevented; the method comprises the following steps of performing character recognition on an invoice monitoring chapter in an invoice picture (the character recognition on the picture of the invoice monitoring chapter part belongs to the prior art, and is not repeated herein), matching the name of Chinese province (including direct prefecture city and autonomous region) in the recognized characters, and matching two characters of Guangdong in the characters of the invoice monitoring chapter of the invoice, so that the invoice can be known to be a value-added tax invoice of Guangdong province, and constructing by adopting the format of the value-added tax invoice of the Guangdong province when a template is constructed later; the original invoice scanning picture (as shown in figure 2) is processed in the step (1) to obtain a preprocessing picture (as shown in figure 4); it should be noted that the original invoice scanning picture and the preprocessed picture are color pictures.

Further, the printing content in the invoice picture is extracted by using the hue difference between the background table characters of the value-added tax invoice and the printing content in the step (2), the invoice picture processed in the embodiment is a deduction link of the value-added tax special invoice, the color of the background table characters is green (the dominant hue is green), the special stamp for invoice supervision and seller invoice is red (the dominant hue is red), the printing content on the invoice is blue (the dominant hue is blue), and the hue of the invoice background table characters (including the invoice stamp) and the hue of the printing content have obvious difference; converting the preprocessed picture (as shown in fig. 4) into an HSV color space model (or an HSL color space model; converting the picture into an HSV or HSL color space model belongs to the prior art, and is not described herein again), creating a blank picture with the same size, and resetting each pixel in the newly created picture according to the following rules: if the hue value H of the pixel at the corresponding position in the preprocessed picture is within a numerical range (240-degree delta, 240-degree + delta) (generally, the value of the delta is between 0 degree and 60 degrees; here, the delta =40 degrees), judging that the pixel is blue, considering that the pixel is positioned in the printing content part in the invoice picture, and copying the pixel to the corresponding position in the newly-built picture; otherwise, judging that the pixel is not blue, considering that the pixel is positioned in a blank part or a background table character part in the invoice picture, and setting the pixel at the corresponding position in the newly-built picture as white; the newly-built picture obtained by the rules only contains (blue) invoice printing content; and (3) performing binarization processing on the newly-built picture to obtain a black-and-white print content picture (as shown in fig. 5; binarization processing on a color picture belongs to the prior art, and details are not described herein).

Further, the step (3) of constructing the value-added tax invoice template refers to identifying an outer rectangular border of a background table in the invoice issuing picture (as shown in fig. 4) in the preprocessed picture obtained in the step (1) (the identification of a rectangle in the picture belongs to the prior art, and is not described herein), constructing a background table frame which has the same size as the preprocessed picture and comprises the same size and position and an invoice information printing area template picture (shown in figure 6; referred to as template picture or template for short; reference numerals 01-21 in figure 6 respectively represent printing areas of various invoice information) by using the stipulations of printing areas of various invoice information relative to the rectangular frame positions of the table in the Guangdong province value-added tax invoice format (the format of the invoice picture is determined to be the Guangdong province value-added tax invoice format in the step (1)), wherein the background table frame has the same size as the preprocessed picture and comprises the same size and position; performing text line positioning on the print content picture (positioning text lines in the picture belongs to the prior art, and is not described herein), performing region matching on the print content picture (shown in figure 5) and the template picture (shown in figure 6) (overlapping the print content picture and the template picture, fixing the print content picture, moving the template picture up and down, left and right, and performing fine adjustment on the relative positions of the template picture and the print content picture), when rectangular areas (such as areas shown by labels 01-21 except 06, 07, 18 and the like in fig. 6) which represent printing positions of various information items of the invoice in the template picture cover text lines positioned in the print content picture (the effect that the printing areas in the template picture and the text lines in the print content picture are completely matched is shown in fig. 7), extracting all picture blocks covered by the rectangular areas from the print content picture to serve as picture blocks of corresponding invoice information items; it is clear that the picture blocks contain only a single line or multiple lines of text printed on the invoice.

Further, in the step (4), text recognition is performed on the picture block (see each rectangular frame area in fig. 7) of the invoice information item obtained in the step (3) (recognition of a single line or multiple lines of text on a picture only including a text line belongs to the prior art, and is not described herein again), so as to obtain the following preliminary results:

the exact content of each invoice item can be obtained by comprehensively analyzing the preliminary result in combination with the service meaning, the composition rule, the font size and the line height of the text line in the invoice picture block of the corresponding invoice information item, and the specific analysis process is as follows:

the picture block of the label 01 position in the invoice template picture corresponds to the invoice code and the printing content of the invoice number in the invoice information, 2 text lines are identified by the picture block of the invoice picture at the position, and the text line 1 comprises two numeric strings with very large font size difference, so that the text line 1 can be considered to be split into 2 numeric strings, the text line 2 comprises the numeric string which is the same as the first numeric string in the text line 1, and 09650503 can be obtained as the invoice number through analysis by combining the business meanings of the invoice code and the invoice number, and 4400161130 is the invoice code;

the picture block of the label 03 position in the invoice template picture corresponds to the print content of the buyer information in the invoice information, the picture block of the invoice picture at the position identifies 5 text lines, wherein the fonts of the text line 3 and the text line 4 are much smaller than the font heights of other text lines, so that the 2 text lines are considered to be originally a line of characters which are split into two lines and reduced in font when being printed because of more characters, the 2 text lines are combined into 1 text line, and the 4 text lines are combined to respectively correspond to the names, taxpayer identification numbers, address phones and account numbers of account opening lines in the buyer information, and further, the 4 content is five-layer A502 chamber 62956995, China Unicode corporation, 110108795109682, China Unicode corporation, China-China Unicode corporation, China Unicode corporation, China-2 th province China, China-China Unicode corporation, China Uni, The Bijing Branch century branch 110902496210201 of the Bingshang Bank;

the picture block of the position of the label 04 in the invoice template picture corresponds to the printing content of the password area in the invoice information, the picture block of the position of the invoice picture identifies 4 text lines, and because the password area only has one piece of password information, the 4 lines of texts are combined to obtain the content of the invoice password;

the positions of the labels 05, 08, 09, 10, 11 and 12 in the invoice template picture respectively correspond to the printed contents of the names, the quantities, the unit prices, the amounts, the tax rates and the taxes of goods or services purchased in the invoice information, and since 2 lines of text are identified in the picture blocks of the 6 positions of the invoice picture, the analysis can know that the invoice contains the purchase information of 2 goods or taxes, so that the names, the quantities, the unit prices, the amounts, the tax rates and the taxes of the goods or taxes 1 and the goods or taxes 2 are respectively house fees, 1, 243.68932039, 243.69, 3%, 7.31 and house fees, 1, 77.669902913, 77.67, 3% and 2.33;

the picture block of the position of the label 17 in the invoice template picture corresponds to the printing content of the seller information in the invoice information, and the picture block of the position of the invoice picture identifies 4 text lines and respectively corresponds to the name, taxpayer identification number, address telephone and account number of the account opening row in the seller information, the content of Guangzhou loford hotel, 440103587609981, Guangzhou litchi bay area, letter road No. 13B 1807055 and Guangzhou litchi bay branch line 44001460802053002577 in China construction Bank.

After the above-mentioned comprehensive analysis, the final extraction result of the invoice information on the invoice scanning picture can be obtained as follows:

Figure 644630DEST_PATH_IMAGE002

and finishing the extraction of the invoice information.

The value-added tax invoice information extraction algorithm provided by the invention can realize the following technical effects:

1. the efficiency of value-added tax invoice information acquisition is improved. The method has the advantages that the value-added tax invoice issuing pictures can be continuously and rapidly collected in batches by using scanning equipment such as a high-speed scanner, invoices can be processed by using the method provided by the invention immediately after being scanned, and extracted invoice information can be stored in a database for subsequent application;

2. the accuracy of extracting the value-added tax invoice information is improved. Because the enterprise has various printing equipment models, printing setting and printing modes, various scanner models and different invoice scanning operations, the printing contents of a large amount of invoices have the condition of 'bias' when the value-added tax invoices are issued, and the printing contents on the invoices and background table characters are overlapped, the current situation that the accuracy is not high when the invoice is input into the invoice contents by manually checking the invoice or the invoice contents are identified and extracted by using the existing text OCR identification technology is caused. According to the method for extracting the value-added tax invoice information, the invoice printing content can be perfectly extracted from the invoice picture by using the hue difference between the background table characters and the printing content of the value-added tax invoice, so that the negative influence of deviation generated in the invoice printing and scanning processes on invoice content identification and information extraction is eliminated, and the accuracy of extracting the value-added tax invoice information from the value-added tax invoice scanning picture is improved.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种自动求解数学题的方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!