Method for intelligently identifying form quotation picture and processing form quotation picture into standard data

文档序号:1964288 发布日期:2021-12-14 浏览:18次 中文

阅读说明:本技术 一种智能识别表格报价图片并处理成标准数据的方法 (Method for intelligently identifying form quotation picture and processing form quotation picture into standard data ) 是由 李锦亮 于 2021-09-07 设计创作,主要内容包括:本发明为一种智能识别表格报价图片并处理成标准数据的方法,包括以下步骤:步骤一,特征识别,步骤二,获取数据,步骤三,数据整理,步骤四,数据优化,步骤五,数据标准化,步骤六,数据核对,本发明基于脚本的目标格式转换;表格校正去噪,弥补阿里OCR接口识别异常问题,使表格报价图片数据达到90%左右的准确率,再配合人工核对、调整,达到100%准确率;能够将数据映射成标准化数据;供应商、格式、存储过程配置化,支持不同的处理,拓展方便。(The invention relates to a method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data, which comprises the following steps: step one, feature identification, step two, data acquisition, step three, data arrangement, step four, data optimization, step five, data standardization, step six, data verification, and target format conversion based on scripts; the form is corrected and denoised, the problem of abnormal recognition of an Ali OCR interface is solved, the accuracy of the form quotation image data is about 90%, and the accuracy of 100% is achieved by matching with manual checking and adjustment; the data can be mapped to standardized data; the configuration of suppliers, formats and storage processes supports different processing, and the expansion is convenient.)

1. A method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data is characterized by comprising the following steps:

step one, feature identification, namely setting suppliers, formats and storage processes corresponding to the formats according to the form quotation pictures;

acquiring data, calling an optical table picture OCR interface to acquire table picture data;

step three, data arrangement, namely, two-dimensionalizing the original table picture data and inserting the original table picture data into a temporary table with the corresponding column number;

step four, optimizing data, executing a storage process, and inserting the data into a standard data table;

fifthly, standardizing data and displaying the data to a front-end interface;

and step six, data checking, namely manually checking data on a front-end interface, modifying, finally storing and updating the standard data.

2. The method as claimed in claim 1, wherein said OCR interface is an arbibaba OCR interface.

3. The method as claimed in claim 1, wherein the third step is performed in a two-dimensional space, all x-lines and y-lines can cut all the space into the smallest units, bring the two-dimensional concept into the data, cut the table into cells which are all the smallest units, and the data in the cells form the two-dimensional data, create a two-dimensional array, traverse the original data cells of the table quotation picture, fill the two-dimensional array, if the cells are not full, fill the cells with empty values, and mark the cells as compensation.

4. The method of claim 1, wherein in step four, all fields of the table in which the standard quotation data is located are integrated to represent a complete quotation data.

5. The method of claim 1, wherein in step three, the two-dimensional original form quotation picture data is in the following format:

A. cutting each cell with the x or y span not being 1 into the minimum cell;

ex-sx=1&&ey-sy=1;

B. judging whether the maximum column number (ex) of the atomic unit grids is less than or equal to the column number of the source pictures;

(1) less than or equal to:

a. creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling a cell with a value being empty, marking the cell as compensation, ending data bidimensionalization, and taking the column number on the side as the column number of the source picture (taking the column number of the source picture as compensation);

(2) greater than: (at this time, it is necessary to locate each cell to be deleted)

a. Creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling the atomic cells with a value of empty, marking the atomic cells as compensation, and taking the column number on the side as the maximum column number (ex) of the atomic cells (compensating by the maximum column number 'ex' of the atomic cells);

b. deleting the head and tail empty columns until the non-empty columns appear; judging whether the column number is aligned with the column number of the source picture, and ending the two-dimensional data if the column number is aligned with the column number of the source picture;

c. calculating the number of columns to be deleted (maximum ex-number of columns of source pictures), and knowing the number of cells to be deleted in each row;

d. traversing each row of cell data, and processing one row:

(d1) deleting one or more cells with empty headers, and if the number of the cells is equal to the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d2) deleting the repeated cells with text not empty, and if the repeated cells are matched with the number of the repeated cells needing to be deleted in each row, finishing the data processing of the row;

(d3) deleting the compensated cells, and if the compensated cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d4) deleting the cells with text being empty, and if the cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d5) if the number of the data to be deleted is more than the number of the data to be deleted, deleting the data from the tail.

6. The method for intelligently identifying and processing tabular quotation pictures into standard data according to claim 1, wherein the data is standardized into source data or standard data, and the existing data standardization of the current system comprises the following steps:

a. standardizing the name of the product;

b. standardizing the marks;

c. factory standardization;

d. and (5) standardizing the categories.

Technical Field

The invention relates to the technical field of data identification, in particular to a method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data.

Background

Different suppliers provide commodity quotations in the form of table pictures; the invention relates to a method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data, which is developed aiming at the scene, by replacing manual work with a program to process the data to obtain the standard data and greatly saving the manual work.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a method which has high identification accuracy and is convenient to expand, intelligently identifies the form quotation picture and processes the form quotation picture into standard data.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: the invention discloses a method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data, which comprises the following steps of:

step one, feature identification, namely setting suppliers, formats and storage processes corresponding to the formats according to the form quotation pictures;

acquiring data, calling an optical table picture OCR interface to acquire table picture data;

step three, data arrangement, namely, two-dimensionalizing the original table picture data and inserting the original table picture data into a temporary table with the corresponding column number;

step four, optimizing data, executing a storage process, and inserting the data into a standard data table;

fifthly, standardizing data and displaying the data to a front-end interface;

and step six, data checking, namely manually checking data on a front-end interface, modifying, finally storing and updating the standard data.

Preferably, the optical form picture OCR interface is an optical form picture OCR interface read by an arbiba.

Preferably, the third step is performed in a two-dimensional space, all x lines and y lines can be cut into the smallest units, the two-dimensional concept is brought into the data, the table is cut into the cells which are all the smallest units, the data in the cells form two-dimensional data, a two-dimensional array is created, the original data cells of the quotation picture of the table are traversed, the two-dimensional array is filled, and if the cells are not full, the cells with empty values are filled and marked as compensation.

Preferably, the improvement of the present invention is that, in step four, all fields of the table in which the standard quotation data of the "standard data table" is located are integrated to represent a complete quotation data.

Preferably, in step three, the two-dimensional original form quotation picture data is in the following format:

A. cutting each cell with the x or y span not being 1 into the minimum cell;

ex-sx=1&&ey-sy=1;

B. judging whether the maximum column number (ex) of the atomic unit grids is less than or equal to the column number of the source pictures;

(1) less than or equal to:

a. creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling a cell with a value being empty, marking the cell as compensation, ending data bidimensionalization, and taking the column number on the side as the column number of the source picture (taking the column number of the source picture as compensation);

(2) greater than: (at this time, it is necessary to locate each cell to be deleted)

a. Creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling the atomic cells with a value of empty, marking the atomic cells as compensation, and taking the column number on the side as the maximum column number (ex) of the atomic cells (compensating by the maximum column number 'ex' of the atomic cells);

b. deleting the head and tail empty columns until the non-empty columns appear; judging whether the column number is aligned with the column number of the source picture, and ending the two-dimensional data if the column number is aligned with the column number of the source picture;

c. calculating the number of columns to be deleted (maximum ex-number of columns of source pictures), and knowing the number of cells to be deleted in each row;

d. traversing each row of cell data, and processing one row:

(d1) deleting one or more cells with empty headers, and if the number of the cells is equal to the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d2) deleting the repeated cells with text not empty, and if the repeated cells are matched with the number of the repeated cells needing to be deleted in each row, finishing the data processing of the row;

(d3) deleting the compensated cells, and if the compensated cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d4) deleting the cells with text being empty, and if the cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d5) if the number of the data to be deleted is more than the number of the data to be deleted, deleting the data from the tail.

Preferably, the improvement of the invention is that the data is standardized into source data or standard data, and the existing data standardization of the current system comprises the following steps:

a. standardizing the name of the product;

b. standardizing the marks;

c. factory standardization;

d. and (5) standardizing the categories.

(III) advantageous effects

Compared with the prior art, the invention provides a method for intelligently identifying the form quotation picture and processing the form quotation picture into the standard data, which has the following beneficial effects:

the invention is based on target format conversion of script; the form is corrected and denoised, the problem of abnormal recognition of an Ali OCR interface is solved, the accuracy of the form quotation image data is about 90%, and the accuracy of 100% is achieved by matching with manual checking and adjustment; the data can be mapped to standardized data; the configuration of suppliers, formats and storage processes supports different processing, and the expansion is convenient.

Drawings

FIG. 1 is a schematic illustration of a main process of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a method for intelligently identifying a form quotation picture and processing the form quotation picture into standard data according to the present invention includes the following steps:

step one, feature identification, namely setting suppliers, formats and storage processes corresponding to the formats according to the form quotation pictures;

acquiring data, calling an optical table picture OCR interface to acquire table picture data;

step three, data arrangement, namely, two-dimensionalizing the original table picture data and inserting the original table picture data into a temporary table with the corresponding column number;

step four, optimizing data, executing a storage process, and inserting the data into a standard data table;

fifthly, standardizing data and displaying the data to a front-end interface;

and step six, data checking, namely manually checking data on a front-end interface, modifying, finally storing and updating the standard data.

In this embodiment, the optical table picture OCR interface adopts an optical table picture OCR interface for reading in the arbiba, so that the accuracy of recognition is improved.

In this embodiment, the third step is performed in a two-dimensional space, all x lines and y lines can cut all the spaces into the smallest units, a two-dimensional concept is brought into the data, the table is cut into cells which are all the smallest units, the data in the cells form two-dimensional data, a two-dimensional array is created, the original data cells of the quotation picture of the table are traversed, the two-dimensional array is filled, if the two-dimensional array is not full, the cells with empty values are filled, and the cells are marked as compensation, so that the efficiency of data processing is improved.

In this embodiment, in the fourth step, all fields of the table in which the standard quotation data of the "standard data table" is located are integrated to represent a complete quotation data, so that the data can be checked conveniently.

In this embodiment, in step three, the two-dimensional original form quotation picture data is in the following format:

A. cutting each cell with the x or y span not being 1 into the minimum cell;

ex-sx=1&&ey-sy=1;

B. judging whether the maximum column number (ex) of the atomic unit grids is less than or equal to the column number of the source pictures;

(1) less than or equal to:

a. creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling a cell with a value being empty, marking the cell as compensation, ending data bidimensionalization, and taking the column number on the side as the column number of the source picture (taking the column number of the source picture as compensation);

(2) greater than: (at this time, it is necessary to locate each cell to be deleted)

a. Creating a two-dimensional array, traversing the atomic cells, filling the atomic cells into the two-dimensional array, then traversing the two-dimensional array, if the atomic cells are not filled, filling the atomic cells with a value of empty, marking the atomic cells as compensation, and taking the column number on the side as the maximum column number (ex) of the atomic cells (compensating by the maximum column number 'ex' of the atomic cells);

b. deleting the head and tail empty columns until the non-empty columns appear; judging whether the column number is aligned with the column number of the source picture, and ending the two-dimensional data if the column number is aligned with the column number of the source picture;

c. calculating the number of columns to be deleted (maximum ex-number of columns of source pictures), and knowing the number of cells to be deleted in each row;

d. traversing each row of cell data, and processing one row:

(d1) deleting one or more cells with empty headers, and if the number of the cells is equal to the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d2) deleting the repeated cells with text not empty, and if the repeated cells are matched with the number of the repeated cells needing to be deleted in each row, finishing the data processing of the row;

(d3) deleting the compensated cells, and if the compensated cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d4) deleting the cells with text being empty, and if the cells are matched with the number of the cells needing to be deleted in each row, finishing the data processing of the row;

(d5) if there are more numbers to be deleted, then delete from the tail, and insert the temporary table into the corresponding column number as:

after the two-dimensional data are obtained, the sql statements are spliced according to the rows, quoted data are inserted into the temporary tables corresponding to the number of the columns in batches, the data exception returned by the Ali reading optical OCR interface is made up, 100% of the data exception cannot be made up, the data accuracy can reach 90%, and the labor consumption is reduced as much as possible.

In this embodiment, the data is normalized to source data or standard data, and the existing data normalization in the current system includes:

a. standardizing the name of the product;

b. standardizing the marks;

c. factory standardization;

d. and (5) standardizing the categories.

In summary, when the invention is used, because the data representation meanings of each form quotation picture with different styles are different, different treatments are required; after step 3 is executed, the corresponding column number temporary table has two-dimensional data of each row and each column corresponding to the picture, the storage process has the function of traversing the two-dimensional data, defining the meaning of the data of each column, corresponding to the standard data table, inserting the data into the standard data table, associating the function name of each storage process with the corresponding format (a field for storing the function name of the process is arranged in the format table), returning data of the OCR picture interface of the Aliskia reading optical table, wherein the data are the coordinate starting and ending positions and the unit cell data content, but the returned data have problems, such as the starting and ending positions do not change, and the starting and ending positions of the returned data change, or the data of a certain cell is identified as empty, step 3 is to perform a lot of processing because the problems but the accuracy rate of the problems cannot reach 100%, insert the data into the temporary table of the corresponding column number to obtain the two-dimensional data, splice sql statements according to rows, and insert quoted data into the temporary table of the corresponding column number in batches, wherein the two-dimensional data is to make up for the data abnormality returned by the OCR interface of the light reading form picture of the arbiba and cannot ensure 100% compensation, but the data can reach 90% accuracy, so that the manual consumption is reduced as much as possible, the storage process function corresponding to the execution format of the execution storage process of step four is to insert each quoted data representing one piece into the standard data table, the front end interface of step five returns the data to the front end, the front end performs rendering, and finally, because the program cannot reach 100% accuracy rate, manual intervention is needed, checking, modifying, storing and finally storing the updated data to ensure that the data accuracy rate reaches 100 percent.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种石刻文字识别的优化方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!