OCR image character recognition and character correction method and system

文档序号:1379194 发布日期:2020-08-14 浏览:6次 中文

阅读说明:本技术 一种ocr图像字符识别和字符校正的方法及系统 (OCR image character recognition and character correction method and system ) 是由 宋国梁 颜长华 于 2020-04-26 设计创作,主要内容包括:本发明公开了一种OCR图像字符识别和字符校正的方法,在字符识别模块中:采用多级神经网络构造并拟合了汉字变形度Pr函数,该网络以图像CNN数据以及横竖撇捺四个额外特征作为为变量,以不同程度的GAN识别度作为“变形度”的训练值,反映目标汉字的变形度Pr;在字符校正模块中:增设第二级相似字区分网络用于对训练网络第一次确定的最佳识别结果进行高精度的相似字区分,二级网络的设置可以降低第一级网络的复杂度提高网络整体的泛化能力。本方法和系统主要针对机打发票、各种表格、单证进行识别,识别精度高,识别速度快,适应性强,对于部分信息缺失以及识别错误有很强的纠正能力;能够相对传统OCR识别技术的识别效果,大大提升识别准确度。(The invention discloses a method for character recognition and character correction of an OCR image, which comprises the following steps: constructing and fitting a Chinese character deformation degree Pr function by adopting a multistage neural network, wherein the network takes image CNN data and four additional characteristics of horizontal, vertical, left-falling and right-falling as variables, takes GAN identification degrees of different degrees as training values of 'deformation degree', and reflects the deformation degree Pr of a target Chinese character; in the character correction module: the second-level similar word distinguishing network is additionally arranged and used for distinguishing the high-precision similar words of the optimal recognition result determined for the first time by the training network, and the second-level network can reduce the complexity of the first-level network and improve the overall generalization capability of the network. The method and the system mainly aim at identifying the machine-issued tickets, various forms and documents, have high identification precision, high identification speed and strong adaptability, and have strong correction capability on partial information loss and identification errors; compared with the recognition effect of the traditional OCR recognition technology, the recognition accuracy is greatly improved.)

1. A method of OCR image character recognition and character correction, characterized by: comprises that

Carrying out character recognition on an image to be recognized through a training network to obtain character recognition information;

checking the character recognition information by a preset correction rule to obtain a character correction result;

wherein the character recognition of the image to be recognized through the training network comprises:

constructing a training network by constructing and fitting a Pr function and taking four characteristics of horizontal, vertical, left, right and left as variables, and calculating the degree of deformation of the Chinese characters;

and a second-stage similar word distinguishing network is additionally arranged for distinguishing similar words from the best recognition result determined by the training network for the first time.

2. A method of OCR image character recognition and character correction according to claim 1 and wherein: the character recognition is carried out on the image to be recognized through the training network to obtain character recognition information, and the character recognition method comprises the following steps:

s11, setting a neural network to train a known Chinese character image library and the number of horizontal, vertical, left-falling, right-falling and right-falling of Chinese characters corresponding to the Chinese character image library;

and S12, constructing a training network through GAN with different degrees, and calculating the Chinese character deformation degree.

3. A method of OCR image character recognition and character correction according to claim 2 and wherein: the construction of the training network and the calculation of the Chinese character deformation degree comprise

And constructing a training network comprising the neural network and a standard CNN neural network, acquiring the image to be recognized, the Chinese character to be detected and the number of horizontal, vertical, left-falling, right-falling and right-falling in the target Chinese character obtained by training through the training network, and calculating a quantitative error function Pr.

4. A method of OCR image character recognition and character correction according to claim 3 and wherein: the training data of the training network is obtained by processing original standard Chinese character picture data; firstly, five GAN networks with different layers are set, the data sets of the five GAN networks correspond to quintuple data of the spread of the form and character similarity, and the GAN1 data are obtained by enhancing the form and character similarity of the Chinese character; the GAN2 data is obtained by enhancing the data set of the Chinese character, the shape near character and the shape near character of the shape near character; by analogy, five types of GAN networks are obtained: GAN 1-GAN 5;

then, the Pr value is defined as:

0.0: an original image;

0.1: none of the 5 species can be distinguished;

0.2: 4 are indistinguishable, and 1 is distinguishable;

0.4: 3 are indistinguishable, 2 are distinguishable;

0.6: 2 are indistinguishable, and 3 are distinguishable;

0.8: 1 can not be distinguished, 4 can be distinguished;

0.9: can be distinguished;

1.0: noise-free images or other animal and plant images;

and after the Pr value is determined, training a network of the Pr value of each target Chinese character by using the training network structure to form a training database.

5. A method of OCR image character recognition and character correction according to claim 3 and wherein: the second-stage similar word distinguishing network is used for accurately distinguishing similar words from the best recognition result determined by the training network for the first time and comprises

And training the network to determine the best recognition result for the first time, calling the similar word library in which the word is positioned for matching comparison, and calling the pre-trained second-stage similar word distinguishing network to distinguish the matched similar words if the similar words are matched.

6. A method of OCR image character recognition and character correction according to any of claims 1-5, wherein: the preset correction rule is used for checking the character recognition information to obtain a character correction result, and the method comprises the following steps of

Presetting a correction rule, and verifying the character identification information;

and constructing a feedback model for feeding back reliability information upwards and giving further processing suggestion information according to the conformity verified by the correction rule.

7. A system for OCR image recognition, comprising: comprises a character recognition module and a character correction module; wherein

The image correction module is used for carrying out character recognition on the image to be recognized through a training network to obtain character recognition information; the character recognition of the image to be recognized through the training network comprises the following steps:

constructing a training network by constructing and fitting a Pr function and taking four characteristics of horizontal, vertical, left, right and left as variables, and calculating the degree of deformation of the Chinese characters;

a second-level similar word distinguishing network is additionally arranged for distinguishing similar words from the best recognition result determined by the training network for the first time;

and the character correction module is used for presetting correction rules to check the character recognition information and obtaining a character correction result.

8. An OCR image recognition system according to claim 7 and wherein: the execution steps of the character recognition module comprise:

setting a neural network to train a known Chinese character image library and the number of horizontal, vertical, left-falling and right-falling strokes of the corresponding Chinese characters;

constructing a training network through GAN with different degrees, and calculating the Chinese character deformation degree; it includes:

and constructing a training network comprising the neural network and a standard CNN neural network, acquiring the image to be recognized, the Chinese character to be detected and the number of horizontal, vertical, left-falling, right-falling and right-falling in the target Chinese character obtained by training through the training network, and calculating a quantitative error function Pr.

9. An OCR image recognition system as recited in claim 8, wherein: the character recognition module executes the second-stage similar character distinguishing network for distinguishing similar characters of the best recognition result determined by the training network for the first time, and the method comprises the following steps of

And training the network to determine the best recognition result for the first time, calling the similar word library in which the word is positioned for matching comparison, and calling the pre-trained second-stage similar word distinguishing network to distinguish the matched similar words if the similar words are matched.

10. An OCR image recognition system according to claim 9 and wherein: the execution step of the character correction module comprises

Presetting a correction rule, and verifying the character identification information output by the character identification module;

and constructing a feedback model for feeding back reliability information upwards and giving further processing suggestion information according to the conformity verified by the correction rule.

Technical Field

The invention relates to the technical field of Chinese character recognition, in particular to a method and a system for OCR image character recognition and character correction.

Background

The OCR (Optical Character Recognition) technology is a computer input technology that converts characters of various bills, newspapers, books, manuscripts and other printed matters into image information by an Optical input method such as scanning, and then converts the image information into usable image information by using a Character Recognition technology.

With the continuous development of image sensors, particularly the exponential increase of the number of various mobile phones and professional (such as security) cameras, the image data of a computer is rapidly increased; but the image quality is relatively reduced compared with the traditional scanner or various professional cameras; the traditional Chinese character OCR technology has the problems that the quality of source image data is not high, and the recognition rate is severely reduced when the pollution is serious.

The recognition of the content of Chinese characters (OCR) of computer images is a difficult problem in image recognition, and compared with English character recognition, the number of Chinese characters is large, the similarity of basic characters is high, the recognition is easy to interfere, and the recognition is difficult. The bills are also severely affected by various bills shading, printing positions, printing definition and covering pollutants (seals). According to the relevant market research in 2018, the test effect of a plurality of traditional OCR manufacturers on the market is not ideal for various bills photographed by a mobile phone, although the new generation end-to-end OCR scheme based on the deep neural network has a good effect in the field of Western character OCR, because the cardinal number of Chinese characters is huge, the required training data set exceeds thousands of times of that of the Western character set (conservative estimation), so that the Chinese character OCR on the open AI platform is not ideal on poor images, and the end-to-end deep neural network has natural misrecognition and is easy to attack.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an OCR image character recognition and character correction method and system to improve the accuracy of OCR.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method for character recognition and character correction of an OCR image comprises

Carrying out character recognition on an image to be recognized through a training network to obtain character recognition information;

checking the character recognition information by a preset correction rule to obtain a character correction result;

wherein the character recognition of the image to be recognized through the training network comprises:

constructing a training network by constructing and fitting a Pr function and taking four characteristics of horizontal, vertical, left, right and left as variables, and calculating the degree of deformation of the Chinese characters;

and a second-stage similar word distinguishing network is additionally arranged for distinguishing similar words from the best recognition result determined by the training network for the first time.

Further, in the OCR image character recognition and character correction method, the performing character recognition on the image to be recognized through the training network to obtain character recognition information includes:

s11, setting a neural network to train a known Chinese character image library and the number of horizontal, vertical, left-falling, right-falling and right-falling of Chinese characters corresponding to the Chinese character image library;

and S12, constructing a training network through GAN with different degrees, and calculating the Chinese character deformation degree.

Further, in the above OCR image character recognition and character correction method, the constructing a training network and calculating the degree of deformation of the chinese character include

And constructing a training network comprising the neural network and a standard CNN neural network, acquiring the image to be recognized, the Chinese character to be detected and the number of horizontal, vertical, left-falling, right-falling and right-falling in the target Chinese character obtained by training through the training network, and calculating a quantitative error function Pr.

Further, in the above OCR image character recognition and character correction method, the training data of the training network is processed from the original standard chinese character image data; firstly, five GAN networks with different layers are set, the data sets of the five GAN networks correspond to quintuple data of the spread of the form and character similarity, and the GAN1 data are obtained by enhancing the form and character similarity of the Chinese character; the GAN2 data is obtained by enhancing the data set of the Chinese character, the shape near character and the shape near character of the shape near character; by analogy, five types of GAN networks are obtained: GAN 1-GAN 5.

Then, the Pr value is defined as:

0.0: an original image;

0.1: none of the 5 species can be distinguished;

0.2: 4 are indistinguishable, and 1 is distinguishable;

0.4: 3 are indistinguishable, 2 are distinguishable;

0.6: 2 are indistinguishable, and 3 are distinguishable;

0.8: 1 can not be distinguished, 4 can be distinguished;

0.9: can be distinguished;

1.0: noise-free images or other animal and plant images;

and after the Pr value is determined, training a network of the Pr value of each target Chinese character by using the training network structure to form a database.

Further, in the above OCR image character recognition and character correction method, the second-stage similar word distinguishing network is used for accurately distinguishing similar words from the best recognition result determined by the training network for the first time, and includes

And training the network to determine the best recognition result for the first time, calling the similar word library in which the word is positioned for matching comparison, and calling the pre-trained second-stage similar word distinguishing network to distinguish the matched similar words if the similar words are matched.

Further, in the above OCR image character recognition and character correction method, the preset correction rule checks the character recognition information to obtain a character correction result, including

Presetting a correction rule, and verifying the character identification information;

and constructing a feedback model for feeding back reliability information upwards and giving further processing suggestion information according to the conformity verified by the correction rule.

The OCR image recognition system comprises a character recognition module and a character correction module; wherein

The image correction module is used for carrying out character recognition on the image to be recognized through a training network to obtain character recognition information; the character recognition of the image to be recognized through the training network comprises the following steps:

constructing a training network by constructing and fitting a Pr function and taking four characteristics of horizontal, vertical, left, right and left as variables, and calculating the degree of deformation of the Chinese characters;

a second-level similar word distinguishing network is additionally arranged for distinguishing similar words from the best recognition result determined by the training network for the first time;

and the character correction module is used for presetting correction rules to check the character recognition information and obtaining a character correction result.

Further, in the OCR image recognition system, the executing step of the character recognition module includes:

setting a neural network to train a known Chinese character image library and the number of horizontal, vertical, left-falling and right-falling strokes of the corresponding Chinese characters;

constructing a training network through GAN with different degrees, and calculating the Chinese character deformation degree; it includes:

and constructing a training network comprising the neural network and a standard CNN neural network, acquiring the image to be recognized, the Chinese character to be detected and the number of horizontal, vertical, left-falling, right-falling and right-falling in the target Chinese character obtained by training through the training network, and calculating a quantitative error function Pr.

Further, in the above OCR image recognition system, the character recognition module executes the second-stage similar character distinguishing network for performing similar character distinguishing on the best recognition result determined by the training network for the first time, including

And training the network to determine the best recognition result for the first time, calling the similar word library in which the word is positioned for matching comparison, and calling the pre-trained second-stage similar word distinguishing network to distinguish the matched similar words if the similar words are matched.

Further, in the above OCR image recognition system, the executing step of the character correcting module includes

Presetting a correction rule, and verifying the character identification information output by the character identification module;

and constructing a feedback model for feeding back reliability information upwards and giving further processing suggestion information according to the conformity verified by the correction rule.

Compared with the prior art, the invention has the beneficial effects that:

the recognition accuracy is determined by constructing and fitting a Pr function, the deformation degree of the standard Chinese character is calculated based on the horizontal, vertical, left-falling and right-falling variables, and the method is processed by a training network, so that in the subsequent recognition, only an image to be recognized, a Chinese character to be detected and the number of the horizontal, vertical, left-falling and right-falling four basic strokes in the target Chinese character need to be input and obtained by training each time, namely the reliability degree Pr of the target Chinese character to be recognized can be calculated by the network, the quality of the current recognition effect is confirmed, and the method is very effective for judging abnormal pictures and aggressive pictures; the feedback algorithm (namely a feedback model) based on the comprehension degree can improve the accuracy of OCR, and has strong correction capability on partial information loss and recognition errors; compared with the recognition effect of the traditional OCR recognition technology, the recognition accuracy is greatly improved. The method is particularly suitable for identifying machine-issued tickets, various forms and documents, and has the advantages of high identification precision, high identification speed and strong adaptability.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a flow chart of one embodiment of a method for OCR image character recognition and character correction in accordance with the present invention;

FIG. 2 is a logical block diagram of the OCR image recognition system of the present invention;

FIG. 3 is a logic block diagram of a training network constructed in the word recognition module of the system shown in FIG. 2.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种OCR图像识别的图像处理方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!