Optical character recognition method, device, electronic equipment and storage medium

文档序号：950258 发布日期：2020-10-30 浏览：2次中文

阅读说明：本技术 光学字符识别方法、装置、电子设备及存储介质 (Optical character recognition method, device, electronic equipment and storage medium ) 是由恩孟一刘珊珊李轩章成全许海伦张晓强于 2020-06-16 设计创作，主要内容包括：本申请公开了光学字符识别方法、装置、电子设备及存储介质,涉及人工智能领域和深度学习领域,其中的方法可包括：针对待识别的图片,确定出其中的文本区域的包围框,根据所述包围框从待识别的图片中提取出文本区域图片；确定出文本区域图片中的文本行的包围框,根据所述包围框从文本区域图片中提取出文本行图片；对文本行图片进行文本序列识别,得到识别结果。应用本申请所述方案,可提升识别速度等。(The application discloses an optical character recognition method, an optical character recognition device, electronic equipment and a storage medium, which relate to the field of artificial intelligence and the field of deep learning, wherein the method comprises the following steps: determining a surrounding frame of a text region in a picture to be recognized, and extracting the picture of the text region from the picture to be recognized according to the surrounding frame; determining a surrounding frame of a text line in the text region picture, and extracting the text line picture from the text region picture according to the surrounding frame; and performing text sequence identification on the text line picture to obtain an identification result. By applying the scheme, the identification speed can be increased.)

1. An optical character recognition method, comprising:

determining a surrounding frame of a text region in a picture to be recognized, and extracting a text region picture from the picture to be recognized according to the surrounding frame;

determining a surrounding frame of a text line in the text region picture, and extracting the text line picture from the text region picture according to the surrounding frame;

and performing text sequence identification on the text line picture to obtain an identification result.

2. The method of claim 1,

the method further comprises the following steps: before the bounding box of the text line in the text region picture is determined, determining an adjusting mode of the text region picture, and adjusting the size of the text region picture according to the determined adjusting mode.

3. The method of claim 2,

the method further comprises the following steps: inputting the picture to be recognized into a lightweight text scale prejudgment model obtained by pre-training to obtain a single-channel output text region mask picture and a text scale picture;

the values of all the pixel points in the text area mask graph respectively represent the probability that the corresponding pixel points belong to the text area, and the values of all the pixel points in the text scale graph respectively represent the ratio of the size of the shortest side of the text line to which the corresponding pixel points belong to the preset optimal size;

the determining of the bounding box of the text region comprises: determining a bounding box of a text region in the picture to be identified according to the text region mask image;

the determining of the adjustment mode of the text region picture comprises: and determining an adjusting mode of the text region picture according to the text scale map.

4. The method of claim 3,

the determining the bounding box of the text region in the picture to be recognized according to the text region mask map comprises the following steps:

determining a text connected domain in the text region mask image through connected domain analysis;

and respectively determining the minimum rectangle containing the text connected domain in the picture to be identified as a surrounding frame of a text region corresponding to the text connected domain aiming at any text connected domain.

5. The method of claim 3,

the adjusting method for determining the text region picture according to the text scale map comprises the following steps:

respectively determining the values of all pixel points in the text area picture in the text scale picture aiming at any text area picture, wherein the values of all the pixel points in the text area picture in the text scale picture are the same;

the adjusting mode of the text region picture comprises the following steps: and on the premise of reserving the aspect ratio of the text region picture, adjusting the width and height of the text region picture so that the size of the shortest side of the adjusted text line is equal to the optimal size.

6. The method of claim 3,

the lightweight text scale prejudgment model comprises the following steps: the system comprises a first feature extraction module, a first prediction module and a second prediction module; the first feature extraction module is used for extracting features of an input picture, the first prediction module is used for generating the text region mask image according to a feature extraction result, and the second prediction module is used for generating the text scale image according to the feature extraction result.

7. The method of claim 1,

the method further comprises the following steps: inputting the text region picture into a lightweight text detection model obtained by pre-training to obtain an output single-channel text center line response image and four-channel text boundary region offset images;

the values of all the pixel points in the text center line response graph respectively represent the probability that the corresponding pixel points belong to a text line center line region, and the values of all the pixel points in the text boundary region offset graph respectively represent the horizontal and vertical distances from the corresponding pixel points to the upper boundary of the text line to which the corresponding pixel points belong and the horizontal and vertical distances from the corresponding pixel points to the lower boundary of the text line to which the corresponding pixel points belong;

The determining the bounding box of the text line in the text region picture comprises: and determining a surrounding frame of the text line in the text region picture by combining the text center line response picture and the text boundary region offset picture.

8. The method of claim 7,

the determining, by combining the text centerline response map and the text boundary region offset map, a bounding box of a text line in the text region picture includes:

determining the central line of each text line by analyzing the connected domain of the text central line response graph;

and aiming at any central line, respectively determining an enclosing frame of a text line corresponding to the central line by combining values of pixel points on the central line in the text boundary region offset image, and corresponding the enclosing frame to the text region image.

9. The method of claim 7,

the lightweight text detection model comprises the following steps: the second characteristic extraction module, the third prediction module and the fourth prediction module; the second feature extraction module is used for extracting features of an input picture, the third prediction module is used for generating the text center line response graph according to a feature extraction result, and the fourth prediction module is used for generating the text boundary region offset graph according to the feature extraction result.

10. The method of claim 1,

the text sequence recognition of the text line picture is carried out, and the obtaining of the recognition result comprises the following steps: inputting the text line picture into a lightweight text sequence recognition model obtained by pre-training to obtain an output recognition result; and determining the feature extraction convolution network structure in the lightweight text sequence recognition model by adopting an automatic machine learning model searching mode.

11. An optical character recognition apparatus, comprising: the system comprises a first picture processing module, a second picture processing module and a text recognition module;

the first picture processing module is used for determining a surrounding frame of a text region in a picture to be identified and extracting a text region picture from the picture to be identified according to the surrounding frame;

the second picture processing module is used for determining a surrounding frame of a text line in the text region picture and extracting the text line picture from the text region picture according to the surrounding frame;

and the text recognition module is used for performing text sequence recognition on the text line picture to obtain a recognition result.

12. The apparatus of claim 11,

The first picture processing module is further configured to determine an adjustment mode of the text region picture, and perform size adjustment on the text region picture according to the determined adjustment mode.

13. The apparatus of claim 12,

the first image processing module is further used for inputting the image to be recognized into a pre-trained lightweight text scale pre-judging model to obtain an output single-channel text region mask image and a text scale image; the values of all the pixel points in the text area mask graph respectively represent the probability that the corresponding pixel points belong to the text area, and the values of all the pixel points in the text scale graph respectively represent the ratio of the size of the shortest side of the text line to which the corresponding pixel points belong to the preset optimal size;

the first picture processing module determines a bounding box of a text region in the picture to be identified according to the text region mask picture, and determines an adjusting mode of the text region picture according to the text scale picture.

14. The apparatus of claim 13,

the first picture processing module determines text connected domains in the text region mask image through connected domain analysis, and determines the minimum rectangle containing the text connected domains in the picture to be recognized as a surrounding frame of a text region corresponding to the text connected domains aiming at any text connected domain.

15. The apparatus of claim 13,

the first picture processing module respectively determines the values of all pixel points in the text area picture in the text scale picture aiming at any text area picture, and the values of all the pixel points in the text area picture in the text scale picture are the same;

16. The apparatus of claim 13,

17. The apparatus of claim 11,

The second picture processing module is further used for inputting the text region picture into a lightweight text detection model obtained by pre-training to obtain an output single-channel text center line response picture and four-channel text boundary region offset pictures; the values of all the pixel points in the text center line response graph respectively represent the probability that the corresponding pixel points belong to a text line center line region, and the values of all the pixel points in the text boundary region offset graph respectively represent the horizontal and vertical distances from the corresponding pixel points to the upper boundary of the text line to which the corresponding pixel points belong and the horizontal and vertical distances from the corresponding pixel points to the lower boundary of the text line to which the corresponding pixel points belong;

and the second picture processing module determines a bounding box of a text line in the text region picture by combining the text center line response picture and the text boundary region offset picture.

18. The apparatus of claim 17,

and the second picture processing module determines the central line of each text line by analyzing the connected domain of the text central line response graph, determines the bounding box of the text line corresponding to the central line by respectively combining the values of the pixel points on the central line in the text boundary region offset graph aiming at any central line, and corresponds the bounding box to the text region picture.

19. The apparatus of claim 17,

20. The apparatus of claim 11,

the text recognition module inputs the text line picture into a lightweight text sequence recognition model obtained by pre-training to obtain an output recognition result; and determining the feature extraction convolution network structure in the lightweight text sequence recognition model by adopting an automatic machine learning model searching mode.

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.

Technical Field

The application relates to a computer application technology, in particular to an optical character recognition method, an optical character recognition device, electronic equipment and a storage medium in the fields of artificial intelligence and deep learning.

Background

Optical Character Recognition (OCR) technology is widely used in industry, such as document Recognition. Current OCR implementations are typically complex, resulting in slow recognition, etc.

Disclosure of Invention

The application provides an optical character recognition method, an optical character recognition device, an electronic device and a storage medium.

An optical character recognition method comprising:

determining a surrounding frame of a text region in a picture to be recognized, and extracting a text region picture from the picture to be recognized according to the surrounding frame;

determining a surrounding frame of a text line in the text region picture, and extracting the text line picture from the text region picture according to the surrounding frame;

and performing text sequence identification on the text line picture to obtain an identification result.

An optical character recognition apparatus comprising: the system comprises a first picture processing module, a second picture processing module and a text recognition module;

and the text recognition module is used for performing text sequence recognition on the text line picture to obtain a recognition result.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

One embodiment in the above application has the following advantages or benefits: the method can firstly extract the text region of the picture to be recognized, then further extract the text line from the text region, and further recognize the text sequence of the text line, thereby obtaining a recognition result. It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of an embodiment of an optical character recognition method described herein;

FIG. 2 is a schematic diagram illustrating an overall implementation process of the optical character recognition method according to the present application;

FIG. 3 is a schematic diagram of an optical character recognition apparatus 30 according to an embodiment of the present disclosure;

fig. 4 is a block diagram of an electronic device according to the method of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a flow chart of an embodiment of an optical character recognition method according to the present application. As shown in fig. 1, the following detailed implementation is included.

In 101, for a picture to be recognized, a surrounding frame of a text region in the picture is determined, and the picture of the text region is extracted from the picture to be recognized according to the surrounding frame.

In 102, a bounding box of the text line in the text region picture is determined, and the text line picture is extracted from the text region picture according to the bounding box.

In 103, text sequence recognition is performed on the text line picture to obtain a recognition result.

The method has the characteristics of simple logic, small calculation amount and the like, so that the recognition speed can be increased, the method can be operated in various computing environments such as a Graphic Processing Unit (GPU) and a Central Processing Unit (CPU), and the method has wide applicability.

Before determining the bounding box of the text line in the text region picture, an adjustment mode of the text region picture can be determined, and the size of the text region picture can be adjusted according to the determined adjustment mode.

In practical applications, one text region may be determined from a picture to be recognized, or a plurality of text regions may be determined, and when a plurality of text regions are determined, each text region may be processed in the same manner.

In the same picture, the size of the text may vary greatly, and for some text with too large or too small size, the detection of a single model and a single scale is often difficult to accurately detect a complete text line. The conventional processing method is that, for an input picture, the input picture is scaled to several different sizes, the scaled pictures are input to a text detector for detection, and finally, the detection results under different input sizes are integrated together through post-processing strategies such as Non Maximum Suppression (NMS) and the like to obtain a final detection result. This is based on the assumption that each line of text will be scaled at least once, by scaling to a size that the text detector is more suitable to detect, with different scaling sizes. However, this method has at least the following problems: 1) because a plurality of input whole graphs with different sizes need to be processed, the calculation amount of the whole graph level is considerable, the efficiency is low, and if the selection of the zoom size is not proper, the text line is not zoomed to the proper size, the waste of calculation resources is certainly caused; 2) for the same text line, if the text line is detected by a text detector under multiple sizes, a certain prior rule is needed to select which result to retain, and the manually designed prior rule is often poor in robustness and causes loss of precision and the like.

In view of the above problems, in this embodiment, it is proposed that a picture to be recognized may be input into a lightweight Text Scale prejudgment model obtained through pre-training, and a Text region Mask (TM, Text Mask) graph and a Text Scale (TS, Text Scale) graph of a single channel are obtained, where a value of each pixel in the Text region Mask graph represents a probability that the corresponding pixel belongs to a Text region, a value of each pixel in the Text Scale graph represents a ratio between a size of a shortest side of a Text line to which the corresponding pixel belongs and a preset optimal size, and a side of the Text line includes a width and a height, and the size of the height is generally smaller than the size of the width. Correspondingly, the bounding box of the text region in the picture to be recognized can be determined according to the text region mask image, and the adjusting mode of the text region picture can be determined according to the text scale image.

The lightweight text scale prejudgment model comprises the following steps: the device comprises a first feature extraction module, a first prediction module and a second prediction module, wherein the first feature extraction module is used for extracting features of an input picture, the first prediction module is used for generating a text region mask image according to the feature extraction result, and the second prediction module is used for generating a text scale image according to the feature extraction result.

The lightweight text scale prediction model may be a full Convolutional Network, the first feature extraction module may be a small Convolutional Neural Network (CNN), the first prediction module may segment text regions in the picture based on a feature extraction result of the first feature extraction module, and the second prediction module may predict a ratio (ratio) between a size of a shortest side of a text line in each text region and a preset optimal size. The first prediction module and the second prediction module can respectively contain 3 rolling machine layers. Accordingly, the final output of the lightweight text scale prejudgment model is two single-channel segmentation maps, namely a text region mask map and a text scale map. In the text region mask image, the value of each pixel point respectively represents the probability that the corresponding pixel point belongs to the text region, the probability can be a value between 0 and 1, and in the text scale image, the value of each pixel point respectively represents the ratio of the size of the shortest side of the text line to which the corresponding pixel point belongs to the preset optimal size.

In the training stage, for the text region mask image, the value of each pixel point in the background region, that is, the non-text region, may be 0, and the value of each pixel point in the text region may be 1, and for the text scale image, the value of each pixel point is the ratio between the size of the shortest side of the text line to which the corresponding pixel point belongs and the preset optimal size. The optimal size can be a hyper-parameter, and the specific value can be determined according to the actual requirement. In the aspect of selecting the loss function, the first prediction module can be selected to be dice-loss, and the second prediction module can be selected to be smooth-L1 loss.

And in the prediction stage, inputting the picture to be recognized into a lightweight text scale prejudging model to obtain an output text region mask picture and a text scale picture, then determining all text connected domains in the text region mask picture through connected domain analysis, wherein each text connected domain represents an independent text region, and aiming at any text connected domain, namely aiming at each text connected domain, respectively determining a minimum rectangle containing the text connected domain in the picture to be recognized as a surrounding frame of the text region corresponding to the text connected domain, and further extracting the text region picture from the picture to be recognized according to the surrounding frame.

For each text region picture, the values of the pixel points in the text region picture in the text scale map can be respectively determined, and the values of the pixel points in the text region picture in the text scale map are the same, that is, in this embodiment, it is assumed that the sizes of the text lines in the same text region are the same, and as the sizes of the heights of the text lines in the same text region are the same, the sizes of the heights are usually smaller than the sizes of the widths and the heights. Then, for each text region picture, the corresponding adjustment manner may be: and on the premise of reserving the aspect ratio of the text region picture, adjusting the width and height of the text region picture so that the size of the shortest side of the text line after adjustment is equal to the optimal size.

Through the processing mode, the text region in the picture to be recognized can be rapidly and accurately determined based on the text region mask image and the text scale image output by the lightweight text scale prejudging model, and the size of the text region picture can be directly adjusted to be a proper size, so that the follow-up processing is facilitated, the accuracy of a follow-up processing result is improved, and the problems of low efficiency, precision loss and the like caused by the fact that the picture is zoomed into a plurality of different sizes in the traditional mode are solved.

And for each text region picture, respectively determining a surrounding frame of each text line in the text region picture, and extracting each text line picture from the text region picture according to the surrounding frame.

The Text region picture can be input into a lightweight Text detection model obtained by pre-training, and a Text Center Line (TCL) response graph of a single channel and a Text boundary region Offset (TBO) graph of four channels are obtained. The values of all the pixels in the text center line response graph respectively represent the probability that the corresponding pixels belong to the text line center line region, and the values of all the pixels in the text boundary region offset graph respectively represent the horizontal and vertical distances between the corresponding pixels and the upper boundary of the text line to which the corresponding pixels belong and the horizontal and vertical distances between the corresponding pixels and the lower boundary of the text line to which the corresponding pixels belong.

The lightweight text detection model can comprise: the device comprises a second feature extraction module, a third prediction module and a fourth prediction module, wherein the second feature extraction module is used for extracting features of an input picture, the third prediction module is used for generating a text center line response image according to the feature extraction result, and the fourth prediction module is used for generating a text boundary region offset image according to the feature extraction result.

The lightweight Text detection model can be obtained by appropriately simplifying the existing Single-shot arbitrary-Shaped Text (SAST) model, wherein the second feature extraction module can adopt a lightweight deep residual error network such as Resnet-18, so that the calculation amount of feature extraction is reduced as much as possible, and the four prediction branches of the SAST can be simplified into two branches, namely the third prediction module and the fourth prediction module, and the third prediction module and the fourth prediction module can respectively comprise 4 convolutional layers. Therefore, the lightweight text detection model is a full convolution network, the final output includes a text center line response graph and a text boundary region offset graph, the text center line response graph is a single channel, the value of each pixel point therein respectively represents the probability that the corresponding pixel point belongs to the text line center line region, the probability can be a value between 0 and 1, the text boundary region offset graph is a four-channel graph, the value of each pixel point therein respectively represents the horizontal and vertical distances from the corresponding pixel point to the upper boundary of the corresponding text line and the lower boundary of the corresponding text line, that is, for the value of any pixel point, the horizontal distance from the corresponding pixel point to the upper boundary of the corresponding text line, the vertical distance from the upper boundary of the corresponding text line, the horizontal distance from the lower boundary of the corresponding text line and the vertical distance from the lower boundary of the corresponding text line are respectively represented.

In the training stage, the SAST configuration can be used, the text center line response graph can be supervised by using dice-loss, and the text boundary region offset graph can be supervised by using smooth-L1 loss.

And a prediction stage, inputting the text region picture into a lightweight text detection model to obtain an output text center line response picture and a text boundary region offset picture, and then determining a surrounding frame of a text line in the text region picture by combining the text center line response picture and the text boundary region offset picture. Preferably, the connected domain analysis may be performed on the text center line response graph to determine the center line of each text line, for each center line, the value of the pixel point on the center line in the text boundary region offset graph may be respectively combined to determine the bounding box of the text line corresponding to the center line, and the bounding box may be corresponding to the text region picture, so that the text line picture may be extracted from the text region picture according to the bounding box.

By the processing mode, the text lines in the text region pictures can be quickly and accurately determined based on the text center line response pictures and the text boundary region offset pictures output by the lightweight text detection model, and the acquired text region pictures can be processed in parallel, so that the processing speed is further increased.

And respectively carrying out text sequence identification on each acquired text line picture to obtain an identification result. Preferably, the text line picture can be input into a lightweight text sequence recognition model obtained through pre-training, so that an output recognition result is obtained. Specifically, for an input text line picture, the lightweight text sequence recognition model may first obtain features of the text line picture through a feature extraction convolutional network, and then may serialize the features into a plurality of frames, and then input the frames into a bidirectional Gated loop Unit (GRU) to perform classification prediction and the like on the frames, which is specifically implemented in the prior art. In the training phase, the recognition of text sequences can be supervised by classical ctc loss.

For the structure of the light-weight text sequence recognition model, the calculation amount of the convolution network of the feature extraction part accounts for a large part of the calculation amount of the whole model, and in order to make the calculation cost of the model lower, a lighter feature extraction convolution network structure can be adopted.

In this embodiment, a traditional manner of manually designing a network structure may be abandoned, and instead, an automatic Machine Learning (auto ml) technique may be adopted to obtain the network structure in an automatic search manner, that is, the feature extraction convolution network structure in the lightweight text sequence recognition model may be determined in an automatic Machine Learning model search manner.

Specifically, the whole Network searching task can be controlled by a Recurrent Neural Network (RNN) controller capable of predicting Network configuration, the controller is optimized by a reinforcement learning mode with model accuracy and prediction time consumption as training targets, and an optimal Network structure is selected by the controller. In the aspect of searching space, the whole feature extraction convolution network can be divided into a plurality of submodules, the number of the submodules can be 3 due to the consideration of model lightweight, the structures of the submodules are the same, each submodule can be composed of a plurality of layers, and each layer is composed of a plurality of operators, such as convolution, pooling, shortcut (shortcut) connection and the like. The search space of the web search task may include: the specific configuration of the layers used in each sub-module (e.g., the selection of operators, the connection mode, etc.), the number of layers included in each sub-module, and the like.

Compared with the mode of manually designing a network structure, the mode can greatly reduce the labor cost, has higher precision, and can adopt simple sequential Classification (CTC) decoding logic to perform text recognition and decoding when text sequence recognition is performed due to the guarantee of the precision, thereby reducing the realization complexity and further improving the processing speed and the like.

Based on the above description, fig. 2 is a schematic diagram of an overall implementation process of the optical character recognition method according to the present application, and for specific implementation, reference is made to the foregoing related description, which is not repeated.

In summary, the embodiment provides a lightweight general optical character recognition method composed of a lightweight text scale prejudging model, a lightweight text detection model, a lightweight text sequence recognition model and the like, and the method has the characteristics of simple logic, small calculation amount and the like on the premise of ensuring higher recognition accuracy, so that the recognition speed is increased, the method can be operated in various computing environments such as a GPU and a CPU, and the method has wide applicability and the like.

It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.

Fig. 3 is a schematic structural diagram of an optical character recognition apparatus 30 according to an embodiment of the present disclosure. As shown in fig. 3, includes: a first picture processing module 301, a second picture processing module 302, and a text recognition module 303.

The first picture processing module 301 is configured to determine, for a picture to be identified, a surrounding frame of a text region in the picture, and extract a picture of the text region from the picture to be identified according to the surrounding frame.

The second picture processing module 302 is configured to determine a bounding box of a text line in the text region picture, and extract the text line picture from the text region picture according to the bounding box.

And the text recognition module 303 is configured to perform text sequence recognition on the text line picture to obtain a recognition result.

For the extracted text region picture, the first picture processing module 301 may further determine an adjustment mode of the text region picture, and perform size adjustment on the text region picture according to the determined adjustment mode.

The first image processing module 301 may input an image to be recognized into a lightweight text scale prejudgment model obtained by pre-training, so as to obtain an output single-channel text region mask image and a text scale image; the values of the pixels in the text area mask graph respectively represent the probability that the corresponding pixels belong to the text area, and the values of the pixels in the text scale graph respectively represent the ratio of the size of the shortest side of the text line to which the corresponding pixels belong to the preset optimal size. Further, the first picture processing module 301 may determine, according to the text region mask map, a bounding box of the text region in the picture to be recognized, and determine, according to the text scale map, an adjustment manner of the text region picture.

Specifically, the first picture processing module 301 may determine a text connected component in the text region mask map through connected component analysis, and may determine, for any text connected component, a minimum rectangle containing the text connected component in the picture to be recognized as a bounding box of the text region corresponding to the text connected component.

The first picture processing module 301 may further determine, for any text region picture, values of pixel points in the text region picture in the text scale map, where the values of the pixel points in the text region picture in the text scale map are the same. Correspondingly, the determined adjustment mode of the text region picture may include: and on the premise of keeping the aspect ratio of the text region picture, adjusting the width and height of the text region picture to enable the size of the shortest side of the text line after adjustment to be equal to the optimal size.

The second image processing module 302 may input the text region image into a lightweight text detection model obtained by pre-training, so as to obtain an output single-channel text center line response image and four-channel text boundary region offset images; the values of all the pixels in the text center line response graph respectively represent the probability that the corresponding pixels belong to the text line center line region, and the values of all the pixels in the text boundary region offset graph respectively represent the horizontal and vertical distances between the corresponding pixels and the upper boundary of the text line to which the corresponding pixels belong and the horizontal and vertical distances between the corresponding pixels and the lower boundary of the text line to which the corresponding pixels belong. Accordingly, the second picture processing module 302 may determine a bounding box of the text line in the text region picture in combination with the text centerline response map and the text boundary region offset map.

Specifically, the second image processing module 302 may determine a center line of each text line by performing connected domain analysis on the text center line response graph, and for any center line, may respectively determine, in combination with values of pixel points on the center line in the text boundary region offset graph, an enclosure of the text line corresponding to the center line, and correspond the enclosure to the text region picture.

The text recognition module 303 may input the text line picture into a lightweight text sequence recognition model obtained by pre-training to obtain an output recognition result; the feature extraction convolution network structure in the lightweight text sequence recognition model can be determined by adopting an automatic machine learning model searching mode.

For a specific work flow of the apparatus embodiment shown in fig. 3, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device, the optical character recognition can be carried out by adopting a lightweight universal optical character recognition mode consisting of a lightweight text scale prejudging model, a lightweight text detection model, a lightweight text sequence recognition model and the like, and the device has the characteristics of simple logic, small calculated amount and the like on the premise of ensuring higher recognition accuracy, thereby improving the recognition speed, can operate in various computing environments such as GPU, CPU and the like, and has wide applicability and the like; in addition, the text region in the picture to be recognized can be quickly and accurately determined based on the text region mask image and the text scale image output by the lightweight text scale prejudging model, and the size of the text region picture can be directly adjusted to be a proper size, so that the subsequent processing is facilitated, the accuracy of the subsequent processing result is improved, and the problems of low efficiency, precision loss and the like caused by the fact that the picture is zoomed into a plurality of different sizes in the traditional mode are solved; moreover, text lines in the text region pictures can be quickly and accurately determined based on a text center line response picture and a text boundary region offset picture output by the lightweight text detection model, and the acquired text region pictures can be processed in parallel, so that the processing speed is further increased; furthermore, an automatic machine learning model searching mode can be adopted to determine the feature extraction convolution network structure in the lightweight text sequence recognition model, and the traditional mode of manually designing the network structure is abandoned, so that the labor cost is greatly reduced, and higher precision and the like are achieved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor Y01 is taken as an example.

Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

Memory Y02 is provided as a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于多识别参量的网上银行服务方法

Optical character recognition method, device, electronic equipment and storage medium

相关技术

网友询问留言