Image analysis method, system, device and medium based on OCR

文档序号:1379192 发布日期:2020-08-14 浏览:5次 中文

阅读说明:本技术 一种基于ocr的图像分析方法、系统、设备及介质 (Image analysis method, system, device and medium based on OCR ) 是由 周曦 姚志强 林旸焜 许梅芳 于 2020-04-21 设计创作,主要内容包括:本发明提供一种基于OCR的图像分析方法、系统、设备及介质,包括:根据预先获取的目标图像的特征对目标图像进行语义分割,获取语义分割结果;基于所述语义分割结果对目标图像的版面进行分析。本发明通过对目标图像进行语义分割,根据分割后的结果进行版面分析,能够在遮挡、成像角度、旋转、光照等干扰下,快速、准确地检测目标图像中的文本;并且对于不规范的图像版式或表面,本发明也能结构化提取字段信息。(The invention provides an image analysis method, system, device and medium based on OCR, comprising: performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result; and analyzing the layout of the target image based on the semantic segmentation result. According to the invention, through semantic segmentation of the target image and layout analysis according to the segmented result, the text in the target image can be rapidly and accurately detected under the interference of shielding, imaging angle, rotation, illumination and the like; and for irregular image formats or surfaces, the method can also structurally extract field information.)

1. An OCR-based image analysis method, comprising the steps of:

performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and analyzing the layout of the target image based on the semantic segmentation result.

2. An OCR-based image analysis method according to claim 1 and further comprising preprocessing the target image before obtaining the features of the target image, said preprocessing including at least one of:

and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level.

3. An OCR-based image analysis method according to claim 2 wherein the annotated content includes one of: text line attributes, corners, offsets.

4. An OCR-based image analysis method according to any of claims 1 to 3, characterized in that based on a deep learning ensemble scheme, training is performed using random gradient descent with momentum to train one or more layout analysis deep learning network models;

and inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

5. An OCR-based image analysis method according to claim 4, wherein the semantic segmentation of the target image comprises at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image.

6. An OCR-based image analysis method according to claim 5, wherein the obtained semantic segmentation results comprise at least one of: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

7. An OCR-based image analysis method according to claim 6 and also comprising corner matching in combination with a plurality of weights; including at least one of the following weights: distance, aspect ratio, angle.

8. An OCR-based image analysis method according to claim 6,

classifying all predicted pixel points to obtain the attribute of a text box of a target image;

and analyzing the layout of the target image based on the attribute of the text box.

9. An OCR-based image analysis method as claimed in claim 4, further comprising increasing one or more disturbance parameters to improve robustness of one or more layout deep learning network models when training the one or more layout deep learning network models.

10. An OCR-based image analysis method according to claim 9 and wherein said disturbance parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur.

11. An OCR-based image analysis method as in claim 4, wherein in training one or more layout analysis deep learning network models, model losses are calculated using different methods for predictions of different attributes, and all model losses are weighted to obtain the total loss of the model.

12. An OCR-based image analysis method according to claim 11 and wherein the calculated model loss comprises at least one of: smooth L1 loss, cross entropy loss.

13. An OCR-based image analysis method according to claim 1 wherein the features of the target object include at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts.

14. An OCR-based image analysis method according to claim 1 or 13, wherein the features of the target image are obtained through a convolutional neural network or a full convolutional network, and the features include a global feature of the target image and a local feature of the target image.

15. An OCR-based image analysis method according to claim 14 and further comprising boosting the receptive field of a full convolution network using a parallel architecture formed jointly by a concatenation of hole convolutions and different sampling rate hole convolutions.

16. An OCR-based image analysis system, comprising:

the segmentation module is used for performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and the analysis module is used for analyzing the layout of the target image based on the semantic segmentation result.

17. An OCR-based image analysis system according to claim 16 and further comprising, prior to obtaining features of the target image, preprocessing of the target image including at least one of:

and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level.

18. An OCR-based image analysis system according to claim 17 and wherein the annotated content comprises one of: text line attributes, corners, offsets.

19. An OCR-based image analysis system according to any one of claims 16 to 18 wherein a deep learning based ensemble scheme is trained using random gradient descent with momentum to train out one or more layout deep learning network models;

and inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

20. An OCR-based image analysis system according to claim 19 wherein the semantic segmentation of the target image includes at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image.

21. An OCR-based image analysis system according to claim 20 and wherein the obtained semantic segmentation results include at least one of: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

22. An OCR-based image analysis system according to claim 21 and further comprising corner matching in combination with a plurality of weights; including at least one of the following weights: distance, aspect ratio, angle.

23. An OCR-based image analysis system according to claim 21,

classifying all predicted pixel points to obtain the attribute of a text box of a target image;

and analyzing the layout of the target image based on the attribute of the text box.

24. An OCR-based image analysis system according to claim 19 and further comprising adding one or more interference parameters to improve robustness of said one or more layout deep learning network models when training said one or more layout deep learning network models.

25. An OCR-based image analysis system according to claim 24 and wherein said disturbance parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur.

26. An OCR based image analysis system according to claim 19 wherein in training one or more layout deep learning network models, model losses are calculated using different methods for predictions of different attributes and all model losses are weighted to obtain the total loss of the model.

27. An OCR-based image analysis system according to claim 16 wherein the calculated model loss includes at least one of: smooth L1 loss, cross entropy loss.

28. An OCR-based image analysis system according to claim 16 and wherein said target object features include at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts.

29. An OCR-based image analysis system according to claim 16 or 28 and wherein the features of the target image are obtained by convolutional neural network, full convolutional network, including obtaining global features of the target image, local features of the target image.

30. An OCR-based image analysis system according to claim 29 and further comprising using a parallel architecture formed jointly by a concatenation of hole convolutions and different sampling rate hole convolutions to boost the receptive field of a full convolution network.

31. An OCR-based image analysis apparatus, comprising:

performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and analyzing the layout of the target image based on the semantic segmentation result.

32. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-15.

33. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-15.

Technical Field

The present invention relates to the field of image technologies, and in particular, to an image analysis method, system, device, and medium based on OCR.

Background

Layout analysis or layout analysis is an important issue in the field of OCR (Optical Character recognition), and aims to determine whether a given picture or image includes a designated object and to obtain an accurate position and boundary of the designated object. In the field of OCR, semantic segmentation and a general object detection framework have been widely adopted for scene text detection tasks. Due to interference such as shielding, imaging angle, rotation and illumination, common target detection in the prior art hardly meets the requirements of rapid and accurate text detection at the same time, and field information cannot be extracted structurally for irregular image formats or layouts.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide an OCR-based image analysis method, system, device and medium for solving the problems in the prior art.

To achieve the above and other related objects, the present invention provides an OCR-based image analysis method, including the steps of:

performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and analyzing the layout of the target image based on the semantic segmentation result.

Optionally, before acquiring the features of the target image, the method further includes preprocessing the target image, where the preprocessing includes at least one of:

and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level.

Optionally, the annotated content comprises one of: text line attributes, corners, offsets.

Optionally, training by using random gradient descent with momentum based on an overall scheme of deep learning to train one or more layout analysis deep learning network models;

and inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

Optionally, the semantic segmentation performed on the target image comprises at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image.

Optionally, the obtained semantic segmentation result includes at least one of: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

Optionally, carrying out corner matching by combining a plurality of weights; including at least one of the following weights: distance, aspect ratio, angle.

Optionally, classifying all predicted pixel points to obtain the attribute of a text box of the target image;

and analyzing the layout of the target image based on the attribute of the text box.

Optionally, when training one or more layout analysis deep learning network models, further comprising adding one or more interference parameters to improve robustness of the one or more layout analysis deep learning network models.

Optionally, the interference parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur.

Optionally, when one or more layout analysis deep learning network models are trained, for predictions of different attributes, model losses are calculated by using different methods, and all the model losses are weighted to obtain a total loss of the model.

Optionally, the calculated model loss comprises at least one of: smooth L1 loss, cross entropy loss.

Optionally, the characteristics of the target object include at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts.

Optionally, the features of the target image are obtained through a convolutional neural network or a full convolutional network, including obtaining global features of the target image and local features of the target image.

Optionally, the method further comprises the step of improving the receptive field of the full convolution network by using a parallel architecture formed by the hole convolution cascade and the hole convolutions with different sampling rates.

The present invention also provides an OCR-based image analysis system, including:

the segmentation module is used for performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and the analysis module is used for analyzing the layout of the target image based on the semantic segmentation result.

Optionally, before acquiring the features of the target image, the method further includes preprocessing the target image, where the preprocessing includes at least one of:

and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level.

Optionally, the annotated content comprises one of: text line attributes, corners, offsets.

Optionally, training by using random gradient descent with momentum based on an overall scheme of deep learning to train one or more layout analysis deep learning network models;

and inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

Optionally, the semantic segmentation performed on the target image comprises at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image.

Optionally, the obtained semantic segmentation result includes at least one of: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

Optionally, carrying out corner matching by combining a plurality of weights; including at least one of the following weights: distance, aspect ratio, angle.

Optionally, classifying all predicted pixel points to obtain the attribute of a text box of the target image;

and analyzing the layout of the target image based on the attribute of the text box.

Optionally, when training one or more layout analysis deep learning network models, further comprising adding one or more interference parameters to improve robustness of the one or more layout analysis deep learning network models.

Optionally, the interference parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur.

Optionally, when one or more layout analysis deep learning network models are trained, for predictions of different attributes, model losses are calculated by using different methods, and all the model losses are weighted to obtain a total loss of the model.

Optionally, the calculated model loss comprises at least one of: smooth L1 loss, cross entropy loss.

Optionally, the characteristics of the target object include at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts.

Optionally, the features of the target image are obtained through a convolutional neural network or a full convolutional network, including obtaining global features of the target image and local features of the target image.

Optionally, the method further comprises the step of improving the receptive field of the full convolution network by using a parallel architecture formed by the hole convolution cascade and the hole convolutions with different sampling rates.

The invention also provides an image analysis device based on OCR, comprising:

performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and analyzing the layout of the target image based on the semantic segmentation result.

The present invention also provides an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as described in one or more of the above.

The present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the methods as described in one or more of the above.

As described above, the image analysis method, system, device and medium based on OCR provided by the present invention have the following beneficial effects: performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result; and analyzing the layout of the target image based on the semantic segmentation result. According to the invention, through semantic segmentation of the target image and layout analysis according to the segmented result, the text in the target image can be rapidly and accurately detected under the interference of shielding, imaging angle, rotation, illumination and the like; and for irregular image formats or surfaces, the method can also structurally extract field information.

Drawings

FIG. 1 is a flow diagram illustrating an OCR-based image analysis method according to an embodiment;

FIG. 2 is a diagram illustrating a hardware architecture of an OCR-based image analysis system according to an embodiment;

fig. 3 is a schematic hardware structure diagram of a terminal device according to an embodiment;

fig. 4 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

M10 segmentation module

M20 analysis module

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 voice assembly

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides an image analysis method based on OCR, including the following steps:

s100, performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

s200, analyzing the layout of the target image based on the semantic segmentation result.

Through the recording of the scheme, the method can quickly and accurately detect the text in the target image under the interference of shielding, imaging angle, rotation, illumination and the like; and for irregular image formats or surfaces, the method can also be used for structuredly extracting field information.

In an exemplary embodiment, before acquiring the features of the target image, the method further includes preprocessing the target image, where the preprocessing includes at least one of: and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level. As an example, for region detection of a ticket or document, the target image may be preprocessed by an algorithm such as text position correction. In the embodiment of the application, a text box label of a target image is converted into a label at a pixel level, wherein the content of the label comprises one of the following contents: text line attributes, corners, offsets.

In an exemplary embodiment, based on the overall scheme of deep learning, random gradient descent with momentum is used for training, and one or more layout analysis deep learning network models are trained. As an example, an overall scheme based on deep learning, training using random gradient descent (SGD) with Momentum (Momentum), while regularization and learning rate decay are used with L2; one or more layout analysis deep learning network models are trained. And classifying the required field text by classification and regression methods to obtain coordinate values of text response, thereby realizing text structured recognition.

And inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

In the embodiment of the application, when one or more layout analysis deep learning network models are trained, for the prediction of different attributes, different methods are used for calculating the loss of the layout analysis deep learning network models, and the loss of all the layout analysis deep learning network models is weighted to obtain the total loss of the layout analysis deep learning network models. For example, the smoothed L1 Loss (Smooth L1 Loss) and the Cross Entropy Loss (Cross Entropy Loss) in the model are calculated, and the smoothed L1 Loss and the Cross Entropy Loss are weighted to obtain the total Loss of the model. As an Example, in the embodiment of the present application, during training, a method of prior Hard instance Mining (OHEM) is further used to balance positive and negative samples, so as to improve the Recall rate (Recall) of the overall result.

In the embodiment of the application, after one or more layout analysis deep learning network models are obtained, the models can be subjected to fine tuning, quantification and pruning according to subsequent identification.

As can be seen from the above exemplary embodiments, the semantic segmentation performed on the target image includes at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image. Wherein, the obtained semantic segmentation result comprises at least one of the following: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

In the embodiment of the application, the Text box is regressed by using a method of Advanced-east (an effective and accurate Text detector), and the corresponding four-point positions are obtained by predicting the offset of the pixel points relative to the Text box, so that the high-precision Text box boundary regression is realized. For example, only the pixel points in the corner region are used for prediction, the prediction of the text box corner region is increased, and the text box boundary is accurately obtained. Where the regression is performed using different convolution kernels for larger and smaller offsets.

In an exemplary embodiment, after the text box boundary is accurately obtained, for the predicted text pixel points, a text line connected block is obtained by using depth-first search (DFS), and four-point regression of the text box is performed in the corner region. The text lines obtained thereby may have a problem of blocking or overlapping up and down, and therefore, the present embodiment may also perform corner matching with a plurality of weights, for example, the weights may include distances, aspect ratios, angles, and the like. Meanwhile, the text boxes with large overlap ratio are combined, and noise interference is reduced.

According to the description of the exemplary embodiment, the target object is preprocessed, and then semantic segmentation is performed based on the characteristics of the target image to obtain predicted pixel points;

classifying all predicted pixel points to obtain the attribute of a text box of a target image;

and analyzing the layout of the target image based on the attribute of the text box.

In an exemplary embodiment, Layout Analysis or Layout Analysis (Layout Analysis) may be due to inherent variations in the target, such as specifying that the target has complex details or that the target has occlusions; or due to external condition changes, such as imaging angle, plane rotation, depth rotation, illumination change, acquisition path and the like, the text box of the target cannot be accurately determined by format analysis or layout analysis. Therefore, when training one or more layout analysis deep learning network models, adding one or more interference parameters improves the robustness or generalization ability of the one or more layout analysis deep learning network models. As an example, wherein the interference parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur. By increasing the robustness or generalization capability, the method has good robustness to image problems in practical scenes such as illumination, shadow, wrinkle, seal, background interference, printing dislocation, quality image and the like.

In some exemplary embodiments, the characteristic of the target object comprises at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts. By way of example, the embodiment of the application may acquire the features of the target image through a convolutional neural network or a full convolutional network, including acquiring global features of the target image and local features of the target image. Specifically, a convolutional neural network CNN is used for obtaining target image features, bottom layer features are extracted, and then features are extracted by using a full convolution network such as ResNet50 (a depth residual error network of 50 layers) and MobileNet V2 (a lightweight network structure proposed by Google in 2018) on the basis of a basic framework of Deeplab V3+ (semantic segmentation algorithm proposed by Google in 2018), so that global features of a target image and local features of the target image are obtained. In the embodiment of the application, the upsampling and convolution fusion between different layers of the full convolution network not only can give consideration to the global characteristics and the local characteristics, but also can obtain the global context information of the target image, thereby effectively reducing the influence of noise and further achieving the pixel level semantic segmentation from coarse to fine.

The method also comprises the step of improving the receptive field of the full convolution network while maintaining parameters by using a parallel framework jointly formed by the cavity convolution cascade and the cavity convolutions with different sampling rates. By improving the receptive field of the full convolution network, the method can meet the requirements of layout analysis or layout analysis of languages (such as Chinese) with high accuracy of left and right boundary regression and high text length span required by the receptive field in the OCR field.

According to the above description, as an example, in a specific embodiment, the method includes:

inputting a target image, and acquiring features of the target image from the target image in a mode of respectively reducing the image by 4 times, 8 times and 16 times according to the ratio of 1:4, 1:8 and 1:16, wherein the features include acquiring global features of the target image, local features of the target image and relevance among target image level contexts (or interaction among the target image level contexts) for feature fusion; and converting the text box label of the target image into a label at a pixel level, wherein the label content comprises one of the following contents: text line attributes, corners, offsets.

And then performing semantic segmentation on the target object subjected to the feature fusion through one or more layout analysis deep learning network models, wherein the performed semantic segmentation comprises a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in an angular point region in the prediction target image. And obtaining a semantic segmentation result, namely obtaining the corner position through a text box of the regression target image, obtaining the boundary position of the text box of the target image through an anchor detection box of the regression target image and pixel points in the corner region in the prediction target image, and obtaining the precise boundary position of the text box of the target image through pixel points in the corner region in the prediction target image.

And acquiring all the prediction pixel points, classifying all the prediction pixel points to obtain the attribute of the text box, and performing format analysis on the target image according to the attribute of the text box.

In conclusion, the method comprises the steps of preprocessing a target image, enhancing robustness or generalization capability, then obtaining the features of the target image, then performing semantic segmentation on the features of the target image based on a trained layout analysis deep learning network model, obtaining the accurate positions of the boundaries of a text frame in the target image, and finally classifying all prediction pixel points to obtain the attributes of the text frame; and performing layout analysis on the image according to the attribute of the text box. The method comprises the steps of obtaining image features by using a convolutional neural network CNN, extracting bottom layer features, extracting features by using a multi-scale full convolutional network based on a Deeplab V3+ framework and using various structures including but not limited to ResNet50, MobileNet V2 and the like, and performing semantic segmentation at a pixel level to realize fine positioning of character/background labeling. And then, a cavity convolution cascade and a cavity convolution parallel framework with different sampling rates are used, and the reception field of the full convolution network is improved while the parameters are kept. And by integrating upsampling and convolution between different layers of the full convolution network, global characteristics and local characteristics are considered, more global context information of the target image is obtained, and the influence of noise is effectively reduced. And meanwhile, the text box label is converted into a label at a pixel level, including label text line attributes, corners, offsets and the like. And the generalization capability or robustness of the layout analysis deep learning network model to background, rotation, perspective distortion, noise, Gaussian blur, motion blur and the like is increased through a data enhancement algorithm. And when a layout analysis deep learning network model is trained, based on the integral scheme of deep learning, random gradient descent with momentum is used for training, L2 regularization and learning rate attenuation are used for obtaining required text positioning and classification, and further text structuring is realized. Including correcting for direction by detecting the target area. And (3) regressing the text box by using an Advanced-EAST method, and predicting the offset of the pixel point relative to the text box to obtain the corresponding four-point position. On the basis, the prediction of the text box corner region is increased, and the accuracy of text box boundary regression is increased by predicting the pixel points in the corner region. While regression is performed using different convolution kernels for larger and smaller offsets. For predictions of different attributes, the loss is calculated using different methods, including smoothing L1 loss, cross entropy loss, etc., and the overall loss is obtained by weighting multiple losses. And balancing the positive and negative samples by using a method of prior difficult mining, so that the recall rate of the whole result is improved. And then, for the predicted text pixel points, depth-first searching is used to obtain text line connected blocks, and four-point regression of the text box is carried out in the corner region. Aiming at the problem of the up-down adhesion or the overlapping of the text lines, a plurality of weight combination methods are added for carrying out corner matching, wherein the weight combination methods comprise distance, length-width ratio, angle and the like. And the text boxes with high overlap ratio are combined, so that the noise interference is reduced. And finally, realizing layout analysis through all the predicted pixel points in the text box and the corresponding classification attributes.

As shown in fig. 2, the present invention also provides an OCR-based image analysis system, including:

the segmentation module M10 is used for performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and the analysis module M20 is used for analyzing the layout of the target image based on the semantic segmentation result.

Through the recording of the scheme, the system can quickly and accurately detect the text in the target image under the interference of shielding, imaging angle, rotation, illumination and the like; and the system can also structurally extract field information for irregular image formats or surfaces.

In an exemplary embodiment, before acquiring the features of the target image, the method further includes preprocessing the target image, where the preprocessing includes at least one of: and carrying out position correction on the text in the target image through a text position correction algorithm, and converting the text box label of the target image into a label at a pixel level. As an example, for region detection of a ticket or document, the target image may be preprocessed by an algorithm such as text position correction. In the embodiment of the application, a text box label of a target image is converted into a label at a pixel level, wherein the content of the label comprises one of the following contents: text line attributes, corners, offsets.

In an exemplary embodiment, based on the overall scheme of deep learning, random gradient descent with momentum is used for training, and one or more layout analysis deep learning network models are trained. As an example, an overall scheme based on deep learning, training using random gradient descent (SGD) with Momentum (Momentum), while regularization and learning rate decay are used with L2; one or more layout analysis deep learning network models are trained. And classifying the required field text through classification and regression system division to obtain coordinate values of text response, thereby realizing text structured recognition.

And inputting the pre-acquired features of the target images and the context correlation attributes among the target image levels into one or more trained layout analysis deep learning network models to perform semantic segmentation on the target images.

In the embodiment of the application, when one or more layout analysis deep learning network models are trained, for the prediction of different attributes, different systems are used for calculating the losses of the layout analysis deep learning network models, all the losses of the layout analysis deep learning network models are weighted, and the total loss of the layout analysis deep learning network models is obtained. For example, the smoothed L1 Loss (Smooth L1 Loss) and the Cross Entropy Loss (Cross Entropy Loss) in the model are calculated, and the smoothed L1 Loss and the Cross Entropy Loss are weighted to obtain the total Loss of the model. As an Example, in the embodiment of the present application, during training, a system of prior Hard instance Mining (OHEM) is further used to balance positive and negative samples, so as to improve the Recall rate (Recall) of the overall result.

In the embodiment of the application, after one or more layout analysis deep learning network models are obtained, the models can be subjected to fine tuning, quantification and pruning according to subsequent identification.

As can be seen from the above exemplary embodiments, the semantic segmentation performed on the target image includes at least one of: the method comprises the steps of a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in a corner region in the prediction target image. Wherein, the obtained semantic segmentation result comprises at least one of the following: acquiring corner positions through a text box of a regression target image, acquiring boundary positions of the text box of the target image through an anchor detection box of the regression target image and pixel points in a corner region in a prediction target image, and acquiring the precise boundary positions of the text box of the target image through pixel points in the corner region in the prediction target image.

In the embodiment of the application, the method further comprises the step of using a system regression Text box of Advanced-east (an effective and accurate Text detector), and obtaining corresponding four-point positions by predicting the offset of the pixel points relative to the Text box, so as to realize high-precision Text box boundary regression. For example, only the pixel points in the corner region are used for prediction, the prediction of the text box corner region is increased, and the text box boundary is accurately obtained. Where the regression is performed using different convolution kernels for larger and smaller offsets.

In an exemplary embodiment, after the text box boundary is accurately obtained, for the predicted text pixel points, a text line connected block is obtained by using depth-first search (DFS), and four-point regression of the text box is performed in the corner region. The text lines obtained thereby may have a problem of blocking or overlapping up and down, and therefore, the present embodiment may also perform corner matching with a plurality of weights, for example, the weights may include distances, aspect ratios, angles, and the like. Meanwhile, the text boxes with large overlap ratio are combined, and noise interference is reduced.

According to the description of the exemplary embodiment, the target object is preprocessed, and then semantic segmentation is performed based on the characteristics of the target image to obtain predicted pixel points;

classifying all predicted pixel points to obtain the attribute of a text box of a target image;

and analyzing the layout of the target image based on the attribute of the text box.

In an exemplary embodiment, layout analysis or layout analysis may be due to inherent variations in the target, such as specifying that the target has complex details or that the target has occlusions; or due to external condition changes, such as imaging angle, plane rotation, depth rotation, illumination change, acquisition path and the like, the text box of the target cannot be accurately determined by format analysis or layout analysis. Therefore, when training one or more layout analysis deep learning network models, adding one or more interference parameters improves the robustness or generalization ability of the one or more layout analysis deep learning network models. As an example, wherein the interference parameter comprises at least one of: background, rotation, perspective, distortion, noise, gaussian blur, motion blur. By increasing the robustness or generalization capability, the system has good robustness to image problems in real scenes such as illumination, shadow, wrinkle, stamp, background interference, printing dislocation, quality image and the like.

In some exemplary embodiments, the characteristic of the target object comprises at least one of: global features of the target image, local features of the target image, and a correlation between target image level contexts. By way of example, the embodiment of the application may acquire the features of the target image through a convolutional neural network or a full convolutional network, including acquiring global features of the target image and local features of the target image. Specifically, a convolutional neural network CNN is used for obtaining target image features, bottom layer features are extracted, and then features are extracted by using a full convolution network such as ResNet50 (a depth residual error network of 50 layers) and MobileNet V2 (a lightweight network structure proposed by Google in 2018) on the basis of a basic framework of Deeplab V3+ (semantic segmentation algorithm proposed by Google in 2018), so that global features of a target image and local features of the target image are obtained. In the embodiment of the application, the upsampling and convolution fusion between different layers of the full convolution network not only can give consideration to the global characteristics and the local characteristics, but also can obtain the global context information of the target image, thereby effectively reducing the influence of noise and further achieving the pixel level semantic segmentation from coarse to fine.

The method also comprises the step of improving the receptive field of the full convolution network while maintaining parameters by using a parallel framework jointly formed by the cavity convolution cascade and the cavity convolutions with different sampling rates. The system can meet the requirements of high accuracy of left and right boundary regression and large text length span layout analysis or format analysis of languages (such as Chinese) with high requirements on the receptive field in the OCR field by improving the receptive field of the full convolution network.

According to the above description, as an example, in a specific embodiment, the method includes:

inputting a target image, and acquiring features of the target image from the target image in a mode of respectively reducing the image by 4 times, 8 times and 16 times according to the ratio of 1:4, 1:8 and 1:16, wherein the features include acquiring global features of the target image, local features of the target image and relevance among target image level contexts (or interaction among the target image level contexts) for feature fusion; and converting the text box label of the target image into a label at a pixel level, wherein the label content comprises one of the following contents: text line attributes, corners, offsets.

And then performing semantic segmentation on the target object subjected to the feature fusion through one or more layout analysis deep learning network models, wherein the performed semantic segmentation comprises a text box of a regression target image, an anchor detection box of the regression target image, pixel points in a prediction target image and pixel points in an angular point region in the prediction target image. And obtaining a semantic segmentation result, namely obtaining the corner position through a text box of the regression target image, obtaining the boundary position of the text box of the target image through an anchor detection box of the regression target image and pixel points in the corner region in the prediction target image, and obtaining the precise boundary position of the text box of the target image through pixel points in the corner region in the prediction target image.

And acquiring all the prediction pixel points, classifying all the prediction pixel points to obtain the attribute of the text box, and performing format analysis on the target image according to the attribute of the text box.

In summary, the system firstly preprocesses a target image, enhances robustness or generalization capability, then obtains the characteristics of the target image, then semantically segments the characteristics of the target image based on a trained layout analysis deep learning network model, obtains the accurate position of the boundary of a text frame in the target image, and finally classifies all predicted pixel points to obtain the attributes of the text frame; and performing layout analysis on the image according to the attribute of the text box. The method comprises the steps of obtaining image features by using a convolutional neural network CNN, extracting bottom layer features, extracting features by using a multi-scale full convolutional network based on a Deeplab V3+ framework and using various structures including but not limited to ResNet50, MobileNet V2 and the like, and performing semantic segmentation at a pixel level to realize fine positioning of character/background labeling. And then, a cavity convolution cascade and a cavity convolution parallel framework with different sampling rates are used, and the reception field of the full convolution network is improved while the parameters are kept. And by integrating upsampling and convolution between different layers of the full convolution network, global characteristics and local characteristics are considered, more global context information of the target image is obtained, and the influence of noise is effectively reduced. And meanwhile, the text box label is converted into a label at a pixel level, including label text line attributes, corners, offsets and the like. And the generalization capability or robustness of the layout analysis deep learning network model to background, rotation, perspective distortion, noise, Gaussian blur, motion blur and the like is increased through a data enhancement algorithm. And when a layout analysis deep learning network model is trained, based on the integral scheme of deep learning, random gradient descent with momentum is used for training, L2 regularization and learning rate attenuation are used for obtaining required text positioning and classification, and further text structuring is realized. Including correcting for direction by detecting the target area. And (3) performing regression on the text box by using a system of Advanced-EAST, and predicting the offset of the pixel point relative to the text box to obtain the corresponding four-point position. On the basis, the prediction of the text box corner region is increased, and the accuracy of text box boundary regression is increased by predicting the pixel points in the corner region. While regression is performed using different convolution kernels for larger and smaller offsets. For predictions of different attributes, losses are calculated using different systems, including smooth L1 losses, cross entropy losses, etc., and the overall loss is obtained by weighting multiple losses. And the system for mining the previous difficult cases is used for balancing the positive and negative samples, so that the recall rate of the whole result is improved. And then, for the predicted text pixel points, depth-first searching is used to obtain text line connected blocks, and four-point regression of the text box is carried out in the corner region. Aiming at the problem of the up-down adhesion or the overlapping of the text lines, a system which is combined by a plurality of weights is added for carrying out corner matching, wherein the corner matching comprises the distance, the aspect ratio, the angle and the like. And the text boxes with high overlap ratio are combined, so that the noise interference is reduced. And finally, realizing layout analysis through all the predicted pixel points in the text box and the corresponding classification attributes.

The embodiment of the present application further provides an image analysis device based on OCR, including:

performing semantic segmentation on the target image according to the characteristics of the target image acquired in advance to acquire a semantic segmentation result;

and analyzing the layout of the target image based on the semantic segmentation result.

In this embodiment, the OCR-based image analysis device executes the system or the method, and specific functions and technical effects may refer to the embodiment, which is not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

Embodiments of the present application also provide a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the method in fig. 1 according to the embodiments of the present application.

Fig. 3 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 4 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. Fig. 4 is a specific embodiment of fig. 3 in an implementation process. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the data processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 4 can be implemented as the input device in the embodiment of fig. 3.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种软边缘支持向量机仿生偏振光航向解算系统及方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!