Detection method of fragment and link scene characters based on convolutional neural network

文档序号：1505623 发布日期：2020-02-07 浏览：10次中文

阅读说明：本技术 一种基于卷积神经网络的片段和链接的场景文字的检测方法 (Detection method of fragment and link scene characters based on convolutional neural network ) 是由严灵毓夏慧玲王春枝董新华叶志伟李敏于 2019-10-18 设计创作，主要内容包括：本发明公开发明了一种基于卷积神经网络的片段和链接的场景文字的检测方法。本发明中构建的片段链接模型包括若干个依次连接的卷积特征层和卷积预测器,由于该发明能够在前向传导的过程中同时在多个尺度上密集地检测片段和链接,因此检测效率非常高。链接类型具体分为同层链接和跨层链接两种,同层链接连接同一个特征层中检测到的片段,而跨层链接可以连接不同层上的片段。基于此,使用跨层链接和同层链接可以组合相同或者不同尺度上的片段,很好的避免了漏检和重复检查的问题。(The invention discloses a detection method of scene characters based on segments and links of a convolutional neural network. The segment link model constructed in the invention comprises a plurality of convolution characteristic layers and convolution predictors which are connected in sequence, and because the invention can detect segments and links densely on a plurality of scales simultaneously in the forward conduction process, the detection efficiency is very high. The link types are specifically divided into a same-layer link and a cross-layer link, the same-layer link connects the fragments detected in the same feature layer, and the cross-layer link can connect the fragments on different layers. Based on the method, the fragments on the same or different scales can be combined by using cross-layer link and same-layer link, and the problems of missed detection and repeated detection are well avoided.)

1. A detection method of a fragment and a linked scene text based on a convolutional neural network is characterized by comprising the following steps:

step 1, obtaining a size of w_I*h_IThe character picture I;

step 2, constructing a segment linking model, wherein the segment linking model comprises a plurality of convolution characteristic layers and convolution predictors which are sequentially connected, and the convolution characteristic layers and the convolution predictors are numbered and indexed by L; the position on the feature map corresponding to each convolution feature layer corresponds to a reference box, and the feature map size is set to be w for the feature convolution layer L_L*h_LThe position (i, j) of the signature corresponds to a square reference box, the centre point (x) of which_a,y_a) Calculated by the following formula:

in the above formula, the width and height of the reference box are set to be constant a_L；

Step 3, inputting the picture I into the constructed segment link model to obtain the result output of the convolution predictor, and the method comprises the following substeps;

step 3.1, dividing characters in the picture I into segments through a segment link model, and obtaining scores of the segments and set coordinates of the segments;

step 3.2, realizing same-layer linkage of adjacent fragments of the detection fragments;

step 3.3, cross-layer link of the cross-layer neighbor fragments of the detection fragments is realized;

step 4, utilizing an undirected graph to perform fragment combination on the result of the convolution predictor to obtain a whole word;

and 5, outputting the detection result of the whole word.

2. The method of claim 1, wherein the method comprises the steps of: the segment link model in the step 2 comprises 6 convolution characteristic layers and convolution predictors, wherein the convolution characteristic layers are conv4_3, conv7_3, conv8_2, conv9_2, conv10_2 and con11 respectively, 1024 convolution kernels of conv4_3 are provided, the size of the convolution kernels is 3, and the convolution step size is 1; 256 conv7_3 convolution kernels, the size of the convolution kernel is 1, and the convolution step size is 1; there are 128 conv8_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 con9_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 conv10_2 convolution kernels, the convolution kernel size is 3, and the convolution step size is 2; the convolution kernel size of the convolution predictor is 3.

3. The method of claim 1, wherein the method comprises the steps of: the score of the segment in the step 3.1 is directly obtained through a convolution predictor, the set coordinate of the segment is obtained by calculating the coordinate of a reference box corresponding to the convolution characteristic layer and the predicted offset of the convolution predictor, the specific implementation mode is as follows,

assume that the rectangular box of a segment is denoted as s ═ x_s,y_s,ω_s,h_s,q_s) Wherein x is_s,y_sIndicates the position of the center of the rectangular frame, w_s,h_sLength and width of rectangular frame, q_sThe specific calculation formula of the set coordinates of the segment is shown as follows:

x_s＝a_LΔx_s+x_a

y_s＝a_LΔy_s+y_a

w_s＝a_Lexp(Δw_s)

h_s＝a_Lexp(Δh_s)

θ_s＝Δθ_s

wherein the offset amount (Deltax)_s,Δy_s,Δw_s,Δh_s,Δq_s) And reference box center coordinates (x)_a,y_a) From the fragment ligation model, the width and height of the reference cassette_LThe size of the output segment is controlled.

4. The convolutional neural network-based segment and link as defined in claim 3The method for detecting scene characters is characterized in that: in step 3.2 the adjacent segment is defined as the segment detected at the 8-neighborhood position on the signature, assuming that the segment is indexed by the sequence number L of the signature and the position (i, j) on the signature, i.e. the segment is indexed by s^(i,j,L)Representing, defining, adjacent segments

5. the method of claim 4, wherein the method comprises the following steps: step 3.3, the cross-layer neighbor segment is defined as a segment pair which accords with a certain space constraint relation on two continuous characteristic layers, and the mathematical expression is as follows:

Technical Field

The invention belongs to the field of application of digital image processing, and particularly relates to a method for detecting scene characters of segments and links based on a convolutional neural network.

Background

Understanding images is a major goal of computer vision. The image understanding is divided into different levels, for example, the edge detection of an object is the image understanding of the bottom layer; semantic segmentation of objects is the understanding of middle layers, etc. The understanding of the characters carried in the image is the understanding of high level semantic information (high level semantics), and the information is compatible with human symbolic systems and can be directly used for high level semantic and logic analysis. Because of the ubiquitous nature of text and the importance of textual information, understanding text in pictures has been in an important position in computer vision. The technology for recognizing characters from images is generally referred to as Optical Character Recognition (OCR), which is one of the earliest applications of computer vision, and is limited by the state of technology and hardware conditions, and the traditional OCR method is only designed for document characters. Characters in natural pictures are generally called scene characters, and scene character detection and recognition are continuation and upgrade of traditional OCR on natural pictures, and the application of the scene character detection and recognition is extremely wide.

The deep neural network is the core of deep learning, and is a flexible and powerful machine learning model. From the mathematical perspective, the deep neural network is a highly nonlinear and highly parameterized general fitting function, and good prediction accuracy and generalization performance can be achieved after a large number of samples are trained. The deep neural network is an effective means for solving the problem of scene character recognition.

The method for detecting the scene characters needs to overcome the interference caused by rich changes of natural images, and is greatly different from the traditional document character analysis. Many algorithms have been proposed over the past for natural image text or video text. Early methods were mostly based on analysis of texture and connected domain, detecting scene text by analyzing the visual features of the lower layers in the image. However, since there is a difference in shape between the scene text and the general object, the detection method of the general object is not an ideal solution. First, the general object detection method is limited by its candidate region extraction algorithm, and the aspect ratio of the output bounding box is often variable only within a small range. Therefore, it is difficult to be used for detecting non-latin characters such as chinese and japanese because there is no space between words of these characters, and the detection target is often an extremely long and thin bounding box with an extremely high aspect ratio; secondly, the general object detection method can only output horizontal bounding boxes, and cannot detect characters in non-horizontal directions.

Disclosure of Invention

In order to solve the above-mentioned texts existing in the prior art, the present invention provides a method for detecting scene texts based on segments and links of a convolutional neural network.

The technical scheme adopted by the invention comprises the following steps:

step 1, obtaining a size of w_I*h_IThe character picture I;

in the above formula, the width and height of the reference box are set to be constant a_L；

Step 3, inputting the picture I into the constructed segment link model to obtain the result output of the convolution predictor, and the method comprises the following substeps;

step 3.1, dividing characters in the picture I into segments through a segment link model, and obtaining scores of the segments and set coordinates of the segments;

step 3.2, realizing same-layer linkage of adjacent fragments of the detection fragments;

step 3.3, cross-layer link of the cross-layer neighbor fragments of the detection fragments is realized;

step 4, utilizing an undirected graph to perform fragment combination on the result of the convolution predictor to obtain a whole word;

and 5, outputting the detection result of the whole word.

Further, the segment link model in step 2 comprises 6 convolution feature layers and convolution predictors, wherein the convolution feature layers are conv4_3, conv7_3, conv8_2, conv9_2, conv10_2 and con11 respectively, the convolution kernel of conv4_3 is 1024, the size of the convolution kernel is 3, and the convolution step size is 1; 256 conv7_3 convolution kernels, the size of the convolution kernel is 1, and the convolution step size is 1; there are 128 conv8_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 con9_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 conv10_2 convolution kernels, the convolution kernel size is 3, and the convolution step size is 2; the convolution kernel size of the convolution predictor is 3.

Furthermore, the score of the segment in step 3.1 is directly obtained by the convolution predictor, the set coordinate of the segment is calculated by the coordinate of the reference box corresponding to the convolution characteristic layer and the offset predicted by the convolution predictor, the specific implementation mode is as follows,

assume that the rectangular box of a segment is denoted as s ═ x_s,y_s,ω_s,h_s,q_s) Wherein x is_s,y_sIndicates the position of the center of the rectangular frame, w_s,h_sThe length and width of the rectangular frame are represented, qs represents the rotation angle of the rectangular frame relative to the horizontal direction, and the specific calculation formula of the set coordinates of the segments is as follows:

x_s＝a_LΔx_s+x_a

y_s＝a_LΔy_s+y_a

w_s＝a_Lexp(Δw_s)

h_s＝a_Lexp(Δh_s)

θ_s＝Δθ_s

Further, in step 3.2 the neighboring segment is defined as the segment detected at the 8-neighborhood position on the signature, assuming that the segment is indexed by the sequence number L of the signature and the position (i, j) on the signature, i.e. the segment is indexed by s^(i,j,L)Representing, defining, adjacent segments

Is the fragment s^(i,j,L)Segments on the feature map belonging to the same connected domain are mathematically represented as:

further, in step 3.3, the cross-layer neighbor segment is defined as a segment pair meeting a certain spatial constraint relation on two consecutive feature layers, and the mathematical expression is as follows:

the invention has the beneficial effects that: a method for detecting scene words based on segments and links of a convolutional neural network is provided. The method is based on the thought of segment linking, and divides a character line into two elements of a segment and a link, wherein the segment is a small segment of a word or the character line, and the link connects adjacent segments belonging to the same word or the character line. The segments and links are densely detected at multiple scales through a full convolutional network and combined according to geometric rules to obtain whole words.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

fig. 2 is a source image to be detected according to an embodiment of the present invention, (a) is the source image to be detected according to embodiment 1; (b) is a source image after segmenting for example 1 (a); (c) the image after the same-layer linking is carried out on the embodiment 1 (b); (d) for the cross-layer linked image of example 1 (b); (e) the image after segment linking of example 1(d) was performed; (f) the combined image was obtained from the fragments of example 1 (e).

FIG. 3 is a network structure diagram of a segment link model implemented in the present invention, in which the content of gray blocks represents the convolutional layer output; gray arrows represent convolution predictors; the arrangement of convolutional layers is represented by the number of convolutional kernels, convolutional kernel size (k), and convolutional step size(s).

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the technical scheme adopted by the invention is as follows: a detection method of a fragment and a linked scene text based on a convolutional neural network is characterized by comprising the following steps:

step 1: input size w_I*h_IText in Picture I, where w_IAnd h_IRespectively representing the width and the length of the picture;

step 2: constructing a fragment link model, wherein a network structure diagram of the fragment link model is shown in FIG. 3, the model comprises 6 feature layers, namely conv4_3, conv7_3, conv8_2, conv9_2, conv10_2 and con11, wherein the convolution kernels of conv4_3 are 1024, the size of the convolution kernels is 3, and the convolution step is 1; 256 conv7_3 convolution kernels, the size of the convolution kernel is 1, and the convolution step size is 1; there are 128 conv8_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 con9_2 convolution kernels, the convolution kernel size is 1, and the convolution step size is 1; there are 128 conv10_2 convolution kernels, the convolution kernel size is 3, and the convolution step size is 2; they provide features of different scales and depths, with the depth of conv4_3 being shallowest and the scale being smallest; con11 is deepest and largest in dimension; each of the 6 feature layers is connected to a convolution predictor having a convolution kernel size of 3 × 3, which is used to predict the score of the segment and the link, and both the feature layers and the convolution predictors are numbered and indexed by L1.

The position on the feature map corresponding to each feature layer corresponds to a reference box, and the confidence coefficient and the offset of the reference box are estimated from the corresponding features. Both the segment and the whole word are represented by rectangular boxes with angles, and the rectangular box representing the whole word is assumed to be b ═ x_b,y_b,ω_b,h_b,q_b) The rectangular frame representing the segment is s ═ x_s,y_s,ω_s,h_s,q_s) (ii) a Wherein x_b,y_bAnd x_s,y_sRespectively, the position of the center of the rectangular frame, w_b,h_bAnd w_s,h_sRespectively, the length and width of the rectangular frame, q_bAnd q is_sWhich indicates the rotation angle of the rectangular frame with respect to the horizontal direction.

For the feature layer L, the feature map size is set to w_L*h_LThe position (i, j) of the signature corresponds to a square reference box, the centre point (x) of which_a,y_a) Calculated by the following formula:

in the above formula, the width and height of the reference box are set to be constant a_LWherein the width and height boxes of the reference boxes of the same layer are the same, and the width and height boxes of different layers are different.

And step 3: inputting the picture I into the constructed segment link model to obtain the result output of the convolution predictor;

step 3.1, dividing characters in the picture I into segments through a segment link model, and obtaining scores of the segments and set coordinates of the segments;

the confidence and offset of the reference box are predicted by the convolution predictor at each position on the feature map, and a 7-channel score map is obtained. Performing softmax normalization on 2 channels to obtain segment scores in (0,1), wherein the higher the segment score is, the higher the detection accuracy of the segment is, and the other 5 channels correspond to a plurality of offsets (delta x)_s,Δy_s,Δw_s,Δh_s,Δq_s) The set coordinates of the segments are calculated from the coordinates and offset of the reference box, and the specific formula is as follows:

x_s＝a_LΔx_s+x_a

y_s＝a_LΔy_s+y_a

w_s＝a_Lexp(Δw_s)

h_s＝a_Lexp(Δh_s)

θ_s＝Δθ_s

it can be seen that the width and height a of the reference box_LThe size of the output segment is controlled. a is_LShould be selected in consideration of the domain size of the L layer, and the invention selects a_LHas a value of

γ＝1.5。

Step 3.2, realizing same-layer linkage of adjacent fragments of the detection fragments;

links are used to connect two adjacent segments, indicating whether the two segments belong to the same word, and links can both connect segments of the same word and help separate segments of different words. Such as: even if two words are close in distance, the link between their segments will be predicted negative, leaving two different words not grouped together.

Due to the locality of the fragments, the adjacent fragments on the feature map are also close in position on the picture. Thus, the link may be detected by local features. As with the segment, the link is also predicted by a convolution predictor, which outputs a 16-channel prediction score map corresponding to the link relationship between the segment and its 8 neighbors at each feature map location. The prediction of each link is composed of two channels, each 2 channels are subjected to softmax normalization to obtain a link score, and the higher the link score is, the higher the accuracy of segment link is.

Neighboring segments are defined as segments detected at 8-neighborhood locations on the feature map. Since only one segment is detected at each feature position, the segment is indexed by the serial number of the feature map and the position (i, j) on the feature map, i.e. the segment is indexed by s^(i,j,L)And representing, defining the adjacent segments of one segment as the segments belonging to the same connected domain on the feature map, and mathematically representing:

step 3.3, cross-layer link of the cross-layer neighbor fragments of the detection fragments is realized;

the cross-layer link is connected with segments in different feature layers, namely cross-layer neighbor, so that aiming at the cross-layer link, the cross-layer neighbor segments are defined as segment pairs which accord with a certain space constraint relation on two continuous feature layers, and the mathematical expression is as follows:

and 4, step 4: combining the segments of the result of the convolution predictor to obtain a whole word;

combining the fragments, firstly establishing a linkage relation of the fragments without a directed graph, wherein each node represents one fragment, each edge represents one link, and the two fragments are connected; then, all connected groups, fragments belonging to the same connected group, are found on the undirected graph by using depth-first search, and a path formed by links exists between every two connected groups.

And 5: and outputting the detection result of the whole word.

The method can be used in the field related to scene character detection processing by utilizing the detection method of the scene characters based on the segments and the links of the convolutional neural network.

The following are examples provided by the inventors to further illustrate the technical solution of the present invention. According to the technical scheme of the invention, the embodiment performs scene character detection, and the source images to be detected are shown in fig. 2(a), and the size of each source image is 512 pixels by 512 pixels. Confidence thresholds for segment and link were set at 0.9 and 0.7, respectively, and table 1 lists and compares the performance of segment links and other methods. Wherein, the results of the partial methods are obtained from the ranking list of the optical contest website. The F-score is the geometric average of the rate of determination and the average rate, which is used to measure the overall performance of the test, and table 1 below:

TABLE 1 Objective index comparison of segment linking results

As can be seen from Table 1, the F-score of the segment links greatly precedes the other methods and exceeds the second name by more than 10%. Considering that some methods have close to even higher accuracy than segment linking, the performance improvement of segment linking can be considered to mainly come from the improvement of recall rate.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

10页详细技术资料下载

Detection method of fragment and link scene characters based on convolutional neural network

相关技术

网友询问留言