Natural scene text recognition method based on sequence transformation correction and attention mechanism

文档序号:1338507 发布日期:2020-07-17 浏览:26次 中文

阅读说明:本技术 基于序列变换纠正及注意力机制的自然场景文本识别方法 (Natural scene text recognition method based on sequence transformation correction and attention mechanism ) 是由 林庆祥 金连文 罗灿杰 赖松轩 于 2020-03-27 设计创作,主要内容包括:本发明公开一种基于序列变换纠正及注意力机制的自然场景文本识别方法,包括数据获取、数据处理、标签制作、训练网络、测试网络;所述训练网络包括:构建识别网络,并将训练数据及处理好的标签输入预先设计好的识别网络,完成识别网络的训练;所述识别网络包括序列变换纠正器、基于注意力机制的文本识别器;所述序列变换纠正器包括若干个卷积层、非线性层和池化层;所述序列变换纠正器还包括分解层、由若干个全连接层组成的定位网络;所述基于注意力机制的文本识别器包括特征编码网络和基于注意力机制的解码器。本发明极大降低了识别模型的识别难度,识别准确率高、鲁棒性强,针对形状不规则的文本具有很好的识别性能。(The invention discloses a natural scene text recognition method based on sequence transformation correction and attention mechanism, which comprises data acquisition, data processing, label making, network training and network testing; the training network includes: constructing an identification network, inputting training data and processed labels into the pre-designed identification network, and finishing training of the identification network; the recognition network comprises a sequence transformation corrector, a text recognizer based on an attention mechanism; the sequence conversion corrector comprises a plurality of convolution layers, a nonlinear layer and a pooling layer; the sequence transformation corrector also comprises a decomposition layer and a positioning network consisting of a plurality of full connection layers; the attention-based text recognizer includes a feature encoding network and an attention-based decoder. The method greatly reduces the recognition difficulty of the recognition model, has high recognition accuracy and strong robustness, and has good recognition performance for the irregular-shaped text.)

1. A natural scene text recognition method based on sequence transformation correction and attention mechanism is characterized by comprising the following steps:

data acquisition: acquiring training set and test set samples;

data processing: carrying out zooming processing on the pictures of the training set and the test set;

and (3) label making: labeling the training set pictures;

training a network: constructing an identification network, inputting training data and processed labels into the pre-designed identification network, and finishing training of the identification network;

the recognition network comprises a sequence transformation corrector, a text recognizer based on an attention mechanism; the sequence conversion corrector comprises a plurality of convolution layers, a nonlinear layer and a pooling layer; the sequence transformation corrector also comprises a decomposition layer and a positioning network consisting of a plurality of full connection layers; the text recognizer based on the attention mechanism comprises a feature coding network and a decoder based on the attention mechanism;

testing the network: and inputting the test data into the trained recognition network to obtain the recognition result of the text line in the picture.

2. The method for natural scene text recognition based on sequence transformation correction and attention mechanism as claimed in claim 1, wherein the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for picture correction by the sequence transformation corrector comprises:

obtaining a characteristic diagram of the picture to be corrected through the zooming layer, the convolution layer, the nonlinear layer and the pooling layer;

decomposing the feature diagram into N image blocks which are not intersected with each other in the horizontal direction through a decomposition layer, inputting the features of each image block into a positioning network, and predicting the transformation parameters of each image block through the positioning network;

inputting the transformation parameters of each image block into a grid mapping module to obtain a smooth sampling grid;

and obtaining the corrected picture by bilinear interpolation sampling on the original picture to be corrected by using a sampling grid.

3. The natural scene text recognition method based on sequence transformation correction and attention mechanism as claimed in claim 1, wherein the convolution layer is further capable of performing edge supplement, and the specific method for edge supplement comprises: and pasting a circle of pixel points on the upper, lower, left and right sides of the original picture or the characteristic graph, wherein the pixel values of the pixel points are 0.

4. The method for natural scene text recognition based on sequence transformation correction and attention mechanism as claimed in claim 1, wherein the feature coding network uses convolutional neural network and long-term memory network as basic units for converting the picture data into time-series features with context associated information.

5. The method for natural scene text recognition based on sequence transformation correction and attention mechanism as claimed in claim 4, wherein said decoder based on attention mechanism introduces a long and short term memory network L STM to gradually recognize each character in the picture during the decoding process, and the specific recognition method comprises:

the decoder based on the attention mechanism calculates an attention weight matrix according to the time sequence characteristics output by the characteristic coding network and the hidden state of a time point on the long-time and short-time memory network;

carrying out normalization processing on the attention weight matrix to obtain probability distribution of the attention weight matrix;

carrying out weighted summation processing on the time sequence characteristics obtained by the characteristic coding network coding according to the probability distribution of the attention weight matrix to obtain the attention characteristics at the current moment;

updating the hidden state of the long-time memory network according to the attention feature at the current moment and by combining with the character prediction probability distribution at the previous moment;

decoding through the full connection layer, and sending a decoding result into the softmax layer for probability normalization to obtain probability distribution of the predicted character;

and selecting the character corresponding to the value with the maximum confidence coefficient in the probability distribution as the current decoding output character to complete the identification of the character in the picture.

6. The natural scene text recognition method based on sequence transformation correction and attention mechanism as claimed in claim 5, wherein the recognition network training comprises:

and taking the probability of each character in the training data character string output at the corresponding time point as cross entropy, and minimizing cross entropy loss by using an adaptive gradient descent method.

7. The natural scene text recognition method based on sequence transformation correction and attention mechanism as claimed in claim 1, wherein the weight parameters in the recognition network are initialized by a random Gaussian distribution initialization method.

Technical Field

The invention relates to the technical field of pattern recognition and artificial intelligence, in particular to a natural scene text recognition method based on sequence transformation correction and attention mechanism.

Background

Characters are widely used as information carriers from old times to present. Through the presentation of characters, people can more accurately understand and process information obtained through vision, and information communication between people is promoted. With the rapid development of computer technology, artificial intelligence technology is gradually changing people's lives, people hope to understand and process images efficiently through computers, and text information is important for image understanding. Therefore, natural scene text recognition has been a significant and focused research direction.

Compared with a document image recognition task, characters in a natural scene often present various deformations such as rotation, transmission deformation, bending and the like, and natural scene text deformations are complex and various and have no rule, and are difficult to simulate by using a mathematical transform, which causes great challenges for a natural scene text recognition system.

Therefore, a text recognition method capable of effectively improving the recognition accuracy rate for the text data set of the irregular natural scene is needed.

Disclosure of Invention

The invention aims to provide a natural scene text recognition method based on sequence transformation correction and attention mechanism, which solves the problems in the prior art and can effectively improve the recognition accuracy of natural scene texts.

In order to achieve the purpose, the invention provides the following scheme: the invention provides a natural scene text recognition method based on sequence transformation correction and attention mechanism, which comprises the following steps:

data acquisition: acquiring training set and test set samples;

data processing: carrying out zooming processing on the pictures of the training set and the test set;

and (3) label making: labeling the training set pictures;

training a network: constructing an identification network, inputting training data and processed labels into the pre-designed identification network, and finishing training of the identification network;

the recognition network comprises a sequence transformation corrector, a text recognizer based on an attention mechanism; the sequence conversion corrector comprises a plurality of convolution layers, a nonlinear layer and a pooling layer; the sequence transformation corrector also comprises a decomposition layer and a positioning network consisting of a plurality of full connection layers; the text recognizer based on the attention mechanism comprises a feature coding network and a decoder based on the attention mechanism;

testing the network: and inputting the test data into the trained recognition network to obtain the recognition result of the text line in the picture.

Preferably, the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for the sequence transformation corrector to correct the picture comprises:

obtaining a characteristic diagram of the picture to be corrected through the zooming layer, the convolution layer, the nonlinear layer and the pooling layer;

decomposing the feature diagram into N image blocks which are not intersected with each other in the horizontal direction through a decomposition layer, inputting the features of each image block into a positioning network, and predicting the transformation parameters of each image block through the positioning network;

inputting the transformation parameters of each image block into a grid mapping module to obtain a smooth sampling grid;

and obtaining the corrected picture by bilinear interpolation sampling on the original picture to be corrected by using a sampling grid.

Preferably, the winding layer can also be subjected to edge repairing, and the specific method for edge repairing comprises the following steps: and pasting a circle of pixel points on the upper, lower, left and right sides of the original picture or the characteristic graph, wherein the pixel values of the pixel points are 0.

Preferably, the feature coding network takes a convolutional neural network and a long-term memory network as basic units and is used for converting the picture data into the time series features with the context correlation information.

Preferably, the attention mechanism-based decoder introduces a long and short term memory network L STM to gradually recognize each character in a picture in a decoding process, and the specific recognition method includes:

the decoder based on the attention mechanism calculates an attention weight matrix according to the time sequence characteristics output by the characteristic coding network and the hidden state of a time point on the long-time and short-time memory network;

carrying out normalization processing on the attention weight matrix to obtain probability distribution of the attention weight matrix;

carrying out weighted summation processing on the time sequence characteristics obtained by the characteristic coding network coding according to the probability distribution of the attention weight matrix to obtain the attention characteristics at the current moment;

updating the hidden state of the long-time memory network according to the attention feature at the current moment and by combining with the character prediction probability distribution at the previous moment;

decoding through the full connection layer, and sending a decoding result into the softmax layer for probability normalization to obtain probability distribution of the predicted character;

and selecting the character corresponding to the value with the maximum confidence coefficient in the probability distribution as the current decoding output character to complete the identification of the character in the picture.

Preferably, the recognition network training comprises:

and taking the probability of each character in the training data character string output at the corresponding time point as cross entropy, and minimizing cross entropy loss by using an adaptive gradient descent method.

Preferably, the weight parameter in the identification network is initialized by a random gaussian distribution initialization method.

The invention discloses the following technical effects:

(1) due to the adoption of the automatic learning algorithm of the deep network structure, effective expression can be well learned from data, and the identification accuracy is improved.

(2) The invention adopts end-to-end network design, utilizes the gradient information with physical significance returned by the recognition model through a weak supervision training mode to effectively guide network correction, greatly reduces the recognition difficulty of the recognition model in practical application and improves the recognition accuracy.

(3) The method introduces the idea of decomposition into the design of the corrector, decomposes the irregular text picture, ensures that each decomposed image block has smaller deformation, greatly reduces the correction difficulty of the irregular text, designs a grid mapping module in a correction network to ensure that the whole correction process is smooth, ensures that the whole correction transformation process is flexible and efficient, can play a good role in correcting the irregular text, has high recognition accuracy and strong robustness, and has good recognition performance for the irregular text.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a text recognition method of the present invention;

FIG. 2 is a schematic diagram of the overall structure of the text recognition method of the present invention;

FIG. 3 is a block diagram of a sequence transformation corrector network of the present invention;

fig. 4 is a diagram illustrating a verification result according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1 to 3, the present embodiment provides a natural scene text recognition method based on sequence transformation correction and attention mechanism, including the following steps:

s1, data acquisition: acquiring training set and test set samples;

synthesizing a natural scene text line picture by using an open source code and a text corpus acquired from the Internet as a training set; using the published natural scene text recognition dataset as a test set; and saving each picture in a corresponding folder.

S2, data processing:

firstly, carrying out scaling processing on pictures of a training set and a test set, wherein the size of the processed pictures is 64 × 192;

secondly, all the pictures of the training set and the test set are made into files in a L MDB (lightning Memory mapping Database, L lighting Memory-Mapped Database) format, so that the picture reading speed is increased.

S3, label preparation:

the invention trains the recognition network by adopting a supervision method, so that the training set pictures are labeled, and the labels are in the form of text contents corresponding to each text line picture.

S4, training the network: constructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and a text recognizer based on an attention mechanism; inputting training data and the processed label into a pre-designed recognition network to complete the training of the recognition network; the method specifically comprises the following steps:

s4.1, building a sequence transformation corrector; the network structure and parameter setting of the sequence transformation corrector in the embodiment are shown in table 1;

TABLE 1

The specific method for margin compensation of the convolution layer in the table 1 comprises the steps of pasting a circle of pixel points on the upper side, the lower side, the left side and the right side of an original picture or a characteristic picture, enabling the pixel values of the pixel points to be 0, enabling a Re L U activation function to be adopted by a nonlinear layer, and enabling a maximum pooling mode to be adopted by a pooling layer.

The zooming layer of the sequence conversion corrector effectively improves the network reception field, reduces the calculated amount, avoids the input of a large amount of noise and improves the robustness of the module by carrying out zooming-out operation on the picture.

The method for correcting the picture by the sequence conversion corrector comprises the following steps:

first, the pictures were input into the sequence transformation corrector and passed through the scaling, convolutional, nonlinear, and pooling layers in table 1 to obtain a 4 x 12 size feature map.

Secondly, decomposing the feature map into N disjoint image blocks in the horizontal direction through a decomposition layer, inputting the features of each image block into a positioning network composed of two fully connected layers, and predicting the transformation parameters of each image block through the positioning network, as shown in formula (1):

where θ represents a parameter of the neural network, patchiRepresenting the ith image block, i ∈ [1, N],T(patchiAnd | θ) represents inputting the characteristics of the ith image block into the positioning network to obtain the transformation parameters.

Thirdly, inputting the transformation parameters of each image block into a grid mapping module to obtain a smooth sampling grid, wherein the specific process is as follows:

suppose that the height and width of the input image block of the sequence transformation corrector are respectively Hi,WiAfter being corrected by the sequence conversion corrector, the height and the width of the output image block are respectively Ho,Wo

Calculating the coordinate position (x) on the sampling grido,yo) Belonging to the image block number, as shown in equation (2):

will sample the coordinate location (x) on the grido,yo) Mapping into a hidden grid, resulting in coordinates (x)h,yh) The mapping calculation process is shown as formula (3):

wherein n and m respectively represent the width and height of each block grid in the hidden grid;

smoothly hiding coordinates (x) in the grid by means of bilinear interpolationh,yh) Mapping to a coordinate position (x) in the input image block gridi,yi) The mapping calculation process is shown as equation (4):

in summary, the whole grid mapping process is represented as: (x)i,yi)=P(xo,yo)

Wherein, P represents a grid mapping function, and combines formula (3) and formula (4), and the grid mapping function P is shown in formula (5):

and finally, obtaining a corrected picture by bilinear interpolation sampling on the original input picture by using a sampling grid, wherein the sampling calculation process is as shown in a formula (6):

wherein the content of the first and second substances,indicating a position (x) in an output pictureo,yo) A pixel value of (I)(u,v)Representing the pixel value at position (u, v) in the input picture.

The above transformation process is guided, which ensures that the sequence transformation corrector can update the optimization parameters by a gradient descent algorithm.

S4.2, establishing a text recognizer based on an attention mechanism

Firstly, a feature coding network with a convolutional neural network and a long-term memory network as basic units is constructed and used for converting picture data into time series features with context associated information.

The structure of the convolutional neural network is as follows: input (32 × 100) → 64C3 → MP22 → 128C3 → MP22 → 256C3 → 256C3 → MP21 → 512C3 → MP21 → 512C2, wherein in pCq, p represents the number of convolution output channels, q represents the convolution kernel size, C represents the convolution layer, for example, 64C3 represents the convolution layer with the convolution kernel size of 3 and the number of output channels of 64; in MPef, e and f represent the width and height and the step size of the largest pooling layer, respectively, and MP represents the largest pooling layer, for example, MP22 represents the largest pooling layer with width and height and step size of 2.

The method comprises the steps of obtaining a feature with the height of 1 after an input picture passes through a convolutional neural network, inputting the feature into a B L STM network consisting of two Bidirectional long-short Term Memory layers (Bidirectional L ong-short Term Memory L eye, B L STM), and extracting time series features with context association.

Secondly, the time series characteristic H obtained by the characteristic coding network coding is [ H ═ H1,h2,…,hL]The decoder based on the attention mechanism introduces a long-time memory network L STM in the decoding process to gradually identify each character, and the specific flow is as follows:

at the time point t, the decoder based on the attention mechanism encodes the time sequence obtained by the network coding according to the characteristic codingCharacteristic H and hidden state s of a time point on long-time and short-time memory networkt-1Calculating an attention weight matrix etAs shown in formula (7):

et,j=wTTanh(Wsst-1+Whhj+b)…………………(7)

wherein, W, Ws,WhAnd b each represent a trainable parameter, Tanh represents an activation function, j represents a sequence number of a time series, j ∈ [1, L ]];

To attention weight matrix etCarrying out normalization processing to obtain the probability distribution a of the attention weight matrixtAs shown in formula (8):

according to the probability distribution of the attention weight matrix, the time sequence characteristics obtained by the characteristic coding network coding are subjected to weighted summation processing to obtain the attention characteristics g at the current momenttThe calculation is as follows:

according to the attention characteristics of the current moment and the character prediction probability distribution y of the previous momentt-1And updating the hidden state of the memory network at long time and short time, as shown in formula (10):

st=LSTM(yt-1,gt,st-1)………………………(10);

decoding is carried out through a full connection layer, and the decoding result is sent into a softmax normalization layer for probability normalization to obtain the probability distribution y of the predicted charactertAs shown in formula (11):

yt=Softmax(Ust+d)………………………(11)

wherein U, d each represent a trainable parameter;

selection of ytThe character corresponding to the value with the maximum confidence level in the middle is taken as the current characterThe output character is decoded.

S4.3, training parameter setting:

the training data is sent to a network for training, the network is made to traverse the training data set 10 times, wherein the read-in batch size is set to 64, the initial learning rate of the attention-based text recognizer is set to 1, the initial learning rate of the sequence transformation corrector is set to 0.1, and then the learning rate of the entire network is reduced by a factor of 10 when the data set is traversed 6 times and 8 times.

The optimization algorithm uses an adaptive gradient descent method, and the loss function L is shown as equation (12):

wherein B represents the data volume used for the batch optimization,indicating that a character is output from the b-th sample picture at the a-th timeThe probability of (d); t isbIndicating the length of the b-th sample string label.

S4.4, initializing the weight of the identified network: and initializing weight parameters in all networks by a random Gaussian distribution initial method during initial training.

S4.5, identifying network training: and taking the probability of each character in the training data character string output at the corresponding time point as cross entropy, and minimizing cross entropy loss by using an adaptive gradient descent method, namely minimizing a loss function. The training of the sequence transformation corrector is guided by the text recognizer based on the attention mechanism, so that the weak supervision of the recognition network training process is realized, and the accuracy of the recognition of the text data of the irregular natural scene is effectively improved.

S5, testing network: inputting test data into the trained network, and finally obtaining a recognition result of the text line in the picture; the method specifically comprises the following steps:

s5.1, inputting a test set sample, selecting characters with the highest confidence coefficient as predicted characters by using a greedy algorithm, and stringing the characters to obtain a final predicted text line;

and S5.2, after the identification is finished, comparing the result of the identified text line with the labeled label, and calculating the line identification accuracy and the editing distance.

In order to further verify the effectiveness and robustness of the text recognition method of the present invention, a 64 × 192 picture is selected in this embodiment, and the correction result and the recognition result are shown in fig. 4; through the graph 4, the texts of the input pictures processed by the corrector are regularly arranged, so that the recognizer can accurately recognize the texts in the pictures.

The natural scene text recognition method based on sequence transformation correction and attention mechanism reduces the recognition difficulty of a subsequent recognizer by correcting irregular texts; in the training, a weak supervision mode is adopted, the training of the correction network is guided through the recognition model, and any position coordinate label is not used in the training process.

Meanwhile, the invention introduces the decomposition idea when designing the correction network, and decomposes the irregular text picture to ensure that each decomposed image block has smaller deformation and greatly reduces the correction difficulty of the irregular text; the invention also designs a grid mapping module in the correction network to ensure that the whole correction process is smooth; according to the method, an identification algorithm based on an attention mechanism is used in the identification network design, so that the accuracy of natural scene text identification can be effectively improved, and particularly, the effect of improving the text data set in an irregular natural scene is remarkable.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:即时电力数据处理方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!