Multi-instance natural scene text detection method based on relevance hierarchy residual errors

文档序号:1113576 发布日期:2020-09-29 浏览:6次 中文

阅读说明:本技术 一种基于关联性层级残差的多实例自然场景文本检测方法 (Multi-instance natural scene text detection method based on relevance hierarchy residual errors ) 是由 田智强 王春晖 杜少毅 兰旭光 于 2020-05-27 设计创作,主要内容包括:本发明提出了一种基于关联性层级残差的多实例自然场景文本检测方法,本发明采用的特征提取网络利用关联性层级残差以及反向逐级特征融合提取了粗粒度与细粒度相结合的多尺度特征,其中包含了更精确更完备的文本信息,以提高文本检测精度;其次,本发明使用的文本检测框回归损失由CIoU Loss和角度损失两部分构成,尤其是CIoU Loss的使用考虑了预测文本检测框与真实文本检测框之间的重叠面积、中心距离、长宽比等因素,能够更加精准的评估文本检测框的实际回归情况,从而可以提高文本检测方法的性能;然后,本发明在多个步骤中采用适当的方式减轻了硬件计算压力,最后,本发明对于常规文本区域,小文本区域的检测效果都很好。(The invention provides a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and a feature extraction network adopted by the method utilizes the relevance hierarchy residual errors and reverse stage-by-stage feature fusion to extract multi-scale features combining coarse granularity and fine granularity, wherein the multi-scale features comprise more accurate and complete text information so as to improve the text detection precision; secondly, the regression Loss of the text detection box used by the method consists of a CIoU Loss part and an angle Loss part, and particularly, the factors such as the overlapping area, the center distance, the length-width ratio and the like between the predicted text detection box and the real text detection box are considered in the use of the CIoU Loss part, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved; then, the invention adopts a proper mode to reduce the hardware calculation pressure in a plurality of steps, and finally, the invention has good detection effect on the conventional text region and the small text region.)

1. The multi-instance natural scene text detection method based on the relevance hierarchy residual error is characterized by comprising the following steps of:

firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level;

step two, performing reverse step-by-step feature fusion on the feature maps of different scales extracted in the step one so as to obtain a multi-scale fusion feature map;

thirdly, performing text region detection on the multi-scale fusion characteristic graph output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic graph and a text region geometric characteristic graph so as to represent candidate prediction text regions;

step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region, and setting the score threshold value to be 0.5;

step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi-prediction text regions;

step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than the threshold value of 0.1, so as to obtain the final prediction text regions and the detection results;

the method comprises a training process, wherein a plurality of public common text detection data sets are used for training a method model;

using back propagation in the training process, continuously updating model parameters when the loss is large until the loss converges to a small value, and storing the model parameters;

and step seven, using the stored structure and parameters of the model to form a multi-instance natural scene text detection model.

2. The method for detecting multi-instance natural scene text based on relevance hierarchy residual error as claimed in claim 1, wherein in step one, the relevance hierarchy residual error based feature extraction network introduces relevance hierarchy residual error structure based on ResNet-50 backbone network, so as to extract precise and complete multi-scale text features combining coarse granularity and fine granularity, in the feature extraction step, the original input image gradually acquires coarse granularity feature information of different scales from low level to high level through 5 convolutional layers Conv1-Conv5, and the feature map size after passing through each convolutional layer becomes 1/2, 1/4, 1/8, 1/16 and 1/32 of original image in turn; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.

3. The relevance hierarchy residual-based feature extraction network of claim 2, wherein Conv1 adopts a convolution kernel of 7 × 7, and is downsampled by a MaxPhool layer adopting a convolution kernel of 3 × 3, Conv2-Conv5 are composed of a convolution of 1 × 1, a convolution group of 3 × 3 and a convolution group of 1 × 1, and are accompanied by residual connection so as to simplify learning objectives and difficulty of a deep neural network, wherein the convolution group of 3 × 3 is a key for realizing fine-grained feature extraction, and firstly, a feature map generated by convolution of 1x1 is averagely divided into 4 sub-feature maps along a channel dimension, and the 1 st sub-feature map x is divided into 4 sub-feature maps1Is directly output as y1(ii) a Each sub-feature graph x thereafteriAfter a convolution operation K of 3 × 3iThen obtaining output yi(ii) a And starting from the 3 rd sub-feature map, xiThe output y of the previous sub-feature map is addedi-1And then carrying out convolution operation of 3 × 3, and finally combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.

4. The method according to claim 1, wherein in step two, the feature map f generated from Conv5 is subjected to inverse progressive feature fusion1At first, f is first paired1Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv42And in addition, two convolution operations of 1 × 1 and 3 × 3 are added after the feature maps are combined for reducing the channel dimension and reducing the parameter calculation amount, so that the feature maps f with different scales are finally processed in sequence in the mode1、f2、f3And f4The step-by-step fusion is completed, the size of the fused feature map is 1/4 original input image size, and a convolution layer of 3 × 3 is added to generate a final multi-scale feature fusion map。

5. The method for detecting multi-instance natural scene text based on relevance hierarchy residual error according to claim 1, wherein in step three, the feature mapping for the multi-scale feature fusion graph adopts 1x1 convolution operation; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.

6. The method for detecting the multi-instance natural scene text based on the relevance hierarchy residual error according to claim 1, wherein in the fifth step, a local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value of 0.2, the requirement of merging is met; when merging, carrying out weighted average on the vertex coordinates of the original two text regions to obtain the vertex coordinates of the merged text region, wherein the used weight is the fraction of the original two text regions, the fractions of the original two text regions are added to obtain the score of the merged new text region, and then screening the merged candidate predicted text region by a standard non-maximum suppression algorithm to obtain a quasi-predicted text region.

7. The method according to claim 1, wherein in step six, a loss function is used in the training process, and parameter adjustment is performed when loss is propagated backwards.

8. The loss function of claim 9, wherein the loss function is comprised of two parts, wherein the text classification loss is used to guide the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:

L=Lcls+λLreg

wherein L is the total loss of detection; l isclsFor text classification loss, LregFor detecting the frame regression loss, lambda is a parameter for balancing the importance of two losses, and the value of lambda is 1;

the text classification loss calculation formula is as follows:

Figure FDA0002512024720000031

wherein L isclsRepresenting a text classification loss; y represents all positive sample areas in the real text score feature map; | Y*| represents all positive sample regions in the predicted text score feature map | Y ∩ Y*I represents the part of the predicted text score feature map where the positive sample region intersects with the positive sample region in the real text score feature map;

the detection frame regression loss calculation formula is as follows:

Lreg=LgθLθ

wherein L isregRegression loss for the detection box; l isgDetecting box geometric regression loss for the text without considering the angle; l isθDetecting a frame angle loss for the text; lambda [ alpha ]θThe two loss tradeoff parameters are 20.

9. The detection box regression Loss of claim 8, wherein the text detection box geometric regression Loss without considering the angle in the detection box regression Loss is CIoU Loss, and the calculation formula is:

Lg=1-IoU+R(A,B)

wherein L isgRepresenting a text detection box geometric regression loss;

Figure FDA0002512024720000041

wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:

wherein wB,hBWidth and height of the real frame B; w is aA,hAThe width and height of box a are predicted.

10. The detection box regression loss of claim 8, wherein the calculation formula of the text detection box angle loss in the detection box regression loss is as follows:

Lθ=1-cos(θ*-θ)

wherein L isθRepresenting a text detection box angle loss; theta*A predicted angle for the text region; θ represents the true angle of the text region.

Technical Field

The invention belongs to the field of deep learning, computer vision and text detection, and particularly relates to a multi-instance natural scene text detection method based on relevance hierarchy residual errors.

Background

Characters serve as a main mode of information transfer, play an indispensable role in daily life of people, and with the arrival of the big data era, how to acquire text information in massive images becomes a problem which needs to be solved urgently. Therefore, based on the development of deep learning, the natural scene text detection technology becomes a very popular research direction in the field of computer vision, and has important significance for image retrieval, scene understanding and the like.

At present, the emergence of a large number of research achievements enables natural scene text detection to be widely applied to various fields of various industries. Such as: many internet companies develop related services and applications such as image retrieval and street view navigation based on natural scene text detection technology. Some cloud service providers also provide image text detection services successively, wherein each service is oriented to multiple fields of education, logistics, videos, e-commerce, tourism and the like, and provides direct text detection services, indirect text detection models, customized AI service system integration and the like for users. Although the technical result of natural scene text detection is remarkable up to now, the text image processed by natural scene text detection has the characteristics of complex background and various texts, and the related technology still has the technical problems of insufficient detection precision and the like at present.

Through investigation and research, the following two defects of the existing natural scene text detection method are found: on one hand, although the multi-scale feature extraction network used for text detection starts to adopt the idea of multi-scale feature extraction and fusion by taking advantage of the multi-scale feature extraction network such as SPPNet, U-Net, FPNNet and the like, in the method, only one size of convolution kernel is generally adopted in the extraction process of adjacent different scale features, so that the feature extraction is coarse-grained. Considering that a text example in a text image of a natural scene is usually much smaller than a natural background, and the existing method has a poor detection effect on a small text region, it is necessary to extract features with a finer granularity to adapt to a text detection task, so that the coarse-granularity multi-scale feature extraction method still has a room for improvement and promotion. On the other hand, a regression Loss function commonly used in the text detection method at present is IoU Loss, which calculates the intersection ratio between the predicted text detection box and the true text detection box, but under the condition of the same intersection ratio, the position condition and the overlapping condition of the predicted box and the true text detection box may be different, so that the evaluation of the regression condition of the text detection box by directly using the intersection ratio is also defective, and therefore, the design of the Loss function is still to be improved.

Disclosure of Invention

The invention aims to provide a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and aims to solve the problems that the current text detection method is poor in detection effect on small text regions, and loss functions commonly used in text detection cannot well evaluate the actual regression condition of a text detection box.

In order to achieve the purpose, the invention adopts the following technical scheme:

firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level;

step two, performing reverse step-by-step feature fusion on the feature maps of different scales extracted in the step one so as to obtain a multi-scale fusion feature map;

thirdly, performing text region detection on the multi-scale fusion characteristic graph output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic graph and a text region geometric characteristic graph so as to represent candidate prediction text regions;

step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region;

step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi-prediction text regions;

and step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than a certain threshold value, thereby obtaining the final prediction text regions and the detection results.

The method comprises a training process, wherein a plurality of public common text detection data sets are used for training a method model;

using back propagation in the training process, continuously updating model parameters when the loss is large until the loss converges to a small value, and storing the model parameters;

and step seven, using the stored structure and parameters of the model to form a multi-instance natural scene text detection model.

Further, in the step one, a relevance hierarchy residual error structure is introduced into the relevance hierarchy residual error-based feature extraction network based on the ResNet-50 backbone network, so that accurate and complete multi-scale text features with combination of coarse granularity and fine granularity can be extracted. In a feature extraction link, the original input image gradually acquires coarse-grained feature information of different scales from low level to high level through 5 convolutional layers Conv1-Conv5, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of the original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.

Further, a correlation-hierarchical residual-based feature extraction network, Conv1 downsamples using a convolution kernel of 7 × 7 followed by a convolution kernel of 3 × 3 for a MaxPool layer Conv2-Conv5 is constructed as a 1 × 1 convolution, a 3 × 3 convolution group, and a 1 × 1 convolution, with residual concatenation to simplify the learning goals and difficulty of deep neural networks, wherein the 3 × 3 convolution group is to implement fine-grained feature extractionFirstly, the feature graph generated by convolution of 1x1 is equally divided into 4 sub-feature graphs along the channel dimension, and the 1 st sub-feature graph x1Is directly output as y1(ii) a Each sub-feature graph x thereafteriAfter a convolution operation K of 3 × 3iThen obtaining output yi(ii) a And starting from the 3 rd sub-feature map, xiThe output y of the previous sub-feature map is addedi-1And then carrying out convolution operation of 3 × 3, and finally combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.

Further, the feature map f generated from Conv5 is subjected to reverse progressive feature fusion in the second step1At first, f is first paired1Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv42And in addition, two convolution operations of 1 × 1 and 3 × 3 are added after the feature maps are combined for reducing the channel dimension and reducing the parameter calculation amount, so that the feature maps f with different scales are finally processed in sequence in the mode1、f2、f3And f4And step-by-step fusion is completed, the size of the fused feature map is 1/4 original input image size, and a convolution layer of 3 × 3 is added to generate a final multi-scale fusion feature map.

Further, in the third step, 1 × 1 convolution operation is adopted for performing feature mapping on the multi-scale fusion feature map; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.

Further, the score threshold for performing simple pre-screening and culling on the candidate predictive text regions in step four is set to 0.5.

Further, in the fifth step, the local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value 0.2, the requirement of merging is met; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.

Further, in the sixth step, a threshold value for performing region screening according to the region average score of the quasi-predictive text region is set to 0.1.

Further, in the sixth step, a loss function is used in the training process, and parameter adjustment is performed when the loss is propagated reversely.

Further, the loss function is composed of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:

L=Lcls+λLreg

wherein L is the total loss of detection; l isclsFor text classification loss, LregTo detect the box regression loss, λ is a parameter that balances the importance of the two losses, with a value of 1.

The text classification loss calculation formula is as follows:

wherein L isclsRepresenting a text classification loss; y represents all positive sample areas in the real text score feature map; | Y*| represents all positive sample regions in the predicted text score feature map | Y ∩ Y*And | represents a portion where the positive sample region in the predicted text score feature map intersects with the positive sample region in the true text score feature map.

The detection frame regression loss calculation formula is as follows:

Lreg=LgθLθ

wherein L isregRegression loss for the detection box; l isgDetecting box geometric regression loss for the text without considering the angle; l isθDetecting a frame angle loss for the text; lambda [ alpha ]θThe two loss tradeoff parameters are 20.

Furthermore, the geometric regression loss of the text detection box without considering the angle in the regression loss of the detection box is CIoULoss, and the calculation formula is as follows:

Lg=1-IoU+R(A,B)

wherein L isgRepresenting a text detection box geometric regression loss;comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:

Figure BDA0002512024730000053

wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:

Figure BDA0002512024730000055

wherein wB,hBWidth and height of the real frame B; w is aA,hAThe width and height of box a are predicted.

Further, the calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:

Lθ=1-cos(θ*-θ)

wherein L isθRepresenting a text detection box angle loss; theta*A predicted angle for the text region; θ represents the true angle of the text region.

Compared with the prior art, the invention has the following technical effects:

the adopted feature extraction network extracts multi-scale features combining coarse granularity and fine granularity by utilizing relevance hierarchy residual and reverse stage-by-stage feature fusion, wherein the multi-scale features comprise more accurate and complete text information, and the feature expression capability of the network is further enhanced, so that the text detection precision can be improved;

the regression Loss of the text detection box used by the invention consists of a CIoU Loss part and an angle Loss part, and particularly, the use of the CIoU Loss part considers factors such as the overlapping area, the central distance, the length-width ratio and the like between the predicted text detection box and the real text detection box, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved;

the invention relieves hardware computational stress in a number of steps in a suitable manner, such as: 1 × 1, 3 × 3 small convolutions, feature splitting and splicing and the like are used at multiple positions in network design to reduce feature dimensions and reduce parameter calculation amount; pre-simple threshold screening and the like for candidate prediction text regions are also carried out;

the method has high detection precision for the conventional text region, is sensitive to the detection of the small text region, and has higher application value in the field of natural scene text detection.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a feature extraction and feature fusion network architecture of the present invention;

FIG. 3 is a basic structure diagram of the relevance hierarchy residual error used by the feature extraction network Conv2-Conv5 according to the present invention;

FIG. 4 is a graph of a portion of the test results of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1, the present invention comprises the steps of:

step 101, using a camera to acquire image data or directly uploading the image data as image input.

102, extracting the features of the original input image by using a feature extraction network based on the relevance residual error to obtain a feature graph f with the combined coarse granularity and fine granularity of the original input image sizes 1/32, 1/16, 1/8 and 1/41、f2、f3、f4These multi-scale feature maps respectively represent rich feature information from low-level to high-level.

103, reversely, merging the progressive features from the feature map f1At the beginning, sequentially pair f1、f2、f3、f4And performing upsampling and feature splicing to finally generate a multi-scale fusion feature map with the size of the original input image 1/4.

And 104, performing feature mapping on the multi-scale fusion feature map to detect a text region, and outputting a pixel-level text score feature map and a text region geometric feature map, wherein the pixel-level text score feature map and the text region geometric feature map respectively indicate whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that candidate prediction text regions can be represented.

And 105, simply screening and eliminating the candidate predictive text region in advance according to the region score, wherein the screening score threshold is set to be 0.5.

Step 106, merging the remaining candidate prediction text regions line by adopting a local perception non-maximum suppression algorithm, and meeting the requirement of merging when the intersection area of the two candidate prediction text regions is larger than a set threshold value of 0.2; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.

Step 107, performing region screening according to the region average score of the quasi-predictive text region to obtain a final predictive text region (i.e. a text detection result), wherein the screening threshold is set to 0.1.

In addition, as with most deep learning methods, the method model is first trained using a large amount of labeled image data, and back propagation and parameter optimization during the training process require the construction of a loss function. The loss function consists of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:

L=Lcls+λLreg

wherein L is the total loss of detection; l isclsFor text classification loss, LregTo detect the box regression loss, λ is a parameter that balances the importance of the two losses, with a value of 1.

The text classification loss calculation formula is as follows:

wherein L isclsRepresenting a text classification loss; y represents all positive sample areas in the real text score feature map; | Y*| represents all positive sample regions in the predicted text score feature map | Y ∩ Y*And | represents a portion where the positive sample region in the predicted text score feature map intersects with the positive sample region in the true text score feature map.

The detection frame regression loss calculation formula is as follows:

Lreg=LgθLθ

wherein L isregRegression loss for the detection box; l isgDetecting box geometric regression loss for the text without considering the angle; l isθDetecting a frame angle loss for the text; lambda [ alpha ]θThe two loss tradeoff parameters are 20.

The geometric regression Loss of the text detection box without considering the angle in the regression Loss of the detection box is CIoU Loss, and the calculation formula is as follows:

Lg=1-IoU+R(A,B)

wherein L isgRepresenting a text detection box geometric regression loss;comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:

wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;

Figure BDA0002512024730000091

is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:

wherein wB,hBWidth and height of the real frame B; w is aA,hAThe width and height of box a are predicted.

The calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:

Lθ=1-cos(θ*-θ)

wherein L isθRepresenting a text detection box angle loss; theta*A predicted angle for the text region; θ represents the true angle of the text region.

Referring to fig. 2, it depicts a structure diagram of the feature extraction and feature fusion network of the present invention, which includes the following parts:

step 201, using a camera to acquire image data or directly uploading the image data as image input.

Step 202, performing feature extraction on an original input image by using a feature extraction network based on relevance hierarchy residuals, wherein 5 convolutional layers Conv1-Conv5 in the feature extraction network gradually acquire coarse-grained feature information with different scales from a low level to a high level, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of an original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity. Conv1 was downsampled using a 7 × 7 convolution kernel followed by a 3 × 3 convolution kernel for a MaxPoint layer. Conv2-Conv5 were constructed as 1 × 1 convolution, 3 × 3 convolution group, and 1 × 1 convolution, with residual concatenation to simplify the learning objectives and difficulty of deep neural networks.

Step 203, inverse progressive feature fusion feature map f generated from Conv51At first, f is first paired1Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv42And in addition, two convolution operations of 1 × 1 and 3 × 3 are added after the feature maps are combined for reducing the channel dimension and reducing the parameter calculation amount, so that the feature maps f with different scales are finally processed in sequence in the mode1、f2、f3And f4And step-by-step fusion is completed, the size of the fused feature map is 1/4 original input image size, and a convolution layer of 3 × 3 is added to generate a final multi-scale feature fusion map.

Referring to fig. 3, it depicts a basic structure diagram of the relevance hierarchy residual errors used by the inventive feature extraction network Conv2-Conv5, comprising the following parts:

in step 301, the feature map 1 is convolved by 1 × 1 to reduce the amount of parameter calculation.

Step 302, the feature map generated by the 1 × 1 convolution is divided into 4 equally along the channel dimensionSub-feature map, sub-feature map 1, x1Is directly output as y1(ii) a Each sub-feature graph x thereafteriAfter a convolution operation K of 3 × 3iThen obtaining output yi(ii) a And starting from the 3 rd sub-feature map, xiThe output y of the previous sub-feature map is addedi-1And then carrying out convolution operation of 3 × 3, and finally combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.

Step 303, the feature dimension is restored by convolution of the feature graph y by 1x1, and finally the feature graph 2 is generated.

Meanwhile, Conv2-Conv5 uses residual concatenation in order to simplify the learning objectives and difficulty of deep neural networks.

Referring to fig. 4, partial detection results of the method are shown, and the results show that the method is relatively accurate in detection of the horizontal text, relatively sensitive in detection of small text regions, relatively accurate in distinguishing of multiple instances, and capable of eliminating interference of text similar objects.

The embodiments of the present invention have been described above with reference to the accompanying drawings. It will be appreciated by persons skilled in the art that the present invention is not limited by the embodiments described above. On the basis of the technical solution of the present invention, those skilled in the art can make various modifications or variations without creative efforts and still be within the protection scope of the present invention.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:坐标定位方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!