Scene text detection method and system based on multi-scale feature pyramid

文档序号:1862004 发布日期:2021-11-19 浏览:7次 中文

阅读说明:本技术 一种基于多尺度特征金字塔的场景文本检测方法及系统 (Scene text detection method and system based on multi-scale feature pyramid ) 是由 万洪林 王嘉鑫 赵莹莹 王晓敏 于 2021-07-13 设计创作,主要内容包括:本公开提供了一种基于多尺度特征金字塔的场景文本检测方法及系统,包括:对图像的场景文本进行标注;获取文本位置标定好的数据集,将数据集分为训练集和测试集;采用特征金字塔网络,对训练集进行自上而下的特征提取,同时,在特征金字塔网络的C2层对训练集进行不同尺度的特征提取;采用特征融合网络,对自上而下特征提取的输出特征在C2层与不同尺度特征提取的输出特征进行融合,得到融合后的特征图;将融合后的特征图与特征金字塔网络其他层的支路输出的特征融合映射,得到训练模型;用测试集验证模型的有效性;本公开用特征金字塔网络作为骨干网,在骨干网中加入用空洞卷积调节感受野的多尺度模块,以此来提高检测大小差异较大文本的精确度。(The invention provides a scene text detection method and a scene text detection system based on a multi-scale feature pyramid, wherein the method comprises the following steps: marking scene texts of the images; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set; performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales; fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model; verifying the validity of the model by using the test set; the method uses the characteristic pyramid network as a backbone network, and adds a multi-scale module for adjusting the receptive field by cavity convolution in the backbone network, so as to improve the accuracy of detecting the text with larger size difference.)

1. A scene text detection method based on a multi-scale feature pyramid is characterized by comprising the following steps:

inputting an image, and labeling a scene text of the image; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set;

performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales;

fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model;

verifying the validity of the model by using the test set; and detecting the scene text of the image by using the tested model.

2. The method for detecting the scene text based on the multi-scale feature pyramid as claimed in claim 1, wherein each image has its own tagged txt file; in a single image, each line of the label file represents a text target, the first eight numbers are coordinate information of four vertexes of a rectangular text box, and the last column is character content of the text.

3. The method for detecting scene texts based on the multi-scale feature pyramid as claimed in claim 1, wherein a basic framework of a feature extraction network is a ResNet50 network, a network for preliminarily extracting features is a feature pyramid network, and a multi-scale module is fused at a C2 layer of the feature pyramid network for extracting features of different scales; the multi-scale module adopts cavity convolution and introduces multi-scale information; the module controls the receptive field while not changing the feature map size.

4. The method for detecting scene text based on the multi-scale feature pyramid as claimed in claim 3, wherein the multi-scale module comprises:

a global average pooling layer to obtain image-level characteristics, and then carrying out 1 × 1 convolution and bilinear interpolation to the original size;

one 1 × 1 convolutional layer, and three 3 × 3 void convolutions;

the 5 features with different scales are concat together in the channel dimension and then sent into convolution of 1 × 1 for fusion output.

5. The method for detecting the scene text based on the multi-scale feature pyramid as claimed in claim 1, wherein the process of fusing the output features extracted from the top-down feature with the output features extracted from the features of different scales at the C2 level comprises the following steps:

the top-down process of the feature pyramid network is carried out by adopting upsampling, the output P5 of the C5 layer is subjected to upsampling processing, the upsampling result and the feature map output by the C4 layer are fused, the fusion result is P4, then the upsampling processing of P4 is fused with the feature map output by the C3 layer, the fusion result is P3, and the like, the P3 and the C2 layer are fused, and the fusion result is P2.

6. The method for detecting the scene text based on the multi-scale feature pyramid as claimed in claim 5, wherein the feature fusion mapping process of the fused feature map and the branch output of other layers of the feature pyramid network comprises:

and the output P2 of the C2 layer network is multiplied and fused with the output of the multi-scale module point by point, the output result is M1, and finally, the four branches of M1, P3, P4 and P5 are added to realize the detection of the sizes of texts with different scales.

7. The method for detecting the scene text based on the multi-scale feature pyramid as claimed in claim 5, wherein the test set is input into a trained model, and an evaluation index is detected through a standard natural scene text; the evaluation indicators include accuracy, recall, and harmonic mean.

8. A scene text detection system based on a multi-scale feature pyramid is characterized by comprising a data acquisition module, a feature extraction module, a feature fusion module and a model test module;

the data acquisition module configured to: inputting an image, and labeling a scene text of the image; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set;

the feature extraction module configured to: performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales;

the feature fusion module configured to: fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model;

the model testing module configured to: the validity of the model is verified with the test set.

9. A computer-readable storage medium, on which a computer program is stored for fingerprint similarity calculation, wherein the program, when executed by a processor, implements the method for detecting scene text based on a multi-scale feature pyramid according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for multi-scale feature pyramid-based scene text detection as claimed in any one of claims 1-7 when executing the program.

Technical Field

The disclosure belongs to the technical field of computer vision, and particularly relates to a scene text detection method and system based on a multi-scale feature pyramid.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The character detection and recognition technology in the natural scene is a research hotspot in the field of computer vision, a large number of researchers search for the direction for a long time, and the research results of the character extraction technology in the natural scene can be applied to the fields of intelligent urban traffic systems, blind vision auxiliary systems, unmanned technologies, industrial automation systems and the like. Different from the traditional optical character extraction technology, the natural scene contains rich scene information, characters in the natural scene face various difficulties such as variable fonts, complex background, low image quality and the like, and the method has important practical application value for effectively improving the accuracy of scene character extraction.

The scene text refers to text content existing in any scene, such as a road sign, a billboard, a market sign, a commodity package and the like; the purpose is to locate the area or bounding box of each text instance in the natural image; its development history is similar to most computer vision tasks; before the deep learning method does not occupy the dominant position, text detection is mainly based on manual feature extraction; conventional detection methods include a connected domain-based method and a sliding window-based method.

However, these two methods have the following disadvantages: (1) the method based on the connected domain depends on the detection result of the text connected region in a large amount, however, in a natural scene, due to factors such as illumination change, color fading and noise interference, image processing is often very complex, and the text connected region is difficult to accurately detect from the image processing, so that the text of the natural scene is stably detected; (2) the sliding window based method is usually based on a classifier of a single character, the sliding window is acted on a candidate frame, and when a scene is complex, for example, under the influence of natural conditions such as illumination, shadow, shielding and the like, the character classification stability is deteriorated, so that the detection result is reduced; with the development of deep learning, scene text detection also enters a new stage, and a large number of methods based on deep learning emerge; the deep learning-based method uses high-level semantic features with more stable effect, and utilizes more data to fit a more complex model with stronger generalization capability; the research method mainly comprises the following steps: regression-based methods and segmentation-based methods; however, in the two methods, when the text with larger size difference is faced, the sensing domain of the feature map is fixed, so that the detection accuracy is lower.

Disclosure of Invention

In order to solve the problems, the invention provides a scene text detection method and a scene text detection system based on a multi-scale feature pyramid.

In order to achieve the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present disclosure provides a scene text detection method based on a multi-scale feature pyramid, including:

inputting an image, and labeling a scene text of the image; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set;

performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales;

fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model;

verifying the validity of the model by using the test set; and detecting the scene text of the image by using the tested model.

Furthermore, each image has a tag txt file belonging to the image; in a single image, each line of the label file represents a text target, the first eight numbers are coordinate information of four vertexes of a rectangular text box, and the last column is character content of the text.

Further, a basic framework of the feature extraction network is a ResNet50 network, the network for preliminarily extracting the features is a feature pyramid network, and a multi-scale module is fused on a C2 layer of the feature pyramid network and used for extracting the features with different scales; the multi-scale module adopts cavity convolution and introduces multi-scale information; the module controls the receptive field while not changing the feature map size.

Further, the multi-scale module comprises:

a global average pooling layer to obtain image-level characteristics, and then carrying out 1 × 1 convolution and bilinear interpolation to the original size;

one 1 × 1 convolutional layer, and three 3 × 3 void convolutions;

the 5 features with different scales are concat together in the channel dimension and then sent into convolution of 1 × 1 for fusion output.

Further, the process of fusing the output features extracted from the top to the bottom and the output features extracted from the features of different scales at the layer C2 is as follows:

the top-down process of the feature pyramid network is carried out by adopting upsampling, the output P5 of the C5 layer is subjected to upsampling processing, the upsampling result and the feature map output by the C4 layer are fused, the fusion result is P4, then the upsampling processing of P4 is fused with the feature map output by the C3 layer, the fusion result is P3, and the like, the P3 and the C2 layer are fused, and the fusion result is P2.

Further, the feature fusion mapping process of the fused feature map and the branch output of other layers of the feature pyramid network is as follows:

and the output P2 of the C2 layer network is multiplied and fused with the output of the multi-scale module point by point, the output result is M1, and finally, the four branches of M1, P3, P4 and P5 are added to realize the detection of the sizes of texts with different scales.

Further, inputting the test set into a trained model, and detecting an evaluation index through a standard natural scene text; the evaluation indicators include accuracy, recall, and harmonic mean.

In a second aspect, the present disclosure further provides a scene text detection system based on a multi-scale feature pyramid, including a data acquisition module, a feature extraction module, a feature fusion module, and a model test module;

the data acquisition module configured to: inputting an image, and labeling a scene text of the image; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set;

the feature extraction module configured to: performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales;

the feature fusion module configured to: fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model;

the model testing module configured to: the validity of the model is verified with the test set.

In a third aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored for calculating similarity of fingerprints, where the program, when executed by a processor, implements the method for detecting scene text based on the multi-scale feature pyramid according to the first aspect.

In a fourth aspect, the present disclosure further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for detecting scene text based on the multi-scale feature pyramid according to the first aspect when executing the program.

Compared with the prior art, the beneficial effect of this disclosure is:

according to the method, the characteristic pyramid module is adopted, text characteristic information is extracted from top to bottom, diversified extraction is carried out on texts with different shapes and scales, and the multi-scale module is added on the bottom pyramid to further enhance the text detection precision, so that text detection in a natural scene is more diversified, and the scene text detection precision is improved.

Drawings

The accompanying drawings, which form a part hereof, are included to provide a further understanding of the present embodiments, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the present embodiments and together with the description serve to explain the present embodiments without unduly limiting the present embodiments.

Fig. 1 is a flow chart of example 1 of the present disclosure;

FIG. 2 is a stage1, 2 network structure of embodiment 1 of the disclosure;

FIG. 3 shows a stage3 network structure according to embodiment 1 of the present disclosure;

FIG. 4 shows a stage4 network structure according to embodiment 1 of the present disclosure;

FIG. 5 shows the structure of the stage5 network according to embodiment 1 of the present disclosure;

fig. 6 is an overall network of embodiment 1 of the present disclosure;

fig. 7 is a multi-scale module of embodiment 1 of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

Example 1:

as shown in the figure, the embodiment provides a scene text detection method based on a multi-scale feature pyramid, and the specific scheme is as follows:

(1) and acquiring a data set with a calibrated text position, and dividing the data set into a training set and a testing set.

The data set provides a label of a bounding box of a word for each picture, each picture has a label txt file belonging to the picture, taking a single picture as an example, each line of the label file represents a text object, and the first eight numbers are coordinate information (x1, y1, x2, y2, x3, y3, x4 and y4) which are four vertexes of a rectangular text box; the last column is the character content of the text, which is replaced with # # # if the font is fuzzy.

(2) Inputting a training part of a data set into a feature extraction network, wherein the basic framework of the feature extraction network is a ResNet50 network, the network for preliminarily extracting features is an FPN (feature pyramid network), and a multi-scale module is fused at a C2 layer of the FPN network and is used for extracting features of different scales.

The FPN consists of two parts from bottom to top and from top to bottom; from bottom to top, the traditional convolutional network is used for feature extraction, features generated by a backbone network ResNet50 are generally divided according to stages and are respectively marked as C1, C2, C3, C4 and C5, wherein the numbers of the features are the same as the numbers of the stages, the numbers represent the times of halving the resolution, for example, C2 represents a feature map output by stage2, the resolution is 1/4 of an input picture, C5 represents a feature map output by stage5, and the resolution is 1/32 of the input picture; as the convolution progresses further, spatial resolution decreases, spatial information is lost, but higher-level semantic information is more detected; the top-down process is performed by using upsampling, and the horizontal connection is to fuse the upsampling result and the feature map with the same size generated from bottom to top.

As shown in fig. 7, the multi-scale module adopts hole convolution mainly to improve the receptive field of the network and introduce multi-scale information. The module can control the receptive field without changing the size of the characteristic diagram, which is beneficial to extracting multi-scale information; wherein the rate controls the size of the receptive field, and the greater the rate, the greater the receptive field; the multi-scale module mainly comprises the following parts: a. obtaining image-level characteristics by a global average pooling layer, then performing 1 × 1 convolution, and performing bilinear interpolation to the original size; b. one 1 × 1 convolutional layer, and three 3 × 3 void convolutions; c. the 5 features with different scales are concat together in the channel dimension and then sent into convolution of 1 × 1 for fusion output.

(3) And inputting the feature graph after the features are extracted into the feature fusion network.

The feature fusion network firstly performs up-sampling fusion output of the FPN from top to bottom, simultaneously performs point-by-point multiplication and fusion on the output P2 of the C2 layer network of the FPN from bottom to top and the output of the multi-scale module, and finally performs feature fusion mapping on the fused feature map and the feature output by other branches of the feature pyramid.

The FPN top-down process is carried out by adopting upsampling, namely, firstly, the output P5 of C5 is upsampled, the upsampling result and the feature map output by C4 are fused, the fusion result is P4, then, the upsampling process of P4 and the feature map output by C3 are fused, the fusion result is P3, and by analogy, the P3 and C2 are fused, and the fusion result is P2.

The output P2 of the C2 layer network is multiplied and fused with the output of the multi-scale module point by point, the detection precision of the large-scale text is improved, the output result is M1, and finally the four branches of M1, P3, P4 and P5 are added to realize the detection of the sizes of the texts with different scales.

(4) The validity of the model is verified with the test set.

Inputting the testing part of the data set into the trained model, and verifying the effectiveness of the model through standard natural scene text detection and evaluation indexes, namely accuracy (P), recall (R) and harmonic mean (F), wherein the formula is as follows:

wherein TP represents the number of the text boxes detected correctly, and B represents the number of the text boxes detected by the model; g represents the number of real text boxes in the ground channel.

Example 2:

the embodiment provides a scene text detection system based on a multi-scale feature pyramid, which comprises a data acquisition module, a feature extraction module, a feature fusion module and a model test module;

the data acquisition module configured to: inputting an image, and labeling a scene text of the image; acquiring a data set with a calibrated text position, and dividing the data set into a training set and a test set;

the feature extraction module configured to: performing feature extraction on the training set from top to bottom by adopting a feature pyramid network, and simultaneously performing feature extraction on the training set at a C2 layer of the feature pyramid network in different scales;

the feature fusion module configured to: fusing output features extracted from top to bottom with output features extracted from features of different scales at a C2 layer by using a feature fusion network to obtain a fused feature map; fusing and mapping the fused feature graph and features of branch outputs of other layers of the feature pyramid network to obtain a training model;

the model testing module configured to: the validity of the model is verified with the test set.

Example 3:

the embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for detecting a scene text based on a multi-scale feature pyramid described in embodiment 1 when executing the program.

Example four:

the present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for detecting scene text based on a multi-scale feature pyramid described in embodiment 1 is implemented.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and those skilled in the art can make various modifications and variations. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present embodiment should be included in the protection scope of the present embodiment.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:电力专业的设备铭牌识别算法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!