License plate real-time detection method based on edge-guided sparse attention mechanism

文档序号:1354693 发布日期:2020-07-24 浏览:13次 中文

阅读说明:本技术 一种基于边缘引导稀疏注意力机制的车牌实时检测方法 (License plate real-time detection method based on edge-guided sparse attention mechanism ) 是由 秦华标 梁静 于 2020-03-22 设计创作,主要内容包括:本发明公开了一种基于边缘引导稀疏注意力机制的车牌实时检测方法,属于目标检测的技术领域。该方法首先使用卷积神经网络对输入图像进行处理,提取语义特征;然后利用一种新的边缘引导稀疏注意力机制快速捕获显著性区域,即车牌区域,其中边缘引导稀疏注意力机制包括边缘引导组件和稀疏注意力组件;接着采用级联多任务学习辅助车牌精准检测;最后采用损失掩码方法去抑制低质量的预测框,提高系统性能。本发明可实现在各种自然场景下的车牌实时检测,并具有高准确率、高召回率以及高鲁棒性,对于现实应用具有重要意义。在最大和最多样化的公开数据集CCPD上实现了最先进的性能,尤其是在CCPD-Base(100k)测试集上的检测精度达到99.9%。(The invention discloses a license plate real-time detection method based on an edge-guided sparse attention mechanism, and belongs to the technical field of target detection. Firstly, processing an input image by using a convolutional neural network, and extracting semantic features; then, a new edge-guided sparse attention mechanism is used for rapidly capturing a salient region, namely a license plate region, wherein the edge-guided sparse attention mechanism comprises an edge-guided component and a sparse attention component; then, adopting cascade multi-task learning to assist accurate detection of the license plate; and finally, a loss mask method is adopted to inhibit a low-quality prediction box, and the system performance is improved. The invention can realize the real-time detection of the license plate in various natural scenes, has high accuracy, high recall rate and high robustness, and has important significance for practical application. The most advanced performance is achieved on the largest and most diverse public data set, CCPD, and in particular the detection accuracy on the CCPD-Base (100 k) test set reaches 99.9%.)

1. A license plate real-time detection method based on an edge-guided sparse attention mechanism is characterized by comprising the following steps: the edge-guided sparse attention mechanism comprises an edge-guided component and a sparse attention component, and the real-time detection method comprises the following steps:

s1, processing the input image by using a convolutional neural network, and extracting a semantic feature map X;

s2, the edge-guided sparse attention mechanism captures a license plate region, wherein,

the edge guide component is used for enhancing target edge information and reducing noise interference, and specifically operates as follows:

s21, extracting edge information of the image by using a convolutional neural network to generate an edge guide image I;

s22, obtaining linear model coefficients (a, b) by the aid of the semantic feature map X and the edge guide map I through a convolutional neural network;

s23, constructing a linear model g by using the linear model coefficients (a, b) and the edge guide graph Ii=aiIi+biObtaining a characteristic map X1 through a linear model;

the sparse attention component is used for reducing the computational complexity of a self-attention mechanism, the feature map X1 is input in the sparse attention component, and the specific operation in the sparse attention component is as follows:

s24, finding K most similar target pixels for each source pixel of the input feature map X1;

s25, for each source pixel, calculating an attention map by using K target pixels most similar to the source pixel;

s26, aggregating k target pixels by using an attention map to obtain corresponding output characteristics;

s3, adopting cascade multi-task learning to assist accurate detection of the license plate;

s4, using a loss mask method to suppress the low-quality prediction box.

2. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 1, wherein: the specific steps of finding the K most similar target pixels in step S24 are as follows:

s241, predicting offset map offset with 2K channels through convolutional neural network(k,c`,i,j)Where K denotes K target pixels most similar to the corresponding source pixel, and 2 denotes an X-axis and a y-axis, the basic lattice basic is generated by the feature map X1(c,i,j)It represents the original coordinates of each pixel, the basic grid has 2 channels, representing the x-axis and the y-axis, respectively;

s242, summing the original coordinates of each target pixel in the basic grid and the offset coordinates of corresponding K pixels in the offset map to obtain absolute coordinates abs _ offset(k,c`,i,j)The formula is as follows:

abs_offset(k,c`,i,j)=offset(k,c`,i,j)+basic(c,i,j)

c=0,1;c′=2(k-1),2(k-1)+1;k=1,2...,K

wherein c and c' both represent channels, k represents the kth target pixel corresponding to the source pixel point located at the ith row and the jth column;

s242, based on the feature map X1 and the offset map offset(k,c`,i,j)The corresponding K most similar target pixels are found for each source pixel point by sampling, and the feature map X2 is obtained.

3. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 2, wherein: the specific calculation formula for calculating the attention diagram adopts a point multiplication or Gaussian function or an embedded Gaussian function.

4. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 3, wherein: the specific calculation formula for dot multiplication is as follows:

wherein, a(k,i,j)Indicating the attention weight of the source pixel point located in the ith row and the jth column and the corresponding kth target pixel, and "+" indicates the multiplication of the corresponding positions.

5. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 4, wherein: the formula for obtaining the characteristic output is as follows:

wherein o is(c,i,j)Aggregate output representing source pixels in ith row and jth column and having c channelsFeature, note that different channels of the feature map X2 at the same location have the same weight.

6. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 1, wherein: step S3 includes a first-level task learning and a second-level task learning, wherein task branches of the first-level task learning are classification confidence prediction of a license plate, relative position regression prediction of a frame, classification confidence prediction of a key point and relative position regression of the key point respectively, and the relative position regression of the key point is selectable; and the second-level task learning selectively fuses the prediction characteristic graphs obtained by the first-level multi-task learning, and further finely adjusts the target detection to obtain an accurate position.

7. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 1, wherein: the loss mask method is a water ripple loss mask, the closer the boundary box is to the target center point, the larger the regression loss weight is, and the weight of the regression loss of the boundary box far away from the target center is reduced.

8. The real-time license plate detection method based on the edge-guided sparse attention mechanism as claimed in claim 7, wherein: the water ripple loss mask is defined as:

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a license plate real-time detection method based on an edge-guided sparse attention mechanism.

Background

In recent years, due to the core operation of weighted summation of all positions on a feature map by a self-attention mechanism, a deep learning model is helped to capture long-distance dependency relationships, the model can focus more on salient features, and the development of many computer vision tasks such as target detection, semantic segmentation, human posture estimation and the like is promoted. From an image filtering point of view, it is essential to reduce noise and to reorganize the most important contextual semantic information over long distances.

Although popular, the self-attentiveness mechanism also has limitations. There are many methods based on self-Attention, such as those proposed by Wang X in Non-local neural networks (Proceedings of the IEEE Conference on computer Vision and Pattern registration.2018: 7794-7803), whose OCNet, Object Context Network for Scene registration, etc. and DANet, etc. proposed by Fu J, etc. in Dual Access Network for Scene registration, are all designed to achieve excellent performance without considering speed and storage cost, and whose computational complexity is O (N)2C) Furthermore, through the content of Xie C and the like in the Feature differentiation for improving adaptive robustness (Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2019:501 and 509), and experiments, we find that although the robustness of the model is improved through noise reduction, the attention machine system can have the defect that the object outline on the Feature map is blurred while the noise is reduced, which limits the further improvement of the detection precision.

In the last two decades, automatic license plate detection has been the subject of active research, and is used for overspeed violation, highway charging and vehicle passingInspired by YO L online, which achieves the best speed/precision tradeoff, most license Plate Detection networks are based on YO L O. it is noted that methods based on YO L O get a lower recall rate when detecting license plates far away from cameras because YO L O networks are difficult to detect small-sized objects, hence silver S M et al, L place Detection and Recognition in unopposed data scales (European Conference vision. springer, Cham,2018: 593;), L aroca R et al, joa road-time automatic Plate Detection on the YO L O Detection (network 8 interference) and IEEE 2018 network collection n. j-n. nIn the document L registration based on Temporal reduction, the accuracy and recall rate are improved, and the vehicle is detected before the license plate detection is proposed.

Disclosure of Invention

In order to solve the problems that a second-order-based license plate detection method, such as a method for detecting a vehicle first and then detecting a license plate, needs higher calculation cost and more parameters and is difficult to detect in real time, the invention provides a license plate real-time detection method based on an edge-guided sparse attention mechanism, wherein the edge-guided sparse attention mechanism is embedded into a detection backbone network to detect the license plate in real time; in order to overcome the defects that the existing self-attention mechanism has high calculation complexity and can reduce noise and simultaneously blur the outline of an object on a feature map, a linear model is constructed by utilizing the strong fitting capacity of an edge guide feature map and a neural network through an edge guide component, the edge contour feature of a target in the feature map is enhanced, noise interference is inhibited, and further the accuracy and the robustness of target detection are improved. The computational complexity of the sparse attention component is greatly reduced to only O (NKC), where K < N.

The purpose of the invention is realized by at least one of the following technical solutions.

A license plate real-time detection method based on an edge-guided sparse attention mechanism, wherein the edge-guided sparse attention mechanism comprises an edge-guided component and a sparse attention component, and the real-time detection method comprises the following steps:

s1, processing the input image by using a convolutional neural network, and extracting a semantic feature map X;

s2, the edge-guided sparse attention mechanism captures a license plate region, wherein,

the edge guide component is used for enhancing target edge information and reducing noise interference, and specifically operates as follows:

s21, extracting edge information of the image by using a convolutional neural network to generate an edge guide image I;

s22, obtaining linear model coefficients (a, b) by the aid of the semantic feature map X and the edge guide map I through a convolutional neural network;

s23, constructing a linear model g by using the linear model coefficients (a, b) and the edge guide graph Ii=aiIi+biObtaining a characteristic map X1 through a linear model;

the sparse attention component is used for reducing the computational complexity of a self-attention mechanism, the feature map X1 is input in the sparse attention component, and the specific operation in the sparse attention component is as follows:

s24, finding K most similar target pixels for each source pixel of the input feature map X1;

s25, for each source pixel, calculating an attention map by using K target pixels most similar to the source pixel;

s26, aggregating k target pixels by using an attention map to obtain corresponding output characteristics;

s3, adopting cascade multi-task learning to assist accurate detection of the license plate;

s4, using a loss mask method to suppress the low-quality prediction box.

Further, the specific steps of finding the K most similar target pixels in step S24 are as follows:

s241, predicting offset map offset with 2K channels through convolutional neural network(k,c`,i,j)Where K denotes K target pixels most similar to the corresponding source pixel, and 2 denotes an X-axis and a y-axis, the basic lattice basic is generated by the feature map X1(c,i,j)It represents the original coordinates of each pixel, the basic grid has 2 channels, representing the x-axis and the y-axis, respectively;

s242, summing the original coordinates of each target pixel in the basic grid and the offset coordinates of corresponding K pixels in the offset map to obtain absolute coordinates abs _ offset(k,c`,i,j)The formula is as follows:

abs_offset(k,c`,i,j)=offset(k,c`,i,j)+basic(c,i,j),

c=0,1;c′=2(k-1),2(k-1)+1;k=1,2...,K

wherein c and c' both represent channels, k represents a kth target similarity point corresponding to a source pixel point located at the ith row and the jth column;

s242, based on the feature map X1 and the offset map offset(k,c`,i,j)The corresponding K most similar target pixels are found for each source pixel point by sampling, and the feature map X2 is obtained.

Further, the specific calculation formula for calculating the attention map may adopt a point product or a gaussian function or an embedded gaussian function.

Further, the specific calculation formula using dot multiplication is as follows:

wherein, a(k,i,j)Indicating the source pixel point at the ith row and the jth column corresponding to the ith rowAttention weights of k target pixels, "+" indicate corresponding position multiplication.

Further, the formula for obtaining the characteristic output is as follows:

wherein o is(c,i,j)Representing the aggregate output characteristics of source pixels located in row i and column j and having c channels, note that different channels at the same location of the feature map X2 have the same weight.

Further, step S3 includes a first-level task learning and a second-level task learning, where task branches of the first-level task learning are classification confidence prediction of a license plate, relative position regression prediction of a frame, classification confidence prediction of a key point, and relative position regression of a key point, respectively, where the relative position regression of the key point is a selectable item; and the second-level task learning selectively fuses the prediction characteristic graphs obtained by the first-level multi-task learning, and further finely adjusts the target detection to obtain an accurate position.

Further, the loss mask method water ripple loss mask, the closer the bounding box is to the target center point, the larger the regression loss weight is, and the weight of the regression loss away from the target center bounding box is reduced.

Further, the water ripple loss mask is defined as:

compared with the prior art, the invention has the following beneficial effects:

(1) the edge-guided lean-attention mechanism greatly reduces the computational complexity of the self-attention mechanism, simultaneously makes up for the defect that the self-attention mechanism blurs the edge of an object, greatly enhances the edge profile of a target, simultaneously inhibits noise interference, and improves the target detection accuracy and robustness.

(2) The edge-guided sparse attention mechanism can quickly capture a target salient region, namely a license plate region, so that the edge-guided sparse attention module is embedded into a detection backbone network to directly perform real-time license plate detection on a frame, and high accuracy and recall rate can be obtained. Compared with the mainstream second-order detection method for detecting the vehicle or the related area and then detecting the license plate, the method not only keeps high precision, but also greatly reduces the calculated amount and the network parameter amount, and is beneficial to real-time detection.

Drawings

FIG. 1 is a general flowchart of a license plate detection method based on an edge-guided sparse attention mechanism according to the present invention.

Fig. 2 is a detailed flow diagram of the edge guide assembly.

FIG. 3 is a detailed flow diagram of the sparse attention component.

Detailed Description

The following further describes the embodiments of the present invention with reference to the drawings, which are provided for illustration only and are not intended to limit the scope of the present invention.

The existing license plate detection method is mostly based on YO L O, and adopts a second-order method, namely, a vehicle is detected firstly, and then the license plate is further detected, so that the calculation amount and network parameters are greatly increased, the detection speed is reduced, and real-time detection is difficult to realize.

A license plate real-time detection method based on an edge-guided sparse attention mechanism comprises the following steps:

step 1: the input image is processed by using a convolutional neural network (Backborn), and a semantic feature Map X (feature Map X) of the image is extracted. Semantic feature map X is extracted, for example, using the top 8 convolutional layers of VGG-19.

Step 2: a salient region, namely a license plate region, is rapidly captured by utilizing a novel Edge-Guided Sparse Attention mechanism (Edge-Guided Sparse Attention module), wherein the Edge-Guided Sparse Attention mechanism comprises two parts, namely an Edge-Guided Component (Edge-Guided Component) and a Sparse Attention Component (Sparse Attention Component).

The operation of the edge guide assembly includes 3 steps:

(1) an Edge-Guided Map I (Edge-Guided Map I) is generated from the convolutional layer (CNN1), where the convolved input is a grayscale image converted from the original input image.

(2) And (2) splicing the edge guide graph I and the semantic feature graph X generated in the step 1 (CONCAT), and then sending the edge guide graph I and the semantic feature graph X into a small convolution network (CNN2) to obtain linear model parameters (a, b).

(3) Sending the linear model coefficients (a, b) and the edge guide graph I into a linear layer (L initial L eye), and constructing a linear model g for each pixel point in the edge guide graph Ii=aiIi+biWhere g represents the output of the linear model and i represents the index of the pixel. A Feature Map X1(Feature Map X1) with low noise and significant contour information can be generated by this linear model.

The operation of the sparse attention component comprises 3 steps:

(1) entering the feature map X1 into the sparse attention component, finding the K most similar target pixels for each source pixel (SourcePixel) of the input feature map X1 may be obtained by, but is not limited to:

① predict an Offset Map (Offset Map) Offset with 2K channels in a data-driven manner using a convolutional network (CNN3)(k,c`,i,j)Where K denotes K target pixels most similar to the corresponding source pixels, 2 denotes an x-axis and a y-axis, c' denotes a channel, and K denotes a K-th target pixel corresponding to a source pixel point located at an i-th row and a j-th column. To obtain the absolute coordinates of the target pixel, a Basic Grid (Basic Grid) Basic is generated from the input feature map X1 of the sparse attention component(c,i,j)To represent the original coordinates of each pixel. By normalizing the coordinates of each pixel of the feature map X1 to [ -1,1 [ ]]Generating a base mesh. The coordinates of the top left pixel are (-1, -1) and the coordinates of the bottom right pixel are (1, 1). Pixels at the same location on different channels have the same coordinates. The basic grid has 2 channels, representing the x-axis and the y-axis, respectively. A basic grid is generated from the input image to represent the original positions of the pixels.

② sums (Element-wise Sum) the original coordinates of each target pixel in the base mesh with the offset coordinates of the corresponding K pixels in the offset Map to obtain an absolute coordinates Map (Abs _ offset Map) where Abs _ offset is(k,c`,i,j)The formula is as follows:

abs_offset(k,c`,i,j)=offset(k,c`,i,j)+asic(c,i,j),

c=0,1;c′=2(k-1),2(k-1)+1;k=1,2...,K

c and c' both represent channels, and k represents the kth target pixel corresponding to the source pixel point located at the ith row and jth column.

③ find The corresponding K most similar target pixels (The KSimilar pixels) for each source Pixel point by Sampling (Sampling), thereby obtaining a Feature Map X2(Feature Map X2).

(2) For each source pixel, an Attention Map (Attention Map) is calculated using K target pixels that are most similar to the source pixel, and the calculation formula may be a point product, a gaussian function, an embedded gaussian function, or the like, for example, a point product calculation, as follows:

a(k,i,j)x2 indicating the attention weight of the source pixel point located at the ith row and jth column and the corresponding kth target pixel(k,c,i,j)Representing the c channel value, X1, of the k destination pixel point corresponding to the source pixel point located at the i row and j column in the feature map X2(c,i,j)The c channel value of the source pixel point located in the ith row and the jth column in the feature map X1 is represented, and "+" represents the multiplication of the corresponding positions.

(3) Using a calculated attention Map to aggregate (readable Operation) k target pixels to obtain a corresponding output Feature Map (Feature Map O) with distinguishability, and using the following formula,

o(c,i,j)and the aggregate output characteristic of the source pixel points which are positioned in the ith row and the jth column and have c channels is represented. Note that different channels of the feature map X2 at the same location have the same weight.

And step 3: and (3) assisting accurate detection of the license plate by adopting cascade Multi-task learning (Cascade Multi-TaskDetection Head).

The task branches in the first-level multitask learning are Classification confidence prediction (Classification) of a license plate, relative position Regression prediction (Regression) of a bounding box, key point Classification confidence prediction (L endmark Classification) and key point relative position Regression (L endmark Regression), wherein the key point relative position Regression is a selectable item.

And the second-level task learning is used for fusing the Classification confidence degree prediction feature map of the license plate and the Classification confidence degree prediction of the key point, which are obtained by the first-level multi-task learning, for example, the CONCAT operation can be used for fusing, and further fine tuning (referred Classification) of target detection is performed to obtain an accurate position.

And 4, step 4: a new loss mask method is used for restraining the low-quality prediction box, and the system performance is improved. The new loss mask method water ripple loss mask has the core that the closer the bounding box to the target center point is, the larger the weight of the regression loss is, and the weight of the regression loss away from the target center bounding box is reduced. Through back propagation, a low-quality prediction frame is restrained, and therefore detection performance is further improved.

The water ripple loss mask is defined as follows:

wherein the content of the first and second substances,representing a predicted confidence level of the license plate classification,label representing true value, (x)i,yi) The coordinates at the pixel point i are represented,representing the mask at pixel point i. (c)x,cy) Representing downsampling of a true annotated bounding box 22Mapping to the same size as the predicted classification confidence prediction of the license plate to obtain an object center point (c)x,cy)。The point is not located in the gray scale area, and the loss value of the pixel point in the gray scale area does not need to be calculated. σ denotes a hyper-parameter.

The above description of the embodiments is only intended to facilitate the understanding of the method of the invention and its core ideas. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于PU-Learning的图像语义分割方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!