Asphalt pavement damage identification method based on target detection

文档序号:1938464 发布日期:2021-12-07 浏览:14次 中文

阅读说明:本技术 一种基于目标检测的沥青路面损坏识别方法 (Asphalt pavement damage identification method based on target detection ) 是由 石光辉 武海波 陶国权 王绍兵 刘明 徐同文 李世国 陈涛 马登飞 董博 万海峰 于 2021-09-13 设计创作,主要内容包括:本发明公开了一种基于目标检测的沥青路面损坏识别方法;通过基于transformer的神经网络架构,提出了AP-DETR沥青路面损坏检测的模型。通过构造可变形卷积,并使用偏移量以及从目标任务学习偏移量,从而达到增加空间采样位置的目的,进而对几何变形有更好的特征捕捉;采用一种卷积投影注意力,能在本地空间中获取额外的模型注意力,对于本发明的沥青路面损坏:如裂缝、坑槽、剥落等,能更好地捕捉路面损坏整体的特征,更精准地进行分类;构造了本地增强前向供应模块LeFF,能够将CNN的优点如提取低维的特征,增强本地化和transformer的优点如关联长范围的依赖性结合在一起,达到更好的建模效果。(The invention discloses an asphalt pavement damage identification method based on target detection; through a transform-based neural network architecture, a model for detecting the damage of the AP-DETR asphalt pavement is provided. By constructing deformable convolution, using offset and learning offset from a target task, the purpose of increasing spatial sampling positions is achieved, and better feature capture is achieved for geometric deformation; with one type of convolution projection attention, additional model attention can be gained in local space for the asphalt pavement damage of the present invention: such as cracks, pits, peeling and the like, can better capture the overall characteristics of the damaged pavement and classify the pavement more accurately; a local enhanced forward supply module LeFF is constructed, the advantages of CNN such as low-dimensional feature extraction, the advantages of enhanced localization and a transformer such as correlation of long-range dependency can be combined together, and a better modeling effect is achieved.)

1. An asphalt pavement damage identification method based on target detection is characterized by comprising the following steps:

s201: loading an asphalt pavement image to be detected;

s202: preprocessing an image;

the size of the input image is uniformly processed to be 640 x 640 in size;

s203: loading a trained improved AP-DETR target detection network model, wherein the AP-DETR model comprises three main parts, a Convolutional Neural Network (CNN) for extracting image feature representation, a transformer of an encoder-decoder structure and a prediction head for final target detection;

s204: detecting a damaged image of the asphalt pavement;

sending the image data into a network model for prediction to obtain classification and position information of the target;

s205: and obtaining a target detection result of the damage of the asphalt pavement.

2. The method for identifying the damage of the asphalt pavement based on the target detection as claimed in claim 1, wherein the improved AP-DETR target detection network model is constructed by the following method:

s101: preparing a training image dataset

Using an image of road surface damage as an initial input image, the image having a size of 3 XH0×W0

S103: constructing an AP-DETR target detection network model

Inputting an initial input image into a convolutional neural network, wherein the convolutional neural network generates a feature map with the size of C multiplied by H multiplied by W, the output of the feature map is sent to a deformable convolutional neural function (DeformableConv2D), and then the channel dimension C of a high-order feature map becomes d, so that a new feature map with the size of d multiplied by H multiplied by W is generated;

the characteristic diagram with the size of dXHXW adopts a flattening function to connect the columns of the H XW high-dimensional matrix in series to form a continuous column vector, and the size of the characteristic diagram is changed into dXHW which is marked as A;

feature map A is provided as input to a transform encoder, and the output of the encoder is provided as input to a decoder; at the same time, the decoder converts the N size d embeddings into output embeddings that are learned from the position encoding, represented as target queries;

the output embedding would then be decoded into the coordinates of the detection box: the coordinates of the center point of the rectangular frame, the height, the width and the class label are obtained, and N final predictions are obtained; through FPN (namely, forward supply network) calculation, the final prediction layer can predict the center coordinate, height and width of the regularized target bounding box; at the classification module, the class label is predicted by a softmax function;

s104: starting training, initializing weight matrix

Initializing a weight matrix by using normal distribution; sampling from a truncated normal distribution with a variance of 0.01 to obtain a weight initialization value;

s105: forward propagation

The input signal obtains the output of each layer with the help of the weight matrix, and finally reaches the predicted value of the output layer;

s106: computing a focus efficient loss function

S107: counter-propagating

After a network prediction result calculated by any group of random parameters is obtained through forward propagation, the network prediction result is corrected and updated by utilizing the gradient of a loss function relative to each parameter;

s108: updating the weight matrix

Updating the weight matrix according to the gradient of the parameters obtained by the back propagation to achieve the effect of reducing the loss function;

s109: if the maximum training times are not reached, the method returns to the step S105, and the forward propagation is continued, otherwise, the binary model with the best performance is saved.

3. The method for identifying damage to an asphalt pavement based on object detection as claimed in claim 1, wherein between step S101 and step S103, a step for enhancing training data is further included; the method specifically comprises the following steps:

s102: training data enhancement

S1021: buckling and enhancing an existing asphalt pavement damage picture, and buckling a pavement damage part picture in the picture containing the asphalt pavement damage as a single picture by using Python-OpenCV, wherein the picture only contains one type of pavement damage form; then adding Gaussian noise, salt and pepper noise, reducing brightness, sharpening images, reducing the size and the like in proportion to the image so as to simulate the damaged image of the asphalt pavement;

s1022: and generating a plurality of pictures containing the asphalt pavement damage by adopting a copy-paste method, specifically selecting the asphalt pavement pictures without any damage, randomly selecting 3 single asphalt pavement pictures generated in the step S1021, copying and pasting the pictures to the asphalt pavement pictures without any damage, combining the pictures with the image data set of the step S101 to be used as a training image data set, and increasing the number of the images in the data set.

4. The method for identifying damage to an asphalt pavement based on object detection according to claim 2,

the encoder comprises a stack of M-6 identical encoding layers, and the decoder comprises a stack of M-6 identical decoding layers; the output of the coder is sent to each decoding layer, and the first decoding layer also receives the target inquiry and the data input of position coding and feature map summation; after flattening the output of the backbone network ResNet-50, summing the output with the position codes, and sending the summation result to a coding layer with multi-head self attention;

the encoder contains a stack of M6 identical coding layers, the output of each coding layer being input to the next coding layer, and the output of the M (here M6) th coding layer being input to each decoding layer, all of the same size.

5. The method for identifying damage to an asphalt pavement based on object detection according to claim 4,

in the input of the encoder, Q represents a query vector, K represents a vector of the relevance of the queried information to other information, and V represents a vector of the queried information; in the coding layer, K, Q, V are calculated as follows: k and Q are summation results, V is the output of the original backbone network, the output after multi-head self attention is sent to convolution projection attention, after the convolution projection attention, extra attention can be obtained, the output W of the original backbone network is used as residual connection input at the moment and is added to the output of the convolution projection attention, and then the obtained output is recorded as X through a normalization layer;

the output of the normalization layer is fed into a local enhanced forward provisioning (LeFF) module, whether size is unchanged; the output X of the normalization layer is also added as a residual layer to the input of a local enhanced forward supply (LeFF) module; finally, the size is unchanged through a normalization processing layer.

6. The method for identifying damage to an asphalt pavement based on object detection according to claim 4,

the decoding layer of the decoder is started from multi-head self-attention, and Q, K and V are used as input; in addition to the target query, the output of the decoding layer serves as the input of a multi-headed self-attention decoding layer; for the first decoding layer, the target is initialized to 0, and for the next decoding layer, it will use the output from the previous decoding layer, with a size of 256 × 100; calculating Q and K as target query + target, and calculating V as V and target; after multi-head self attention, the size is still 256 multiplied by 100; the multi-head self-attention output is sent to the convolution projection attention module, and is also used as residual error connection to be added with the output of the convolution projection attention module to obtain output, and the size of the output is 256 multiplied by 100; the output passes through a normalization layer, again 256 x 100 in size, and the result is recorded as Y;

in order to prepare the multi-head attention of a decoding layer, the input Q, K and V of the decoding layer are required to be obtained independently, and the output of a coding layer, position coding, target query and layer normalization output Y are required; the values of Q, K, V are calculated as follows: q is Y + target query, K is position encoding + encoder output, and V is encoder output; the output of multi-head attention is 256 × 100; the target is taken as the residual concatenation input, and the output of multi-head attention is added, again 256 × 100; the result is normalized to obtain the size of 256 multiplied by 100, and the result is marked as Z; z is then sent to a local enhanced forward provisioning (LeFF) module, but the size remains the same; z is also concatenated as a residual, added to the output of the local enhanced forward supply (LeFF) module; the result is further processed by a normalization layer to obtain a result with the size of 256 multiplied by 100, which is recorded as U; the result U is the input to the decoding layer and is sent to the next decoding layer as the target.

7. The method for identifying the damage to the asphalt pavement based on the target detection as claimed in claim 2, wherein the LeFF modules in the coding layer and the decoding layer specifically adopt the following steps:

first, an input from a previous module is givenIt will be split into block-like dataAnd a data blockThen using linear projection to expand the block data to high dimension to obtainWhere e is the expansion rate;

second, based on the relative position of the original, the blocky data is then restored to the "original" and is obtained in the spatial dimension

Thirdly, a deep convolution operation with a convolution kernel size of k is performed on the recovered data blocks, enhancing the relational representation of the adjacent data and obtaining

Fourthly, the data blocks are flattened into a sequence to obtain

Finally, the data blocks are projected to the dimensions of the beginningThen connecting with the category data to obtainFollowing each linear projection and depth convolution, is a batch regularization and activation function;

these steps can be described by the formula:

8. the method for identifying the damage to the asphalt pavement based on the target detection as claimed in claim 5 or 6, wherein the input of the convolution projection attention is initially deformed into a two-dimensional feature map, specifically, the vector is divided into small vectors, then the small vectors are tiled into a two-dimensional matrix, and the convolution projection is realized by a depth-separable convolution layer with the last convolution kernel size of s;

the projection output is flattened into one dimension, and the flattening operation is to connect the columns of the high-dimension matrix in series to form a continuous column which becomes a column vector to provide input for the subsequent multi-head attention and deformable convolution; the formula is expressed as follows:

here, theIs the data input of Q/K/V at the t-th layer, xiIs the data located before the convolution projection, Conv2d is a depth separable convolution, and s refers to the size of the convolution kernel.

9. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program performs the steps of the method for identifying damage to an asphalt pavement based on object detection according to any one of claims 1 to 8.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for identifying damage to an asphalt pavement based on object detection according to any one of claims 1 to 8.

Technical Field

The invention relates to the technical field of traffic and computer vision, in particular to a method for identifying damage of an asphalt pavement based on target detection.

Background

Roads play an important role in modern traffic systems and road networks and it is of utmost importance to detect and prevent damage to asphalt pavements using computer vision based methods. However, the traditional detection of asphalt pavement damage is through manual inspection, which is time-consuming and labor-consuming and has extremely low efficiency. The detection of damage to asphalt pavement based on computer vision can use computer vision technology to identify and judge damage to the pavement, but there are technical problems in the following aspects: the existing model has insufficient precision, and particularly for the damage of irregular geometric shapes such as cracks, peeling and the like, the visual geometric characteristics of the model are difficult to capture, so that errors are easy to generate in recognition; furthermore, it is difficult for the existing model to combine local damage with global characteristics of the entire road surface, and generally, the existing model over-emphasizes local damage characteristics and ignores the grasp of damage to the entire road surface, thereby erroneously determining a part of the road surface; meanwhile, the road surface damage has certain neighbor relevance, but the long-range characteristic relevance cannot be considered by the conventional model, so that only a certain area is identified, and the neighbor area is usually ignored.

Disclosure of Invention

In order to solve the problems, the invention provides a model for detecting the damage of the AP-DETR asphalt pavement through a transform-based neural network architecture. By constructing deformable convolution, the purpose of increasing spatial sampling positions is achieved by using offset and learning the offset from a target task, so that better feature capture is achieved for geometric deformation; by adopting convolution projection attention, extra model attention can be obtained in a local space, and the overall characteristics of the pavement damage such as cracks, pits, spalling and the like can be better captured and classified more accurately for the pavement damage; a local enhanced forward supply module LeFF is constructed, the advantages of CNN such as low-dimensional feature extraction, the advantages of enhanced localization and a transformer such as correlation of long-range dependency can be combined together, and a better modeling effect is achieved. Therefore, the model and the method can improve the identification accuracy, thereby achieving more accurate asphalt pavement damage detection effect.

Specifically, the method for identifying the damage of the asphalt pavement based on target detection comprises the following steps:

s201: loading an asphalt pavement image to be detected;

s202: preprocessing an image;

the size of the input image is uniformly processed to be 640 x 640 in size;

s203: loading a trained improved AP-DETR target detection network model, wherein the AP-DETR model comprises three main parts, a Convolutional Neural Network (CNN) for extracting image feature representation, a transformer of an encoder-decoder structure and a prediction head for final target detection;

s204: detecting a damaged image of the asphalt pavement;

sending the image data into a network model for prediction to obtain classification and position information of the target;

s205: and obtaining a target detection result of the damage of the asphalt pavement.

The improved AP-DETR target detection network model is constructed by the following method:

s101: preparing a training image dataset

Using an image of road surface damage as an initial input image, the image having a size of 3 XH0×W0

S103: constructing an AP-DETR target detection network model

Inputting an initial input image into a convolutional neural network, wherein the convolutional neural network generates a feature map with the size of C multiplied by H multiplied by W, the output of the feature map is sent to a deformable convolutional neural function (DeformableConv2D), and then the channel dimension C of a high-order feature map becomes d, so that a new feature map with the size of d multiplied by H multiplied by W is generated;

the characteristic diagram with the size of dXHXW adopts a flattening function to connect the columns of the H XW high-dimensional matrix in series to form a continuous column vector, and the size of the characteristic diagram is changed into dXHW which is marked as A;

feature map A is provided as input to a transform encoder, and the output of the encoder is provided as input to a decoder; at the same time, the decoder converts the N size d embeddings into output embeddings that are learned from the position encoding, represented as target queries;

the output embedding would then be decoded into the coordinates of the detection box: the coordinates of the center point of the rectangular frame, the height, the width and the class label are obtained, and N final predictions are obtained; the final prediction layer can be computed by the FPN (i.e., forward provisioning network)Predicting the center coordinates, height and width of the regularized target bounding box; at the classification module, the class label is predicted by a softmax function; special class labelsIndicating that no target was detected;

s104: starting training, initializing weight matrix

Initializing a weight matrix by using normal distribution; sampling from a truncated normal distribution with a variance of 0.01 to obtain a weight initialization value;

s105: forward propagation

The input signal obtains the output of each layer with the help of the weight moment value, and finally reaches the predicted value of the output layer;

s106: computing a focus efficient loss function

S107: counter-propagating

After a network prediction result calculated by any group of random parameters is obtained through forward propagation, the network prediction result is corrected and updated by utilizing the gradient of a loss function relative to each parameter;

s108: updating the weight matrix

Updating the weight matrix according to the gradient of the parameters obtained by the back propagation to achieve the effect of reducing the loss function;

s109: if the maximum training times are not reached, the method returns to the step S105, and the forward propagation is continued, otherwise, the binary model with the best performance is saved.

On the basis of the scheme, a step for enhancing the training data is further included between the step S101 and the step S103; the method specifically comprises the following steps:

s102: training data enhancement

S1021: buckling and enhancing an existing asphalt pavement damage picture, and buckling a pavement damage part picture in the picture containing the asphalt pavement damage as a single picture by using Python-OpenCV, wherein the picture only contains one type of pavement damage form; then adding Gaussian noise, salt and pepper noise, reducing brightness, sharpening images, reducing the size and the like in proportion to the image so as to simulate the damaged image of the asphalt pavement;

s1022: and generating a plurality of pictures containing the asphalt pavement damage by adopting a copy-paste method, specifically selecting the asphalt pavement pictures without any damage, randomly selecting 3 single asphalt pavement pictures generated in the step S1021, copying and pasting the pictures to the asphalt pavement pictures without any damage, combining the pictures with the image data set of the step S101 to be used as a training image data set, and increasing the number of the images in the data set.

The invention also provides a server, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the asphalt pavement damage identification method based on the target detection.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described target detection-based asphalt pavement damage identification method.

The method can better capture the overall characteristics of the pavement damage aiming at the pavement damage such as cracks, pits, spalling and the like, and can classify the pavement damage more accurately. The method constructs the local enhanced forward supply module LeFF, and can combine the advantages of CNN such as extracting low-dimensional features and enhancing localization and transformer such as associating long-range dependency together to achieve better effect.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of the overall architecture and schematic of the AP-DETR, the lower diagram being an enlarged view of the predictive image;

FIG. 3 is an encoder and decoder;

FIG. 4 shows the structure and details of the coding layer and the decoding layer;

FIG. 5 is the sample positions in the standard convolution and the deformable convolution (convolution size: 3X 3);

FIG. 6 is a deformable convolution map (3 × 3);

FIG. 7 is a convolution projection attention;

FIG. 8 is a local enhanced forward provisioning module;

FIG. 9 is a view showing the effect of recognizing damage to an asphalt pavement;

FIG. 10 is a graph of training loss variation;

FIG. 11 is a graph of verification loss variation;

FIG. 12 is a graph of model metric change.

Detailed Description

In order to make the objects and advantages of the invention more apparent, the following description of the embodiments of the invention with reference to the accompanying drawings is provided for a clear and complete description of the invention.

Example 1

With reference to fig. 1, the invention discloses a method for detecting a detection model of a member based on a target detection asphalt pavement damage identification method, which comprises the following steps:

s101: preparing a training image dataset

The initial input image of road surface damage is 3 XH in size0×W0(there are 3 color channels, H0Represents height, W0Representing the width). In the present invention, we unify the size of the original image from the camera or drone to 3 × 640 × 640.

S102: training data enhancement

From step S101, about 2000 pictures selected from the data set and marked by the LabelMe software manually can be obtained as the pictures of the training data, but considering that there are 3 different types of road surface damages (i.e. pits, cracks, and peeling), and there are recognition of road surface damages under various extreme conditions (e.g. too dark light, reflection in rainy days, etc.), the training data is still insufficient, and the data set needs to be enriched by adopting the data enhancement aspect. The method comprises the following two steps:

s1021: buckling and enhancing an existing asphalt pavement damage picture, and buckling a pavement damage part picture in the picture containing the asphalt pavement damage as a single picture by utilizing a Python-OpenCV program, wherein the picture only contains one type of pavement damage form; then adding Gaussian noise, salt and pepper noise, reducing brightness, sharpening images, reducing the size and the like in proportion to the image so as to simulate the damaged image of the asphalt pavement;

s1022: the copy-paste method is used to generate a plurality of pictures containing asphalt pavement damage such as pits, cracks and spalling. And selecting the asphalt pavement picture without any damage, randomly selecting 3 individual asphalt pavement pictures generated in the step S1021, copying and pasting the pictures onto the asphalt pavement picture without any damage, combining the pictures with the image data set of the step S101 to be used as a training image data set, and increasing the number of the images in the data set. This step will generate 3000 additional pictures.

Through steps S101 and S102, a total of 5000 data sets including various types of asphalt pavement damages can be obtained. Further, all labeled data are expressed as 8: 1: a scale of 1 into a training set (4000 sheets) and a validation set (500 sheets) and a test set (500 sheets).

S103: constructing an AP-DETR target detection network model

The general architecture and principles of the AP-DETR are shown in FIG. 1.

It comprises three main parts, a convolutional neural network layer (CNN) for extracting image feature representation, a transformer of coder-decoder structure, and a final prediction head for final target detection.

Specifically, a Convolutional Neural Network (CNN) backbone (ResNet-50) generates a signature of size C × H × W (C2048, H ═ H)0/32,and W=W032), C refers to the channel dimension of the higher order feature map. The output is fed to a deformable convolutional neural function (DeformableConv2D) as its input. The result of this is that the channel dimension of the higher order feature becomes d, and a new feature of size dXHXW is generated, where d represents the lower dimension after the dimension reduction. (in this embodiment, d is 256, so the size is 256 × 20 × 20). In the transform-based model, there is neither a cycle nor a convolution. To fully take into account the order of the sequence, with respect to each positionThe order information of the symbols of (a) is injected. To achieve this, a position code of size 256 × 20 × 20 is obtained from the output of the backbone network ResNet-50. However, the input of the coding layer needs to be a sequence, so we compress the spatial dimensions into one dimension by using a flattening function, specifically, by concatenating the columns of a 20 × 20 high-dimensional matrix to form a continuous column, then a column vector with length of 400 is obtained. After applying the flattening function, a feature map of size d × HW (256 × 400) is obtained. The characteristic diagram is marked as A; as input to the encoder and decoder.

The input size of the transform encoder is 256 × 400, and its output will have the same size. The encoder output, 256 x 400 in size, is fed to a decoder as input. Meanwhile, the decoder converts N (in this invention, N ═ 100) embeddings of size d (here, d ═ 256) into output embeddings. As shown in FIG. 2, these output embeddings are learned from position coding, represented as target queries.

The output embedding is then decoded into the coordinates of the detection box (the coordinates of the center point and the height, width of the rectangular box) and the class label, and the N final predictions are obtained. The final prediction layer can predict the center coordinates, height and width of the regularized target bounding box by FPN (i.e., forward provisioning network) computation. In addition, in the classification module, the class label is predicted by a softmax function (softmax is used in the multi-classification process, and maps the output of a plurality of neurons into a (0,1) interval, which can be understood as probability, so as to perform multi-classification). It is worth mentioning that N (N100) is a relatively large number compared to the actual number of objects in the image. Further, a special class label 0 indicates that no object is detected.

As shown in fig. 3, the encoder includes a stack of M-6 identical encoding layers, and the decoder includes a stack of M-6 identical decoding layers. The output of the encoder is fed to each decoding layer, the first layer also receiving the target query and position encoding and feature map summing data inputs. Each of the coding and decoding layers is shown in fig. 3 below. The output of the backbone network is flattened to have a size of 256 × 400 (herein referred to as W). The output of the backbone network and the position code sum of size 256 x 400 are also summed with the result size 256 x 400, the result of which is sent to the coding layer with multi-headed self attention.

In the attention model, Q represents a query vector, K represents a vector of correlation of the queried information with other information, and V represents a vector of the queried information. In the coding layer, K, Q, V are calculated as follows: k and Q are the summation results and V is the output of the original backbone network. After multi-headed self-attention, the size is still 256 × 400. The output is then sent to an additional layer of attention, the convolution projection attention. After convolution projects attention, it will gain extra attention, again 256 x 400 in size. The output W of the original backbone network can be used as residual connection input at this time, and it can be added to the output of the convolution projection attention, and the residual network can ensure that the problems of gradient disappearance and gradient explosion do not occur under a deeper network layer by performing one-time current stacking on the current neuron to the previous layer. And then through a normalization layer. The resulting output size is 256 × 400, denoted here as X.

The output of the normalization layer is fed into a local enhanced forward supply (LeFF) module, again of constant size. The output of the normalization layer is also added as an input to a local enhanced forward supply (LeFF) module as a residual layer that ensures that gradient disappearance and gradient explosion problems do not occur at deeper network levels by performing a current stacking of the previous layer on the current neuron. Finally, the size of the result is 256 × 400 by a normalization processing layer. The encoder contains a stack of M6 identical coding layers, the output of each coding layer being input to the next coding layer, and the output of the M (M6) th coding layer being input to each decoding layer, all 256 × 400.

The decoding layer starts with the multi-headed and, like the others, it also needs Q, K, V as inputs. In addition to the target query, the output of the decoding layer (depicted as the target in the figure) also serves as the input to the multi-headed self-attention decoding layer. For the first decoding layer, the target is initialized to 0, but for the next decoding layer, it will take the output from the previous decoding layer, with a size of 256 × 100. The computation of Q and K would be Q-K-target query + target and the computation of V is V-target. After multi-head attention, the size is still 256 × 100. The output of the multi-headed attention is sent to the convolution projection attention module as a residual join, which is also summed with the output of the convolution projection attention module to obtain an output, again 256 x 100 in size. The output passes through a normalization layer, again 256 x 100 in size, and the result is noted as Y.

To prepare for multi-headed attention at the decoding layer, its inputs Q, K, V need to be obtained separately, where the output of the coding layer, position coding, target query and layer normalization output Y are needed. The values of Q, K, V are calculated as follows: q Y + target query, K position encoding + encoder output, and V encoder output. The output of multi-head attention is again 256 × 100. The target is the residual concatenation input, and the output of multi-head attention is added, again 256 × 100. The result is passed through a normalization layer to obtain a size of 256 × 100, and the result is denoted as Z. Z is then sent to the local enhanced forward provisioning (LeFF) module, but the size remains the same. Z is also concatenated as a residual, and added to the output of the local enhanced forward supply (LeFF) module. The result is further normalized to 256 × 100, and is denoted as U. The result U is the input to the decoding layer and is sent to the next decoding layer as the target.

After the deformable convolution is used in the ResNet-50 backbone network, it aims to increase the spatial sampling positions by using the offset and learning the offset from the target task. And the standard convolution module Convolution Neural Network (CNN) has natural limitation on the geometric deformation of the model due to the fixed geometric structure.

The deformable convolution introduces a two-dimensional offset in the position of the regular grid samples relative to the standard convolution, allowing the sampling grid to deform freely.

The sample positions of the standard convolution and the sample positions of the deformable convolution are depicted in fig. 5. For deformable convolution, additional convolution layers are used to learn offsets from the next feature map. The morphable then makes adjustments to the input features in a local, centralized, and adaptive manner.

In a standard convolutional neural network CNN, both the signature and the convolution are three-dimensional. In the present invention, two dimensions are taken as an example for the sake of clarity in description and drawing. For two-dimensional convolution, there are two steps:

1) sampling an input feature map x by using a normal specification R;

2) adding the sampled value to the weight w

The grid R defines the size and the degree of expansion of the receptive field.

The following formula is for each position p0And obtaining an output characteristic diagram y.

Where p isnRefers to a position inside the lattice R;

in the deformable convolution, the offset { Δ pn1,2, N (N | R |) is the increment relative to the conventional trellis R.

Thus, the above formula can be rewritten as:

due to this variation, the sample will be at an irregular and offset position pn+ΔpnThe process is carried out.

As shown in fig. 6, the offset can be obtained by applying a convolutional layer on the same input signature. The convolutional layers used have the same convolutional kernel, spatial size and expansion ratio. The output offset and the input profile have the same spatial resolution. Channel 2N corresponds to N two-dimensional offsets. Therefore, the deformable volume actively improves the modeling capability of the Convolutional Neural Network (CNN) for geometric deformation, and the extraction of feature maps from the ResNet-50 backbone network will make the model more suitable for capturing the deformed features on asphalt pavement damage.

The purpose of the convolution projection attention is to obtain additional model attention in local space. In the convolution projection attention, the original multi-head self-attention position linear projection is replaced by the convolution with separable depth, and the characteristics of the whole pavement damage, such as cracks, pits, spalling and the like, can be better captured and classified.

As shown in fig. 7, the input of the convolution projection attention (i.e., the multi-headed self-attention output) is initially transformed into a two-dimensional feature map by dividing the vectors into several parts by scale and then tiling the divided small vectors into a two-dimensional matrix. A depth separable convolutional layer with a convolutional kernel size s then implements the convolutional projection.

Finally, the projection output is flattened into one dimension, and the flattening operation is to form a continuous column by connecting the columns of the high-dimension matrix in series, so that a column vector is formed, which provides input for subsequent multi-head attention and deformable convolution. The formula can be expressed as follows:

here, theIs the data input of Q/K/V at the t-th layer, xiIs data located before the convolution projection, Conv2d is a depth separable convolution (implementation: Conv2 d. of depth->Batch normalization->Conv2d on a point), s refers to the size of the convolution kernel. Reshape2D indicates that the output is transformed into a 2-dimensional feature map, which is changed in shape by rearranging the matrix; flatten indicates that the projection output is flattened into one dimension, again by warping and dimension stacking of the matrix.

Using convolution projection attention, the model can integrate convolution onto the transform, capturing more of the local spatial context, reducing semantic ambiguity in the next focus. The method not only fully utilizes the advantages of a Convolutional Neural Network (CNN), including local receptive fields, shared weights, spatial sub-sampling and the like, but also retains the advantages of a transform, including dynamic attention, global context fusion and better universality. Particularly, for several forms of pavement damage such as cracks, pits, peeling and the like, the overall characteristics of the pavement damage can be captured better through context and attention, and the classification of the damage types can be realized more accurately.

At each of the coding and decoding layers of the model, a module known as the local enhanced forward provisioning (LeFF) is used to replace the original forward provisioning module. This adaptation allows for an enhanced correlation of adjacent pixel information in the spatial dimension, allowing for extraction of local information and establishing long range dependencies.

As shown in fig. 8, the LeFF module takes the following steps:

first, an input from a previous module is givenIt will be split into block-like dataAnd a data blockThen using linear projection to expand the block data to high dimension to obtainWhere e is the expansion rate.

Second, based on the relative position of the original, the blocky data is then restored to the "original" and is obtained in the spatial dimension

Third, a convolution operation with a convolution kernel size of k depthThe data recovery is performed on the recovered data blocks, the relational representation of adjacent data is enhanced, and the data recovery method is obtained

Fourthly, the data blocks are flattened into a sequence to obtain

Finally, the data blocks are projected to the dimensions of the beginningThen connecting with the category data to obtainImmediately after each linear projection and depth convolution is a batch regularization and activation function.

These steps can be described by the formula:

wherein the content of the first and second substances,representing the input of the module;andrepresenting the partitioned data blocks;represents a high-dimensional vector to which the linear projection is extended;

an original image vector representing block data recovery;

a vector representing a relational representation of the enhanced neighborhood data;

representing a block of data flattened into a sequence;

represents the vector after linear projection;

number representing divisionAnd splicing the result vector according to the block and the linearly projected vector.

Therefore, by using the local enhanced forward provisioning module LeFF, the proposed model can combine the advantages of CNN, such as extracting low-dimensional features, and the advantages of localization enhancement and transformer, such as associating long-range dependencies, to achieve a better modeling effect.

S104: starting training, initializing weight matrix

Optionally, initializing a weight matrix using normal distribution; preferably, the weight initialization value is obtained by sampling from a truncated normal distribution with a variance of 0.01, so that the model can be converged more quickly in the following training process.

S105: forward propagation

The input signal obtains the output of each layer with the help of the weight matrix, and finally reaches the predicted value of the output layer;

s106: computing a focus efficient loss function

S107: counter-propagating

After a network prediction result calculated by any group of random parameters is obtained through forward propagation, the network prediction result is corrected and updated by utilizing the gradient of a loss function relative to each parameter;

s108: updating the weight matrix

And updating the weight matrix according to the gradient of the parameters obtained by back propagation to achieve the effect of reducing the loss function.

S109: if the maximum training times are not reached, the method returns to the step S105, and the forward propagation is continued, otherwise, the binary model with the best performance is saved.

Example 2

An asphalt pavement damage identification method based on target detection comprises the following steps:

s201: loading an asphalt pavement image to be detected;

s202: preprocessing an image;

the size of the input image is uniformly processed to be 640 x 640 in size;

s203: loading a trained improved AP-DETR target detection network model;

s204: detecting a damaged image of the asphalt pavement;

sending the image data into a network model for prediction to obtain classification and position information of the target; the results are shown in FIG. 9;

s205: and obtaining a target detection result of the damage of the asphalt pavement.

To compare our proposed method with other existing methods, we performed a series of experiments.

Experiments were performed on ubuntu computer systems including Intel Core i 78700 CPU, NVIDIA GeForce GTX 1080GPU and 16G of ram.

The AP-DETR model proposed by us is trained by using a back-propagation learning method, Adam is used as an optimizer, the learning rate is 0.001, the weight decay is set to be 0.0005, and 100 generations of training are carried out each time. Similar to other models of object detection, the loss of AP-DETR also contains two parts: classification loss and regression box loss. The regression box contains the L1 regression loss and the GIoU loss.

Assuming that y is the true value of the object,is N predicted values, and to find a binary match between the true and predicted values, we search for permutations of N elementsThe lowest loss was obtained:

here, theIs the true value yiAnd the corresponding predicted values. The Hungarian algorithm is employed here to efficiently compute the optimal allocation.

Loss of bounding boxThe definition is as follows:

here, theIs a hyperparameter, biIs a true bounding box that is,then it is the predicted box.

We also performed the same experiment on existing more popular target detection models such as SSD, YoloV3, YoloV4, DETR, etc., and examined the model indices on the same data set.

Accuracy is used in the present invention to refer to the proportion of true positive classes among all classes determined to be positive.

Recall is also used in the present invention to measure the proportion of all true positive classes that are determined to be positive classes, i.e., how many positive samples are correctly detected. When R is 100%, it means no missing target.

The mean accuracy (mAP) is a widely used indicator in the field of target detection. There are two types of mAP used, including mAP _0.5, and mAP _ 0.5: 0.95. in both types, the minimum overlap region (IoU) of the prediction box and the real box is used to determine whether the detection result is true or false.

In the calculation of mAP _0.5, a constant IoU overlap value of 0.5 is used. And for mAP _ 0.5: 0.95, the value of which is calculated by calculating IoU the overlap value of the values in the interval [ 0.5: 0.95] cut increments of 0.05.

In this embodiment, after training for 200 generations (epoch), as can be seen from fig. 10, the training loss (object confidence loss (obj), class loss (cls), and bounding box loss (box)) initially decreases at an extremely fast rate, and then decreases steadily as the training progresses, which means that the model tends to be more and more stable.

As can be seen from FIG. 11, the verification loss (object confidence loss (obj), class loss (cls), and bounding box loss (box) initially decreases at a very fast rate, and then steadily decreases as the training progresses, which means that the model has a relatively stable performance on the training set, and not only deals with the training set.

FIG. 12 shows a graph of accuracy and recall as training progresses, with the accuracy and recall increasing at the beginning of training. But soon, a steady growth period is entered. When training was achieved for 200 generations, both accuracy and recall were substantially maximized. Similarly, in a training generation, the accuracy and recall are low, also because there are individual noisy difficult samples, and the model is adapting and adjusting.

As can be seen from FIG. 12, the Precision and Recall steadily increase, with a final Precision of about 94 and a Recall of about 94. mAP _0.5, i.e., the mAP value when IOU is 0.5, is about 96, mAP _ 0.5: 0.95 represents the average mAP over different IoU thresholds (from 0.5 to 0.95, step size 0.05), which is approximately 91.

TABLE 1 comparative analysis table of test results of different models

From table 1, after verification of 200 generations of training, the model of the present invention obtains the best performance result on the enhanced asphalt pavement damage data set, with accuracy of 93.79, recall of 94.41, mAP _0.5 of 96.59, mAP _ 0.5: 0.9 to 91.39, there is a 3.6 improvement in accuracy over the next best model DETR, a 1.0 improvement in recall, a 1.5 improvement in map0.5, map0.5: 0.95 has 4.64 lifting, and the lifting effect is obvious.

The improvement of each index is based on that the model of the invention achieves the purpose of increasing the space sampling position by constructing deformable convolution and learning offset from a target task, thereby having better characteristic capture on geometric deformation; the model of the invention adopts convolution projection attention, can obtain extra model attention in the local space, and can better capture the overall characteristics of the pavement damage, such as cracks, pits, spalling and the like, for the pavement damage of the invention, and can classify the pavement damage more accurately; the invention also constructs a local enhanced forward supply module LeFF, which can combine the advantages of CNN such as extracting low-dimensional features, and the advantages of enhanced localization and transformer such as correlation with long-range dependency to achieve better modeling effect.

The technical features mentioned above are combined with each other to form various embodiments which are not listed above, and all of them are regarded as the scope of the present invention described in the specification; also, modifications and variations may be suggested to those skilled in the art in light of the above teachings, and it is intended to cover all such modifications and variations as fall within the true spirit and scope of the invention as defined by the appended claims.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:目标图像识别方法、模型训练方法、装置、设备和介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!