Fine-grained image identification method based on saliency attention mechanism

文档序号:1832045 发布日期:2021-11-12 浏览:6次 中文

阅读说明:本技术 一种基于显著性注意力机制的细粒度图像识别方法 (Fine-grained image identification method based on saliency attention mechanism ) 是由 黄磊 刘超 魏志强 李晓静 秦琦冰 于 2021-07-12 设计创作,主要内容包括:本发明公开了一种基于显著性注意力机制的细粒度图像识别方法,包括:首先提取特征,通过显著性注意力机制模块对所有通道中的每张特征图处理获得显著性图;获得的显著性信息生成并学习注意力权重,并将注意力权重融合进原始特征图中,得到融合注意力权重的特征图,进行显著性注意力特征加强;然后将加强了显著性信息的特征图组通过自顶向下的方式构建不同尺度的包含丰富语义和显著性信息的特征图组;并利用该特征图组完成细粒度类别映射,通过类别损失和排序损失两种损失函数约束网络参数优化过程。通过本发明解决细粒度图像中关键区域的定位和相应特征难提取问题。(The invention discloses a fine-grained image identification method based on a saliency attention mechanism, which comprises the following steps of: firstly, extracting features, and processing each feature map in all channels through a saliency attention mechanism module to obtain a saliency map; generating and learning attention weight by the acquired saliency information, fusing the attention weight into an original feature map to obtain a feature map fused with the attention weight, and performing saliency attention feature reinforcement; then, constructing feature graph groups containing rich semantics and significance information with different scales by the feature graph groups with enhanced significance information in a top-down mode; and finishing fine-grained class mapping by using the characteristic graph group, and constraining the optimization process of the network parameters by two loss functions of class loss and sequencing loss. The method and the device solve the problems of positioning of key areas in fine-grained images and difficulty in extracting corresponding features.)

1. A fine-grained image identification method based on a saliency attention mechanism is characterized by comprising the following steps:

the method comprises the following steps: extracting features to obtain a three-dimensional original feature map;

step two: a saliency map is obtained by the saliency attention mechanism module processing each feature map in all channels: acquiring pixel element maximum values in two directions, namely horizontal and longitudinal directions, in each channel to obtain two one-dimensional maximum value matrix arrays, acquiring horizontal significance characteristic information and vertical significance characteristic information in a characteristic diagram, determining significance information in each channel, and further determining a distinguishing area in each channel characteristic diagram;

step three: generating and learning attention weights according to the significance information obtained in the step two, fusing the attention weights into original feature graphs with different sizes to obtain feature graphs with fused attention weights, and performing significance attention feature reinforcement;

step four: constructing feature graph groups containing rich semantics and significance information with different scales from top to bottom by using the feature graph groups with enhanced significance information;

step five: finishing fine-grained class mapping by using the feature graph group obtained in the previous step, and constraining the optimization process of the network parameters through two loss functions of class loss and sequencing loss;

step six: the training model gradually converges, and the network model with the highest experimental accuracy is stored.

2. The fine-grained image recognition method based on the saliency attention mechanism of claim 1 is characterized in that the three-dimensional feature map X obtained in the first step is represented by a symbol M, wherein X is C X W X H, and in the second step, the saliency map is obtained by the saliency mechanism module through the following specific steps:

1) for each characteristic diagram in all channels, taking the maximum value of the pixel elements in the horizontal direction according to rows, splicing the maximum values into a one-dimensional matrix, namely obtaining the one-dimensional matrix with the dimension of 1 × H in each characteristic diagram Mi, accumulating all channels to obtain C × 1 matrixes with the dimension of 1 × H, and obtaining the maximum value matrix column

2) For each characteristic diagram in all channels, taking the maximum value of the pixel elements in the vertical direction according to columns, splicing the maximum values into a one-dimensional matrix, namely obtaining the one-dimensional matrix with dimension W multiplied by 1 in each characteristic diagram Mi, accumulating all channels to obtain C multiplied by 1 matrixes with dimension W multiplied by 1, and obtaining the maximum value matrix column

Specifically, two one-dimensional matrix formulas for obtaining each feature map are as follows:

wherein x 'represents the pixel value after feature extraction, and x' represents the one-dimensional matrix after maximum value screening; specifically, two maximum value matrix column formulas are obtained as follows:

whereinA maximum value matrix column representing the longitudinal value,a maximum matrix column representing the lateral values.

3. The fine-grained image recognition method based on the saliency attention mechanism according to claim 1, characterized in that in step three, before learning the attention weight, dimension transformation is first performed to make the attention weight consistent with the original feature map dimension, specifically: multiplying the one-dimensional matrixes with corresponding dimensionality in the two groups of maximum value matrix arrays obtained in the step two by using matrix multiplication to obtain a matrix group with the same space dimensionality as the original feature map,

the matrix multiplication is specifically as follows:

wherein A is an initial attention diagram and is consistent with an original characteristic diagram in dimension, and the significance information of the characteristic diagram is calculated on the final output characteristic of each bottleneck block, so that N attention diagrams are shared; fAC(. cndot.) denotes asymmetric convolution.

4. A fine-grained image recognition method based on a saliency attention mechanism according to claim 3, characterized in that an asymmetric convolution is introduced into the attention mechanism, and the saliency attention weight is strengthened by the asymmetric convolution, specifically: performing convolution operation on an initial attention diagram under the same convolution sliding window by using three different convolution kernels of 3 × 3, 1 × 3 and 3 × 1, 1) performing convolution operation on the feature map by using 3 × 3 firstly, 2) performing convolution operation on only a horizontal skeleton in the sliding window by using 1 × 3 convolution on the same sliding window, and 3) performing convolution operation on only a vertical skeleton in the sliding window by using 3 × 1 convolution on the same sliding window.

5. The fine-grained image recognition method based on saliency attention mechanism as claimed in claim 3, characterized in that in step four, the attention diagram obtained in step three and feature diagrams of different scales are subjected to feature fusion to construct a set of feature diagram sets containing high-semantic, multi-scale and saliency information, specifically, A is multiplied by dotaAnd XaMultiplying corresponding elements to activate original characteristic diagram XaThe specific formula of the significance information is as follows:

Ca=Aa·Xa,a=2,3,...,N.

wherein, CaAnd representing feature graphs of different scales obtained by feeding forward the feature image pyramid network from bottom to top.

6. The fine-grained image recognition method based on the saliency attention mechanism according to claim 5, characterized in that after the saliency information is fused into the feature maps of different scales, the presented differentiable regions for the key in the feature maps are different, so in step five, in order to increase the number of samples of the key regions in each image and consider the influence of the feature maps of different scales on network classification, the mapping relationship between the network and the truth class is optimized through a class loss function, and a classification network with the regions of different scales as input is established; the method comprises the steps of compressing the space dimensions of feature graphs of different scales respectively to obtain a one-dimensional feature vector, inputting the feature vector into a full connection layer, mapping the feature vector into a category contained in a data set, and obtaining the cross entropy loss with a truth value category, wherein the formula is as follows:

Wcrepresenting fully-connected operations, theta (-) representing the effect of a globally pooled operation on the spatial compression profile, PaThe most distinctive discrimination region part in the feature map is shown.

7. The fine-grained image recognition method based on the saliency attention mechanism as claimed in claim 6, wherein in the process of constructing the pyramid network from top to bottom, the lower-layer feature map is constrained by the ranking loss, so that the upper-layer feature map contains more semantic information and fine-grained feature information, and the ranking loss formula is as follows:

wherein, PaRepresents PaThe class probability of the feature graph predicted by the softmax function is greater than zero, the constant is an adjustable super parameter, and through the constraint of sequencing loss, the network can not update the parameters when the classification probability of the lower-layer features is greater than that of the upper-layer features, otherwise, the whole network can update the parameters towards the direction which is more favorable for identifying the classes of the lower-layer feature graph;

the total loss function is:

Ltotal=αLP+βLrank

where α and β are taken as hyper-parameters.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a fine-grained image recognition method based on a saliency attention mechanism as claimed in any of claims 1 to 7.

Technical Field

The invention belongs to the technical field of computer vision, relates to deep learning and fine-grained image recognition technology, and particularly relates to a fine-grained image recognition method based on a saliency attention mechanism.

Background

Fine-grained image recognition, also referred to as sub-category image recognition, is different from traditional image recognition in that it aims to distinguish different sub-categories that belong to one class. Different subclasses are often too similar, and interference factors such as the posture, illumination, shielding, background and the like exist, so that the images possibly have similar appearances and shapes, and the fine-grained images have the characteristics of small inter-class difference and large intra-class difference. In view of the high requirements for image recognition accuracy in reality, fine-grained image recognition is an important research direction for computer vision.

The fine-grained image recognition method needs to focus on two problems: (1) the method is characterized in that a key region with distinctiveness in the fine-grained images is located, the fine-grained images are often similar in geometric shape and appearance, and researches show that: these distinctive regions are often present in a local region specified by the target object and have a small size, and these distinctive local regions are less likely to change greatly due to changes in the posture of the target object and the angle of view of the camera, so that positioning in the distinctive regions makes it easier to distinguish the target object types. (2) And effective feature extraction and representation are carried out in the key region, after the key region with distinctiveness is detected from the image, effective fine-grained feature extraction is carried out, and features of the same category are represented and mapped in the same semantic space, so that the same category can be output by a fine-grained image recognition method.

At present, a research method for fine-grained identification mainly comprises three main categories in the aspect of network structure design: an integration method based on a network, a method based on position detection and a method based on an attention mechanism (attention mechanism). The difference between this type of fine-grained image recognition based on attention mechanism and the general method is: in the feature extraction network, an attention mechanism branch network is used for learning attention weight so as to strengthen fine-grained features which are more beneficial to an identification task, as shown in fig. 1, after input images are subjected to conventional feature extraction, an original feature map is input into an attention module, the attention module learns feature maps to pay attention, and then the attention feature maps are fused into the original feature map for key feature reinforcement. At present, a soft attention (soft attention) mechanism based on deep learning is mostly used in computer vision tasks, and the form of the attention mechanism is designed on the basis of a convolutional neural network and updates and optimizes attention weights by using random gradient descent.

According to the fine-grained image recognition method based on the attention mechanism, when the attention weight is obtained, most of the form is that a feature map (feature map) is simply and directly input into an attention module, image features are self-adaptively learned according to specific task requirements through large-scale conventional convolution operation, and the original feature map (original feature map) is not preprocessed. Although this common attention mechanism has achieved some results in the method of locating network modules, it still has several key issues. Specifically, 1) the input of the attention module is an original feature map which is not preprocessed, the original feature map contains many visual information, such as texture and color saliency information, and a complex attention map can only be learned through large-scale convolution or full-connection operation; only the label is used for constraining the network parameters, and certain visual features are not directionally collected, so that the traditional visual saliency features are wasted; 2) in a region of a part-level (part-level) of an image, a region including key feature information is generally in an aggregated state, and it is difficult to express the advantage of paying attention to different regions with distinction in attention to attention weight learned on a point-to-point basis.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a fine-grained image identification method based on a significance attention mechanism, which is improved from the following two aspects, (1) aiming at the problem that a key area in a fine-grained image is difficult to locate: firstly, obtaining significance information in original feature maps with different sizes, generating attention weights by utilizing the significance information, and fusing the attention weights containing the significance information into the original feature maps with different sizes to construct a group of feature map groups containing high-semantic, multi-scale and significance information. (2) Aiming at the problem that the key region features of the fine-grained images are difficult to extract and express: the invention provides a multi-characteristic diagram loss method which utilizes a plurality of characteristic diagrams with different sizes to participate in network parameter calculation together; and by utilizing the sequencing loss, the lower-layer feature map is more beneficial to the recognition task than the upper-layer feature map, the process of constructing the feature image pyramid from top to bottom is ensured to be a process of gradually seeking the advantages of the features, and the performance of the whole network model is improved.

In order to solve the technical problems, the invention adopts the technical scheme that:

a fine-grained image identification method based on a saliency attention mechanism comprises the following steps:

the method comprises the following steps: extracting features to obtain a three-dimensional original feature map;

step two: a saliency map is obtained by the saliency attention mechanism module processing each feature map in all channels: acquiring pixel element maximum values in two directions, namely horizontal and longitudinal directions, in each channel to obtain two one-dimensional maximum value matrix arrays, acquiring horizontal significance characteristic information and vertical significance characteristic information in a characteristic diagram, determining significance information in each channel, and further determining a distinguishing area in each channel characteristic diagram;

step three: generating and learning attention weight by the saliency information obtained in the step two, fusing the attention weight into the original feature map to obtain a feature map fused with the attention weight, and performing saliency attention feature reinforcement;

step four: constructing feature graph groups containing rich semantics and significance information with different scales from top to bottom by using the feature graph groups with enhanced significance information;

step five: finishing fine-grained category mapping by using the feature graph group obtained in the previous step, and constraining the optimization process of the network parameters through two loss functions of category loss and sequencing loss;

step six: the training model gradually converges, and the network model with the highest experimental accuracy is stored.

Further, the three-dimensional feature map X obtained in the step one is represented by a symbol M, wherein X belongs to C × W × H, and in the step two, the specific steps of the attention mechanism module for obtaining the saliency map are as follows:

1) for each feature map in all channels, taking the maximum value of the pixel elements in the horizontal direction according to rows, and splicing the maximum values into a one-dimensional matrix, namely, each feature map MiObtaining a one-dimensional matrix with dimension of 1 × H, accumulating all channels to obtain C × 1 matrixes with dimension of 1 × H, and obtaining a maximum matrix array

2) For each feature map in all channels, taking the maximum value of the pixel elements in the vertical direction according to columns, and splicing the maximum values into a one-dimensional matrix, namely, each feature map MiObtaining a one-dimensional matrix with dimension W multiplied by 1, accumulating all channels to obtain C multiplied by 1 matrixes with dimension W multiplied by 1, and obtaining a maximum value matrix array

Specifically, two one-dimensional matrix formulas for obtaining each feature map are as follows:

wherein x 'represents the pixel value after feature extraction, and x' represents the one-dimensional matrix after maximum value screening; specifically, two maximum value matrix column formulas are obtained as follows:

whereinA maximum value matrix column representing the longitudinal value,and a maximum matrix column representing a transverse value.

Further, in step three, before learning the attention weight, firstly performing dimension transformation to make the attention weight consistent with the original feature map dimension, specifically: multiplying the one-dimensional matrixes with corresponding dimensionality in the two groups of maximum value matrix arrays obtained in the step two by using matrix multiplication to obtain a matrix group with the same space dimensionality as the original characteristic diagram,

the matrix multiplication is specifically as follows:

wherein A is an initial attention diagram and is consistent with an original characteristic diagram in dimension, and the significance information of the characteristic diagram is calculated on the final output characteristic of each bottleneck block, so that N attention diagrams are shared; fAC(. cndot.) denotes an asymmetric convolution.

Further, asymmetric convolution is introduced into the attention mechanism, and the significance attention weight is strengthened by the asymmetric convolution, specifically: performing convolution operation on an initial attention diagram under the same convolution sliding window by using three different convolution kernels of 3 × 3, 1 × 3 and 3 × 1, 1) performing convolution operation on the feature map by using 3 × 3 firstly, 2) performing convolution operation on only a horizontal skeleton in the sliding window by using 1 × 3 convolution on the same sliding window, and 3) performing convolution operation on only a vertical skeleton in the sliding window by using 3 × 1 convolution on the same sliding window.

Further, in the fourth step, the attention diagram obtained in the third step and feature diagrams of different scales are subjected to feature fusion to construct a feature diagram group containing high-semantic, multi-scale and significance information, and specifically, the A is multiplied by pointsaAnd XaMultiplying corresponding elements to activate original characteristic diagram XaThe specific formula of the significance information is as follows:

Ca=Aa·Xa,a=2,3,...,N.

wherein, CaAnd representing feature graphs of different scales obtained by feeding forward the feature image pyramid network from bottom to top.

Furthermore, after the saliency information is fused into the feature maps of different scales, the presented distinctive regions of the key in the feature maps are different, so in the fifth step, in order to increase the number of samples of the key regions in each image and consider the influence of the feature maps of different scales on network classification, the mapping relation between the network and the true value classes is optimized through a class loss function, and a classification network taking the regions of different scales as input is established; the method comprises the steps of compressing the space dimensions of feature graphs of different scales respectively to obtain one-dimensional feature vectors, inputting the feature vectors into a full-link layer, mapping the feature vectors into categories contained in a data set, and obtaining cross entropy loss of a truth class, wherein the formula is as follows:

Wcrepresenting fully-connected operations, theta (-) representing the effect of a globally pooled operation on the spatial compression profile, PaThe most distinctive discrimination region part in the feature map is shown.

Furthermore, in the process of constructing the pyramid network from top to bottom, the lower layer feature diagram is constrained through the sorting loss, so that compared with the upper layer feature diagram, the lower layer feature diagram contains more semantic information and fine-grained feature information, and the sorting loss formula is as follows:

wherein, PaRepresents PaThe class probability of the feature graph predicted by the softmax function is greater than zero, the constant is an adjustable super parameter, and through the constraint of sequencing loss, the network can not update the parameters when the classification probability of the lower-layer features is greater than that of the upper-layer features, otherwise, the whole network can update the parameters towards the direction which is more beneficial to identifying the classes of the lower-layer feature graph;

the total loss function is:

Ltotal=αLP+βLrank,

where α and β are taken as hyper-parameters.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a fine-grained image recognition method based on a saliency attention mechanism as described above.

Compared with the prior art, the invention has the advantages that:

(1) aiming at the problem that key areas in fine-grained images are difficult to locate: in the fine-grained image identification process, key distinguishing areas need to be positioned to effectively distinguish the categories; and such critical areas with distinctiveness are difficult to locate, since the size of the critical distinctiveness area is usually small. The method generates attention by utilizing the significance information of an original Feature map, and after the Feature extraction is carried out on an image in order to realize the significance Feature in an oriented learning image, the method acquires the significance information in the Feature map in advance, strengthens the significance information by utilizing asymmetric convolution, and combines the attention weight containing the significance information with a Feature Pyramid Network (FPN) to construct a group of Feature map groups with high semantic, multi-scale and significance information.

Specifically, firstly, saliency information in the horizontal direction and the vertical direction of a feature map is used for generating a saliency attention weight; and secondly, fusing the significance attention weight into feature maps of different scales in the feature image pyramid network, and strengthening the significance features aiming at key areas in the feature maps of different scales. Through the design form, the attention network can be effectively utilized to carry out the oriented learning of the significance information of the feature diagram.

(2) Aiming at the problem that the key region features of the fine-grained image are difficult to extract and express: the invention provides a multi-characteristic diagram loss method which utilizes characteristic diagrams of a plurality of sizes to participate in network parameter calculation together.

Firstly, the salient attention feature is enhanced on the salient attention feature map through convolution operation, a plurality of feature maps with different sizes in the feature map group are compressed into one-dimensional feature vectors, loss values are calculated through a cross entropy loss function and category labels, and finally, network parameters are updated through gradient descent.

Meanwhile, the method also utilizes the sequencing loss, ensures that the lower-layer feature diagram is more beneficial to the recognition task than the upper-layer feature diagram, ensures that the process of constructing the feature image pyramid from top to bottom is a process of gradually seeking the best of the features, and improves the performance of the whole network model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 is a prior art attention network flow diagram;

FIG. 2 is a diagram of a feature map saliency information acquisition step of the present invention;

FIG. 3 is a diagram of the step of generating initial attention characteristics for saliency information of the present invention;

FIG. 4 is a schematic diagram of an asymmetric convolution structure according to the present invention;

fig. 5 is a diagram of a salient attention network architecture of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

A fine-grained image identification method based on a saliency attention mechanism comprises the following steps:

the method comprises the following steps: feature extraction

The input image is X, the image X is firstly input into a feature extraction network to obtain a three-dimensional original feature map X, the network for obtaining the features can be VGGNet, ResNet and the like, and ResNet with a more obvious downsampling effect is used as a main network for feature extraction. A series of convolution operations is represented by F (-). The method comprises the steps of setting N Bottleneck blocks (Bottleneck blocks) in the whole feature extraction network, wherein each Bottleneck block comprises convolution kernels with different sizes, quantities and step lengths, and building the whole feature extraction network by stacking the Bottleneck blocks, wherein the output feature of the a-th Bottleneck block is set as follows:

Xa=Fa(x),a=1,2,3,...,N.

step two: design significance attention mechanism module

The invention processes each feature map in all channels through a significance attention mechanism module to obtain a significance map: acquiring pixel element maximum values in two directions, namely horizontal and longitudinal directions, in each channel to obtain two one-dimensional maximum value matrix arrays, acquiring horizontal significance characteristic information and vertical significance characteristic information in the characteristic diagram, determining significance information in each channel, and further determining a distinguishing area in each characteristic diagram.

And (4) effectively obtaining the significance corresponding to the characteristic diagram by designing an attention mechanism module by using the three-dimensional characteristic diagram X (X belongs to C multiplied by W multiplied by H) obtained in the first step and represented by the symbol M. The specific steps for obtaining the saliency map by the attention mechanism module are as follows:

1) for each feature map in all channels, taking the maximum values of the pixel elements in the horizontal (W) direction according to rows, and splicing the maximum values into a one-dimensional matrix, namely, each feature map Mi(i is more than or equal to 1 and less than or equal to C) to obtain vitaminThe number of the channels is 1 multiplied by H, all the channels are accumulated to obtain C multiplied by 1 matrix with the dimension of 1 multiplied by H, and a maximum value matrix array is obtainedThe specific operation is shown in fig. 2. The effect of this step is to obtain the significance characteristic information in the horizontal direction in the characteristic diagram.

2) For each feature map in all channels, the maximum values of the pixel elements in the vertical (H) direction are taken according to columns, the maximum values are spliced into a one-dimensional matrix, and one-dimensional matrix can be obtained in each feature map, namely each feature map MiOne-dimensional matrix with dimension W x 1 can be obtained in (i is more than or equal to 1 and less than or equal to C), all channels are accumulated to obtain C x 1 matrixes with dimension W x 1, and the maximum matrix array is obtainedThe specific operation is shown in fig. 2. The effect of this step is to determine the distinctive regions in the feature map in order to acquire the salient feature information in the vertical direction in the feature map.

In each channel, two matrix dimensions obtained according to the values of the transverse direction and the longitudinal direction of the channel are respectively 1 × H, W × 1, because the areas in the concerned image in each channel are not completely the same, more significance information in a plurality of channels can be obtained simultaneously through the step.

Specifically, two one-dimensional matrix formulas for obtaining each feature map are as follows:

wherein x 'represents the pixel value after feature extraction, and x' represents the one-dimensional matrix after maximum value screening; specifically, two maximum value matrix column formulas are obtained as follows:

whereinA maximum value matrix column representing the longitudinal value,and a maximum matrix column representing a transverse value.

Step three: fusion attention weight

And step two, primarily obtaining the significance information in the original feature map, generating and learning the attention weight by using the significance information, and fusing the attention weight containing the significance information into the original feature maps with different sizes to obtain the feature map fused with the attention weight for strengthening the significance attention feature. The selected significance information is equivalent to the initialization of an attention module, provides a basis for the learning of the attention weight in the next step, and strengthens the significance characteristics in the feature maps with different sizes in a directional strengthening mode.

However, the existing maximum value matrix array containing saliency information is different from the original feature map dimension, so before learning the attention weight, dimension transformation is required to be performed to make the attention weight consistent with the original feature map dimension, and the specific is as follows: multiplying the one-dimensional matrixes with corresponding dimensions in the two groups of maximum value matrix arrays obtained in the step two by using matrix multiplication to obtain a matrix group with the same space dimension as the original feature map, wherein as shown in fig. 3, the two one-dimensional maximum value feature matrixes containing the significance information of the original feature map are subjected to matrix multiplication, and H in fig. 3 is: characteristic map height, W; feature map width, C: the number of characteristic diagram channels.

The matrix multiplication is specifically as follows:

wherein A is an initial attention map (attribute map) and is consistent with the original characteristic map in dimension, and the significance information of the characteristic map is calculated on the final output characteristic of each bottleneck block, so that N attention maps are shared; fAC(. cndot.) denotes asymmetric convolution.

It should be noted that the present invention introduces asymmetric convolution into the attention mechanism, and uses asymmetric convolution to strengthen the significance attention weight, specifically: performing convolution operation on an initial attention diagram under the same convolution sliding window (sliding windows) by using three different convolution kernels of 3 × 3, 1 × 3 and 3 × 1, as shown in fig. 4, 1) performing convolution operation on the feature map by using 3 × 3 first, 2) performing convolution operation on only the horizontal skeletons in the sliding window by using 1 × 3 convolution on the same sliding window, and 3) performing convolution operation on only the vertical skeletons in the sliding window by using 3 × 1 convolution on the same sliding window.

3 x 1 and 1 x 3 convolution kernels are used as convolution backbones, so that the weight on the skeletons is effectively enhanced, and the weight on corners is suppressed. With the above convolution kernel configuration, the significance characteristic information in the resulting matrix a, and the resulting attention weight, can be stabilized and strengthened. The obtained attention diagram is the attention diagram to be input into the attention module, maximum values in the horizontal direction and the vertical direction of the original feature diagram are equivalent to obtaining some significance information in the original feature diagram, after matrix multiplication is carried out, the effect of the significance feature information in the global feature is secondarily strengthened for the obtained significance information, the original one-dimensional chain information is expanded to two-dimensional space information, and the information quantity is richer. The area where the maximum value is located in the attention map is not changed, the position where the original maximum value is located and element values in the horizontal direction and the vertical direction of the maximum value are strengthened, and the local maximum value in the feature map is not easily influenced by factors such as posture, illumination and the like of a target object, so that the capability of a network for positioning a key area is enhanced.

Step four: constructing a feature image pyramid network

Constructing feature graph groups containing rich semantics and significance information with different scales from top to bottom by using the feature graph groups with enhanced significance information; and the construction of the characteristic image pyramid network is completed by utilizing top-down and transverse connection operations.

This embodiment takes three different scales (X)2,X3,X4) And (4) constructing a feature image pyramid network by taking the feature diagram obtained in the step three as an example, and performing feature fusion on the attention diagram and the feature diagrams of different scales to construct a group of feature diagram groups containing high-semantic, multi-scale and saliency information. Specifically, A is multiplied by dotaAnd XaMultiplying corresponding elements to activate original characteristic diagram XaThe specific formula of the significance information is as follows:

Ca=Aa·Xa,a=2,3,4.

wherein, CaAnd representing feature graphs of different scales obtained by feeding forward the feature image pyramid network from bottom to top.

Step five: fine-grained class mapping

Obtaining the attention weight containing the feature map saliency feature information through the operation of the above steps, fusing the attention weight into the original feature map to obtain the feature map fused with the attention weight, strengthening the fine-grained feature information of the original feature map, and establishing a feature image pyramid network by using the feature map, as shown in fig. 5. The fine-grained class mapping is then done by using the resulting set of feature maps.

And then, finishing fine-grained class mapping by using the feature map group obtained in the previous step, and constraining the optimization process of the network parameters by two loss functions of class loss and sequencing loss. Specifically, the method comprises the following steps:

regarding class loss:

since the saliency information is fused into the feature maps with different scales, the saliency information shows that the key distinguishing areas in the feature maps are different, in step five, in order to increase the number of the key areas in each imageNumber of samples, taking into account different scales (P)2,P3,P4) The method comprises the steps of optimizing the mapping relation between a network and a true value class through a class loss function, and establishing a classification network taking different scale areas as input. By adding P2,P3,P4And respectively compressing the space dimensions to obtain a one-dimensional characteristic vector. And finally, inputting the feature vectors into a full-connection layer, mapping the feature vectors into categories contained in the data set, and obtaining the cross entropy loss with a truth value category, wherein the category loss formula is as follows:

Wcrepresenting fully-connected operations, theta (-) representing the effect of a globally pooled operation on the spatial compression profile, PaThe most distinctive discrimination region part in the feature map is shown.

Regarding the ordering penalty:

when a high-semantic, multi-scale and salient feature map group is constructed in a top-down and transverse connection mode, by means of the advantages of the feature image pyramid network and the effect of the salient attention module, some high-latitude semantic information, feature map salient information and some high-resolution feature information are fused into a feature map with a larger size, namely in the process of constructing the pyramid network from top to bottom, the high-level semantic information and the salient features are fused into the last P2In order to really achieve a structure effective for classification, the invention designs a lower-layer characteristic diagram P of ordering loss constraint2So that P is2Compared with the upper-layer feature map, the method contains more semantic information and fine-grained feature information. The sequencing loss formula is as follows:

wherein, PaRepresents PaClass probability of feature map predicted by softmax function, constant being greater than zeroThe adjustable hyper-parameters enable the network not to update the parameters when the classification probability of the lower-layer features is larger than that of the upper-layer features through the constraint of sequencing loss, otherwise, the whole network can update the parameters towards the direction which is more beneficial to identifying the categories of the lower-layer feature diagram.

The total loss function is:

Ltotal=αLP+βLrank,

where α and β are taken as hyper-parameters.

Step six: the training model gradually converges, and the network model with the highest experimental accuracy is stored.

As another embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the fine-grained image recognition method based on the saliency attention mechanism as described above. The specific implementation manner is as described above, and is not described herein again.

In summary, the present invention first obtains saliency information in original feature maps of different sizes, generates attention weights by using the saliency information, and fuses the attention weights containing saliency information into original feature maps of different sizes to construct a group of feature map groups containing high semantic, multi-scale, and saliency information. The selected significance information is equivalent to the initialization of an attention module, provides a basis for the learning of the attention weight of the next step, and strengthens significance characteristics in feature maps with different sizes in a directional strengthening mode. Through the design form, the attention network can be effectively utilized to conduct significance information orientation learning of the feature diagram.

The method of the patent considers the influence of the feature maps of multiple sizes on the identification task, maps the feature maps of multiple sizes into categories, calculates loss compared with a truth category label, and obtains a total category loss value by accumulating the loss values corresponding to the feature maps of different sizes. The process of generating a plurality of size characteristic graphs by one picture and calculating loss of common parameters achieves the effect of data expansion, so that the network model is more robust. Meanwhile, the method disclosed by the patent utilizes the sequencing loss, ensures that the lower-layer feature map is more beneficial to the recognition task than the upper-layer feature map, ensures that the process of constructing the feature image pyramid from top to bottom is a process of gradually seeking the best features, and improves the performance of the whole network model. The method solves the problems of positioning of key areas in fine-grained images and difficult extraction of corresponding features.

The steps of the present invention may be implemented in a general-purpose computer device, or alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a memory device and executed by the computing device, or separately fabricated into individual integrated circuit modules, or fabricated into a single integrated circuit module from a plurality of modules or steps. The present invention is not limited to any specific combination of hardware and software.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于多级注意力的图像目标检测方法、系统与装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!