Image feature extraction method based on combined attention mechanism

文档序号：616058 发布日期：2021-05-07 浏览：6次中文

阅读说明：本技术 一种基于联合注意力机制的图像特征提取方法 (Image feature extraction method based on combined attention mechanism ) 是由陈小波吴海昕邹玉华李依晗于 2020-12-31 设计创作，主要内容包括：本发明公开了一种基于联合注意力机制的图像特征提取方法,步骤如下：1：将待提取特征的图像输入卷积神经网络,得到特征图F；2：使用空间注意力模块获得空间权值掩膜矩阵W1；3：空间权值掩膜矩阵W1与特征图F相乘获得特征图F1；4：使用通道注意力模块获得特征图F的通道权值掩膜矩阵W2；5：通道权值掩膜矩阵W2与特征图F相乘获得特征图F2；6：将特征图F1与特征图F2按通道进行连接,得到特征图F3；7：特征图F3通过c个卷积核进行卷积运算得到特征图F4。本发明基于空-谱联合注意力机制的图像特征提取方法,用于解决现有技术中存在的只考虑单层特征导致特征提取不足的技术问题,可以广泛应用于计算机视觉技术领域。(The invention discloses an image feature extraction method based on a combined attention mechanism, which comprises the following steps: 1: inputting an image with features to be extracted into a convolutional neural network to obtain a feature map F; 2: obtaining a space weight mask matrix W1 by using a space attention module; 3: multiplying the space weight mask matrix W1 with the feature map F to obtain a feature map F1; 4: obtaining a channel weight mask matrix W2 of the feature map F by using a channel attention module; 5: multiplying the channel weight mask matrix W2 with the feature map F to obtain a feature map F2; 6: connecting the characteristic diagram F1 with the characteristic diagram F2 according to a channel to obtain a characteristic diagram F3; 7: the feature map F3 is subjected to convolution operation by c convolution kernels to obtain a feature map F4. The invention relates to an image feature extraction method based on a space-spectrum combined attention mechanism, which is used for solving the technical problem of insufficient feature extraction caused by only considering single-layer features in the prior art and can be widely applied to the technical field of computer vision.)

1. An image feature extraction method based on a combined attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

step 1: inputting an image with features to be extracted into a convolutional neural network to obtain a feature map F;

step 2: obtaining a space weight mask matrix W1 of the feature map F by using a space attention module;

and step 3: multiplying the space weight mask matrix W1 by the characteristic diagram F to obtain a characteristic diagram F1 with the size of h multiplied by W multiplied by c;

and 4, step 4: obtaining a channel weight mask matrix W2 of the feature map F by using a channel attention module;

and 5: multiplying the channel weight mask matrix W2 by the characteristic diagram F to obtain a characteristic diagram F2 with the size of h multiplied by W multiplied by c;

step 6: connecting the characteristic diagram F1 with the characteristic diagram F2 according to a channel to obtain a characteristic diagram F3 with the size of h multiplied by w multiplied by 2 c;

and 7: the feature map F3 is obtained by performing convolution operation using c convolution kernels having a size of 3 × 3, and the feature map F4 has a size of h × w × c.

2. The method of image feature extraction based on a joint attention mechanism as claimed in claim 1, wherein: the specific process of the step 1 is as follows: an input image enters a convolutional neural network, initial features of the image are extracted through multilayer convolution in the forward propagation process, and a feature map F with the size of h multiplied by w multiplied by c is obtained, wherein h is the height of the feature map F, w is the width of the feature map F, and c is the channel number of the feature map F.

3. The method of image feature extraction based on a joint attention mechanism as claimed in claim 2, wherein: the step 2 comprises the following steps:

step 2.1: performing maximum pooling on the feature map F in the channel dimension to obtain a feature map F1 with the size of h multiplied by w multiplied by 1;

step 2.2: and (3) obtaining a space weight mask matrix W1 with the size of h multiplied by W multiplied by 1 by a sigmoid activation function for the feature map f 1.

4. The method of claim 3, wherein the method comprises: in the step 2.1, the feature map F is maximally pooled in the channel dimension, and the specific process is as follows: for each position k on the feature map F_ijThe corresponding tensor of size 1 × 1 × c is compressed to a value which is the maximum of the elements in the corresponding tensor, where i is 1,2, … w, and j is 1,2, … h.

5. The method of image feature extraction based on a joint attention mechanism as claimed in claim 4, wherein: the specific process of the step 3 is as follows: each position k on the feature map F_ijThe corresponding 1 × c tensor and spatial weight mask matrix W1 k_ijMultiplying the corresponding weights at the positions, wherein i is 1,2, … w, and j is 1,2, … h.

6. The method of image feature extraction based on a joint attention mechanism as claimed in claim 5, wherein: the step 4 comprises the following steps:

step 4.1: performing maximum pooling on the feature map F in a spatial dimension to obtain a feature map F2 with the size of 1 × 1 × c;

step 4.2: inputting a two-layer fully-connected network into the feature map f2, wherein the number of neurons in a hidden layer is c/2, the number of neurons in an output layer is c, and the size of the output of the fully-connected layer is 1 multiplied by c, namely the feature map f 3;

step 4.3: and (4) obtaining a channel weight mask matrix W2 with the size of 1 multiplied by c by a sigmoid activation function for the feature map f 3.

7. The method of claim 6, wherein the method comprises: in the step 4.1, the feature map F is maximally pooled in the spatial dimension, and the specific process is as follows: dividing the characteristic diagram F into (m) according to channels₁,…,m_i,…,m_c) Where i is 1,2 …, c, a tensor m of size h x w, channel by channel_iCompressed to a value of m_iThe maximum value of the internal element.

8. The method of image feature extraction based on a joint attention mechanism as claimed in claim 7, wherein: the specific process of the step 5 is as follows:

the feature map F is divided into (m) by channel₁,…,m_i,…,m_c) A tensor m with a size of h × w corresponding to the ith channel_iAnd the ith element in W2, wherein i is 1,2 …, c.

9. The method of image feature extraction based on a joint attention mechanism as claimed in claim 8, wherein: and 8, replacing the feature map F with the feature map F4, inputting the feature map F into a subsequent convolutional layer of the VGG16 to further extract deep features and classify the deep features, and keeping the original loss function of the VGG16 to train the whole network.

Technical Field

The invention relates to the technical field of computer vision, in particular to an image feature extraction method based on a joint attention mechanism.

Background

Attention Mechanism (Attention Mechanism) stems from the study of human vision. In cognitive science, due to the bottleneck of information processing, a human being selectively focuses on a part of all information while ignoring other visible information, which is called a mechanism of attention. Different parts of the human retina have different degrees of information processing capability, that is, different parts have different sensitivities to external objects, and only the fovea part of the retina has the strongest sensitivity. In order to make reasonable use of limited visual information processing resources, a human needs to select a specific portion in a visual region and then focus on it. For example, people may focus on focusing on and handling interesting words when reading. Therefore, the attention mechanism mainly includes two aspects: on the one hand, to decide which part of the input needs attention and on the other hand to allocate limited information processing resources to the important part.

In recent years, attention mechanism is introduced to the field of computer vision for visual information processing, which is a mechanism or methodology without strict mathematical definition, and traditional local image feature extraction, saliency detection, sliding window method and the like can be regarded as an attention mechanism. With the development of deep learning, the attention mechanism is also applied to the neural network, and an additional attention module is added to the neural network, so that the neural network can pay attention to key information and ignore irrelevant information. It is common for the network to select certain parts of the input hard or to assign different weights to different parts of the input. The former is called hard attention and the latter is called soft attention. In summary, the attention mechanism may help the network learn the most important parts of a given input, thereby "summarizing" the input, making feature extraction more targeted.

The domains of interest in the attention mechanism are spatial domain, channel domain, layer domain, hybrid domain, temporal domain, and the like. Most of the existing attention mechanism-based image feature extraction algorithms only consider single-domain attention, so that learning of complex features is difficult to drive, and the effects of classification and target detection of subsequent application features are influenced.

Disclosure of Invention

The invention aims to overcome the defects of the background technology, and provides an image feature extraction method based on a combined attention mechanism, which is based on an image feature extraction method of a space-spectrum combined attention mechanism and is used for solving the technical problem of insufficient feature extraction caused by only considering single-layer features in the prior art.

The invention provides an image feature extraction method based on a combined attention mechanism, which comprises the following steps: step 1: inputting an image with features to be extracted into a convolutional neural network to obtain a feature map F; step 2: obtaining a space weight mask matrix W1 of the feature map F by using a space attention module; and step 3: multiplying the space weight mask matrix W1 by the characteristic diagram F to obtain a characteristic diagram F1 with the size of h multiplied by W multiplied by c; and 4, step 4: obtaining a channel weight mask matrix W2 of the feature map F by using a channel attention module; and 5: multiplying the channel weight mask matrix W2 by the characteristic diagram F to obtain a characteristic diagram F2 with the size of h multiplied by W multiplied by c; step 6: connecting the characteristic diagram F1 with the characteristic diagram F2 according to a channel to obtain a characteristic diagram F3 with the size of h multiplied by w multiplied by 2 c; and 7: the feature map F3 is obtained by performing convolution operation using c convolution kernels having a size of 3 × 3, and the feature map F4 has a size of h × w × c.

In the above technical solution, the specific process of step 1 is as follows: an input image enters a convolutional neural network, initial features of the image are extracted through multilayer convolution in the forward propagation process, and a feature map F with the size of h multiplied by w multiplied by c is obtained, wherein h is the height of the feature map F, w is the width of the feature map F, and c is the channel number of the feature map F.

In the above technical solution, the step 2 includes the following steps: step 2.1: performing maximum pooling on the feature map F in the channel dimension to obtain a feature map F1 with the size of h multiplied by w multiplied by 1; step 2.2: and (3) obtaining a space weight mask matrix W1 with the size of h multiplied by W multiplied by 1 by a sigmoid activation function for the feature map f 1.

In the above technical solution, in the step 2.1, the maximum pooling of the feature map F in the channel dimension is performed by the specific process: pert (specific gravity)Each position k on the figure F_ijThe corresponding tensor of size 1 × 1 × c is compressed to a value which is the maximum of the elements in the corresponding tensor, where i is 1,2, … w, and j is 1,2, … h.

In the above technical solution, the specific process of step 3 is as follows: each position k on the feature map F_ijThe corresponding 1 × c tensor and spatial weight mask matrix W1 k_ijMultiplying the corresponding weights at the positions, wherein i is 1,2, … w, and j is 1,2, … h.

In the above technical solution, the step 4 includes the following steps: step 4.1: performing maximum pooling on the feature map F in a spatial dimension to obtain a feature map F2 with the size of 1 × 1 × c; step 4.2: inputting a two-layer fully-connected network into the feature map f2, wherein the number of neurons in a hidden layer is c/2, the number of neurons in an output layer is c, and the size of the output of the fully-connected layer is 1 multiplied by c, namely the feature map f 3; step 4.3: and (4) obtaining a channel weight mask matrix W2 with the size of 1 multiplied by c by a sigmoid activation function for the feature map f 3.

In the above technical solution, in the step 4.1, the feature map F is maximally pooled in the spatial dimension, and the specific process is as follows: dividing the characteristic diagram F into (m) according to channels₁,…,m_i,…,m_c) Where i is 1,2 …, c, a tensor m of size h x w, channel by channel_iCompressed to a value of m_iThe maximum value of the internal element.

In the above technical solution, the specific process of step 5 is as follows: the feature map F is divided into (m) by channel₁,…,m_i,…,m_c) A tensor m with a size of h × w corresponding to the ith channel_iAnd the ith element in W2, wherein i is 1,2 …, c.

In the above technical solution, the method further includes step 8, replacing the feature map F with the feature map F4, inputting the feature map F into the subsequent convolutional layer of the VGG16 to further extract deep features and classify the deep features, and keeping the original loss function of the VGG16 to train the whole network.

The image feature extraction method based on the combined attention mechanism has the following beneficial effects:

1) the method comprehensively considers space and channel attention mechanisms, respectively uses a space attention module and a channel attention module to learn space and channel attention distribution of a feature map, obtains space attention features and channel attention features, and fuses the two features. The features extracted by the multi-attention module are fused through convolution to realize dynamic weight distribution, so that parameters are prevented from being set artificially, and the robustness of the model is enhanced. Compared with the prior art, the feature extraction method provided by the invention improves the learning and representation capabilities of complex features, and further can improve the accuracy of subsequent visual analysis tasks (image classification, image segmentation, image target detection, image retrieval and the like).

2) The feature extraction method provided by the invention can be embedded into most of convolutional neural networks which undertake visual analysis tasks, and the spatial and channel attention distribution of the feature map is self-adaptively learned in the network training process without reprocessing labeled data. Compared with the existing image feature extraction technology, the method is more widely applied and can be used for common images, hyperspectral images, multispectral images, SAR images and the like.

Drawings

FIG. 1 is a schematic overall flow chart of an image feature extraction method based on a combined attention mechanism according to the present invention;

FIG. 2 is a working frame diagram of the image feature extraction method based on the combined attention mechanism according to the present invention;

FIG. 3 is a block diagram of the steps 2 to 3 of the image feature extraction method based on the combined attention mechanism according to the present invention;

fig. 4 is a schematic block diagram of steps 4 to 5 in the image feature extraction method based on the joint attention mechanism according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and examples, which should not be construed as limiting the invention.

Referring to fig. 1 and 2, the present invention relates to an image feature extraction method based on a joint attention mechanism, including the following steps:

step 1: inputting the image with the characteristics to be extracted into a convolutional neural network to obtain a characteristic diagram F:

step 1.1: the method comprises the steps that an input image enters a convolutional neural network, initial features of the image are extracted through multilayer convolution in the forward propagation process, a feature diagram F with the size of h multiplied by w multiplied by c is obtained, h is the height of the feature diagram F, w is the width of the feature diagram F, and c is the number of channels of the feature diagram F, and the convolutional neural network structurally comprises a basic convolutional layer, a pooling layer and a batch normalization layer;

in a specific embodiment, the image of the feature to be extracted is input into the convolutional neural network VGG16 with the size being 224 × 224, the output of the fourth convolutional layer of the VGG16 is selected as a feature map F, and the height, width and channel number of the feature map F are 112, 112 and 128 respectively;

referring to fig. 3, step 2: obtaining a space weight mask matrix W1 of the feature map F by using a space attention mechanism module:

step 2.1: performing maximum pooling on the feature map F in the channel dimension to obtain a feature map F1 with the size of h multiplied by w multiplied by 1;

the method for realizing the maximum pooling of the characteristic diagram F in the channel dimension comprises the following steps: for each position k on the feature map F_ijThe corresponding tensor of size 1 × 1 × c is compressed to a value which is the maximum of the elements in the corresponding tensor, where i ═ 1,2, … w, j ═ 1,2, … h;

in a particular embodiment, the maximum value is calculated along the third dimension for the feature map F, in particular for each position k on the feature map F_ij(where i is 1,2, … 112, j is 1,2, … 112) the corresponding tensor (size 1 × 1 × 224) takes the element maximum value, which corresponds to k of the feature map f1_ijThe values at the positions, height, width, and number of channels forming the feature map f1 are 112, and 1, respectively.

Step 2.2: obtaining a space weight mask matrix W1 with the size of h multiplied by W multiplied by 1 for the feature map f1 through a sigmoid activation function;

in a specific embodiment, a sigmoid activation function is applied to each element in the feature map f1, the values of the elements in the feature map f1 are normalized to be between 0 and 1, a space weight mask matrix W1 with the size of 112 × 112 × 1 is formed,value W1 for each position on W1_ij(i 1,2, … 112, j 1,2, … 112) represents the weight of the feature at the corresponding position in the feature map F, which represents the spatial importance of the feature at the position, and the feature at the important position is highlighted by a larger weight.

And step 3: multiplying the space weight mask matrix W1 by the characteristic diagram F to obtain a characteristic diagram F1 with the size of h multiplied by W multiplied by c;

the space weight mask matrix W1 is multiplied by the feature graph F to obtain a feature graph F1 with the size h multiplied by W multiplied by c, and the implementation steps are as follows: each position k on the feature map F_ijThe corresponding 1 × c tensor and spatial weight mask matrix W1 k_ijMultiplying the corresponding weights at the positions, wherein i is 1,2, … w, and j is 1,2, … h;

in a particular embodiment, each position k on the profile F_ij(i 1,2, … 112, j 1,2, … 112) and the corresponding tensor of size 1 × 128 and the weight W1 of the corresponding position in the spatial weight mask matrix W1_ijThe height, width and channel number of the feature map F1 are 112, 112 and 128 respectively.

Referring to fig. 4, step 4: obtaining a channel weight mask matrix W2 of the feature diagram F by using a channel attention mechanism module;

step 4.1: performing maximum pooling on the feature map F in a spatial dimension to obtain a feature map F2 with the size of 1 × 1 × c;

the method for realizing the maximum pooling of the characteristic diagram F in the space dimension comprises the following steps: dividing the characteristic diagram F into (m) according to channels₁,…,m_i,…,m_c) Where i is 1,2 …, c. Channel-by-channel combining a tensor m of size h x w_iCompressed to a value of m_iThe maximum value of the internal element;

in a specific embodiment, a tensor m of 112 × 112 size corresponding to each channel of the feature map F_iWhere i is 1,2 …, c. Performing maximum pooling one by one, and setting parameters of a pooling layer: the pooling size is set to 1 × 1 with a step size of 1. The results obtained after maximum pooling were connected by channel, resulting in a feature f2 with dimensions 1X 128.

Step 4.2: inputting a two-layer fully-connected network into the feature map f2, wherein the number of neurons in a hidden layer is c/2, the number of neurons in an output layer is c, and the size of the output of the fully-connected layer is 1 multiplied by c, namely the feature map f 3;

in a specific embodiment, the two-layer fully-connected network is arranged as follows: the tensor size of the input of the first layer of fully-connected network is consistent with the feature map f2 and is 1 × 1 × 128, the number of neurons arranged in the output layer is 64, the input of the second layer of fully-connected network is connected with the output of the first layer of fully-connected network, and the number of neurons in the output layer is 128, so that the final output size of the feature map f3 of the two layers of fully-connected networks is 1 × 1 × 128.

Step 4.3: obtaining a channel weight mask matrix W2 with the size of 1 multiplied by c by a sigmoid activation function for the feature map f 3;

in a specific embodiment, a sigmoid activation function is implemented on each element in the feature map f3, the values of the elements in the feature map f3 are all normalized to be between 0 and 1, a channel weight mask matrix W2 with the size of 1 × 1 × 128 is formed, and the value W2 of each channel on W2_k(k ═ 1,2, … 128) represents the weight of the feature map F corresponding to the channel feature, which characterizes the importance of the channel feature, and the feature at the important channel is highlighted with a larger weight.

And 5: multiplying the channel weight mask matrix W2 with the characteristic diagram F to obtain a characteristic diagram F2 with the size of h multiplied by W multiplied by c;

the channel weight mask matrix W2 is multiplied by the feature graph F to obtain a feature graph F2 with the size of h multiplied by W multiplied by c, and the implementation steps are as follows: the feature map F is divided into (m) by channel₁,…,m_i,…,m_c) A tensor m with a size of h × w corresponding to the ith channel_iMultiplying with the ith element in W2, wherein i is 1,2 …, c;

in an embodiment, the feature tensor corresponding to each channel k (k is 1,2, … 128) on the feature map F and the weight W2 of the corresponding channel in the channel weight mask matrix W2_kMultiplying to obtain a characteristic diagram F₂The height, width and number of channels of (2) are 112, 112 and 128, respectively.

Step 6: connecting the characteristic diagram F1 with the characteristic diagram F2 according to a channel to obtain a characteristic diagram F3 with the size of h multiplied by w multiplied by 2 c;

in the specific embodiment, the feature map F1 and the feature map F2 are connected by channels, so that the feature map F3 with the size of 112 × 112 × 256 is obtained.

And 7: the feature map F3 is obtained by performing convolution operation using c convolution kernels having a size of 3 × 3 × 2c, and the feature map F4 having a size of h × w × c is obtained.

In a specific embodiment, the feature map F3 is convolved by 128 convolution kernels of size 3 × 3 × 256 to obtain a feature map F4 of size 112 × 112 × 128.

And 8: and (3) replacing the feature map F with the feature map F4, inputting the feature map F into a subsequent convolutional layer of the VGG16 to further extract deep features and classify the deep features, and training the whole network without changing the original loss function of the VGG 16. Through one-time iterative training of the network, parameters in the network can be updated, the weights of the weight mask matrixes W1 and W2 tend to be more stable, and after the network training is finished, the test image data set is input into the network for testing, and the classification effect of the network on the data set is checked.

The spatial attention mechanism comprehensively considers the spatial information of the image, learns a weight mask related to input, can help the network to highlight interested target information and inhibit the background, and the representing method is a spatial transformation network. The channel attention mechanism usually considers input channel information, calculates attention distribution at the input channel level, and then calculates a weighted average of the input information according to the attention distribution, which represents a method of SEnet.

Therefore, the spatial attention module and the channel attention module are comprehensively considered, spatial and channel attention distribution of the feature map is learned by using the spatial attention module and the channel attention module respectively, spatial attention features and channel attention features are obtained, and the two features are fused. The features extracted by the multi-attention module are fused through convolution to realize dynamic weight distribution, so that parameters are prevented from being set artificially, and the robustness of the model is enhanced. Compared with the prior art, the feature extraction method provided by the invention improves the learning and representation capabilities of complex features, and further can improve the accuracy of subsequent visual analysis tasks (image classification, image segmentation, image target detection, image retrieval and the like).

Moreover, the feature extraction method provided by the invention can be embedded into most of convolutional neural networks which bear visual analysis tasks, the spatial and channel attention distribution of the feature map is self-adaptively learned in the network training process, and the labeled data does not need to be processed again. Compared with the existing image feature extraction technology, the method is more widely applied and can be used for common images, hyperspectral images, multispectral images, SAR images and the like.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Those not described in detail in this specification are within the skill of the art.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于图卷积的遥感图像道路提取方法

Image feature extraction method based on combined attention mechanism

相关技术

网友询问留言