Fine-grained image classification method based on attention transfer mechanism

文档序号:1737502 发布日期:2019-12-20 浏览:25次 中文

阅读说明:本技术 基于注意力转移机制的细粒度图像分类方法 (Fine-grained image classification method based on attention transfer mechanism ) 是由 牛毅 焦阳 李甫 石光明 于 2019-09-06 设计创作,主要内容包括:本发明提出了一种基于注意力转移机制的细粒度图像分类方法,用于提高细粒度图像分类精度。实现步骤为:获取包含细粒度图像的训练样本集和测试样本集;构建全局感知网络和注意力转移网络模型;利用训练样本集对全局感知网络和注意力转移网络进行训练;用训练好的全局感知网络和注意力转移网络对测试样本集进行分类。本发明通过设计全局感知网络和注意力转移网络对图像进行联合特征提取,使网络在提取了全局特征和判别区域特征的基础上,进一步提取了不同判别区域之间的语义相关性,从而加强了网络的特征提取能力,有效提高了细粒度图像分类准确率。(The invention provides a fine-grained image classification method based on an attention transfer mechanism, which is used for improving the classification precision of fine-grained images. The method comprises the following implementation steps: acquiring a training sample set and a test sample set containing fine-grained images; constructing a global perception network model and an attention transfer network model; training a global perception network and an attention transfer network by utilizing a training sample set; and classifying the test sample set by using the trained global perception network and attention transfer network. The invention carries out combined feature extraction on the image by designing the global perception network and the attention transfer network, so that the network further extracts semantic relativity between different discrimination areas on the basis of extracting global features and discrimination area features, thereby enhancing the feature extraction capability of the network and effectively improving the classification accuracy of fine-grained images.)

1. A fine-grained image classification method based on an attention transfer mechanism is characterized by comprising the following steps:

(1) acquiring a training sample set and a testing sample set:

(1a) selecting natural images containing C types of fine-grained categories, wherein the amplitude of each category is H, and carrying out size normalization on all the natural images, wherein the pixel size of the normalized natural images is H multiplied by w, C is more than or equal to 2, H is more than or equal to 20, H is more than or equal to 2, and w is more than or equal to 2;

(1b) marking each type of normalized natural images which are selected randomly and are more than 50 percent, forming a training sample set by the marked each type of normalized natural images and labels thereof, and forming a test sample set by the rest each type of normalized natural images;

(2) constructing a global perception network GNet model:

(2a) constructing a global perception network GNet structure:

constructing a global perception network structure comprising M global feature extraction networks GNet1 arranged in parallel and a global feature fusion network GNet2 connected with M GNet1, wherein:

the global feature extraction network GNet1 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a plurality of full-link layers, an automatic coding layer and a Softmax layer, and is used for generating M global prediction probabilitiesAndwherein M is a global feature map index, and M is in the range of [1, M ∈];

The global feature fusion network GNet2 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a cascade layer, a plurality of fully-connected layers and a Softmax layer, and is used for generating a global feature vector fglbAnd global fusion probability yGNet2

(2b) Defining the loss function of the global perceptual network GNet:

defining a loss function for the mth GNet1 of the M global feature extraction networks GNet1And the loss function L of the global feature fusion network GNet2GNet2

Wherein the content of the first and second substances,represents a cross-entropy loss function of the entropy of the sample, is a training image real category label after unique hot coding, c is an image category label index,is thatThe probability value of the c-th category in (c),is a global prediction probabilityThe probability value of the C-th class in (1), and C ∈ [1, C ]]α isIn thatThe weight occupied by (c);a function representing the separation loss is represented by,j is the global profile index, and j is [1, M ]](x, y) isMrg is a threshold value, β isIn thatThe weight occupied by (c);is a function of the distance loss as a function of,(txm,tym) Is thatThe position coordinate of the maximum value of the middle pixel, gamma isIn thatThe weight occupied by (c); y isGNet2_cIs yGNet2Probability value of the c-th category;

(3) constructing an attention transfer network (ANet) model:

(3a) constructing an attention transfer network ANet structure:

constructing an attention transfer network structure comprising N attention extraction networks ANet1 arranged in parallel and an attention decoding network ANet2 connected to N ANets 1, wherein:

the attention extraction network ANet1 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, an automatic coding layer and a recurrent neural network, a plurality of full connection layers and a Softmax layer for generating N attention feature mapsAnd attention prediction probabilityWherein N is the attention feature map index, and N is the [1, N ]];

The attention decoding network ANet2 includes an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a recurrent neural network, a fully-connected layer, and a Softmax layer for generating an attention decoding probability yANet2

(3b) Defining a loss function of the attention transfer network ANet:

defining a loss function for an nth ANet1 of N attention extraction networks ANet1And the loss function L of the attention decoding network ANet2ANet1

WhereinIs yANet2Probability value of the c-th category;

(4) training a global perception network GNet model:

(4a) initializing parameters of M global feature extraction networks GNet1 and parameters of global feature fusion networks GNet2 connected with M GNet1, and setting maximum training iteration number TGCurrent number of iterations tG=1;

(4b) Randomly selecting BS natural images I from training sample setGRespectively used as the input of each GNnet1, BS is more than or equal to 8, and forward propagation is carried out to obtain M global feature mapsAnd global prediction probabilityAnd pass throughAndcalculating the loss function of GNet1

(4c) Judging tG≤TGIf yes, obtaining a trained GNet1, and executing the step (4 d); otherwise, using a back propagation method and according toCalculating the network parameter gradient of each GNet1, updating the network parameters of the GNet1 by using the network parameter gradient, and enabling tG=tG+1, performing step (4 b);

(4d) computing each global feature mapCorresponding global discriminating regionAnd will be MThe global feature vector f is obtained by performing forward propagation as the input of the global feature fusion network GNet2glbAnd global fusion probability yGNet2And according to yGNet2Calculating a loss function LGNet2

(4e) Judging tG≤TGIf yes, obtaining trained GNet 2; otherwise, using back propagation method and passing through LGNet2Calculating the gradient of the network parameters of GNet2, updating the network parameters of GNet2 by using the gradient of the network parameters, and enabling tG=tG+1, performing step (4 b);

(5) training the attention transfer network ANet:

(5a) initializing parameters of N attention extraction networks ANet1 and parameters of attention decoding networks ANet2 connected with N ANets 1, and setting training maximum iteration number TACurrent number of iterations tA=1;

(5b) Randomly selecting BS natural images I from training sample setAAs input to each ANnet1 and propagated forward, N attention feature maps are obtainedAnd attention prediction probabilityAnd pass throughAndcalculating the loss function of ANet1

(5c) Judging tA≤TAIf yes, obtaining a trained ANet1, and executing the step (5 d); otherwise, using a back propagation method and according toCalculating the network parameter gradients of the N ANets 1, updating the ANet1 network parameters by using the network parameter gradients, and enabling tA=tA+1, performing step (5 b);

(5d) calculating each attention feature mapCorresponding attention discriminating regionAnd will fglbAnd N areAs an input to the attention decoding network ANet2, forward propagation is performed to obtain an attention decoding probability yANet2And according to yANet2Calculating a loss function LANet2

(5e) Judging tA≤TAIf true, get trained ANet 2; otherwise, using back propagation method and passing through LANet2Calculating the network parameter gradient of ANet2, updating the network parameter of ANet2, and making tA=tA+1, performing step (5 b);

(6) acquiring an image classification result:

(6a) respectively taking the test sample set as the input of each trained global feature extraction network GNet1 and each trained attention extraction network ANet1, and carrying out forward propagation to respectively obtain M global feature mapsAnd N attention feature mapsAnd calculate eachCorresponding global discriminating regionAnd eachCorresponding attention discriminating region

(6b) Will be MThe global feature f is obtained by performing forward propagation as the input of the trained global fusion network GNet2glbAnd global fusion probability yGNet2

(6c) By iterating the Bayes path, f isglbAnd N areAs the input of the trained attention decoding network ANet2, the forward propagation is carried out to obtain the attention decoding probability yANet2

(6d) By maximumPair of voting methodsGNet2And yANet2And performing decision-level fusion to obtain the final prediction probability p:

p=max(yGNet2,yANet2);

(6i) and calculating a subscript index idx corresponding to the maximum value in the probability p, and searching an index table to obtain a fine-grained image classification category corresponding to the idx.

2. The fine-grained image classification method based on attention transfer mechanism according to claim 1, wherein the global feature extraction network GNet1 and the global feature fusion network GNet2 are described in step (2a), wherein:

the global feature extraction network GNet1 comprises thirteen convolutional layers, five pooling layers, thirteen activation layers, three full-connection layers and an automatic coding layer, and the specific structure is as follows:

the input layer → the first convolution layer → the first active layer → the second convolution layer → the second active layer → the first pooling layer → the third convolution layer → the third active layer → the fourth convolution layer → the fourth active layer → the second pooling layer → the fifth convolution layer → the fifth active layer → the sixth convolution layer → the sixth active layer → the seventh convolution layer → the seventh active layer → the third pooling layer → the eighth convolution layer → the eighth active layer → the ninth convolution layer → the ninth active layer → the tenth convolution layer → the tenth active layer → the fourth pooling layer → the eleventh convolution layer → the eleventh active layer → the twelfth convolution layer → the twelfth active layer → the thirteenth convolution layer → the thirteenth active layer → the fifth pooling layer → the auto-coded layer → the first fully-connected layer → the second fully-connected layer → the third fully-connected layer → the Softmax layer;

wherein, the pooling layer selects a maximum pooling mode, and the activation layer selects a ReLU activation function f (x):

f(x)=max(0,x)

wherein x is the convolutional layer output;

the global feature fusion network GNet2 comprises thirteen convolutional layers, five pooling layers, thirteen active layers, one cascade layer and three full-connection layers, and the specific structure is as follows:

the input layer → the first convolution layer → the first active layer → the second convolution layer → the second active layer → the first pooling layer → the third convolution layer → the third active layer → the fourth convolution layer → the fourth active layer → the second pooling layer → the fifth convolution layer → the fifth active layer → the sixth convolution layer → the sixth active layer → the seventh convolution layer → the seventh active layer → the third pooling layer → the eighth convolution layer → the eighth active layer → the ninth convolution layer → the ninth active layer → the tenth convolution layer → the tenth active layer → the fourth pooling layer → the eleventh convolution layer → the eleventh active layer → the twelfth convolution layer → the twelfth active layer → the thirteenth convolution layer → the thirteenth active layer → the fifth pooling layer → the cascade → the first fully-connected layer → the second fully-connected layer → the third fully-connected layer → the Softmax layer;

the activation layer selects a ReLU activation function f (x), and the pooling layer selects a maximum pooling mode.

3. The fine-grained image classification method based on attention-transfer mechanism as claimed in claim 1, wherein the attention-extracting network ANet1 and the attention-decoding network ANet2 in step (3a) are set as follows:

the attention extraction network ANet1 comprises thirteen convolutional layers, five pooling layers, thirteen activation layers, an automatic coding layer, a recurrent neural network and three full-connection layers, and has the specific structure that:

the main network structure: the input layer → the first convolution layer → the first active layer → the second convolution layer → the second active layer → the first pooling layer → the third convolution layer → the third active layer → the fourth convolution layer → the fourth active layer → the second pooling layer → the fifth convolution layer → the fifth active layer → the sixth convolution layer → the sixth active layer → the seventh convolution layer → the seventh active layer → the third pooling layer → the eighth convolution layer → the eighth active layer → the ninth convolution layer → the ninth active layer → the tenth convolution layer → the tenth active layer → the fourth pooling layer → the eleventh convolution layer → the eleventh active layer → the twelfth convolution layer → the twelfth active layer → the thirteenth convolution layer → the thirteenth active layer → the fifth pooling layer → the first fully-connected layer → the auto-coded layer → the recurrent neural network → the second fully-connected layer → the Softmax layer;

the branch network structure: fifth pooling layer → third fully-connected layer → recurrent neural network;

the method comprises the following steps that an active layer selects a ReLU active function f (x), a pooling layer selects a maximum pooling mode, and a circulating neural network selects a single-layer LSTM network;

the attention decoding network ANet2 comprises thirteen convolutional layers, five pooling layers, thirteen activation layers, a recurrent neural network and a full connection layer, and the specific structure is as follows:

the input layer → the first convolution layer → the first active layer → the second convolution layer → the second active layer → the first pooling layer → the third convolution layer → the third active layer → the fourth convolution layer → the fourth active layer → the second pooling layer → the fifth convolution layer → the fifth active layer → the sixth convolution layer → the sixth active layer → the seventh convolution layer → the seventh active layer → the third pooling layer → the eighth convolution layer → the eighth active layer → the ninth convolution layer → the ninth active layer → the tenth convolution layer → the tenth active layer → the fourth pooling layer → the eleventh convolution layer → the eleventh active layer → the twelfth convolution layer → the twelfth active layer → the thirteenth convolution layer → the thirteenth active layer → the fifth pooling layer → the recurrent neural network → the first fully connected layer → the Softmax layer;

the activation layer selects a ReLU activation function f (x), the pooling layer selects a maximum pooling mode, and the recurrent neural network selects a multi-layer LSTM network.

4. The fine-grained image classification method based on attention transfer mechanism according to claim 1, wherein the step (4d) of calculating each global feature mapCorresponding global discriminating regionThe method comprises the following specific steps:

(4d1) computingPixel coordinate (t) corresponding to the medium maximum valuexm,tym);

(4d2) Calculating (t)xm,tym) Corresponding to the input natural image IGPixel coordinates of

Wherein h ismAnd wmAre respectively asPixel height and pixel width of (a);

(4d3) to be provided withTaking k as half side length as center, and cutting IGThe image block with the middle pixel size of 2k multiplied by 2k is used as a global discrimination areak≥8。

5. The fine-grained image classification method based on attention transfer mechanism according to claim 1, wherein the step (5d) of calculating each attention feature mapCorresponding attention discriminating regionThe method comprises the following specific steps:

(5d1) computingPixel coordinate (t) corresponding to the medium maximum valuexn,tyn);

(5d2) Calculating (t)xn,tyn) Corresponding to the input natural image IAPixel coordinates of

Wherein h isnAnd wnAre respectively asPixel height and pixel width of (a);

(5d3) to be provided withTaking k as half side length as center, and cutting IAImage block with middle pixel size of 2k multiplied by 2k as attention distinguishing areak≥8。

6. The fine-grained image classification method based on attention transfer mechanism according to claim 1, characterized in that f is divided by iterative Bayesian path in step (6c)glbAnd N areAs the input of the trained attention decoding network ANet2, the forward propagation is carried out to obtain the attention decoding probability yANet2The method comprises the following specific steps:

(6c1) global feature vector fglbAs an input at time 0, the recurrent neural network in the ANet2 is input;

(6c2) discriminating N attention areasInputting the features into ANet2 for feature extraction, sequentially taking the extracted features as the input of 1 st to Nth moments, and inputting the input into a recurrent neural network in ANet 2;

(6c3) performing forward propagation, and taking the output of the N time of the recurrent neural network as the attention decoding probability yANet2

Technical Field

The invention belongs to the technical field of computer vision, and relates to a fine-grained image classification method, in particular to a fine-grained image classification method based on an attention transfer mechanism, which can be used for fine classification tasks such as bird classification and automobile classification.

Background

The image classification is a processing method which extracts the self-distinguishing features of the images by a design or learning-based method and enables intelligent equipment to automatically identify the categories to which the image subjects belong. The image classification method is widely applied to various social fields, such as face recognition, species recognition and the like. According to the granularity difference of the image classification target, the image classification method can be generally divided into two categories, namely general (coarse granularity) image classification and subclass (fine granularity) image classification. Coarse-grained image classification methods aim at identifying different classes of objects, such as distinguishing cats from dogs; the fine-grained image classification method aims at identifying different subclasses of the same class of objects, such as distinguishing blue cats from Persian cats, or distinguishing Alaska dogs from Husky and the like.

In recent years, with the rapid development of deep learning, the deepening of social intelligence and the continuous expansion of open source image classification databases, the accuracy of a coarse-grained image classification task exceeds the average level of human beings. Therefore, fine-grained image classification is gaining more and more attention. Compared with a coarse-grained image, the fine-grained image has the problems of self illumination and posture change, and also has the problems of small inter-class distance and large intra-class distance of the image. These problems cause great difficulty in distinguishing region location and feature extraction in the fine-grained image classification process. Therefore, how to accurately and adaptively position the fine-grained image discrimination area and accurately extract the features of the discrimination area becomes a key for improving the classification precision of the fine-grained image.

According to different distinguishing area positioning methods, the fine-grained image classification method can be divided into two types: the method comprises a fine-grained image classification method based on strong supervision and a fine-grained image classification method based on weak supervision. Compared with a fine-grained image classification method based on strong supervision, the fine-grained image classification method based on weak supervision only utilizes image category labels to carry out discrimination area positioning, and does not use extra discrimination area labels. The general process of the fine-grained classification method based on weak supervision is as follows: firstly, extracting features of a fine-grained image; secondly, by adopting different methods, such as a method based on visual attention, a method based on an image pyramid and the like, distinguishing region positioning is carried out on the fine-grained image by using the extracted features, and further feature extraction is carried out on the obtained distinguishing region; and finally, classifying the images according to the extracted image characteristics. The fine-grained classification based on weak supervision has the advantages of high training efficiency, self-adaption of a discrimination area, low label cost and the like, and is a research hotspot in the field of current fine-grained image classification.

So far, there are the following methods for classifying fine-grained images based on weak supervision:

for example, a patent application with publication number CN 110084285a, entitled "fish fine-grained classification method based on deep learning" discloses a fish fine-grained image classification method based on deep learning, which includes preprocessing an acquired image, performing feature extraction using a deep neural network, constructing a feature pyramid network for region proposal, performing cutting and feature extraction on a discrimination region, then performing primary classification using the extracted features, inputting the classification accuracy as a supervision signal to the region proposal network, fusing the features and global features, sending the features and global features to a full-connection layer for classification, and outputting a final classification result. The method carries out distinguishing area positioning by designing the characteristic pyramid, and solves the problem of low accuracy rate caused by larger intra-class difference and smaller inter-class difference in a fine-grained image classification task. However, the method constructs the feature pyramid only on three scales of 14 × 14, 7 × 7 and 4 × 4, so that the discrimination region of the fine-grained image can only be generated on the scale, and the discrimination region cannot be accurately positioned, so that the classification accuracy is still low.

For another example, Zheng Heliang et al published a paper entitled "Learning Multi-Attention communication reliable neural Network for Fine-Grained Image Recognition" in IEEE International Conference on Computer Vision in 2017, and discloses a Fine-Grained Image classification method of a Multi-Attention machine system. The method solves the problem of inaccurate positioning of the discrimination areas in the fine-grained image classification task, but because the method does not consider semantic correlation among different discrimination areas, the feature extraction capability of the network is reduced, and the classification accuracy is still low.

Disclosure of Invention

The invention aims to provide a fine-grained image classification method based on an attention transfer mechanism aiming at the defects of the prior art, and the fine-grained image classification method is used for solving the technical problem of low fine-grained image classification accuracy in the prior art.

The technical idea of the invention is as follows: firstly, extracting image global features through a global perception network GNet and positioning a global discrimination area based on the global features; then, an attention transfer network ANet is used for positioning an attention discrimination area, and correlation characteristics among different attention discrimination areas are extracted; and finally, fusing output results of the GNet and the ANet by adopting decision-level fusion to obtain a final fine-grained classification result. The method comprises the following specific steps:

(1) acquiring a training sample set and a testing sample set:

(1a) selecting natural images containing C types of fine-grained categories, wherein the amplitude of each category is H, and carrying out size normalization on all the natural images, wherein the pixel size of the normalized natural images is H multiplied by w, C is more than or equal to 2, H is more than or equal to 20, H is more than or equal to 2, and w is more than or equal to 2;

(1b) marking each type of normalized natural images which are selected randomly and are more than 50 percent, forming a training sample set by the marked each type of normalized natural images and labels thereof, and forming a test sample set by the rest each type of normalized natural images;

(2) constructing a global perception network GNet model:

(2a) constructing a global perception network GNet structure:

constructing a global perception network structure comprising M global feature extraction networks GNet1 arranged in parallel and a global feature fusion network GNet2 connected with M GNet1, wherein:

the global feature extraction network GNet1 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a plurality of full-link layers, an automatic coding layer and a Softmax layer, and is used for generating M global prediction probabilitiesAndwherein M is a global feature map index, and M is in the range of [1, M ∈];

The global feature fusion network GNet2 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a cascade layer, a plurality of fully-connected layers and a Softmax layer, and is used for generating a global feature vector fglbAnd global fusion probability yGNet2

(2b) Defining the loss function of the global perceptual network GNet:

defining a loss function for the mth GNet1 of the M global feature extraction networks GNet1And the loss function L of the global feature fusion network GNet2GNet2

Wherein the content of the first and second substances,represents a cross-entropy loss function of the entropy of the sample,is a training image after single hot codingA real category label, c is an image category index,is thatThe probability value of the c-th category in (c),is a global prediction probabilityThe probability value of the C-th class in (1), and C ∈ [1, C ]]α isIn thatThe weight occupied by (c);a function representing the separation loss is represented by,j is the global profile index, and j is [1, M ]](x, y) isMrg is a threshold value, β isIn thatThe weight occupied by (c);is a function of the distance loss as a function of,(txm,tym) Is thatThe position coordinate of the maximum value of the middle pixel, gamma isIn thatThe weight occupied by (c); y isGNet2_cIs yGNet2Probability value of the c-th category;

(3) constructing an attention transfer network (ANet) model:

(3a) constructing an attention transfer network ANet structure:

constructing an attention transfer network structure comprising N attention extraction networks ANet1 arranged in parallel and an attention decoding network ANet2 connected to N ANets 1, wherein:

the attention extraction network ANet1 comprises an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, an automatic coding layer and a recurrent neural network, a plurality of full connection layers and a Softmax layer for generating N attention feature mapsAnd attention prediction probabilityWherein N is the attention feature map index, and N is the [1, N ]];

The attention decoding network ANet2 includes an input layer, a plurality of convolutional layers, a plurality of pooling layers, a plurality of activation layers, a recurrent neural network, a fully-connected layer, and a Softmax layer for generating an attention decoding probability yANet2

(3b) Defining a loss function of the attention transfer network ANet:

defining a loss function for an nth ANet1 of N attention extraction networks ANet1And the loss function L of the attention decoding network ANet2ANet1

WhereinIs yANet2Probability value of the c-th category;

(4) training a global perception network GNet model:

(4a) initializing parameters of M global feature extraction networks GNet1 and parameters of global feature fusion networks GNet2 connected with M GNet1, and setting maximum training iteration number TGCurrent number of iterations tG=1;

(4b) Randomly selecting BS natural images I from training sample setGRespectively used as the input of each GNnet1, BS is more than or equal to 8, and forward propagation is carried out to obtain M global feature mapsAnd global prediction probabilityAnd pass throughAndcalculating the loss function of GNet1

(4c) Judging tG≤TGIf yes, obtaining a trained GNet1, and executing the step (4 d); otherwise, using a back propagation method and according toEach time of calculationThe network parameter gradient of GNet1 is utilized to update the network parameter of GNet1, and then t is madeG=tG+1, performing step (4 b);

(4d) computing each global feature mapCorresponding global discriminating regionAnd will be MThe global feature vector f is obtained by performing forward propagation as the input of the global feature fusion network GNet2glbAnd global fusion probability yGNet2And according to yGNet2Calculating a loss function LGNet2

(4e) Judging tG≤TGIf yes, obtaining trained GNet 2; otherwise, using back propagation method and passing through LGNet2Calculating the gradient of the network parameters of GNet2, updating the network parameters of GNet2 by using the gradient of the network parameters, and enabling tG=tG+1, performing step (4 b);

(5) training the attention transfer network ANet:

(5a) initializing parameters of N attention extraction networks ANet1 and parameters of attention decoding networks ANet2 connected with N ANets 1, and setting training maximum iteration number TACurrent number of iterations tA=1;

(5b) Randomly selecting BS natural images I from training sample setAAs input to each ANnet1 and propagated forward, N attention feature maps are obtainedAnd attention prediction probabilityAnd pass throughAndcalculating the loss function of ANet1

(5c) Judging tA≤TAIf yes, obtaining a trained ANet1, and executing the step (5 d); otherwise, using a back propagation method and according toCalculating the network parameter gradients of the N ANets 1, updating the ANet1 network parameters by using the network parameter gradients, and enabling tA=tA+1, performing step (5 b);

(5d) calculating each attention feature mapCorresponding attention discriminating regionAnd will fglbAnd N areAs an input to the attention decoding network ANet2, forward propagation is performed to obtain an attention decoding probability yANet2And according to yANet2Calculating a loss function LANet2

(5e) Judging tA≤TAIf true, get trained ANet 2; otherwise, using back propagation method and passing through LANet2Calculating the network parameter gradient of ANet2, updating the network parameter of ANet2, and making tA=tA+1, performing step (5 b);

(6) acquiring an image classification result:

(6a) using the test sample set as eachThe inputs of the trained global feature extraction network GNet1 and each trained attention extraction network ANet1 are propagated forward to respectively obtain M global feature mapsAnd N attention feature mapsAnd calculate eachCorresponding global discriminating regionAnd eachCorresponding attention discriminating region

(6b) Will be MThe global feature f is obtained by performing forward propagation as the input of the trained global fusion network GNet2glbAnd global fusion probability yGNet2

(6c) By iterating the Bayes path, f isglbAnd N areAs the input of the trained attention decoding network ANet2, the forward propagation is carried out to obtain the attention decoding probability yANet2

(6d) Using a maximum voting method to yGNet2And yANet2And performing decision-level fusion to obtain the final prediction probability p:

p=max(yGNet2,yANet2);

(6i) and calculating a subscript index idx corresponding to the maximum value in the probability p, and searching an index table to obtain a fine-grained image classification category corresponding to the idx.

Compared with the prior art, the invention has the following advantages:

the method performs combined feature extraction on the image by designing the global perception network GNet and the attention transfer network ANet, and further extracts semantic correlation among different discrimination regions through the ANet on the basis of extracting global features and discrimination region features of the image by the GNet, so that the feature extraction capability of the network is enhanced.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Fig. 2 is a diagram of the global sensing network architecture of the present invention.

Fig. 3 is a diagram of an attention transfer network architecture of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, a fine-grained image classification method based on an attention transfer mechanism includes the following steps:

step 1) obtaining a training sample set and a testing sample set:

step 1a) in the embodiment, a california bird database CUB-200-2011 is adopted, which comprises 11788 natural images of 200 birds, such as a black-foot believable Chinese pulsatilla root, a yellow-mouth rhododendron, a white-hair sparrow and the like;

step 1b) performing data enhancement on all bird natural images in a database, wherein the data enhancement comprises randomly rotating all bird natural images to [ -10, +10], and then randomly horizontally turning the images;

step 1c), carrying out size normalization on the enhanced bird natural image, wherein the pixel of the normalized bird natural image is 448 multiplied by 448;

step 1d) selecting 5994 images from all bird natural images with normalized sizes for category labeling, then using the labeled bird natural images with normalized sizes and labels thereof as a training sample set, and using the remaining 5794 bird natural images with normalized sizes as a test sample set.

Step 2) constructing a global perception network GNet model shown in FIG. 2:

step 2a), setting the number of global feature extraction networks GNet1 to be M-4;

step 2b) constructing a VGG16 feature extraction network composed of a first convolution layer → a first active layer → a second convolution layer → a second pooling layer → a third convolution layer → a third active layer → a fourth convolution layer → a fourth active layer → a second pooling layer → a fifth convolution layer → a fifth active layer → a sixth convolution layer → a sixth active layer → a seventh convolution layer → a seventh active layer → a third pooling layer → an eighth convolution layer → an eighth active layer → a ninth convolution layer → a ninth active layer → a tenth convolution layer → a tenth active layer → a fourth pooling layer → an eleventh convolution layer → an eleventh active layer → a twelfth convolution layer → a twelfth active layer → a thirteenth convolution layer → a thirteenth active layer → a fifth pooling layer, which are laminated in this order;

wherein, the pooling layer selects a maximum pooling mode, and the activation layer selects a ReLU activation function f (x), which specifically comprises:

f(x)=max(0,x)

wherein x is the convolutional layer output;

parameter setting of the convolutional layer:

the size of the first convolution layer convolution kernel is set to be 3, and the number of the convolution kernels is set to be 64;

the size of the second convolutional layer convolution kernel is set to be 3, and the number of convolution kernels is set to be 64;

the size of the convolution kernel of the third convolution layer is set to be 3, and the number of the convolution kernels is set to be 128;

the size of the convolution kernel of the fourth convolution layer is set to be 3, and the number of the convolution kernels is set to be 128;

the size of the fifth convolutional layer convolution kernel is set to be 3, and the number of convolution kernels is set to be 256;

the size of the convolution kernel of the sixth convolution layer is set to be 3, and the number of the convolution kernels is set to be 256;

the size of the seventh convolutional layer convolution kernel is set to be 3, and the number of convolution kernels is set to be 256;

the size of the convolution kernel of the eighth convolution layer is set to be 3, and the number of the convolution kernels is set to be 512;

the size of the convolution kernel of the ninth convolution layer is set to be 3, and the number of the convolution kernels is set to be 512;

the size of the tenth convolutional layer convolution kernel is set to be 3, and the number of convolution kernels is set to be 512;

the size of the eleventh convolutional kernel is set to 3, and the number of convolutional kernels is set to 512;

the size of the twelfth convolutional kernel is set to be 3, and the number of convolutional kernels is set to be 512;

the size of the thirteenth convolutional kernel is set to 3, and the number of convolutional kernels is set to 512;

step 2c), constructing an automatic coding layer consisting of a first full connection layer, a fourteenth active layer, a second full connection layer, a fifteenth active layer, a third full connection layer and a sixteenth active layer which are sequentially stacked;

wherein, the fourteenth activation layer and the fifteenth activation layer select a ReLU activation function f (x), and the sixteenth activation layer selects a Sigmoid activation function s (x):

wherein z is a feature vector output by the third fully-connected layer;

setting parameters of a full connection layer:

the first fully connected layer length is set to 512;

the second fully connected layer length is set to 64;

the third fully connected layer length is set to 512;

step 2d) constructing M global feature extraction networks GNet1, wherein each GNet1 consists of a VGG16 feature extraction network, a global pooling layer, an automatic coding layer and a Softmax layer which are sequentially stacked;

step 2e) constructing a global feature fusion network GNet2 consisting of a VGG16 feature extraction network, a cascade layer and a Softmax layer which are sequentially stacked;

step 3) constructing an attention transfer network ANet model:

step 3a) setting the number of attention extraction networks ANet1 to N-4;

step 3b), constructing an H layer consisting of a full connection layer and a ReLU activation layer which are sequentially stacked, wherein the length of the full connection layer is set to be 512;

step 3C), constructing a C layer consisting of a full connection layer and a ReLU activation layer which are sequentially stacked, wherein the length of the full connection layer is set to be 512;

step 3d) constructing a recurrent neural network consisting of LSTM, wherein the length of a hidden layer in the LSTM is set to be 512;

step 3e) constructing N attention extraction networks ANet1, wherein each ANet1 consists of a VGG16 feature extraction network, a global pooling layer, an H layer, a C layer, an automatic coding layer, a cyclic neural network and a Softmax layer which are sequentially stacked;

step 3f), constructing an attention decoding network ANet2 composed of a VGG16 feature extraction network, a recurrent neural network and a Softmax layer which are sequentially stacked;

step 4), training a global perception network GNet:

step 4a) initializing M global feature extraction networks GNet1 and VGG16 feature extraction networks in global feature fusion networks GNet2 connected with M GNet1 by adopting pre-trained VGG16 feature extraction network parameters on an ImageNet database, and initializing random parameters in the rest layers;

step 4b) setting the maximum number of training iterations TG2000, current iteration number tG=1;

Step 4c) randomly selecting 128 natural images I from the training sample set as BSGThe signal is respectively used as the input of each GNet1 and is propagated in the forward direction, and M-4 global feature maps are obtainedAnd global prediction probabilityWherein m is the global feature map index, and m is the [1,4 ]];

Step 4d) determining tGWhether the result is less than or equal to 1000, if so, obtaining trained GNet1, and executing the step (4 e); if not, then,using a counter-propagating method and according toCalculating the network parameter gradients of 4 GNet1, updating each ANet1 network parameter by using an Adam optimization method, and enabling tG=tG+1, performing step (4 c); with Adam optimizer momentum parameters set to [0.9, 0.999%]The weight attenuation coefficient is 0.001;

step 4e) calculationThe pixel coordinate corresponding to the maximum value corresponds to the input natural image IGPixel coordinates ofWhereinAnd are provided withTaking 16 as a half side length as a center, and cutting IGThe image block with the middle pixel size of 32 multiplied by 32 is used as the global discrimination area

Step 4f) willThe global feature vector f is obtained by performing forward propagation as the input of the global feature fusion network GNet2glbAnd global fusion probability yGNet2And according to yGNet2Calculating a loss function LGNet2

Step 4g) determining tGIf the result is less than or equal to 2000, obtaining trained GNet 2; otherwise, using back propagation method and passing through LGNet2Calculating the gradient of the network parameters of GNet2, updating the network parameters of GNet2 by using the gradient of the network parameters, and enabling tG=tG+1,Performing step (4 b);

step 5) training an attention transfer network ANet:

step 5a) initializing N attention extraction networks ANet1 and VGG16 feature extraction networks in an attention decoding network ANet2 connected with the N ANet1 by adopting VGG16 feature extraction network parameters pre-trained on an ImageNet database, and initializing random parameters of the rest layers;

step 5b) setting the maximum number of training iterations TA2000, current iteration number tA=1;

Step 5c) randomly selecting 128 natural images I from the training sample set as BSAThe signal is respectively used as the input of each GNet1 and is propagated in the forward direction, and N-4 attention feature maps are obtainedAnd attention prediction probabilityWhere n is the attention feature map index and n is [1,4 ]];

Step 5d) determining tAWhether the result is less than or equal to 1000, if so, obtaining trained ANet1, and executing the step (5 e); otherwise, using a back propagation method and according toCalculating the network parameter gradient of 4 ANets 1, updating each ANet1 network parameter by using an Adam optimization method, and enabling tA=tA+1, performing step (5 c); with Adam optimizer momentum parameters set to [0.9, 0.999%]The weight attenuation coefficient is 0.001;

step 5e) calculationThe pixel coordinate corresponding to the maximum value corresponds to the input natural image IAPixel coordinates ofWhereinAnd are provided withTaking 16 as a half side length as a center, and cutting IAImage block with medium pixel size of 32 x 32 as attention discriminating region

Step 5f) willAs the input of the attention decoding network ANet2, after passing through the VGG16 feature extraction network, the input is sequentially used as the input of the recurrent neural network, and f is inputglbThe attention decoding probability y is obtained by inputting the Bayes iteration path into the recurrent neural network in the ANet2 and carrying out forward propagationANet2And according to yANet2Calculating a loss function LANet2

Step 5g) judgment of tAIf the result is not more than 2000, obtaining trained ANet 2; otherwise, using back propagation method and passing through LANet2Calculating the network parameter gradient of the ANet2, updating the network parameter of the ANet2 by using the network parameter gradient, and enabling tA=tA+1, performing step (5 b);

step 6) classifying the test sample set through the trained GNet and ANet:

step 6a) inputting the test sample set into the trained global perception network GNet and attention transfer network ANet, and classifying the targets contained in each natural image to obtain the classification result y of the GNet and ANet on the fine-grained image targetsGNet2And yANet2

Step 6b) adopting a maximum voting method to yGNet2And yANet2And performing decision-level fusion to obtain the final prediction probability p:

p=max(yGNet2,yANet2)

calculating subscript index corresponding to the maximum value in the probability p, obtaining fine-grained image classification category corresponding to the image in each test sample set by searching an index table, counting the classification Accuracy of the bird images in the test sample set,

wherein N istestFor measuring the number of natural images in the sample set, ciAndthe prediction category and the real category of the ith natural image in the test sample set are respectively.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions and contents:

the simulation experiment adopts 4 Nvidia Titan X GPUs to carry out network training and testing on a PyTorch deep Learning platform, carries out comparative simulation on the classification precision of the method and the existing 'Learning Multi-orientation conditional Neural network for Fine-Grained Image Recognition' (MA-CNN) method, and the result is shown in Table 1.

TABLE 1 convolutional neural network method vs. Classification accuracy of the present invention

Network architecture MA-CNN The invention
Accuracy of classification 86.5% 91.2%

2. And (3) simulation result analysis:

as can be seen from Table 1, the classification accuracy of the method of the present invention is improved by 4.7% compared with MA-CNN, which indicates that the attention transfer mechanism adopted in the method effectively improves the classification accuracy of fine-grained images.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于局部CNN框架的甲骨拓片分类方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!