Small sample target detection method based on feature weighting and network fine tuning

文档序号：1832048 发布日期：2021-11-12 浏览：4次中文

阅读说明：本技术 基于特征加权与网络微调的小样本目标检测方法 (Small sample target detection method based on feature weighting and network fine tuning ) 是由朱得糠李彤孟志鹏于 2021-07-30 设计创作，主要内容包括：本发明属于计算机视觉图像目标检测技术领域,提供基于特征加权与网络微调的小样本目标检测方法。完成有B个基类和N个新类(小样本)的目标检测任务。本发明为经典Faster RCNN目标检测算法增加了元学习器为检测器学习权重矢量,并据此将网络训练过程划分为基类训练和微调训练两个阶段；为经典的Faster RCNN检测器添加了特征加权过程,并改进了随后的区域池化和预测层网络,以适应加权后的特征金字塔。本发明有效提升了小样本目标检测的精度。(The invention belongs to the technical field of computer vision image target detection, and provides a small sample target detection method based on feature weighting and network fine adjustment. And completing the target detection task with B base classes and N new classes (small samples). The invention adds a meta-learner for learning weight vectors for a detector for a classic fast RCNN target detection algorithm, and divides a network training process into two stages of base class training and fine tuning training according to the weight vectors; a feature weighting process is added to the classical fast RCNN detector and the subsequent region pooling and prediction layer network is improved to accommodate the weighted feature pyramid. The invention effectively improves the precision of small sample target detection.)

1. A small sample target detection method based on feature weighting and network fine tuning is characterized by comprising the following steps:

s1, modeling the detection task: b, setting a large number of training samples of a class B to be detected as a base class; n types of targets only having K training samples are provided, and the training samples are small samples and defined as new types; the small sample target detection task is to detect targets of (B + N) classes by a test sample, assign unique ID numbers for the (B + N) classes to be detected, and use i for the B base class ID numbers_BE {1, …, B }, and N new class id numbers are denoted by i_NE { B +1, …, B + N };

s2, network construction: the overall network structure mainly comprises two parts, namely an improved Faster RCNN detector and a Meta-learner (Meta-leaner) formed by lightweight CNN;

the detector comprises 4 network modules of a backbone network, a recommended area network, an area pooling network and a prediction layer and a characteristic weighting operation process;

the feature weighting operation is an operation on a feature pyramid and a weight tensor; wherein, only 4 layers of P [ '0' ], P [ '1' ], P [ '2' ], P [ '3' ] of the characteristic pyramid are weighted; multiplying the feature tensor of the feature pyramid by the corresponding element of the weight tensor channel by channel, and splicing the feature tensor and the corresponding element of the weight tensor channel by channel;

s3, data preparation

Marking samples, wherein each training picture is marked with a txt marking file conforming to a YOLO algorithm format;

preparing base class training data, and summarizing all base class image samples to serve as a base class training data set;

fine-tuning training data preparation, which is to solve the problem that the quantities of the base class training data and the new class training data are unbalanced, in the stage, K samples are randomly extracted from each base class, and the base class and the new class are (B + N) multiplied by K samples as a fine-tuning training data set;

s4, model training

S4.1, carrying out base class training by using a base class training data set, wherein the base class training adopts a batch random gradient descent mode to train a network, and each batch of training process is as follows:

s4.1.1, randomly extracting a batch of bz image samples from a base class training data set, wherein the samples are called query samples and input into a detector;

s4.1.2, extracting 1 picture for each base class from the base class training data set, arranging the pictures according to the numbering sequence, forming a characteristic sample by B pictures, and inputting the characteristic sample into the meta-learner;

s4.1.3 the detector backbone network outputs a feature pyramid, which is then split into two paths: one path of the training frame is obtained through an RPN network; and performing weighting operation on one path and the weight tensor output by the meta learner. After the weighted feature pyramid and the training frame pass through a regional pooling layer and a prediction layer, outputting a detection result, comparing the detection result with a labeling result of a query sample to calculate a loss function value, and updating network parameters by adopting an Adam optimizer;

s4.1.4 repeating the above steps until training is finished, and storing network parameters;

s4.2, carrying out fine tuning training by using the fine tuning training data set, wherein each batch of training process of the fine tuning training is as follows: s4.2.1 the forecast layer is modified, the input of forecast layer full connection network is changed from B x 1024 maintenance when base class training to (B + N) x 1024 dimension, the other part structure is not changed.

S4.2.2 initializing network parameters, initializing prediction layer network parameters randomly, and reading the network parameters stored during base class training in the rest layers. Freezing backbone network and RPN network parameters.

4.2.2 randomly extracting a plurality of inquiry samples from the fine tuning training data set, inputting the inquiry samples into a detector, and repeatedly sampling;

s4.2.3 extracting 1 picture for each class, including base class and new class from the fine tuning training data set, arranging according to the serial number sequence, composing feature samples by (B + N) pictures, and inputting the feature samples into the meta-learner;

s4.2.4 outputting a detection result according to the same flow of S4.1.3, comparing the detection result with a labeling result of a query sample to calculate a loss function value, and updating network parameters by adopting an Adam optimizer;

s4.2.5 repeating the steps from 4.2.2 to 4.2.4 until the training is finished, and storing the network parameters;

s5, model reasoning

S5.1, calculating a weight tensor, after finishing two stages of training of fine tuning training and base class training, sequentially inputting K samples of each class into a meta-learner to obtain K weights, calculating an arithmetic mean to obtain a weight vector corresponding to each class, and forming the weight vectors of each class into a weight tensor with the size of (B + N,1024) according to a class numbering sequence;

s5.2, test data reasoning, namely directly inputting the test sample into the detector, carrying out weighted multiplication on the output feature pyramid and the pre-calculated weight tensor to obtain weighted features arranged according to the id sequence, and carrying out subsequent operation until a detection result is given.

2. The method for detecting the small sample target based on the feature weighting and the network fine tuning as claimed in claim 1, wherein the backbone network module of step S2 is composed of a residual network and a feature pyramid network, the input image is down-sampled from bottom to top via the residual network and up-sampled and transversely linked from top to bottom via the feature pyramid network, the feature pyramid is output, a batch of input pictures is set to have bz sheets, the input is a tensor with a size of (bz,3,416,416), wherein 3 represents an RGB channel, and 416 represents the number of wide and high pixels; the output format is an ordered dictionary, and the dictionary data is a tensor;

residue network: is subdivided into 5 parts, respectively conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x, in the form ofRepresents one residual block, x 3 represents a concatenation of 3 residual blocks; the structure of the residual block network is as follows, the 1 × 1,64 represents the convolution layer of the 1 × 1 convolution kernel of 64 channels, and then follows the batch normalization layer and the Relu layer, the output of the convolution layer of the 3 rd layer is added with the input and then passes through a Relu to be the output of the residual block, starting from conv2_ x, the output of the last residual block of each part is taken as the output characteristic of the layer and respectively corresponds to C2, C3, C4 and C5, the form is tensor, and the sizes are 1/4,1/8,1/16 and 1/32 of the original input characteristic size respectively;

the feature pyramid network is represented as an ordered dictionary P, and since all levels of the pyramid use a shared classifier/regressor, the output feature tensor dimensions (number of channels) are fixed in all feature maps, which is set to 256 by the method. From C5, a tensor P [ '3' ] with size (bz,256,13,13) is obtained after 1 × 1,64 convolution layers; adding P [ 3'] to next layer (C4 is convolved by 1 × 1, 64) element by element after 2 times of nearest neighbor upsampling to obtain tensor P [ 2' ] with size (bz,256,26, 26); tensors P [ '1' ] and P [ '0' ] are obtained through the above steps, with sizes of (bz,256,52,52) and (bz,256,104,104), respectively. The tensor P [ 'pool' ] is obtained by pooling P [ '3' ] with a size of (bz,256,6, 6).

3. The small sample object detection method based on feature weighting and network fine tuning of claim 1, wherein the recommended area network module of step S2 is described as follows: inputting a characteristic pyramid P, outputting coordinates of a recommendation box, and if the coordinates are in a training stage, outputting a loss function value of a network; p passes through a 3 × 3 convolution layer and Relu layer, then inputs rpn _ head layers, namely 1 × 1,3 and 1 × 1,12 convolution layers, outputs scores prediction and frame coordinate prediction, and combines with anchor frames of 3 aspect ratios (0.5,1.0,2.0) and 5 scales (32,64,128,256,512) to give coordinates of a recommended frame; after processing, obtaining a list with bz as the length, wherein each element corresponds to one picture, and the value of the list is a certain number of recommendation frames; in the training phase, the loss function values are calculated using scores and prediction box positions, with the true values.

4. The method for small sample target detection based on feature weighting and network fine tuning as claimed in claim 1, wherein the weighting network module of step S2 is composed of a series of convolution, batch normalization and pooling layer modules. According to the category sequence, the following operations are sequentially carried out on each category: randomly extracting the 1 picture, wherein the RGB three-channel size is scaled to 416 multiplied by 416; initializing a 416 multiplied by 416 tensor of all 0, setting the position of an instance frame of the class in the zoomed picture to be 1, and taking the tensor as a filtering channel of the picture; the tensors of the RGB and filter channels are stacked to form a class of size (4,416,416) feature samples. If the total number of classes to be classified is represented by Nc, then Nc equals B during base class training and Nc equals B + N during fine tuning training. And splicing the feature samples of Nc classes into tensors of (Nc,4,416,416) sizes according to the class sequence, inputting the tensors as the training feature samples of the batch into a weight network, and outputting the tensors as weight tensors omega with the sizes of (Nc,1024), namely Nc 1024-dimensional vectors.

5. The method for small sample target detection based on feature weighting and network fine tuning as claimed in claim 1, wherein the region Pooling module of step S2 is to map the recommended boxes output by the recommended region network onto a certain layer of features of the pyramid, and then extract the features of each recommended box for subsequent RCNN classification and box regression by using RoI Pooling. Firstly, a certain number Nt of positive and negative sample frames are selected as training frames according to the 'intersection ratio' of a recommendation frame and a true value frame, and are combined with a weighted feature pyramid to obtain a feature tensor of a foreground frame with the size of (bz × Nt,256 × Nc,7, 7). Because of the multi-scale feature, it is a question to which feature layer the training box is mapped, according to the formula

Determining a feature pyramid level corresponding to the training box, wherein k₀Is the layer level to which a box with an area of 224 x 224 should be assigned.

6. The method for detecting the small sample target based on the feature weighting and the network fine tuning as claimed in claim 1, wherein the input of the prediction layer of step S2 is the pooled fixed-size training sample features with the size of (bz × Nt,256 × Nc,7,7), and the 1,2, 3-dimensional expansion of the tensor is spliced into a 256 × Nc × 7 × 7 vector and converted into a tensor of (bz × Nt,256 × Nc × 7 × 7). The expanded feature tensor is passed through two fully connected layers to obtain a (bz × Nt,1024) size feature tensor. Then, a box classifier and a box regressor are input, and a score tensor of (bz × Nt, Nc) size and a box coordinate tensor of (bz × Nt,4 × Nc) size are obtained. Wherein the score is only used for calculating the loss function value during training.

7. The method for small sample target detection based on feature weighting and network fine tuning as claimed in any one of claims 1-6, wherein in step S3, if the data itself is directly used in the YOLO labeling format, if not, the data is modified and transformed into the YOLO labeling format as follows, a txt labeling file with the same name is made for each training picture, and inside the txt labeling file, there is a target frame labeled according to the class [ id x, y, w, h ], and the numerical values refer to the class id number of the labeled content, the normalized center point x coordinate, the normalized center point y coordinate, the normalized target frame width w, the normalized target condition height h, and one instance frame line.

Technical Field

The invention belongs to the technical field of computer vision image target detection, and particularly relates to a small sample target detection method.

Background

In recent years, the computer vision technology based on deep learning has been developed in a breakthrough, and the tasks of target classification, detection, segmentation and the like are shown to be close to or even exceed the level of human beings. However, the deep learning method requires a large amount of data for training, and has a large capacity difference from the learning capability of human in small samples. Currently, there are many studies on small sample learning, mainly for the classification of research targets in the field of computer vision. For more complex small sample target detection tasks, research is still less. A paper Few-Shot Object Detection view Feature learning earlier proposes a meta-learning method, wherein in the meta-learning stage, the meta-features of pictures are learned through a large number of base class samples, and target Detection is carried out after weighting; the method for detecting the small sample target comprises two stages of training, namely base training and fine tuning training, and achieves the purpose of detecting the small sample target only by fine tuning the last layer of a network through a small number of samples. The paper FSCE Few-Shot Object Detection visual background Encoding states that the reduction of the Detection performance of small sample objects is mainly caused by the fact that new classes of a small number of samples are wrongly classified into other classes, so that the algorithm introduces a CPE loss, the consistency between the examples in the classes and the difference between the examples in the classes are measured, and the performance of correct classification is improved by reducing the CPE loss.

The above studies followed a mission pattern for small sample target detection: and defining the target types with a large number of training samples as a base class, and defining the target types with only a small number of training samples as a new class. And training the small sample target detector by using the base class and the new class sample to complete the task of simultaneously detecting the base class and the new class from the test data.

Disclosure of Invention

The invention aims to provide a small sample target detection method capable of effectively improving detection precision, and solves the technical problem that the small sample target detection precision is not high.

In order to achieve the above object and solve the above technical problems, the present invention provides a small sample target detection method based on feature weighting and network fine tuning, and the technical scheme is as follows:

s1, modeling the detection task: setting a total of B types of targets to be detected, wherein each type of target has a large number of training samples, and defining the types as base classes; another N classes of targets, each class of targets having only K trainsSamples (small samples) defining these classes as new classes. Detecting all (B + N) classes in the test sample by the small sample target detection task; allocating unique ID number for (B + N) classes to be detected, wherein the B base class ID numbers are i_BE {1, …, B }, and N new class id numbers are denoted by i_NE { B +1, …, B + N }.

S2, network construction: the overall network structure mainly comprises two parts, namely an improved fast RCNN Detector (Detector) and a Meta-learner (Meta-leaner) formed by lightweight CNN, as shown in fig. 7;

the detector mainly comprises (1) a Backbone Network (Backbone), (2) a recommended regional Network (RPN), (3) a regional Pooling Network (ROI Pooling Network), and (4) 4 Network modules of a Prediction Layer (Prediction Layer) and a characteristic weighting operation process;

in the training stage, the detector input data are query samples including pictures and labels thereof; the meta-learner inputs data as support samples, and arranges the support samples according to the corresponding category sequence; after the characteristic pyramid output by the backbone network of the detector and the weight vectors of various categories output by the meta-learner are subjected to characteristic weighting operation, obtaining a weighted characteristic pyramid, continuously participating in subsequent regional pooling and prediction, and outputting a detection frame and classification scores thereof for subsequent training or outputting a detection result; the detailed structure of each network module is shown in the following figures and description.

S3, data preparation: (1) marking samples, wherein each training picture is marked with a txt marking file conforming to a YOLO algorithm format; (2) preparing base class training data, and summarizing all base class image samples to serve as a base class training data set; (3) fine tuning training data preparation, namely balancing the quantity of base class training data and new class training data, randomly extracting K samples from each base class in the stage, and taking (B + N) multiplied by K samples of the base class and the new class as a fine tuning training data set;

s4, model training: after the model and data are prepared, training can begin. The training is divided into two stages: (1) carrying out base class training by utilizing a base class training data set; (2) and carrying out fine tuning training by using the fine tuning training data set.

S4.1, base class training: and extracting a query sample and a support sample from the base class training data set, using the query sample and the support sample as network input, and training the network to detect B base classes only. Where the query sample is used to extract the feature pyramid and recommend areas where the target may be present. The characteristic samples comprise B pictures, each 1 picture corresponds to one category, each characteristic sample is processed to obtain a weight vector, the B vectors form a weight tensor, the weight tensor is weighted with the characteristic pyramid layer by layer characteristic tensors, the weighted characteristic pyramids are obtained by channel splicing, and compared with the original pyramid, the number of characteristic channels of each layer is B times of the original number. And performing subsequent processing on the weighted characteristic pyramid according to a classic fast RCNN flow, giving a network prediction result and a loss function value, performing iterative training, and storing network parameters after the iterative training is finished.

And S4.2, fine tuning training, namely extracting a query sample and a support sample from the fine tuning training data set to serve as network input, and training the network to simultaneously detect the B + N base classes and the new classes. Firstly, a base class training network is changed, the input of the prediction layer full-connection network is changed from B multiplied by 1024 maintenance during base class training to (B + N) multiplied by 1024 dimensions, and the structures of other parts are not changed. And loading network parameters stored in base class training for networks except the prediction layer, freezing the backbone network and the RPN network, and randomly initializing the network parameters of the prediction layer. And fine-tuning training in a batch random gradient descending mode, and updating the weight network and the prediction layer network parameters. The training step is the same as S4.1, and the network parameters are stored after the training is finished.

S5, model reasoning test: after the fine tuning training is completed, the training process of the network is completed, and the network can be used for executing a target detection reasoning task. Before reasoning and testing, the weight of each class can be calculated by using K picture samples of each class, and weight vectors of (B + N) classes are formed and stored. During testing, a support sample does not need to be input online, only the reasoning test sample is input into the backbone network, the stored weight vector is used for weighting, and finally a detection result is output.

For the convenience of the following description, we introduce the definition of tensor, where an n-dimensional tensor S is expressed as an n-dimensional array, the dimension indexes are numbered 0,1, …, n-1, and the size of the tensor is expressed as an n-tuple size (S) ═ z (z-tuple)₁,…,z_n-1)，z_iIndicating the length of the ith dimension. Sheet of paperThe slice whose volume is indexed k along a dimension, e.g., dimension 1, is denoted as tensor S [: k,: …':]。

further, the S2 backbone network: as shown in fig. 1, the image processing apparatus is composed of a residual Network (ResNet101) and a Feature Pyramid Network (FPN), and an input image is (1) downsampled from bottom to top by the ResNet101 and (2) upsampled from top to bottom and horizontally linked by the FPN, so as to output a Feature Pyramid. The batch processing method is adopted for training and reasoning, a batch of bz pictures is set, the input pictures are expressed as a tensor with the size of (bz,3,416,416), wherein the bz of the 0 th dimension represents the number of the pictures, the 3 of the 1 st dimension represents an RGB channel, and the 416 of the 2 nd dimension and the 3 rd dimension represents the number of wide and high pixels. The input of the backbone network is a picture tensor, and the output is an ordered dictionary, namely a characteristic pyramid P.

(1) ResNet101 network: subdivided into 5 sections, conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x, respectively. Is shaped asOne residual block is represented and x 3 represents a concatenation of 3 residual blocks. The network structure of the residual block is shown in fig. 2, and the shape is 1 × 1,64 represents the convolution layer of 1 × 1 convolution kernel of 64 channels, and then is connected with the batch normalization layer and the Relu layer, and the output of the convolution layer of the 3 rd layer is added with the input and then is passed through the Relu layer, so that the output of the residual block is obtained. Starting from conv2_ x, the output of the last residual block of each section as the output characteristic of the layer corresponds to C2, C3, C4, C5 respectively, in the form of tensors, and the sizes of the tensors are 1/4,1/8,1/16,1/32 of the original input characteristic size respectively.

(2) FPN network: a feature pyramid is obtained through this step and is denoted as ordered dictionary P. Since the shared classifier/regressor is used for all levels of the pyramid, the output feature tensor dimension (number of channels) is fixed in all feature maps, which the method sets to 256. From C5, a tensor P [ '3' ] with size (bz,256,13,13) is obtained after 1 × 1,64 convolution layers; adding P [ 3'] to next layer (C4 is convolved by 1 × 1, 64) element by element after 2 times of nearest neighbor upsampling to obtain tensor P [ 2' ] with size (bz,256,26, 26); tensors P [ '1' ] and P [ '0' ] are obtained through the above steps, with sizes of (bz,256,52,52) and (bz,256,104,104), respectively. The tensor P [ 'pool' ] is obtained by pooling P [ '3' ] with a size of (bz,256,6, 6).

Further, S2 recommends the area network: and inputting the characteristic pyramid P, outputting the coordinates of the recommendation box, and if the coordinates are in the training stage, outputting the loss function value of the network. P passes through a 3 × 3 convolutional layer and Relu layer, then enters rpn _ head layers, i.e., the 1 × 1,3 and 1 × 1,12 convolutional layers in FIG. 3, outputs scores prediction and frame coordinate prediction, combined with 3 aspect ratios (0.5,1.0,2.0) and 5 scales (32,64,128,256,512) of anchor frames to give the coordinates of the recommended frame. After processing, a list with bz as the length is obtained, each element corresponds to one picture, and the value of the list is a certain number (such as Np) of recommendation boxes. In the training phase, the loss function values are calculated using scores and prediction box positions, with the true values.

Further, S2 weights the network: the structure is shown in fig. 4 and is composed of a series of convolution, batch normalization and pooling layer modules. According to the category sequence, the following operations are sequentially carried out on each category: randomly extracting the 1 picture, wherein the RGB three-channel size is scaled to 416 multiplied by 416; initializing a 416 multiplied by 416 tensor of all 0, setting the position of an instance frame of the class in the zoomed picture to be 1, and taking the tensor as a filtering channel of the picture; the tensors of the RGB and filter channels are stacked to form a class of size (4,416,416) feature samples. If the total number of classes to be classified is represented by Nc, then Nc equals B during base class training and Nc equals B + N during fine tuning training. And splicing the feature samples of Nc classes into tensors of (Nc,4,416,416) sizes according to the class sequence, inputting the tensors as the training feature samples of the batch into a weight network, and outputting the tensors as weight tensors omega with the sizes of (Nc,1024), namely Nc 1024-dimensional vectors.

Further, the weighting operation of S2 is an operation on the feature pyramid and the weight tensor. Sequentially aligning 4-layer feature tensor P [ '0' of feature pyramid '],P['1'],P['2'],P['3']Performing weighting operation to obtain a weighted feature pyramid P_ωIs composed of P_ω['0'],P_ω['1'],P_ω['2'],P_ω['3']4 layers in total, corresponding to weighted tensor sizes of (bz,256 XNc, 104), (bz,256 XNc, 52,52), (bz,256 XNc, 26,26) and (b) respectivelyz,256 × Nc,13, 13). The weighting operation is to multiply the feature tensor of the first 4 layers of the feature pyramid P by the corresponding element of the weight tensor omega channel by channel to obtain Nc 256-channel feature matrixes, and then to splice the 256-channel feature matrixes according to the channels to obtain the feature matrixes. The weighted feature pyramid integrates category feature information extracted from the support sample, and specifically how to multiply layer by layer channel by channel for the following specific implementation steps.

Further, in the step S2, area pooling is performed, first, a certain number Nt of positive and negative sample training frames are randomly selected according to the "intersection ratio" of the recommendation frame and the true value frame, then the training frames are combined with the weighted feature pyramid, and the training sample features with fixed size (bz × Nt,256 × Nc,7,7) are obtained through pooling operation, and are used for subsequent RCNN classification and frame regression. Because of the multi-scale features, it is a question to which feature level the training box is mapped, as in equation (1)

Determining a feature pyramid level corresponding to the training box, wherein k₀Is the layer level to which a box with an area of 224 x 224 should be assigned.

Further, S2 predicts a layer: the input is the pooled fixed-size training sample features with size (bz × Nt,256 × Nc,7,7), and the 1,2, 3-dimensional expansions of the tensor are spliced into 256 × Nc × 7 × 7 vectors, which are converted into tensors of (bz × Nt,256 × Nc × 7 × 7). The expanded feature tensor is passed through two fully connected layers to obtain a (bz × Nt,1024) size feature tensor. Then, a box classifier and a box regressor are input, and a score tensor of (bz × Nt, Nc) size and a box coordinate tensor of (bz × Nt,4 × Nc) size are obtained. Wherein the score is only used for calculating the loss function value during training.

Further, S3 is based on the YOLO label format, if the data itself is directly used in the YOLO label format, if not, the data is modified and converted into the YOLO label format as follows, a txt label file with the same name is made for each training picture, a target frame labeled according to the class [ id x, y, w, h ] is arranged in the txt label file, and the numerical values respectively refer to the class id number of the labeled content, the coordinate of the normalized center point x, the coordinate of the normalized center point y, the width w of the normalized target frame, the height h of the normalized target condition, and one example frame is one line.

Further, the S4.1 base class trains: training a network in a batch random gradient descending mode, wherein each batch of training process comprises the following steps of (1) randomly extracting a batch of bz image samples from a base class training data set, wherein the bz image samples are called query samples and input into a backbone network; (2) extracting 1 picture for each base class from a base class training data set, arranging the pictures according to the serial number sequence, forming a characteristic sample by B pictures, and inputting the characteristic sample into a weight network; (3) the network is executed according to the flow shown in fig. 7, the detection result is output, the loss function value is calculated by comparing the detection result with the labeling result of the query sample, and the network parameters are updated by adopting an Adam optimizer; (4) and repeating the steps until the training is finished, and storing the network parameters.

Further, S4.2 fine tuning training: each batch of training process comprises (1) randomly extracting a batch of query samples (repeatable sampling) from a fine-tuning training data set and inputting the query samples into a backbone network; (2) extracting 1 picture for each class (including a base class and a new class) from the fine tuning training data set, arranging the pictures according to the serial number sequence, forming a characteristic sample by (B + N) pictures, and inputting the characteristic sample into a weight network; (3) the network is executed according to the flow shown in fig. 7, the detection result is output, the loss function value is calculated by comparing the detection result with the labeling result of the query sample, and the network parameters are updated by adopting an Adam optimizer; (4) and repeating the steps until the training is finished, and storing the network parameters.

The S5 model inference test comprises the following steps: (1) calculating a weight vector, sequentially inputting K samples of each category into a weight network after finishing two-stage training to obtain K weights, and solving an arithmetic mean to obtain the weight vector corresponding to each category; (2) and (3) test data reasoning, namely directly inputting the test sample into the backbone network, directly performing weighting operation on the output characteristic pyramid and the weight vector, and performing subsequent processing until a detection result is given.

The effective benefits of the invention are:

the invention provides a small sample target detection method based on feature weighting and network fine tuning, and particularly completes a target detection task with B base classes and N new classes (small samples). The method adds a meta learner to a classic fast RCNN target detection algorithm, combines feature weighting and network fine tuning, constructs a small sample target detection network, can effectively improve detection precision compared with the existing method, does not increase extra computational burden during inference test, and has good practicability.

The invention is characterized in that a meta-learner is introduced to learn weight vectors for a detector, and the network training process is divided into two stages of base class training and fine tuning training according to the weight vectors; a feature weighting process is added to the classical fast RCNN detector and the subsequent region pooling and prediction layer network is improved to accommodate the weighted feature pyramid.

Drawings

FIG. 1 is a schematic diagram of a backbone network;

FIG. 2 is a schematic diagram of a residual block network;

FIG. 3 is a schematic diagram of a recommended area network;

FIG. 4 is a schematic diagram of a weighting network;

FIG. 5 is a schematic diagram of a regional pooling network;

FIG. 6 is a schematic diagram of a prediction layer network;

FIG. 7 is a schematic diagram of a network framework of the present invention;

FIG. 8 is a reasoning testing flow diagram.

Detailed Description

S1 detects task modeling. Allocating unique ID number for (B + N) classes to be detected, wherein the B base class ID numbers are i_BE {1, …, B }, and N new class id numbers are denoted by i_NE { B +1, …, B + N }.

S2 constructs a small sample object detection network based on the fast RCNN network, as shown in fig. 7.

S3 data preparation.

S3.1, marking samples, wherein each training picture is marked with a txt marking file conforming to a YOLO algorithm format;

s3.2, preparing base class training data, and summarizing all base class image samples to serve as a base class training data set;

and S3.3, fine-tuning training data preparation is carried out, in order to balance the problem that the quantities of the base class training data and the new class training data are not balanced, only K samples are randomly extracted from each base class in the stage, and the base class and the new class are used as fine-tuning training data sets, wherein the total number of (B + N) multiplied by K samples.

After the model and data are prepared, training can begin. The network training is divided into two stages: the first stage only uses base class samples to carry out base class training; the second stage performs fine-tuning training using samples of the base class and the new class.

S4.1 base class training. Training the network by adopting a batch random gradient descent mode, wherein the batch size is set as bz, and each batch of training process is as follows:

s4.1.1 randomly extracting bz query samples from a base class training data set, scaling the bz query samples to 416 x 416, and forming a tensor input backbone network with the size of (bz,3,416,416). The backbone network outputs a 5-layer characteristic pyramid dictionary P, each layer of the pyramid corresponds to one element of the dictionary, the keywords of the dictionary are { '0', '1', '2', '3', 'pool' }, the corresponding data are tensors, and the dimensions are P [ '0' ] (bz,256,104,104), P [ '1' ] (bz,256,52,52), P [ '2' ] (bz,256,26,26), P [ '3' ] (bz,256,13,13), and P [ 'pool' ] (bz,256,6, 6).

S4.1.2, extracting 1 picture for the 1 st base class from the base class training data set, wherein the size scaling is 416 × 416, and obtaining picture RGB channel data tensor with the size of (3,416,416); in the corresponding annotation file, the instance frame of the class corresponds to the 416 × 416 scale position, a full 0 tensor of 416 × 416 size is initialized, the tensor element of the instance frame position is set to 1, and the tensor element is used as the filtering channel tensor of the picture, and the dimension is (1,416,416). And superposing the data tensor and the filtering channel tensor according to the channel dimension to obtain a tensor of the dimension (4,416,416) as the characteristic sample of the base class. Repeating the above operations for the B base classes according to the class numbering sequence to obtain the characteristic tensor of the batch training, wherein the size of the characteristic tensor is (B,4,416,416), and the characteristic tensor is input into the weight network. The weight tensor ω size of the weight network output is (B,1024), representing B1024-dimensional column vectors corresponding to the B classes.

S4.1.3, inputting 5 layers of feature pyramid P into the recommendation area network, outputting a bz length list, each element corresponding to 1 picture, and the element content being two tensors with sizes of (Np) and (Np,4), which are the scores and position coordinates of Np candidate frames.

S4.1.4 weighting the 4 layers { '0', '1', '2', '3' } of the feature pyramid P with class-by-class weight vectors to obtain the weighted feature pyramid P_ω，P_ωIs the same dictionary as the P key, the dictionary element value is the feature tensor of 256 × B channels, the feature tensor of the (i-1) × 256+ k channels of the 'l' th layer is calculated according to the formula (2):

P_ω['l'][:,(i-1)×256+k,:,:]＝P['l'][:,k,:,:]×ω[i,k] (2)

sequentially traversing k-0, …,255, i-1, …, B and l-0, … and 3 to obtain the complete P_ω。

S4.1.5, comparing the recommendation frame of each picture obtained in S4.1.3 with the corresponding labeling frame, namely a true value, to obtain a cross-comparison threshold, screening Nt positive and negative sample frames, namely training frames, in a specified proportion for each picture sample through the cross-comparison threshold, and recording the true value identity and position of the sample frames.

S4.1.6 training box given by S4.1.5 with P_ωPerforming ROI posing to obtain box _ feature, i.e. the feature tensor with the size of (bz × Nt,256 × B,7,7), i.e. the weighted feature tensor of this class, and the specific operation is as follows: for each training box, the corresponding pyramid level l is first determined according to formula (1), and then P is used for training through the roi _ align algorithm_ω['l']And extracting the characteristics of each training frame to finally obtain the characteristic box _ feature of the foreground frame.

S4.1.7, for box _ feature, expanding the feature dimension into 1 dimension, inputting the 1 dimension into two fully connected layers, and obtaining the expanded Roi feature tensor with the size of (bz × Nt, 1024).

S4.1.8, the feature tensor Roi is input to the prediction layer and the scores and frame positions corresponding to the B base classes are output. The loss was calculated according to equation (3) in comparison with the true value recorded in S4.1.5

Network parameters are updated using an Adam optimizer.

S4.1.9 repeat S4.1 until training is finished, and save network parameters.

And S4.2, fine adjustment training.

The difference between the fine tuning training and the base class training is that (1) the number of the classes of the target to be detected is increased; (2) the training data set is replaced; (3) the training process freezes part of the network parameters.

S4.2.1 the network model is modified, as shown in FIG. 7, the input of the prediction layer full-connection network is changed from B x 1024 maintenance when the base class is trained to (B + N) x 1024 dimensions, and the other part of the structure is not changed.

S4.2.3, the same procedure as S4 is adopted to perform network training, and training samples are randomly extracted from the fine training sample set at this time, except that the dimension of the weight vector ω is (B + N,1024), and the traversal i is 1, …, and B + N. And storing the network parameters until the training is finished.

And after the fine tuning training is finished, the training process of the network is finished, and the network after the fine tuning training can be used for executing a target detection reasoning test task.

S5, model inference test

S5.1, calculating a weight vector, sequentially inputting K samples of each category into a weight network to obtain K weights, solving an arithmetic mean to obtain a weight vector omega [ i,1024] corresponding to the category, i belongs to [1, …, B + N ], traversing all (B + N) categories to obtain the weight vector;

s5.2, reasoning the test sample, inputting the test sample into a backbone network, and obtaining a characteristic pyramid dictionary P;

s5.3 performs the same steps as S4.1.3 and S4.1.4;

s5.4, taking the recommendation frame output in the S5.3 step as an ROI frame, and executing S4.1.6 and S4.1.7 steps;

s5.5, inputting the feature tensor Roi into the prediction layer, outputting scores and frame positions corresponding to the B + N classes, and obtaining a final detection result through steps of NMS, SoftMax and the like.

Example 1

And constructing a small sample target detection task of 6 base classes and 4 new classes by taking the NWPU VHR-10 data set as a sample. The NWPU VHR-10 dataset is a public 10-level geospatial object detection dataset, ten types of objects are aircraft, ships, storage tanks, softball courts, tennis courts, basketball courts, ground tracks, ports, bridges and vehicles. The data set contains 650 high-resolution remote sensing images, manually annotated by an expert.

The 6 base classes are: 1 airplane, 2 storage tanks, 3 tennis courts, 4 basketball courts, 5 ports and 6 vehicles.

The 4 new classes are: 7 ships, 8 softball courts, 9 ground runways and 10 bridges.

363 training pictures, 91 verification pictures and 196 test pictures, which are 650 pictures. The algorithm of the invention and the algorithm of the document "Frustratingly Simple Few-Shot Object Detection" (FS-FSOD for short) compare and verify 6 base classes and 4 new classes on a test data set, and respectively compare the mAP precision under the condition of 1,3, 5 and 10 small samples, and the result is shown in Table 1:

TABLE 1 comparison of results of two algorithms

14页详细技术资料下载

Small sample target detection method based on feature weighting and network fine tuning

相关技术

网友询问留言