Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network

文档序号：1952840 发布日期：2021-12-10 浏览：19次中文

阅读说明：本技术 基于自注意力和生成对抗网络的细粒度跨媒体检索方法 (Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network ) 是由姚亚洲孙泽人陈涛张传一沈复民于 2021-11-11 设计创作，主要内容包括：本发明涉及细粒度跨媒体检索技术领域,公开了一种基于自注意力和生成对抗网络的细粒度跨媒体检索方法,包括：提取媒体数据的卷积神经网络和循环神经网络；将卷积神经网络和循环神经网络输入进公共语义空间学习模块,提取与媒体数据类别无关的语义特征A1；将公共语义空间学习模块中的语义特征A1输入进媒体判别器；将生成器和媒体判别器进行对抗学习,得到训练后的语义特征A2；将语义特征A2与公共语义空间学习模块中的语义特征A1进行相似性度量,根据度量的结果检索其对应的数据。本发明用于学习细粒度子类别之间的细小差距、缩小跨媒体数据之间的异质性差异、保证分类准确率的同时,使得同一类别标签下不同媒体数据的特征尽可能相似。(The invention relates to the technical field of fine-grained cross-media retrieval, and discloses a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network, which comprises the following steps: extracting a convolutional neural network and a cyclic neural network of the media data; inputting the convolutional neural network and the cyclic neural network into a public semantic space learning module, and extracting semantic features A1 which are irrelevant to media data categories; inputting semantic features A1 in the public semantic space learning module into a media discriminator; performing counterstudy on the generator and the media discriminator to obtain trained semantic features A2; and performing similarity measurement on the semantic features A2 and the semantic features A1 in the common semantic space learning module, and retrieving corresponding data according to the measurement result. The method is used for learning the tiny difference between fine-grained subcategories, reducing the heterogeneity difference between cross-media data, ensuring the classification accuracy and enabling the characteristics of different media data under the same category label to be similar as much as possible.)

1. A fine-grained cross-media retrieval method based on self-attention and generation countermeasure network is characterized by comprising the following steps: s1, extracting a convolutional neural network and a cyclic neural network of media data according to a feature extractor based on a self-attention mechanism in a generator; s2, a public semantic space learning module is preset, a convolutional neural network and a cyclic neural network are input into the public semantic space learning module, and semantic features A1 irrelevant to media data categories are extracted from the public semantic space learning module; s3, presetting a media discriminator, inputting semantic features A1 in the public semantic space learning module into the media discriminator, verifying whether the semantic features A1 are related to media data types according to the media discriminator, if so, returning to the step S2, and if not, entering the step S4; s4, performing countermeasure learning on the generator and the media discriminator, and performing countermeasure training on the semantic features A1 to obtain trained semantic features A2; s5, carrying out similarity measurement on the semantic features A2 and the semantic features A1 in the public semantic space learning module, and retrieving corresponding data according to a measurement result; classification constraints, distance constraints and sorting constraints based on fine granularity are preset, and the above steps S1-S5 follow the constraints.

2. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network according to claim 1, wherein the step S1 comprises: the media data comprises image data, video data, audio feature data and text data; acquiring a designated part of a positioning target according to a convolutional neural network of image data, video data and audio characteristic data; and acquiring the word description characteristics of the positioning target according to the recurrent neural network of the text data.

3. A self-attention and generation countermeasure network based fine-grained cross-media retrieval method according to any one of claims 1-2, wherein the step S2 comprises: randomly selecting any local attention area in the convolutional neural network, cutting and amplifying the local attention area, and inputting the local attention area into a common semantic space learning module; the feature dimension is reduced through linear superposition in the recurrent neural network, and the description word features of the media data extracted by the recurrent neural network and the important features extracted from the attention mechanism are combined together and input into a common semantic space learning module.

4. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network according to claim 1, wherein the step S2 further comprises: setting a media data type nondifferentiation rule and a media data type differential semantic rule and a similarity measurement rule in a common semantic space learning module; and integrating heterogeneity differences of the media data based on fine-grained division in a common semantic space learning module, and respectively extracting high-level semantic features of the media data to obtain semantic features A1.

5. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network according to claim 1, wherein the step S3 comprises: setting a media discriminator according to the full connection layer; defining the classification loss of the media discriminator according to the cross entropy loss function, the nonlinear function, the media type label and the input media type; and verifying whether the features in the common semantic space learning module are irrelevant to the features of the media types according to the classification loss of the media discriminator.

6. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network according to claim 1, wherein the step S4 comprises: in the countercheck learning, the media discriminator distinguishes the media types according to the maximum and minimum game rules, and when the media discriminator cannot distinguish which media type the characteristics of the common semantic space learning module belong to, the media discriminator judges that the countercheck learning of the common characteristic space is finished, and the trained semantic characteristics A2 are obtained.

7. The method of claim 1, wherein the similarity measure in step S5 comprises: and (3) performing similarity measurement of cosine values on the semantic feature A2 and the semantic feature A1 in the common semantic space learning module.

8. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network of claim 1, wherein the classification constraints comprise: and learning fine-grained semantic features of the media data according to the classification constraint conditions.

9. The fine-grained cross-media retrieval method based on self-attention and generation countermeasure network of claim 1, wherein the distance constraint condition comprises: and determining the similarity of the semantic features of the media data according to the distance constraint conditions.

10. The self-attention and generation-confrontation network-based fine-grained cross-media retrieval method of claim 1, wherein the ranking constraint comprises predicting a relative distance between samples of the input media data according to the ranking constraint.

Technical Field

The invention relates to the technical field of fine-grained cross-media retrieval, in particular to a fine-grained cross-media retrieval method based on self-attention and generation countermeasure networks, which is used for learning fine differences among fine-grained sub-categories, reducing heterogeneity differences among cross-media data, ensuring classification accuracy and enabling the characteristics of different media data under the same category label to be similar as much as possible.

Background

Fine-grained cross-media retrieval has been a challenging task, with the following major difficulties;

(1) the difference between the fine-grained subcategory objects is small, and discriminant areas of the fine-grained subcategories are difficult to find under the weak supervision of category labels, so that the fine-grained subcategories are difficult to distinguish;

(2) the characteristics of different media type data often have inconsistent distributions and representations, so that there is a huge gap in media, i.e., heterogeneity, between them. Thus, there is a need to evaluate semantic similarity across media data.

(3) The prior art proposes a unified deep common semantic space learning module generator Cross network (Cross Net), which learns 4 types of media data simultaneously using the same feature extraction network. Text data has some unique properties compared to image, video and audio data, which makes it difficult to normalize to the same two-dimensional matrix input as other media. If the convolutional neural network is directly used for processing text input data used by the method, loss of sequence features existing among words can be caused, in order to describe a fine-grained target, the text data often contains more vocabularies, a large amount of information irrelevant to a target object exists in the vocabularies, and if the method does not use great attention to assist in extracting text features, the feature extraction network is difficult to train the text features relevant to the target object. Moreover, the algorithm does not address the first difficulty mentioned above. Therefore, when the method is used for executing the task of fine-grained cross-media retrieval, the retrieval result still needs to be improved.

In order to solve the above-mentioned problems of media data feature extraction and the problem of the maximum heterogeneity difference existing across media retrieval, a technical scheme is urgently needed, which can effectively learn the small difference between fine-grained subcategories and effectively reduce the heterogeneity difference across media data. The invention uses the generation countermeasure network to complete the standardization process, and ensures the classification accuracy and simultaneously leads the characteristics of different media data under the same class label to be similar as much as possible.

Disclosure of Invention

The invention aims to provide a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network, which is applied to learning fine differences among fine-grained subcategories, reducing heterogeneity differences among cross-media data, ensuring classification accuracy and enabling features of different media data under the same category label to be similar as much as possible.

The invention is realized by the following technical scheme: a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network comprises the following steps:

s1, extracting a convolutional neural network and a cyclic neural network of media data according to a feature extractor based on a self-attention mechanism in a generator;

s2, a public semantic space learning module is preset, a convolutional neural network and a cyclic neural network are input into the public semantic space learning module, and semantic features A1 irrelevant to media data categories are extracted from the public semantic space learning module;

s3, presetting a media discriminator, inputting semantic features A1 in the public semantic space learning module into the media discriminator, verifying whether the semantic features A1 are related to media data types according to the media discriminator, if so, returning to the step S2, and if not, entering the step S4;

s4, performing countermeasure learning on the generator and the media discriminator, and performing countermeasure training on the semantic features A1 to obtain trained semantic features A2;

s5, carrying out similarity measurement on the semantic features A2 and the semantic features A1 in the public semantic space learning module, and retrieving corresponding data according to a measurement result;

and presetting classification constraints, distance constraints and sorting constraints based on fine granularity, wherein the steps S1-S5 follow the constraints.

In the technical scheme, two feature extractors based on a self-attention mechanism are adopted to extract features of four media data. Under the action of a self-attention mechanism, the network can effectively learn discriminant differences among fine-grained subcategories. In the fine-grained cross-media retrieval task, the intra-class distance of the sub-species of different media types needs to be reduced, the inter-class distance of the different sub-species of the same media type needs to be enlarged, and if the features are directly extracted from each media by using a common convolutional neural network, the inconsistency of the features can be caused, so that a normalized mechanism needs to be used for achieving the purpose.

In order to better implement the present invention, step S1 further includes:

the media data includes image data, video data, audio feature data, and text data;

acquiring a designated part of a positioning target according to a convolutional neural network of image data, video data and audio characteristic data;

and acquiring the word description characteristics of the positioning target according to the recurrent neural network of the text data.

In the technical scheme, the text feature extraction algorithm based on the self-attention mechanism uses a recurrent neural network to better analyze the time sequence relation of text data, and combines the sequence features extracted by the recurrent neural network with the important features extracted by the self-attention mechanism to better extract the text features.

In order to better implement the present invention, step S2 further includes:

randomly selecting any local attention area in the convolutional neural network, cutting and amplifying the local attention area, and inputting the local attention area into a common semantic space learning module;

the feature dimension is reduced through linear superposition in the recurrent neural network, and the description word features of the media data extracted by the recurrent neural network and the important features extracted from the attention mechanism are combined together and input into a common semantic space learning module.

In the technical scheme, after the low-level features of four media data are extracted simultaneously by two networks, all the low-level features are input into a common semantic space learning module, the data features in the space bridge the heterogeneity difference of the four media data, only the high-level semantic features of the four media data are extracted respectively, the features are irrelevant to media categories, and therefore similarity measurement can be performed on the features.

In order to better implement the present invention, step S2 further includes:

setting a media data type nondifferential rule and a media data type differential semantic rule in a common semantic space learning module;

and integrating heterogeneity differences of the media data based on fine-grained division in a common semantic space learning module, and respectively extracting high-level semantic features of the media data to obtain semantic features A1.

In this solution, the purpose of the generator is to generate features that are not distinguished by the media discriminator, and the media discriminator is to distinguish which media type the features generated by the generator belong to as far as possible. The generator and the media discriminator perform counterlearning, and when the media discriminator cannot distinguish which media type the features of the space belong to, the common feature space learning is determined to be completed.

In order to better implement the present invention, step S3 further includes:

setting a media discriminator according to the full connection layer;

defining the classification loss of the media discriminator according to the cross entropy loss function, the nonlinear function, the media type label and the input media type;

and verifying whether the features in the common semantic space learning module are irrelevant to the features of the media types according to the classification loss of the media discriminator.

In the present solution, a media discriminator is also constructed to verify whether the features in the common representation space are independent of the features of the media type, in order to train against the generator.

In order to better implement the present invention, step S4 further includes:

in the countercheck learning, the media discriminator distinguishes the media types according to the maximum and minimum game rules, and when the media discriminator cannot distinguish which media type the characteristics of the common semantic space learning module belong to, the media discriminator judges that the countercheck learning of the common characteristic space is finished, and the trained semantic characteristics A2 are obtained.

In the technical scheme, when the media discriminator and the generator are subjected to countertraining, the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm.

To better implement the present invention, further, the similarity measure in step S5 includes:

and (3) performing similarity measurement of cosine values on the semantic feature A2 and the semantic feature A1 in the common semantic space learning module.

In the technical scheme, the similarity measurement of cosine values is carried out on the generated features and the features in the search library, so that data similar to the semantics of the features can be retrieved.

In order to better implement the present invention, further, the constraint conditions include:

and learning fine-grained semantic features of the media data according to the classification constraint conditions.

In the present technical solution, the classification constraint condition refers to semantic features for enabling features of a common feature space to be sufficiently learned to various media types.

To better implement the present invention, further, the distance constraint includes:

and determining the similarity of the semantic features of the media data according to the distance constraint conditions.

In the technical scheme, the distance constraint condition is that heterogeneity gaps exist among the features of different media data, and in order to narrow such gaps, after a common semantic space learning module is learned, distance constraint is used for the features of the space to narrow the features among samples of different media in the same sub-category, namely the distance constraint condition is used to narrow the gaps in the categories.

To better implement the present invention, further, the ordering constraint includes:

the relative distance between samples of the input media data is predicted according to the ordering constraint.

In the technical scheme, the relative distance between the input media data samples is predicted according to the ordering constraint condition to ensure the compactness of the sample characteristics of the media data of the same subclass and the sparsity of the sample characteristics of the media data of different subclasses

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the method is used for learning the tiny difference between fine-grained subcategories, reducing the heterogeneity difference between cross-media data, ensuring the classification accuracy and enabling the characteristics of different media data under the same category label to be similar as much as possible.

(2) The method accurately finds the description sentences of the target objects by means of the long-term and short-term memory network based on the self-attention mechanism aiming at the characteristics that the sequence characteristics of the text data are easy to be missed or confused and the characteristics of the heterogeneity gap existing between media.

(3) The present invention learns fine-grained features across media data using a convolutional neural network based on a self-attention mechanism and local attention regions, with the help of a generation countermeasure network, a common semantic space for four media types.

(4) Experiments of the invention on multiple complex and challenging data sets fully verify the effectiveness of the self-attention-based mechanism for extracting fine-grained features.

Drawings

The invention is further described in connection with the following figures and examples, all of which are intended to be open ended and within the scope of the invention.

Fig. 1 is a flowchart of a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 2 is a schematic diagram of a text feature extraction algorithm based on a self-attention mechanism in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the invention.

FIG. 3 is a schematic diagram of a convolutional neural network based on a self-attention mechanism in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the invention.

FIG. 4 is a comparison graph of dual media retrieval on a PKU FG-XMedia dataset for a variety of methods in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure networks provided by the present invention.

FIG. 5 is a comparison diagram of multimedia retrieval on PKU FG-XMedia experimental data set in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 6 is a comparison diagram of dual media retrieval on a PKU Xmedia data set in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 7 is a comparison diagram of multimedia retrieval on PKU Xmedia experimental data set in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 8 is a comparison diagram of dual media retrieval on wikipedia data set in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

FIG. 9 is a comparison diagram of dual media retrieval on NUS-WIDE datasets in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure networks provided by the present invention.

Fig. 10 is a schematic diagram illustrating the influence of each constraint condition in a dual-media task in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 11 is a schematic diagram illustrating the influence of each constraint in a multimedia task in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 12 is a schematic diagram of the influence of each module in the algorithm in the dual media mode in the fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Fig. 13 is a schematic diagram of the influence of each module in a multimedia task in a fine-grained cross-media retrieval method based on self-attention and generation countermeasure network provided by the present invention.

Detailed Description

Example 1:

in the fine-grained cross-media retrieval method based on self-attention and generation of confrontation network of the embodiment, as shown in fig. 1, a convolutional neural network for extracting image data, video data and audio data features by using a feature extractor, and a cyclic neural network for extracting text data by using a feature extractor.

The embodiment provides a fine-grained cross-media retrieval method based on a self-attention mechanism and a generation countermeasure network, aiming at the characteristics that the sequence characteristics of text data are easy to be omitted or mixed and the characteristics of heterogeneity differences existing among cross media, and accurately finding description sentences of target objects by means of a long-short term memory network based on the self-attention mechanism; and then learning fine-grained features of cross-media data by using a convolutional neural network based on a self-attention mechanism and a local attention area, and then learning a common semantic space learning module of four media types by means of a generation countermeasure network. Experiments on multiple complex and challenging data sets fully validated the effectiveness of the self-attention-based mechanism for extracting fine-grained features. Step S1 uses two feature extractors based on the self-attention mechanism, one is a convolutional neural network for extracting image, video, audio features, and the other is a recurrent neural network for extracting text data. Step S2 inputs all of them into the common semantic space learning module, and only extracts their high-level semantic features irrelevant to the media category, respectively, thereby performing similarity measurement on these features. Step S3 is to input the features of the common semantic space learning module into the media discriminator for discrimination, so as to verify that the features of the space are not related to the media type. Step S4 performs countermeasure learning with the generator and the media discriminator. Step S5 makes similarity measurement of cosine value between the generated feature and the feature in the common semantic space learning module, so as to retrieve data similar to its semantic meaning.

In this embodiment, three constraint conditions are considered together to better learn the common semantic space learning module: (1) and (4) classification constraint conditions: learning fine-grained semantic features of various media types; (2) distance constraint conditions are as follows: ensuring that the characteristics of the samples in the class are as close as possible, namely when the samples belong to the same species and subclass but do not belong to the same media class, the characteristics of the samples can be as similar as possible; (3) ordering constraint: and predicting the relative distance between the input samples to ensure that the features of the same sub-class samples can be closer, and the features of different sub-class samples have sparseness.

Example 2:

this embodiment is further optimized based on embodiment 1, and in this embodiment, it is described in embodiment 1 that the generator is composed of two feature extractors based on a self-attention mechanism, and the media data includes image data, video data, audio feature data, and text data, so that one of the two feature extractors is a convolutional neural network for extracting image, video, and audio features, and the other is a cyclic neural network for extracting text data. Where a convolutional neural network based on the mechanism of self-attention is able to pinpoint certain specific parts of the target, such as the eyes or beaks of a bird, to learn fine-grained features between sub-categories and to distinguish them. The recurrent neural network based on the self-attention mechanism can combine the accuracy of the self-attention mechanism for acquiring important features and the controllability of the recurrent neural network for sequence data, and finds the most important features in a plurality of description words. In the embodiment, a recurrent neural network based on a text feature extraction algorithm of a self-attention mechanism and a convolutional neural network based on the self-attention mechanism are applied.

The text feature extraction algorithm based on the self-attention mechanism is to use a recurrent neural network to better analyze the time sequence relation of text data, and combine the sequence features extracted by the recurrent neural network with the important features extracted by the self-attention mechanism to better extract the text features.

As shown in fig. 2, the algorithm is implemented by text, character embedding, long term memory networks, and a self-attention mechanism that includes self-attention weights and an embedded text matrix.

The text feature extraction algorithm based on the self-attention mechanism is as follows:

given a sentence with n words, the word embedding matrix E for that sentence is represented as:

；

wherein the content of the first and second substances,the word representing the ith word of the sentence is embedded into a representation vector. E is a matrix of size n x media discriminators containing word-embedded vectors of n words. Since each word vector in E has a dependency relationship, a bidirectional convolutional neural network (LSTM) is used to obtain the dependency relationship between adjacent words in a sentence.

Assuming that the size of the hidden layer of a bidirectional convolutional neural network (LSTM) is u, the output data of the hidden layerCan be expressed as:

h is the set of all hidden layer output results for bi-directional LSTM, denoted as:

；

the size of H is n × 2 u.

Since the fine-grained features used in the present embodiment require more complicated semantic information to represent, and the description text contains more words, the length of the sequence data input into the bidirectional convolutional neural network (LSTM) is longer. To ensure that long-term and short-term memory in a bidirectional convolutional neural network (LSTM) does not disappear or become confused, the present embodiment is expected to characterize each character with a feature vector of higher dimension, which results in that the subsequent linear layer needs more network depth and parameters when reducing the feature dimension, and thus is difficult to converge quickly. Therefore, the present embodiment adopts linear superposition (avera generator esum) to reduce the feature dimension before the weight feature extraction. After dimensionality reductionThe size is expressed as n × u.

The self-attention mechanism aims to learn a weight matrix which can increase the weight of words in the sentence related to the target and reduce the weight of words not related to the target, namely, the finally extracted features are more concentrated in a specific area. Self-attention mechanism hides the entire convolutional neural network (LSTM) from stateAs input, a weight matrix M is then output, M being represented as follows:

；

here, theIs one dimension ofThe weight matrix of (a) is determined,is one size ofX u matrix parameter, whereinIs a hyper-parameter that can be set to an arbitrary value. The generator (x) is an activation function that can be modified as required. Because of the fact thatThe size of (D) is n × u, and the size of M is n.

Hidden state of convolutional neural network (LSTM)Multiplication with the weight matrix M yields an embedded text matrix L, which is expressed as:；

l is a low-level feature of the text data obtained via the text processing channel. Then, the dimension of the media is adjusted to be consistent with the characteristics of other three media types through a plurality of full connection layers, and the media is expressed as。

As shown in fig. 3, since the self-attention mechanism can accurately locate a portion having an identification degree in the target, a convolutional neural network based on the self-attention mechanism is used here to extract features of image data, video data, and audio data. After obtaining the local attention area with the identification degree, the local attention area is cut and amplified and then input into the network again for learning the fine-grained local area.

Because the processing operation of the network on image data, video data and audio data is the same, the following explanation is carried out by taking an image as input, and the specific steps are as follows:

for given image dataInputting the data into convolutional neural network feature extraction, mapping the extracted featuresExpressed as:；

wherein the content of the first and second substances,are parameters of this convolutional neural network. Then will beObtaining local attention area through a1 x 1 convolution layerIt is expressed as:；

wherein the content of the first and second substances,is the parameter of the convolution layer of 1 x 1,for this purpose the mapping function is set to,byIn a clear view of the above, it is known that,is prepared byThe dimension is reduced to obtain the channel number of. According to the statement Hu of Hu,each channel of (a) represents a certain part of the object, for example: eyes of birds, wheels of cars, eaves of houses, etc. Is obtained byAndthereafter, the two are combined using a self-attention based pooling layer, i.e. the two are ready to be combinedAndthe specific schematic diagram is shown in fig. 3, and the calculation process is as follows:then all ofObtaining final characteristic matrix through pooling, dimension reduction and splicing operationThis is also the input to the common feature space afterwards.

In order for the network to learn fine-grained features, the present embodiment willLocal attention gainedAnd (3) cutting and amplifying the force area, inputting the force area into the convolutional neural network in the figure 3 again, and performing class prediction on the force area after passing through a common feature learning module and a classifier.

To avoid background interference, this embodiment randomly selects(n =1, 2, …, k) of any one of the local attention areasTo learn fine-grained features. Firstly, normalization processing is carried out to obtain a normalization resultComprises the following steps:

；

clipping mask in FIG. 3Can be obtained from the following formula:；

as can be seen from the above equation, when the value of the point is greater than a certain threshold value Tc, it is 1, otherwise it is 0. Wherein Tc is ∈ [0,1 ]]. Then finding out the minimum surrounding frame and annotating all the '1' frames to obtain a local attention cutting picture, amplifying the local attention cutting picture to the size of the original input data, and finally obtaining an enlarged local attention picture。

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

in this embodiment, after the low-level features of the four media data are extracted by the two networks at the same time, all the low-level features are input into the common semantic space learning module, the data features in this space bridge the heterogeneity differences of the four media data, only the high-level semantic features are extracted respectively, the features are independent of the media type, so that the similarity measurement can be performed on the features, and in order to verify that the features of the space are independent of the media type, the features of the common semantic space learning module are input into the media discriminator to be distinguished.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 4:

this embodiment is further optimized based on embodiment 1, and the purpose of the generator is to generate features that are not recognized by the media discriminator, and the media discriminator is to distinguish which media type the features generated by the generator belong to as far as possible. The generator and the media discriminator perform counterlearning, and when the media discriminator cannot distinguish which media type the features of the space belong to, the common feature space learning is determined to be completed. Unlike traditional single media retrieval, the fine granularity studied by the present embodiment has heterogeneity difference across media retrieval tasks. To bridge the heterogeneity differences between different media, a common semantic space learning module is required to learn, in which samples containing the same semantic information can be represented as similar "features". Since the disparity in heterogeneity between different media can be large, it is a challenge to converge the cross-media representation domain into one unified representation domain. Thus, the present embodiment builds a countermeasure training mechanism between the generator and the media discriminator to drive the learning of the common semantic space learning module. The generator is used to generate semantic representations and make them independent of the type of media type, while the media discriminator attempts to distinguish the media types of the semantic representations. The two main parts guide the learning of a common representation space through antagonism, thereby effectively enhancing the consistency of cross-media semantics and improving the retrieval accuracy.

Features in a common semantic space learning module across media types should have three attributes simultaneously: (1) the semantic differences are obtained, so that the semantic consistency of data of different media types can be effectively enhanced. (2) Media types are not distinguished, which effectively reduces the heterogeneity difference, i.e., the media gap. The media discriminator distinguishes the media types of the 'features' generated by the generator, and the 'features' generated by the feature generator reduces the representation difference between different media types so as to mix up the media discriminator and also is a learning and confrontation training process of the media discriminator and the feature generator. (3) And in the similarity measurement principle, the common semantic space learning module can perform similarity measurement.

A generator toAs an input, the features of the four media data can be learned simultaneously. The network that processes the text extracts low-level features of the text as:

；

the image data, the video data and the audio data are subjected to a convolutional neural network based on a self-attention mechanism to extract characteristics of three media typesIs represented as follows:

；

local attention amplification of three media type data is obtained from convolution neural network based on self-attention mechanism. In order to learn local features with identification degree between fine-grained subcategories, the local features are obtainedAlso inputting into convolution neural network for training to obtain characteristicsThe following were used:

；

due to the characteristicsAndoften have inconsistent distributions and therefore they cannot be directly compared. In order to make sample features of different media types comparable, a common semantic space learning module is added after the feature extractor. The common semantic space learning module is composed of a plurality of layers of full-connection layers and an excitation function to form a nonlinear mapping, and the projection function is composed ofAnd (4) showing. The common semantic space learning module aims at finding a common subspaceIn a common subspaceThe extracted features of the four media types will be expressed as:(ii) a Here, theIs a mapping functionThe parameter (c) of (c).

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 5:

this embodiment is further optimized on the basis of embodiment 1, in which a media discriminator is also constructed to verify whether the features in the common representation space are independent of the features of the media type for the purpose of competing training with the generator. The media discriminator is composed of a plurality of fully connected layers and aims to distinguish which media type the features of the common semantic space learning module in the generator belong to. To be provided withAs input to the generator, the classification penalty of this media discriminator is defined as:

；

wherein the content of the first and second substances,in order to be a function of the cross-entropy loss,is a parameter ofFor predicting、、Andto which of the media types, respectively,the one-hot media type tags for the four input data.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 6:

the embodiment is further optimized on the basis of embodiment 1, in the embodiment, when the media discriminator and the generator are subjected to the countermeasure training, the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm, and the process is also called as a min-max game rule. Based on the theory, some parameters can be distributed into each loss function and the loss function of the countermeasure stageIs defined as:

；

wherein the content of the first and second substances,the balance parameter between the positive loss function and the negative loss function in the training stage is a positive value.Are parameters of a feature extractor that processes images, video and audio,are parameters of a feature extractor that processes text data,is a parameter of the common semantic space learning module. Since the optimization objectives of the two loss functions are opposite, the two concurrent sub-processes are overruledThe routine may be defined as:

；

wherein the parametersWill be provided withMinimization ofMake itMaximization, which is the antagonistic training process of this model. The object is to findAndto achieveThe saddle point of (1). At the point of this saddle, the saddle is,can letAt a minimum, andcan letAnd max. In the two publicationsBased on the formula, the updating process of the parameters is as follows:

；

wherein the content of the first and second substances,is the learning rate. The parameter update of the above formula can be realized by a Stochastic Gradient Descent (SGD) algorithm. The pytorech framework can conveniently implement a Stochastic Gradient Descent (SGD) algorithm that can automatically calculate gradients and update parameters.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 7:

in this embodiment, further optimization is performed on the basis of embodiment 1, and in this embodiment, the similarity between the generated features and the features in the search library is measured as cosine values, so that data similar to the semantics of the features can be retrieved.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 8:

in this embodiment, further optimization is performed on the basis of embodiment 1, and the classification constraint condition is to enable the features of the common feature space to sufficiently learn semantic features of various media types. One after the last full connection layer of the common semantic space learning moduleThe function is used as a classifier, and a group of probability values are finally output, and the probability values can be used for judging which class the common semantic space learning module predicts the sample. After having passed through the classifier, the classifier is used,andis shown as. In addition, the feature of the above-obtained attention enlarged image in the common feature space is herein describedInputting the result into a classifier to obtain a corresponding result, and then matching the result with the characteristics of the original imageAre combined by the weight of each halfThereby judging the class to which the classification belongs and defining the classification lossThe following were used:

；

wherein the content of the first and second substances,representing original image, video, audio and text features respectively,andthe features of the attention-magnifying pictures are shown separately, above all the output data of the last fully connected layer.In order to be a function of the cross-entropy loss,for the number of subcategories of a data set, defineThe following were used:。

other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 9:

in this embodiment, a further optimization is performed on the basis of embodiment 1, the distance constraint condition is that there is a heterogeneity gap between features of different media data, and in order to narrow such a gap, after the common semantic space learning module is learned, distance constraint is used for the features of the space to narrow features between samples of different media in the same sub-category, that is, to narrow intra-class gaps. For convenience of expression, only two ways (image and video) are taken as examples herein, and the distance formula between sample features of image and video is as followsShown in the figure:

；

one between every two mediaFinally, the 6 distances are added to obtainThe formula is shown below.

The smaller the inter-sub-species gap for different media types.

. The distance penalty is intended to minimize intra-subcategory variance.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 10:

this example is further optimized on the basis of example 1,

the ordering penalty applied by the ordering constraint may extend the distance between different fine-grained subcategory features. It is defined as follows:

；

in the above-mentioned formula, the compound of formula,andare two samples belonging to the same subcategory, andandfrom two subcategories, respectively. For example,andis the same as the image data and the video data of the 'black-foot believes-sky-poly',is a sample of the "black back believes celestial globe",is a sample belonging to the "Grey-backed Neisseria hernalis". It is noted that these settings are independent of the media type, that is to sayAndcan be a sample of any media type. Then useA variance is described, which is measured by the euclidean distance.Andwhich represents the boundary threshold, which is set to 1 and 0.5 in this embodiment.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

In addition, based on the above embodiment, the experimental operating environments selected by the invention are linux4.15.0, GeForceRTX2080Ti and CUDA10.2, and the experimental language is python3.6, and the following operations are performed:

firstly, data set processing is carried out: to normalize the data set, the present invention operates as follows: for an image, it is resized to 448 × 448, for text: following Conneau to convert it to an n × d matrix, d being the sign embedding dimension, the invention takes the value of 100. In addition, all text sentences are fixed to 448 characters in length, so the matrix size of each text segment is 448 × 100. If the number of characters of the sentence is less than 448, 0 is supplemented above the line; if the character length exceeds 448, the following characters are clipped at the 448 th character. Because the number of characters in each piece of text description text in the text data set does not exceed 448, no information is lost. For each video data, 25 frames of images are extracted therefrom at equal intervals as video data. The audio data in the original data set has been processed by a short-time fourier transform, and the audio is presented in a spectrogram.

Secondly, carrying out algorithm detail processing: first, sample samples of four media types are simultaneously input into a generator, image, video and audio data are directly and simultaneously input into the same convolutional neural network based on a self-attention mechanism, and a feature extractor used in the invention is a ResNet50 network. The retrieval stage is different from the training stage, during training, one attention area is randomly extracted to be cut into a local attention enlargement image to assist training in order to prevent overfitting, but in the retrieval stage, all attention areas are added and averaged and then input into the network in order to stabilize the retrieval result. To obtain better performance, some modifications were made: the input data size is 448 x 448 followed by an average pooling layer of kernel size 14 and step size 1 after the last convolutional layer of ResNet 50. It should be noted here that ResNet50 can be replaced by any other effective convolutional neural network, such as Alex Net and VGG Net. Text data is input into a recurrent neural network based on a self-attention mechanism. For better learning the sequence characteristics of the text, the adopted recurrent neural network is a bidirectional convolutional neural network, and the dimension of the hidden layer is 2048. After passing through the two feature extraction networks, the extracted low-level features are input into a common semantic learning module together. And obtaining a common semantic space learning module under the combined action of the three constraint conditions and the media discriminator. In the training phase, the learning rate starts at 0.001, decreasing by a factor of 0.5 every third epoch, 1 epoch being equal to one training with all samples in the training set. And when in retrieval, the output of the last fully-connected layer of the framework is used as a common semantic representation form of four media types. Then the cosine distance introduced in the second chapter is adopted to measure the similarity between different media data, and finally, the retrieval result is returned according to the similarity.

And finally, setting a retrieval task: in order to prove the effectiveness of the algorithm, the following two tasks are used for evaluating the fine-grained cross-media retrieval performance of different methods, namely dual-media fine-grained cross-media retrieval and multimedia fine-grained cross-media retrieval. Dual-media fine-grained cross-media retrieval: the query is an instance of any one media type and the retrieved results are instances of another media type. For example, if the query is an image of "gazhou seagull", the result may be a text instance of "gazhou seagull", denoted "I → T", so there are 12 total dual media fine-grained cross-media retrieval tasks, respectively, "I → T", "I → V", "I → a", "T → I", "T → V", "T → a", "V → I", "V → T", "V → a", "a → I", "a → T" and "a → V". Multimedia fine-grained cross-media retrieval: the query is an instance of any media type and the retrieved results are data for all media types. For example, if the query is an image of "gull of california", the result will be an image, data for "gull of california" of video, audio and text type, denoted "I → all", so the multimedia fine-grained cross-media retrieval has a total of 4 tasks, respectively "I → all", "T → all", "V → all" and "a → all".

Analyzing the result through the operation: firstly, the PKUFG-XMedia data set (fine-grained cross-media data set) is analyzed, on the data set, CFFCE is a fine-grained cross-media retrieval algorithm based on an entity-level public feature space, FGCN is a feedback graph convolution network algorithm,

MHTN is a modal confrontation hybrid transmission network algorithm, as shown in fig. 4, the horizontal row is various cross-media retrieval algorithms, such as MHTH retrieval algorithm, ACMR is a cross-modal search retrieval algorithm based on confrontation training, JRL retrieval algorithm, GSPH retrieval algorithm, CMDN retrieval algorithm, SCAN retrieval algorithm, GXN retrieval algorithm, and I, V, A, T of the vertical row represents image, video, audio and text features respectively.

As shown in FIG. 4, the results of the dual media retrieval comparison on PKU FG-XMedia data set by various methods are shown; as shown in FIG. 5, the comparison result of multimedia retrieval on the PKU FG-XMedia experimental data set can be seen. As can be seen from FIG. 4, the performance of the 12 dual-media retrieval tasks provided by the method is significantly better than that of the general Cross-media retrieval method and the recently proposed fine-grained Cross-media retrieval algorithm FG Cross Net. In particular, the mean average accuracy score of the present method is about 12 percentage points higher than the FG Cross Net score in the task associated with the present method. Because FG Cross Net adopts the same convolutional neural network to extract the characteristics of four media data at the same time, the sequence characteristics of the text are lost. The invention uses the text feature extraction branch based on the self-attention mechanism to reserve the sequence characteristics among text words and adopts the self-attention mechanism to increase the weight of the information related to the target, thereby better extracting the text features. Therefore, the average precision score of the method greatly exceeds FG Cross Net and other Cross-media retrieval algorithms in all text-related retrieval tasks. Fig. 5 shows the mean average precision score in the multimedia fine-grained cross-media retrieval mode, which has the same variation trend as the dual-media retrieval. From the above table, the algorithm of the present invention achieves the highest performance no matter the 12 tasks of dual media search or the 4 tasks of multimedia search, mainly because: 1) a common semantic space learning module for generating a confrontation network to learn all four media data is constructed, so that the heterogeneity difference between different media can be reduced. 2) A text feature extraction branch based on a self-attention mechanism is constructed to capture the sequence features of the text, and the features can lead the learning process to be focused on more important components in the text data.

As shown in fig. 6 and 7, the experimental mean average accuracy scores of different algorithms on PKUXmedia datasets are demonstrated, since SCAN and GXN are the cross-media retrieval methods proposed earlier, and since space limitations, the experiments hereafter do not show the retrieval performance of both methods. It can be seen from the following chart that the performance of the proposed algorithm remains optimal on this non-fine grained data set. Unlike fine-grained datasets, the difference between different classes of data in a coarse-grained dataset is large. Taking an image as an example, the overall visual appearance of birds and insects in the coarse-grained dataset is greatly different, while the difference between the "grey-backed belief tumbler" and the "black-backed belief tumbler" belonging to birds in the fine-grained dataset is only the color of the back and the feet. Thus, SAGA networks can better learn the characteristics of coarse-grained subclasses. As is clear from the following chart, the proposed algorithm outperforms all other cross-media retrieval algorithms on 12 dual-media retrieval tasks. And for the FG Cross Net method with the best effect, the performance of all tasks executed by the algorithm is improved. The experimental result shows that the effectiveness of the generated countermeasure network model in learning the public semantic space.

As shown in fig. 8, mean average accuracy scores for experiments performed on wikipedia datasets by different algorithms are presented. Like PKUXmedia, wikipedia is also a dataset for coarse-grained retrieval. Since the wikipedia dataset has only image and text data, only two retrieval tasks, "I → T" and "T → I" are performed on this dataset. The experimental result also shows the effectiveness of constructing the public semantic space by using the generated confrontation network model and the effectiveness of the self-attention mechanism in text feature extraction.

As shown in FIG. 9, the mean average accuracy scores of the searches of different algorithms on the NUS-WIDE dataset are presented. NUS-WIDE is also a coarse-grained dataset, and it can be seen that the SAGA algorithm still maintains the highest retrieval performance. The SAGA can not only accurately retrieve the information required by the user, but also return more effective data as much as possible to meet the requirements of the user.

In order to verify the effect of each constraint in the algorithm, the present invention performed ablation experiments. The experimental data set used PKUFG-Xmedia, as shown in fig. 10 and 11, and the experimental results are shown in fig. 10 and 11. From fig. 10 and 11 it can be observed that: the "classification constraint" is used to train the framework of the present invention, and the retrieval performance of the method is relatively low in 12 dual media retrieval tasks and 4 multimedia retrieval tasks. This indicates that the classification constraints can learn the features of fine-grained subcategories with the help of semantic categories, but do not work well. The "+ distance constraint" means that a distance constraint is added on the basis of the classification constraint. Compared with the classification constraint that the average mAP is only 0.41 in the dual-media mode, the method improves the efficiency by 3.4 percent and improves the efficiency by 4.1 percent in the multimedia mode, thereby having better retrieval performance. This is because the distance constraint tries to draw features of the same sub-category closer together features with the same semantics. The "+ ordering constraint" means that an ordering constraint is added on the basis of the classification constraint and the distance constraint. It can be seen that the performance is optimized with the ordering constraint. The retrieval performance under the dual-media retrieval mode is improved by 1.9%, the retrieval performance under the multimedia retrieval mode is improved by 3.5%, the characteristics of different sub-categories are separated as much as possible by the ordering constraint, and the retrieval performance is also effectively improved.

As shown in fig. 12 and fig. 13, the new proposed fine-grained cross-media retrieval algorithm has two new sub-modules: text feature extraction algorithm based on self-attention mechanism and generation of countermeasure network. The performance of each sub-module should be evaluated. Fig. 12 and 13 show the mean average precision scores of different sub-module combinations in two search modes in an incremental manner, and the second row in fig. 12 and 13 shows the search performance of a basic framework that does not contain two modules. The third line shows the mean average precision score when the original text processing branch is replaced with a self-attention mechanism based text feature extraction algorithm. It can be seen that the performance of all retrieval tasks is significantly improved after the branch is added. The fourth row shows the mean average accuracy score after construction of the generated countermeasure structure. It greatly improves retrieval performance compared to the results of using only a self-attention mechanism based text feature extractor. In summary, each of the newly proposed sub-modules has its role and advantages, and combining all the sub-modules together can lead to the best retrieval performance. It can be seen that the retrieval performance is best when all modules are combined.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种深度监督与特征融合的细粒度跨媒体检索方法

Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network

相关技术

网友询问留言