Fine-grained cross-media retrieval method with deep supervision and feature fusion

文档序号：1952841 发布日期：2021-12-10 浏览：9次中文

阅读说明：本技术 一种深度监督与特征融合的细粒度跨媒体检索方法 (Fine-grained cross-media retrieval method with deep supervision and feature fusion ) 是由姚亚洲孙泽人陈涛张传一沈复民于 2021-11-11 设计创作，主要内容包括：本发明涉及深度学习技术领域,公开了一种深度监督与特征融合的细粒度跨媒体检索方法,包括：步骤S1.将图像数据、视频数据、文本数据和音频数据分别输入到媒体专用网络中,将四种媒体数据子网络最后一层的全连接层进行权值共享,提取对应的原始特征；步骤S2.将原始特征利用语义信息和标签信息输入进公共空间,同时媒体数据四种子网络分别连接线性分类器,根据线性分类器内的标签信息对提取到的媒体数据的原始特征进行分类；步骤S3.预设三元组损失约束条件、分类损失约束条件和判别损失约束条件上,对公共空间内媒体数据的分类特征进行度量。本发明用于在标签空间和公共的特征表示空间内最大程度地减小判别损失并学习各媒体特征的相似性。(The invention relates to the technical field of deep learning, and discloses a fine-grained cross-media retrieval method for deep supervision and feature fusion, which comprises the following steps: s1, respectively inputting image data, video data, text data and audio data into a media special network, sharing weights of a last full-connection layer of four media data sub-networks, and extracting corresponding original characteristics; s2, inputting original features into a public space by utilizing semantic information and label information, respectively connecting four sub-networks of the media data with a linear classifier, and classifying the original features of the extracted media data according to the label information in the linear classifier; and S3, measuring the classification characteristics of the media data in the public space on the preset triple loss constraint condition, the classification loss constraint condition and the judgment loss constraint condition. The invention is used for reducing discrimination loss to the maximum extent and learning the similarity of various media characteristics in a label space and a common characteristic representation space.)

1. A fine-grained cross-media retrieval method based on deep supervision and feature fusion is characterized by comprising the following steps: s1, respectively inputting image data, video data, text data and audio data into a media special network, sharing weights of a last full-connection layer of four media data sub-networks, and extracting corresponding original characteristics; s2, inputting original features into a public space by utilizing semantic information and label information, respectively connecting four sub-networks of the media data with a linear classifier, and classifying the original features of the extracted media data according to the label information in the linear classifier; and S3, measuring the classification characteristics of the media data in the public space on the preset triple loss constraint condition, the classification loss constraint condition and the judgment loss constraint condition.

2. The fine-grained cross-media retrieval method for deep supervision and feature fusion according to claim 1, wherein the media specific network in step S1 comprises an image convolution network, a video convolution network, an audio convolution network and a text convolution network.

3. The fine-grained cross-media retrieval method with deep supervision and feature fusion according to claim 1, wherein the step S1 includes: extracting fine-grained characteristics of the image data and the video data according to the pre-trained bilinear convolutional neural network, and taking the fine-grained characteristics of the image data and the video data as original semantic characteristics; and inputting the original semantic features of the image data and the video data into a full connection layer to generate original features of the image and the video.

4. The fine-grained cross-media retrieval method with deep supervision and feature fusion according to claim 1, wherein the step S1 further comprises: extracting fine-grained features of the audio data according to the pre-trained VGG16 convolutional neural network, and taking the fine-grained features of the audio data as original semantic features; and inputting the original semantic features of the audio data into the full-link layer to generate the audio original features.

5. The fine-grained cross-media retrieval method with deep supervision and feature fusion according to claim 1, wherein the step S1 further comprises: extracting fine-grained characteristics of the text data according to the sensor CNN convolutional neural network, and taking the fine-grained characteristics of the text data as original semantic characteristics; and inputting the original semantic features of the text data into the last full-connection layer to generate original text features.

6. The method for deep supervision and feature fusion based fine-grained cross-media retrieval according to claim 1, wherein the method for presetting the triple loss constraint condition in step S3 includes: and judging whether the media data samples are the same type media data samples, if so, constraining the media data samples according to equidistant constraint and boundary constraint, and if not, introducing equal distribution constraint to constrain the media data samples.

7. The method of claim 6, wherein the method comprises the following steps: reducing the intra-class variance of the same type media data samples according to the equidistant constraint, distinguishing sample characteristics in the same type media data samples according to the boundary constraint, and constraining the same type media data in a common space according to the equidistant constraint and the boundary constraint; according to the equal distribution constraint, and using a small batch gradient descent algorithm mini-batch to constrain different types of media data samples in a common space.

8. The fine-grained cross-media retrieval method with deep supervision and feature fusion according to claim 1,

the method for presetting the classification loss constraint condition in the step S3 includes: in the public space, simultaneously using a linear classifier to predict label information of the media data sample characteristics in the public space; and the linear layers corresponding to the linear classifiers are connected to the tops of the four sub-networks, and the four sub-networks are classified according to a cross entropy loss function.

9. The method for fine-grained cross-media retrieval with deep supervision and feature fusion according to claim 1, wherein the method for presetting the discriminant loss constraint condition in the step S3 includes: selecting two media data sample characteristic vectors, setting the value of the media data sample to be 1 when the media data sample belongs to the same class and setting the value of the media data sample to be 0 when the media data sample belongs to different classes, and representing the media data sample characteristic vectors according to a cosine function and a discriminant loss function to obtain a negative log-likelihood value and a negative log-likelihood function; and acquiring an activation function according to the negative log-likelihood function, and performing discriminant loss on the media data sample according to the activation function.

Technical Field

The invention relates to the technical field of deep learning, in particular to a fine-grained cross-media retrieval method for deep supervision and feature fusion, which is used for reducing discrimination loss to the maximum extent and learning the similarity of various media features in a label space and a public feature representation space.

Background

In recent years, with the rapid growth of multimedia data, rich and diverse data are affluent to the lives of people, and multimedia data such as images, texts, audios and videos have become a main way for people to know the world. The relevance among mass multimedia data is continuously improved, and meanwhile, the retrieval requirements of users on the multimedia data become very flexible, so that the retrieval requirements of the users on single type of data are required to be met, and the interested data can be rapidly and accurately retrieved from the mass data. Therefore, the research and application of fine-grained cross-media retrieval are wide, the diversity of information acquired by the user is enriched in the aspect of information retrieval, and meanwhile, the efficiency of the user for acquiring the information is improved. In the aspects of pattern recognition and man-machine interaction, the communication and interaction between people and machines are more convenient and faster. In the fields of education, medical aviation and the like, a machine can teach a person to learn the characteristics of images, audios and the like of different fine-grained objects through cross-media retrieval.

The existing traditional method usually learns the relationship between media and in the media through label information, thereby carrying out fine-grained cross-media retrieval. The existing mode has the defects that:

1. the media gap problem, i.e. the feature representations of data samples of different media types differ greatly, so that it is a very difficult problem to measure the similarity between them directly.

2. Semantic gap problem, the problem of inconsistent representation of features of a computer with human understanding, resulting in differences between low-level features and high-level semantics.

3. The fine-grained level causes the problems of small inter-class difference (different fine-grained classes are similar) and large intra-class difference (objects in the same class have obvious difference due to posture illumination and the like). These drawbacks may result in poor performance of cross-media retrieval.

Therefore, a method is needed to solve the above problems and eliminate the medium gap while keeping the differences of different semantic category samples; constructing a common feature representation space of the four media data, reducing discrimination loss to the maximum extent in the label space and the common feature representation space, and learning the similarity of each media feature; and moreover, the label features and the semantic features are combined to carry out similarity measurement, and weight values corresponding to the features are distributed through variance. The performance of cross-media retrieval is further improved, and the cross-media retrieval method has good referential performance and practicability.

Disclosure of Invention

The invention aims to provide a fine-grained cross-media retrieval method with deep supervision and feature fusion, which achieves the effects of reducing discrimination loss and learning the similarity of various media features to the maximum extent in a label space and a public feature representation space.

The invention is realized by the following technical scheme: a fine-grained cross-media retrieval method based on deep supervision and feature fusion comprises the following steps:

s1, respectively inputting image data, video data, text data and audio data into a media special network, sharing weights of a last full-connection layer of four media data sub-networks, and extracting corresponding original characteristics;

s2, inputting original features into a public space by utilizing semantic information and label information, respectively connecting four sub-networks of the media data with a linear classifier, and classifying the original features of the extracted media data according to the label information in the linear classifier;

and S3, measuring the classification characteristics of the media data in the public space on the preset triple loss constraint condition, the classification loss constraint condition and the judgment loss constraint condition.

The invention provides a fine-grained cross-media retrieval method based on deep supervision and feature fusion, which mainly aims to eliminate media gaps while keeping the differences of samples of different semantic categories; constructing a common feature representation space of the four media data, reducing discrimination loss to the maximum extent in the label space and the common feature representation space, and learning the similarity of each media feature; and moreover, the label features and the semantic features are combined to carry out similarity measurement, and weight values corresponding to the features are distributed through variance. The performance of cross-media retrieval is further improved, and the cross-media retrieval method has good referential performance and practicability. The invention provides a fine-grained cross-media retrieval method based on deep supervision and feature fusion, and a public space which can be directly measured by different media data features is found. The method obtains the feature representation of different media data in the public space by minimizing the classification loss of the label space, the discriminant loss of the public space and the triple loss of the public space. In addition, a new measurement method is provided, and label information and semantic information of the sample are considered at the same time. The effectiveness of the method provided by the invention is verified by a large amount of experiments and comprehensive analysis on five reference data sets.

To better implement the present invention, further, the media specific network in step S1 includes an image convolution network, a video convolution network, an audio convolution network, and a text convolution network.

In the technical scheme, the distance of the media characteristics is reduced by learning the heterogeneous characteristics of different media types.

In order to better implement the present invention, step S1 further includes:

extracting fine-grained characteristics of the image data and the video data according to the pre-trained bilinear convolutional neural network, and taking the fine-grained characteristics of the image data and the video data as original semantic characteristics;

and inputting the original semantic features of the image data and the video data into a full connection layer to generate original features of the image and the video.

In the technical scheme, aiming at image and video data, in consideration of the problem that fine-grained features of the data are difficult to extract, a pre-trained BCNN (bilinear convolutional neural network) is selected, and feature vectors generated after the network are used as original semantic features of the image and the video.

In order to better implement the present invention, step S1 further includes:

extracting fine-grained features of the audio data according to the pre-trained VGG16 convolutional neural network, and taking the fine-grained features of the audio data as original semantic features;

and inputting the original semantic features of the audio data into the full-link layer to generate the audio original features.

In the technical scheme, a pre-trained VGG16 convolutional neural network is selected for audio data, the VGG16 convolutional neural network has 16 layers, and feature vectors generated by the last layer of full-connection layer are used as original semantic features of audio.

In order to better implement the present invention, step S1 further includes:

extracting fine-grained characteristics of the text data according to the sensor CNN convolutional neural network, and taking the fine-grained characteristics of the text data as original semantic characteristics;

and inputting the original semantic features of the text data into the last full-connection layer to generate original text features.

In the technical scheme, in order to realize feature learning of text features and other media features in a public space, a content CNN (convolutional neural network for classifying sentences) is selected as a feature extraction network for a text, and a convolution structure identical to the feature extraction network is selected.

In order to better implement the present invention, further, the method for presetting the triple loss constraint condition in step S3 includes:

and judging whether the media data samples are the same type media data samples, if so, constraining the media data samples according to equidistant constraint and boundary constraint, and if not, introducing equal distribution constraint to constrain the media data samples.

In the technical scheme, in order to eliminate a medium gap, triad loss functions are equidistantly distributed to reduce intra-class difference and increase inter-class difference.

In order to better realize the invention, further, the intra-class variance of the same type of media data samples is reduced according to the equidistant constraint, the sample characteristics in the same type of media data samples are distinguished according to the boundary constraint, and the same type of media data is constrained in a public space according to the equidistant constraint and the boundary constraint;

according to the equal distribution constraint, and using a small batch gradient descent algorithm mini-batch to constrain different types of media data samples in a common space.

In the technical scheme, the class areas are constrained by maximizing the minimum distance between the classes, so that the compactness of the intra-class distribution is ensured while the inter-class distribution is maintained.

In order to better implement the present invention, the method for presetting the classification loss constraint condition in step S3 further includes:

in the public space, simultaneously using a linear classifier to predict label information of the media data sample characteristics in the public space; and the linear layers corresponding to the linear classifiers are connected to the tops of the four sub-networks, and the four sub-networks are classified according to a cross entropy loss function.

In the technical scheme, the classification performance of the network is optimized by using a cross entropy loss function.

In order to better implement the present invention, the method for presetting the discriminant loss constraint condition in step S3 further includes:

selecting two media data sample characteristic vectors, setting the value of the media data sample to be 1 when the media data sample belongs to the same class and setting the value of the media data sample to be 0 when the media data sample belongs to different classes, and representing the media data sample characteristic vectors according to a cosine function and a discriminant loss function to obtain a negative log-likelihood value and a negative log-likelihood function;

and acquiring an activation function according to the negative log-likelihood function, and performing discriminant loss on the media data sample according to the activation function.

In the technical scheme, the discrimination loss mainly considers the loss of the media sample characteristics in the public space, and in order to measure the discrimination loss of all the media sample characteristics in the public space, the discrimination loss is introduced to ensure that the probability value of two sample characteristic vectors belonging to the same category is as large as possible.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention learns the correlation of four media data by using the label information and semantic information of the data, optimizes the network by three loss functions, and eliminates a media gap while keeping the difference of different semantic type samples;

(2) the invention reduces discrimination loss and learns the similarity of each media characteristic to the maximum extent in the label space and the public characteristic expression space;

(3) the invention provides a method for measuring similarity by combining the label characteristics and the semantic characteristics of a sample, which further improves the performance of cross-media retrieval.

Drawings

The invention is further described in connection with the following figures and examples, all of which are intended to be open ended and within the scope of the invention.

Fig. 1 is a flowchart of a fine-grained cross-media retrieval method with deep supervision and feature fusion provided by the present invention.

FIG. 2 is a diagram of results of a bimodal fine-grained cross-media search on a PKU FG-XMedia dataset by various methods provided by the present invention.

FIG. 3 is a diagram of the results of a multi-modal fine-grained cross-media search on a PKU FG-XMedia dataset according to various methods provided by the present invention.

Fig. 4 is a comparison graph of the retrieval results of all algorithms on the Wikipedia dataset provided by the present invention.

FIG. 5 is a comparison of the search results of all algorithms on the Pascal sequences dataset as provided by the present invention.

FIG. 6 is a comparison of the search results of all algorithms on the XMedia dataset according to the present invention.

FIG. 7 is a comparison of the search results of all algorithms on the XMedia dataset according to the present invention.

Fig. 8 is a schematic structural diagram of a fine-grained cross-media retrieval method with deep supervision and feature fusion provided by the present invention.

Detailed Description

Example 1:

in the fine-grained cross-media retrieval method with deep supervision and feature fusion of the embodiment, as shown in fig. 1 and 8, the network structure of the invention mainly comprises four sub-networks, as shown in fig. 1 and represented as. In order to ensure that the four sub-networks can learn the feature representation of images, videos, texts and audios which are similar as much as possible in a common space, the method adopts the weight sharing of the fully-connected layer of the last layer of the four sub-networks. Finally, the extracted features are further used for learning the correlation among the media by utilizing semantic information and label information, and meanwhile, the four sub-networks are respectively connected with a linear classifier, and the extracted features are classified by utilizing the label information.

Example 2:

the embodiment is further optimized on the basis of embodiment 1, and the embodiment reduces the distance of the media features while learning the heterogeneous features of different media types.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

the embodiment is further optimized on the basis of the embodiment 1, and in the embodiment, images, videos, texts and audios are respectively input into a media special network to extract original features. Aiming at image and video data, considering the problem that fine-grained features of the data are difficult to extract, a pre-trained BCNN (bilinear convolutional neural network) is selected, and feature vectors generated after the network are used as original semantic features of the image and the video and are respectively expressed asAnd then generating features through the full connection layer as the features of the image and the video in the public space.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 4:

in this embodiment, a pre-trained VGG16 convolutional neural network is selected for audio data, the VGG16 convolutional neural network has 16 layers, and feature vectors generated by the last layer of fully-connected layer are used as original semantic features of audioFeatures generated by several fully connected layers are then represented as features of the audio data in a common space。

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 5:

in this embodiment, in order to implement feature learning of text features and other media features in a public space, a content CNN (convolutional neural network classifying sentences) is selected as a feature extraction network for a text, and a convolutional structure identical to the above is selected. Inputting the preprocessed K-dimensional text features into a text feature extraction network, and selecting the output of the last full-connection layer as the original semantic features of the textThen the features generated by several layers of fully connected layers are used as feature representation of the text data in the public space。

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 6:

in this embodiment, in order to eliminate the medium gap, an equidistance equal distribution triple loss function is introduced to reduce the intra-class difference and increase the inter-class difference. The existing triple method generally directly restricts the distance between the same type sample pair and the different type sample pair, and does not consider the distance difference inside the different type sample pair, so the method is effective in a local scope but is not necessarily suitable for a global scope. The distance between the samples can be adjusted in a self-adaptive mode through equidistantly distributing the loss function. In order to further shorten the distance of the same type of samples and shorten the distance of different samples, relative distance constraint is respectively carried out on the same type of samples and the different types of samples.

Aiming at the similar samples, the intra-class variance is reduced by utilizing equidistant constraint, so that the similar samples are pulled closer in a characteristic space, and the formula is as follows:

；

wherein f represents the sample media characteristics, d represents the Euclidean distance, a and p belong to the same class, n belongs to different classes, and Nb is the batch size.

Boundary constraints are introduced for distinguishing sample features, so that the condition that all samples are gathered to one point to cause the similarity structure of the samples to be reduced due to the fact that only one equidistant constraint is adopted is prevented:

；

whereinAnd d is a predefined threshold.

Since the boundary constraint does not constrain the position of each sample, part of the samples may be unevenly distributed in the feature space, resulting in a reduction in classification accuracy. Therefore, for different types of samples, different types of samples are pushed into a feature space farther away from the type through equal distribution constraint, so that the features of different types are uniformly distributed in the feature space. The loss functionThe definition is as follows:

；

whereinThe definition is as follows:

；

whereinRefers to the number of classes in a mini-batch, and when using the mini-batch gradient descent algorithm, D (Ci, Cj) refers to the distance between two classes i and j.

The class regions are constrained by maximizing the minimum distance between classes, so that the compactness of the distribution in the classes is ensured while the distribution between the classes is maintained.The loss function is defined as follows:

if the classes are evenly distributed, the minimum distance from one class to the samples of the other classes is almost the same. The loss function introduces an extra class sample n constraint so that the distance of samples a, n and the distance of samples n, n' are as equal as possible.

Combining equidistant, boundary constraint and equal distribution loss functions to obtain a loss function capable of simultaneously zooming in and out the intra-class distance：

；

Similarity measure of tag features and semantic features: the conventional method generally utilizes the label characteristics to measure cosine similarity and sort according to retrieval results, and directly uses the label characteristics in the invention, so that the accuracy of cross-media retrieval can be reduced when the input samples are not clear or the classification accuracy of the network to the samples is not high. Therefore, the label features and the semantic features are combined, and the variance is used for distributing corresponding weights to different features, so that the similarity measurement is completed. The measurement formula is as follows:

；

wherein, a1 and B1 are the label features of the sample, a2 and B2 are the semantic features of the sample, and β is the weight value of the label features, expressed as:

；

wherein, U is the variance of the label features of different samples, and is the threshold of the variance, which is set to 0.65.

The algorithm flow of the method is as follows:

input data includes training sets and labels, learning rate, batch size, and maximum cycles number;

randomly initializing network parameters;

randomly taking out samples from the training set aiming at each iteration to construct a mini-batch;

calculating four media characteristics by forward propagation;

calculating a loss function;

updating four sub-network optimization parameters through a minimization formula;

finally outputting four sub-network optimization parameters;

other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 7:

the embodiment further optimizes the classification performance of the network by using a cross entropy loss function on the basis of the embodiment 6.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 8:

in this embodiment, a further optimization is performed on the basis of embodiment 1, in this embodiment, the discrimination classification loss mainly considers the sample classification in the label space, and the total loss function is defined as follows:

；

the classification loss assumes that the common space is ideal for classification, while a linear classifier is used to predict the label information of the sample features in the common space. The linear layer is connected on top of four media sub-networks and optimizes the classification performance of the network using a cross-entropy loss function:

；

other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 9:

in this embodiment, the discrimination loss mainly considers the loss of the media sample features in the common space, and in order to measure the discrimination loss of all the media sample features in the common space, the discrimination loss is introduced to make the probability value that two sample feature vectors belong to the same category as large as possible. The formula is as follows:

；

whereinRepresenting two different types of media, i, j representing the i, j-th sample instance of the media, ui, vi representing the characteristics of the two different types of media,，,representing the similarity of two features calculated using a cosine function;, is a function whose value is 1 when the samples belong to the same class; when belonging to different classes, the value is 0. The first term in the equation is a negative log-likelihood of inter-modal sample similarity, whose likelihood function is defined as follows:

whereinAs a sigmoid function, the sigmoid function is an activation function:

；

it is easy to find that minimizing this negative log-likelihood function is equivalent to maximizing the likelihood function. The greater the similarity of two vectors, the greater the probability value that they belong to the same class, which also means that they can be classified into similar categories. Likewise, the second term is used to measure similarity of sample features within the media. Based on the function, the similarity of the media characteristics can be better learned.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

In addition, based on the above embodiment, in order to verify the effectiveness of the method, two cross-media retrieval tasks are performed on the data sets at the fine-grained level and the data sets at the coarse-grained level, including bimodal fine-grained cross-media retrieval and multimodal fine-grained cross-media retrieval, and compared with the existing nine classical methods in terms of retrieval performance. In order to ensure fairness, the same preprocessing operation is adopted for media data, cosine similarity is selected as a measurement method, and experimental results are analyzed and shown in fig. 2-7.

The invention comprises four sub-networks, image, video, text, audio, and two full connectivity layers with ReLU activation function are connected behind each sub-network. The numbers of hidden units of the two layers are 2048 and 1024 respectively, and the correlation among four different media is learned through weight sharing of the second fully-connected layer of the four sub-networks. In the training phase, a two-step training strategy is employed. Selecting an Adam optimizer, firstly, training a network by a minimized cross entropy loss function and a discriminant loss function, setting the learning rate to be 0.001 and setting the maximum period to be 200; and secondly, fine-tuning the network under three constraint conditions, wherein the learning rate is set to be 0.0001, and the maximum period is set to be 100.

The data set adopted by the invention is as follows: the fine-grained level PKU FG-XMedia dataset can be used as well as the coarse-grained level Wikipedia dataset, Pascal Sences dataset, NUS-WIDE dataset, and XMedia dataset. Thus proving that the method provided by the invention is also applicable to the fields of fine granularity and coarse granularity. The PKU FG-XMedia dataset is currently a fine-grained, cross-media retrieval domain-oriented unique dataset. Including four media types of images, video, text and audio. The Wikipedia, Pascal Sennce and NUS-WIDE data sets are widely used data sets for coarse-grained cross-media retrieval at present, and comprise two media types of images and texts; the XMedia dataset contains 5 media types, image, text, video, audio, 3D model, with 10 million multiple media instances.

The evaluation indexes adopted by the invention are as follows: the cross-media retrieval performance can be evaluated by adopting an average precision mean mAP, and cross-media retrieval results are evaluated on two retrieval tasks of multi-mode fine-grained cross-media retrieval and dual-mode fine-grained cross-media retrieval.

The comparison method adopted by the invention comprises the following steps: in order to verify the effectiveness of the method, the invention compares the method with other nine excellent cross-media retrieval algorithms, including FGCrossNet, MHTN, ACMR, JRL, GSPH, CMDN, CCA, Multimodal DBN and Corr-AE.

As shown in FIG. 2, the result of the bimodal fine-grained cross-media retrieval on the PKU FG-XMedia data set by the method and the existing method is shown, and as can be seen from the result, in all bimodal fine-grained cross-media retrieval, the method achieves the best retrieval effect compared with other methods, and the mAP value is averagely improved by about 8%. FIG. 3 shows the multi-modal fine-grained cross-media retrieval result of the method and the existing method on the PKU FG-XMedia dataset, and as can be seen from the result, the method obtains the highest mAP value, and the average retrieval precision is improved from 0.412 to 0.493, which is beyond the existing method. FIGS. 2 through 5 show mAP values for four search tasks on the PKU FG-XMedia dataset, with 8 classes I → All, V → All, T → All, A → All.

Based on the above results, it can be seen that: (1) the method is superior to the best FGCrossNet method in the prior art. FGCrossNet also achieves good effects with a cross-media network designed for fine-grained domain, which shows the advantage of sharing a common space by four media. (2) The method also considers the special characteristics of different media on the premise of adopting the shared public space, thereby further improving the performance and averagely improving the mAP value by about 8 percent. (3) The method is superior to the method of Corr-AE, CCA, Multimodal DBN and the like based on the correlated loss modeling of paired samples, which shows the advantage of simultaneously considering the same type of samples and different types of samples when learning a common subspace. (4) The method is superior to methods such as MHTN, JRL, ACMR and the like which utilize class label information to model loss in media, and is mainly caused by the fact that the method simultaneously utilizes label information and semantic information and fully considers intra-class variance, inter-class variance and intra-media variance.

In order to verify the effectiveness of the method on coarse-grained cross-media retrieval, a comparison experiment is carried out on a coarse-grained data set. Fig. 4 shows the cross-media search result on the Wikipedia dataset by the present method and the existing method. Fig. 5 shows the cross-media search result of the method and the conventional method on the Pascal sequence data set, and fig. 6 shows the cross-media search result of the method and the conventional method on the NUS-WIDE data set. FIG. 7 shows the cross-media search result on XMedia data set according to the present invention and the prior art method.

The experimental results prove that the method is also suitable for coarse-grained cross-media retrieval. Compared with the prior art, the invention not only considers the label information of the sample, but also considers the semantic information of the media, thereby further improving the cross-media retrieval performance,

the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于细粒度跨媒体检索构造实体级公共特征空间的方法

Fine-grained cross-media retrieval method with deep supervision and feature fusion

相关技术

网友询问留言