Method for constructing entity-level public feature space based on fine-grained cross-media retrieval

文档序号：1952842 发布日期：2021-12-10 浏览：10次中文

阅读说明：本技术 基于细粒度跨媒体检索构造实体级公共特征空间的方法 (Method for constructing entity-level public feature space based on fine-grained cross-media retrieval ) 是由姚亚洲孙泽人陈涛张传一沈复民于 2021-11-11 设计创作，主要内容包括：本发明涉及细粒度跨媒体检索技术领域,公开了一种基于细粒度跨媒体检索构造实体级公共特征空间的方法,包括：获取视频数据,预设噪声帧滤除规则,根据噪声帧滤除规则对获取到的视频数据进行过滤,获取不包含细粒度实体的帧图像；获取媒体数据类型的低层特征,对低层特征进行线性投影,获取实体级公共特征空间；根据实体级公共特征空间提取高层次语义并进行学习。本发明可以学习低层特征到实体级公共特征空间的投影矩阵,并生成跨媒体数据的公共特征以进行跨媒体检索。(The invention relates to the technical field of fine-grained cross-media retrieval, and discloses a method for constructing an entity-level public feature space based on fine-grained cross-media retrieval, which comprises the following steps: acquiring video data, presetting a noise frame filtering rule, filtering the acquired video data according to the noise frame filtering rule, and acquiring a frame image which does not contain a fine-grained entity; acquiring low-level features of media data types, and performing linear projection on the low-level features to acquire an entity-level public feature space; and extracting high-level semantics according to the entity-level public feature space and learning. The invention can learn the projection matrix from the low-level features to the entity-level public feature space and generate the public features of the cross-media data to perform cross-media retrieval.)

1. A method for constructing an entity-level public feature space based on fine-grained cross-media retrieval is characterized by comprising the following steps: s1, acquiring video data, judging whether the input media type is video data, if so, entering a step S2, otherwise, acquiring media type data with the types of audio data, image data and text data, and then entering a step S3; s2, presetting a noise frame filtering rule, and filtering the obtained video data according to the noise frame filtering rule to obtain a frame image which does not contain a fine-grained entity; s3, acquiring low-level features of the media data types, and performing linear projection on the low-level features to acquire an entity-level public feature space;

s4, extracting high-level semantics according to the entity-level public feature space and learning;

the relevance constraint condition of the homogeneous sample and the entity construction constraint condition of the single media are preset, and the steps S1-S4 follow the constraint conditions.

2. The method for constructing an entity-level common feature space based on fine-grained cross-media retrieval according to claim 1, wherein the step S2 comprises: s2.1, intercepting video frames in each video data at the same time interval to obtain original key frames; s2.2, extracting the characteristics of the video frame according to the characteristic extraction network; s2.3, determining a central frame by calculating the distance between video frames;

s2.4, presetting a threshold, after determining the central frame and the threshold, judging whether the distance from each frame to the central frame is greater than the threshold, if so, discarding the current frame, and if not, retaining the valid frame;

and S2.5, when the central frame is not changed any more, acquiring an image of an effective frame which does not contain a fine-grained entity.

3. The method for constructing an entity-level common feature space based on fine-grained cross-media retrieval according to claim 1, wherein the low-level features in the step S3 include: extracting the characteristics of the filtered video data, audio data and image data according to the convolutional neural network; extracting entities in the text data according to an entity extractor, acquiring text entity characteristics, and generating corresponding text entity labels; and generating low-level features according to the features of the filtered video data, audio data and image data and the text entity features.

4. The method of claim 1, wherein the step of generating an entity-level common feature space in S3 comprises: and generating an entity-level public feature space by performing linear projection through entity labels, correlation constraint conditions of samples of the same type, entity construction constraint conditions of single media and countermeasure training of a media discriminator and a generator.

5. The method for constructing an entity-level common feature space based on fine-grained cross-media retrieval according to claim 1, wherein the step S4 comprises:

and performing entity-level semantic abstraction by using semantic category labels, calculating the similarity between media data of different media types based on uniform semantic representation, and performing cross-media retrieval.

6. The method of claim 1, wherein the correlation constraint of the homogeneous samples comprises: the media data having positive correlation are operated on the basis of the distance minimization rule, and the media data having negative correlation are operated on the basis of the distance maximization rule.

7. The method of claim 1, wherein the entity construction constraints of the single media comprise: and associating text entity features extracted from the text with features of the filtered video data, audio data and image data.

Technical Field

The invention relates to the technical field of fine-grained cross-media retrieval, in particular to a method for constructing an entity-level public feature space based on fine-grained cross-media retrieval, which can learn a projection matrix from low-level features to the entity-level public feature space and generate common features of cross-media data to carry out cross-media retrieval.

Background

One of the difficulties in fine-grained cross-media retrieval is how to reduce the disparity between different media data, and the most common method is to find a mapping function, map samples of different media types into a common feature space, and rank the similarity between the features of the input data in the space and the candidate objects in the database to generate a retrieval result. However, such a common feature space has certain drawbacks. Often the features in this space have no clear meaning (i.e. it is not known what the feature represents), resulting in the possibility of eventually ignoring the original semantic information contained in the media data. Some algorithms propose a common semantic space which simply uses the relationship between low-level features and high-level semantic features, the low-level features are generally extracted by a convolutional neural network, and the high-level features are semantics abstracted under the constraint of semantic category labels on the basis of the low-level features, so that only coarse-grained features can be extracted.

It is well known that entities and entities can be distinguished from each other. In the text, the entity is the name of the object, the name of the species, the name of the place, the time, etc., and in the image is the individual. Almost every kind of media data has a notion of entities inside it, and the corresponding entities exist in the corresponding higher-level features. In addition, due to the objective existence of the entity, various media data with similar semantics also contain the same entity, so that the media data can be expanded into other media domains. In view of the differentiability, relevance and expansibility of the entity, the invention adopts a method based on an entity-level public feature space to solve the task of fine-grained cross-media retrieval, and the entity is used as a middle connection bridge of low-level features and high-level semantic features.

In order to solve the above problems, a new fine-grained cross-media retrieval method is needed, which can learn a projection matrix from low-level features to an entity-level common feature space and generate a common feature representation form of cross-media data to perform cross-media retrieval.

Disclosure of Invention

The invention aims to provide a method for constructing an entity-level public feature space based on fine-grained cross-media retrieval, which can learn a projection matrix from low-level features to the entity-level public feature space and generate a common feature representation form of cross-media data to perform cross-media retrieval.

The invention is realized by the following technical scheme: a method for constructing an entity-level public feature space based on fine-grained cross-media retrieval comprises the following steps: s1, acquiring video data, judging whether the input media type is video data, if so, entering a step S2, otherwise, acquiring media type data with the types of audio data, image data and text data, and then entering a step S3;

s2, presetting a noise frame filtering rule, and filtering the obtained video data according to the noise frame filtering rule to obtain a frame image which does not contain a fine-grained entity;

s3, acquiring low-level features of the media data types, and performing linear projection on the low-level features to acquire an entity-level public feature space;

and S4, extracting high-level semantics according to the entity-level public feature space and learning.

The relevance constraint condition of the homogeneous sample and the entity construction constraint condition of the single media are preset, and the steps S1-S4 follow the constraint conditions.

In the technical scheme, a new Fine-grained Cross-Media Retrieval method is provided on the basis of generation of a countermeasure network, namely, an Entity-level Common Feature Space-based Fine-grained Cross-Media Retrieval algorithm (CFFCE). The algorithm is able to learn common feature representations at the entity level, such high-level features being entities with a well-defined meaning. In detail, the algorithm mainly comprises two parts: first, an entity extractor is employed on textual data to automatically extract fine-grained entities with explicit semantics. Thus, entity-level features with fine-grained semantics can be established between the low-level features and the high-level features, semantic ambiguity of the middle-high-level features is effectively reduced, and understanding of concepts of the middle-high-level features is facilitated. Then, by minimizing (maximizing) the distance between media features with positive (negative) correlation at the entity level, a projection matrix of low-level features to the entity-level common feature space can be learned and a common feature representation across media data generated for cross-media retrieval.

In order to better implement the present invention, step S2 further includes:

s2.1, intercepting video frames in each video data at the same time interval to obtain original key frames;

s2.2, extracting the characteristics of the video frame according to the characteristic extraction network;

s2.3, determining a central frame by calculating the distance between video frames;

and S2.5, when the central frame is not changed any more, acquiring an image of an effective frame which does not contain a fine-grained entity.

In the present technical solution, it is considered that the video data is intercepted from each video at equal intervals, and inevitably contains some frames irrelevant to the target object, such as a slice header and a slice trailer. Therefore, convergence of the network and accuracy of retrieval are seriously influenced, so that an effective video frame containing a target is obtained through a noise frame filtering rule, the characteristics of the target can be better learned under the condition that no noise frame is used as input, the retrieval accuracy can be effectively improved, the convergence speed of the network is accelerated, and the video data is preprocessed to obtain purer video data.

To better implement the present invention, further, the low-level features in step S3 include:

extracting the characteristics of the filtered video data, audio data and image data according to the convolutional neural network;

extracting entities in the text data according to an entity extractor, acquiring text entity characteristics, and generating corresponding text entity labels;

and generating low-level features according to the features of the filtered video data, audio data and image data and the text entity features.

In the technical scheme, the common feature space generally learned for processing the cross-media retrieval task has no specific meaning, that is, what the features in the common space specifically represent is unknown, which may ignore original semantic information in the media data, so the technical scheme learns the entity-level common feature space based on a fine-grained cross-media retrieval algorithm of the entity-level common feature space.

To better implement the present invention, further, the generating the entity-level common feature space in step S3 includes: and generating an entity-level public feature space by performing linear projection through entity labels, correlation constraint conditions of samples of the same type, entity construction constraint conditions of single media and countermeasure training of a media discriminator and a generator.

In the present solution, the entity-level common feature space aims at learning a common representation containing entity semantics. The part takes the low-level features of images, videos, audios and texts as input, represents the low-level features of the four media of the pairing data and the semantic category labels at the moment, and then maps the low-level features to a mapping matrix of an entity-level public feature space.

In order to better implement the present invention, step S4 further includes:

In the technical scheme, the distance between samples with positive correlation is minimized, and the distance between samples with negative correlation is maximized by utilizing the correlation among the same-class data of the cross-media.

To better implement the present invention, further, the correlation constraint conditions of the samples of the same type include:

the media data having positive correlation are operated on the basis of the distance minimization rule, and the media data having negative correlation are operated on the basis of the distance maximization rule.

In the technical scheme, in order to reduce the heterogeneity difference between different media, namely minimize the media data with positive correlation, in order to utilize the correlation between the same type of data across the media, the distance between the samples with positive correlation is minimized, and the distance between the samples with negative correlation is maximized.

In order to better implement the present invention, further, the entity construction constraint conditions of the single media include:

and associating text entity features extracted from the text with features of the filtered video data, audio data and image data.

In the technical scheme, although the entity extracted from the text is different from the entity forms of the other three media data, the entity extracted from the text is associated with the other three media data by adopting the constraint.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention provides a fine-grained cross-media retrieval algorithm based on an entity-level public feature space, which is characterized in that a noise frame filtering algorithm based on feature space clustering is designed aiming at the characteristic that video data contains noise frames, and after the processing, the video data input into a network does not contain noise, thereby providing a precondition guarantee for the later learning based on the entity-level public feature space;

(2) the method extracts entities from a text, performs entity projection on the entities, generates an entity-level public feature space through a media discriminator, constraints of positive and negative samples and entity reconstruction constraints, and finally extracts entity-level semantic features by means of semantic tags, thereby realizing fine-grained cross-media retrieval;

(3) compared with the FG Cross Net algorithm with the second ranking, the mAP score is 1% less and 15% more when the fine-grained Cross-media retrieval algorithm based on the entity-level public feature space executes various tasks, and the robustness is high;

(4) the invention provides a noise frame filtering algorithm based on feature space clustering, which is used for preprocessing video data to obtain purer video data.

Drawings

The invention is further described in connection with the following figures and examples, all of which are intended to be open ended and within the scope of the invention.

Fig. 1 is a flowchart of a method for constructing an entity-level common feature space based on fine-grained cross-media retrieval according to the present invention.

FIG. 2 is a diagram showing a comparison of dual media retrieval on a PKU FG-XMedia dataset in a method for constructing an entity-level common feature space based on fine-grained cross-media retrieval according to the present invention.

FIG. 3 is a comparison diagram of multimedia retrieval on a PKU FG-XMedia experimental data set in a method for constructing an entity-level public feature space based on fine-grained cross-media retrieval according to the present invention.

Fig. 4 is a comparison diagram of dual media retrieval on a PKU Xmedia dataset in the method for constructing an entity-level common feature space based on fine-grained cross-media retrieval provided by the present invention.

Fig. 5 is a comparison diagram of multimedia retrieval on a PKU Xmedia experimental data set in the method for constructing an entity-level common feature space based on fine-grained cross-media retrieval provided by the present invention.

Fig. 6 is a comparison diagram of dual media retrieval on the Wikipedia dataset in the method for constructing the entity-level common feature space based on fine-grained cross-media retrieval provided by the present invention.

FIG. 7 is a comparison diagram of dual media retrieval on NUS-WIDE datasets in a method for constructing an entity-level common feature space based on fine-grained cross-media retrieval provided by the present invention.

Fig. 8 is an influence diagram of each module in the multimedia retrieval mode in the method for constructing the entity-level common feature space based on fine-grained cross-media retrieval according to the present invention.

Fig. 9 is an influence diagram of each module in the multimedia retrieval mode in the method for constructing the entity-level common feature space based on fine-grained cross-media retrieval according to the present invention.

Detailed Description

Example 1:

in the method for constructing the entity-level public feature space based on fine-grained cross-media retrieval, as shown in fig. 1, considering that video data is formed by performing frame extraction at equal intervals on video, an extracted video frame contains the characteristic of a noise frame, and the noise frame does not contain a target object, before entity-level public feature learning is performed, a frame image not containing a fine-grained entity is filtered by using the noise frame filtering algorithm based on feature space clustering provided by the invention, so that the entity-level public feature space is better learned. And extracting entities in the text data by using an entity extractor, and generating corresponding entity labels, wherein the entities are shared by the four kinds of media data because the matched other media data and the text have the same semantics. Entity-level common feature spaces can then be obtained using entity labels by minimizing the association constraints across media, entity reconstruction constraints for single media, and countermeasure training of media discriminators and generators. Therefore, the ambiguity of the middle-high level features can be reduced to a certain degree, the semantic concept of the middle-high level features can be understood, and finally the extraction and learning are carried out through the high-level entity level semantics. The present invention also uses two constraints to assist in generating a countermeasure mechanism for a countermeasure network to better learn the mapping of low-level features to an entity-level common feature space.

The present embodiment provides a new Fine-grained Cross-Media Retrieval method based on generation of a countermeasure network, that is, an Entity-level Common Feature Space-based Fine-grained Cross-Media Retrieval algorithm (CFFCE). The algorithm is able to learn common feature representations at the entity level, such high-level features being entities with a well-defined meaning. In detail, the algorithm mainly comprises two parts: first, an entity extractor is employed on textual data to automatically extract fine-grained entities with explicit semantics. Thus, entity-level features with fine-grained semantics can be established between the low-level features and the high-level features, semantic ambiguity of the middle-high-level features is effectively reduced, and understanding of concepts of the middle-high-level features is facilitated. Then, by minimizing (maximizing) the distance between media features with positive (negative) correlation at the entity level, a projection matrix of low-level features to the entity-level common feature space can be learned and a common feature representation across media data generated for cross-media retrieval.

The noun explains:

CFFCE, fine-grained cross-media retrieval algorithm based on entity-level public feature space.

FGCN is feedback graph convolution network.

MHTN modal confrontation hybrid transmission networks.

ACMR is a cross-modal search based on countermeasure training.

Example 2:

the present embodiment is further optimized on the basis of embodiment 1, the media data related to the fine-grained cross-media retrieval in the present embodiment includes video data, and for the extraction of the video data, a fixed number of frames are intercepted at equal intervals as input, and some frames unrelated to the target object, such as a slice header and a slice trailer, are inevitably included, and the originally proposed key frame includes one or more images unrelated to the target object. This presents some potential problems: when a noise frame in a video becomes an offset point, the feature distribution of input data is affected. Therefore, the network parameters will adapt to the feature distribution, and the retrieval accuracy is affected. The noisy frames will cause an offset in the input, resulting in a difficulty in fast convergence of the countermeasure network. Therefore, a noise frame filtering rule is preset to solve the problem of the noise frame in the video data.

In this embodiment, N frames are cut out from each video at the same interval to form an original key frame, then a neural network is used as a feature extractor to extract features from the N frames of images, the video frames in each video data are cut out at the same time interval, and the original key frame is obtained by using a ResNet50 network pre-trained on ImageNetThe N-dimensional features are extracted from the network and expressed asWhereinRepresenting the total number of videos in the video data set,represents the firstFeatures of N original key frames captured by each video.

The sum of the distances between the features of each video frame and the features of all other frames is then calculated, using the Ç 2 norm, the distance between two features is calculated as follows:

whereinIs all other video frames toThe sum of the distances of (a).

WhereinRepresenting all other video frames toThe sum of the distances of (a).

Sequences obtained by the above formulaMake a ranking, assumeIs the smallest, then will beThe frame is fixed as a central frame and willIs defined asAverage value of (d):then orderIs a threshold value。

Determining good center frames and thresholdsThen, judging the distance between each frame and the central frame, if the distance between the current frame and the central frame is larger than thatThe current frame will be discarded, otherwise it will remain as a valid frame. And repeating the operation until the central frame does not change any more, namely the termination condition is that the central frame does not change any more, and obtaining the effective video frame containing the target.

The effective video frame containing the target is obtained, and the characteristics of the target can be better learned under the condition that no noise frame is taken as input, so that the retrieval accuracy can be effectively improved, and the convergence speed of the network is accelerated.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

the embodiment is further optimized based on embodiment 1, and in this embodiment, the common feature space generally learned for processing the cross-media retrieval task has no specific meaning, that is, what the features in the common space specifically represent is unknown, which may ignore the original semantic information in the media data. The fine-grained cross-media retrieval algorithm based on the entity-level common feature space of the embodiment learns an entity-level common feature space, that is, features in the space represent entities. Firstly, an entity extractor in an Illinois Wikifier is adopted to extract an entity in text data, and a corresponding entity label is generated. In addition, convolutional neural networks are used for feature extraction of video, audio and image data. Since a set of data input into a network has undergone a pairing operation, a plurality of media data having the same semantic information share entities and tags of a text. This feature is also referred to as the entity feature of the text and as the low-level feature. Due to the sparsity of the entity-level tag matrix, linear projection is used here to obtain the entity-level common feature space. In order to make the features in this space only contain semantic information of the target object, ignoring media construction differences and data representation differences, the countermeasure mechanism of the generator and the arbiter is still used here to achieve this goal.

Convolutional neural networks are employed to extract features of images, video and audio. The extracted three media have low-level characteristics ofTaking the image as an example, the image is,expressed as:;

the entity-level common feature space aims at learning common representations that contain entity semantics. This section takes as input low-level features of images, video, audio and text, when the paired data is represented asWhereinRepresenting semantic class labels, usingAs a mapping matrix that maps the low-level features of the four media to the entity-level common feature space.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 4:

this embodiment is further optimized based on embodiment 1, and the media discriminator and the network of generators in the confrontation training of the generatorsAs an input, text data in the same group of data enters an entity extractor of Illinois Wikifier, and the extracted entities are represented as:. Wherein the content of the first and second substances,representing the number of entities extracted and,represents the extracted secondAn entity, the entity corresponding to a label ofText low-level features extracted using LSTMTo representComprises the following steps:。

other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 5:

this embodiment is further optimized based on embodiment 1, and in order to ensure that the high-level entity-level semantic extraction is independent of the media type, a media discriminator is used here to discriminate the media type of the features in the space. Loss under constraint of media tagIs defined as:

；

wherein the content of the first and second substances,is a cross entropy loss function.Is a parameter ofThe non-linear mapping function of (2) is used to predict which media types the high-level entity-level semantic features respectively belong to.For the input data one-hot media type tags, image, video, audio and text are represented, respectively.

In contrast to training, the loss of the media resolvers is maximized to obtain the entity-level common feature space while minimizing the loss of the generators. On this basis, some parameters are assigned to the inside of each loss function and the loss function of the countermeasure phase is defined as:

；

wherein the content of the first and second substances,the balance parameter between the positive loss function and the negative loss function in the training stage is a positive value.Is a parameter of a feature extractor that processes images, video, audio, and text, reducing it toThe parameters of the low-level features of the four data projected to the common feature space are simplified into parametersSince the optimization objectives of these two loss functions are opposite, the countermeasure process of the two concurrent sub-processes can be defined as:

；

wherein the parametersGeneral formulaMinimization ofMake formulaMaximization, which is the antagonistic training process of this model. The object is to findAndto achieve a saddle point of the loss function of the antagonistic phase in the formulaAnd formulaOn the basis of the above, the updating process of the parameters is as follows:

；

(ii) a Wherein the content of the first and second substances,is the learning rate.

Example 6:

the embodiment is further optimized on the basis of embodiment 1, the correlation constraint of the samples of the same type is to reduce the heterogeneity difference between different media, that is, minimize the media data with positive correlation, and to utilize the correlation between the same type of data across media, minimize the distance between the samples with positive correlation, and maximize the distance between the samples with negative correlation.

The correlation constraint Lcc of the samples of the same class is used here to achieve this goal.

For convenience and efficiency, the loss is measured using the euclidean distance, and the constraint can be expressed as:

wherein the content of the first and second substances,is the Jaccard coefficient at the entity level, representing the similarity between the two media data. For example,representing the correlation of image data and video data. This parameter combines the coarse-grained features of semantic class labels with the fine-grained features of the entity-level space. Defined by the following equation:

，；

wherein the content of the first and second substances,tags that represent images, video, audio and text at the entity level,representing the corresponding semantic tags of x media and y media.

Regarding cross-media similarity, here in matrixTo show that:

；

wherein the content of the first and second substances,is the number of all samples in the data set U,is a similarity matrix within a single medium. In the above-mentioned formula,，and is and。mainly for searching the similarity of the same subclass data of different media, soAnd = 0. To balance the correlation effect of positive and negative samples, a regularization operation will be performed, i.e., the sum of each row is 0.

；Is a diagonal matrix in which, among other things,(ii) a Will be provided withIs defined as:

；

so thatCan be expressed as:

。

other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 7:

in this embodiment, a constraint is further optimized on the basis of embodiment 1, and although the entity extracted from the text has a difference from the entity form of the other three media data, the entity extracted from the text is associated with the other three media data by using the constraint.

Since the entity is extracted from the text data, it is not associated with the other three media. Thus, for the purpose of associating sub-categories with entities, entity construction constraints are employed hereinSpecifically, the following are shown:

；

here, theIs thatNorm, which can ensure that the data of the same media can be as close as possible to the entity tag. In addition, to avoid overfitting, regularization terms are employedThe following were used:

；

finally combining the three items intoThe following were used:

。

the entity level high-level abstract semantics obtains a projection matrixThereafter, for a given low-level featureA representation of the entity-level common feature space can be obtained as follows:

；

finally, a plurality of layers of logistic regression are adopted to estimate the semantic label of each datum, and in order to better extract the semantic features of the middle-layer entity-level representation, a multi-classification loss function is adopted to ensure the achievement of the goal.

；

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

In addition, the experimental operating environment of this embodiment may be linux4.15.0, GeForceRTX2080Ti, and CUDA10.2, and the experimental language may be python 3.6. For each video data, 25 frames of images are intercepted at equal intervals, but because the frames contain noise frames, the number of frames left after denoising is not enough to represent one video, 40 frames of images are firstly extracted from each video at equal intervals as video data, and then the noise frames are filtered by adopting the SCFR algorithm. The horizontal row is various cross-media retrieval algorithms, and the vertical row of the IVAT respectively represents image, video, audio and text characteristics.

Firstly, samples of four media types are simultaneously input into a generator, image, video and audio data are simultaneously input into a convolutional neural network, a feature extractor used in the method is a ResNet50 network, text feature extraction algorithm based on a self-attention mechanism is used for processing text data, and an entity extractor of Illinois Wikif is used for carrying out entity extraction on the text. The top extracted features are input into an entity-level common feature space, where samples with positive correlation are brought closer using two constraints (correlation constraint of homogeneous samples and entity construction constraint of single media), samples with negative correlation are moved farther, and entities extracted from text are associated with the other three media data using entity labels. Due to the sparsity of entity labels, a fully connected layer with a dimension (2048) is used here as a mapping function from low-level features to an entity-level common feature space. The entity-level common feature space is then learned through the countermeasure mechanism of the generator and the arbiter. And finally, learning entity-level high-level abstract semantic representation by means of semantic category labels, thereby calculating the similarity between the cross-media data. And during retrieval, the output of the last full-connection layer of the framework is used as a common semantic representation form of four media types. Then, cosine distance is adopted to measure the similarity between different media data, and finally, a retrieval result is returned according to the similarity.

The retrieval task sets dual-media fine-grained cross-media retrieval, a query is an example of any one media type, the retrieval result is an example of another media type, the multimedia fine-grained cross-media retrieval, the query is an example of any media type, and the retrieval result is data of all media types. And shows a comparison of a dual media retrieval on the PKUFG-XMedia dataset and a multimedia retrieval on the PKUFG-XMedia experimental dataset according to fig. 2 and 3. The dual media retrieval comparison on the PKUXmedia dataset is shown according to fig. 4, the multimedia retrieval comparison on the PKUXmedia experimental dataset is shown according to fig. 5, the dual media retrieval comparison on the Wikipedia dataset is shown according to fig. 6, the dual media retrieval comparison on the NUS-WIDE dataset is shown according to fig. 7, the influence of each module in the media retrieval mode is shown according to fig. 8, and the influence of each module in the multimedia retrieval mode is shown according to fig. 9.

In summary, the present embodiment provides a fine-grained cross media retrieval algorithm (CFFCE) based on an entity-level common feature space for the previous problems of semantic ambiguity of the common feature space and no consideration of fine-grained features. The proposed method comprises three main stages. (1) Considering that video data is formed by performing equal-interval frame extraction on video, an extracted video frame contains the characteristics of a noise frame (the noise frame is a frame not containing a target object), before entity-level public feature learning is performed, a noise frame filtering algorithm based on feature space clustering provided by the invention is firstly used for filtering frame images not containing fine-grained entities, so that an entity-level public feature space is better learned. (2) And (4) entity-level common feature space learning. And extracting entities in the text data by using an entity extractor, and generating corresponding entity labels, wherein the entities are shared by the four kinds of media data because the matched other media data and the text have the same semantics. Entity-level common feature spaces can then be obtained using entity labels by minimizing the association constraints across media, entity reconstruction constraints for single media, and countermeasure training of media discriminators and generators. This may reduce the ambiguity of the mid-high level features to some extent and help understand their semantic concepts. (3) And extracting and learning high-level entity semantics. In order to retrieve multimedia data with the highest correlation belonging to the same high-level concept, entity-level semantic abstraction is performed using semantic class labels. Finally, based on uniform semantic representation, the similarity between media data of different media types can be calculated, and cross-media retrieval is carried out. Experiments are carried out on 1 challenging fine-grained cross-media data set (PKUFG-XMedia) and 3 challenging coarse-grained cross-media data sets (PKUXmedia, Wikipedia and NUS-WIDE), and the experimental results are compared with 8 classical cross-media algorithms, so that the effectiveness of the algorithm is shown.

As shown in fig. 2 and 3, the maps results of the CFFCE algorithm and the cross-media retrieval algorithm performing 12 dual-media retrieval tasks and 4 multimedia retrieval tasks on the PKUFG-XMedia dataset are demonstrated. Because of space limitations, FGCrossNet is simplified to FGCN. As can be seen from fig. 2, the CFFCE has significantly better performance than the 8 cross-media retrieval methods in the 12 dual-media retrieval tasks. The CFFCE can combine the coarse-grained information of the semantic category labels with the fine-grained information in the entities, so that the common semantic features of the four media data can be better extracted.

As shown in fig. 3, the mep score of each algorithm in the multimedia retrieval mode is shown, and the score trend is similar to that of the dual-media fine-grained retrieval. CFFCE exhibits its superior performance compared to other algorithms. Mainly because: 1) the video data is filtered by adopting the SCFR algorithm to remove noise frames, so that video frames containing target objects, namely effective entities can be obtained, and a premise guarantee is provided for the CFFCE algorithm later. 2) And (4) entity-level common feature learning. At this stage, by minimizing cross-media correlation errors and single-media reconstruction errors, an entity mapping matrix from a low-level feature to an entity-level common feature space can be learned, and then a uniform entity-level abstract semantic representation is obtained by category labels.

As shown in fig. 4, the maps scores for dual media retrieval on PKUXmedia datasets by various cross-media retrieval algorithms are presented.

As shown in fig. 5, a comparison result of multimedia retrieval is presented. Unlike PKUFG-XMMedia, PKUXmedia is a coarse-grained dataset. It is known from experience that an algorithm that performs better on a fine-grained dataset should also maintain good performance on a coarse-grained dataset. As is clear from fig. 4 and 3, CFFCE is superior to other algorithms in both 12 dual media retrieval tasks. In table 3, the maps score of V → I, V → I, V → I is improved by 16.3%, 10.9%, 3.4% compared to the second name FGCrossNet, respectively, and the experimental results fully demonstrate the effectiveness of the video frame filtering algorithm. In addition, the CFFCE algorithm is obviously promoted in other retrieval tasks, and the performance of the entity-level public feature space for learning fine-grained features is verified.

As shown in fig. 6, the mep scores of the CFFCE algorithm and 6 cross-media retrieval algorithms on the Wikipedia dataset are shown. The performance of the CFFCE remains optimal compared to other cross-media algorithms on this non-fine-grained dataset. Unlike the PKUFG-XMedia and PKUXmedia datasets, Wikipedia has no video data, so there is no need to use video frame filtering operations. As can be seen in fig. 6, the maps results of CFFCE's "I → T" and "T → I" searches on this dataset are 31.9% and 34.8% improved over the JRL algorithm, and the experimental results also indicate the effectiveness of the entity-level common features.

As shown in fig. 7, the maps scores for the CFFCE algorithm and 6 classical cross-media algorithms for retrieval on the NUS-WIDE dataset are presented. Like PKUXmedia and Wikipedia, NUS-WIDE is also a coarse-grained dataset. Similar to all the data sets above, the performance of CFFCE remained optimal, being improved by 7.4% and 26.6% compared to FGCrossNet and CMDN, respectively. Since the NUS-WIDE data set does not contain video data as in Wikipedia, the use of a video frame filtering algorithm is not required. Experimental results also indicate the effectiveness of CFFCE.

As shown in fig. 8 and 9, which are the results of the CFFCE algorithm experiments on the PKUFG-XMedia dataset (fine-grained cross-media dataset), different parts of the mAP score are shown in an incremental manner. Where the second line of fig. 8 and 9 demonstrates the retrieval performance of only the base frame (not including video frame denoising and entity-level extraction). The third row shows the mAP score after the noisy video frame is removed using a feature space based video frame filtering algorithm on the base framework. It can be seen that the performance of all search tasks is obviously improved after the denoising operation is added, wherein the improvement of the T → V search task is the most obvious and is 32%. The fourth row shows the mAP score after the entity-level common feature space is constructed. It improves the retrieval performance to a large extent compared to the results of using only the noise frame filtering algorithm.

In summary, the CFFCE algorithm proposed by the present invention has two parts, one is a feature space-based video frame filtering SCRF algorithm, which aims to remove noise frames in each video; and the other is based on a fine-grained cross-media retrieval algorithm CFFCE of a common feature space at an entity level, and the performance of each part is evaluated experimentally.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

17页详细技术资料下载

Method for constructing entity-level public feature space based on fine-grained cross-media retrieval

相关技术

网友询问留言