Construction method and application of cross-modal retrieval model based on multi-layer attention mechanism

文档序号：1952903 发布日期：2021-12-10 浏览：16次中文

阅读说明：本技术 基于多层注意力机制的跨模态检索模型的构建方法及应用 (Construction method and application of cross-modal retrieval model based on multi-layer attention mechanism ) 是由李国徽孔浩袁凌魏明金从元于 2021-08-27 设计创作，主要内容包括：本发明公开了一种基于多层注意力机制的跨模态检索模型的构建方法及应用,包括：S1、搭建跨模态检索模型；S2、计算图像模态和文本模态间的不变性损失、标签空间损失、以及公共表示空间中不同模态间和各个模态内的判别损失,加以不同的权重,得到跨模态检索模型的损失函数；S3、通过最小化跨模态检索模型的损失函数对跨模态检索模型进行训练。本发明在训练过程中,通过考虑不同模态之间的相互影响,分别对各图像文本数据对基于多层注意力机制学习其文本特征和图像特征内细粒度的上下文局部信息,以及学习其文本特征和图像特征之间的全局特征对应关系,充分利用了原始数据中的语义信息,将多模态特征进行融合,大大提高了跨模态检索的准确度。(The invention discloses a construction method and application of a cross-modal retrieval model based on a multilayer attention mechanism, which comprises the following steps: s1, building a cross-modal retrieval model; s2, calculating invariance loss and label space loss between an image modality and a text modality, and discriminant loss between different modalities and in each modality in a public representation space, and adding different weights to obtain a loss function of a cross-modality retrieval model; s3, training the cross-modal search model by minimizing the loss function of the cross-modal search model. In the training process, by considering the mutual influence among different modes, the method respectively learns the text characteristics and the context local information of fine granularity in the image characteristics of each image text data pair based on a multi-layer attention mechanism, and learns the global characteristic corresponding relation between the text characteristics and the image characteristics, makes full use of semantic information in original data, fuses multi-mode characteristics, and greatly improves the accuracy of cross-mode retrieval.)

1. A construction method of a cross-modal retrieval model based on a multi-layer attention mechanism is characterized by comprising the following steps:

s1, building a cross-modal retrieval model; the cross-modal retrieval model comprises a cascaded feature extraction module, a multi-modal fusion module and a common representation module;

in the training stage, the feature extraction module is used for respectively extracting features of image modalities and data of text modalities of image text data pairs in a pre-collected image text data set to obtain text features and image features of the image text data pairs; wherein the image-text data set comprises a plurality of image-text data pairs and semantic tag vectors corresponding thereto; the image text data pair comprises data of an image modality and data of a text modality corresponding to the data;

the multi-mode fusion module is used for learning the text features and the context local information of fine granularity in the image features of each image text data pair based on a multi-layer attention mechanism and learning the global feature corresponding relation between the text features and the image features of each image text data pair so as to further extract the text features and the image features of each image text data pair and output the extracted text features and image features to the common representation module;

the feature extraction module comprises a first input end and a second input end, data of an image modality are input into the cross-modality retrieval model through the first input end, and data of a text modality are input into the cross-modality retrieval model through the second input end; the common representation module comprises a first output via which the semantic representation vector of the image text data pair is output from the cross-modal retrieval model and a second output via which the semantic representation vector of the text modal data is output from the cross-modal retrieval model;

s2, calculating invariance loss and label space loss between an image mode and a text mode and discriminant loss between different modes and in each mode in a public expression space based on semantic label vectors and semantic expression vectors of each image text data pair, and adding different weights to obtain a loss function of the cross-mode retrieval model;

s3, training the cross-modal retrieval model by minimizing the loss function of the cross-modal retrieval model to obtain the trained cross-modal retrieval model.

2. The method according to claim 1, wherein the multi-modal fusion module comprises a plurality of cascaded MCA layers and a dimension reduction unit located after the last MCA layer;

the MCA layer is used for learning a global feature corresponding relation between the text features and the image features based on a cooperative attention mechanism and learning context local information of fine granularity in the text features and the image features based on an attention mechanism;

and the dimension reduction unit is used for respectively reducing the dimension of the text features and the image features input by the MCA layer of the last level.

3. The method of constructing a cross-modal search model of claim 2, wherein the MCA layer comprises a first self-attention unit, a first fully-connected layer, a second self-attention unit, a cooperative attention unit, and a second fully-connected layer; the output end of the first self-attention unit is connected with the input end of the first full-connection layer; the output end of the second self-attention unit is connected with the input end of the cooperative attention unit; the output end of the cooperative attention unit is connected with the input end of the second full connection layer;

the first self-attention unit is used for extracting potential semantic relations in a text mode based on a self-attention mechanism so as to learn fine-grained context local information in text features and further extract the text features;

the second self-attention unit is used for extracting potential semantic relations in the image modality based on a self-attention mechanism so as to learn fine-grained context local information in the image features and further extract the image features;

the cooperative attention unit is used for extracting a semantic relation between an image mode and a text mode based on a cooperative attention mechanism, and realizing a learning process of guiding image features by text features, so that a global feature corresponding relation between the text features and the image features is learned, and the text features are fused into the image features.

4. The method according to claim 3, wherein for the non-final MCA layer, the output terminal of the first full-link layer is connected to the input terminal of the cooperative attention unit and the input terminal of the first self-attention unit of the next MCA layer; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer;

for the MCA layer of the last stage, the output end of the first full connection layer is respectively connected with the input end of the cooperative attention unit and the input end of the dimensionality reduction unit; and the output end of the second full connection layer is connected with the input end of the dimension reduction unit.

5. The method of claim 3, wherein for the non-final MCA layer, the output terminal of the first full link layer is connected to the input terminal of the first self-attention unit of the next MCA layer; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer;

for the MCA layer of the last stage, the output end of the first full connection layer is respectively connected with the input end of the cooperative attention unit of the previous stage and the input end of the dimensionality reduction unit; and the output end of the second full connection layer is connected with the input end of the dimension reduction unit.

6. A cross-modal retrieval method based on a multi-layer attention mechanism is characterized by comprising the following steps:

mapping target query data to a public representation space based on the cross-modal retrieval model constructed by the construction method of the cross-modal retrieval model according to any one of claims 1 to 5 to obtain a semantic representation vector of the target query data, and calculating the similarity between the target query data and data in a database by calculating the similarity between the semantic representation vector of the target query data and the semantic representation vector of the data in the database to obtain a corresponding retrieval ordering result;

and mapping the semantic expression vector of the data in the database by adopting the cross-modal retrieval model.

7. The cross-modal search method according to claim 6, wherein in the cross-modal search model, an image channel is formed from a first input end of the feature extraction module to a first output end of the common representation module, and is used for mapping data of an image modality into a common representation space; a text channel is formed from a second input end of the feature extraction module to a second output end of the public representation module and is used for mapping data of a text mode to a public representation space;

the target query data or data in the database is mapped into the common representation space using corresponding channels according to its modality.

8. The cross-modal retrieval method of claim 6, wherein the similarity between the target query data and the data in the database with different modalities is calculated to obtain the cross-modal data similarity; calculating the similarity between the target query data and the data in the database with different modes to obtain the similarity of the data with the same mode; and carrying out weighted average on the cross-modal data similarity and the homomodal data similarity to obtain the similarity between the target retrieval data and the data in the database.

9. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of constructing a cross-modal search model according to any of claims 1 to 5 and/or the cross-modal search method according to any of claims 6 to 8.

Technical Field

The invention belongs to the technical field of data retrieval, and particularly relates to a construction method and application of a cross-modal retrieval model based on a multilayer attention mechanism.

Background

General search engines on the internet, such as Google, Baidu, Yahoo, accept only text-based retrieval methods, and can provide only text keywords as services for retrieving contents, not services for retrieving using multimodal information. The method of only using text for query often cannot accurately obtain the desired query result, and because the text keywords input by the user are usually subjective, it is difficult to accurately and comprehensively describe the multimedia information, which greatly affects the experience of the user using the search engine.

Compared with the traditional retrieval mode, the cross-modal retrieval task mainly retrieves data of different modalities, and data of one modality is input as a query condition to search relevant data of another modality. Today, multimodality data has become the main expression form of big data in the rapid development of information technology, and how to make computers have the ability to understand multimodality data becomes an important issue worthy of research. The multi-modal data are various in types and rich in semantics, and different modal data respectively describe the characteristics of the object under different angles, so that the method for sensing the external world by utilizing the multi-modal data is close to the main form of the world known by human. The semantic understanding and knowledge representation of multi-modal data requires a computer to be able to perceive and understand real data scenes and to further infer the perceived knowledge, but the computer has great difficulty in processing different forms of multimedia information such as dynamic video, audio and other forms of data, so the subject has a very broad research meaning.

On the other hand, the cross-modal retrieval service realized in mass data has a very wide application scene. Cross-modality retrieval has driven the development of traditional search engines and has proposed a novel way of searching for information that users may use at will in different forms of data to retrieve, such as searching for audio songs using pictures or searching for videos through text news. Cross-modal retrieval also plays an important role in some special information query scenes, such as searching a related facial image related to a criminal in a video according to criminal portrait information provided by a witness, submitting a photo of a sight spot to retrieve a text description corresponding to the sight spot during traveling, and the like.

Currently, data retrieval based on a single modality has been widely and intensively studied, and they can be divided into keyword-based retrieval and content-based retrieval modes, such as text retrieval, image retrieval, video retrieval. Due to the rapid increase of data such as different types of pictures, texts and videos, the development of a cross-modal retrieval technology is greatly promoted by multi-modal data, and higher requirements are put forward for retrieval among different modal data. The cross-modal retrieval can help the user to obtain comprehensive information of the target event, so that the method has great application value in the real world, and has attracted wide attention of the industry and academia in recent years.

The main challenge of cross-modality retrieval is to retain semantic information between different modalities and to compute similarities between different modality data. Firstly, data in different modes have different dimensions, how to reduce heterogeneity among the different modes is an important problem for uniformly representing the data in the different modes, and meanwhile, how to cover semantic information of each mode data in a final representation form of multi-mode data as much as possible without losing important semantics is also a challenging problem. Many methods have been proposed to solve the above problems, but all have some disadvantages, specifically:

(1) methods based on statistical correlation analysis. Statistical correlation analysis based methods to eliminate heterogeneity between different modalities, the similarity between data of different modalities is calculated by learning a common subspace. Although this method is efficient in learning and easy to operate, it is difficult to fully simulate the complex correlation across media data in the real world by linear projection alone, and most methods can only simulate two media types.

(2) A cross-modal hashing based approach. The existing cross-modal hash method maps data in different modes to the same space for calculation by learning one or more hash functions, but because the similarity between heterogeneous data containing rich semantic information is ignored, many cross-modal hash methods cannot obtain strong hash codes, and the retrieval performance is poor.

(3) A method based on deep learning. The deep learning-based method can well keep the difference between samples with different semantics, and the discriminant features are learned by minimizing the discriminant loss of the samples in the label space and the common representation space. However, most of the previous methods separately embed the images and texts into a common space to compare their similarities, and the mutual relationship between the images and the texts is rarely considered.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a construction method and application of a cross-modal retrieval model based on a multi-layer attention mechanism, which are used for solving the technical problem that the existing cross-modal retrieval method does not fully utilize semantic information in original data, only considers the relation of a single modality, and does not consider semantic relations in each modality and among different modalities, so that the retrieval accuracy is low.

In order to achieve the above object, in a first aspect, the present invention provides a method for constructing a cross-modal search model based on a multi-layer attention mechanism, including the following steps:

s1, building a cross-modal retrieval model; the cross-modal retrieval model comprises a cascaded feature extraction module, a multi-modal fusion module and a public representation module;

the characteristic extraction module comprises a first input end and a second input end, data of an image modality are input into the cross-modal retrieval model through the first input end, and data of a text modality are input into the cross-modal retrieval model through the second input end; the common representation module comprises a first output end and a second output end, the semantic representation vector of the image text data pair is output from the cross-modal retrieval model through the first output end, and the semantic representation vector of the text modal data is output from the cross-modal retrieval model through the second output end;

and S3, training the cross-modal retrieval model by minimizing the loss function of the cross-modal retrieval model to obtain the trained cross-modal retrieval model.

Further preferably, the multi-modal fusion module comprises a multi-modal fusion module comprising a plurality of cascaded MCA layers and a dimensionality reduction unit located after the last MCA layer;

and the dimension reduction unit is used for respectively reducing the dimensions of the text features and the image features input by the multi-mode fusion module.

Further preferably, the MCA layer comprises a first self-attentive cell, a first fully-connected layer, a second self-attentive cell, a cooperative-attentive cell and a second fully-connected layer; the output end of the first self-attention unit is connected with the input end of the first full-connection layer; the output end of the second self-attention unit is connected with the input end of the cooperative attention unit; the output end of the cooperative attention unit is connected with the input end of the second full connection layer;

the cooperative attention unit is used for extracting a semantic relation between an image mode and a text mode based on a cooperative attention mechanism and realizing a learning process of guiding image features by text features, so that a global feature corresponding relation between the text features and the image features is learned to fuse the text features into the image features.

Further preferably, for the non-last MCA layer, the output end of the first full connection layer is connected to the input end of the cooperative attention unit and the input end of the first self-attention unit of the next MCA layer respectively; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer;

Further preferably, for the MCA layer not in the last stage, the output terminal of the first full-link layer is connected to the input terminal of the first self-attention unit in the MCA layer in the next stage; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer;

In a second aspect, the present invention provides a cross-modal retrieval method based on a multi-layer attention mechanism, including:

mapping target query data to a public expression space based on the cross-modal retrieval model constructed by the construction method of the cross-modal retrieval model to obtain a semantic expression vector of the target query data, and calculating the similarity between the target retrieval data and the data in the database by calculating the similarity between the semantic expression vector of the target query data and the semantic expression vector of the data in the database to obtain a corresponding retrieval ordering result;

and the semantic expression vector of the data in the database is obtained by adopting the cross-modal retrieval model mapping.

Further preferably, in the cross-modal retrieval model constructed by the construction method of the cross-modal retrieval model, an image channel is formed from the first input end of the feature extraction module to the first output end of the common representation module, and is used for mapping data of an image modality to a common representation space; a text channel is formed from a second input end of the feature extraction module to a second output end of the public representation module and is used for mapping data of a text mode to a public representation space;

the target query data or data in the database is mapped into the common representation space using corresponding channels according to its modality.

Further preferably, the similarity between the target query data and the data in the database with different modalities is calculated to obtain the cross-modality data similarity; calculating the similarity between the target query data and the data in the database with different modes to obtain the similarity of the data with the same mode; and carrying out weighted average on the cross-modal data similarity and the homomodal data similarity to obtain the similarity between the target retrieval data and the data in the database.

In a third aspect, the present invention also provides a machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement any one of the cross-modal retrieval model building methods described above and/or the cross-modal retrieval method described above.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

1. the invention provides a construction method of a cross-modal retrieval model based on a multi-layer attention mechanism, which is characterized in that the attention mechanism is cascaded into a multi-layer neural network by considering the mutual influence among different modalities, and the invention learns the global feature corresponding relation between texts and images and the context local information of fine granularity in the texts and the images based on the different attention mechanisms.

2. The invention provides a construction method of a cross-modal retrieval model based on a multilayer attention mechanism, wherein a multi-modal fusion module comprises a self-attention unit and a cooperative attention unit, context local information in a mode is learned through the self-attention unit, global interaction information of different mode data is learned through the cooperative attention unit, then the self-attention unit and the cooperative attention unit are stacked into a multilayer neural network structure, and the learning process of image features is guided by using a result after text feature coding.

3. The invention provides a construction method of a cross-modal retrieval model based on a multilayer attention mechanism, which is characterized in that tag space loss, loss in each mode and between different modes in a public representation space and invariance loss between image and text modes are calculated and added with different weights to obtain a loss function of the retrieval model; and optimizing parameters of the retrieval model through a minimum loss function to obtain the optimized retrieval model. Due to the adoption of an end-to-end deep learning-based model framework, the global information of multi-modal data is fully utilized to learn fine-grained features among different modes, so that the discriminability of different semantic data samples is reserved, and the difference among cross-modal data is eliminated.

4. Because the learning capability of the shallow model is limited, and semantic association of different modal data cannot be captured more deeply, the invention provides a method for cross-modal semantic fusion based on a multilayer attention mechanism.

Drawings

Fig. 1 is a schematic flowchart of a method for constructing a cross-modal search model based on a multi-layer attention mechanism according to embodiment 1 of the present invention;

fig. 2 is a schematic structural diagram of a cross-modal search model provided in embodiment 1 of the present invention;

fig. 3 is a schematic structural diagram of a self-attention unit SA according to embodiment 1 of the present invention;

fig. 4 is a schematic structural diagram of the cooperative attention unit GA provided in embodiment 1 of the present invention;

fig. 5 is a schematic structural diagram of a multi-modal fusion module based on a stacked structure according to embodiment 1 of the present invention;

fig. 6 is a schematic structural diagram of a multi-modal fusion module based on a codec structure according to embodiment 1 of the present invention;

FIG. 7 is a schematic diagram of the layer structure of MCA provided in example 1 of the present invention; wherein, (a) is a layer structure schematic diagram of a first MCA, (b) is a layer structure schematic diagram of a second MCA, and (c) is a layer structure schematic diagram of a third MCA.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Examples 1,

A method for constructing a cross-modal retrieval model based on a multi-layer attention mechanism is disclosed, as shown in FIG. 1, and comprises the following steps:

s1, building a cross-modal retrieval model;

specifically, as shown in fig. 2, the cross-modal search model includes a cascaded feature extraction module, a multimodal fusion module, and a common representation module;

for the feature extraction module:

in the training stage, the feature extraction module is used for respectively extracting features of image mode and text mode data of each image text data pair in the pre-collected image text data set to obtain text features and image features of each image text data pair; the image-text data set comprises a plurality of image text data pairs and semantic label vectors corresponding to the image text data pairs; the image text data pair includes data of an image modality and data of a text modality corresponding thereto. The teletext data set of this embodiment comprises n pairs of picture text data, denoted asWherein the content of the first and second substances,is the data of the ith image modality,is the data of the ith text modality. Each image text pair corresponds to a semantic label vector table_i＝[lable_1i,lable_2i,…,lable_ci]And c represents the number of directories, if the ith instance belongs to the jth directory, the corresponding component in the tag vector is 1, otherwise, the component is 0. It should be noted that the image modality data and the text modality data in the image-text data set are both preprocessed data; preprocessing data of an image modality comprises size adjustment, cutting, normalization and the like; the preprocessing of the data of the text modality comprises denoising, word segmentation, word filtering stop and the like.

Specifically, the feature extraction module comprises a first input end and a second input end, data of an image modality is input into the cross-modal retrieval model through the first input end, and data of a text modality is input into the cross-modal retrieval model through the second input end; the feature extraction module comprises two parallel sub-networks for respectively extracting features of data in an image mode and data in a text mode; the input end of the first sub-network is a first input end of the feature extraction module and is used for extracting data of an image modality; the input of the second sub-network is a second input of the feature extraction module for extracting data of the textual modality.

Feature extraction of data of the picture modality: the first sub-network comprises an image extraction sub-network and a first full connection layer which are connected in series; in general, a convolutional neural network is used as an image extraction sub-network to obtain a feature representation form of data in an image modality. A convolution operation is carried out between a convolution kernel in the convolution neural network and an input image, read image information is converted into vectors in feature mapping, in addition, due to the adoption of an end-to-end training mode, a model needs to control feature dimensions of each layer through a pooling layer, and finally a picture is represented into a feature vector with a specific dimension. The convolutional neural network stacks a plurality of convolutional layers, and image features can be extracted in a more abstract mode. Because the convolutional neural network has quite unique advantages in the field of image recognition, a specific convolutional neural network can be used for feature extraction of a picture in the cross-modal retrieval process, in the embodiment, a VGG16 model is used as an image feature extraction tool, the picture is input into the model after being preprocessed, and then a 4096-dimensional vector output from the second layer from the last of the model is used as an image feature extraction result.

Feature extraction of data of text modalities: the second sub-network comprises a text extraction sub-network and a second full connection layer which are connected in series; the text extraction sub-network generally adopts a feature extraction mode based on a language model to convert text information from an unstructured original text into a multi-dimensional vector, and in the embodiment, the text extraction sub-network is BERT. The BERT utilizes a transform structure to construct a multilayer bidirectional encoder network, can directly convert an original text into a high-level semantic vector form with semantic features, and obtains good effect in a natural language processing task. In the use process of BERT, a Chinese and English pre-training model is loaded firstly, then the preprocessed text is input into the pre-training model to obtain a vector with a specific dimensionality, and finally the generated multi-dimensional feature vector is output to a cross-modal retrieval model for attention learning.

For the multimodal fusion module:

in the training stage, the multi-mode fusion module is used for learning the text features and the context local information of fine granularity in the image features of each image text data pair based on a multi-layer attention mechanism and learning the global feature corresponding relation between the text features and the image features so as to further extract the text features and the image features and output the text features and the image features to the common representation module;

the invention introduces a multi-layer cascade attention mechanism in the cross-modal retrieval process, each layer is called an MCA unit, each layer is composed of one or more GAs and SAs, SA represents a self-attention unit and is used for extracting the potential semantic relationship in each mode, and GA represents a cooperative attention unit and is used for extracting the semantic relationship among different modes. One MCA layer can be formed by serially connecting SA and GA, and then a plurality of MCAs are serially connected to form a multi-layer attention mechanism, and the complex multi-layer structure can better realize complex reasoning operation compared with a shallow model. In consideration of the influence of the image modality on the representation form of the text modality, the invention introduces an attention mechanism and a multi-layer attention mechanism to further fuse the input text feature and the image feature. The attention mechanism comprises a self-attention and cooperative attention mechanism, the self-attention and cooperative attention mechanism can be used for mining content similarity of an image mode and a text mode, and the performance of cross-mode retrieval is improved by fusing multi-mode information. The multi-layer attention mechanism can achieve fine-grained interaction between multi-modal features. Specifically, the multi-modal fusion module comprises a multi-modal fusion module and a dimension reduction unit, wherein the multi-modal fusion module comprises a plurality of cascaded MCA layers and a dimension reduction unit positioned behind the last MCA layer; the MCA layer is used for learning the global feature corresponding relation between the text features and the image features based on the cooperative attention mechanism and learning the context local information of fine granularity in the text features and the image features based on the self-attention mechanism. In addition, after the feature vectors pass through a plurality of MCA layers, the dimension is usually high, and in order to reduce the extracted feature dimension, dimension reduction needs to be performed through a dimension reduction unit; in this embodiment, the dimension reduction unit is a multi-layer perceptron MLP.

Specifically, the MCA layer includes a first self-attention cell SA1, a first fully-connected layer FF1, a second self-attention cell SA2, a cooperative attention cell GA, and a second fully-connected layer FF 2; the output of the first self-attention cell SA1 is connected to the input of a first fully-connected layer FF 1; the output of the second self-attention cell SA2 is connected to the input of the cooperative attention cell GA; the output of the cooperative attention unit GA is connected to the input of the second fully-connected layer FF 2; the first self-attention unit SA1 is configured to extract a potential semantic relationship inside a text modality based on a self-attention mechanism to learn fine-grained context local information in a text feature, so as to further extract the text feature; the second self-attention unit SA2 is used for extracting potential inside the image modality based on a self-attention mechanismIn the semantic relation, context local information with fine granularity in the image features is learned, so that the image features are further extracted; the cooperative attention unit GA is used for extracting the semantic relation between the image modality and the text modality based on a cooperative attention mechanism and realizing the learning process of the text feature guide image feature, so that the global feature corresponding relation between the text feature and the image feature is learned, and the text feature is fused into the image feature. Inputting text features into a multi-modal fusion module, first into a first self-attention unit SA1 and then into a first fully-connected layer FF1 in each MCA layer; the image features are input into a multi-modal fusion module, first through a second self-attention cell SA2 in each MCA layer, then into a cooperative attention cell GA, and finally into a second fully-connected layer FF 2. The multi-layer attention mechanism can be formed by cascading a plurality of MCA layers, and the complex multi-layer structure can better realize complex reasoning operation compared with a shallow model. In this embodiment, the number of MCA layers is L, and the image features X and the text features Y obtained by the feature extraction module are input to the multimodal fusion module for attention learning. For the MCA layer, the input to each layer is the output of the previous layer, i.e. [ X ]^(l),Y^(l)]＝MCA^(l)([X^(l-1),Y^(l-1)]) Where L is 2,3, …, L.

Specifically, the self attention unit SA is shown in fig. 3, and the cooperative attention unit GA is shown in fig. 4; the self-attention unit is used for learning the relationship inside the modality, and the cooperative attention unit is used for learning the attention mechanism of the relationship between different modalities, the calculation process is similar to the self-attention unit, and the main difference is that the input is the data of two different modalities of X and Y. Specifically, in the present embodiment, the self-attention unit SA and the cooperative attention unit GA are each composed of a multi-head attention layer, a feedforward layer, and a normalization layer (Add & LayerNorm).

The self-Attention mechanism adopts a Scaled Dot-Product Attention mechanism for calculating modal internal information, wherein Q, K, V represents a problem, a keyword and a value respectively by firstly point-multiplying K and Q and then dividing by the K and QWherein d is_kThe dimensions representing K and Q are generally set to be the same, then the weight of V is obtained through softmax, and finally the attention result is obtained through weighted summation, wherein the attention result is specifically represented as:if given query Q ∈ R^1×dAnd n key and value pairs, wherein d is the dimension of the problem Q, and K belongs to R^n×d，V∈R^n×dWhereinThe scaling factor is used, and the fast and efficient operation can be realized by using a point multiplication mode. The self-attention unit SA takes the input feature vector (image feature X or text feature Y) as K, V, Q, respectively, and then inputs it into a multi-head attention layer, learns the relationship between the inside of the feature vector point by multi-head attention, and sums all the outputs as a final result by weighting, and introduces another basic component, i.e., a feedforward network layer, in addition to the multi-head attention layer that performs linear transformation to increase the nonlinearity of the model. The feed-forward network layer takes as its input the output characteristics of the multi-headed attention layer and further converts them using two fully connected layers, including the ReLU and Dropout layers. The feedforward layer further converts the learned attention and finally carries out normalization processing. The multi-head Attention layer uses h parallel heads, each head corresponds to different zooming Attention, and finally different attentions are obtained, so that the characteristics of multiple dimensions can be more effectively excavated. Specifically, the output result of the multi-head attention layer is expressed as:

MultiHead(Q,K,V)＝[head₁,head₂,head₃···head_h]W^o

head_j＝Attention(QW_j ^Q,KW_j ^K,VW_j ^V)，j＝1,2,…,h

wherein, W_j ^Q,W_j ^K,Is a projection matrix of the jth head for generating an attention result of the jth head;the device is used for splicing the results; d_hRepresenting the output dimension of each attention head, typically d_hD/h; setting Q ═ Q₁,q₂....,q_m]∈R^m×dThen, the output characteristic F e R of the jth head is obtained^m×d。

Similarly, the cooperative attention unit GA takes the input image feature X as Q, and the text feature Y as K and V to be input into the multi-head attention layer, learns the relationship between the two point by point, and finally realizes that the text feature Y guides the learning process of the image feature X.

It should be noted that the model proposed by the present invention requires interactive learning of the MCA to generate a more complex network by stacking multiple layers, thereby generating more discriminative text and visual features, compared to the shallow attention model. The multi-mode fusion module has two design modes, wherein one mode adopts a stacking structure, namely the output of each MCA layer is used as the input of the next MCA layer; another way is to adopt an encoder-decoder structure, and take the output of the last layer of the encoder composed of a plurality of MCA layers as the input of each layer of the decoder, so as to realize the purpose of guiding the learning of image characteristics according to text characteristics.

Specifically, as shown in fig. 5, a schematic structure diagram of a multi-modal fusion module based on a stacked structure (taking L ═ 3 as an example); for the MCA layer which is not the last stage, the output end of the first full connection layer is respectively connected with the input end of the cooperative attention unit and the input end of the first self-attention unit of the MCA layer of the next stage; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer; for the MCA layer of the last stage, the output end of the first full connection layer is respectively connected with the input end of the cooperative attention unit and the input end of the dimensionality reduction unit; and the output end of the second full connection layer is connected with the input end of the dimension reduction unit. And the multi-mode fusion module based on the stack structure fuses text features and image features corresponding to each MCA layer in each MCA layer, then inputs the learning result into the next MCA layer, and enables the model to have deeper reasoning capability by stacking a plurality of MCA layers.

Fig. 6 is a schematic structural diagram of a multi-modal fusion module based on a codec structure (taking L ═ 3 as an example); for the MCA layer which is not the last stage, the output end of the first full connection layer is connected with the input end of the first self-attention unit of the MCA layer of the next stage; the output end of the second full connection layer is connected with the input end of a second self-attention unit of the next MCA layer; for the MCA layer of the last stage, the output end of the first full connection layer is respectively connected with the input end of the cooperative attention unit of the previous stage and the input end of the dimensionality reduction unit; and the output end of the second full connection layer is connected with the input end of the dimension reduction unit. For the multi-mode fusion module based on the encoder-decoder structure, firstly, the text feature Y is taken as input, then the final representation Y 'of the text feature is obtained through learning of the plurality of first self-attention units SA1, Y' is input into the cooperative attention units GA in each MCA layer to guide the learning process of the picture mode X, and meanwhile, the image mode X selectively extracts own key information through the second self-attention unit SA 2.

The most difference of the multi-modal fusion modules in the two structures is whether a final text feature representation form is adopted to guide the image to learn, and the multi-modal fusion module based on the encoder and decoder structure adopts the final text form to perform cooperative attention learning so as to realize deeper semantic interactive understanding.

It should be noted that global feature representation of a picture or text may lose some key information, which may relate to a local region of the picture or a word of the text, and in order to solve the above problem, the present invention uses an attention mechanism. While the single-layer attention mechanism can only learn rough interaction among multiple modes, the multi-layer attention mechanism can realize fine-grained interaction among multi-mode features, and therefore the multi-layer cascade attention model is used in the method. For the MCA layer structure proposed by the present invention, the MCA layer, i.e. the component modularized common attention layer, may use different self-attention units SA and cooperative attention units GA to form the MCA layer structure, and three forms are listed below, where the first MCA layer structure is as shown in (a) of fig. 7, only one cooperative attention unit GA is used in an image modality alone without processing a text feature, and the text feature guides a learning process of the image feature through a cooperative attention mechanism, so as to capture a fine interaction between data of the two modalities. The second MCA layer structure is shown in (b) of fig. 7, and uses a relationship between text interiors selectively focused on from attention unit SA in a text modality, and uses a cooperative attention unit GA in a picture modality to fuse information between the text modality and the picture modality, which considers not only latent semantic features inside the modalities but also semantic relationships between the modalities. The third MCA layer structure is the structure adopted by the present invention, and as shown in (c) of fig. 7, the first self-attention unit SA1 is used for feature extraction in a text modality, and a second self-attention unit SA2 and a cooperative attention unit GA are respectively used in another modality, so that compared with the other two structures, the MCA layer structure adopted by the present invention also takes into account the attention mechanism of an image modality itself to obtain a richer image feature representation.

For a common representation module:

the public representation module is used for respectively mapping the text features and the image features input by the multi-mode fusion module to a public representation space to obtain semantic representation vectors of each image text data pair; the semantic representation vector of the image text data pair comprises a semantic representation vector of image modality data and a semantic representation vector of text modality data; the common representation module comprises a first output end and a second output end, the semantic representation vector of the image text data pair is output from the cross-modal retrieval model through the first output end, and the semantic representation vector of the text modal data is output from the cross-modal retrieval model through the second output end;

specifically, in this embodiment, the common representation module is a third fully connected layer, and maps the extracted image data features and text data features to the common representation space through the shared fully connected layer.

S2, calculating invariance loss and label space loss between an image mode and a text mode and discriminant loss between different modes and in each mode in a public expression space based on semantic label vectors and semantic expression vectors of each image text data pair, and adding different weights to obtain a loss function of a cross-mode retrieval model;

in particular, according to the correlation theory of deep learning, the definition Loss is the total objective function of the model, which is the linear weighting of three Loss functions, respectively, the tag space Loss L₁(i.e., the loss between the label space and the common representation space), discrimination loss L between and within each modality in the common representation space₂And loss of invariance L between image modality and text modality₃(i.e., the loss of distance between all pairs of picture text representations). Due to three losses L₁、L₂、L₃The corresponding influences are different, and different weights are given to the influences, and specifically, the loss function of the cross-modal search model is as follows:

Loss＝λL₁+μL₂+ηL₃

wherein λ, μ and η are L₁、L₂、L₃The weight coefficient of (2).

Specifically, a linear classifier is added after the third fully-connected layer to predict the categories of image modality data and text modality data and compare with the real categories, thereby calculating the tag space loss. The label space loss is used for measuring the loss between the label space and the common representation space, namely, the difference between the prediction categories of the image modality data and the text modality data and the corresponding real categories (namely, semantic labels), and specifically is as follows:

wherein | · | purple sweet_FRepresents the Frobenius norm; alpha and beta are respectively the weight corresponding to the semantic expression vector of the data of the image modality and the data of the text modality; because the modes of extracting high-level semantic vectors from the image and the text are different, so that the prediction losses of the image and the text mapped to the features in the common representation space are inconsistent, different weights are applied to the image prediction labels and the text prediction labels to balance the difference of the prediction losses; p is the projection matrix of the linear classifier; u ═ U₁,u₂,…,u_n]A predictive representation matrix u for data of an image modality of a teletext data set in a common representation space_iA semantic representation vector for data of an ith image modality in the teletext data set; v ═ V₁,v₂,…,v_n]A predictive representation matrix, v, for data of a text modality of a teletext data set in a common representation space_iA semantic representation vector for data of an ith text modality in the teletext data set; lable ═ Lable₁,lable₂,…,lable_n]A matrix corresponding to the semantic tag vector; i is 1,2, …, n.

The discrimination penalty between and within different modalities in the common representation space is:

wherein, gamma is_ij＝cos(u_i,v_j)，Φ_ij＝cos(u_i,u_j)，Θ_ij＝cos(v_i,v_j)， cos is a cosine function for measuring similarity, sgn is a sign function, if two elements belong to the same class, the sign function is 1, otherwise, the sign function is 0; u. of_iA semantic representation vector for data of an ith image modality in the teletext data set; v. of_jFor the first in the image-text data setA semantic representation vector of data of j text modalities; i is 1,2, …, n, j is 1,2, …, n.

The likelihood function for measuring the similarity between the modes is expressed as follows:

L₂the function is used for measuring the common representation space similarity and is a criterion for learning the discriminant features, and the function is composed of three parts, wherein the first part is used for measuring the similarity between the data of the image modality and the data of the text modality, and the second part and the third part are used for calculating the similarity inside the data of the image modality and the data of the text modality respectively. The cosine similarity cos (u) can be deduced as the first part requires that the logarithm of the maximum likelihood function, i.e. the minimum likelihood function, takes the negative sign_i,v_j) The larger, gamma_ijThe larger the probability p (1| u) of similarity between the modalities is, thus, measured_i,v_j) The larger.

Loss of invariance L between image modality and text modality₃Semantic representation vector (u) for all image text data pairs_i,v_i) The distance loss between the two is specifically as follows:

and S3, training the cross-modal retrieval model by minimizing the loss function of the cross-modal retrieval model to obtain the trained cross-modal retrieval model.

The present invention supervises model learning discrimination features by minimizing discrimination losses in both the label space and the common representation space. At the same time, modality-invariant features in the common representation space are learned by minimizing loss of invariance between different modalities and using a weight sharing strategy. According to this learning strategy, pairs of label information and classification information are utilized as fully as possible to ensure that the learned representation is discriminative in semantic structure and invariant in all modalities.

Examples 2,

A cross-modal retrieval method based on a multi-layer attention mechanism comprises the following steps:

mapping target query data to a public representation space based on the cross-modal retrieval model constructed by the construction method of the cross-modal retrieval model provided by the embodiment 1 to obtain a semantic representation vector of the target query data, and calculating the similarity between the target query data and data in a database by calculating the similarity between the semantic representation vector of the target query data and the semantic representation vector of the data in the database to obtain a corresponding retrieval ordering result;

and the semantic expression vector of the data in the database is obtained by adopting the cross-modal retrieval model mapping.

Specifically, in the cross-modal retrieval model, an image channel is formed from a first input end of the feature extraction module to a first output end of the common representation module, and is used for mapping data of an image modality to a common representation space; a text channel is formed from a second input end of the feature extraction module to a second output end of the public representation module and is used for mapping data of a text mode to a public representation space; the target query data or data in the database is mapped into the common representation space using corresponding channels according to its modality.

Preferably, in practical use, similarity of homomorphic data may be added to improve the accuracy of retrieval, specifically: calculating the similarity between the target query data and data in the database with different modes to obtain the cross-mode data similarity; calculating the similarity between the target query data and the data in the database with different modes to obtain the similarity of the data with the same mode; the similarity between the target retrieval data and the data in the database is obtained by carrying out weighted average on the similarity between the cross-modal data and the similarity between the same-modal data, and is expressed as follows:

S＝εSimilarity(x,U')+γSimilarity(x,V')

wherein, epsilon and gamma are both weights; similarity (·) is a function that measures Similarity; x is target query data, which can be data of an image modality or data of a text modality; u '═ U'₁,u'₂,…,u'_n]A representation matrix u of data of an image modality for a database in a common representation space_i' is a semantic representation vector of data of an ith image modality in the database; v '═ V'₁,v'₂,…,v_n']A representation matrix v for data of an image modality of the database in the common representation space_i' is a semantic representation vector of data of an ith image modality in the database; i is 1,2, …, n. The database may be the image-text data set. And S is the final return result, then the final return result is sequenced, and the data in the front of the sequence is taken as the final result.

The related technical scheme is the same as embodiment 1, and is not described herein.

Examples 3,

A machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the method of constructing a cross-modal retrieval model of embodiment 1 and/or the cross-modal retrieval method of embodiment 2.

The related technical scheme is the same as that of embodiment 1 and embodiment 2, and is not described herein.

It should be noted that any reference to a storage medium, storage, database, or other medium used in connection with the present invention may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

In summary, the invention provides a method for cross-modal semantic fusion based on a multi-layer attention mechanism, which introduces the multi-layer attention mechanism to extract fine-grained features of text and image data in a cross-modal retrieval process, and learns potential semantic relations between the interior of single-modal data and different modalities. Context local information inside a self-attention unit learning mode is used in a model, a cooperative attention unit is used for learning global interaction information of data in different modes, then the self-attention unit and the cooperative attention unit are stacked into a multi-layer neural network structure, and a learning process of image features is guided by using a result after text feature coding. Finally, a common space is constructed where samples from the same semantic category should be close in space, even though the data may be from different modalities, and samples from different semantic categories should be far apart. The method of sharing the neural network is adopted to map the data of different modes to a common space for representation, and the classification result is output by a classifier. In the model training process, the distance between each image text pair needs to be minimized, the discriminant loss in the label space and the common representation space is minimized so as to learn the discriminant representation characteristics of the multi-modal data, and finally the intra-modal and inter-modal loss needs to be minimized in order to reduce the cross-modal difference. And after the model is trained, mapping the target retrieval data to a public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set to obtain a corresponding retrieval sequencing result.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：数据搜索方法及装置

Construction method and application of cross-modal retrieval model based on multi-layer attention mechanism

相关技术

网友询问留言