Multi-mode label recommendation method based on one-way supervision attention

文档序号：1889366 发布日期：2021-11-26 浏览：17次中文

阅读说明：本技术 一种基于单向监督注意力的多模态标签推荐方法 (Multi-mode label recommendation method based on one-way supervision attention ) 是由凌海峰徐旺钱洋姜元春刘业政柴一栋孙见山袁昆孟祥睿于 2021-08-26 设计创作，主要内容包括：本发明公开了一种基于单向监督注意力的多模态标签推荐方法,其步骤包括：1.构建多模态数据集合并表示,2.获取文本信息的特征表示,3.获取视觉信息的特征表示,4.深度融合文本与视觉特征,5.利用融合后的全局特征进行标签推荐。本发明在处理标签推荐任务时,能够综合利用文本与视觉信息,以得到有效的数据特征表示,从而能提高推荐精度。(The invention discloses a multi-modal label recommendation method based on one-way supervision attention, which comprises the following steps: 1. constructing and representing a multi-mode data set, 2, obtaining feature representation of text information, 3, obtaining feature representation of visual information, 4, deeply fusing text and visual features, and 5, utilizing the fused global features to recommend labels. When the label recommendation task is processed, the text and the visual information can be comprehensively utilized to obtain effective data characteristic representation, so that the recommendation precision can be improved.)

1. A multi-mode label recommendation method based on one-way supervision attention is characterized by comprising the following steps:

step 1, constructing a multi-modal data set A;

step 1.1, constructing a text information set of the multi-modal data, and recording the text information set as T ═ T₁,t₂,…,t_k,…,t_KWhere t is_kRepresenting the kth piece of text information, wherein K is the total number of the text information; the k-th text information t_kThe included words, denoted t_k＝{w_1,k,w_2,k,…,w_n,k,…,w_N,kIn which w_n,kRepresents the k-th text information t_kThe nth text word, N represents the maximum word length of the text information;

step 1.2, constructing a visual information set of the multi-modal data, and recording the visual information set as P ═ U₁,U₂,…,U_k,…,U_KIn which U_kRepresents the k-th piece of visual information, the k-th piece of visual information U_kPicture contained, noted as U_k＝{p_1,k,p_2,k,…,p_m,k,…,p_M,kIn which p is_m,kRepresents the k-th piece of visual information U_kThe mth picture, wherein M represents the maximum number of pictures of the visual information;

step 1.3, constructing a tag information set in the multi-modal data, and recording the tag information set as Λ ═ Λ₁,Λ₂,…,Λ_k,…,Λ_KWherein, Λ_kRepresents the k-th tag information, the k-th tag information_kIncluded labels, denoted Λ_k＝{λ_1,k,λ_2,k,…,λ_o,k,…,λ_O,kIn which λ is_o,kRepresents the kth tag information Λ_kThe O-th label, wherein O represents the maximum label number of the label information; forming a label space phi by all different label information;

step 1.4, constructing a multi-mode data set A ═ T, P, Λ from the text information set T, the visual information set P and the label information set Λ;

step 2, acquiring the feature representation of the text information;

step 2.1, for the k textInformation t_kEmbedding words to obtain word vector X_k＝{x_1,k,x_2,k,…,x_n,k,…,x_N,kIn which x_n,kRepresents the k-th text information t_kFilling a zero vector for the text data with the length less than N to reach the length N;

step 2.2, adopting bidirectional GRU to match word vector X_kProcessing to obtain word-level text feature representation, and recording asWherein the content of the first and second substances,represents the k-th text information t_kThe text characteristics of the nth word level;

using n as current time step, for nth word vector x_n,kThe GRU unit respectively obtains an updated gate z by using formulas (1) to (3)_n,kReset gate r_n,kCandidate stateSo that the GRU unit output h according to the (n-1) th time step_n-1,kUsing equation (4) to obtain the GRU unit output h of the nth time step_n,k：

z_n,k＝σ(W_z·[h_n-1,k,x_n,k]+b_z) (1)

r_n,k＝σ(W_r·[h_n-1,k,x_n,k]+b_r) (2)

Obtaining forward implicit feature expressions extracted by a forward GRU unit by using an equation (5) and an equation (6) respectivelyBackward implicit feature representation extracted by backward GRU unitThus, the text feature representation of the word level is obtained by using the formula (7)

In the formulae (5) and (6),respectively representing a forward GRU unit and a backward GRU unit;

and 2.3, introducing a self-attention mechanism shown in the formulas (8) and (9) so as to obtain the characteristic representation of the text information by utilizing the formula (10)

In the formulae (8) to (10),for word-level representation of text featuresD is the hidden layer dimension, W_wFor hiding representationsWeight matrix of b_wFor hiding representationsIs determined by the offset vector of (a),for text feature representationThe attention score of (a) is given,scoring text attentionThe weight matrix of (a) is determined,l is a text characteristic dimension;

step 3, obtaining the characteristic representation of the visual information;

step 3.1, extracting picture region features through a pre-trained VGG-16 network:

Step 3.2, learning the feature representation of each picture through the visual attention based on the text:

obtaining picture region feature representation combined with text context by using equation (11) and equation (12) respectivelyAnd the g regionAttention score ofThereby obtaining the m-th picture p by using the formula (13)_m,kIs characterized by

In the formulae (11) to (13),characterizing a picture regionThe weight matrix of (a) is determined,for text featuresThe weight matrix of (a) is determined,scoring attention for picture regionsThe weight matrix of (a) is determined,scoring attention for picture regionsThe offset vector of (2);

and 3.3, fusing a plurality of pictures through an attention mechanism to finally obtain the characteristic representation of the visual information:

in the formulae (14) to (16),for the implicit feature representation of the mth picture in the kth piece of visual information,for latent feature representation of picturesThe weight matrix of (a) is determined,for latent feature representation of picturesIs determined by the offset vector of (a),is the m picture p_m,kThe attention score of (a) is given,scoring the picture attentionThe weight matrix of (a) is determined,representing the finally learned visual features;

step 4, text and visual features are deeply fused:

obtaining intermediate features of layer f output in deep neural network by using equation (20)Thereby obtaining the global characteristics of the output of the deep neural network

In the formula (20), ReLU represents a linear rectification function, W_fRepresenting intermediate featuresWeight matrix of b_fRepresenting intermediate featuresF, F denotes the number of layers of the deep neural network, and when F is 1, the text features are used as the offset vectors of (1), (2), (…), and F denotes the number of layers of the deep neural networkAnd visual featuresSplicing to obtain initial characteristics

Step 5, utilizing the fused global features to recommend labels;

step 5.1, global characteristicsInput to a classifier to obtain a probability distribution of the label using equation (18)

In the formula (18), W_p,b_pA weight matrix and a deviation vector of the classifier are obtained;

and 5.2, adopting the binary cross entropy shown as the formula (19) as an objective function J:

in equation (19), S represents the number of data pieces in the multi-modal data set A as a training set,representing the actual probability that the s-th data in the training set contains the label lambda,the prediction probability that the s-th data in the training set contains a label lambda is shown, and lambda belongs to phi;

step 5.3, setting the maximum iteration number epoch _ number, training the target function J by using a back propagation and gradient descent method, and stopping training when the iteration number reaches the epoch _ number so as to minimize the target function J, thereby obtaining an optimal recommendation model for outputting the probability distribution of the labels to the input text information set T and the visual information set PAnd selecting the top labels with the maximum probability for recommendation.

Technical Field

The invention relates to the technical field of information retrieval of multi-modal data, in particular to a label recommendation method based on multi-modal data.

Background

A tag is a keyword that describes a feature of an object, allowing a user to freely select related terms to assign to a target according to preferences. The tags can not only highly summarize the key information of the objects, but also realize the interconnection among the objects by constructing an inner chain to form a relational network. Through the label system, a marketer can clearly and clearly display relevant important information of the product; the effectiveness of product retrieval can be enhanced by consumers, and the user experience is improved; the platform can more effectively track and manage the content generated by the marketer. Although the importance of tags in terms of organization and consumption of content has been demonstrated, there are still relatively few articles that contain user tags. Thus, the tag recommendation becomes a necessary and important task.

In recent years, researchers have used a variety of methods to accomplish tag recommendation tasks, such as collaborative filtering, generative modeling, deep neural networks, and the like. Although there has been much research on this topic, most research has focused on the use of textual features. However, statistically, more than 42% of the tweets contain multi-modal data. The picture feature can also provide rich information to help us handle the tag recommendation task. In addition, the information displayed by the picture and the text may be complementary, and the picture and the text can cooperate to more fully reflect the characteristics of the object. Therefore, it is not sufficient to use only textual features to solve the tag recommendation task. Although some research has been done to provide multi-modal models that combine text and picture features, there are some drawbacks to applying them to the tag recommendation task. Document Vqa Visual query answering,2015 models object features by extracting text and picture global feature vectors for element multiplication. This approach ignores one issue: tags do not necessarily have to be associated with all parts of the image and text. The adoption of global features may introduce noise instead, resulting in a reduction in recommendation accuracy. The document [ Hashtag Recommendation for Multimodal Microblog Using Co-Attention Network,2017] introduced a synergistic Attention mechanism into the tag Recommendation task to fuse picture and text features, but this mechanism was proposed by the document [ structural Co-annotation for visual query answering,2016], and was originally used for the visual question-answering task. Since the visual question-answering is to match the text with the picture features, the commonalities of the picture and the text are mainly concerned, so that the text or the unique parts in the picture are easy to ignore when using the mechanism. Document [ AMNN: Attention-Based Multi Neural Network Model for Hashtag Recommendation, 2020] adopts a self-Attention mechanism to extract text and picture features respectively, and then performs label Recommendation by simply summing the text and picture features to obtain overall features, but the method does not consider information interaction between the text and picture features. How to effectively extract and fuse the picture and text features still remains a problem to be researched. Furthermore, previous studies on multimodal data have shown that each piece of data contains mostly only one picture. However, in many cases, such as microblog posts, each article does not contain only one picture, and not all pictures are necessarily helpful for the tag recommendation task. How to effectively use the information of the pictures is also a problem worthy of research.

Disclosure of Invention

The invention provides a multi-mode label recommendation method based on one-way supervision attention in order to overcome the defects of the prior art, so that text and visual information can be comprehensively utilized to quickly and accurately learn data characteristic representation when a label recommendation task of multi-mode data is processed, and therefore the accuracy and precision of label recommendation are improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a multi-mode label recommendation method based on one-way supervision attention, which is characterized by comprising the following steps of:

step 1, constructing a multi-modal data set A;

step 1.1, constructing a text information set of the multi-modal data, and recording the text information set as T ═ T₁,t₂,···,t_k,···,t_KWhere t is_kRepresenting the kth piece of text information, wherein K is the total number of the text information; the k-th text information t_kThe included words, denoted t_k＝{w_1,k,w_2,k,···,w_n,k,···,w_N,kIn which w_n,kRepresents the k-th text information t_kThe nth text word, N represents the maximum word length of the text information;

step 1.2, constructing a visual information set of the multi-modal data, and recording the visual information set as P ═ U₁,U₂,···,U_k,···,U_KIn which U_kRepresents the k-th piece of visual information, the k-th piece of visual information U_kPicture contained, noted as U_k＝{p_1,k,p_2,k,···,p_m,k,···,p_M,kIn which p is_m,kRepresents the k-th piece of visual information U_kThe mth picture, wherein M represents the maximum number of pictures of the visual information;

step 1.3, constructing a tag information set in the multi-modal data, and recording the tag information set as Λ ═ Λ₁,Λ₂,···,Λ_k,···,Λ_KWherein, Λ_kRepresents the k-th tag information, the k-th tag information_kIncluded labels, denoted Λ_k＝{λ_1,k,λ_2,k,···,λ_o,k,···,λ_O,kAnd (c) the step of (c) in which,λ_o,krepresents the kth tag information Λ_kThe O-th label, wherein O represents the maximum label number of the label information; forming a label space phi by all different label information;

step 1.4, constructing a multi-mode data set A ═ T, P, Λ from the text information set T, the visual information set P and the label information set Λ;

step 2, acquiring the feature representation of the text information;

step 2.1, for the k text information t_kEmbedding words to obtain word vector X_k＝{x_1,k,x_2,k,···,x_n,k,···,x_N,kIn which x_n,kRepresents the k-th text information t_kFilling a zero vector for the text data with the length less than N to reach the length N;

z_n,k＝σ(W_z·[h_n-1,k,x_n,k]+b_z) (1)

r_n,k＝σ(W_r·[h_n-1,k,x_n,k]+b_r) (2)

In formula (1) -formula (4), [ ] indicates an element multiplication, σ indicates a sigmoid activation function, and W_zRepresentation update Gate z_n,kWeight matrix of b_zRepresentation update Gate z_n,kOffset vector of W_rRepresents a reset gate r_n,kWeight matrix of b_rRepresents a reset gate r_n,kOffset vector of W_hRepresenting candidate statesWeight matrix of b_hRepresenting candidate statesLet h be h when n is 1_n-1,k0; obtaining forward implicit feature expressions extracted by a forward GRU unit by using an equation (5) and an equation (6) respectivelyBackward implicit feature representation extracted by backward GRU unitThus, the text feature representation of the word level is obtained by using the formula (7)

In the formulae (5) and (6),respectively representing a forward GRU unit and a backward GRU unit;

and 2.3, introducing a self-attention mechanism shown in the formulas (8) and (9) so as to obtain the characteristic representation of the text information by utilizing the formula (10)

step 3, obtaining the characteristic representation of the visual information;

step 3.1, extracting picture region features through a pre-trained VGG-16 network:

Step 3.2, learning the feature representation of each picture through the visual attention based on the text:

and 3.3, fusing a plurality of pictures through an attention mechanism to finally obtain the characteristic representation of the visual information:

step 4, text and visual features are deeply fused:

obtaining intermediate features of layer f output in deep neural network by using equation (20)Thereby obtaining the global characteristics of the output of the deep neural network

Step 5, utilizing the fused global features to recommend labels;

step 5.1, global characteristicsInput to a classifier to obtain a probability distribution of the label using equation (18)

In the formula (18), W_p,b_pA weight matrix and a deviation vector of the classifier are obtained;

and 5.2, adopting the binary cross entropy shown as the formula (19) as an objective function J:

step 5.3, setting the maximum iteration number epoch _ number, training the target function J by using a back propagation and gradient descent method, and stopping training when the iteration number reaches the epoch _ numberSo as to minimize the objective function J, thereby obtaining the probability distribution of the optimal recommendation model for outputting labels to the input text information set T and the visual information set PAnd selecting the top labels with the maximum probability for recommendation.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention comprehensively considers the text and the visual information, considers that the text and the picture are the different forms of the display of the same target object, and the displays can describe different aspects of the object from different perspectives, so the characteristics of the object can be better reflected by combining the text and the visual information. On the basis, the invention provides an end-to-end multi-mode data tag recommendation framework, and visual and text features are effectively extracted and fused, so that the precision of a tag recommendation task is improved, and the tag recommendation framework is more convenient and valuable in practical application.

2. The invention adopts word embedding and bidirectional GRU model to process text data, and can comprehensively consider the context of the text. Meanwhile, the attention mechanism is introduced, so that text information helpful to the label recommendation task can be effectively screened out, the accuracy of text features is enhanced, and the interpretability of the model is improved.

3. The invention provides a hierarchical attention mechanism to extract visual features. Firstly, obtaining picture region characteristics through a pre-trained VGG-16 network; secondly, providing a corresponding situation for learning the picture characteristics and supervising by adopting a text-based visual attention mechanism to obtain characteristic representation of each picture; and finally, effectively fusing the characteristics of a plurality of pictures in each piece of data by adopting a self-attention mechanism to obtain data visual characteristic representation. And finally, accurate representation of the visual features is obtained in a layer-by-layer progressive mode, so that noise is reduced, and the effectiveness of the visual features is improved.

4. According to the method, the text and the visual information are mapped into the same feature space through interactive fusion of the deep neural network modeling text and the visual features, and the global feature representation of each piece of data is obtained. Compared with the traditional fusion mode of weighted sum and inner product, the method can fuse text and visual information more deeply, so that more comprehensive and effective feature representation is obtained, and the recommendation accuracy is improved.

Drawings

Fig. 1 is a frame diagram of a multi-modal label recommendation method based on one-way supervised attention proposed in the present invention.

Detailed Description

In the embodiment, a multi-modal label recommendation method based on one-way supervision attention is characterized in that a deep learning model is utilized, image and text features are comprehensively considered, a multi-layer attention mechanism is used for reducing noise, the interpretability of the model is improved, finally, the text and visual features are fused through a deep neural network, accurate data feature representation is obtained through learning, and therefore more effective recommendation labels are provided. As shown in fig. 1, the input of the model includes text information and a picture set, the text information is transmitted to the text feature extraction module, and the picture set is transmitted to the visual feature extraction module; obtaining text features through a text feature extraction moduleObtaining visual features through a visual feature extraction moduleIn addition, text featuresThe visual feature extraction module is also used as an input of the visual feature extraction module, so that the visual attention based on the text is generated, and the learning of the visual features is supervised; then, the text is characterizedAnd visual featuresInput feature fusion module to obtain global featuresAnd finally, calculating the probability distribution of the label through the prediction layer to generate a candidate label, specifically, the method comprises the following steps:

step 1, constructing a multi-modal data set A;

step 1.1, constructing a text information set of multi-modal data, and deleting words with too low word frequency and texts with too short length after performing word segmentation, word stop removal and word noise removal on the text data, wherein the words are marked as T ═ T₁,t₂,···,t_k,···,t_KWhere t is_kRepresenting the kth piece of text information, wherein K is the total number of the text information; the kth piece of text information t_kThe included words, denoted t_k＝{w_1,k,w_2,k,···,w_n,k,···,w_N,kIn which w_n,kIndicates the k-th text information t_kThe nth text word, N represents the maximum word length of the text information;

step 1.2, constructing a visual information set of multi-modal data, deleting data not including pictures, and recording as P ═ U₁,U₂,···,U_k,···,U_KIn which U_kRepresents the k-th piece of visual information, the k-th piece of visual information U_kPicture contained, noted as U_k＝{p_1,k,p_2,k,…,p_m,k,…,p_M,kIn which p is_m,kRepresents the k-th piece of visual information U_kThe mth picture, wherein M represents the maximum number of pictures of the visual information;

step 1.3, constructing a tag information set in the multi-modal data, deleting tags with too low frequency and data without tags, and marking as Λ ═ Λ₁,Λ₂,…,Λ_k,…,Λ_KWherein, Λ_kRepresents the kth tag information, the kth tag information Λ_kIncluded labels, denoted Λ_k＝{λ_1,k,λ_2,k,…,λ_o,k,…,λ_O,kIn which λ is_o,kRepresents the kth tag information Λ_kThe O-th label, wherein O represents the maximum label number of the label information; from allThe same label information forms a label space phi;

step 1.4, constructing a multi-mode data set A ═ T, P, Λ from the text information set T, the visual information set P and the label information set Λ;

step 2, acquiring the characteristic representation of the text information, wherein compared with visual data, the text data can provide richer and higher-quality object information, and the importance of the text data is obviously higher than that of the visual data;

step 2.1, for the k text information t_kEmbedding words to obtain word vector X_k＝{x_1,k,x_2,k,…,x_n,k,…,x_N,kIn which x_n,kIndicates the k-th text information t_kFilling a zero vector for the text data with the length less than N to reach the length N;

step 2.2, adopting bidirectional GRU to match word vector X_kProcessing to obtain word-level text feature representation, and recording asWherein the content of the first and second substances,indicates the k-th text information t_kThe text characteristics of the nth word level;

z_n,k＝σ(W_z·[h_n-1,k,x_n,k]+b_z) (1)

r_n,k＝σ(W_r·[h_n-1,k,x_n,k]+b_r) (2)

In the formulae (5) and (6),respectively representing a forward GRU unit and a backward GRU unit;

step 2.3, a self-attention mechanism shown in the formula (8) and the formula (9) is introduced, the attention mechanism can focus on text information helpful for label recommendation, noise is filtered, feature representation of the text information is better obtained, and therefore the feature representation of the text information is obtained by the formula (10)

step 3, obtaining the characteristic representation of the visual information;

step 3.1, extracting picture region features through a pre-trained VGG-16 network:

the m picture p_m,kDividing the image into G areas to obtain a picture area setWherein the content of the first and second substances,denotes the m-th picture p_m,kThe g-th region; and acquiring the m picture p by the VGG-16 network and taking the output of the last pooling layer of the network as the extracted regional characteristics_m,kMiddle g areaIs characterized byIn the present embodiment, firstFirstly, uniformly adjusting the sizes of all pictures to be (224 multiplied by 224), and then dividing each picture into 49 areas;

step 3.2, learning the feature representation of each picture through text-based visual attention, wherein not all regions of the picture are helpful for the tag recommendation task, and even under different situations, the same picture attention region is different. The invention provides a corresponding situation for the picture by using the text characteristic representation learned in the step 2, thereby filtering noise and finally obtaining accurate picture representation:

and 3.3, fusing a plurality of pictures through an attention mechanism to finally obtain the characteristic representation of the visual information, wherein all pictures are not helpful to the tag recommendation task, and the invalid pictures introduce noise into the model, so that the recommendation precision is reduced. This problem can be handled well with the attention mechanism, and the specific formula is as follows:

in formula (14) -formula (16)，For the implicit feature representation of the mth picture in the kth piece of visual information,for latent feature representation of picturesThe weight matrix of (a) is determined,for latent feature representation of picturesIs determined by the offset vector of (a),is the m picture p_m,kThe attention score of (a) is given,scoring the picture attentionThe weight matrix of (a) is determined,representing the visual features that are finally learned.

And 4, deeply fusing the text and the visual features, wherein the text and the visual information may describe target information from different visual angles, and the joint action of the text and the visual information determines the label of the article, so that the text and the visual features need to be effectively fused. The invention adopts the following formula to deeply model the fusion process of text and visual characteristics:

obtaining intermediate features of layer f output in deep neural network by using equation (20)Thereby obtaining a depthGlobal features of neural network output

Step 5, utilizing the fused global features to recommend labels;

step 5.1, global characteristicsInputting a classifier, in the embodiment, converting a label recommendation task into a multi-label classification problem, adopting a logistic regression model as the classifier, enabling each node in the model to correspond to an actual label one by one, and mapping a feature vector into a probability theta of the label, which belongs to [0,1 ]]Thereby obtaining a probability distribution of the label using equation (18)

In the formula (18), W_p,b_pA weight matrix and a deviation vector of the classifier are obtained;

and 5.2, adopting the binary cross entropy shown as the formula (19) as an objective function J:

in equation (19), S represents the number of data pieces in the multimodal data set a as a training set, and in this embodiment, the multimodal data set is divided into a training set, a verification set, and a test set in a ratio of 8:1: 1.Representing the actual probability that the s-th data in the training set contains the label lambda,the prediction probability that the s-th data in the training set contains a label lambda is shown, and lambda belongs to phi;

step 5.3, setting the maximum iteration number epoch _ number to 100, and training the objective function J by using a back propagation and gradient descent method, where in this embodiment, the gradient descent method uses a learning rate η of 1e-4 and an exponential decay rate β₁＝0.9,β₂And when the iteration number reaches the epoch _ number, stopping training to ensure that the objective function J reaches the minimum, thereby obtaining the probability distribution of the optimal recommendation model for outputting the labels to the input text information set T and the input visual information set PAnd selecting the top labels with the maximum probability for recommendation.

14页详细技术资料下载

Multi-mode label recommendation method based on one-way supervision attention

相关技术

网友询问留言