Method for solving video question-answering task based on multi-mode progressive attention model

文档序号：1875576 发布日期：2021-11-23 浏览：18次中文

阅读说明：本技术 基于多模态渐进式注意力模型解决视频问答任务的方法 (Method for solving video question-answering task based on multi-mode progressive attention model ) 是由孙广路刘昕雨梁丽丽李天麟于 2021-08-10 设计创作，主要内容包括：本发明的实施方式提供了一种基于多模态渐进式注意力模型解决视频问答任务的方法。该方法包括：一、针对视频问答任务中的多种模态信息,分别提取多种模态特征；二、利用问题对提取到的多种模态特征进行初步关注并计算相应的权重得分,再利用问题对重要模态特征进行迭代关注以定位到与问题最相关的模态特征；三、利用多模态融合算法实现特征的跨模态融合,再利用问题对视频的多模态融合表示进行关注,找出与问题相关的重要视频特征；四、将模型的部分有效输出结果进行融合,用于答案生成。相比现有的视频问答解决方案,本发明能够更精准地定位到与问题相关的视频帧或视频画面区域。本发明在视频问答任务中取得的效果相比于传统的方法更好。(The embodiment of the invention provides a method for solving a video question-answering task based on a multi-mode progressive attention model. The method comprises the following steps: firstly, aiming at various modal information in a video question and answer task, respectively extracting various modal characteristics; secondly, performing primary attention on the extracted various modal characteristics by using the problem, calculating corresponding weight scores, and performing iterative attention on the important modal characteristics by using the problem so as to locate the modal characteristics most relevant to the problem; thirdly, realizing cross-modal fusion of the features by utilizing a multi-modal fusion algorithm, and then paying attention to multi-modal fusion representation of the video by utilizing problems to find out important video features related to the problems; and fourthly, fusing partial effective output results of the model for answer generation. Compared with the existing video question-answering solution, the method and the device can more accurately position the video frame or video picture area related to the problem. Compared with the traditional method, the effect obtained by the invention in the video question-answering task is better.)

1. The method for solving the video question-answering task based on the multi-modal progressive attention model comprises the following steps:

step S1, obtaining the video and question to be processed;

step S2, extracting frame characteristics, clipping characteristics and audio characteristics of the video as a plurality of modal characteristics of the video, and extracting text characteristics of the problem;

step S3, respectively paying attention to a plurality of modal characteristics of the video by using questions to obtain a plurality of modal representations with question guidance, respectively calculating the weight scores of the modalities by using the questions, and selecting the modality representation with the highest weight score from the plurality of modalities as a key modality;

step S4, according to the obtained modal representation and the weight score, fusing a plurality of modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video;

step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance;

step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality more relevant to the problem in a multi-round iteration mode;

step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.

2. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step of extracting the frame feature, the clip feature and the audio feature of the video in the step S2 comprises:

step S21, extracting the frame characteristic v of the video by utilizing a pre-trained ResNet model_f＝{f₁,f₂,...,f_N1}，Wherein f is_iRepresenting frame characteristics of the ith frame in the video,i is 1,2,3, …, N1, N1 denotes the number of frames, d denotes the dimension of the frame feature;

step S22, extracting the clipping feature v of the video by utilizing the pre-training TSN network_c＝{c₁,c₂,...,c_N2}，Wherein, c_jIndicating the clipping characteristics of the jth clip in the video,n2 represents the number of clips, the dimensions of the clip features being the same as the dimensions of the frame features;

step S23, converting the audio in the video into a spectrogram according to the Mel cepstrum coefficient to be used as the input of a pre-training GoogleLeNet model, and then extracting the audio characteristic v of the video by using the pre-training GoogleLeNet model_a＝{a₁,a₂,...,a_N3}，Wherein, a_kRepresenting the audio features of the kth audio in the video,n3 represents the number of audio frequencies, the dimension of the audio features being the same as the dimension of the frame features;

the step of extracting the problem feature in step S2 includes:

step S24, performing one-hot encoding expression on all words in the question to obtain question expression q ═ { q { (q) }₁,q₂,...,q_TWherein q is_tIs a one-hot coded representation of the tth word in the problem, T ═ 1, 2., T represents the length of the problem;

step S25, obtaining a word embedding matrix by using the pre-training word embedding model GloVeWherein, | N_vocab| represents the vocabulary number of the data set, and the value 300 represents the characteristic dimension of each word vector;

step S26, embedding the question q into a low-dimensional continuous vector space through the word embedding matrix E to obtain a word embedding vector x_t＝E*q_t,t＝1,2,...,T；

Step S27, using LSTM to encode word-embeddingText features of the inbound vector derivation problemLSTM_q(. cndot.) represents a long-short term memory network that handles word-embedded vectors.

3. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1 or 2, wherein the plurality of modal representations with question guidance obtained in step S3 includes a frame representation with question guidance obtained by the steps of:

s31 problem feature using compatibility functionAnd frame feature v_f＝(f₁,f₂,...,f_N1) Dimension scaling is carried out, namely the problem feature and the frame feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding frame vector group e is obtained_fThe specific calculation of each frame vector is shown as follows:

s32, using alignment function to align the frame vector group e_fOf each frame vectorEach translate into a corresponding frame attention weight scoreTo obtain the normalization result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vectorThe specific calculation of (a) is shown as follows:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;

s33, using the generated context function to convert each frame feature f_iCorresponding frame attention weight scorePerforming a weighted sum calculation to obtain a frame representation p with problem guidance_fAs shown in the following formula:

wherein, W₁Representing trainable weight matrices, b₁A trainable bias vector is represented.

4. The method for solving the video question-answering task based on the multi-modal progressive attention model according to any one of claims 1-3, wherein the plurality of modal representations with question guidance obtained in step S3 includes a clip representation with question guidance obtained by:

s34 problem feature using compatibility functionAnd a clipping feature v_c＝(c₁,c₂,...,c_N2) Performing dimensionality scaling, namely mapping the problem features and the clipping features from a high-dimensional feature space to the same low-dimensional feature space to perform similarity calculation to obtain a corresponding clipping vector group e_cThe specific calculation of each clip vector is shown as follows:

wherein the compatibility function used is a scaling point multiplication function,representing the set of resulting clipping vectors in a way that,representing the jth clip vector in the set of clip vectors, c_jThe j-1, 2,3, …, N2 and d represent preset scaling factors;

s35, respectively combining the clip vectors e by using alignment function_cVector of each clipTranslating into clip attention weight scoresTo obtain the normalized result of the similarity between the question feature and the clip feature, and the clip attention weight score corresponding to each clip vectorSpecific calculation ofAs shown in the following formula:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;

s36, using the generated context function to combine each clipping feature c_jClip attention weight score corresponding theretoPerforming a weighted sum calculation to obtain a clip representation p with question guidance_cAs shown in the following formula:

wherein, W₂Representing trainable weight matrices, b₂A trainable bias vector is represented.

5. The method for solving video question-answering task based on multi-modal progressive attention model according to any one of claims 1-4, wherein the plurality of modal representations with question guidance obtained in step S3 includes an audio representation with question guidance obtained by the steps of:

s37 problem feature using compatibility functionAnd audio features v_a＝(a₁,a₂,...,a_N3) Dimension scaling is carried out, namely the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding audio vector group e is obtained_aThe specific calculation of each audio vector is shown as follows:

wherein the compatibility function used is a scaling point multiplication function,representing the resulting set of audio vectors in a way that,representing the k-th audio vector of the set of audio vectors, a_kRepresenting the audio characteristics of the kth audio in the video, k being 1,2,3, …, N3, d representing a preset scaling factor;

s38, respectively using the alignment function to make the audio vector group e_aEach audio vector inTranslating into an audio attention weight scoreTo obtain the normalization result of the similarity of the problem feature and the audio feature, the audio attention weight score corresponding to each audio vectorThe specific calculation of (a) is shown as follows:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;

s39, using the generating context function to combine each audio feature a_kAudio attention weight score corresponding theretoPerforming a weighted sum calculation to obtain an audio representation p with problem guidance_aAs shown in the following formula:

wherein, W₃Representing trainable weight matrices, b₃A trainable bias vector is represented.

6. The method for solving the video question-answering task based on the multi-modal progressive attention model according to any one of claims 3-5, wherein the step S3 further comprises:

representing p for frames with question guidance using questions respectively according to the following formula_fClip representation p with question guidance_cAnd an audio representation p with question guidance_aCalculating the weight score to obtain a weight score result s_f,s_c,s_aAnd is in s_f,s_c,s_aThe modality with the highest weight score is selected as the key modality p,

wherein < > represents the cosine similarity calculation, P ═ P_f,p_c,p_aDenotes a number of modal features with problem guidance, H ═ H_f,H_c,H_aRepresents the problem featureAnd different modality feature with question guidance P ═ { P ═ P_f,p_c,p_aDegree of similarity between S ═ S_f,s_c,s_aRepresents the problem featureFor different modal features with problem guidance P ═ { P_f,p_c,p_aThe weight score result after focusing on, p represents the most relevant modality to the question, p ∈ { p }_f,p_c,p_a}。

7. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the multi-modal fusion representation of the video in the step S4 is obtained by:

representing a frame with problem guidance by a multi-mode compression bilinear model MCB according to the following formula_fClip representation p with question guidance_cAudio representation p with question guidance_aAnd their respective weight scores s_f,s_c,s_aAre fused together to obtain a video fusion representation v_u：

v_u＝MCBFusion(s_fp_f,s_cp_c,s_ap_a)。

8. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step S5 comprises:

step S51, using the resulting video fusion representation v according to the following formula_uLong-short term memory network LSTM for coding problems_qHidden state of output at time tCalculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time t_aThe input of (a) is performed,

wherein the content of the first and second substances,denotes multiplication by element, Bi _ LSTM_a(-) represents a bidirectional long-short term memory network,represents Bi _ LSTM_aHidden state at the t-th moment in the encoding process;

step S52, using Bi _ LSTM according to the following formula_aHidden state at time tFor the resulting video fusion representation v_uPay attention to get a video fusion representation v with problem guidance_o，

Wherein, W₄、W₅And W₆Representing trainable weight matrices, b₅And b₆Representing trainable bias vectors, e_tRepresenting a weight, alpha, obtained by calculating the similarity of the video fusion feature and the video feature_tAnd expressing the attention weight distribution after weight normalization.

9. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step S6 comprises:

step S61, initializing the query condition according to the following formula,

wherein the content of the first and second substances,representing a problem feature;

step S62, using query conditions z according to the following formula_rPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance

α_r＝softmax(W₈e_r+b₈)，

step S63, iteratively updating the query condition according to the following formula,

wherein z is_r-1Representing the query condition for the r-1 th iteration update,representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, …, R;

step S64, the query conditions obtained by updating in step S63 are utilized, step S62 is executed to carry out multi-step iterative attention on the key mode p, and key mode features more relevant to the problem are obtained

10. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the predicted answer is obtained in step S7 according to the following steps:

LSTM in step S2 is expressed as follows_qExported memory cell statusBi _ LSTM in step S5_aExported memory cell statusVideo representation v with question guidance obtained in step S5_oAnd the iterative focus result obtained in step S6The four parts of information are fused for generating answers,

wherein, W₁₀Representing trainable weight matrices, W_answerA weight matrix representing the vocabulary, Answer representing the generated Answer, argmax representing the selection of the highest score as the prediction result.

Technical Field

The embodiment of the invention relates to the technical field of video question answering, in particular to a method for solving a video question answering task based on a multi-mode progressive attention model.

Background

In recent years, video question answering is a very challenging new area, and is receiving the attention of researchers. The task requirement model can understand semantic information between the video and the question and generate an answer according to the semantic information. Open questions are a difficult type of question in the current stage video question-answering task, since they require the model to automatically generate natural language answers.

In the question-and-answer task, video information is more complex than image information. The video is an image sequence with strong time dynamics, and a large number of redundant frames irrelevant to the problem exist, so that the relevance between the video representation and the problem is influenced, the model cannot be accurately positioned to the video information relevant to the problem, and experiments show that the problem can be effectively solved and the accuracy of the model is obviously improved by applying the attention model to a video question-answering task.

Most of the current video question-answering tasks only extract the frame features and the clip features of the video in the implementation process, and completely ignore the audio features of the video, so that all effective information of the video is not utilized to the maximum extent. And because various information cross exists among different modal characteristics and respective expression modes are different, if only basic operations such as point multiplication, cascade connection and the like are utilized to perform characteristic fusion, the complex relationship between two modal states cannot be modeled. In order to solve the problems, the method accurately positions the video frames related to the problems or the video picture areas related to the problems in a stage-by-stage positioning mode by utilizing a multi-mode progressive attention model.

Disclosure of Invention

In this context, embodiments of the present invention are expected to provide a method for solving a video question-and-answer task based on a multi-modal progressive attention model, so as to overcome the problem that the prior art cannot provide more accurate answers to the video question-and-answer task.

In a first aspect of the embodiments of the present invention, there is provided a method for solving a video question-answering task based on a multi-modal progressive attention model, including: step S1, obtaining a video to be processed and a question; step S2, extracting frame characteristics, clipping characteristics and audio characteristics of the video as a plurality of modal characteristics of the video, and extracting text characteristics of the problem; step S3, respectively paying attention to a plurality of modal characteristics of the video by using the problem to obtain a plurality of modal representations with problem guidance, respectively calculating the weight scores of the modalities by using the problem, and selecting the modality representation with the highest weight score from the plurality of modalities as a key modality; step S4, according to the obtained modal representation and the weight score, fusing the multiple modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video; step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance; step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality more relevant to the problem in a multi-round iteration mode; and step S7, obtaining a predicted answer at least based on the question features, the video fusion representation with the question guidance and the results of the multi-step attention and the multi-turn iteration.

Further, the step of extracting the frame feature, the clip feature and the audio feature of the video in step S2 includes: step S21, extracting the frame characteristic v of the video by utilizing a pre-trained ResNet model_f＝{f₁,f₂,...,f_N1}，Wherein f is_iRepresenting the ith frame in a videoThe frame characteristics of (a) are determined,i is 1,2,3, …, N1, N1 denotes the number of frames, d denotes the dimension of the frame feature; step S22, extracting the clipping feature v of the video by utilizing the pre-training TSN network_c＝{c₁,c₂,…,c_N2}，Wherein, c_jIndicating the clipping characteristics of the jth clip in the video,j is 1,2,3, …, N2, N2 indicates the number of clips, and the dimension of the clip feature is the same as the dimension of the frame feature; step S23, converting the audio in the video into a spectrogram according to the Mel inverse pedigree number to be used as the input of a pre-training GoogleNet model, and then extracting the audio characteristic v of the video by using the pre-training GoogleNet model_a＝{a₁,a₂,...,a_N3}，Wherein, a_kRepresenting the audio features of the kth audio in the video,k is 1,2,3, …, N3, and N3 denote the number of audios, and the dimension of the audio feature is the same as that of the frame feature;

the step of extracting the problem feature in step S2 includes: step S24, performing one-hot encoding expression on all words in the question, and obtaining question expression q ═ { q ═ q {₁,q₂,...,q_TWherein q is_tIs a one-hot coded representation of the tth word in the question, T ═ 1,2,3, …, T represents the length of the question; step S25, obtaining a word embedding matrix by using the pre-training word embedding model GloVeWherein, | N_vocabI denotes the number of words, number of the data setThe value 300 represents the feature dimension of each word vector; step S26, embedding the question q into a low-dimensional continuous vector space through the word embedding matrix E to obtain a word embedding vector x_t＝E*q_tT1, 2,. said, T; step S27, embedding vector by using LSTM encoding word to obtain text characteristic of questionLSTM_q(. cndot.) represents a long-short term memory network that handles word-embedded vectors.

Further, the plurality of modality representations with question guidance obtained in step S3 includes a frame representation with question guidance obtained by the steps of: s31 problem feature using compatibility functionAnd frame feature v_f＝(f₁,f₂,...,f_N1) Dimension scaling is carried out, namely the problem feature and the frame feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding frame vector group e is obtained_fThe specific calculation of each frame vector is shown as follows:

wherein the compatibility function used is a scaling point multiplication function,the resulting set of frame vectors is represented as,representing the i-th frame vector of the set of frame vectors, f_iRepresenting the frame characteristics of the ith frame in the video, wherein i is 1,2,3, …, N1, and d represents a preset scaling factor; s32, using alignment function to align the frame vector group e_fOf each frame vectorEach translate into a corresponding frame attention weight scoreTo obtain the normalization result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vectorThe specific calculation of (a) is shown as follows:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s33, using the generated context function to convert each frame feature f_iCorresponding frame attention weight scorePerforming a weighted sum calculation to obtain a frame representation p with problem guidance_fAs shown in the following formula:wherein, W₁Representing trainable weight matrices, b₁A trainable bias vector is represented.

Further, the plurality of modal representations with issue guidance obtained in step S3 includes a clip representation with issue guidance obtained by the steps of: s34 problem feature using compatibility functionAnd a clipping feature v_c＝(c₁,c₂,...,c_N2) Performing dimensionality scaling, namely mapping the problem characteristic and the clipping characteristic from a high-dimensional characteristic space to the same low-dimensional characteristic space to perform similarity calculation to obtain corresponding clipsEdit vector set e_cThe specific calculation of each clip vector is as follows:

wherein the compatibility function used is a scaling point multiplication function,the resulting set of clip vectors is represented,representing the jth clip vector in the set of clip vectors, c_jThe j-1, 2,3, …, N2 and d represent preset scaling factors; s35, respectively combining the clip vectors e by using alignment function_cVector of each clipTranslating into clipping attention weight scoresTo obtain the normalized result of the similarity between the question feature and the clip feature, the clip attention weight score corresponding to each clip vectorThe specific calculation of (a) is shown as follows:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s36, using the generated context function to combine each clipping feature c_jClip attention weight score corresponding theretoPerforming a weighted sum calculation to obtain a clipped representation p with question guidance_cAs shown in the following formula:wherein, W₂Representing trainable weight matrices, b₂A trainable bias vector is represented.

Further, the plurality of modal representations with question guidance obtained in step S3 includes an audio representation with question guidance obtained by the steps of: s37 problem feature using compatibility functionAnd audio features v_a＝(a₁,a₂,...,a_N3) Dimension scaling is carried out, namely the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding audio vector group e is obtained_aThe specific calculation of each audio vector is as follows:

wherein the compatibility function used is a scaling point multiplication function,the resulting set of audio vectors is represented,representing the k-th audio vector of the set of audio vectors, a_kRepresenting the audio characteristics of the kth audio in the video, k being 1,2,3, …, N3, d representing a preset scaling factor; s38, respectively using the alignment function to make the audio vector group e_aEach audio vector inConversion to audio notesGravity scoreTo obtain the normalized result of the similarity between the problem feature and the audio feature, the audio attention weight score corresponding to each audio vectorThe specific calculation of (a) is shown as follows:

wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s39, using the generating context function to combine each audio feature a_kAudio attention weight score corresponding theretoPerforming a weighted sum calculation to obtain an audio representation p with problem guidance_aAs shown in the following formula:wherein, W₃Representing trainable weight matrices, b₃A trainable bias vector is represented.

Further, step S3 further includes: representing p for frames with question guidance using questions respectively according to the following formula_fClip representation p with question guidance_cAnd audio representation p with question guidance_aCalculating the weight score to obtain a weight score result s_f,s_c,s_aAnd is in s_f,s_c,s_aThe modality with the highest weight score is selected as the key modality p,

where < > represents the cosine similarity calculation, P ═ P_f,p_c,p_aDenotes a number of modal features with problem guidance, H ═ H_f,H_c,H_aRepresents the problem featureAnd different modality feature with question guidance P ═ { P ═ P_f,p_c,p_aDegree of similarity between S ═ S_f,s_c,s_aRepresents the problem featureFor different modal features with problem guidance P ═ { P_f,p_c,p_aThe weight score result after the attention, p represents the most relevant modality to the question, p ∈ { p }_f,p_c,p_a}。

Further, the multi-modal fusion representation of the video in step S4 is obtained by: representing a frame with problem guidance by a multi-mode compression bilinear model MCB according to the following formula_fClip representation p with question guidance_cAudio representation p with question guidance_aAnd their respective weight scores s_f,s_c,s_aAre fused together to obtain a video fusion representation v_u： v_u＝MCBFusion(s_fp_f,s_cp_c,s_ap_a)。

Further, step S5 includes: step S51, using the resulting video fusion representation v according to the following formula_uLong-short term memory network LSTM for coding problems_qHidden state of output at time tCalculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time t_aThe input of (a) to (b),wherein the content of the first and second substances,representing multiplication in an elemental way, Bi _ LSTMa (-) represents a bidirectional long-term memory network,represents Bi _ LSTM_aHidden state at the t-th moment in the encoding process; step S52, using Bi _ LSTM according to the following formula_aHidden state at time tFor the resulting video fusion representation v_uPay attention to get a video fusion representation v with problem guidance_o，

Further, step S6 includes: step S61, initializing the query condition according to the following formula,wherein the content of the first and second substances,representing a problem feature; step S62, respectively using query conditions z according to the following formula_rPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance

α_r＝softmax(W₈e_r+b₈)，

Wherein, W₇、W₈And W₉Representing trainable weight matrices, b₇、b₈And b₉Representing a trainable bias vector; p represents the most problem-relevant modality, p ∈ { p }_f,p_c,p_a}，z_rThe query condition representing the update of the R-th iteration, R is 0,1,2, …, R, e_rRepresenting the weight, alpha, calculated from the similarity of the problem features and the key modal features_rExpressing the attention weight distribution after weight normalization; step S63, iteratively updating the query condition according to the following formula,wherein z is_r-1Representing the query condition for the r-1 th iteration update,representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, …, R; step S64, using the updated query conditions in step S63 to executeStep S62, performing multi-step iterative attention on the key modality p to obtain key modality characteristics more relevant to the problem

Further, in step S7, the predicted answer is obtained according to the following steps: LSTM in step S2 is expressed as follows_qExported memory cell statusBi _ LSTM in step S5_aOutput memory cell stateVideo representation v with question guidance obtained in step S5_oAnd the iterative focus result obtained in step S6The four parts of information are fused for generating answers,

In a second aspect of embodiments of the present invention, there is provided a storage medium storing a program which, when executed by a processor, implements a method of solving a video question and answer task based on a multi-modal progressive attention model as described above.

In a third aspect of embodiments of the present invention, there is provided a computing device comprising the storage medium described above.

According to the method for solving the video question-answering task based on the multi-mode progressive attention model, the following effects can be achieved:

(1) compared with the prior art, the method utilizes the cooperative work of a plurality of attention models with different functions to more accurately locate the video frame or video picture area related to the problem.

(2) The invention realizes the cross-modal fusion of the features by utilizing an improved multi-modal fusion algorithm, and improves the representation capability after the features are fused.

Drawings

The foregoing and other objects, features and advantages of exemplary embodiments of the present invention will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a flow diagram of one exemplary process for a method for solving a video question-and-answer task based on a multi-modal progressive attention model, according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an achievable system architecture of the method for solving the video question-answering task based on the multi-modal progressive attention model of the present invention;

FIG. 3 is a diagram illustrating an example of the results of the method of the present invention for solving a video question-answering task based on a multi-modal progressive attention model;

FIG. 4 schematically shows a schematic structural diagram of a computer according to an embodiment of the present invention;

FIG. 5 schematically shows an illustrative diagram of a computer-readable storage medium according to an embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a method for solving a video question-answering task based on a multi-mode progressive attention model is provided.

In this document, it is to be understood that the number of any element in the figures is intended to be illustrative rather than restrictive, and that any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several exemplary embodiments thereof.

Summary of The Invention

The inventors have found that first a variety of modal features are extracted from the video and the problem, respectively. And then inputting the extracted multi-modal features into a plurality of attention models, and finally fusing effective information in output results of all the modules for answer generation.

The method for solving the video question-answering task based on the multi-mode progressive attention model comprises the following steps: step S1, obtaining the video and question to be processed; step S2, extracting the frame feature, the clip feature and the audio feature of the video as a plurality of modal features of the video, and extracting the text feature of the question; step S3, respectively paying attention to a plurality of modal characteristics of the video by using the problem to obtain a plurality of modal representations with problem guidance, respectively calculating the weight score of each modal by using the problem, and selecting the modal representation with the highest weight score from the plurality of modals as a key modal; step S4, according to the obtained modal representation and the weight score, fusing a plurality of modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video; step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance; step S6, performing multi-step attention on the characteristics of the key mode by using the problem, and positioning the key mode characteristics more relevant to the problem in a multi-iteration mode; step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

Fig. 1 schematically illustrates an exemplary process flow of a method for solving a video question-answering task based on a multi-modal progressive attention model according to an embodiment of the present disclosure. Fig. 2 shows a system structure that can be realized by the method.

As shown in fig. 1, when the process flow starts, step S1 is first executed.

And step S1, obtaining the video to be processed and the question.

For example, the video and questions to be processed may be user input, may be received externally by the system, or may be downloaded from a predetermined website, etc.

As an example, in the embodiment of the present invention, the specific processing flow is described by taking english as an example. It should be understood that the language of the question is not limited to the english language shown in fig. 2, but may be other languages such as chinese, japanese, korean, french, and the like. Accordingly, the language of the predicted answer may be the same as the language of the question, or may be set to one or more selectable languages according to the user's selection.

And step S2, extracting frame features, clip features and audio features of the video as a plurality of modal features of the video, and extracting text features of the problem.

As an example, in step S2, the frame feature, the clip feature, and the audio feature of the video may be extracted, for example, by steps S21 to S23 described below.

In step S21, the frame features of the video are extracted using the previously trained ResNet model.

For example, with v_fRepresenting the frame characteristics of the video, then:

v_f＝{f₁,f₂,...,f_N1}，

wherein f is_iRepresenting frame characteristics of the ith frame in the video,i-1, 2,3, …, N1, N1 denotes the number of frames and N1 is a natural number, d denotes a preset scaling factor, as here (when corresponding to a frame feature of a video) denotes the frame feature dimension.

As described above, in the embodiment of the present invention, the ResNet model is pre-trained on ImageNet, that is, the trained ResNet model is used to extract frame features of the video. It should be noted that the above ResNet model is not used to limit the example, and in other examples, for example, other pre-trained models for extracting frame features may also be used, and details are not described here.

Thus, in step S22, the clip features of the video can be extracted using a pre-trained tsn (temporal Segment networks) network.

For example, with v_cRepresenting the clipping characteristics of the video, then:

v_c＝{c₁,c₂,...,c_N2}，

wherein, c_jIndicating the clipping characteristics of the jth clip in the video,j ═ 1,2,3, …, N2, N2 denote the number of clips and N2 is a natural number, d denotes a preset scaling factor, as here (in correspondence with the clip characteristics of the video) denotes the clip characteristic dimension. EditingThe dimension of the feature is the same as the frame feature dimension.

As described above, in the embodiment of the present invention, the TSN network is trained in advance, that is, the trained TSN network is used to extract the clipping feature of the video. It should be noted that the TSN network is not limited to this example, and in other examples, for example, other network models for extracting clipping features that are pre-trained may also be used, and details are not described here.

Then, in step S23, the sound in the video is converted into a sonogram based on mel cepstral coefficients as an input to a pre-trained google net model, which is then used to extract audio features of the video.

For example, with v_aRepresenting the audio features of the video, then:

v_a＝{a₁,a₂,...,a_N3}，

wherein, a_kRepresenting the audio features of the kth audio in the video,k is 1,2,3, …, N3, N3 denotes the number of audios and N3 is a natural number, d denotes a preset scaling factor, as here (in correspondence with the audio features of the video) denotes the audio feature dimension. The dimension of the audio features is the same as the frame feature dimension.

In this way, a plurality of modal features of the video may be extracted in step S2 by the method described above.

It is worth mentioning that, in the above example, three features of a frame feature, a clip feature, and an audio feature of a video are adopted as the plurality of modal features of the video, but the embodiment of the present invention is not limited thereto.

For example, in the embodiment of the present invention, at least two of the frame feature, the clip feature, the audio feature, the clip audio feature, and the frame audio feature of the video may be selected as the plurality of modal features of the video.

For example, clip feature v_cAnd audio features v_aCarrying out feature fusion to obtain clipping audio features v_caThe specific feature fusion mode comprises the following steps: linear addition, linear multiplication, splicing, etc., taking splicing by two characteristics as an example, to obtain a clip audio characteristic v_caThe specific fusion mode is as follows:

v_ca＝[v_c,v_a]，

wherein [ · ] represents the operation of splicing two features, and compared with a single clip and audio features, the clip audio feature information is richer and has stronger semantic information.

As another example, the frame audio feature v_faBy extracting and clipping audio features v_caThe method of the feature fusion is similar, and the obtained frame audio features v_faMore efficient than the characteristic information of a single modality.

Further, in step S2, problem features may be extracted through steps S24 to S27 described below.

In step S24, all words in the question are expressed by one-hot coding, and a corresponding question expression q is obtained, q ═ q { [ q ] q₁,q₂,...,q_T}. For example, when the language of the question is english, all words in the question, i.e., words, may be represented by one-hot encoding in step S24.

Wherein q is_tFor the one-hot coded representation of the tth word in the question, T is 1,2,3, …, T, the length of the T question (i.e. the number of words involved) and T is a natural number.

Then, in step S25, a word embedding matrix E is obtained by using a pre-trained word embedding model (such as GloVe model),

wherein, | N_vocab| represents the number of words in the predetermined data set, and the value 300 represents the word embedded in the matrixThe feature dimension of each word vector.

Next, in step S26, the question q is embedded in a low-dimensional continuous vector space by the obtained word embedding matrix, and a corresponding word embedding vector x is obtained_t，x_t＝E*q_t。

In this way, in step S27, the word embedding vector obtained in step S26 may be encoded by using LSTM (Long Short-Term Memory network), thereby obtaining the text feature of the question Wherein, LSTM_q(. cndot.) represents a long-short term memory network that handles word-embedded vectors. Since the problem is sequence data unlike a picture, the LSTM coding problem feature is adopted, so that the problem feature does not lose information, and semantic information of each word in the problem is better reserved.

As shown in fig. 2, for example, the corresponding video features can be extracted through three models (as video feature extraction models) of ResNet, TSN and google net as shown in the figure, and the text features of the question can be extracted through a GloVe + LSTM model (as question feature extraction model) as shown in fig. 2.

Step S3 is to focus on the plurality of modal features of the video by using questions to obtain a plurality of modal representations with question guidance, calculate the weight scores of the respective modalities by using the questions, and select the modality representation with the highest weight score among the plurality of modalities as the key modality.

It should be noted that the plurality of modalities refer to modalities corresponding to the frame feature, the clip feature, and the audio feature (optionally, other features may be included).

In step S3, the obtained plurality of modal representations with issue guidance includes, for example: a frame representation with issue guidance; a clip representation with issue guidance; audio presentation with problem guidance.

The above-described frame representation with the question guidance can be obtained by, for example, steps S31 to S33 described below.

In step S31, a compatibility function is used to characterize the problemAnd frame feature v_f＝(f₁,f₂,...,f_N1) Dimension scaling, i.e. problem characterizationAnd frame feature v_f＝(f₁,f₂,...,f_N1) Mapping the high-dimensional feature space to the same low-dimensional feature space to perform similarity calculation (i.e. calculating semantic similarity between problem features and frame features) to obtain a corresponding frame vector group e_fSet of frame vectors e_fThe specific calculation of each frame vector in (a) is as follows:

wherein the compatibility function used in step S31 is a scaling point multiplication function,the resulting set of frame vectors is represented as,representing the i-th frame vector of the set of frame vectors, f_iThe frame characteristics of the ith frame in the video are shown, i is 1,2,3, …, N1, and d is a preset scaling factor.

It should be noted that, in the embodiment of the present invention, mapping a and B from the high-dimensional feature space to the same low-dimensional feature space means that a and B are both mapped from the high-dimensional feature space to the same low-dimensional feature space, for example, a is mapped from 2048-dimensional feature space to 256-dimensional feature space, and B is also mapped from 2048-dimensional feature space to 256-dimensional feature space; alternatively, a maps from 2048-dimensional feature space to 256-dimensional feature space and B maps from 1024-dimensional feature space to 256-dimensional feature space. In other words, a and B are each mapped from a respective high-dimensional space to a low-dimensional space of the same dimension.

Next, in step S32, the frame vector group e is aligned using an alignment function_fOf each frame vectorEach translate into a corresponding frame attention weight scoreTo obtain the normalized result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vectorThe specific calculation of (a) is shown as follows:

the normalization function used in step S32 is a softmax function, and exp (·) denotes a prime operation function based on a natural base number e. In addition to this, the present invention is,corresponding to the expression i-i 1The value range of i1 is 1-N1.

Thus, in step S33, each frame feature f is generated using the generation context function_iCorresponding frame attention weight scorePerforming weighted summation calculation (namely, weighted summation is carried out on each frame feature based on the frame attention weight score corresponding to each frame feature) to obtain a frame representation p with problem guidance_fAs shown in the following formula:

wherein, W₁Representing trainable weight matrices, b₁A trainable bias vector is represented.

As another example, the above-described clip representation with question guidance can be obtained through steps S34 to S36 described below.

In step S34, a compatibility function is used to characterize the problemAnd a clipping feature v_c＝(c₁,c₂,...,c_N2) Dimension scaling is carried out, namely, the problem feature and the clipping feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation (namely, semantic similarity between the problem feature and the clipping feature is calculated), and a corresponding clipping vector group e is obtained_cSet of clipping vectors e_cThe specific calculation of each clip vector in (a) is as follows:

wherein the compatibility function used in step S34 is a scaling point multiplication function,representing the set of resulting clipping vectors in a way that,representing the jth clip vector in the set of clip vectors, c_jThe clip characteristics of the jth clip in the video are shown, j is 1,2,3, …, N2, and d is a preset scaling factor.

Next, in step S35, clip vector groups e are each aligned using an alignment function_cVector of each clipTranslating into corresponding clip attention weight scoresTo obtain the normalized result of the similarity between the question feature and the clip feature, and the clip attention weight score corresponding to each clip vectorThe specific calculation of (a) is shown as follows:

wherein the normalization function used in step S35 is a softmax function. In addition to this, the present invention is,representing the i2 th clip vector in the set of clip vectors. In addition to this, the present invention is,corresponding to the expression j-i 2The value range of i2 is 1-N2.

Thus, in step S36, each clip feature c is added by using the generation context function_jClip attention weight score corresponding theretoA weighted summation calculation is performed (i.e., each clip feature is weighted and summed based on its corresponding clip attention weight score) to obtain a clip representation p with question guidance_cAs shown in the following formula:

wherein, W₂Representing trainable weightsMatrix, b₂A trainable bias vector is represented.

Further, the audio representation with the above-described problem guidance can also be obtained by steps S37 to S39 described below.

In step S37, a compatibility function is used to characterize the problemAnd audio features v_a＝(a₁,a₂,...,a_N3) Dimension scaling is carried out, namely, the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation (namely, semantic similarity between the problem feature and the audio feature is calculated), and a corresponding audio vector group e is obtained_aSet of audio vectors e_aThe specific calculation of each audio vector in (a) is as follows:

wherein the compatibility function used in step S37 is a scaling point multiplication function,representing the resulting set of audio vectors in a way that,representing the k-th audio vector of the set of audio vectors, a_kDenotes the audio characteristics of the kth audio in the video, k ═ 1,2,3, …, N3, and d denotes the preset scaling factor.

Next, in step S38, the audio vectors e are respectively grouped by the alignment function_aEach audio vector inTranslating into an audio attention weight scoreTo obtain problem features and audioNormalization result of feature similarity, audio attention weight score corresponding to each audio vectorThe specific calculation of (a) is shown as follows:

the normalization function used in step S38 may be a softmax function, for example. In addition to this, the present invention is,representing the i3 th audio vector in the set of audio vectors. In addition to this, the present invention is,corresponding to the expression k-i 3The value range of i3 is 1-N3.

Thus, in step S39, each audio feature a is combined using the generating context function_kAudio attention weight score corresponding theretoPerforming a weighted sum calculation to obtain an audio representation p with problem guidance_aAs shown in the following formula:

wherein, W₃Representing trainable weight matrices, b₃A trainable bias vector is represented.

In this way, in the above steps S31 to S39, the frame attention weight score, the clip attention weight score, and the audio attention weight score are obtained separately, and for the sake of clarity, may be recorded as a first weight score so as to be distinguished from a second weight score which will be described later.

Further, in step S3, p may be represented by question for frames with question guidance, respectively_fClip representation p with question guidance_cAnd an audio representation p with question guidance_aCalculating the weight score again to obtain a weight score result s_f,s_c,s_a(e.g., as respective second weight scores) and is represented at s_f,s_c,s_aThe modality with the highest weight score is selected as the key modality p, and the following formula is shown as follows:

P＝{p_f,p_c,p_a}；

H＝{H_f,H_c,H_a}；

S＝{s_f,s_c,s_a}。

wherein < > represents a cosine similarity calculation, e.g.To representThe result of the cosine similarity with P, where P ═ P_f,p_c,p_aDenotes a number of modal features with problem guidance, H ═ H_f,H_c,H_aRepresents the problem featureAnd different modality feature with question guidance P ═ { P ═ P_f,p_c,p_aDegree of similarity between.

S＝{s_f,s_c,s_aRepresents the problem featureFor different modal features with problem guidance P ═ { P_f,p_c,p_aThe weight score result after focusing on, p represents the most relevant modality to the question, p ∈ { p }_f,p_c,p_a}。

Furthermore, s_fThen the utilization problem feature is indicatedRepresenting p for frames with question guidance_fAnd a second weight score, s, obtained after closing the attention_cRepresenting utilization problem featuresRepresenting p for clips with question guidance_cA second weight score, s, obtained after the attention is given_aRepresenting utilization problem featuresFor audio representation p with question guidance_aAnd a second weight score obtained after the attention is paid.

Thus, as shown in fig. 2, the above processing of step S3 can be completed by the video sequence attention module as shown based on the video features and question features extracted in step S2 to obtain the key modality.

And step S4, fusing the plurality of modal representations based on a multi-modal fusion algorithm according to the obtained modal representations and the weight scores to obtain a video fusion representation of the video.

In step S4, the frame with question guidance can be represented as p by using the multi-modal compressed bilinear model MCB, for example, according to the following formula_fClip representation p with question guidance_cAudio representation p with question guidance_aAnd their respective rightsA weight score s_f,s_c,s_a(i.e., the respective second weight scores) are fused together to obtain a video fusion representation v_u：

v_u＝MCBFusion(s_fp_f,s_cp_c,s_ap_a)。

Wherein, MCBFusion (-) in the above formula represents the multi-modal fusion algorithm function corresponding to the multi-modal compressed bilinear model MCB.

Thus, as shown in fig. 2, the processing of step S4 can be completed by the multi-modal fusion algorithm module as shown in the figure according to the modal representation and the weight score obtained in step S3, so as to obtain a multi-modal fusion representation of the video.

And step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with the problem guide.

For example, step S5 may include steps S51 to S52 described below.

In step S51, the resulting video fusion representation v is used according to the following formula_uLong-short term memory network LSTM and coding problem_qHidden state of output at time tCalculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time t_aThe input of (2):

wherein the content of the first and second substances,denotes multiplication by element, Bi _ LSTM_a(-) represents a bidirectional long-short term memory network,represents Bi _ LSTM_aAnd processing the hidden state at the t-th moment in the coding.

Next, in step S52, Bi _ LSTM is used according to the following formula_aHidden state at time tFor the resulting video fusion representation v_uPay attention to obtain video fusion expression v with problem guidance_o：

Wherein, W₄、W₅And W₆Representing trainable weight matrices, b₅And b₆Representing trainable bias vectors, e_tRepresenting weights, α, calculated from the similarity of the video fusion features and the video features (video features, i.e., frame features, clip features, and audio features as described above)_tAnd expressing the attention weight distribution after weight normalization. e.g. of the type_i4E corresponding to t ═ i4_tAnd i4 ranges from 1 to T.

Thus, as shown in FIG. 2, the above processing of step S5 can be accomplished by the secondary attention module as shown, based on the multi-modal fused representation of the video obtained in step S4, to find important video features (i.e., video fused representation with problem guidance) related to the problem.

And step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality which are more relevant to the problem in a multi-round iteration mode.

The processing of step S6 described above may be realized by, for example, steps S61 to S62 as will be described below.

In step S61, the query condition is initialized according to the following formula:wherein the content of the first and second substances,the problem feature is represented.

Next, in step S62, the query conditions z are used as follows_rPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance

α_r＝softmax(W₈e_r+b₈)；

Wherein, W₇、W₈And W₉Representing trainable weight matrices, b₇、b₈And b₉Representing a trainable offset vector; p represents the most problem-relevant modality, p ∈ { p }_f,p_c,p_a}，z_rAnd (3) representing the query condition updated by the R-th iteration, wherein R is 0,1,2, …, R represents the total number of iterations and R is a natural number. e.g. of the type_rRepresenting the weight, alpha, calculated from the similarity of the problem features and the key modal features_rShowing the attention weight distribution after weight normalization.

Next, in step S63, the query condition is iteratively updated according to the following formula:

wherein z is_r-1Representing the query condition for the r-1 th iteration update,and (3) representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, … and R.

In this way, in step S64, using the query condition updated in step S63, step S62 is executed to perform multi-step iterative attention on the key modality p, so as to obtain a key modality feature more relevant to the problem

Thus, as shown in FIG. 2, the process of step S6 can be accomplished according to the key modality found in step S3 by iteratively locating the attention module as shown to locate the key modality features more relevant to the problem.

Step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.

In step S7, the long short term memory network LSTM in step S2 can be expressed as follows_qExported memory cell statusBi _ LSTM in step S5_aExported memory cell statusVideo representation v with question guidance obtained in step S5_oAnd the iterative focus result obtained in step S6And fusing four parts of information for generating answers:

Thus, as shown in fig. 2, the above-mentioned processing of step S7 can be completed by the answer generation module shown in the figure, and the result is fused according to the partial valid output results of the previous steps and then input to the module, so as to generate the predicted answer. The portion enclosed by the dashed line box shown in fig. 2 is the multi-modal progressive attention model according to the embodiment of the present invention, and is used for performing the above steps.

As shown in fig. 3, given a video and a question, a predicted answer can be obtained as shown in the figure. Therefore, the method can be used for carrying out video question-answering processing, and more accurate answers can be measured in advance.

In addition, the embodiment of the invention also provides a storage medium storing a program, and the program realizes the method for solving the video question-answering task based on the multi-modal progressive attention model when being executed by a processor.

In addition, the embodiment of the invention also provides a computing device which comprises the storage medium.

FIG. 4 illustrates a block diagram of an exemplary computer system/server 50 suitable for use in implementing embodiments of the present invention. The computer system/server 50 shown in FIG. 4 is only an example and should not be taken to limit the scope of use and functionality of embodiments of the present invention.

As shown in FIG. 4, computer system/server 50 is in the form of a general purpose computing device. Components of computer system/server 50 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that couples the various system components (including the system memory 502 and the processing unit 501).

Computer system/server 50 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 50 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)5021 and/or cache memory 5022. The computer system/server 50 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the ROM5023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 503 by one or more data media interfaces. At least one program product may be included in system memory 502 with a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 5025 having a set (at least one) of program modules 5024 may be stored in, for example, system memory 502, and such program modules 5024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules 5024 generally perform the functions and/or methodologies of the described embodiments of the invention.

The computer system/server 50 may also communicate with one or more external devices 504 (e.g., keyboard, pointing device, display, etc.). Such communication may be through input/output (I/O) interfaces 505. Also, the computer system/server 50 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through a network adapter 506. As shown in FIG. 5, the network adapter 506 communicates with other modules of the computer system/server 50 (e.g., processing unit 501, etc.) via the bus 503. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer system/server 50.

The processing unit 501 executes various functional applications and data processing, for example, executes and implements the steps of the above-described method, by running a program stored in the system memory 502.

A specific example of a computer-readable storage medium embodying the present invention is shown in fig. 5.

The computer-readable storage medium of fig. 5 is an optical disc 600, and a computer program (i.e., a program product) is stored thereon, and when the program is executed by a processor, the program implements the steps described in the above method embodiments, and specific implementations of the steps are not repeated here.

PREFERRED EMBODIMENTS

In the preferred embodiment, experiments were performed on a ZJL experimental dataset with a total of 13161 short videos and 197415 challenge-response pairs. In order to objectively evaluate the performance of the method of the present invention, the effect of the present invention is evaluated using the Accuracy evaluation criteria in the selected test set, which reflects the Accuracy of model prediction. The experimental results obtained following the procedure described above are shown in table 1.

TABLE 1

The present invention performs an ablation study experiment to evaluate the effectiveness of each modality, wherein Q represents Question Only to predict answers based on the characteristics of questions, V + Q represents Video and Question to predict answers based on Video and questions, a + Q represents Audio and Question to predict answers based on Audio and questions, V + a + Q represents Video, and Question and Audio to predict answers based on Video, questions and Audio, and the obtained experimental results are shown in table 2.

TABLE 2

It should be noted that although in the above detailed description several units, modules or sub-modules are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Further, while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

25页详细技术资料下载

Method for solving video question-answering task based on multi-mode progressive attention model

相关技术

网友询问留言