Video question-answering system, method, computer and storage medium based on text attention and fine-grained information

文档序号:1798804 发布日期:2021-11-05 浏览:21次 中文

阅读说明:本技术 一种基于文本注意力和细粒度信息的视频问答系统、方法、计算机及存储介质 (Video question-answering system, method, computer and storage medium based on text attention and fine-grained information ) 是由 孙广路 梁丽丽 谭涛 于波 刘昕雨 于 2021-08-09 设计创作,主要内容包括:本发明提出一种基于文本注意力和细粒度信息的视频问答系统、方法、计算机及存储介质,属于计算机视觉和自然语言处理交叉领域。首先,获得问题语句的词嵌入表示;其次,获得问题导向的视频文本级别特征表示;再其次,获得问题导向的具有空间注意的视频帧级别特征表示;再其次,获得最终的问题导向的视频帧级别特征表示;再其次,获得问题导向的视频片段级别特征表示;最后,得到具有问题导向的视频特征表示,并生成答案。解决了现有技术中存在的答案推理准确性低的技术问题。本发明通过提取视频的细粒度区域特征,可以更好地理解视频信息,增加了视频问答问题的准确度,缩小了视频和问题之间的模态差异。(The invention provides a video question-answering system, a video question-answering method, a video question-answering computer and a storage medium based on text attention and fine-grained information, and belongs to the cross field of computer vision and natural language processing. First, a word-embedded representation of a question statement is obtained; secondly, obtaining a problem-oriented video text level feature representation; secondly, obtaining a problem-oriented video frame level feature representation with spatial attention; secondly, obtaining the final problem-oriented video frame level feature representation; secondly, obtaining the problem-oriented video clip level feature representation; finally, a video feature representation with question guidance is obtained and an answer is generated. The technical problem of low answer reasoning accuracy in the prior art is solved. According to the method and the device, the video information can be better understood by extracting the fine-grained region characteristics of the video, the accuracy of the video question and answer is improved, and the modal difference between the video and the question is reduced.)

1. A video question-answering system based on text attention and fine-grained information is characterized by comprising a word embedding acquisition module, a video text level feature representation acquisition module, a video frame level feature representation acquisition module, a question-oriented video feature representation acquisition module, a video feature representation fusion module and a generated answer acquisition module;

the word embedding acquisition module is used for acquiring word embedding representation of the question sentence aiming at the question sentence;

the video text level characteristic representation acquisition module is used for embedding and representing the obtained words of the question sentences, generating an attention score for each text description of the video by using a text attention mechanism, and acquiring the video text level characteristic representation of the question guidance on the basis of the attention score;

the video frame level feature representation acquisition module is used for extracting regional features in each frame of a video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to word embedding representation of a question sentence, and obtaining video frame level feature representation with spatial attention of question guidance on the basis of the attention score;

the problem-oriented video frame level feature representation acquisition module is used for acquiring final problem-oriented video frame level feature representation according to the video frame level features with space attention;

the problem-oriented video feature representation acquisition module is used for acquiring video clips by using a three-dimensional convolutional neural network C3D to obtain problem-oriented video clip level feature representations;

the video feature representation fusion module is used for carrying out feature fusion on the video text level feature representation, the video frame level feature representation and the video fragment level feature representation to obtain a video feature representation with problem guidance;

the answer obtaining module is used for calculating and generating an answer by using a softmax function.

2. A video question-answering method based on text attention and fine-grained information is characterized by comprising the following steps:

s1, aiming at the question sentence, obtaining word embedded expression of the question sentence;

s2, aiming at a video, generating video text description of M sentences for each video, generating an attention score for each text description of the video by using a text attention mechanism according to the word embedded representation of the question sentence obtained in the step S1, and obtaining a video text level feature representation of question guidance on the basis of the attention score;

s3, extracting regional features in each frame of the video aiming at a section of video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to the word embedding representation of the question sentence obtained in the step S1, and obtaining the video frame level feature representation with spatial attention of the question guidance on the basis of the attention score;

s4, obtaining the final problem-oriented video frame level feature representation of the video frame level features with the spatial attention obtained in the step S3;

s5, obtaining video clips by using a three-dimensional convolutional neural network C3D, and obtaining problem-oriented video clip level feature representation;

and S6, performing feature fusion on the video text level feature representation obtained in the step S2, the video frame level feature representation obtained in the step S3 and the video clip level feature representation obtained in the step S5 to obtain a video feature representation with question orientation, and generating an answer.

3. The method according to claim 2, wherein the step S1 of obtaining the word-embedded representation of the question sentence is specifically: aiming at a question sentence, firstly, adopting a Word vector coding network Word2Wec to obtain Word embedding vector representation of each Word in the question sentence, then inputting each Word embedding vector representation into a gated cyclic unit network GRU, and obtaining hidden state representation h of the questionqWord-embedded representation as a problem.

4. The method according to claim 3, wherein the step S2 of obtaining the question-oriented video text-level feature representation comprises: firstly, a pre-trained Lexical full convolution neural network Lexical-FCN is used for generating a text description s of M sentences for each video1,s2,...,sMTo summarize the content of different areas in the video, and then to use the GRU model to perform characteristic coded representation of the M sentence text description, which is noted asWhereinA hidden state representation of the text description encoded for time i ∈ {1, 2., M };

calculating the similarity between the question sentence and each sentence text description by using a text attention mechanism, generating the weight of each sentence text description, and performing weighting fusion to obtain the text characteristic representation of the question-oriented text description, wherein the specific calculation process is as follows:

a) computing text featuresAttention point of (1):

wherein h isqFor the word-embedding representation of the problem, the superscript T is the transpose operation of the matrix, WsIs a parameter matrix used to calculate text attention scores;

b) for each text featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a problem-oriented video text level feature representation, wherein the calculation formula is as follows:

5. the method according to claim 4, wherein the step S3 for obtaining the problem-oriented video frame-level feature representation with spatial attention is as follows:

extracting the region characteristics in each video frame by using a pre-trained retina network RetinaNet, and recording as:where K represents the number of regional features in each frame,representing the jth region feature in the ith frame,representing the level characteristics of the extracted video frame of the ith frame, distributing weights for the regional characteristics by using a space attention mechanism, and carrying out weighting fusion to obtain a problem-oriented regional characteristic representation, wherein the specific calculation process is as follows:

a) computing region featuresAttention point of (1):

wherein h isqWord-embedded representation of the question, W, WqAnd WoIs a parameter matrix, boIs a bias vector;

b) for each region featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the video frame level characteristics with spatial attention, wherein the calculation formula is as follows:

6. the method of claim 5, wherein the step S4 of obtaining the final problem-oriented video frame-level feature representation comprises: video frame level feature representation with spatial attention calculated for step S3N represents the number of frames, and the GRU model is used to encode and represent the frames, and is expressed as: the method comprises the steps of representing the hidden state characteristics of a video frame coded at the ith moment, screening redundant frames by using a time attention mechanism, distributing weights to the characteristics of each video frame, and carrying out weighted summation according to the weight results to obtain problem-oriented video frame characteristic representation, wherein the specific calculation process comprises the following steps:

a) computing given video frame characteristicsAttention point of (1):

wherein h isqWord-embedded representation of the question, W, WqAnd WfIs a parameter matrix, bfIs a bias vector;

b) for each video frame level featureWeighting using softmax functionLine normalization, the calculation formula is as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the final problem-oriented video frame level feature representation, wherein the calculation formula is as follows:

7. the method of claim 6, wherein the step S5 of obtaining the question-oriented video clip-level feature representation specifically comprises:

extracting segment-level video feature representations using a three-dimensional convolutional neural network C3D, denoted as Representing the ith segment feature representation, and encoding the segment features by using a GRU model, and recording the segment features as The implicit state feature representation of the segment coded at the ith time is calculated by using a time attention mechanism according to the correlation between different video segments and problemsWeighting each video segment, and performing weighted fusion of features according to the calculation result to obtain problem-oriented video segment feature representation, wherein the specific calculation process is as follows:

a) computing video segment characteristicsAttention point of (1):

wherein h isqWord-embedded representation of a question, WqAnd WcIs a parameter matrix, bcIs a bias vector;

b) for each video segment characteristicWeights are normalized using the softmax function, which is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a feature representation of the problem-oriented video segment level, wherein the calculation formula is as follows:

8. the method of claim 7, wherein the step S6 of obtaining the question-oriented video feature representation and generating the answer includes:

for the obtained problem-oriented video text level feature representationVideo frame level feature representationVideo segment level feature representationAnd performing feature fusion to obtain a video feature representation with question guidance, and calculating the probability distribution of words in the answer set by using a softmax function, wherein the calculation formula of the final answer prediction is as follows:

wherein, the [ alpha ], [ beta ] -a]Representing connections of vectors, WaIs a parameter matrix, baIs a bias vector.

9. A computer comprising a memory storing a computer program and a processor, the processor implementing the steps of a video question-answering method based on textual attention and fine-grained information according to any one of claims 2 to 8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a video question-answering method based on textual attention and fine-grained information according to any one of claims 2 to 8.

Technical Field

The invention relates to a video question-answering method, in particular to a video question-answering system, a video question-answering method, a video question-answering computer and a video question-answering storage medium based on text attention and fine-grained information, and belongs to the cross field of computer vision and natural language processing.

Background

In recent years, with the development of computer vision and natural language processing, a great deal of vision and language research topics have attracted attention. The video question-answering is a cross-modal task and comprises data information of various different modalities, such as a video modality, a voice modality and a text modality.

The task of video question answering can be described as a process of giving appropriate answers to questions by acquiring visual information and semantic information of a piece of video and natural language description questions related to video content, wherein the natural language description questions are known. When people watch a video, people can acquire characters, objects, environments and the like appearing in the video through visual information displayed by video frames, and the visual information brought by scenes enables people to integrally know the video content.

The traditional video question-answering method generally carries out reasoning by directly extracting video features, and ignores the problem of modal difference between question sentences and videos, so that the video feature representation generated after similarity calculation is not most relevant to the problem; in addition, when video features are extracted by most video question-answering methods, the regional fine-grained features of video frames are ignored, so that the accuracy of final answer reasoning is influenced.

Therefore, how to realize the accuracy of answer reasoning is important. The invention provides a video question-answering system, a video question-answering method, a video question-answering computer and a storage medium based on text attention and fine-grained information.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

The invention provides a video question-answering system based on text attention and fine-grained information, which comprises a word embedding acquisition module, a video text level characteristic representation acquisition module, a video frame level characteristic representation acquisition module, a question-oriented video characteristic representation acquisition module, a video characteristic representation fusion module and a generated answer acquisition module, wherein the word embedding acquisition module is used for embedding words in a text;

the word embedding acquisition module is used for acquiring word embedding representation of the question sentence aiming at the question sentence;

the video text level characteristic representation acquisition module is used for embedding and representing the obtained words of the question sentences, generating an attention score for each text description of the video by using a text attention mechanism, and acquiring the video text level characteristic representation of the question guidance on the basis of the attention score;

the video frame level feature representation acquisition module is used for extracting regional features in each frame of a video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to word embedding representation of a question sentence, and obtaining video frame level feature representation with spatial attention of question guidance on the basis of the attention score;

the problem-oriented video frame level feature representation acquisition module is used for acquiring final problem-oriented video frame level feature representation according to the video frame level features with space attention;

the problem-oriented video feature representation acquisition module is used for acquiring video clips by using a three-dimensional convolutional neural network C3D to obtain problem-oriented video clip level feature representations;

the video feature representation fusion module is used for carrying out feature fusion on the video text level feature representation, the video frame level feature representation and the video fragment level feature representation to obtain a video feature representation with problem guidance;

the answer obtaining module is used for calculating and generating an answer by using a softmax function.

A video question-answering method based on text attention and fine-grained information comprises the following steps:

s1, aiming at the question sentence, obtaining word embedded expression of the question sentence;

s2, aiming at a video, generating video text description of M sentences for each video, generating an attention score for each text description of the video by using a text attention mechanism according to the word embedded representation of the question sentence obtained in the step S1, and obtaining a video text level feature representation of question guidance on the basis of the attention score;

s3, extracting regional features in each frame of the video aiming at a section of video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to the word embedding representation of the question sentence obtained in the step S1, and obtaining the video frame level feature representation with spatial attention of the question guidance on the basis of the attention score;

s4, obtaining the final problem-oriented video frame level feature representation of the video frame level features with the spatial attention obtained in the step S3;

s5, obtaining video clips by using a three-dimensional convolutional neural network C3D, and obtaining problem-oriented video clip level feature representation;

and S6, performing feature fusion on the video text level feature representation obtained in the step S2, the video frame level feature representation obtained in the step S3 and the video clip level feature representation obtained in the step S5 to obtain a video feature representation with question orientation, and generating an answer.

Preferably, the method for obtaining word-embedded representation of question sentences in step S1 specifically includes: aiming at a question sentence, firstly, adopting a Word vector coding network Word2Wec to obtain Word embedding vector representation of each Word in the question sentence, then inputting each Word embedding vector representation into a gated cyclic unit network GRU, and obtaining hidden state representation h of the questionqWord-embedded representation as a problem.

Preferably, the specific method for obtaining the question-oriented video text-level feature representation in step S2 is as follows: firstly, a pre-trained Lexical full convolution neural network Lexical-FCN is used for generating a text description s of M sentences for each video1,s2,...,sMTo summarize the content of different areas in the video, and then to use the GRU model to perform characteristic coded representation of the M sentence text description, which is noted asWhereinA hidden state representation of the text description encoded for time i ∈ {1, 2., M };

calculating the similarity between the question sentence and each sentence text description by using a text attention mechanism, generating the weight of each sentence text description, and performing weighting fusion to obtain the text characteristic representation of the question-oriented text description, wherein the specific calculation process is as follows:

a) computing text featuresAttention point of (1):

wherein h isqFor the word-embedding representation of the problem, the superscript T is the transpose operation of the matrix, WsIs a parameter matrix used to calculate text attention scores;

b) for each text featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a problem-oriented video text level feature representation, wherein the calculation formula is as follows:

preferably, the specific method for obtaining the problem-oriented video frame-level feature representation with spatial attention in step S3 is as follows:

extracting the region characteristics in each video frame by using a pre-trained retina network RetinaNet, and recording as:where K represents the number of regional features in each frame,representing the jth region feature in the ith frame,video frame level indicating extracted i-th frameThe method comprises the following steps of (1) distributing weights for the region features by using a space attention mechanism, and performing weighted fusion to obtain problem-oriented region feature representation, wherein the specific calculation process is as follows:

a) computing region featuresAttention point of (1):

wherein h isqWord-embedded representation of a question, WqAnd WoIs a parameter matrix, boIs a bias vector;

b) for each region featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the video frame level characteristics with spatial attention, wherein the calculation formula is as follows:

preferably, the specific method for obtaining the final problem-oriented video frame-level feature representation in step S4 is: video frame level feature representation with spatial attention calculated for step S3N represents the number of frames, and the GRU model is used to encode and represent the frames, and is expressed as: the method comprises the steps of representing the hidden state characteristics of a video frame coded at the ith moment, screening redundant frames by using a time attention mechanism, distributing weights to the characteristics of each video frame, and carrying out weighted summation according to the weight results to obtain problem-oriented video frame characteristic representation, wherein the specific calculation process comprises the following steps:

a) computing given video frame characteristicsAttention point of (1):

wherein h isqWord-embedded representation of the question, W, WqAnd WfIs a parameter matrix, bfIs a bias vector;

b) for each video frame featureWeights are normalized using the softmax function, which is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the final problem-oriented video frame level feature representation, wherein the calculation formula is as follows:

preferably, the specific method for obtaining the problem-oriented video segment-level feature representation in step S5 is as follows:

extracting segment-level video feature representations using a three-dimensional convolutional neural network C3D, denoted as Representing the ith segment feature representation, and encoding the segment features by using a GRU model, and recording the segment features as The hidden state feature representation of the segment representing the i-th time coding uses a time attention mechanism to calculate the weight of each video segment according to different relevance of the video segments and the problem, and performs weighted fusion of the features according to the calculation result to obtain problem-oriented video segment feature representation, wherein the specific calculation process is as follows:

a) computing video segment characteristicsAttention point of (1):

wherein h isqWord-embedded representation of the question, W, WqAnd WcIs a parameter matrix, bcIs a bias vector;

b) for each video segment characteristicWeights are normalized using the softmax function, which is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a feature representation of the problem-oriented video segment level, wherein the calculation formula is as follows:

preferably, the specific method for obtaining the video feature representation with question guide and generating the answer in step S6 is:

for the obtained problem-oriented video text level feature representationVideo frame level feature representationVideo segment level feature representationAnd performing feature fusion to obtain a video feature representation with question guidance, and calculating the probability distribution of words in the answer set by using a softmax function, wherein the calculation formula of the final answer prediction is as follows:

wherein, the [ alpha ], [ beta ] -a]Representing connections of vectors, WaIs a parameter matrix, baIs a bias vector.

A computer comprising a memory storing a computer program and a processor implementing the steps of a video question-answering method based on textual attention and fine-grained information when executing said computer program.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements a video question-and-answer method based on textual attention and fine-grained information.

The invention has the following beneficial effects:

(1) compared with the prior art, the method and the device have the advantage that the modal difference between the video and the problem is reduced by generating the text description for the video.

(2) By extracting the fine-grained region features of the video, the video information can be better understood, and the accuracy of answering questions is improved.

(3) By introducing a time attention mechanism to the video frame characteristics and the video segment characteristics, the video frame and the video segment related to the question are obtained, redundant information is eliminated, the accuracy of answering the question is further improved, and the question reasoning efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic structural diagram of a video question-answering system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a video question-answering method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a principle of a video question answering method according to a second embodiment of the present invention;

fig. 4 is a diagram illustrating the result of the video question answering method according to the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The first embodiment is as follows:

referring to fig. 1 to illustrate this embodiment, a video question-answering system based on text attention and fine-grained information in this embodiment includes a word embedding acquisition module, a video text level feature representation acquisition module, a video frame level feature representation acquisition module, a question-oriented video feature representation acquisition module, a video feature representation fusion module, and a generated answer acquisition module;

the word embedding acquisition module is used for acquiring word embedding representation of the question sentence aiming at the question sentence;

the video text level characteristic representation acquisition module is used for embedding and representing the obtained words of the question sentences, generating an attention score for each text description of the video by using a text attention mechanism, and acquiring the video text level characteristic representation of the question guidance on the basis of the attention score;

the video frame level feature representation acquisition module is used for extracting regional features in each frame of a video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to word embedding representation of a question sentence, and obtaining video frame level feature representation with spatial attention of question guidance on the basis of the attention score;

the problem-oriented video frame level feature representation acquisition module is used for acquiring final problem-oriented video frame level feature representation according to the video frame level features with space attention;

the problem-oriented video feature representation acquisition module is used for acquiring video clips by using a three-dimensional convolutional neural network C3D to obtain problem-oriented video clip level feature representations;

the video feature representation fusion module is used for carrying out feature fusion on the video text level feature representation, the video frame level feature representation and the video fragment level feature representation to obtain a video feature representation with problem guidance;

the answer obtaining module is used for calculating and generating an answer by using a softmax function.

Example two:

referring to fig. 2 to 4 to illustrate the present embodiment, a video question-answering method based on text attention and fine-grained information in the present embodiment includes the following steps:

s1, aiming at the question sentence, obtaining word embedded expression of the question sentence;

specifically, aiming at a question sentence, a Word embedding way is adopted to obtain Word embedding representation of the question sentence, firstly, a Word vector coding network Word2Wec is adopted to obtain Word embedding vector representation of each Word in the question sentence, then, each Word embedding vector representation is input into a gated cyclic unit network GRU, and hidden state representation h of the question is obtainedqWord-embedded representation as a problem.

S2, aiming at a video, generating video text description of M sentences for each video, generating an attention score for each text description of the video by using a text attention mechanism according to the word embedded representation of the question sentence obtained in the step S1, and obtaining a video text level feature representation of question guidance on the basis of the attention score;

specifically, for a section of video, firstly, a pre-trained Lexical full-convolution neural network Lexical-FCN is used for generating a text description s of M sentences for each video1,s2,...,sMTo summarize the content of different areas in the video, and then to use the GRU model to perform characteristic coded representation of the M sentence text description, which is noted asWhereinA hidden state representation of the text description encoded for time i ∈ {1, 2., M };

calculating the similarity between the question sentence and each sentence text description by using a text attention mechanism, generating the weight of each sentence text description, and performing weighting fusion to obtain the text characteristic representation of the question-oriented text description, wherein the specific calculation process is as follows:

a) computing text featuresAttention point of (1):

wherein h isqFor the word-embedding representation of the problem, the superscript T is the transpose operation of the matrix, WsIs a parameter matrix used to calculate text attention scores;

b) for each text featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a problem-oriented video text level feature representation, wherein the calculation formula is as follows:

s3, extracting regional features in each frame of the video aiming at a section of video to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to the word embedding representation of the question sentence obtained in the step S1, and obtaining the video frame level feature representation with spatial attention of the question guidance on the basis of the attention score;

specifically, for a section of video, extracting regional features in each frame of the video by using a pre-trained retina network RetinaNet to form a regional feature set of the video frame, generating an attention score for each regional feature by using a spatial attention mechanism according to the word embedding representation of the question sentence obtained in step S1, and obtaining a video frame level feature representation with spatial attention for question guidance on the basis of the attention score;

specifically, a pre-trained retina network RetinaNet is used to extract the region features in each video frame, and the region features are recorded as:where K represents the number of regional features in each frame,representing the jth region feature in the ith frame,representing the level characteristics of the extracted video frame of the ith frame, distributing weights for the regional characteristics by using a space attention mechanism, and carrying out weighting fusion to obtain a problem-oriented regional characteristic representation, wherein the specific calculation process is as follows:

a) computing region featuresAttention point of (1):

wherein h isqWord-embedded representation of the question, W, WqAnd WoIs a parameter matrix, boIs a bias vector;

b) for each region featureThe weights are numerically converted using the softmax function, and the formula is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the video frame level characteristics with spatial attention, wherein the calculation formula is as follows:

s4, obtaining the final problem-oriented video frame level feature representation of the video frame level features with the spatial attention obtained in the step S3;

specifically, a time attention mechanism is used to obtain a final problem-oriented video frame level feature representation;

in particular, the spatial attention-bearing video frame-level feature representation calculated for step S3N represents the number of frames, and the GRU model is used to encode and represent the frames, and is expressed as: the method comprises the steps of representing the hidden state characteristics of a video frame coded at the ith moment, screening redundant frames by using a time attention mechanism, distributing weights to the characteristics of each video frame, and carrying out weighted summation according to the weight results to obtain problem-oriented video frame characteristic representation, wherein the specific calculation process comprises the following steps:

a) computing video frame characteristicsAttention point of (1):

wherein h isqWord-embedded representation of a question, WqAnd WfIs a parameter matrix, bfIs a bias vector;

b) for each video frame featureWeights are normalized using the softmax function, which is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain the final problem-oriented video frame level feature representation, wherein the calculation formula is as follows:

s5, obtaining video clips by using a three-dimensional convolutional neural network C3D, and obtaining problem-oriented video clip level feature representation;

specifically, a three-dimensional convolutional neural network C3D is used for acquiring video clips, and a time attention mechanism is used for acquiring problem-oriented video clip level feature representation;

specifically, a three-dimensional convolutional neural network C3D is used for extracting video feature representation at a segment level, and the representation is recorded as Representing the ith segment feature representation, and encoding the segment features by using a GRU model, and recording the segment features as The hidden state feature representation of the segment representing the i-th time coding uses a time attention mechanism to calculate the weight of each video segment according to different relevance of the video segments and the problem, and performs weighted fusion of the features according to the calculation result to obtain problem-oriented video segment feature representation, wherein the specific calculation process is as follows:

a) computing video segment characteristicsAttention point of (1):

wherein h isqWord-embedded representation of a question, WqAnd WcIs a parameter matrix, bcIs a bias vector;

b) for each video segment characteristicWeights are normalized using the softmax function, which is calculated as follows:

c) according to the weight coefficientTo pairAnd performing weighted summation to obtain a feature representation of the problem-oriented video segment level, wherein the calculation formula is as follows:

and S6, performing feature fusion on the video text level feature representation obtained in the step S2, the video frame level feature representation obtained in the step S3 and the video clip level feature representation obtained in the step S5 to obtain a video feature representation with question orientation, and generating an answer.

Specifically, the probability distribution of words in the answer set is calculated using the Softmax function.

In particular, the resulting problem oriented video text-level feature representation isVideo frame level feature representationVideo segment level feature representationPerforming feature fusion to obtain a video feature representation with question guidance, and calculating the probability distribution of words in the answer set by using a softmax function, wherein the calculation formula of final answer prediction is as follows:

Wherein, the [ alpha ], [ beta ] -a]Representing connections of vectors, WaIs a parameter matrix, baIs a bias vector.

The method provided by the invention is subjected to experimental analysis:

carrying out experimental analysis on the method provided by the patent on a video question and answer data set activityNet-QA;

ActivityNet-QA is derived from an ActivityNet dataset, with an emphasis on identifying everyday, high-level, object-oriented activities from user-generated video comprising 14429 video clips, 71961 natural language descriptions and 130161 question-answer pairs, the average length of the video being 180s, wherein the question-answer pairs are generated manually from a given question template, and comprise the following four types: motion type problem, which needs to be understood for rough temporal behavior; spatial Relationship type problem, mainly based on static video frame to make Spatial reasoning; the Temporal Relationship type problem requires finding the Relationship between consecutive video frames; free type questions, which comprise a combination of two or more of the above types, the present invention classifies the questions into five categories based on answer attributes, which are What, Who, How, Where, and Other, respectively.

The invention divides the activityNet-QA data set into a 75% training set, a 5% verification set and a 20% testing set, and carries out word statistics on the answers in the question-answer pair, and selects the first 1000 words with the most occurrence times in the vocabulary as the answer set.

And (3) coding the question and the answer by using a Word2Vec model, wherein the dimension size is 300, and for the words which cannot be found in the vocabulary table, the Word embedding characteristics can be obtained by calculating the average value of the existing Word embedding characteristics.

In order to objectively evaluate the performance of the present invention, the performance of the present invention on ActivityNet-QA dataset was tested according to the procedure described in the embodiment, and the accuracy of the obtained experimental results reached 38.2%.

The working principle of the invention is as follows:

1) and generating M sentence text descriptions for each video by using a pre-trained Lexical-FCN model, and generating an attention score for each text description by using a text attention mechanism so as to obtain text-level characteristics related to the problem.

2) And extracting fine-grained region features in the video, and meanwhile, scoring the extracted region features by adopting a spatial attention mechanism so as to obtain frame-level features related to the problem.

3) A temporal attention mechanism is used to derive the frames and video segments most relevant to the question and the final answer is predicted using the video text level features, frame level features and segment level features.

The method reduces the modal difference between the video and the question, fully utilizes the fine-grained information of the video, and can effectively improve the accuracy of the video question answering.

The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于时段标签用户兴趣模型挖掘的方法、装置、存储介质及电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!