Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning

文档序号：1798829 发布日期：2021-11-05 浏览：33次中文

阅读说明：本技术 一种基于动态注意力与图网络推理的视频问答系统、方法、计算机及存储介质 (Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning ) 是由孙广路梁丽丽李天麟张博伦谭涛于 2021-08-09 设计创作，主要内容包括：本发明提出一种基于动态注意力与图网络推理的视频问答系统、方法、计算机及存储介质,属于计算机视觉和自然语言处理交叉领域。首先,计算物体空间特征和物体类别特征；其次,提取子视频的视觉运动特征；再其次,将物体空间、物体类别特征和视觉运动特征进行特征融合得到物体联合特征；再其次,对问题进行编码,获得问题特征,再其次,将物体联合特征和问题特征输入注意力模型中得到具有问题意识的联合特征；再其次,获得具有问题意识的物体关系特征；再其次,获得具有时序关系的视频特征；最后,得到融合特征生成答案。本发明解决了现有的视频问答系统不能有效地表示不同模态、同模态之间的关联程度和不能有针对性获取不同物体运动信息的问题。(The invention provides a video question-answering system, a video question-answering method, a video question-answering computer and a storage medium based on dynamic attention and graph network reasoning, and belongs to the cross field of computer vision and natural language processing. Firstly, calculating object space characteristics and object category characteristics; secondly, extracting visual motion characteristics of the sub-videos; secondly, performing feature fusion on the object space, the object category features and the visual motion features to obtain object joint features; secondly, coding the problem to obtain problem characteristics, and then inputting the object joint characteristics and the problem characteristics into an attention model to obtain joint characteristics with problem consciousness; secondly, obtaining object relation characteristics with problem consciousness; secondly, obtaining video characteristics with a time sequence relation; and finally, obtaining a fused feature generation answer. The invention solves the problems that the existing video question-answering system can not effectively express the association degree between different modes and the same mode and can not obtain the motion information of different objects in a targeted manner.)

1. A video question-answering system based on dynamic attention and graph network reasoning is characterized by comprising an object space characteristic and object category characteristic calculating module, a visual motion characteristic extracting module, an object joint characteristic acquiring module, a question characteristic extracting module, a question consciousness joint characteristic acquiring module, a question consciousness object relation characteristic acquiring module, a video time sequence characteristic acquiring module, a multi-mode characteristic fusion module and an answer generating module, wherein the object space characteristic and object category characteristic calculating module is used for calculating the object space characteristic and the object category characteristic;

the object space characteristic and object category characteristic calculating module is used for predicting an object marking frame and a category label in a video according to an object detection model to obtain an object space characteristic and an object category characteristic;

the visual motion characteristic extraction module is used for acquiring sub-videos with the same space-time dimension according to the sub-video alignment model, extracting the visual motion characteristics of the sub-videos and acquiring the visual motion characteristics;

the object joint feature acquisition module is used for fusing object space features, object category features and visual motion features to obtain object joint features;

the problem feature extraction module is used for coding a problem to obtain problem features;

the problem consciousness joint feature acquisition module is used for inputting the object joint features and the problem features into the attention model to acquire joint features with problem consciousness;

the problem-aware object relation feature acquisition module is used for constructing a graph convolution network with problem awareness to infer complex relations among different objects and acquiring problem-aware object relation features;

the video time sequence characteristic acquisition module is used for acquiring a hidden state at the last moment of a GRU decoding stage and representing the hidden state as a video characteristic with a time sequence relation;

the multi-modal feature fusion module is used for performing low-rank bilinear fusion on the problem features and the video features with the time sequence relation to obtain fusion features;

the answer generation module is used for calculating and generating an answer by using a softmax function.

2. A video question-answering method based on dynamic attention and graph network reasoning is characterized by comprising the following steps:

s1, for a section of video, predicting an object labeling frame and a class label in the video according to an object detection model to calculate object space characteristics and object class characteristics;

s2, aiming at a section of video, obtaining sub-videos with the same space-time dimension according to a sub-video alignment model, and extracting visual motion characteristics of the sub-videos;

s3, performing feature fusion on the object space features and the object category features obtained in the step S1 on the visual motion features obtained in the step S2 to obtain object joint features;

s4, aiming at the problem, coding the problem to obtain the problem characteristic;

s5, inputting the object joint characteristics obtained in the step S3 and the problem characteristics obtained in the step S4 into an attention model to obtain joint characteristics with problem awareness;

s6, initializing graph convolution network nodes by using the joint features with problem consciousness obtained in the step S5, and constructing a graph convolution network with problem consciousness to reason about complex relationships among different objects to obtain object relationship features with problem consciousness;

s7, coding the object relation characteristics with problem awareness obtained in the step S6 by using a GRU model, and taking the hidden state at the last moment as video characteristics with time sequence relation;

and S8, performing low-rank bilinear fusion on the video features with the time sequence relation obtained in the step S7 and the question features obtained in the step S4 to obtain fusion features, and generating an answer by using softmax.

3. The method according to claim 2, wherein the specific method for predicting the object labeling box and the class label in the video according to the object detection model to extract the object spatial feature and the object class feature in step S1 is: detecting k objects with the highest confidence score by adopting a target detection model Mask R-CNN, and calculating object space characteristics and object category characteristics according to the labeling frames and the category labels; the specific calculation process is as follows:

a) calculating the spatial characteristics of the object:

wherein, S ═ { x, y, h, w }, x and y are coordinates of the object in the video frame, and h and w are height and width of the border of the object; MLP () is a shallow feedforward neural network,is represented by F^SIn the real number domainDimension of (2) is d_sThe subscript s denotes the label of the spatial characteristics of the object;

b) calculating object class characteristics:

wherein E is^LA vector representation is embedded for the words labeled for the object class,is represented by F^LIn the real number domainDimension of (2) is d_lThe subscript l indicates the label of the object class characteristic.

4. The method according to claim 3, wherein the step S2 proposes obtaining sub-videos with the same object space-time dimension according to a sub-video alignment model, and then extracting visual motion features of the sub-videos by using a three-dimensional convolution model C3D; the specific method for acquiring the visual motion characteristics of the sub-video comprises the following steps: performing space and time alignment on the k objects with the highest confidence scores by using a sub-video alignment model to obtain a sub-video set S ═ { sv ═ of the k objects_kL 1 is more than or equal to K and less than or equal to K, wherein sv_kThe sub-video representing the kth object is calculated as follows:

a) calculating the space position of the object image, wherein the coordinates (x) of the ith row and j column pixel points_i，y_j) The calculation method is as follows:

the origin of coordinates at the upper left corner of the object image is (0,0), the pixel size of the original image is (W, H), and the pixel size after bilinear interpolation is (W, H);

b) calculating spatial pixel points of the object image, wherein the point P (x, y) is a pixel sampling point obtained by calculation, and the numerical value of the point is determined by the numerical values of the four closest pixel points according to the distance and the distance as the weight; therefore, the pixel point Q closest to the P point is utilized₁₁(x₁，y₁)、Q₁₂(x₁，y₂)、Q₂₁(x₂，y₁) And Q₂₂(x₂，y₂) Performing a calculation by first calculating a point R₁(x，y₁) And R₂(x，y₂) Pixel value f (R) of₁) And f (R)₂) Then the formula is as follows:

then passes through point R₁、R₂Pixel value f (R) of₁) And f (R)₂) To calculate the pixel value f (P) of point P, the formula is as follows:

c) converting an original length-uncertain n-frame object image sequence into a length-fixed m-frame object sub-video by using a time alignment model of an object image;

d) will sub-video sv_kSplitting the video into N clips to obtain a set containing N clipsWhereinRepresenting the nth clip in the kth sub-video, the motion characteristics of the clip are calculated using C3D:

wherein the content of the first and second substances,is represented by F^MIn the real number domainDimension of (2) is d_mThe subscript m denotes a mark of the clip motion characteristics, C^svA set of clips representing all sub-videos.

5. The method according to claim 4, characterized in that step S3 is performed on the object space characteristic F calculated in step S1^SAnd object class characteristics F^LThe visual motion characteristic F calculated for step S2^MCarrying out feature fusion to obtain object joint feature F^J(ii) a The specific method comprises the following steps:

wherein [ ·; h; a]Representing a splicing and fusion operation of the two-dimensional image,is represented by F^JIn the real number domainDimension of (2) is d_jThe subscript j indicates the label of the object association feature.

6. The method of claim 5, wherein the step S4 of coding the question to obtain the question feature comprises:

a) for one problem Q ═ Q₁，q₂，...，q_MThe calculation for each word in the question is as follows, resulting in a word vector representation r_i：

p_i＝One-Hot(q_i)

r_i＝W_ep_i，i∈{1，2，...，M}

Wherein the content of the first and second substances,One-Hot () represents One-Hot coding that encodes the word q in question_iConversion into a one-hot coded representation p_i，Embedding the matrix for words, where L is the length of the vocabulary, and 300 is each word q_iDimension (d);

a) word vector set { r) using Bi-directional gate-round unit model Bi-GRU₁，r₂，...，r_MEncode as follows:

wherein the content of the first and second substances,indicating a hidden state sequence derived for the forward GRU,representing a hidden state sequence obtained by a reverse GRU, wherein i belongs to {1, 2.... multidot.M } in a word vector group;

b) outputting a hidden state result from the last moment of the Bi-GRU modelAndsplicing is carried out, and problem characteristics are obtained:

wherein, the following components are added to the mixture,]indicating a splicing operation, F^QThe characteristics of the problem are shown,to representF^QIn the real number domainDimension of (2) is d_qThe subscript q indicates the label of the problem feature.

7. The method according to claim 6, wherein the specific method of inputting the joint features with problem awareness into the attention model for the object joint features calculated in step S3 and the problem features calculated in step S4 in step S5 is:

a) computing problem feature F^QCharacteristic associated with the object F^JThe interaction space of (2):

wherein the content of the first and second substances,andis a learnable weight matrix, b^QAnd b^JIs a bias vector;is a real number fieldIn d_q×d_jThe matrix of (a) is,is a real number fieldIn d_q×d_jA matrix of (a);

b) calculating a similarity matrix of problem features and object joint features using a scaled dot product functionThe formula is as follows:

wherein the content of the first and second substances,transpose the representation for the problem features mapped to the interaction space, d is the scaling factor, softmax () is the normalization function;

c) and multiplying the similarity matrix and the object joint characteristics to obtain joint characteristics with problem consciousness, wherein the formula is as follows:

wherein the content of the first and second substances,to representIn the real number domainDimension of (2) is d_jThe subscript j denotes the label of the joint feature with problem awareness.

8. The method of claim 7,

the specific method for obtaining the object relationship characteristic with problem awareness in step S6 is:

a) initializing node of the computational graph convolutional network:

A⁽¹⁾＝softmax(W₁X⁽⁰⁾(W₂X⁽⁰⁾)^T)

wherein A is⁽¹⁾Neighbor matrices representing first-level graph convolutional networks, using joint features with problem awarenessInitial node X of initialization graph⁽⁰⁾，W₁And W₂For mapping the matrix, superscript T represents the transpose operation of the matrix;

b) constructing a graph convolution network structure, wherein the reasoning process of the graph convolution network of the P-th layer (P is more than or equal to 1 and less than or equal to P) is as follows:

X^(p)＝A^(p)X^(p-1)W^(p)

wherein, X^(p)Feature of nodes for the p-th graph convolution network, A^(p)For rolling up node characteristics X of network through p-1 level graph^(p-1)Calculated adjacency matrix of p-th layer, W^(p)Weighting parameters of the convolution network of the p-th layer graph;

c) adjacency matrix A for p-th layer^(p)Then the network node X is convolved according to the p-1 layer graph^(p-1)The similarity with the self is calculated, and the formula is as follows:

A^(p)＝softmax(W₁X^(p-1)(W₂X^(p-1))^T)

wherein, W₁And W₂For mapping the matrix, superscript T represents the transpose operation of the matrix;

d) initial node feature X⁽⁰⁾With node feature X of the last layer^(P)Obtaining the GCN output after fusion, wherein the formula is as follows:

F^R＝X^(P)+X⁽⁰⁾

wherein the content of the first and second substances,generating object relation characteristics with problem awareness for the GCN, wherein k represents k objects with highest detection confidence scores; record the relationship feature set of all time periods asF₁ ^RRepresenting that a first GCN network captures object relation characteristics at a first time;

the method for obtaining the video features with the time sequence relationship in step S7 is as follows:

h_t＝GRU(F_t ^R，h_t-1)，1≤t≤N

wherein h is_t-1Representing the hidden state feature at time t-1, the hidden state h at the last time_NOutputting video features recorded as having a temporal relationshipF^VIn the real number domainDimension of (2) is d_vThe subscript v denotes the label of the video features having a temporal relationship.

The specific method for calculating the generated answer by using the softmax function in step S8 is:

a) video feature F with time sequence relation^VAnd problem feature F^QInputting the data into a low-rank bilinear fusion MLB model to obtain fusion characteristics z:

z＝P^Tσ(U^TF^v⊙V^TF^Q)+b

wherein P, U and V are three trainable weight matrices, b is an offset vector, σ is a ReLU function, which represents a Hadamard product, and the superscript T indicates the transpose operation of the matrices;

b) mapping the fused feature z to output the answer category with the highest score using the softmax functionThe final answer is calculated as follows:

wherein, W_aIs a parameter matrix, b_aIs a bias vector.

9. A computer comprising a memory storing a computer program and a processor, wherein the processor implements the steps of the video question-answering method based on dynamic attention and graph network inference as claimed in any one of claims 2 to 8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a video question-answering method based on dynamic attention and graph network reasoning according to any one of claims 2 to 8.

Technical Field

The invention relates to a video question-answering method, in particular to a video question-answering system, a video question-answering method, a video question-answering computer and a video question-answering storage medium based on dynamic attention and graph network reasoning, and belongs to the cross field of computer vision and natural language processing.

Background

In recent years, with the development of computer vision and natural language processing, a great deal of vision and language research topics have attracted attention. The video question-answering is a cross-modal task and comprises data information of various different modalities, such as a video modality, a voice modality and a text modality. The task of video question answering can be described as a process of predicting answers to questions by acquiring visual information and semantic information of a known video and natural language description questions related to video contents. When people watch a section of video, people can acquire characters, objects, environments and the like appearing in the video through visual information displayed by the video, so that people can integrally know the content of the video.

Although the current video question-answering task solves the problem of interaction among different modal information, the following defects still exist: (1) when the existing neural network is used for modeling time-space information in a video, the obtained time sequence characteristics cannot effectively represent the association degrees of different modes and the same mode; (2) in general, a video contains a plurality of moving objects, but the current model cannot obtain the motion information of different objects in a targeted manner and has a large amount of redundant information; (3) in the method for solving the video question-answering task by adopting the graph volume network, graph node representations are constructed according to object information, and information such as the position, the motion and the like of an object is not used.

In the video question and answer task, how to effectively express the association degree between different modes and the same mode by using the obtained time sequence characteristics, obtain the motion information of different objects in a targeted manner, solve a great deal of redundant information and use the position and motion information of the objects to construct graph nodes is a key and difficult problem of video question and answer. The invention provides a video question-answering system, a video question-answering method, a video question-answering computer and a storage medium based on dynamic attention and graph network reasoning, aiming at the problem.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

The invention provides a video question-answering system based on dynamic attention and graph network reasoning, which comprises an object space characteristic and object category characteristic calculation module, a visual motion characteristic extraction module, an object joint characteristic acquisition module, a question characteristic extraction module, a question consciousness joint characteristic acquisition module, a question consciousness object relation characteristic acquisition module, a video time sequence characteristic acquisition module, a multi-mode characteristic fusion module and an answer generation module, wherein the object space characteristic and object category characteristic calculation module is used for calculating the object space characteristic and the object category characteristic;

the object joint feature acquisition module is used for fusing object space features, object category features and visual motion features to obtain object joint features;

the problem feature extraction module is used for coding a problem to obtain problem features;

the multi-modal feature fusion module is used for performing low-rank bilinear fusion on the problem features and the video features with the time sequence relation to obtain fusion features;

the answer generation module is used for calculating and generating an answer by using a softmax function.

A video question-answering method based on dynamic attention and graph network reasoning comprises the following steps:

s2, aiming at a section of video, obtaining sub-videos with the same space-time dimension according to a sub-video alignment model, and extracting visual motion characteristics of the sub-videos;

s4, aiming at the problem, coding the problem to obtain the problem characteristic;

Preferably, in step S1, the specific method for extracting the object spatial feature and the object class feature by predicting the object labeling box and the class label in the video according to the object detection model is as follows: detecting k objects with the highest confidence score by adopting a target detection model Mask R-CNN, and calculating object space characteristics and object category characteristics according to the labeling frames and the category labels; the specific calculation process is as follows:

a) calculating the spatial characteristics of the object:

F^s＝MLP(S)

b) calculating object class characteristics:

F^L＝MLP(E^L)

Preferably, in step S2, sub-videos with the same object space-time dimension are obtained according to the sub-video alignment model, and then the visual motion features of the sub-videos are extracted by using the three-dimensional convolution model C3D; the specific method for acquiring the visual motion characteristics of the sub-video comprises the following steps: performing space and time alignment on the k objects with the highest confidence scores by using a sub-video alignment model to obtain a sub-video set S ═ { sv ═ of the k objects_kL 1 is more than or equal to K and less than or equal to K, wherein sv_kThe sub-video representing the kth object is calculated as follows:

a) calculating the space position of the object image, wherein the coordinates (x) of the ith row and j column pixel points_i,y_j) The calculation method is as follows:

the origin of coordinates at the upper left corner of the object image is (0,0), the pixel size of the original image is (W, H), and the pixel size after bilinear interpolation is (W, H);

b) calculating spatial pixel points of the object image, wherein the point P (x, y) is a pixel sampling point obtained by calculation, and the numerical value of the point is determined by the numerical values of the four closest pixel points according to the distance and the distance as the weight; therefore, the pixel point Q closest to the P point is utilized₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁) And Q₂₂(x₂,y₂) Performing a calculation by first calculating a point R₁(x,y₁) And R₂(x,y₂) Pixel value f (R) of₁) And f (R)₂) Then the formula is as follows:

then passes through point R₁、R₂Pixel value f (R) of₁) And f (R)₂) To calculate the pixel value f (P) of point P, the formula is as follows:

c) converting an original length-uncertain n-frame object image sequence into a length-fixed m-frame object sub-video by using a time alignment model of an object image;

d) will sub-video sv_kSplitting the video into N clips to obtain a set containing N clipsWhereinRepresenting the nth clip in the kth sub-video,the motion characteristics of the clip were calculated using C3D:

F^M＝C3D(C^sv)，

Preferably, the step S3 is executed to obtain the object space characteristic F calculated in the step S1^SAnd object class characteristics F^LThe visual motion characteristic F calculated for step S2^MCarrying out feature fusion to obtain object joint feature F^J(ii) a The specific method comprises the following steps:

F^J＝[F^S；F^L；F^M],d_j＝d_m+d_s+d_l

Preferably, the step S4 encodes the question to obtain the question feature, and the specific method includes:

a) for one problem Q ═ Q₁,q₂,...,q_MAnd calculating each word in the problem to obtain a word vectorIs represented by r_i：

p_i＝One-Hot(q_i)

r_i＝W_ep_i，i∈{1,2,...,M}

Where One-Hot () represents One-Hot coding that encodes the word q in question_iConversion into a one-hot coded representation p_i，Embedding the matrix for words, where L is the length of the vocabulary, and 300 is each word q_iDimension (d);

a) word vector set { r) using Bi-directional gate-round unit model Bi-GRU₁,r₂,...,r_MEncode as follows:

b) outputting a hidden state result from the last moment of the Bi-GRU modelAndsplicing is carried out, and problem characteristics are obtained:

Preferably, the specific method for inputting the joint feature with problem awareness into the attention model for the object joint feature calculated in step S3 and the problem feature calculated in step S4 in step S5 is as follows:

a) computing problem feature F^QCharacteristic associated with the object F^JThe interaction space of (2):

b) calculating a similarity matrix of problem features and object joint features using a scaled dot product functionThe formula is as follows:

c) and multiplying the similarity matrix and the object joint characteristics to obtain joint characteristics with problem consciousness, wherein the formula is as follows:

wherein the content of the first and second substances,to representIn the real number domainDimension of (2) is d_jThe subscript j denotes the label of the joint feature with problem awareness.

Preferably, the specific method for obtaining the object relationship characteristic with problem awareness in step S6 is:

a) initializing node of the computational graph convolutional network:

A⁽¹⁾＝softmax(W₁X⁽⁰⁾(W₂X⁽⁰⁾)^T)

X^(p)＝A^(p)X^(p-1)W^(p)

c) adjacency matrix A for p-th layer^(p)Then the network node X is convolved according to the p-1 layer graph^(p-1)The similarity with the self is calculated, and the formula is as follows:

A^(p)＝softmax(W₁X^(p-1 ₎(W₂X^(p-1)^T)

wherein, W₁And W₂For mapping the matrix, superscript T represents the transpose operation of the matrix;

d) initial node feature X⁽⁰⁾With node feature X of the last layer^(P)Obtaining the GCN output after fusion, wherein the formula is as follows:

F^R+X^(P)+X⁽⁰⁾

the method for obtaining the video features with the time sequence relationship in step S7 is as follows:

h_t＝GRU(F_t ^R，h_t-1)，1≤t≤N

The specific method for calculating the generated answer by using the softmax function in step S8 is:

a) video feature F with time sequence relation^VAnd problem feature F^QInputting the data into a low-rank bilinear fusion MLB model to obtain fusion characteristics z:

z＝P^Tσ(U^TF^v⊙V^TF^Q)+b

b) mapping the fused feature z to output the answer category with the highest score using the softmax functionThe final answer is calculated as follows:

wherein, W_aIs a parameter matrix, b_aIs a bias vector.

A computer comprising a memory storing a computer program and a processor implementing the steps of a video question-and-answer method based on dynamic attention and graph network reasoning when executing said computer program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a video question-and-answer method based on dynamic attention and graph network reasoning.

The invention has the following beneficial effects:

the method is different from the existing extraction mode of the visual characteristics in the video question-answering model, and the method strengthens the acquisition of the motion information of a single object by extracting the object joint characteristics of the video and removes the information irrelevant to the problem. According to the invention, through a new characteristic construction idea, information about object space and category is added in the visual motion characteristics, so that semantic information in the visual information can be enriched and better combined with problem characteristics. An attention mechanism is introduced into the graph convolution network, and the graph convolution network can gradually accumulate graph node information related to problems in the reasoning process and strengthen the relation between the motion characteristics of the objects. The invention fully combines an attention mechanism and a graph convolution network to reason the complex relation between different objects in the video, thereby effectively improving the accuracy of video question answering.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic structural diagram of a system according to a first embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method according to a second embodiment of the present invention;

FIG. 3 is a diagram illustrating bilinear interpolation according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a principle of a method according to a second embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of a prediction result for solving the video question-answering task according to the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The first embodiment is as follows:

referring to fig. 1 to illustrate this embodiment, a video question-answering system based on dynamic attention and graph network reasoning according to this embodiment includes an object space feature and object category feature calculation module, a visual motion feature extraction module, an object joint feature acquisition module, a question feature extraction module, a question-awareness joint feature acquisition module, a question-awareness object relationship feature acquisition module, a video time series feature acquisition module, a multi-modal feature fusion module, and an answer generation module;

the object joint feature acquisition module is used for fusing object space features, object category features and visual motion features to obtain object joint features;

the problem feature extraction module is used for coding a problem to obtain problem features;

the multi-modal feature fusion module is used for performing low-rank bilinear fusion on the problem features and the video features with the time sequence relation to obtain fusion features;

the answer generation module is used for calculating and generating an answer by using a softmax function.

Example two:

referring to fig. 2 to fig. 5, the present embodiment is described, and a video question-answering method based on dynamic attention and graph network inference in this embodiment includes the following steps:

specifically, the specific method for predicting the object labeling box and the class label in the video to extract the object spatial feature and the object class feature is as follows: detecting k objects with the highest confidence score by adopting a target detection model Mask R-CNN, and calculating object space characteristics and object category characteristics according to the labeling frames and the category labels; the specific calculation process is as follows:

a) calculating the spatial characteristics of the object:

F^s＝MLP(S),

b) calculating object class characteristics:

F^L＝MLP(E^L),

S2, aiming at a section of video, obtaining sub-videos with the same space-time dimension according to a sub-video alignment model, and extracting visual motion characteristics of the sub-videos;

specifically, sub-videos with the same object space-time dimension are obtained according to a sub-video alignment model, and then visual motion characteristics of the sub-videos are extracted by using a three-dimensional convolution model C3D; the specific method for acquiring the visual motion characteristics of the sub-video comprises the following steps: performing space and time alignment on the k objects with the highest confidence scores by using a sub-video alignment model to obtain a sub-video set S ═ { sv ═ of the k objects_kL 1 is more than or equal to K and less than or equal to K, wherein sv_kThe sub-video representing the kth object is calculated as follows:

a) calculating the space position of the object image, wherein the coordinates (x) of the ith row and j column pixel points_i,y_j) The calculation method is as follows:

the origin of coordinates at the upper left corner of the object image is (0,0), the pixel size of the original image is (W, H), and the pixel size after bilinear interpolation is (W, H);

b) calculating spatial pixel points of the object image, wherein the point P (x, y) is a pixel sampling point obtained by calculation, and the numerical value of the point is determined by the numerical values of the four closest pixel points according to the distance and the distance as the weight; therefore, the pixel point Q closest to the P point is utilized₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁) And Q₂₂(x₂,y₂) Performing a calculation by first calculating a point R₁(x,y₁) And R₂(x,y₂) Pixel value f (R) of₁) And f (R)₂) Then the formula is as follows:

then passes through point R₁、R₂Pixel value f (R) of₁) And f (R)₂) To calculate the pixel value f (P) of point P, the formula is as follows:

c) converting an original length-uncertain n-frame object image sequence into a length-fixed m-frame object sub-video by using a time alignment model of an object image;

d) will sub-video sv_kSplitting the video into N clips to obtain a set containing N clipsWhereinRepresents the kth subThe nth clip in the video, the motion characteristics of the clip are calculated using C3D:

F^M＝C3D(C^sv),

S3, performing feature fusion on the object space features and the object category features obtained in the step S1 on the visual motion features obtained in the step S2 to obtain object joint features;

specifically, the object space characteristic F^SObject class characteristics F^LAnd visual motion characteristics F^MCarrying out feature fusion to obtain object joint feature F^J；

Specifically, the specific method for obtaining the object combination characteristics is as follows:

F^J＝[F^s；F^L；F^M],d_j＝d_m+d_s+d_l

S4, aiming at the problem, coding the problem to obtain the problem characteristic;

specifically, the problem is coded to obtain the problem characteristics, and the specific method is as follows:

a) for one problem Q ═ Q₁,q₂,...,q_MGet the word vector representation x by calculating each word in the question as follows_i：

p_i＝One-Hot(q_i)

r_i＝W_ep_i，i∈{1,2,...,M}

Wherein One-Hot () represents One-Hot coding that encodes One word q_iConversion into a one-hot coded representation p_i，Embedding the matrix for words, where L is the length of the vocabulary, and 300 is each word q_iOf (c) is calculated.

a) Word vector set { r) using Bi-directional gate-round unit model Bi-GRU₁,r₂,...,r_MEncode as follows:

wherein the content of the first and second substances,indicating a hidden state sequence derived for the forward GRU,and the hidden state sequence obtained by the reverse GRU and the i e in the word vector group are represented as {1, 2.

b) Outputting a hidden state result from the last moment of the Bi-GRU modelAndsplicing is carried out, and problem characteristics are obtained:

wherein, the following components are added to the mixture,]indicating a splicing operation, F^QThe characteristics of the problem are shown,is represented by F^QIn the real number domainDimension of (2) is d_qThe subscript q indicates the label of the problem feature. S5, inputting the object joint characteristics obtained in the step S3 and the problem characteristics obtained in the step S4 into an attention model to obtain joint characteristics with problem awareness;

specifically, the object joint features and the problem features are input into an attention model to obtain joint features with problem awareness; specifically, the specific method for obtaining the joint feature of problem awareness is as follows:

a) computing problem feature F^QCharacteristic associated with the object F^JThe interaction space of (2):

b) calculating a similarity matrix of problem features and object joint features using a scaled dot product functionThe formula is as follows:

wherein the content of the first and second substances,for the problem feature transpose representation mapped to the interaction space, d is the scaling factor and softmax () is the normalization function.

c) And multiplying the similarity matrix and the object joint characteristics to obtain joint characteristics with problem consciousness, wherein the formula is as follows:

wherein the content of the first and second substances,to representIn the real number domainDimension of (2) is d_jThe subscript j denotes the label of the joint feature with problem awareness.

S6, initializing graph convolution network nodes by using the joint features with problem consciousness obtained in the step S5, and constructing a graph convolution network with problem consciousness to reason about complex relationships among different objects to obtain object relationship features with problem consciousness;

a) specifically, a specific method for obtaining the object relationship characteristic with problem awareness is as follows: initializing node of the computational graph convolutional network:

A⁽¹⁾＝softmax(W₁X⁽⁰⁾(W₂X⁽⁰⁾)^T)

X^(p)＝A^(p)X^(p-1)W^(p)

wherein, X^(p)For the p-th graph convolution network node characteristics, W^(p)For rolling up node characteristics X of network through p-1 level graph^(p-1)Calculated adjacency matrix of p-th layer, W^(p)Weighting parameters of the convolution network of the p-th layer graph;

c) adjacency matrix A for p-th layer^(p)Then the network node X is convolved according to the p-1 layer graph^(p-1)The similarity with the self is calculated, and the formula is as follows:

A^(p)＝softmax(W₁X^(p-1)(W₂X^(p-1))^T)

wherein, W₁And W₂For mapping the matrix, superscript T represents the transpose operation of the matrix;

d) will be as followsStarting node characteristic X⁽⁰⁾With node feature X of the last layer^(P)Obtaining the GCN output after fusion, wherein the formula is as follows:

F^R＝X^(P)+X⁽⁰⁾

wherein the content of the first and second substances,the problem-aware object relationship features generated for the GCN, k, indicate the k objects with the highest confidence scores for detection. The patent provides a video question-answering system based on dynamic attention and graph network reasoning, which consists of N convolutional neural networks GCN, is used for capturing object relation characteristics of videos in different time periods, and finally, recording the relation characteristic sets of all time periods asF₁ ^RIndicating that the first GCN network captures the object relationship features at the first time.

S7, coding the object relation characteristics with problem awareness obtained in the step S6 by using a GRU model, and taking the hidden state at the last moment as video characteristics with time sequence relation;

specifically, the method for coding and representing the video features by using the GRU model to obtain the video features with the time sequence relationship comprises the following steps:

h_t＝GRU(F_t ^R，h_t-1)，1≤t≤N

And S8, performing low-rank bilinear fusion on the video features with the time sequence relation obtained in the step S7 and the question features obtained in the step S4 to obtain fusion features, and generating an answer by using softmax.

Specifically, the specific method for generating the answer is as follows:

a) video feature F with time sequence relation^VAnd problem feature F^QInputting the data into a low-rank bilinear fusion MLB model to obtain fusion characteristics z:

z＝P^Tσ(U^TF^v⊙V^TF^Q)+b

b) mapping the fused feature z to output the answer category with the highest score using the softmax functionThe final answer is calculated as follows:

wherein, W_aIs a parameter matrix, b_aIs a bias vector.

The method provided by the invention is subjected to experimental analysis:

the model provided by the invention is subjected to experimental analysis on a video question and answer data set MSVD-QA data set.

The MSVD-QA data set is obtained by adopting an automatic generation method based on a Microsoft research video description corpus, and the problem type of the data set is an open problem. The MSVD-QA data set has a total of 1970 video segments and 50,505 answers to questions. The MSVD-QA data set is a data set which mainly comprises daily life video of people.

The invention divides the MSVD-QA data set into training set accounting for 61% of the total number of the video, the number of the verification set accounting for 13% of the total number of the video, and the number of the test set accounting for 26% of the total number of the video.

And extracting object features by using Mask R-CNN, wherein the feature dimension is 1024 d. Dynamic information of the object is captured by using a C3D model pre-trained on a Sport-1M data set, and output feature vectors of a conv5b layer are taken as extracted motion features of the object, wherein the feature size of the extracted motion features is 1024 d. For the question, a pre-trained Glove model was used to extract the semantic representation of the question, with a feature dimension of 300d for each word.

In order to objectively evaluate the performance of the present invention, the performance of the present invention on the MSVD-QA dataset was tested according to the procedures described in the embodiments, and the accuracy of the obtained experimental results could reach 52.68%, which is 8.8% higher than that of the existing method.

The model provided by the invention is subjected to experimental analysis on the SVQA data set of the video question and answer data set.

The SVQA dataset is a strongly speculative video question and answer dataset. The dataset video is generated by the Unity3D tool. The SVQA dataset automatically generates questions using a fixed template for 2000 video and 120K question-and-answer pairs.

The invention divides the SVQA data set into training set accounting for 65% of the total number of videos, the number of the verification set accounting for 15% of the total number of videos and the number of the test set accounting for 20% of the total number of videos.

In order to objectively evaluate the performance of the present invention, the performance of the present invention on the SVQA dataset was tested according to the procedures described in the embodiments, and the accuracy of the obtained experimental results can reach 77.6%, which is 5.4% higher than that of the existing method.

The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The processor may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种智慧楼宇数据采集方法

Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning

相关技术

网友询问留言