Scene character visual question-answering method based on knowledge-guided deep attention network

文档序号:1963813 发布日期:2021-12-14 浏览:9次 中文

阅读说明:本技术 基于知识引导深度注意力网络的场景文字视觉问答方法 (Scene character visual question-answering method based on knowledge-guided deep attention network ) 是由 余宙 朱俊杰 俞俊 于 2021-08-05 设计创作,主要内容包括:本发明公开了一种基于知识引导深度注意力网络的的场景文字视觉问答方法。本发明步骤:1、数据预处理及数据集的划分,2、构建问题的语言特征,3、构建图像的物体综合特征,4、获取图像的文本综合特征,5、构建前预测词的特征,6、构建相对空间关系特征,7、构建相对语义关系特征,8、构建深度神经网络,9、损失函数,10、训练模型,11、网络预测值计算。本发明通过建模物体对象和文本对象间的相对空间关系特征、前预测词和文本对象的相对语义关系特征,得到先验知识关系,并深度堆叠知识增强自注意力网络层数来获得更加丰富的信息,相比于先前基于卷积神经网络和构建对象间浅层关系的方法性能有了很大提升。(The invention discloses a scene character visual question-answering method based on a knowledge-guided deep attention network. The method comprises the following steps: 1. the method comprises the steps of data preprocessing and data set division, 2, problem construction language feature construction, 3, image construction object comprehensive feature construction, 4, image acquisition text comprehensive feature acquisition, 5, pre-construction predicted word feature, 6, relative spatial relationship feature construction, 7, relative semantic relationship feature construction, 8, deep neural network construction, 9, loss function, 10, training model, 11 and network prediction value calculation. The invention obtains the prior knowledge relationship by modeling the relative spatial relationship characteristics between the object and the text object and the relative semantic relationship characteristics between the pre-predicted word and the text object, and obtains richer information by enhancing the number of self-attention network layers by deeply stacking knowledge.)

1. A scene character visual question-answering method based on a knowledge-guided deep attention network is characterized by comprising the following steps:

step (1), division of data set

Dividing the data set;

step (2): constructing linguistic features of a question

One problem consists of K words, which are converted into word vectors containing semantic information for each word K using a pre-trained word vector modelWherein d isquesWord vector dimensions that refer to problem words; splicing word vectors of K words in a problem into language features of a complete problem, and mapping the language features to a d-dimensional space through linear transformation to obtain features

And (3): object synthesis feature for constructing image

For an input image, calculating M candidate frames containing objects in the image by using a trained target detection network; for each candidate frame m, obtaining its spatial characteristicsInputting the corresponding area of the candidate frame in the image into a target detection network, and extracting the output of a certain layer of the network as the visual feature of the candidate frameProjecting the visual characteristic and the spatial characteristic to a d-dimensional space by using learnable linear transformation to obtain the comprehensive characteristic of the candidate frame mThe comprehensive characteristics of all objects in one image are spliced into the total comprehensive characteristics of the objects as

And (4): obtaining text synthesis features of an image

For an input image, acquiring N text messages containing texts in the image by using an offline OCR system, wherein the N text messages comprise candidate box information and in-box character information; for each candidate frame information N, obtaining the spatial characteristics thereofInputting the corresponding area of the candidate frame in the image into a trained target detection network and extracting the output of a certain layer of the network as the visual characteristic of the candidate frameObtaining a text character feature for each in-box character information n using the FastText algorithmObtaining another text character feature using PHOC algorithmSpatial features to be obtained using linear transformationVisual featuresText character featuresAndprojecting the text information to a d-dimensional space to obtain the text comprehensive characteristics of the text information nAll the text comprehensive characteristics in one image are spliced into an overall text comprehensive characteristic of

And (5): features of pre-built predicted words

The model iteratively decodes to generate answers by selecting words from an OCR text or a fixed vocabulary, predicts the characteristics, the source and the position of the words during the decoding for the t-th time in the iterative autoregressive decoding process, and constructs and obtains the comprehensive characteristics of the previous predicted wordAnd will beAs input for the t +1 th decoding; a complete answer is composed of words predicted by decoding T times, and the characteristics of all the pre-predicted words are spliced into complete pre-predicted word characteristics

And (6): constructing relative spatial relationship features

Generating relative spatial relationship features for any two objects i, j in an imageAll the objects in one image, the total number of which is M + N, are constructed into the total relative spatial relationship characteristicThe object comprises an object and text;

and (7): constructing relative semantic relationship features

In the iterative autoregressive decoding process, during the decoding for the t time, the predicted word c obtained by the decoding for the t-1 time and the character information n in the image are calculated by utilizing cosine similarity to obtain the relative semantic relation characteristic between the word c and the character information nA complete answer is composed of words predicted by decoding T times, and all predicted words in the answer and the relative semantic relation features of all text objects in the image are spliced into an overall relative semantic relation feature

And (8): constructing a deep neural network

Integrating the language feature Q of the question and the object feature X of the imageobjText synthesis feature X of imageocrCharacteristics of pre-predicted wordsXdecSpliced to form featuresThe feature I and the relative spatial relationship feature F are combinedspInputting the information into an inference module to generate a vector fusing information of each mode When t is decoded, Z is outputtCorresponding text featuresPre-and predictive word featuresInputting into a dynamic pointer network, the output of which combines with a relative semantic relation characteristic FsePredicting the probability of answer to words in textAnd answer probability of words in fixed vocabularyWill be provided withAre spliced into Selecting the word with the maximum probability as a prediction answer of the decoding for the t time, and iteratively predicting the next word until the process is finished;

and (9): loss function

Inputting the predicted answers output in the step (8) and the corresponding correct answers into a loss function together, and calculating to obtain a loss value;

step (10): training model

Carrying out gradient back transmission on the deep neural network model parameters in the step (8) by using a back propagation algorithm according to the loss value generated by the loss function in the step (8), and continuously optimizing until the whole network model converges;

step (11): network predictor calculation

According to the output in step (8)And selecting the word with the maximum probability as the predicted answer of the decoding at the t time, and iteratively predicting the next word until the end to generate the final answer.

2. The knowledge-guided deep attention network-based scene text visual question answering method according to claim 1, wherein the language features of the constructed question in the step (2) are as follows:

converting each word k in the question into a word vector containing semantic information using a pre-trained word vector modelWherein d isquesWord vector dimensions that refer to problem words; splicing the problem composed of K words into a language feature of a complete problem, mapping the language feature to a d-dimensional space through linear transformation,the specific formula is as follows:

where Linear is a Linear transformation.

3. The knowledge-guided deep attention network-based scene text visual question answering method according to claim 2, wherein the object comprehensive characteristics of the constructed image in the step (3) are as follows:

for each candidate frame m, its spatial position coordinate is (x)min,ymin,xmax,ymax),(xmin,ymin) Position coordinates of the upper left corner of the candidate frame, (x)max,ymax) The position coordinates of the lower right corner of the candidate frame are represented by spatial featuresThe specific formula is as follows:

wherein W, H are the width and height of the image, respectively;

for each candidate box m, the visual characteristics areProjecting the visual characteristic and the spatial characteristic to a d-dimensional space by using learnable linear transformation to obtain the comprehensive characteristic of the candidate frame iThe specific formula is as follows:

wherein W1、W2Is a learnable mapping matrix, LN is layer normalization;

selecting M candidate frames containing the object from each image, and splicing all the candidate frames in each image into a total object comprehensive characteristicSign forThe specific formula is as follows:

4. the knowledge-guided deep attention network-based scene text visual question answering method according to claim 3, wherein the text comprehensive characteristics of the constructed image in the step (4) are as follows:

for each candidate frame n, its spatial location coordinate is (x)min,ymin,xmax,ymax,),(xmin,ymin) Position coordinates of the upper left corner of the candidate frame, (x)max,ymax) The position coordinates of the lower right corner of the candidate frame are represented by spatial featuresThe specific formula is as follows:

for each candidate box n, the visual characteristics areA character is characterized by Another character is characterized in thatProjecting the space characteristic, the visual characteristic and the character characteristic to a d-dimensional space by utilizing linear transformation to obtain the text comprehensive characteristic of the candidate frame nThe specific formula is as follows:

wherein W3、W4、W5Is a learnable mapping matrix, LN is layer normalization;

integrating all texts in each image into a characteristicThe text of the splicing assembly is characterized in thatThe specific formula is as follows:

5. the knowledge-guided deep attention network-based scene text visual question answering method according to claim 4, wherein the characteristics of the pre-constructed predicted words in the step (5) are as follows:

in the iterative autoregressive decoding process, the characteristics, the source and the position of the predicted word during the decoding for the t time are constructed to obtain the comprehensive characteristics of the previous predicted wordAnd will beAs input for the t +1 th decoding, wherein the 1 st decoded inputIs a special character'<s>'; a complete answer is composed of words predicted by decoding T times, and the characteristics of all the pre-predicted words are spliced into complete pre-predicted word characteristicsThe specific formula is as follows:

6. the knowledge-guided deep attention network-based scene text visual question answering method according to claim 5, wherein the relative spatial relationship features are constructed in the step (6), and specifically the following steps are carried out:

modeling through the relative position between any two candidate frames to generate a relative spatial relationship characteristic; defining the spatial position of the ith object as (x)i,yi,wi,hi) The four-dimensional coordinates respectively represent the horizontal and vertical coordinates, the width and the height of the center point of the object; the relative spatial relationship between the ith object and the jth object is characterized byThe specific formula is as follows:

all object relationship pair features in each image are spliced into an overall relative spatial relationship feature ofConcrete maleThe formula is as follows:

wherein P ═ M + N.

7. The knowledge-guided deep attention network-based scene text visual question answering method according to claim 6, wherein the construction of the relative semantic relationship features in the step (7) is as follows:

respectively converting a word c predicted previously and a word n of a text in an image into a word vector containing semantic information by using a pre-trained word vector modelCalculating relative semantic relation characteristics between a word c predicted previously and a word n of text in an image by using cosine similarityWhere at prediction 1, the previously predicted word is a special character'<s>', the specific formula is as follows:

a complete answer is composed of T words, N text words are in the image, all the predicted words in the answer and the relative semantic relation features of all the text words in the image are spliced into an overall relative semantic relation featureThe specific formula is as follows:

8. the knowledge-guided deep attention network-based scene text visual question answering method according to claim 7, wherein the deep neural network is constructed in the step (8), and specifically comprises the following steps:

8-1, fusing language features of the problem, object comprehensive features of the image, text comprehensive features of the image and features of the pre-predicted words;

language feature Q of question, object comprehensive feature X of imageobjText synthesis feature X of imageocrCharacteristic X of pre-predicted worddecIs the characteristic from each mode, can be mapped to the same latitude d through linear mapping, and the four characteristics are spliced intoThe specific formula is as follows:

I=[Q,Xobj,Xocr,Xdec](formula 13)

8-2. construction of knowledge-enhanced self-attention network KSA

The feature I and the relative spatial relationship feature F are combinedspAs input, output of KSAThe specific formula is as follows:

B'=LN(I+KMSA(I,I,I,Fsp) Equation 14

B ═ LN (B '+ FFN (B')) (formula 15)

Wherein KMSA is a knowledge-enhanced multi-headed self-attention network that is part of a KSA network, whose inputs are I and FspThe output is a feature vector rich in multi-modal informationThe specific formula is as follows:

q ═ Linear [ I ] (formula 16)

K ═ Linear [ I ] (formula 17)

V ═ Linear [ I ] (formula 18)

Q, K therein,Respectively obtained by mapping I through a full connection layer;∈=1e-6to avoid underflow problems;

wherein, the MLP is a two-layer perceptron, and the specific formula is as follows:

where ReLU is the activation function, FC1Is a fully connected layer;

wherein, the input of the FFN structure is B', and the specific formula is as follows:

FFN(B')=FCd(Drop(ReLU(FC4d(B')))) (equation 21)

8-3. deep Stacking KSA network

With the fused feature I and the relative spatial relationship feature FspAs input, the KSA network [ KSA ] is deeply stacked(1),KSA(2),…,KSA(E)](ii) a Layer e KSA(e)Output feature and relative spatial relationship feature FspAs layer e +1 KSA(e+1)Iteratively, the formula is as follows:

I(e+1)=KSA(e+1)(I(e),Fsp) (formula 22)

Wherein, I(0)KSA for different layers, relation feature F ═ IspKeeping the same;

8-4. feature fusion of Bull attention

The output I of the step 8-3(e+1)As input, via the full connection layer, single head z is outputuThe formula is as follows:

zu=Linear(I(e+1))Linear(I(e+1))T(formula 23)

The multi-head attention Z is composed of U independent single heads, the characterization capability of the attention characteristics is further improved, and the specific formula is as follows:

Z=[z1,z2,…,zu,…,zU](formula 24)

8-5 iterative decoding computation probability

When the t time decoding is carried out, the t time output Z of the step 8-4 is outputtCorresponding text features Pre-and predictive word featuresInputting into dynamic pointer network, predicting answer probability of original OCR text wordThe specific formula is as follows:

wherein, Wocrbocr

Will be provided withAnd relative semantic relationship featuresCombined to form the final OCR text word answer probabilityThe specific formula is as follows:

by passingCalculating to obtain the answer probability of the words in the fixed vocabularyThe specific formula is as follows:

wherein the content of the first and second substances,

will be provided withAre spliced intoThe specific formula is as follows:

fromThe word with the highest probability is selected as the answer of the ith decoding, and the next word is predicted iteratively until the process is finished.

9. The knowledge-guided deep attention network-based scene text visual question answering method according to claim 8, wherein the loss function in the step (9) is as follows:

calculating a predicted answerWith real labelsThe difference between them, here using a binary cross-entropy loss (binary cross-entropy), is given by the following formula:

10. the knowledge-guided deep attention network-based scene text visual question answering method according to claim 9, wherein the training model in the step (10) is as follows:

and (4) carrying out gradient back transmission on the model parameters of the neural network in the step (8) by using a back propagation algorithm according to the loss value generated by the loss function in the step (9), and continuously optimizing until the whole network model converges.

Technical Field

The invention provides a Scene character visual question-answering method (Scene-textual visual question answering) based on a knowledge-guided deep attention network. The core method is to provide a knowledge-guided deep attention network, obtain a priori knowledge relationship by modeling relative spatial relationship features between an object and a text object and relative semantic relationship features of a pre-predicted word and the text object, so as to obtain richer information, and can generate answers more accurately in a deep learning task of scene text visual question answering, thereby verifying the superiority of the model. The method firstly provides the relative spatial relationship characteristics between the modeling object and the text object, thereby obtaining richer spatial relationship between the objects. Meanwhile, the method also provides the relative semantic relation characteristics of the predictive words and the text objects before modeling for the first time, and can further obtain the deeper semantic relation between the predictive words and the text objects. In a scene text visual question-answering experiment, multimodal features and the prior knowledge relation are input into a model, deep interaction of the multimodal features is completed in an inference Module (Reasoning Module), deep understanding of a pre-predicted word and a text object is completed in a Generation Module (Generation Module), and a better experiment result is obtained.

Background

Visual question answering (Visual question answering) is a learning task involving computer vision and natural language processing, which takes a picture and a free and open natural language question as input, and outputs a natural language as an answer through a Visual question answering system. Therefore, there is a need for a machine to have some understanding of the content of the pictures, the meaning and intent of the problem, and the related general knowledge.

Scene text visual question answering is a subtask of visual question answering. Like the visual question-answering task, the scene text visual question-answering task also has a picture and a corresponding question as input, but an Optical character recognition (Optical character recognition) part is additionally arranged on the basis of the visual question-answering for acquiring text information in the picture, and the scene text visual question-answering also needs to be the same as the visual question-answering and fuses data in different modes. In contrast, scene text visual questions and answers are more focused on the text information in the picture, its questions mainly surround the text information in the picture, and its answers also require the text information in the picture. A scene text visual question-answering algorithm needs to learn the following conditions: recognizing that this is a problem with text; detecting an area containing characters; thirdly, converting the area containing the characters into a text form; combining the text and the visual content to establish the relationship between the text and the visual content; determining whether to determine the detected text as answer or to process the text.

The correct answers of the scene text visual question answers come from different respondents and have higher freedom. Meanwhile, picture carriers in natural scenes have various themes, the content is complex and rich in variation, objects and texts in the images may have high difference, and thus, the scene text visual question answering faces huge challenges.

In the scene text visual question-answering task, technologies such as Object Detection (Object Detection), Optical Character Recognition (OCR), Word Embedding (Word Embedding), multi-mode fusion and the like are mainly involved. The multi-mode fusion is the key point of research in recent years in scientific research institutions and the industry, and the effective fusion can obtain enough abundant and effective information for the model, so that the model can predict more accurate answers. In multimodal fusion, the main models are divided into two types, one is a model based on a graph neural network, and the other is a model based on an attention mechanism. The scene text visual question-answering needs to be established on the basis of full utilization of multi-modal information, so that effective mining of relationships among the multiple modes is particularly important, and the priori knowledge relationship is obtained by modeling the relative spatial relationship features between the object and the text object and the relative semantic relationship features between the pre-predicted words and the text object, so that the model can be well understood, and a better experimental result can be obtained.

In the aspect of practical application, the future application potential of the scene text visual question-answering task is huge, such as facing to the vision-impaired user.

The number of the disabled people in China is nearly 9000 ten thousand, wherein the number of the disabled people reaches 1800 ten thousand, which is equivalent to that one disabled person exists in every 90 people in China. China is the country with the largest number of visually impaired people all over the world and is also the country with the largest blind people group. The visual system is the main way for human beings to perceive environmental information, and more than 80% of the information is obtained by eyes. In the present china, under the condition that the visually impaired people occupy a large proportion of the disabled people, the problem that the visual information perception and interaction requirements of the disabled people in daily life are difficult to be fully met is becoming a problem to be solved urgently. The method can be combined with a voice technology in the future, a vision-impaired user can upload an image and provide a problem for a system, an algorithm can be understood according to input information of the user, and finally the user can be answered in a voice broadcasting mode, so that the life quality of the vision-impaired people is improved to a certain extent, and the method has great market application value and social benefit.

In conclusion, the scene text visual question-answering is a subject worth of intensive research, and the patent aims to cut through and develop discussion from a plurality of key points in the task, solve the difficulties and the key points existing in the current method and form a set of complete scene text visual question-answering system.

Disclosure of Invention

The invention provides a scene character visual question-answering method based on a knowledge-guided deep attention network. The invention mainly comprises two points:

1. the relative spatial relationship between the object and the text object is input into the relational attention network by modeling the relative spatial relationship characteristics between the object and the text object so as to fully discover the relative spatial relationship between the objects and acquire richer information.

And the semantic relation between the answer and the text object is fully understood by predicting the relative semantic relation characteristics of the word and the text object before modeling, and a deeper expression is obtained by combining the output result of the dynamic pointer network.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1), division of data set

Dividing the data set;

step (2): constructing linguistic features of a question

One problem consists of K words, which are converted into word vectors containing semantic information for each word K using a pre-trained word vector modelWherein d isquesWord vector dimensions that refer to problem words; splicing word vectors of K words in a problem into language features of a complete problem, and mapping the language features to a d-dimensional space through linear transformation to obtain features

And (3): object synthesis feature for constructing image

For an input image, calculating M candidate frames containing objects in the image by using a trained target detection network; for each candidate frame m, obtaining its spatial characteristicsInputting the corresponding area of the candidate frame in the image into a target detection network, and extracting the output of a certain layer of the network as the visual feature of the candidate frameProjecting the visual characteristic and the spatial characteristic to a d-dimensional space by using learnable linear transformation to obtain the comprehensive characteristic of the candidate frame mThe comprehensive characteristics of all objects in one image are spliced into the total comprehensive characteristics of the objects as

And (4): obtaining text synthesis features of an image

For an input image, acquiring N text messages containing texts in the image by using an offline OCR system, wherein the N text messages comprise candidate box information and in-box character information; for each oneCandidate frame information n, obtaining spatial characteristics thereofInputting the corresponding area of the candidate frame in the image into a trained target detection network and extracting the output of a certain layer of the network as the visual characteristic of the candidate frameObtaining a text character feature for each in-box character information n using the FastText algorithmObtaining another text character feature using PHOC algorithmSpatial features to be obtained using linear transformationVisual featuresText character featuresAndprojecting the text information to a d-dimensional space to obtain the text comprehensive characteristics of the text information nAll the text comprehensive characteristics in one image are spliced into an overall text comprehensive characteristic of

And (5): features of pre-built predicted words

Model iterative decoding by selecting words from OCR text or fixed vocabularyGenerating answers, and in the iterative autoregressive decoding process, constructing the characteristics, sources and positions of the predicted words in the decoding of the t time to obtain the comprehensive characteristics of the previous predicted wordAnd will beAs input for the t +1 th decoding; a complete answer is composed of words predicted by decoding T times, and the characteristics of all the pre-predicted words are spliced into complete pre-predicted word characteristics

And (6): constructing relative spatial relationship features

Generating relative spatial relationship features for any two objects i, j in an image All the objects in one image, the total number of which is M + N, are constructed into the total relative spatial relationship characteristicThe object comprises an object and text;

and (7): constructing relative semantic relationship features

In the iterative autoregressive decoding process, during the decoding for the t time, the predicted word c obtained by the decoding for the t-1 time and the character information n in the image are calculated by utilizing cosine similarity to obtain the relative semantic relation characteristic between the word c and the character information nA complete answer is composed of words predicted by decoding T times, and all predicted words in the answer and the relative semantic relation features of all text objects in the image are spliced into an assemblyRelative semantic relationship features

And (8): constructing a deep neural network

Integrating the language feature Q of the question and the object feature X of the imageobjText synthesis feature X of imageocrCharacteristic X of pre-predicted worddecSpliced to form featuresThe feature I and the relative spatial relationship feature F are combinedspInputting the information into an inference module to generate a vector fusing information of each modeWhen t is decoded, Z is outputtCorresponding text featuresPre-and predictive word featuresInputting into a dynamic pointer network, the output of which combines with a relative semantic relation characteristic FsePredicting the probability of answer to words in textAnd answer probability of words in fixed vocabularyWill be provided withAre spliced intoFrom which the probability maximum is selectedThe word is used as a prediction answer of the decoding of the t time, and the next word is predicted in an iterative mode until the end;

and (9): loss function

Inputting the predicted answers output in the step (8) and the corresponding correct answers into a loss function together, and calculating to obtain a loss value;

step (10): training model

Carrying out gradient back transmission on the deep neural network model parameters in the step (8) by using a back propagation algorithm according to the loss value generated by the loss function in the step (8), and continuously optimizing until the whole network model converges;

step (11): network predictor calculation

According to the output in step (8)And selecting the word with the maximum probability as the predicted answer of the decoding at the t time, and iteratively predicting the next word until the end to generate the final answer.

Further, the language features of the construction problem in step (2) are specifically as follows:

converting each word k in the question into a word vector containing semantic information using a pre-trained word vector modelWherein d isquesWord vector dimensions that refer to problem words; splicing the problem composed of K words into a language feature of a complete problem, mapping the language feature to a d-dimensional space through linear transformation,the specific formula is as follows:

where Linear is a Linear transformation.

Further, the object comprehensive characteristics of the constructed image in the step (3) are specifically as follows:

for each candidate frame m, its spatial position coordinate is (x)min,ymin,xmax,ymax),(xmin,ymin) Position coordinates of the upper left corner of the candidate frame, (x)max,ymax) The position coordinates of the lower right corner of the candidate frame are represented by spatial featuresThe specific formula is as follows:

wherein W, H are the width and height of the image, respectively;

for each candidate box m, the visual characteristics areProjecting the visual characteristic and the spatial characteristic to a d-dimensional space by using learnable linear transformation to obtain the comprehensive characteristic of the candidate frame iThe specific formula is as follows:

wherein W1、W2Is a learnable mapping matrix, LN is layer normalization;

selecting M candidate frames containing the object from each image, and splicing all the candidate frames in each image into an overall object comprehensive characteristicThe specific formula is as follows:

further, the text comprehensive features of the constructed image in the step (4) are specifically as follows:

for each candidate frame n, its spatial location coordinate is (x)min,ymin,xmax,ymax,),(xmin,ymin) Position coordinates of the upper left corner of the candidate frame, (x)max,ymax) The position coordinates of the lower right corner of the candidate frame are represented by spatial featuresThe specific formula is as follows:

for each candidate box n, the visual characteristics areA character is characterized byAnother character is characterized in thatProjecting the space characteristic, the visual characteristic and the character characteristic to a d-dimensional space by utilizing linear transformation to obtain the text comprehensive characteristic of the candidate frame nThe specific formula is as follows:

wherein W3、W4、W5Is a learnable mapping matrix, LN is layer normalization;

all in each imageText synthesis featureThe text of the splicing assembly is characterized in thatThe specific formula is as follows:

further, the characteristics of the pre-constructed predicted word in the step (5) are specifically as follows:

in the iterative autoregressive decoding process, the characteristics, the source and the position of the predicted word during the decoding for the t time are constructed to obtain the comprehensive characteristics of the previous predicted wordAnd will beAs input for the t +1 th decoding, wherein the 1 st decoded inputIs a special character'<s>'; a complete answer is composed of words predicted by decoding T times, and the characteristics of all the pre-predicted words are spliced into complete pre-predicted word characteristicsThe specific formula is as follows:

further, the constructing of the relative spatial relationship feature in the step (6) is specifically as follows:

modeling through the relative position between any two candidate frames to generate a relative spatial relationship characteristic; will be firstThe spatial positions of the i objects are defined as (x)i,yi,wi,hi) The four-dimensional coordinates respectively represent the horizontal and vertical coordinates, the width and the height of the center point of the object; the relative spatial relationship between the ith object and the jth object is characterized byThe specific formula is as follows:

all object relationship pair features in each image are spliced into an overall relative spatial relationship feature ofThe specific formula is as follows:

wherein P ═ M + N.

Further, the constructing of the relative semantic relationship features in the step (7) is specifically as follows:

respectively converting a word c predicted previously and a word n of a text in an image into a word vector containing semantic information by using a pre-trained word vector model Calculating relative semantic relation characteristics between a word c predicted previously and a word n of text in an image by using cosine similarityWhere at prediction 1, the previously predicted word is a special character'<s>', the specific formula is as follows:

a complete answer is composed of T words, N text words are in the image, all the predicted words in the answer and the relative semantic relation features of all the text words in the image are spliced into an overall relative semantic relation featureThe specific formula is as follows:

further, the deep neural network is constructed in the step (8), specifically as follows:

8-1, fusing language features of the problem, object comprehensive features of the image, text comprehensive features of the image and features of the pre-predicted words;

language feature Q of question, object comprehensive feature X of imageobjText synthesis feature X of imageocrCharacteristic X of pre-predicted worddecIs the characteristic from each mode, can be mapped to the same latitude d through linear mapping, and the four characteristics are spliced into I eThe specific formula is as follows:

I=[Q,Xobj,Xocr,Xdec](formula 13)

8-2. construction of knowledge-enhanced self-attention network KSA

The feature I and the relative spatial relationship feature F are combinedspAs input, output of KSAThe specific formula is as follows:

B'=LN(I+KMSA(I,I,I,Fsp) Equation 14

B ═ LN (B '+ FFN (B')) (formula 15)

Wherein KMSA is a knowledge-enhanced multi-headed self-attention network that is part of a KSA network, whose inputs are I and FspThe output is a feature vector rich in multi-modal information The specific formula is as follows:

q ═ Linear [ I ] (formula 16)

K ═ Linear [ I ] (formula 17)

V ═ Linear [ I ] (formula 18)

Q, K therein,Respectively obtained by mapping I through a full connection layer;to avoid underflow problems;

wherein, the MLP is a two-layer perceptron, and the specific formula is as follows:

where ReLU is the activation function, FC1Is a fully connected layer;

wherein, the input of the FFN structure is B', and the specific formula is as follows:

FFN(B')=FCd(Drop(ReLU(FC4d(B')))) (Formula 21)

8-3. deep Stacking KSA network

With the fused feature I and the relative spatial relationship feature FspAs input, the KSA network [ KSA ] is deeply stacked(1),KSA(2),…,KSA(E)](ii) a Layer e KSA(e)Output feature and relative spatial relationship feature FspAs layer e +1 KSA(e+1)Iteratively, the formula is as follows:

I(e+1)=KSA(e+1)(I(e),Fsp) (formula 22) wherein I(0)KSA for different layers, relation feature F ═ IspKeeping the same;

8-4. feature fusion of Bull attention

The output I of the step 8-3(e+1)As input, via the full connection layer, single head z is outputuThe formula is as follows:

zu=Linear(I(e+1))Linear(I(e+1))T(formula 23)

The multi-head attention Z is composed of U independent single heads, the characterization capability of the attention characteristics is further improved, and the specific formula is as follows:

Z=[z1,z2,…,zu,…,zU](formula 24)

8-5 iterative decoding computation probability

When the t time decoding is carried out, the t time output Z of the step 8-4 is outputtCorresponding text features Pre-and predictive word featuresInputting into dynamic pointer network, predicting answer probability of original OCR text wordThe specific formula is as follows:

wherein, Wocrbocr

Will be provided withAnd relative semantic relationship featuresCombined to form the final OCR text word answer probabilityThe specific formula is as follows:

by passingCalculating to obtain the answer probability of the words in the fixed vocabularyThe specific formula is as follows:

wherein the content of the first and second substances,will be provided withAre spliced intoThe specific formula is as follows:

fromThe word with the highest probability is selected as the answer of the ith decoding, and the next word is predicted iteratively until the process is finished.

Further, the loss function in step (9) is specifically as follows:

calculating a predicted answerWith real labelsThe difference between them, here using a binary cross-entropy loss (binary cross-entropy loss), is given by the following formula:

further, the training model in step (10) is specifically as follows:

and (4) carrying out gradient back transmission on the model parameters of the neural network in the step (8) by using a back propagation algorithm according to the loss value generated by the loss function in the step (9), and continuously optimizing until the whole network model converges.

The invention has the following beneficial effects:

the invention provides a scene character visual question-answering method based on a knowledge-guided deep attention network, which obtains a priori knowledge relationship by modeling the relative spatial relationship characteristics between an object and a text object and the relative semantic relationship characteristics between a pre-predicted word and the text object, and further obtains richer information by enhancing the number of layers of the self-attention network by deeply stacking knowledge.

Drawings

FIG. 1: knowledge enhancement self-attention module

FIG. 2: knowledge-guided deep attention network architecture

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

As shown in fig. 1 and 2, the present invention provides a scene text visual question-answering method based on knowledge-guided deep attention network.

The step (1) is specifically realized as follows:

dividing the data set: the training set consists of 21953 pictures, 34602 questions;

the verification set consists of 3166 pictures and 5000 questions; the test set consisted of 3289 pictures, 5734 questions.

The language features of the construction problem in the step (2) are specifically as follows:

one problem consists of K words, which are converted into word vectors containing semantic information for each word K using a pre-trained word vector modelWherein d isquesWord vector dimensions that refer to problem words; splicing word vectors of K words in a problem into language features of a complete problem, and mapping the language features to a d-dimensional space through linear transformation to obtain features

The object comprehensive characteristics of the constructed image in the step (3) are as follows:

for an input image, calculating M candidate frames containing objects in the image by using a trained target detection network; for each candidate frame d, obtaining its spatial characteristicsInputting the corresponding area of the candidate frame in the image into a target detection network, and extracting the output of a certain layer of the network as the visual feature of the candidate frameProjecting the visual characteristic and the spatial characteristic to a d-dimensional space by using learnable linear transformation to obtain the comprehensive characteristic of the candidate frame mThe comprehensive characteristics of all objects in one image are spliced into the total comprehensive characteristics of the objects as

The text comprehensive characteristics of the constructed image in the step (4) are as follows:

for an input image, acquiring N text messages containing texts in the image by using an offline OCR system, wherein the N text messages comprise candidate box information and in-box character information; for each candidate frame information n, obtaining the spatial characteristics thereofInputting the corresponding area of the candidate frame in the image into a trained target detection network and extracting the output of a certain layer of the network as the visual characteristic of the candidate frameObtaining a text character feature for each in-box character information n using the FastText algorithmObtaining another text word using the PHOC algorithmCharacter featureSpatial features to be obtained using linear transformationVisual featuresText character featuresAndprojecting the text information to a d-dimensional space to obtain the text comprehensive characteristics of the text information nAll the text comprehensive characteristics in one image are spliced into an overall text comprehensive characteristic of

The characteristics of the pre-construction predicted word in the step (5) are as follows:

the model iteratively decodes to generate answers by selecting words from an OCR text or a fixed vocabulary, predicts the characteristics, the source and the position of the words during the decoding for the t-th time in the iterative autoregressive decoding process, and constructs and obtains the comprehensive characteristics of the previous predicted wordAnd will beAs input for the t +1 th decoding; a complete answer is composed of words predicted by decoding T times, and the characteristics of all the pre-predicted words are spliced into complete pre-predicted word characteristics

Wherein the input of the 1 st decodingIs a special character'<s>’。

Constructing the relative spatial relationship characteristics in the step (6) specifically as follows:

generating relative spatial relationship features for any two objects i, j in an image All the objects in one image, the total number of which is M + N, are constructed into the total relative spatial relationship characteristicThe object comprises an object and text;

constructing relative semantic relationship features in the step (7), which are specifically as follows:

respectively converting a word c predicted previously and a word n of a text in an image into a word vector containing semantic information by using a pre-trained word vector model Calculating relative semantic relation characteristics between a word c predicted previously and a word n of text in an image by using cosine similarityWhere at prediction 1, the previously predicted word is a special character'<s>’;

A complete answer consisting of T words, N text words in the image, all predicted words in the answer and all text words in the imageIs spliced into an overall relative semantic relationship feature

Constructing a deep neural network in the step (8), which comprises the following specific steps:

8-1, fusing language features of the problem, object comprehensive features of the image, text comprehensive features of the image and features of the pre-predicted words;

language feature Q of question, object comprehensive feature X of imageobjText synthesis feature X of imageocrCharacteristic X of pre-predicted worddecIs the characteristic from each mode, can be mapped to the same latitude d through linear mapping, and the four characteristics are spliced into

8-2. construction of knowledge-enhanced self-attention network KSA

The feature I and the relative spatial relationship feature F are combinedspAs input, output of KSAWherein KMSA is a knowledge-enhanced multi-headed self-attention network that is part of a KSA network, whose inputs are I and FspThe output is a feature vector rich in multi-modal informationQ, K therein,Respectively obtained by mapping I through a full connection layer;to avoid underflow problems;

8-3. deep Stacking KSA network

With the fused feature I and the relative spatial relationship feature FspAs input, the KSA network [ KSA ] is deeply stacked(1),KSA(2),…,KSA(E)](ii) a Layer e KSA(e)Output feature and relative spatial relationship feature FspAs layer e +1 KSA(e+1)And (4) iteratively repeating. Wherein, I(0)KSA for different layers, relation feature F ═ IspKeeping the same;

8-4. feature fusion of Bull attention

The output I of the step 8-3(e+1)As input, via the full connection layer, single head z is outputu(ii) a The multi-head attention Z is composed of U independent single heads, and the characterization capability of the attention features is further improved.

8-5 iterative decoding computation probability

When the t time decoding is carried out, the t time output Z of the step 8-4 is outputtCorresponding text features Pre-and predictive word featuresInputting into dynamic pointer network, predicting answer probability of original OCR text word

Will be provided withAnd relative semantic relationship featuresCombined to form the final OCR text word answer probability

By passingCalculating to obtain the answer probability of the words in the fixed vocabulary

Will be provided withAre spliced intoFromThe word with the highest probability is selected as the answer of the ith decoding, and the next word is predicted iteratively until the process is finished.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种歌曲生成方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!