Image semantic description method and system based on multi-feature extraction

文档序号:1243151 发布日期:2020-08-18 浏览:10次 中文

阅读说明:本技术 一种基于多特征提取的图像语义描述方法及系统 (Image semantic description method and system based on multi-feature extraction ) 是由 赵小虎 有鹏 李晓 常先红 宋瑞军 张楠 于 2020-04-17 设计创作,主要内容包括:本发明涉及一种基于多特征提取的图像语义描述方法及系统,属于图像特征提取技术领域,解决了现有技术提取的图像特征单一且长短时记忆网络只能捕捉单向时序信息等问题。该方法包括:将待语义描述的图像输入全局特征提取模型得到图像的全局特征向量;提取所述图像的属性特征向量;将所述全局特征向量和属性特征向量同时输入到双向长短时记忆网络,得到前向联合损失函数和后向联合损失函数;对所述前向联合损失函数和后向联合损失函数累加求和,当所述和最小时,获得与所述图像最佳匹配的语义描述。实现了多个图像特征的提取,提高了语义描述的精确度。(The invention relates to an image semantic description method and system based on multi-feature extraction, belongs to the technical field of image feature extraction, and solves the problems that image features extracted in the prior art are single, and a long-time memory network can only capture one-way time sequence information. The method comprises the following steps: inputting an image to be semantically described into a global feature extraction model to obtain a global feature vector of the image; extracting attribute feature vectors of the images; inputting the global feature vector and the attribute feature vector into a bidirectional long-time memory network simultaneously to obtain a forward joint loss function and a backward joint loss function; and accumulating and summing the forward joint loss function and the backward joint loss function, and when the sum is minimum, obtaining the semantic description which is best matched with the image. The extraction of a plurality of image features is realized, and the precision of semantic description is improved.)

1. An image semantic description method based on multi-feature extraction is characterized by comprising the following steps:

inputting an image to be semantically described into a global feature extraction model to obtain a global feature vector of the image;

extracting attribute feature vectors of the images; the attribute features comprise the outline, the texture and the position information of the image;

inputting the global feature vector and the attribute feature vector into a bidirectional long-time memory network simultaneously to obtain a forward joint loss function and a backward joint loss function;

and accumulating and summing the forward joint loss function and the backward joint loss function, and when the sum is minimum, obtaining the semantic description which is best matched with the image.

2. The method according to claim 1, wherein the image input attribute feature extraction model is extracted to obtain an attribute feature vector of an image, and the attribute feature extraction model is obtained by the following steps:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module;

the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map;

the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes;

the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

3. The method of claim 2, wherein the size of the prior box is set based on the following formula:

wherein s iskIs the ratio of the prior frame to the image, sminAnd smaxThe maximum value and the minimum value of the proportion of the prior frame relative to the image are respectively, and n is the number of the characteristic images.

4. The method according to claim 2 or 3, wherein the inputting the image into the attribute feature extraction model to obtain the attribute feature vector of the image comprises the following steps:

inputting an image into an attribute feature extraction model:

obtaining a feature map of the image based on a feature extraction layer;

based on the prior frame, obtaining an attribute detection matrix of the feature map;

and obtaining an attribute feature vector based on a column vector maximization module.

5. The method according to claim 1, wherein the step of inputting the global features and the attribute features to a bidirectional long-time and short-time memory network simultaneously to obtain a forward joint loss function and a backward joint loss function comprises the following steps:

obtaining a predicted word based on the global feature vector and the attribute feature vector;

inputting the predicted word into the embedding layer to obtain a predicted word vector xt

Vector the predicted word xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequence

Forward hiding the text LSTM layer sequenceAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequence

Based on the multi-modal LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining the word w corresponding to the maximum probability by using a softmax functiont

Word w corresponding based on the maximum probabilitytAnd obtaining a forward joint loss function and a backward joint loss function.

6. The method of claim 5, wherein obtaining the text LSTM forward hidden layer sequenceAnd text LSTM backward hidden layer sequenceThe method comprises the following steps:

based on the predicted word vector xtGet forward sentenceAnd backward sentences

For the forward sentenceAnd backward sentencesCoding to obtain text STM forward hidden layer sequenceAnd text LSTM backward hidden layer sequence

7. An image semantic description system based on multi-feature extraction is characterized by comprising:

the global feature vector acquisition module is used for inputting the image to be semantically described into the global feature extraction model to obtain a global feature vector of the image;

the attribute feature vector acquisition module is used for extracting the attribute feature vector of the image; the attribute features comprise the outline, the texture and the position information of the image;

a joint loss function obtaining module, configured to input the global feature vector and the attribute feature vector to a bidirectional long-and-short-term memory network at the same time, so as to obtain a forward joint loss function and a backward joint loss function;

and the semantic description obtaining module is used for accumulating and summing the forward joint loss function and the backward joint loss function, and obtaining the semantic description which is best matched with the image when the sum is minimum.

8. The system of claim 7, wherein the attribute feature vector obtaining module establishes the attribute feature extraction model as follows:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module;

the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map;

the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes;

the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

9. The system according to claim 7 or 8, wherein the attribute feature vector obtaining module obtains the attribute feature vector by:

inputting an image into an attribute feature extraction model:

obtaining a feature map of the image based on a feature extraction layer;

based on the prior frame, obtaining an attribute detection matrix of the feature map;

and obtaining an attribute feature vector based on a column vector maximization module.

10. The system according to claim 7, wherein said joint loss function obtaining module obtains the forward joint loss function and the backward joint loss function by:

obtaining a predicted word based on the global feature vector and the attribute feature vector;

inputting the predicted word into the embedding layer to obtain a predicted word vector xt

Vector the predicted word xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequence

Forward hiding the text LSTM layer sequenceAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequence

Based on the multi-modal LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining the word w corresponding to the maximum probability by using a softmax functiont

Word w corresponding based on the maximum probabilitytAnd obtaining a forward joint loss function and a backward joint loss function.

Technical Field

The invention relates to the technical field of image feature extraction, in particular to an image semantic description method and system based on multi-feature extraction.

Background

Image semantic description has always been one of the most important research directions in the field of artificial intelligence, and is a high-level task for image understanding. At present, an image semantic description method based on a deep neural network makes a significant breakthrough in the field, and particularly a semantic description generation model combining a convolutional neural network and a recurrent neural network.

Mao et al creatively combines convolutional neural networks and recurrent neural networks, solving the problems of image description and sentence retrieval. Kiros et al then introduced the encoding-decoding framework into the image semantic description study first. They encode visual information using deep convolutional neural networks, while text data is encoded using long-and-short memory networks (LSTM).

Under a framework based on coding and decoding, semantic description obtains beneficial effects, but the image features extracted by the prior art are single and are not enough to represent complete image information, so that the semantic description effect is poor; secondly, the gradient vanishing problem existing in RNN (recurrent neural network) can be eliminated by using LSTM, however, LSTM can only capture one-way timing information, global context dependency in the true sense is not realized, and the precision of semantic description is low.

Disclosure of Invention

In view of the foregoing analysis, embodiments of the present invention provide an image semantic description method and system based on multi-feature extraction, so as to solve the problem of low semantic description accuracy caused by single image feature extracted in the prior art and only one-way timing information captured by LSTM.

On one hand, the embodiment of the invention provides an image semantic description method based on multi-feature extraction, which comprises the following steps:

inputting an image to be semantically described into a global feature extraction model to obtain a global feature vector of the image;

extracting attribute feature vectors of the images; the attribute features comprise the outline, texture and position information of the image;

inputting the global feature vector and the attribute feature vector into a bidirectional long-time memory network simultaneously to obtain a forward joint loss function and a backward joint loss function;

and accumulating and summing the forward joint loss function and the backward joint loss function, and when the sum is minimum, obtaining the semantic description which is best matched with the image.

Further, the image is input into an attribute feature extraction model to extract an attribute feature vector of the image, and the attribute feature extraction model is obtained through the following steps:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module;

the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map;

the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes;

the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

Further, the size calculation formula of the prior frame is as follows:

wherein s iskIs the ratio of the prior frame to the image, sminAnd smaxThe maximum value and the minimum value of the proportion of the prior frame relative to the image are respectively, and n is the number of the characteristic images.

Further, the inputting the image into the attribute feature extraction model to obtain the attribute feature vector of the image includes the following steps:

inputting an image into an attribute feature extraction model:

obtaining a feature map of the image based on a feature extraction layer;

based on the prior frame, obtaining an attribute detection matrix of the feature map;

and obtaining an attribute feature vector based on a column vector maximization module.

Further, the step of simultaneously inputting the global characteristics and the attribute characteristics into a bidirectional long-time and short-time memory network to obtain a forward joint loss function and a backward joint loss function specifically includes the following steps:

obtaining a predicted word based on the global feature vector and the attribute feature vector;

inputting the predicted word into the embedding layer to obtain a predicted word vector xt

Vector the predicted word xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequence

Forward hiding the text LSTM layer sequenceAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequence

Based on the multi-modal LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining the word w corresponding to the maximum probability by using a softmax functiont

Word w corresponding based on the maximum probabilitytAnd obtaining a forward joint loss function and a backward joint loss function.

Further, the obtained text LSTM forward hidden layer sequenceAnd text LSTM backward hidden layer sequenceThe method comprises the following steps:

based on the predicted word vector xtGet forward sentenceAnd backward sentences

For the forward sentenceAnd backward sentencesCoding to obtain text STM forward hidden layer sequenceAnd text LSTM backward hidden layer sequence

On the other hand, the embodiment of the invention provides an image semantic description system based on multi-feature extraction, which comprises the following steps:

the global feature vector acquisition module is used for inputting the image to be semantically described into the global feature extraction model to obtain a global feature vector of the image;

the attribute feature vector acquisition module is used for extracting the attribute feature vector of the image; the attribute features comprise the outline, texture and position information of the image;

a joint loss function obtaining module, configured to input the global feature vector and the attribute feature vector to a bidirectional long-and-short-term memory network at the same time, so as to obtain a forward joint loss function and a backward joint loss function;

and the semantic description obtaining module is used for accumulating and summing the forward joint loss function and the backward joint loss function, and obtaining the semantic description which is best matched with the image when the sum is minimum.

Further, the step of establishing the attribute feature extraction model by the attribute feature vector acquisition module is as follows:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module;

the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map;

the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes;

the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

Further, the attribute feature vector obtaining module obtains the attribute feature vector by:

inputting an image into an attribute feature extraction model:

obtaining a feature map of the image based on a feature extraction layer;

based on the prior frame, obtaining an attribute detection matrix of the feature map;

and obtaining an attribute feature vector based on a column vector maximization module.

Further, the joint loss function obtaining module obtains a forward joint loss function and a backward joint loss function by:

obtaining a predicted word based on the global feature vector and the attribute feature vector;

inputting the predicted word into the embedding layer to obtain a predicted word vector xt

Vector the predicted word xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequence

Forward hiding the text LSTM layer sequenceAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequence

Based on the multi-modal LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining the word w corresponding to the maximum probability by using a softmax functiont

Word w corresponding based on the maximum probabilitytAnd obtaining a forward joint loss function and a backward joint loss function.

Compared with the prior art, the invention can realize at least one of the following beneficial effects:

1. compared with the prior art, the image semantic description method based on multi-feature extraction solves the problems that the image features extracted by the prior art are single and only one-way time sequence information can be captured, achieves extraction of a plurality of image features, and meanwhile, the two-way long-time and short-time memory network can capture two-way time sequence information, so that the dependency between words is higher, and the obtained semantic description effect is more accurate.

2. In the process of establishing the attribute feature extraction model, a Resnet-50 residual structure is used for replacing an original VGG16 as a front-end network of the SSD network, a feature extraction layer is added, the problems of small target scale, low resolution and the like of an input image are solved, and the feature extraction capability of the SSD network is improved.

3. The bidirectional time sequence information is captured through the bidirectional long-time and short-time memory network, so that the dependency between words is higher, the problem that the LSTM can only capture the unidirectional time sequence information is solved, and the accuracy of the semantic description effect is improved.

4. Based on the accumulation summation of the forward joint loss function and the backward joint loss function, the expected sequence is finally obtained, the joint loss function is quantized in a mathematical expression mode, the method is simple and easy to implement, the reliability is improved, and the semantic description result is more accurate and reliable.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a schematic diagram of an image semantic description method based on multi-feature extraction;

FIG. 2 is a flow chart of an image semantic description method based on multi-feature extraction;

FIG. 3 is a diagram illustrating the creation of an attribute feature extraction model in one embodiment;

FIG. 4 is a flow diagram of extracting image attribute feature vectors in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a bidirectional long-short term memory network according to an embodiment;

FIG. 6 is a diagram of a semantic description system for images based on multi-feature extraction according to another embodiment;

reference numerals:

1000-global feature vector acquisition module; 2000-attribute feature vector acquisition module;

3000-a joint loss function obtaining module; 4000-semantic description obtaining module.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

Compared with the prior art, the image semantic description method based on multi-feature extraction is provided, and is shown in fig. 1. For an input image, obtaining a global feature vector V of the image through a global feature extraction modelimgObtaining an attribute feature vector V of the image through an attribute feature extraction modelattThen V is addedimgAnd VattAnd simultaneously, inputting a bidirectional long-time memory network to obtain a joint loss function, and when the joint loss function is minimum, obtaining semantic description matched with the input image.

The embodiment of the invention discloses an image semantic description method based on multi-feature extraction, which is shown in figure 2. The method comprises the following steps:

and S1, inputting the image to be semantically described into the global feature extraction model to obtain a global feature vector of the image.

The global feature extraction model is a basic model for image processing, and the global features are extracted to obtain important information of an image, including color, texture, contour and position information of the image, which are expressed by the obtained global feature vector. For example, if there is a girl in the input image, the input image will have global features such as hair growth, color of clothes, face contour, etc. as girls.

Specifically, extracting the global feature vector of the image includes: for the global feature extraction model, an input image passes through a plurality of layers of convolution kernels, and as the number of the convolution kernels in each layer is larger, the length and width of a feature map are reduced along with the deepening of a network, so that each feature map extracted by the convolution layer is more representative. The last layer is a full-link layer, and vectors obtained through the layer are used as global feature vectors of the image.

The global feature vector of the image to be semantically described is extracted, so that the extraction of the global feature vector is realized, the subsequent semantic description of the image is supported, and the method has important significance for finally obtaining accurate semantic description.

S2, extracting attribute feature vectors of the image; the attribute features include texture, contour, and location information.

The existing semantic description method obtains good description effect under the framework based on coding and decoding, but only extracts single features of the image, and the single image features are not enough to describe complete image information, so that the result accuracy of image description is low. Therefore, when the global feature vector is extracted, the attribute feature vector of the image is obtained by establishing the attribute feature extraction model, and the extracted multiple features are used for describing the semantics of the image, so that the semantic description precision of the image is enhanced.

Preferably, the image is input into the attribute feature extraction model to extract an attribute feature vector of the image, and the attribute feature extraction model is obtained by the following steps:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module; the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map; the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes; the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

Specifically, the established attribute feature extraction model comprises a changed SSD network, a prior frame module and a column vector maximization module. The changed SSD network is obtained by replacing a VGG16 network with a Resnet-50 residual structure, and is an improvement on a preposed network in the original SSD network. As shown in fig. 3, the modified SSD network includes 7 feature extraction layers, of which 4 feature extraction layers are arranged in a Resnet-50 residual structure. And performing feature mapping based on the feature extraction layer to obtain a feature map. The prior frame module may obtain an attribute detection matrix of the feature map, and specifically includes: and obtaining a feature matrix based on each prior frame, and performing concat function combination calculation on all the obtained feature matrices to obtain an attribute detection matrix. The column vector maximization module is used for performing column vector maximization on the attribute detection matrix obtained by the prior frame module to obtain the attribute characteristic vector of the image.

The traditional convolution network or the full-connection network has the problems of information loss, loss and the like more or less during information transmission, and simultaneously has the problems of gradient disappearance or gradient explosion, so that the deep network cannot be trained. In the SSD network, after the Resnet-50 residual structure is used for replacing the original VGG16 network, other feature extraction layers in the SSD network are not changed. The Resnet-50 residual structure is used for replacing the original VGG16 as a preposed network of the SSD network, a feature extraction layer is added, the problems of small target scale, low resolution and the like of an input image are solved, and the feature extraction capability of the SSD network is improved.

Preferably, the calculation formula of the prior frame size is as follows:

wherein s iskIs the ratio of the prior frame to the image, sminAnd smaxThe maximum value and the minimum value of the proportion of the prior frame relative to the image are respectively, and n is the number of the characteristic images. Aspect ratio a for a prior framerGenerally selectThe width of each prior boxHeight ofRespectively as follows:the central point of the prior frame isi,j∈[0,|fk],|fkAnd | is the size of the kth feature map.

The input image can obtain more complete image information through the attribute feature extraction model, the problem that the obtained global feature is rough due to loss of important information in the feature extraction process of the global feature extraction model is solved, the attribute features are extracted again on the basis of the extracted global feature, the completeness of feature information extraction is realized, and the semantic expression effect of the image is enhanced.

Preferably, as shown in fig. 4, the inputting the image into the attribute feature extraction model to obtain the attribute feature vector of the image includes the following steps:

and S220, inputting the image into an attribute feature extraction model.

And S240, obtaining a feature map of the input image based on the feature extraction layer. The input image can obtain a feature map through a feature extraction layer.

And S260, obtaining an attribute detection matrix of the feature map based on the prior frame.

Specifically, the method for obtaining the attribute detection matrix is similar to the step of extracting the global feature, in each layer of feature map, the size and the position of a prior frame are obtained based on multiple layers of convolution kernels, a feature matrix is obtained based on each prior frame, and finally, concat functions are carried out on all the obtained feature matrices to obtain the attribute detection matrix. And finally obtaining the attribute characteristics of the image through superposition of a plurality of different attribute detection matrixes. The combination of a plurality of feature extraction layers and a plurality of prior frames enables the extracted image features to be more perfect, and the precision of semantic description to be improved.

And S280, acquiring an attribute feature vector based on the column vector maximization module.

Specifically, based on the attribute detection matrix obtained in step S260, the column vector maximization module obtains the attribute feature vector through the following formula:

wherein, VattIn order to be the attribute feature vector,for the attribute detection matrix, i, j are respectively a row element or a column element of the attribute detection matrix, and m is priorThe number of frames c is the number of detection matrix types obtained by training, and in this embodiment, m is 14658 and c is 300.

The attribute feature vector of the image is obtained by adopting the attribute feature extraction model, so that the problem that the global feature extraction model loses important information to cause coarse global features is solved, the completeness of feature information extraction is realized, and the semantic expression effect of the image is enhanced.

And S3, inputting the global feature vector and the attribute feature vector into a bidirectional long-time and short-time memory network simultaneously to obtain a forward joint loss function and a backward joint loss function.

The LSTM, as a variant of RNN, solves the problems of gradient extinction and gradient explosion during training, but the LSTM can only capture single-phase timing information and has low precision in semantic description. The Bi-LSTM is proposed to better express input data, and solves the problem that the LSTM can only capture single-phase time sequence information. For many data such as sound and text, the content of a sentence is related to the context, and Bi-LSTM can obtain the probability of the word vector with the highest similarity to the image by using the training of the forward time sequence and the backward time sequence respectively, and the semantic description of the context can be obtained by the sentence formed by the word vectors. The Bi-LSTM comprises a forward LSTM and a backward LSTM, and the LSTM mainly enables important information calculated at a subsequent moment to be transmitted through forgetting old information and memorizing new information, while useless information is discarded, and a hidden state is output at each moment. Meanwhile, the Bi-LSTM can avoid the gradient disappearance problem, can capture bidirectional semantic dependence and improve the semantic description performance of the model on the image.

Specifically, as shown in fig. 5, the bidirectional long-and-short term memory network sequentially comprises an embedding layer, a text LSTM (T-LSTM), a multi-mode LSTM (M-LSTM), and a Softmax probability calculation layer. Preferably, the global feature and the attribute feature are simultaneously input to a bidirectional long-and-short-term memory network to obtain a forward joint loss function and a backward joint loss function, and the method specifically comprises the following steps:

s310, obtaining a predicted word based on the global feature vector and the attribute feature vector. Specifically, a random gradient descent (SDG) method is an existing end-to-end training method, and in the training process, a bidirectional long-and-short-term memory network performs recognition training on an input global feature vector and a semantic feature vector, so that the bidirectional long-and-short-term memory network has an effect of directly obtaining predicted words based on the global feature vector and the semantic feature vector. In this step, the bidirectional long-and-short term memory network is a network that has been trained by a stochastic gradient descent (SDG) method, and when global feature vectors and attribute feature vectors are input, predicted words can be directly obtained.

S320, inputting the predicted word into the embedding layer to obtain a predicted word vector xt. The embedded layer is an indispensable processing layer for a bidirectional long-time memory network and can convert embedded words into a vector form.

S330, the predicted word vector xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequenceThe method specifically comprises the following two steps:

s331, based on the predicted word vector xtGet forward sentenceAnd backward sentencesIn particular, xtFor word vectors, different sentences can be obtained by combination of words, since xtIs in vector form, so the sentences obtained by combining the sentences are also in vector form and are divided into forward sentencesAnd backward sentences

S332, aiming at the forward sentenceAnd backward sentencesCoding to obtain text STM forward hidden layer sequenceAnd text LSTM backward hidden layer sequenceSpecifically, the method is realized by the following formula:

in the above formula, the first and second carbon atoms are,andrespectively a forward embedding matrix and a backward embedding matrix obtained from T-LSTM network learning,andrespectively a forward sentence and a backward sentence,andare respectively from T-LSTMAnd (4) the weights corresponding to the forward direction and the backward direction obtained by the network learning are obtained, and T represents T-LSTM.

S340, forward hiding layer sequence of the text LSTMAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceSpecifically, the method is realized by the following formula:

in the above formula, VimgIn order to be a global feature vector,andthe weights corresponding to the forward and backward directions, respectively, learned from the M-LSTM network, with M representing M-LSTM.

S350, based on the multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining maximum probability pairs using softmax functionCorresponding word wt. Specifically, the method is realized by the following formula:

in the above formula, the first and second carbon atoms are,andmaximum probability, W, for forward and backward sentences, respectivelysAnd bsRespectively, a weight matrix and a weight vector which are obtained by learning from the Bi-LSTM network.

S360, words w corresponding to the maximum probabilitytAnd obtaining a forward joint loss function and a backward joint loss function.

The maximum probability corresponding to the forward sentence and the backward sentence is obtained through the bidirectional long-time and short-time memory network, so that the word w corresponding to the maximum probability can be obtained from the forward sentence and the backward sentencet. For the word w corresponding to the maximum probabilitytAnd recombining to obtain the expected sequence I. Forward joint loss functionAnd backward joint loss functionWords w each corresponding to a maximum probabilitytAnd the desired sequence I, the formula is as follows:

in the above formula, T is the length of the desired sequence I, p () is the probability of the desired sequence I, wtIs a word corresponding to the maximum probability.

And S4, accumulating and summing the forward joint loss function and the backward joint loss function, and extracting the semantic description which is best matched with the image when the sum is minimum.

Obtaining a joint loss function based on the words w corresponding to the maximum probability, wherein the forward joint loss functionCombined with backward joint loss functionAre summed to a joint loss function L, i.e.The joint loss function l (lossfunction) is used to measure the degree of inconsistency between the predicted value and the true value, and the smaller the value of the joint loss function, the higher the degree of fitting, i.e. the predicted value is closest to the true value, and the higher the degree of correlation between the finally generated sentence and the image. Therefore, when the joint loss function L is minimum, i.e., pmax in the forward joint loss function and the backward joint loss function calculation formulas, the calculation formula is as follows:

when the desired sequence p (w) is obtained by the above formulatI) is the maximum, the smaller the combined loss function value is, the closest the predicted value and the true value is, and the final expected sequence isThe higher the degree of correlation between I and the image, the more semantic description the sequence I is expected to be most matched with the image.

Based on the accumulation summation of the forward joint loss function and the backward joint loss function, the expected sequence is finally obtained, the joint loss function is quantized in a mathematical expression mode, the method is simple and easy to implement, the experience degree is improved, and the semantic description result is more accurate and reliable.

Compared with the prior art, the image semantic description method based on multi-feature extraction solves the problems that the image features extracted by the prior art are single and only one-way time sequence information can be captured, achieves extraction of a plurality of image features, and meanwhile, the two-way long-time and short-time memory network can capture two-way time sequence information, so that the dependency between words is higher, and the obtained semantic description effect is more accurate.

Another embodiment of the present invention, as shown in fig. 6, provides an image semantic description system based on multi-feature extraction, including the following modules:

the global feature vector acquisition module 1000 is configured to input an image to be semantically described into the global feature extraction model to obtain a global feature vector of the image;

an attribute feature vector obtaining module 2000, configured to extract an attribute feature vector of the image; the attribute features comprise the outline, the texture and the position information of the image;

a joint loss function obtaining module 3000, configured to input the global feature vector and the attribute feature vector to a bidirectional long-and-short-term memory network at the same time, so as to obtain a forward joint loss function and a backward joint loss function;

a semantic description obtaining module 4000 for cumulatively summing the forward joint loss function and the backward joint loss function, and obtaining a semantic description which best matches with the image when the sum is minimum.

An image semantic description system based on multi-feature extraction solves the problems that image features extracted in the prior art are single and only one-way time sequence information can be captured, extraction of a plurality of image features is achieved, meanwhile, a two-way long-time and short-time memory network can capture two-way time sequence information, dependency among words is higher, and the obtained semantic description effect is more accurate.

Preferably, the step of establishing the attribute feature extraction model by the attribute feature vector acquisition module is as follows:

establishing an attribute feature extraction basic model, wherein the attribute feature extraction basic model comprises a changed SSD network, a prior frame module and a column vector maximization module;

the changed SSD network is obtained by replacing a preposed network of the SSD network with a Resnet-50 residual structure; the changed SSD network comprises a feature extraction layer, wherein the feature extraction layer is used for performing feature mapping on an input image to obtain a feature map;

the prior frame module is used for setting prior frames with different sizes based on different feature maps and obtaining attribute detection matrixes corresponding to the feature maps by using the set prior frames with different sizes;

the column vector maximization module is used for performing column vector maximization on the attribute detection matrix to obtain an attribute feature vector;

and inputting a training image to train the established basic model to obtain the attribute feature extraction model.

Preferably, the attribute feature vector obtaining module obtains the attribute feature vector by:

inputting an image into an attribute feature extraction model:

obtaining a feature map of the image based on a feature extraction layer;

based on the prior frame, obtaining an attribute detection matrix of the feature map;

and obtaining an attribute feature vector based on a column vector maximization module.

By the attribute feature vector acquisition module, an attribute feature extraction model is established, the attribute feature vector is acquired, the problem of single image feature acquired in the prior art is solved, the image feature is more complete, and the semantic description effect is improved.

Preferably, the joint loss function obtaining module obtains the forward joint loss function and the backward joint loss function by:

obtaining a predicted word according to a random gradient descent model based on the global feature vector and the attribute feature vector;

inputting the predicted word into the embedding layer to obtain a predicted word vector xt

Vector the predicted word xtEmbedding the LSTM into the text to obtain a forward hidden layer sequence of the LSTMAnd text LSTM backward hidden layer sequence

Forward hiding the text LSTM layer sequenceAnd text LSTM backward hidden layer sequenceEmbedding into multi-mode LSTM to obtain multi-mode LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequence

Based on the multi-modal LSTM forward hidden layer sequenceAnd multimodal LSTM backward hidden layer sequenceObtaining a word w corresponding to the maximum probability by using a softmax function;

and obtaining a forward joint loss function and a backward joint loss function based on the word w corresponding to the maximum probability.

The bidirectional long-time and short-time memory network captures bidirectional time sequence information through the combined loss function acquisition module, and the problem that the prior art can only capture unidirectional time sequence information is solved, so that the dependency between words is higher. Meanwhile, the combined loss function is quantized in a mathematical expression mode through the semantic description acquisition module, so that the method is simple and easy to implement, and the experience degree is improved, so that the semantic description result is more accurate and reliable.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种训练图像识别网络、图像识别搜索的方法及相关装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!