Intelligent image automatic description method based on deep neural network

文档序号：1738186 发布日期：2019-12-20 浏览：14次中文

阅读说明：本技术 基于深度神经网络的智能图像自动描述方法 (Intelligent image automatic description method based on deep neural network ) 是由纪荣嵘陈福海沈忱于 2019-08-06 设计创作，主要内容包括：基于深度神经网络的智能图像自动描述方法,涉及人工智能领域中的智能图像自动描述。包括以下步骤：1)描述数据预处理；2)抽取图像深度卷积特征和语义信息；3)基于多层级视觉语义嵌入的智能图像自动描述。采用因式分解的注意力机制模块,可解决图像自动描述中没有考虑语义信息层次性以及没有考虑场景信息等问题,可以显式地嵌入场景相关的语义信息,用以指导物体相关的语义信息嵌入和图像特征的嵌入。基于多层级视觉语义嵌入的图像自动描述研究,可方便图像自动描述在工业届中的推广与使用。(An intelligent image automatic description method based on a deep neural network relates to intelligent image automatic description in the field of artificial intelligence. The method comprises the following steps: 1) preprocessing description data; 2) extracting depth convolution characteristics and semantic information of the image; 3) and automatically describing the intelligent image based on multi-level visual semantic embedding. By adopting a factorization attention mechanism module, the problems that semantic information hierarchy is not considered and scene information is not considered in automatic image description can be solved, and scene-related semantic information can be explicitly embedded to guide object-related semantic information embedding and image feature embedding. The image automatic description research based on the multi-level visual semantic embedding can facilitate the popularization and the use of the image automatic description in the industry.)

1. The intelligent image automatic description method based on the deep neural network is characterized by comprising the following steps:

1) preprocessing description data;

2) extracting depth convolution characteristics and semantic information of the image;

3) and automatically describing the intelligent image based on multi-level visual semantic embedding.

2. The method for automatically describing the intelligent image based on the deep neural network as claimed in claim 1, wherein in the step 1), the specific method for preprocessing the description data is as follows: performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than 5 in the description of the data set are removed and replaced by "< UNK >" and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.

3. The method for automatically describing the intelligent image based on the deep neural network as claimed in claim 1, wherein in the step 2), the specific steps of extracting the depth convolution features and the semantic information of the image are as follows:

(1) processing the image convolution characteristics by using a residual depth convolution network to obtain a characteristic graph, denoted as F, of each image_l∈R^14×14×2048(ii) a Here, 14 × 14 is a feature region block of an image, and 2048 is expressed as a feature dimension of each feature block;

(2) performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, segmenting words of text contents according to spaces, extracting nouns by using an NLTK tool, taking the nouns as object semantic concepts, and counting the frequency of the object semantic concepts; converting the frequency of occurrence of nouns into a decimal between 0 and 1 by using a Sigmoid function; finally, obtaining a vector of the length of an object semantic concept word list by using a picture, wherein the length of the list is 4267; performing multi-label classification training on the formed labels, and adopting a plurality of Sigmoid cross entropy losses; in order to balance the difference of the number of positive and negative samples, the weight of the positive samples on the loss is increased by ten times; f1 score is used as an index to measure the quality of the verification set; adopting an Adam optimization algorithm, wherein the initial learning rate is 0.4, performing attenuation once every 10 rounds, the attenuation rate is 0.9, and iterating until convergence; when the object semantic information is extracted, the probability obtained after a Sigmoid function layer is output on an MS COCO image automatic description data set, and the final output dimensionality is 4267;

(3) the probability of retaining the first 5 predicted scenes on the picture using the residual depth convolution network pair is 366 for the final output dimension.

4. The method for automatically describing the intelligent image based on the deep neural network as claimed in claim 1, wherein in the step 3), the steps of automatically describing the intelligent image based on the multi-level visual semantic embedding are as follows:

(1) processing attention and embedding of object semantic concepts and local abstract features by using a double-layer cyclic neural network, wherein a processing unit in the cyclic neural network is a long-time memory neural network; the first layer of long-and-short-term memory neural network is a long-and-short-term memory neural network paying attention from top to bottom and is used for processing attention and embedding to object semantic concepts and local abstract features according to the output and the state of the second layer of language long-and-short-term memory neural network; in this attention deficit/hyperactivity neural network, the input to each step is the output of the previous language deficit/hyperactivity neural network and an encoded previous generated word:

wherein z is_tIs an input vocabulary which is encoded only thermally at time t, andthe method is characterized in that the method is a word embedding matrix for a Q-size vocabulary, the word embedding matrix is obtained by learning after random initialization and is not trained in advance;

(2) from top to bottomThe long-short time memory neural network receives the hidden state of the language long-short time memory neural network at the last momentAnd the word code W at this moment_ez_t′Hidden states generated as inputsEntering into a factorized attention module, embedding scene semantic information S ═ diag (v) converted into a diagonal matrix_scene) Guiding attention to obtain weighted local abstract features and object semantic concepts, and finally sending the weighted local abstract features and the weighted object semantic concepts into a language long-time and short-time memory neural network together to obtain word output;

to enable scene semantic information v_sceneMemory neural network output at long and short attentionThe attention mechanism module is designed to be factorized, and hides the state weight W in the traditional attention mechanism_hIs factorized into U_hSV_hThe formula is as follows:

W_h＝U_hSV_h

whereinAndS＝diag(v_scene) Is scene semantic information v_sceneA diagonalized matrix of; s is used for hiding layer stateEmbedding the content of the description scene and directly influencing the next two attention guides and description generation;

(3) to enable the model to generate wordsThe spatial features of the image are dynamically paid attention to obtain richer detailed description, and the attention embedding mode of the local abstract features of the image is reserved; local abstract feature v for L images_iAccording to the duration of the attention of the first layer, the state of the hidden layer of the neural network is memorizedAnd diagonalized scene semantic information S, and obtaining a normalized weight alpha by the following calculation mode_i，t：

α_t＝Softmax(a_t)

Wherein the content of the first and second substances,andare learnable parameters;

image local abstract feature v_convDot-by-dot weight alpha_tAfter weighted summation, the method is used in a second layer language long-time memory neural network:

(4) semantic concepts for C-dimensional objectsOutput of long-time memory neural network according to attention of first layerAnd calculating the diagonalized scene semantic information S to obtain a normalized weight beta_i，tThe formula is as follows:

β_t＝Softmax(b_t)

wherein the content of the first and second substances,andare learnable parameters;

object semantic concept v in C dimension_objObtaining high-rise object semantic concepts embedded by scene semantic information through point multiplication weight after dimension transformationAlso using the second layer in the language long-time memory neural network;

(5) the second layer of language long-time memory neural network generates a current word after obtaining the image information processed by the attention mechanism; the input of the second layer language long-time memory neural network comprises the local abstract characteristics of the image with scene attentionSemantic concepts with scene attentionAnd hidden layer state of first layer attention long-time memory neural network

Wherein the content of the first and second substances,andare learnable parameters and biases;

second layer long-time memory neural network word probability distribution p²Not only participates in the optimization in the training process, but also carries out the sampling during the test;

the probability distribution of the complete output sequence is the product of the conditional probability distributions:

Technical Field

The invention relates to automatic description of intelligent images in the field of artificial intelligence, in particular to an automatic description method of intelligent images based on a deep neural network, which is used for describing objective contents of images by natural language based on images.

Background

Image automatic description (Image capturing) is a machine ultimate intelligent task set forth in the computer science community this year. The task content of the method is based on a given image, and the objective content of the image is described by natural language, as shown in figure 1. With the development of computer vision, the computer vision is not limited to complete tasks such as detection, identification, segmentation and the like, and the computer is required to automatically describe the objective contents of the images. Unlike image classification or object detection tasks, image automatic description is to describe scenes, objects and their relationships important in pictures in a natural language as a whole. This is an important component of computer visual content understanding.

The image automatic description represents a new peak of the artificial intelligence research, and is an important embodiment of the artificial intelligence approaching to human intelligence. Due to the particularity of the automatic description task content of the images, the development of the research direction is often based on the latest research progress of two fields of computer vision, natural language processing and the like. Compared with the traditional image detection, recognition and segmentation tasks or multi-modal research problems, the automatic description of the images usually requires a machine model to know the visual and text contents and can make logical reasoning on the basis of content acquisition to complete the final description. The automatic description of images has huge industrial application scenes, and the scenes comprise: social media intelligent texting, intelligent equipment and visual handicapped barrier assistance.

The automatic description method of the image adopts a coder-decoder structure, uses a convolutional neural network to code the image, and then uses a cyclic neural network to decode sentences. However, because of the extraction and embedding of image features, it is impossible to explicitly express multi-level semantic information of an image, and some methods propose automatic image description methods based on visual concepts. However, the automatic image description method based on visual concepts cannot well represent different levels of visual semantics, and only stays at a semantic level with objects as main parts, and does not relate to scene semantic parts. The context of scene semantic information captures semantic information encoded at higher levels in the picture, such as where the picture was taken and the activities that a person may perform in the picture, and is used to influence attention modules and text generation. The words of the description generated may differ for a particular scene type.

Although the automatic description of images has been greatly developed in the last two years, problems still remain.

Disclosure of Invention

The invention aims to provide a novel deep learning network design under an attention mechanism of factorization based on scene information aiming at the problems of the current image automatic description method, and can solve the problems that the semantic information hierarchy is not considered and the scene information is not considered in the image automatic description.

The invention comprises the following steps:

1) preprocessing description data;

2) extracting depth convolution characteristics and semantic information of the image;

3) and automatically describing the intelligent image based on multi-level visual semantic embedding.

In step 1), the specific method for preprocessing the description data may be: performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than 5 in the description of the data set are removed and replaced by "< UNK >", and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.

In step 2), the specific step of extracting the depth convolution feature and the semantic information of the image may be:

(1) processing the image convolution characteristics by using a residual depth convolution network to obtain a characteristic graph, denoted as F, of each image_I∈R^14×14×2048(ii) a Here, 14 × 14 is a feature region block of an image, and 2048 is expressed as a feature dimension of each feature block;

(3) the probability of retaining the first 5 predicted scenes on the picture using the residual depth convolution network pair is 366 for the final output dimension.

In step 3), the specific steps of the intelligent image automatic description based on multi-level visual semantic embedding may be:

wherein z is_tIs an input vocabulary of one-hot (one-hot) codes under t time, andthe method is characterized in that the method is a word embedding matrix for a Q-size vocabulary, the word embedding matrix is obtained by learning after random initialization and is not trained in advance;

(2) the top-down attention long-short time memory neural network receives the hidden layer state of the last-moment language long-short time memory neural networkAnd the word code W at this moment_ez_tHidden states generated as inputsEntering into a factorized attention module, embedding scene semantic information S ═ diag (v) converted into a diagonal matrix_scene) Guiding attention to obtain weighted local abstract features and object semantic concepts, and finally sending the weighted local abstract features and the weighted object semantic concepts into a language long-time and short-time memory neural network together to obtain word output;

W_h＝U_hSV_h

(3) in order to enable the model to dynamically pay attention to the image space characteristics when generating words and obtain richer detailed description, the attention embedding mode of the local abstract characteristics of the image is reserved; local abstract feature v for L images_iAccording to the duration of the attention of the first layer, the state of the hidden layer of the neural network is memorizedAnd diagonalized scene semantic information S, and obtaining a normalized weight alpha by the following calculation mode_i，t：

α_t＝Softmax(a_t)

Wherein the content of the first and second substances,andare learnable parameters;

image local abstract feature v_convDot-by-dot weight alpha_tAfter weighted summation, the method is used in a second layer language long-time memory neural network:

β_t＝Softmax(b_t)

wherein the content of the first and second substances,andare learnable parameters;

(5) the second layer of language long-time memory neural network generates a current word after obtaining the image information processed by the attention mechanism; the input of the second layer language long-time memory neural network comprises the local abstract characteristics of the image with scene attentionSemantic concepts with scene attentionAnd hiding of first layer attention long-time memory neural networkLayer state

Wherein the content of the first and second substances,andare learnable parameters and biases;

second layer long-time memory neural network word probability distribution p²Not only participates in the optimization in the training process, but also carries out the sampling during the test;

the probability distribution of the complete output sequence is the product of the conditional probability distributions:

the invention provides a novel deep learning network structure, which innovatively adopts a factorized attention mechanism module and can explicitly embed scene-related semantic information so as to guide the embedding of object-related semantic information and image features. The invention is based on the image automatic description research of multilayer visual semantic embedding, and can facilitate the popularization and the use of the image automatic description in the industry.

Drawings

Fig. 1 is a model example diagram. After an image is given, the model describes the image.

Fig. 2 is a model frame diagram.

FIG. 3 is a diagram of a case of model output visualization.

Detailed Description

The following examples will further illustrate the present invention with reference to the accompanying drawings.

As shown in fig. 2, an embodiment of the present invention includes the following steps:

first, the pre-processing of the description data

Step 1, performing stop word processing on text contents in all training data, and performing lowercase on all English words. Then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than five in the description of the data set are removed and replaced by "< UNK >", and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.

Second, extraction of depth convolution characteristic and semantic information of image

Step 1, processing the image convolution characteristics by using a residual depth convolution network to obtain a characteristic diagram of each image, wherein the characteristic diagram is represented as F_I∈R^14×14×2048. Here, 14 × 14 is a feature region block of an image, and 2048 is expressed as a feature dimension of each feature block.

And 2, performing stop word processing on the text contents in all the training data, and performing lowercase on all English words. And then, segmenting words of the text content according to spaces, extracting nouns by using an NLTK tool, taking the nouns as object semantic concepts, and counting the frequency of the object semantic concepts. The frequency of occurrences of nouns is converted to a decimal between 0 and 1 using the Sigmoid function. And finally, obtaining a vector of the length of the object semantic concept word list by using a picture, wherein the length of the list is 4267. Multi-label classification training is performed on the labels formed in the way, and a plurality of Sigmoid cross entropy losses are adopted. In order to balance the difference of the number of positive and negative samples, the weight of the positive samples on the loss is increased by ten times. The F1 score is used as an index to measure the quality of the verification set. And (3) adopting an Adam optimization algorithm, wherein the initial learning rate is 0.4, the attenuation is performed once every 10 rounds, the attenuation rate is 0.9, and the iteration is performed until convergence. When the object semantic information is extracted, the probability obtained after a Sigmoid function layer is output on the MS COCO image automatic description data set, and the final output dimensionality is 4267.

Step 3, the probability of the first 5 predicted scenes on the picture is reserved by using a residual depth convolution network pair, and the final output dimension is 366.

Third, intelligent image automatic description based on multi-level visual semantic embedding

Step 1, a double-layer cyclic neural network is used for processing attention and embedding to object semantic concepts and local abstract features, and the cyclic neural network is used for memorizing the neural network with long and short processing units. The first layer of long-short time memory neural network is a long-short time memory neural network paying attention from top to bottom, and the first layer of long-short time memory neural network is used for processing attention and embedding to object semantic concepts and local abstract features according to the output and the state of the second layer of language long-short time memory neural network. In this attention deficit/hyperactivity neural network, the input to each step is the output of the previous language deficit/hyperactivity neural network, and an encoded previous generated word:

wherein z is_tIs an input vocabulary of one-hot (one-hot) codes under t time, andthe word embedding matrix is a word embedding matrix for a Q-size vocabulary, is obtained by learning after random initialization and is not trained in advance.

Step 2, paying attention from top to bottom to receive the hidden layer state of the language long-time and short-time memory neural network at the last momentAnd the word code W at this moment_ez_tHidden states generated as inputsSending into a factorized attention module, and embedding scene semantic information converted into a diagonal matrixAnd guiding attention to obtain the weighted local abstract features and the object semantic concepts, and finally sending the local abstract features and the object semantic concepts into the language long-time and short-time memory neural network together to obtain the output of the words.

To enable scene semantic information v_sceneMemory neural network output at long and short attentionExplicitly embedded, we designed a factorized attention mechanism module that hides the state weights W from the traditional attention mechanism_hIs factorized into U_hSV_hThe formula is as follows:

W_h＝U_hSV_h

whereinAndS＝diag(v_scene) Is scene semantic information v_sceneThe diagonalized matrix of (a). S is used for hiding layer stateEmbedding the content of the description scene directly affects the next two attention leads and description generation.

Step 3, in order to enable the model to dynamically pay attention to the image space characteristics when generating words and obtain richer detailed description, the graph is reservedLike the way in which local abstract features are embedded. Local abstract feature v for L images_iAccording to the duration of the attention of the first layer, the state of the hidden layer of the neural network is memorizedAnd diagonalized scene semantic information S, and obtaining a normalized weight alpha by the following calculation mode_i，t：

α_t＝Softmax(a_t)

WhereinAndare learnable parameters.

Image local abstract feature v_convDot-by-dot weight alpha_tAfter weighted summation, the sum is used in a second layer language long-term memory neural network.

Step 4 for C-dimensional object semantic conceptsThe output of the neural network can be memorized according to the attention of the first layerAnd calculating the diagonalized scene semantic information S to obtain a normalized weight beta_i，tThe formula is as follows:

β_t＝Softmax(b_t)

wherein the content of the first and second substances,andare learnable parameters.

Object semantic concept v in C dimension_objObtaining high-rise object semantic concepts embedded by scene semantic information through point multiplication weight after dimension transformationA second layer is also used in the linguistic long-term memory neural network.

And 5, after the long-term and short-term memory neural network of the second layer of language obtains the image information processed by the attention mechanism, generating the current word. The input of the second layer language long-time memory neural network comprises the local abstract characteristics of the image with scene attentionSemantic concepts with scene attentionAnd hidden layer state of first layer attention long-time memory neural network

Wherein the content of the first and second substances,andare learnable parameters and biases.

Second layer long-time memory neural network word probability distribution p²Not only participates in the optimization during the training process, but also performs the 'sampling' during the test.

The probability distribution of the complete output sequence is the product of the conditional probability distributions:

the long-term memory neural network is trained to predict every word of the entire description. In the inference phase, there are many ways in which a description of a given picture can be generated. One method is "sampling", based on the first probability distribution p generated₁And sampling to obtain a first word. Embedding corresponding words as input to obtain a second probability distribution p₂This is repeated until the end flag is sampled or the maximum length is reached. The second method is to perform a bundle search, and consider the k best generated sentences as candidates during the iteration at time t, so as to generate sentences at time t +1 and keep only the best k sentences. When decoded using a Beam search (Beam search), at least one sentence of particularly high quality (i.e., a particularly high score) will typically be included, although typically the best sentence is not the highest of the generated log-probabilities. Therefore, an additional approximation is made in order to increase the speed of training. The cluster search is chosen as the set of samples, as opposed to sampling through the entire probability distribution.

In order to verify the feasibility and the advancement of the proposed model, the evaluation of the model was performed on the general data set MSCOCO. The quantitative comparison with the latest image automatic description method is shown in table 1.

TABLE 1

Model (model)	Bleu_1	Bleu_2	Bleu_3	Bleu_4	Meteor	Rouge_L	CIDEr	SPICE
									Up-Down	0.798	-	-	0.363	0.277	0.569	1.201	0.214
DA	-	-	-	0.375	0.285	0.582	1.256	0.223
									This patent	0.803	0.646	0.501	0.381	0.284	0.582	1.268	0.220

It can be seen that the performance of the proposed model has a high advantage in various evaluation indexes.

In addition, as shown in FIG. 3, the text generated by visualizing the input image according to the present invention is described in Table 2.

TABLE 2

It can be seen from fig. 3 and table 2 that the model of the present invention achieves a significant improvement in image description since it allows for the modeling of intrinsic scene information.

13页详细技术资料下载

Intelligent image automatic description method based on deep neural network

相关技术

网友询问留言