Video feature extraction method and video quantization method applying same

文档序号：987774 发布日期：2020-11-06 浏览：2次中文

阅读说明：本技术 视频特征提取方法及应用该方法的视频量化方法 (Video feature extraction method and video quantization method applying same ) 是由宋井宽郎睿敏朱筱苏高联丽于 2020-08-04 设计创作，主要内容包括：本发明涉及计算机视觉技术领域,尤其涉及视频特征提取方法及应用该方法的视频量化方法,提供了一种视频特征提取方法,以解决有效获得包含丰富上下文信息的视频特征的技术问题,同时提供了一种应用上述视频特征提取方法的视频量化方法。视频特征提取方法包括：从目标视频中提取原始视觉特征并构建原始特征矩阵,所述原始特征矩阵包含每帧采样图像的空间信息和每帧采样图像之间的时序信息；根据原始特征矩阵生成采样图像空间注意力热度图和采样图像时序注意力热度图；以及将原始特征矩阵、采样图像空间注意力热度图、采样图像时序注意力热度图相加融合得到目标特征矩阵。(The invention relates to the technical field of computer vision, in particular to a video feature extraction method and a video quantization method applying the same, provides a video feature extraction method, aims to solve the technical problem of effectively obtaining video features containing abundant context information, and simultaneously provides a video quantization method applying the video feature extraction method. The video feature extraction method comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.)

1. The video feature extraction method is characterized by comprising the following steps:

extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image;

generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and

and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.

2. The video feature extraction method of claim 1,

A) generating a sampling image space attention heat map according to the original characteristic matrix comprises the following steps:

generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix; and

generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix;

and/or the like and/or,

B) the method for generating the time-series attention heat map of the sampling image according to the original characteristic matrix comprises the following steps:

and generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix.

3. The video feature extraction method of claim 2, wherein:

if the original characteristic matrix o of the target video is set_i∈R^{T′×h×w×c}Wherein h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the number of frames of sampled images, then

A) Generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix, wherein the line dimension attention heat map comprises the following steps:

remodeling the original characteristic matrix into { T' × h } × w × c; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain a characteristic matrix r with three dimensions being { T' × h } × w × c_θ，r_ρ，r_γWherein c 1 x 1 is the channel height width; the three feature matrices r_θ，r_ρ，r_γAccording to the formula

and/or the like and/or,

B) generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix, wherein the column dimension attention heat map comprises the following steps:

remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × c_θ，c_ρ，c_γWherein c 1 x 1 is the channel height width; the three feature matrices c_θ，c_ρ，c_γAccording to the formula

and/or the like and/or,

C) generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix, wherein the time sequence dimension attention heat map comprises the following steps:

remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × c_θ，t_ρ，t_y(ii) a The three feature matrices t_θ，t_ρ，t_γAccording to the formula

4. A video quantization method, comprising:

the video feature extraction method according to claim 1, 2 or 3, obtaining a target feature matrix;

converting the target characteristic matrix into a characteristic vector representing a target video; and

compressing the feature vectors into binary coding achieves video quantization.

5. The video quantization method of claim 4, wherein transforming the target feature matrix into the feature vector representing the target video comprises:

respectively reshaping a row dimension attention heat map r, a column dimension attention heat map c and a time sequence dimension attention heat map T into T' × h × w × c;

then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix O_iAdding to obtain a feature matrix O 'fused with three-dimensional attention'_iIts dimension and original feature matrix O_iThe consistency is achieved;

thereafter, a feature matrix O 'of three-dimensional attention will be fused'_iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculation_iIts dimension is equal to the original feature matrix O'_iThe dimensions are T' × h × w × c;

finally, the feature matrix O' subjected to two times of three-dimensional attention fusion_iPerforming global average pooling operation on dimensions T', h and w respectively to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 multiplied by c, namely a c-dimensional feature vector; and taking c as D to obtain a subsequent feature vector x with the length of D dimension.

6. The video quantization method of claim 4, wherein compressing the feature vectors into binary coding to achieve video quantization comprises a process of inputting the feature vectors into a progressive feature quantization network and then outputting the binary coding from the progressive feature quantization network, wherein,

the progressive feature quantization network comprises a plurality of quantization layers, and if the feature vector is a feature vector x with a length of D dimension, each quantization layer comprises a codebook comprising M D-dimensional codewords, and each codeword in the codebook corresponds to a corresponding index;

after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalized exponential function to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and the approximate value of the input vector obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely the quantization layer quantization error, is used as a second output;

the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.

7. The video quantization method of claim 6, wherein: the codebook of each quantization layer of the progressive feature quantization network contains 256 codewords, and the first output of each quantization layer is an 8-bit binary code.

8. The video quantization method of claim 7, wherein: the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code by connecting first outputs of the four quantization layers in the progressive feature quantization network.

Technical Field

The invention relates to the technical field of computer vision, in particular to a video feature extraction method and a video quantization method applying the same.

Background

Video retrieval is a fundamental and challenging problem in computer vision, which aims to retrieve the video that is most similar to the input video from a vast video library. And the unsupervised video quantitative retrieval realizes the quick retrieval of the video by compressing the visual characteristics of the original unlabeled video into a compact binary code.

At present, a known unsupervised video quantitative retrieval method is to extract visual feature information of each frame of video picture by using a convolutional neural network, process features of the frames by using a cyclic neural network to obtain video features, and compress the feature information to an extremely short binary code by using a hash algorithm, so that the volume of a database is reduced, and the retrieval speed is increased.

The above method has two problems. First, it is difficult to obtain information in a long time range by means of a convolutional neural network and a cyclic neural network, so that it is difficult to retain context information of a video, and it is impossible to obtain a better video feature. Secondly, under a large-scale video library, the video characteristics are very complex, and the Hash algorithm is difficult to obtain good accuracy.

Summary of the invention

The technical problem to be solved by the invention is as follows: the video feature extraction method is provided to solve the technical problem of effectively obtaining video features containing abundant context information, and the video quantization method applying the video feature extraction method is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows: a video feature extraction method comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.

According to an embodiment provided by the present specification, generating a spatial attention heat map of a sampled image from a raw feature matrix includes: generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix; and generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix.

According to an embodiment provided by the present specification, generating a time-series attention heat map of a sampled image from an original feature matrix includes: and generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix.

According to the embodiments provided in the present specification, if the original feature matrix O of the target video is set_i∈R^{T′×h×w×c}Wherein h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the number of frames of sampled images. Generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix, wherein the line dimension attention heat map comprises the following steps: remodeling the original characteristic matrix into { T' × h } × w × c; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain a characteristic matrix r with three dimensions being { T' × h } × w × c_θ，r_ρ，r_γWherein c 1 x 1 is the channel height width; the three feature matrices r_θ，r_ρ，r_γAccording to the formulaPerforming operation to obtain a line dimension attention heat map r, wherein

Is a feature matrix r_γThe transposed matrix of (2).

According to the embodiments provided in the present specification, if the original feature matrix O of the target video is set_i∈R^{T′×h×w×c}Wherein h is the height of each frame of video image, w is the width of each frame of video image, c is the number of channels of each frame of video image, and T' is the number of frames of sampled images, then, a column dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampled image and all other pixel points in the same column with the pixel point is generated according to the original characteristic matrix, and the column dimension attention heat map comprises: remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × c_θ，c_ρ，c_γWherein c 1 x 1 is the channel height width; the three feature matrices c_θ，c_ρ，c_γAccording to the formulaPerforming operation to obtain a column dimension attention heat map c, whereinIs a feature matrix c_γThe transposed matrix of (2).

According to the embodiments provided in the present specification, if the original feature matrix O of the target video is set_i∈R^T ^′×h×w×cWherein h is the height of each frame of video image, w is the width of each frame of video image, c is the channel number of each frame of video image, and T' is the number of sampling image frames, then, each pixel point in each frame of sampling image and all other pixel points in the same time sequence with the pixel point are generated and expressed according to the original characteristic matrixThe time-sequence dimension attention heat map of the information dependency relationship comprises the following steps: remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × c_θ，t_ρ，t_γ(ii) a The three feature matrices t_θ，t_ρ，t_γAccording to the formula

Performing operation to obtain a time sequence dimension attention heat map t, whereinIs a feature matrix t_γThe transposed matrix of (2).

In order to achieve the above object, according to an aspect created by embodiments provided in the present specification, a video quantization method is provided. The method comprises the following steps: obtaining a target feature matrix according to any one of the video feature extraction methods; converting the target characteristic matrix into a characteristic vector representing a target video; and compressing the feature vectors into binary coding to achieve video quantization.

According to the embodiment created by the embodiment provided by the present specification, the converting the target feature matrix into the feature vector representing the target video includes: respectively reshaping a row dimension attention heat map r, a column dimension attention heat map c and a time sequence dimension attention heat map T into T' × h × w × c; then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix O_iAdding to obtain a feature matrix O 'fused with three-dimensional attention'_iIts dimension and original feature matrix O_iThe consistency is achieved; thereafter, a feature matrix O 'of three-dimensional attention will be fused'_iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculation_iDimension ofAnd original feature matrix O'_iThe dimensions are T' × h × w × c; finally, the feature matrix O' subjected to two times of three-dimensional attention fusion_iPerforming global average pooling operation on dimensions T', h and w respectively to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 multiplied by c, namely a c-dimensional feature vector; and taking c as D to obtain a subsequent feature vector x with the length of D dimension.

Compressing the eigenvectors into binary codes to realize video quantization according to the embodiment created by the embodiment provided in the specification comprises the process of inputting the eigenvectors into a progressive eigen quantization network and then outputting the binary codes from the progressive eigen quantization network, wherein the progressive eigen quantization network comprises a plurality of quantization layers, if the eigenvectors are assumed as eigenvectors x with the length of D dimension, each quantization layer comprises a codebook with M D-dimension code words, and each code word in the codebook corresponds to a corresponding index; after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalized exponential function to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and the approximate value of the input vector obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely the quantization layer quantization error, is used as a second output; the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.

According to an embodiment created by an embodiment provided in the present specification, the codebook of each quantization layer of the progressive feature quantization network includes 256 codewords, and the first output of each quantization layer is an 8-bit binary code.

According to the embodiment created by the embodiment provided by the present specification, the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code by connecting the first outputs of the four quantization layers in the progressive feature quantization network.

The video feature extraction method can effectively obtain the video features containing rich context information. On the basis, the high-efficiency and accurate quantization of the video can be realized through the designed progressive characteristic quantization network, and further the quick retrieval of the video is realized.

The embodiments provided in the present specification will be further described with reference to the drawings and detailed description. Additional aspects and advantages of the embodiments provided herein will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the embodiments provided herein.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to assist in understanding the embodiments of the invention provided herein, and the description of the embodiments provided herein and their accompanying description is intended to explain the embodiments provided herein and not to limit the embodiments provided herein. In the drawings:

fig. 1 is an overall framework diagram of an embodiment of a video quantization method provided in this specification.

Fig. 2 is a three-dimensional self-attention module structure diagram of an embodiment of a video quantization method provided in the present specification.

Detailed Description

The embodiments provided in the present specification will be described in detail and fully with reference to the accompanying drawings. Those skilled in the art will be able to implement the embodiments provided in this specification based on these descriptions. Before the embodiments provided in this specification are explained with reference to the drawings, it should be particularly pointed out that:

the technical solutions and the technical features provided in the embodiments provided in the present specification in the following portions including the following description may be combined with each other without conflict.

In addition, the embodiments of the present invention provided in the present specification and referred to in the following description are generally only a part of the embodiments of the present invention and not all of the embodiments, and therefore, all other embodiments obtained by a person of ordinary skill in the art without any creative effort based on the embodiments of the present invention provided in the present specification should fall within the scope of the protection of the embodiments provided in the present specification.

The terms and units in the examples provided in this specification were created with reference to: the terms "comprising," "including," "having," and any variations thereof in the description and claims and any relevant portions of the examples provided herein are intended to cover non-exclusive inclusions. In addition, other related terms and units in the embodiment creations provided by the specification can be reasonably interpreted based on the related contents of the embodiment creations provided by the specification.

Fig. 1 is an overall framework diagram of a video quantization method provided in this specification. Fig. 2 is a three-dimensional self-attention module structure diagram of an embodiment of a video quantization method provided in the present specification. As shown in fig. 1-2, the video quantization method includes two parts, namely video feature extraction and video quantization. In the video feature extraction part, a video feature extraction module based on a three-dimensional self-attention mechanism is adopted to simultaneously acquire the time information and the spatial information of a video. In the two parts of video quantization, a gradient descent based progressive quantization algorithm is adopted to quantize the visual characteristics of the whole video.

Video feature extraction

The video feature extraction comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.

In one embodiment, a deep convolutional neural network is used to extract the original features V E {0, 1, …, 255} for the entire video library^{N×T×H×W×C}The video comprises N videos, and each video has T frame images with the height of H, the width of W and the number of channels of C. A uniform sampling strategy is adopted for each video, and in the embodiment, T' 25 frames of images are extracted for each video at equal intervals. Therefore, this embodiment can obtain a reduced feature set F ∈ {0, 1, …, 255}^{N×T′×H×W×C}. These feature matrices mainly contain two aspects of information: 1) spatial information of each frame of image, which may be shape, position or even semantic information. 2) Timing information from frame to frame, such as motion information. The information of both aspects is highly relevant. Therefore, the embodiment designs a feature module based on a three-dimensional self-attention mechanism, and obtains an attention heat map from two aspects of time sequence and space for each pixel point. This process can be interpreted as calculating the effect of each other neighboring pixel point on the current determined pixel point. For a certain pixel point, every time a three-dimensional self-attention mechanism is carried out, information dependency relationships among all other pixel points which are in the same row and the same column and have the same time sequence with the pixel point are obtained through calculation. Therefore, the present embodiment adopts a strategy of cycling this three-dimensional self-attention mechanism to acquire global information. Specifically, for one pixel point, all other pixel points and the relationship thereof are calculated through two times of three-dimensional self-attention iteration.

In implementing such a three-dimensional self-attention mechanism module, the present embodiment employs three independent attention operations, namely, row, column, and time sequence from three directions. In the row direction, for example, in the original feature matrix O of each video_i∈R^{T′×h×w×c}As inputs, where h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the sampled imageNumber of frames. First, this embodiment reshapes this feature matrix into { T' × h } × w × c and performs convolution operation using three convolution kernels with the size of c × 1 (channel number × height × width), to obtain three feature matrices r in the same dimension_θ，r_ρ，r_γ. Then r is immediately sent_ρAnd r_γIs transferred to

Multiplying the matrix, passing the obtained result through a softmax function, and finally, performing multiplication with r_θPerforming dot multiplication to obtain a line dimension attention heat map r, whereinIs a feature matrix r_γThe transposed matrix of (2). The above operation can be summarized as the following formula:

attention in the column direction is given to the feature matrix of { T' × w } × h × c dimensions by performing the similar operation as described above. Remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrixes by using three convolution kernels with the size of c 1 x 1 (the number of channels, the height and the width) respectively to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × c_θ，c_ρ，c_γ(ii) a The three feature matrices c_θ，c_ρ，c_γAccording to the formula

Performing operation to obtain a column dimension attention heat map c, whereinIs a feature matrix c_γThe transposed matrix of (2).

The attention in the time sequence direction is the feature for the { w × h } × T' × c dimensionThe matrix performs similar operations as described above. Remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × c_θ，t_ρ，t_γ(ii) a The three feature matrices t_θ，t_ρ，t_γAccording to the formulaPerforming operation to obtain a time sequence dimension attention heat map t, whereinIs a feature matrix t_γThe transposed matrix of (2).

And finally, adding and fusing the row dimension attention heat map r (matrix), the column dimension attention heat map c (matrix), the time sequence dimension attention heat map t (matrix) and the original feature matrix to obtain a final feature matrix.

And after the three-dimensional self-attention module is processed twice, averaging the obtained feature matrix fused with the three-dimensional attention to obtain a feature vector x representing the D-dimensional length of each video, and taking the feature vector x as the input of the quantization module.

Specifically, as shown in fig. 2, the row dimension attention heat map r, the column dimension attention heat map c, and the timing dimension attention heat map T are reshaped into T' × h × w × c, respectively; then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix O_iAdding to obtain a feature matrix O 'fused with three-dimensional attention'_iIts dimension and original feature matrix O_iThe same is true; thereafter, a feature matrix O 'of three-dimensional attention will be fused'_iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculation_iIts dimension is equal to the original feature matrix O'_iSimilarly, the dimensions are T' × h × w × c; finally, the feature matrix O' subjected to two times of fusion of three-dimensional attention_iAt T', h, respectively,And performing global average pooling operation on the w dimension to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 × 1 × 1 × c, namely a c-dimension feature vector, and taking c as D, namely obtaining a subsequent D-dimension feature vector x.

Second, video quantization

Video quantization comprises compressing the feature vectors into binary coding to achieve video quantization. In one embodiment, compressing the eigenvectors into binary coding to achieve video quantization comprises inputting the eigenvectors into a progressive eigen quantization network and outputting the binary coding from the progressive eigen quantization network, wherein the progressive eigen quantization network comprises a plurality of quantization layers, and if the eigenvectors are eigenvectors x with a length of D dimension, each quantization layer comprises a codebook comprising M D-dimensional codewords, and each codeword in the codebook corresponds to a corresponding index; after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalization exponential function (softmax function) to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and an input vector approximate value obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely quantization layer quantization error, is used as a second output; the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.

Thus, each quantization layer outputs to the next quantization layer the quantization error of the quantization layer (the previous quantization layer), and the error is used as the input of the next quantization layer, so that the error can be further reduced by quantization, and the quantized outputs of the layers can gradually approximate the feature vector x. When the quantization precision required by the present embodiment is not high, the present embodiment uses only the quantization coding of the first layer, and the present embodiment uses the quantization coding of the first layer plus the second layer when the quantization precision requirement is increased. Along with the gradual increase of the quantization layer number, the quantization precision is gradually improved, and a gradual process is embodied.

In one embodiment, the codebook of each quantization layer of the progressive feature quantization network contains 256 codewords, and the first output of each quantization layer is an 8-bit binary code. Meanwhile, the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code obtained by connecting first outputs of the four quantization layers in the progressive feature quantization network.

The technical effects of the above embodiment are as follows:

1) the embodiment designs a novel three-dimensional self-attention module, and simultaneously obtains the context information of time and space. After the original characteristic matrix passes through the primary three-dimensional self-power module, for each pixel point, only the relation heat degree of other pixel points which are positioned in the same row with the pixel point, other pixel points which are positioned in the same column with the pixel point and other pixel points which are positioned in the same time sequence with the pixel point to the pixel point are calculated. At the moment, the feature matrix which is obtained by the primary three-dimensional self-attention module and is fused with the three-dimensional self-attention is input into the three-dimensional self-attention module again, so that the relation between all other pixel points and the pixel point can be obtained for a specific pixel point, and more global information is obtained.

2) The quantization algorithm is introduced to a video retrieval task for the first time, a well-designed depth quantization algorithm based on gradient descent is adopted, video features are quantized to extremely short binary codes, and a progressive quantization method is realized.

3) A large number of experimental results show that the video quantization method based on the three-dimensional self-attention mechanism proposed in this embodiment is superior to the current latest video hash algorithm, especially on the challenging FCVID data set. On the FCVID data set, the result of the encoding experiment with 64-bit length is adopted, the mAP @5 index of the embodiment reaches 51.1 percent, and is 6.1 points higher than the same index (45 percent) of the existing best method.

The contents of the embodiments provided in the present specification are explained above. Those skilled in the art will be able to implement the embodiments provided in this specification based on these descriptions. Based on the above description of the embodiments provided in the present specification, all other embodiments obtained by those skilled in the art without any creative effort should fall within the protection scope of the embodiments provided in the present specification.

10页详细技术资料下载

Video feature extraction method and video quantization method applying same

相关技术

网友询问留言