Multilevel image compression method using Transformer

文档序号：1893547 发布日期：2021-11-26 浏览：18次中文

阅读说明：本技术 一种使用Transformer的多层次图像压缩方法 (Multilevel image compression method using Transformer ) 是由刘美琴梁甲名林春雨白慧慧赵耀于 2021-09-27 设计创作，主要内容包括：本发明公开一种使用Transformer的多层次图像压缩方法,以Transformer模块为主,辅之以卷积层神经网络的多层次的图像压缩框架,Transformer模块包括多层编码器组件组件、解码器组件,编码端采用编码器组件,解码端采用解码器组件；解码器拥有交叉注意力机制,该交叉注意力机制将解码器的输入的自注意力特征与编码器的自注意力特征进行联合计算,对压缩压缩框架编码器的编码端学习到的特征充分利用。本发明保留了Transformer中的解码器组件及其交叉注意力机制,应用在解码端以实现对编码端学习到的特征的充分利用,达到更好的效果。且本发明框架对硬件的需求更小。(The invention discloses a multi-level image compression method using a Transformer, which takes a Transformer module as a main part and is assisted by a multi-level image compression framework of a convolutional neural network, wherein the Transformer module comprises a multi-level encoder component and a decoder component, an encoder component is adopted at an encoding end, and a decoder component is adopted at a decoding end; the decoder has a cross-attention mechanism which jointly calculates the self-attention characteristics of the input of the decoder and the self-attention characteristics of the encoder and fully utilizes the characteristics learned by the encoding end of the compression frame encoder. The invention reserves the decoder component in the transform and the cross attention mechanism thereof, is applied to the decoding end to realize the full utilization of the characteristics learned by the encoding end and achieves better effect. And the framework of the invention requires less hardware.)

1. A multilevel image compression method using a Transformer is characterized in that the multilevel image compression method is a multilevel image compression framework which takes a Transformer module as a main part and is assisted by a convolutional layer neural network, and the framework utilizes the convolutional layer neural network to reduce the resolution of a feature map, extract local features and simultaneously consider global feature distribution; the transform module comprises a multilayer encoder component and a decoder component, wherein an encoder component is adopted by an encoding end, and a decoder component is adopted by a decoding end; the decoder has a cross-attention mechanism which jointly calculates the self-attention characteristics of the input of the decoder and the self-attention characteristics of the encoder and fully utilizes the characteristics learned by the encoding end of the compression frame encoder.

2. The method of claim 1, wherein the encoding side and the decoding side of the image compression framework are symmetric hierarchical structures, each layer is called a stage, and each stage comprises an up-down sampling module, a transform module and necessary feature variants;

the up-down sampling module of each stage is realized by a convolution layer, and the up-down sampling of the stages is favorable for adapting to the characteristics of large resolution of an input image and small extraction characteristic resolution of an image compression task; and the GDN/IGDN layer is used to carry out nonlinear transformation processing on the features after convolution so as to make the feature arrangement more favorable for compression.

3. The method of claim 1 or 2, wherein the image compression framework comprises:

input image data x is firstly calculated through a coding end of a compression frame to obtain potential features y of an image, then a coding part of a super prior module calculates y to extract side information z, and z is firstly quantized and then is reconstructed into features through a decoding part of the super prior module Splicing with the y subjected to mask processing, and inputting the y and the y into a context prediction module together to predict the probability distribution of the y, wherein the probability model adopts a mixed Gaussian distribution model consisting of three sub-Gaussian distributions; wherein, in the training process, y is directly quantized The reconstruction of the original image x is realized through the calculation of a decoding end of a compression frame; during the test, the probability distribution of y is predicted based on the autoregressive reconstructionThe original image x is then reconstructed.

4. The method of claim 2, wherein the feature transformation comprises two operations: the dimension conversion and the block elongation are realized, the dimension conversion adapts to the difference of data dimensions of the Transformer in the process of transferring from the natural language processing field to the computer vision field, and the block elongation solves the problem that the memory overflow error is caused by overhigh memory occupation of an attention mechanism in the Transformer.

5. The method of claim 4, wherein the operations of block stretching and block stretching occur after the up-and-down sampling convolution and before the dimension transformation, and since the product of the height and width of the image feature map is very large compared to the length of the text sequence, the input of the block stretching and then the dimension transformation into the transform helps to greatly reduce the memory usage of the attention mechanism, which is calculated as follows:

6. The method of claim 1, wherein the encoder component comprises two modules: the system comprises a multi-head self-attention module and a feedforward neural network module, wherein each module is designed to be in a residual error connection mode and has a layer normalization operation;

the multi-head self-attention module is designed into a multi-head form on the basis of an attention mechanism, and the formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,V W_i ^V)

the feedforward neural network module comprises two full-connection layers, a relu activation function layer is designed between the two full-connection layers, and the expression is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

through the series of calculation, the encoder can fully capture global information and complement the local features extracted by the convolutional layers at each stage, so that the features can be calculated more accurately; at the same time, the attention mechanism reassigns weights to features, causing the distribution of features to be processed into a form more favorable to the task objective.

7. The method of claim 1, wherein the decoder component further comprises two modules: the decoder component uses cross attention to replace a self-attention mechanism, K and V of the cross attention are generated by m obtained by learning of the encoding end, cross attention is calculated between Q obtained through input y of the decoder and K obtained through output m of the encoder, attention weight is added to V for calculation, and the characteristic that information learned by the encoding end is rich can be fully utilized, so that the effect is improved.

8. The method of claim 3, wherein the super-a-module comprises an encoding part, a decoding part, and a context prediction part; the super-prior coding part and the decoding part are the same as the coding end and the decoding end of the compression framework, and the extraction of the side information is realized by adopting the same coder and decoder components by using the hierarchical structure of a Transformer.

9. The method of claim 8, wherein the context prediction part comprises a mask process implemented by a mask convolution component, a context prediction module and a gaussian mixture model modeling module GMM; the information which is learned from the characteristics by the mask processed characteristics and the super-prior coding and decoding part is spliced and input into a context prediction module to learn the probability distribution of the characteristics and output related parameters; the context prediction module is realized by convolution layers with increasing channel numbers, and the probability modeling adopts a Gaussian mixture model comprising three sub-Gaussian models, so that the output parameters are three groups of mean values, variances and three corresponding weights;

and carrying out probability modeling on the characteristic y by using the probability parameter obtained by prediction, wherein the modeling formula is as follows:

in the formula, K-3 represents three sub-gaussian distributions, w, mu and sigma²Weights respectively corresponding sub-Gaussian distributionsMean and variance, U (-1/2,1/2) is uniform noise for analog quantization.

Technical Field

The application relates to the field of computer technology image processing, in particular to a multi-level image compression method using a Transformer.

Background

With the rapid development of the internet and digital media and the arrival of the big data era, a great amount of image data is generated, stored and transmitted every day on the internet, and the data occupies a large space and bandwidth. Image compression algorithms have evolved in order to achieve more efficient storage and transmission of image data. Image compression aims at achieving efficient compression of image data by removing redundant information in the image data to reduce the amount of data required to represent a digital image, which is also one of the basic research issues common in the field of image processing.

In earlier times, some classical conventional Image compression algorithms were developed, such as discrete cosine transform (DCT transform), vector quantization, differential pulse code modulation, run length Coding, huffman Coding, arithmetic Coding, etc., and by using these algorithms, conventional Image compression frameworks achieve good compression effects, such as The classical JPEG Standard [ G.K Wallace, "The JPEG still picture compression Standard", IEEE trans. on Consumer Electronics, vol.38, No.1, pp.43-59, feb.1991 ], and later JPEG2000[ major radar, Rajan joint, "An Overview of The JPEG2000still Image compression", lsev state Processing: Communication, No.1, pp.3-48, jan.2002, HEVC Coding, HEVC, g, graphics, HEVC, motion graphics, vol.22, No.12, pp.1649-1668, dec.2012 ], VVC intra-frame Coding [ g.j.Sullivan and j.R.ohm, "Versatile video Coding methods of the next generation of video compression", Picture Coding Symposium, Jun.2018 ], etc., all of which have gradually improved effects and are widely used in the fields of the internet, industry, etc.

Later, with the development of deep learning, many image compression frameworks based on convolutional neural networks were proposed and in effect achieved an override of traditional compression algorithms. The earliest end-to-end image compression frameworks, bls2017[ Balleme J, Laparra V, Simocelii E P.end-to-end optimized image compression [ J ]. arXiv preprinted arXiv:1611.01704,2016 ], proposed by the Google team in 2017, and later deep learning-based image compression frameworks were improved on the basis of the frameworks. The framework adopts a self-encoder [ P.Vincent, H.Larochelle, Y.Bengio and P. -A.Manzagol, "Extracting and compressing robust features with differentiating autoencoder", Intl.conf.on Machine Learning (ICML), pp.1096-1103, July 5-9.2008.] style, uses a convolutional neural network to replace a transformation part in a compression framework so as to learn the features beneficial to quantization encoding in image data, and simultaneously, the work also provides a GDN/IGDN method, and the splitting normalization is carried out on the pixel domain features, and the distribution of the processed features is more beneficial to the compression task compared with other non-linear methods. In 2018, the Google team provides an image compression frame bmshj2018[ Ballen J, Minnen D, Single S, et al. variational image compression with a scale superproprior [ J ]. arXiv prediction arXiv:1802.01436,2018 ] using a super prior, the work learns and predicts the distribution of the features and models the feature into Gaussian distribution by improving an entropy coding part, and a super prior module is added in the coding process, so that the reconstruction of the extracted features is more accurate, the information amount is reduced, and the effect is improved. In the same year, Google improves the entropy coding part, and proposes a hierarchical super-prior frame mbt2018[ Minnen D, Ball J, Toderici g. joint auto-prediction and hierarchical priors for sparse image compression [ J ]. arXiv prediction arXiv:1809.02736,2018 ] using a context module, although the implementation of the context module is autoregressive, which inevitably slows the operation speed, but also further improves the accuracy of feature reconstruction, thereby improving the frame performance. In 2020, a group of the early rice field university proposed a Gaussian mixture model-based framework [ Cheng Z, Sun H, Takeuchi M, et al, rare image compression with partitioned Gaussian mixture models [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2020:7939-7948 ], which improved the single Gaussian distribution used in the super-prior to a mixed Gaussian distribution containing three sub-Gaussian distributions, further improving the accuracy of feature calculation. In addition to these, there are many other works to improve the compression framework from other aspects, such as the work [ Lee J, Cho S, Kim M.an End-to-End joint learning scheme of image compression and quality enhancement with improved encoding reduction [ J ]. arXiv prediction arXiv:1912.12817,2019 ] adding a GRDN model for denoising after the decoding End of the framework as a post-processing module, and adjusting the training strategy to achieve the improvement of the reconstructed image quality; work [ Zhou, A.Nakagawa, K.Kato, S.Wen, K.Kazui and Z.Tan, "Variable Rate Image Compression Method with Dead-zone Quantizer,"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workstations (CVPRW),2020, pp.624-628, doi:10.1109/CVPRW50498.2020.00089 ] improves the quantized part of the Compression frame, approximately achieving the effect of Variable bitrate (Image Compression has different reconstruction quality at different bitrates, usually a corresponding model is obtained for each bitrate); work [ Zhang R, Isola P, Efrons A, et al, the unreaseable effect of the deep defects as a permanent measure [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2018:586 + 595 ] thinks about the evaluation criteria of image compression framework effects, improving the effect of the model from the perspective of subjective visual perception, and so on.

In 2021, a trend is formed by migrating a Transformer [ Vaswani A, Shazer N, Parmar N, et al.Attention is all you needed [ C ]// Advances in neural information processing systems.2017: 5998-. Initially, IPT [ Chen H, Wang Y, Guo T, et al.Pre-induced image processing transform [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern registration.2021: 12299. 12310 ], ViT [ Dosovitskiy A, Beyer L, Kolesnikov A, et al.Animage is work 16x16 words: transformations for image recognition at scale [ J ]. arXiv prediction arXiv:2010.11929,2020 ] achieved the use of Transformers to solve visual tasks by a number of experiments. Later, CvT [ Wu H, Xiao B, Codela N, et al, Cvt: Introducing connections to vision transformers [ J ]. arXiv preprints 2103.15808,2021.], PVT [ Wang W, Xie E, Li X, et al, pyramid vision transformers: A versatile for compression with out connections [ J ]. arXiv preprints 2102.12122,2021.], Swin-transducer [ Liu Z, Lin Y, Cao Y, et al, Swin transducer: Hierarch vision transformers [ J ]. X prediction ] 2103.14030,2021. display, which are not applicable to video-based, high-resolution, video-based, and non-adaptive approaches. Uformer [ Wang Z, Cun X, Bao J, et al, Uformer: A General U-Shaped transform for Image retrieval [ J ]. arXiv preprint arXiv:2106.03106,2021 ] on the basis of a layered architecture, a design that UNet adds a jump connection is simulated, and a better Image reconstruction effect is realized by multiplexing features.

In the work of the transformers, IPT, Uformer and the like are used for realizing tasks such as image denoising, image hyper-resolution ratio and the like; ViT, CvT, PVT, Swin-Transformer, etc. work to accomplish high-level visual tasks, such as image classification; however, there is no work to implement the image compression task by using a Transformer, and due to the difference between the image coding and the computer vision task, there are many other problems to be faced in implementing the image compression by using the Transformer.

The conventional image compression framework based on deep learning is realized by using a convolutional neural network, but the convolutional neural network is calculated in a local receptive field, so that the convolutional neural network cannot well learn remote information or global information outside a local area. The Transformer has excellent ability to learn global information, and the attention mechanism in the Transformer is also equivalent to assigning weights to eigenvalues, so that the distribution of eigenvalues after processing can be more beneficial to the effect of the compression task. Therefore, the invention realizes a multilevel image compression framework which mainly uses a Transformer module and is assisted by a convolution layer, and the framework can reduce the resolution of a feature map by utilizing convolution, extract local features and simultaneously consider global feature distribution, thereby realizing excellent image compression effect.

Recently, many efforts for migrating a Transformer to the field of computer vision have emerged, the Transformer includes two components, namely an encoder and a decoder, and the decoder component and the encoder component are different from each other in that the decoder has a cross-attention mechanism which jointly calculates the self-attention characteristics of the input of the decoder and the self-attention characteristics of the encoder so as to fully utilize the learning result of the encoding end. While those mentioned above work with reference to transformers, the decoder component is discarded.

Disclosure of Invention

The invention aims to provide a multi-level image compression method using a transform, which reserves a decoder component and is applied to a decoding end of a compression frame so as to fully utilize the characteristics learned by a coding end of the compression frame by utilizing a cross attention mechanism of the multi-level image compression method.

In order to realize the purpose of the invention, the technical scheme provided by the invention is as follows: a multilevel image compression method using a Transformer is a multilevel image compression framework which takes a Transformer module as a main part and is assisted by a convolutional layer neural network, and the framework utilizes the convolutional layer neural network to reduce the resolution of a feature map, extract local features and simultaneously consider global feature distribution; the transform module comprises a multilayer encoder component and a decoder component, wherein an encoder component is adopted by an encoding end, and a decoder component is adopted by a decoding end; the decoder has a cross-attention mechanism which jointly calculates the self-attention characteristics of the input of the decoder and the self-attention characteristics of the encoder and fully utilizes the characteristics learned by the encoding end of the compression frame encoder.

The encoding end and the decoding end of the image compression framework are of symmetrical layered structures, each layer is called as a stage, and each stage comprises an up-down sampling module, a transform module and necessary characteristic deformation;

The image compression framework comprises:

input image data x is firstly calculated through a coding end of a compression frame to obtain potential features y of an image, then a coding part of a super prior module calculates y to extract side information z, and z is firstly quantized and then is reconstructed into features through a decoding part of the super prior module Splicing with the y subjected to mask processing, and inputting the y and the y into a context prediction module together to predict the probability distribution of the y, wherein the probability model adopts a mixed Gaussian distribution model consisting of three sub-Gaussian distributions; wherein, in the training process, y is directly quantizedThe reconstruction of the original image x is realized through the calculation of a decoding end of a compression frame; during the test, the probability distribution of y is predicted based on the autoregressive reconstructionThe original image x is then reconstructed.

The feature variation described includes two operations: the dimension conversion and the block elongation are realized, the dimension conversion adapts to the difference of data dimensions of the Transformer in the process of transferring from the natural language processing field to the computer vision field, and the block elongation solves the problem that the memory overflow error is caused by overhigh memory occupation of an attention mechanism in the Transformer.

The operation of block elongation occurs after the up-down sampling convolution and before the dimension conversion, because the product of the height and the width of the image feature map is very large compared with the length of the text sequence, the block elongation and the dimension conversion are input into the Transformer, which is beneficial to greatly reducing the video memory occupation of the attention mechanism, and the calculation formula of the attention mechanism is as follows:

The encoder assembly includes two modules: the system comprises a multi-head self-attention module and a feedforward neural network module, wherein each module is designed to be in a residual error connection mode and has a layer normalization operation;

the multi-head self-attention module is designed into a multi-head form on the basis of an attention mechanism, and the formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,V W_i ^V)

the feedforward neural network module comprises two full-connection layers, a relu activation function layer is designed between the two full-connection layers, and the expression is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

The decoder component also includes two modules: the decoder component uses cross attention to replace a self-attention mechanism, K and V of the cross attention are generated by m obtained by learning of the encoding end, cross attention is calculated between Q obtained through input y of the decoder and K obtained through output m of the encoder, attention weight is added to V for calculation, and the characteristic that information learned by the encoding end is rich can be fully utilized, so that the effect is improved.

The super-prior module comprises an encoding part, a decoding part and a context prediction part; the super-prior coding part and the decoding part are the same as the coding end and the decoding end of the compression framework, and the extraction of the side information is realized by adopting the same coder and decoder components by using the hierarchical structure of a Transformer.

The context prediction part comprises mask processing realized by a mask convolution component, a context prediction module and a Gaussian mixture model modeling module GMM; the information which is learned from the characteristics by the mask processed characteristics and the super-prior coding and decoding part is spliced and input into a context prediction module to learn the probability distribution of the characteristics and output related parameters; the context prediction module is realized by convolution layers with increasing channel numbers, and the probability modeling adopts a Gaussian mixture model comprising three sub-Gaussian models, so that the output parameters are three groups of mean values, variances and three corresponding weights;

and carrying out probability modeling on the characteristic y by using the probability parameter obtained by prediction, wherein the modeling formula is as follows:

in the formula, K-3 represents three sub-gaussian distributions, w, mu and sigma²U (-1/2,1/2) is the uniform noise for analog quantization, respectively the weight, mean and variance of the corresponding sub-gaussian distribution.

The invention has the beneficial effects that:

most of recent work of migrating a transform to computer vision has high hardware requirements, generally needs about 32G of video memory support, and the framework of the invention can be fully trained on one NVIDIA GeForce RTX 2080Ti video card with 11G of video memory.

Compared with the existing compression framework based on deep learning and using the convolutional neural network, the invention realizes image compression by replacing the conventional convolutional neural network with a Transformer in a large scale. The Transformer has more excellent global information learning capability, which is also a limitation of the current framework using the convolutional neural network. Meanwhile, the attention mechanism of the Transformer can also redistribute the weight to the features, so that the distribution of the features is more favorable for realizing the task effect. Therefore, the invention realizes a multilevel image compression framework which mainly uses a Transformer module and is assisted by a convolution layer, and the framework can reduce the resolution of a feature map by utilizing convolution, extract local features and simultaneously consider global feature distribution, thereby realizing excellent image compression effect.

Compared with the existing work of migrating the Transformer to other visual tasks, the invention reserves the decoder component in the Transformer and the cross attention mechanism thereof, is applied to the decoding end to realize the full utilization of the characteristics learned by the encoding end and achieves better effect. And the framework of the invention requires less hardware.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is a block diagram of a decoder assembly;

FIG. 3 is a block diagram of decoder components;

FIG. 4 is a schematic diagram of mask convolution;

fig. 5 is a rate-distortion curve on the Kodak data set with PSNR as an evaluation criterion.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

The invention provides an end-to-end multilevel image compression method implemented by using a Transformer, and an overall framework diagram is shown in FIG. 1.

Input image data x is firstly calculated through a coding end of a compression frame to obtain potential features y of an image, then a coding part of a super prior module calculates y to extract side information z, and z is firstly quantized and then is reconstructed into features through a decoding part of the super prior moduleAnd splicing with the y subjected to mask processing, and inputting the y and the y into a context prediction module together to predict the probability distribution of the y, wherein the probability model adopts a mixed Gaussian distribution model consisting of three sub-Gaussian distributions. During the training process, y is directly quantizedThe reconstruction of the original image x is realized through the calculation of a decoding end of a compression frame; during the test, the probability distribution of y is predicted based on the autoregressive reconstructionThe original image x is then reconstructed.

X. encoding end and decoding end

The encoding end and the decoding end of the compression framework of the invention are symmetrical layered structures, each layer is called a stage, and each stage comprises an up-down sampling module, a transform module and necessary characteristic deformation.

The up-down sampling module of each stage is realized by a convolution layer, and the up-down sampling of the stages is favorable for adapting to the characteristics of large resolution of an input image and small extraction characteristic resolution of an image compression task. And the GDN/IGDN layer is used to carry out nonlinear transformation processing on the features after convolution so as to make the feature arrangement more favorable for compression. The Transformer module is composed of multi-layer codec calculation components, an encoder component is adopted at an encoding end, a decoder component is adopted at a decoding end, and both the encoder component and the decoder component are subjected to adaptive fine tuning on the structure in the original Transformer and are the most important parts for carrying out feature processing by the framework of the invention. The necessary feature deformation includes a change in feature dimensions and a block-wise stretching operation on the feature.

Next, the feature modification, encoder component, and decoder component will be described.

X.1.1 characteristic deformation

The feature transformation includes two operations: the dimension conversion and the block elongation are carried out, the former is to adapt to the difference of data dimensions of the Transformer in the process of migrating from the natural language processing field to the computer vision field, and the latter is to solve the problem that the memory overflow error is caused by too high memory occupation of the injection mechanism in the Transformer.

The problem that a Transformer migrates from a natural language processing field to a computer vision field is firstly faced with is that data dimensions are inconsistent, data in the natural language processing field is a one-dimensional text sequence, the shape is (B, N, T), data in the computer vision field is a two-dimensional image, and the shape is (B, H, W, C), wherein B represents training batch size, namely the number of samples, N is sequence length, namely the number of words in the sequence, T is the dimension of embedding word vectors for each word, H and W are respectively the height and width of the image or feature map, and C is the number of channels in each position, namely the dimension of feature vectors in each position. Therefore, in order to apply the Transformer to the two-dimensional feature data, the image is "textualized", and like many other works, the present invention treats the image as a sequence of blocks, that is, a feature map with a height H and a width W has N blocks, each block is processed into a feature vector, in this case, N is H × W, and T is C, and the selection of the block is realized by each stage convolution layer, so that the block here is an overlapped block with the size of the convolution kernel, and the block can also be treated as a "visual word".

The operation of block elongation occurs after the up-down sampling convolution and before the dimension conversion, and since the product of the height and the width of the image feature map is very large compared with the length of the text sequence, the input of the block elongation and the dimension conversion into the transform is helpful for greatly reducing the video memory occupation of the attention mechanism, and the calculation formula of the attention mechanism is as follows:

q, K, V in the formula respectively represent a query vector, a key vector, and a value vector, which are generated by inputting features through three different full connection layers, d_kIs the single head attention dimension. The most occupied part of the video memory in the attention mechanism is an attention matrix obtained by calculating the dot product of Q and K, namely QK in a formula^T，The shapes of Q and K are the same as the input shape of the Transformer, assuming that the shape of the feature map after convolution is (B, H, W, C), if the direct dimension conversion and the shape change to (B, H × W, C) are input into the Transformer, the shape of the QK dot product is (B, N), and N is H × W. The block stretching operation is to divide the feature into a plurality of non-overlapping small blocks, note that the block concept of dimension conversion is different here from the above, the block size adopted in the present invention is 2 × 2, the block stretching is to deform the feature block of 2 × 2 × C into a feature vector of 1 × 1 × 4C, then the feature map is deformed into (B, H/2, W/2,4C), and then the dimension conversion is input into the transform (B, (H × W)/4,4C), at this time, the shape of the QK dot product is (B, N), N ═ H × W)/4, the display memory is reduced to 1/4 compared with the case of not performing the block stretching process, thereby solving the problem of program memory overflow.

X.1.2 encoder assembly

Fig. 2 is a block diagram of encoder components, and as shown in fig. 2, there are multiple encoder components in the transform module at each stage of the encoding end of the compression framework, which are important parts for feature processing, and the internal structure is as shown in fig. 2. The encoder assembly includes two modules: a multi-headed self-attention Module (MSA) and a feed-forward neural network Module (MLP), each designed in the form of residual concatenation and having a layer-normalized operation.

The multi-head self-attention module is designed into a multi-head form on the basis of an attention mechanism, and the formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,V W_i ^V)

head_iindicating single head attention, W_iRepresenting the weight of the operation corresponding to the full connection layer, showing that the attention of a plurality of heads divides Q, K, V into a plurality of groups according to the channels when the full connection layer is processed, each group respectively carries out attention calculation, and then carries out subsequent operation by splicing the calculation results according to the channels. Multiple heads of attention may enable parallel computations, and the corresponding attention of different heads may capture information from different representation spaces at different locations.

The feedforward neural network module comprises two full-connection layers, a relu activation function layer is designed between the two full-connection layers, and the expression is as follows:

FFN(x)＝max(0,xW₁+b₁)W₂+b₂

through the series of calculation, the encoder can fully capture global information and complement the local features extracted by the convolutional layers at each stage, so that more accurate calculation of the features is realized. At the same time, the attention mechanism reassigns weights to features, causing the distribution of features to be processed into a form more favorable to the task objective.

X.1.3 decoder assembly

The Transformer module at each stage of the decoding end of the compression framework has a plurality of decoder components, and the internal structure of the decoder is shown in fig. 3. The decoder assembly also includes two modules: multi-headed cross attention Module (MCA) and feed forward neural network Module (MLP), differ from the encoder components in that cross attention is used instead of the self-attention mechanism.

Compared with self attention, the cross attention K and V are not generated by the input y of the decoder any more but generated by the m obtained by learning of the encoding end, the cross attention is calculated between Q obtained through the input y of the decoder and K obtained through the output m of the encoder, and then the attention weight is added to V for calculation, so that the characteristic that the information learned by the encoding end is rich can be more fully utilized, and the effect is improved. The m of the invention is taken from the output of the encoding end, is equal to y in the first stage, and is also equally upsampled due to the upsampling process in each stage of the decoding end, and is realized by using 1 × 1 convolution on a bypass.

X.2 advanced test module

The super-first module comprises an encoding part, a decoding part and a context prediction part. The super-prior coding part and the decoding part are the same as the coding end and the decoding end of the compression framework, and the extraction of the side information is realized by adopting the same coder and decoder components by using the hierarchical structure of a Transformer.

The context prediction part comprises a mask process realized by a mask convolution component, a context prediction module and a Gaussian mixture model modeling module GMM. The context means that when encoding the information of the current position, it should be assumed that the information of the current position and the position after the current position is invisible, and the manner of masking invisible information is realized by mask convolution, and the schematic diagram is shown in fig. 4, and the red part is assigned as 0, i.e. the invisible area.

The information which is learned from the characteristics by the mask processed characteristics and the super-prior coding and decoding part is spliced and input into a context prediction module so as to learn the probability distribution of the characteristics and output related parameters. The context prediction module is realized by convolution layers with increasing number of three channels, and the probability modeling of the invention adopts a Gaussian mixture model comprising three sub-Gaussian models, so that the output parameters are three groups of mean values, variances and three corresponding weights.

And finally, carrying out probability modeling on the feature y by using the probability parameter obtained by prediction, wherein the modeling formula is as follows:

in the formula, K-3 represents three sub-gaussian distributions, w, mu and sigma²Are respectively provided withU (-1/2,1/2) is uniform noise for analog quantization for the weights, means, and variances of the corresponding sub-Gaussian distributions.

Through the series of operations, the method can more accurately predict the potential features y extracted from the image x, and reduce the information quantity of feature representation, thereby realizing the improvement of the frame effect.

X.3 Experimental part

To verify the effectiveness of the method of the invention, a series of experiments were performed herein.

X.3.1 Experimental setup

The experiments of the present invention used 13500 images randomly extracted from the validation set of the classical ImageNet dataset [ J.Deng, W.Dong, R.Socher, L.Li, K.Li and L.Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database", IEEE Conf.on Computer Vision and Pattern Recognition, pp.1-8, June 20-25,2009 ] as the training dataset for the experiments, which were randomly cropped into 256 × 256 tiles before being input into the model. The batch size used for training was 8 images, the optimizer used was Adam optimizer [ d.p.kingma and j.ba, "Adam: a method for stored learning optimization", arXiv:1412.6980, pp.1-15, dec.2014 ], parameters were set to β 1-0.9, β 2-0.999, ε 10-8, initial learning rate was set to 5 × 10-5, learning rate was attenuated to 2 × 10-5 after 700000 training iterations and then 100000 iterations were retrained. In the training loss function, the parameter λ for balancing the code rate and the distortion is set to 0.02, 0.015, 0.01, 0.0075 respectively to achieve different corresponding reconstruction effects at different code rates. The experimental equipment is a NVIDIA GeForce RTX 2080Ti GPU with a video memory 11G.

X.3.2 results of the experiment

A typical Image compression algorithm common in recent years is selected to be compared with the algorithm of the present invention, a Kodak24 dataset containing 24 images with a resolution of 512 x 768 is selected as a test dataset [ Kodak less True Color Image Suite, Download from http:// r0k. us/graphics/Kodak/], the experimental effect is measured by a rate distortion curve drawn by combining a peak signal-to-noise ratio (PSNR) and a bit rate (bpp), and the rate distortion curves of the peak signal-to-noise ratio (PSNR) of the algorithms at different bit rates (bpp) are shown in FIG. 5 on the test dataset.

As shown in fig. 5, the abscissa is the bit rate bpp, the ordinate is the peak signal-to-noise ratio PSNR, the meaning of the bit rate is the average bit rate of each pixel, and the peak signal-to-noise ratio represents the ratio between the maximum signal and the noise caused in the image reconstruction process, so that the smaller the bit rate, the higher the peak signal-to-noise ratio, the better the corresponding compression algorithm effect. The curve with the five-pointed star in the figure is the experimental effect of the algorithm, and as can be seen from the figure, the effect of the algorithm exceeds that of many classical algorithms, including common traditional algorithms JPEG, JPEG2000 and image compression algorithms bls2017, bmshj2018 based on deep learning, but is slightly inferior to that of BPG. However, the present invention is an image compression framework using a Transformer module as a main calculation module, and is superior to many compression frameworks including a classical framework and a framework using a convolutional neural network in effect, and is a good start for further development of an image compression framework using a transnformer in the future.

The described embodiments are only some embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

14页详细技术资料下载

Multilevel image compression method using Transformer

相关技术

网友询问留言