Method and apparatus for picture encoding and decoding

文档序号：144505 发布日期：2021-10-22 浏览：37次中文

阅读说明：本技术 用于图片编码和解码的方法和设备 (Method and apparatus for picture encoding and decoding ) 是由 F·加尔平 P·博尔德斯 F·拉卡佩于 2020-02-06 设计创作，主要内容包括：公开了一种用于视频解码的方法。访问编码视频的图像块的第一重构版本。所述图像块的第一重构版本由第一神经网络滤波,以形成将被用作参考的所述图像块的第二重构版本。通过第二神经网络对所述图像块的第二重构版本进行滤波,以形成所述图像块的将被显示的第三重构版本。第一神经网络和第二神经网络被联合训练。(A method for video decoding is disclosed. A first reconstructed version of an image block of an encoded video is accessed. The first reconstructed version of the image block is filtered by a first neural network to form a second reconstructed version of the image block to be used as a reference. Filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed. The first neural network and the second neural network are jointly trained.)

1. A method for video decoding, comprising:

accessing a first reconstructed version of an image block of an encoded videoAnd

filtering the first reconstructed version of the image block by a first neural network (CNNa) to form a second reconstructed version of the image block adapted to be used as a reference

Filtering information representing the second reconstructed version of the image tile by a second neural network (CNNb) to form a third reconstructed version of the image tile suitable for display

Wherein the first neural network and the second neural network are jointly trained using a joint loss function.

2. A device for video decoding, comprising one or more processors configured to:

accessing a first reconstructed version of an image block of an encoded videoAnd

filtering the first reconstructed version of the image block by a first neural network (CNNa) to form a second reconstructed version of the image block adapted to be used as a reference

Filtering information representing the second reconstructed version of the image tile by a second neural network (CNNb) to form a third reconstructed version of the image tile suitable for display

Wherein the first neural network and the second neural network are jointly trained using a joint loss function.

3. The method of claim 1 or the apparatus of claim 2, wherein the first reconstructed version of the image blockIs used as an additional input for training the second neural network.

4. The method of claim 1 or the apparatus of claim 2, wherein filtering the first reconstructed version of the image block by a first neural network to form a second reconstructed version of the image block to be used as a reference comprises:

-said first reconstruction of said image block by means of an initial neural network (CNNa)Version(s)Filtering to form an initial reconstructed version of the image block to be used as input to the first and second neural networks (S0);

wherein the initial neural network is trained to optimize objective and perceptual filtering.

5. The method of claim 1 or the apparatus of claim 2, wherein the information representative of the second version is an intermediate representation of the second version or an output of the first neural network, the intermediate representation corresponding to an output of a layer preceding a last layer of the first neural network.

6. A method for video encoding, comprising:

accessing a first reconstructed version of an image block of a picture of an encoded videoAnd

filtering the first reconstructed version of the image block by a first neural network (CNNa) to form a second reconstructed version of the image block adapted to be used as a reference

Wherein the first neural network is jointly trained with a second neural network (CNNb) using a joint loss function, the second neural network for filtering information representing the second reconstructed version of the image patch to form a third reconstructed version of the image patch suitable for display

7. An apparatus for video encoding, comprising one or more processors configured to:

accessing a first reconstructed version of an image block of a picture of an encoded videoAnd

filtering the first reconstructed version of the image block by a first neural network (CNNa) to form a second reconstructed version of the image block adapted to be used as a reference

8. The method of claim 6 or the apparatus of claim 7, wherein the first reconstructed version of the image blockIs used as an additional input for training the second neural network.

9. The method of claim 6 or the apparatus of claim 7, wherein filtering the first reconstructed version of the image block by a first neural network to form a second reconstructed version of the image block to be used as a reference comprises:

-said first reconstructed version of said image block by means of an initial neural network (CNNa)Filtering to form an initial reconstructed version of the image block to be used as input to the first and second neural networks (S0);

wherein the initial neural network is trained to optimize objective and perceptual filtering.

10. The method according to any of claims 6 and 8-9 or the apparatus according to any of claims 7-9, wherein the joint loss function comprises a first term representing a first loss function between the tile and the second reconstructed version of the tile.

11. The method of claim 10 or the apparatus of claim 10, wherein the first loss function is an L1 or L2 norm.

12. The method according to any of claims 6 and 8-11 or the apparatus according to any of claims 7-11, wherein the joint loss function comprises a second term representing a second loss function between the image block and the third reconstructed version of the image block.

13. The method of claim 12 or the apparatus of claim 12, wherein the second loss function is determined using a generative countermeasure network.

14. The method of claim 6 or the apparatus of claim 7, wherein the first reconstructed version of the image blockIs used as an additional input for training the second neural network.

15. A signal comprising data generated by the method of any one of claims 6 and 8-14 or the apparatus of any one of claims 7-14.

16. A computer program comprising program code instructions for implementing a method according to claim 1 or any one of claims 3 to 5 or for implementing a method according to claim 6 or any one of claims 8 to 14.

17. An information storage device storing program code instructions for implementing a method according to claim 1 or any one of claims 3 to 5 or for implementing a method according to claim 6 or any one of claims 8 to 14.

1. Field of the invention

At least one embodiment of the present invention relates generally to a method and apparatus for picture encoding and decoding, and more particularly, to a method and apparatus for filtering using a plurality of neural networks in video encoding and decoding.

2. Background of the invention

To achieve high compression efficiency, image and video coding schemes typically employ prediction and transform to exploit spatial and temporal redundancy in video content. Generally, intra or inter prediction is used to exploit intra or inter correlation and then transform, quantize, and entropy code the difference between the original image block and the predicted image block, which is usually expressed as a prediction error, a prediction residual, or a prediction value. During encoding, it is generally possible to partition/split the original image block into sub-blocks using quadtree partitioning. To reconstruct video, compressed data is decoded through inverse processes corresponding to prediction, transform, quantization, and entropy coding.

3. Summary of the invention

According to a general aspect of at least one embodiment, there is presented a method for video decoding, comprising:

accessing a first reconstructed version of an image block of an encoded video; and

filtering the first reconstructed version of the image block by a first neural network to form a second reconstructed version of the image block, the second reconstructed version to be used as a reference,

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

According to a general aspect of at least one embodiment, a device for video decoding is presented, comprising one or more processors configured to:

accessing a first reconstructed version of an image block of an encoded video; and

filtering the first reconstructed version of the image block by a first neural network to form a second reconstructed version of the image blockThe second reconstructed versionWill be used as a reference for the purpose of,

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

According to a general aspect of at least one embodiment, there is presented a method for encoding video data, comprising:

accessing a first reconstructed version of an image block of an encoded video; and

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

According to a general aspect of at least one embodiment, an apparatus for encoding video data is presented, comprising one or more processors configured to:

accessing a first reconstructed version of an image block of an encoded video; and

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

According to another general aspect of at least one embodiment, a bitstream is formatted to include a signal generated according to the encoding method described above.

One or more of the present embodiments also provide a computer-readable storage medium having stored thereon instructions for encoding or decoding video data according to at least part of any of the above methods. One or more embodiments also provide a computer-readable storage medium on which a bitstream generated according to the above-described encoding method is stored. One or more embodiments also provide a method and apparatus for transmitting or receiving a bitstream generated according to the above-described encoding method. One or more embodiments also provide a computer program product comprising instructions for performing at least a portion of any of the methods described above.

4. Description of the drawings

Figure 1 shows a block diagram of an exemplary HEVC (high efficiency video coding) video encoder

FIG. 2 shows four in-loop filters used in JEM 6.0;

fig. 3 is a block diagram illustrating an exemplary HEVC video decoder;

figure 4 shows an exemplary CNN (convolutional neural network);

fig. 5A, 5B and 5C show the training process, the encoding process and the decoding process, respectively, using CNN as an in-loop filter;

fig. 6 shows an exemplary method for filtering a reconstructed image block according to an embodiment;

fig. 7A shows the training process using CNN as in-loop filter and CNN as out-of-loop filter;

FIG. 7B depicts a discriminator-based penalty function;

figures 7C and 7D show the encoding and decoding processes using CNN as in-loop filter and CNN as out-of-loop filter, respectively, according to the first embodiment;

8A, 8B and 8C show the encoding and decoding processes according to the second embodiment using CNN as in-loop filter and CNN as out-of-loop filter, respectively;

fig. 9A, 9B and 9C show an encoding process and a decoding process, respectively, using CNN as an in-loop filter and CNN as an out-of-loop filter according to a third embodiment; and

FIG. 10 illustrates a block diagram of an exemplary system in which aspects of the illustrative embodiments may be implemented.

5. Detailed description of the preferred embodiments

In HEVC coding, a picture is partitioned into square-shaped CTUs, which have a configurable size, typically 64 × 64. In other video coding standards, the size may be 128 × 128 or 256 × 256. The CTU is the root of a quadtree partitioned into 4 equal-sized square Coding Units (CUs), i.e., half of the parent block size in width and height. A quadtree is a tree in which a parent node may be divided into four child nodes, each of which may become another parent node divided into four child nodes. In HEVC, a Coding Block (CB) is partitioned into one or more Prediction Blocks (PB) and forms the root of a quadtree partition into Transform Blocks (TB). The Coding Unit (CU) includes a tree structure set of Prediction Units (PUs) and Transform Units (TUs) corresponding to the coding block, the prediction block, and the transform block, the PUs including prediction information for all color components, and the TUs including a residual coding syntax structure for each color component. The sizes of CB, PB, and TB of the luminance components are suitable for the respective CUs, PUs, and TUs.

In more recent coding systems, the CTU is the root of a coding tree partitioned into Coding Units (CUs). A coding tree is a tree in which a parent node (typically corresponding to a block) may be partitioned into child nodes (e.g., into 2, 3, or 4 child nodes), each of which may become a parent node for another partitioned child node. In addition to the quadtree splitting pattern, new splitting patterns (binary tree symmetric splitting pattern, binary tree asymmetric splitting pattern, and ternary tree splitting pattern) are defined, which increases the total number of possible splitting patterns. The coding tree has a unique root node, such as a CTU. Leaves of the decoding tree are termination nodes of the tree. Each node of the coding tree represents a block that may be further partitioned into smaller blocks, also referred to as sub-blocks. Once the CTU to CU partition is determined, the CUs corresponding to the leaves of the coding tree are encoded. The partitioning of the CTUs into CUs and the coding parameters used to encode each CU (corresponding to a leaf of the coding tree) may be determined at the encoder side by a rate-distortion optimization process.

In this application, the term "block" or "picture block" may be used to refer to any one of CTU, CU, PU, TU, CB, PB, and TB. In addition, the terms "block" or "picture block" may be used to refer to macroblocks, partitions, and sub-blocks as specified in h.264/AVC or other video coding standards, and more generally to arrays of samples having many sizes.

In this application, the terms "reconstruction" and "decoding" are used interchangeably, the terms "pixel" and "sample" are used interchangeably, and the terms "image", "picture" and "frame" are used interchangeably. Typically, but not necessarily, the term "reconstruction" is used at the encoder side, while "decoding" is used at the decoder side.

The present application describes a number of aspects including tools, features, embodiments, models, methods, and the like. Many of these aspects are described as specific and, at least to show individual characteristics, are often described in a manner that may sound limited. However, this is for clarity of description and does not limit the application or scope of those aspects. Indeed, all of the different aspects may be combined and interchanged to provide further aspects. Further, these aspects may also be combined and interchanged with the aspects described in the earlier documents.

The aspects described and contemplated in this application can be embodied in many different forms. Fig. 1, 2, and 10 below provide some examples, but other examples are contemplated and the discussion of fig. 1, 2, and 10 does not limit the breadth of the implementation. At least one aspect generally relates to video encoding and decoding, and at least one other aspect generally relates to transmitting a generated or encoded bitstream. These and other aspects may be implemented as a method, apparatus, computer-readable storage medium having stored thereon instructions for encoding or decoding video data according to any of the methods described, and/or computer-readable storage medium having stored thereon a bitstream generated according to any of the methods described.

Various methods are described herein, and each method includes one or more steps or actions for achieving the described method. The order and/or use of specific steps and/or actions may be modified or combined unless a specific order of steps or actions is required for proper operation of the method.

Various methods and other aspects described in this application may be used to modify modules of the video encoder 100 and decoder 200, such as the in-loop filtering module and the post-decoding process (165, 265, 285), as shown in fig. 1 and 2. Furthermore, aspects of the disclosure are not limited to VVC or HEVC, and may be applied to other standards and recommendations, whether pre-existing or developed in the future, for example, and extensions of any such standards and recommendations, including VVC and HEVC. The aspects described in this application may be used alone or in combination unless otherwise indicated or technically excluded.

Various values are used in this application, such as the number of layers of the convolutional neural network, the number of convolutions per layer, and the like. The specific values are for example purposes and the described aspects are not limited to these specific values.

Fig. 1 shows an encoder 100. Variations of this encoder 100 are contemplated, but for clarity, the encoder 100 is described below, and not all contemplated variations are described.

Before being encoded, the video sequence may undergo a pre-encoding process (101), for example, applying a color transform (e.g., conversion from RGB 4:4:4 to YCbCr 4:2: 0) to the input color pictures, or performing a remapping of the input picture components in order to obtain a more resilient signal distribution to compression (e.g., using histogram equalization of one of the color components). Metadata may be associated with the pre-processing and appended to the bitstream.

In the encoder 100, the pictures are encoded by an encoder element, as described below. A picture to be encoded is divided (102) and processed in units of, for example, CUs. Each unit is encoded using, for example, intra or inter modes. When a unit is encoded in intra mode, it performs intra prediction (160). In inter mode, motion estimation (175) and compensation (170) are performed. The encoder decides (105) to encode the unit using one of an intra mode or an inter mode, and indicates the intra/inter decision by, for example, a prediction mode flag. For example, a prediction residual is calculated by subtracting (110) the prediction block from the original image block.

The prediction residual is then transformed (125) and quantized (130). The quantized transform coefficients are entropy coded (145) along with motion vectors and other syntax elements to output a bitstream. The encoder may skip the transform and apply quantization directly to the untransformed residual signal. The encoder may bypass the transform and quantization, i.e. directly code the residual without applying a transform or quantization process.

The encoder decodes the encoded block to provide a reference for further prediction. The quantized transform coefficients are dequantized (140) and inverse transformed (150) to decode the prediction residual. The decoded prediction residual and the prediction block are combined (155), and the image block is reconstructed. An in-loop filter (165) is applied to the reconstructed picture to perform, for example, DBF (deblocking)/SAO (sample adaptive offset) filtering to reduce coding artifacts. More generally, for video compression, other filters may be used for in-loop filtering. For example, as shown in fig. 2, for the current JEM 6.0 (joint exploration model 6.0) developed by jfet (joint video exploration group), four filters, namely, bilateral filter (BLF), deblocking filter (DBF), SAO, and ALF (adaptive in-loop filter) are applied one after another. These different filters are typically based on: (1) sample analysis and pixel classification and (2) class dependent filtering. The filtered image is stored in a reference picture buffer (180).

For ease of representation, we will refer to the input image of the encoder as S and the input of the in-loop filtering as SAnd the output of the in-loop filtering is referred to as May also be referred to as an initial reconstruction or initial reconstructed version of the image. As shown in fig. 2, the input to the in-loop filtering is the sum of the predicted samples and the decoded prediction residual. For some blocks, when the prediction residual is zero or non-existent (e.g., in skip mode), the input to the in-loop filtering is directly the predicted sample.

In current JEM, reconstructed samples are processed before deblocking filterA bilateral filter is applied. The BLF works by basing the filter weights not only on the distance to neighboring samples but also on their values. Each sample in the initial reconstructed picture is replaced by a weighted average of itself and its neighbors. Weights are calculated based on the distance from the center sample and the difference in the sample values. Because the filter is a small plus sign shape (i.e., the filter uses four adjacent samples), all distances are 0 or 1.

The sample located at (i, j) will be filtered using its neighboring samples. The weight ω (i, j, k, l) is a weight assigned to the neighboring sample (k, l) for filtering the current sample (i, j), and is defined as:

where I (I, j) and I (k, l) are the initial reconstructions, respectivelyIntensity values, σ, of samples (i, j) and (k, l) in (1)_dIs a spatial parameter, and σ_rIs a range parameter. The characteristic (or strength) of the bilateral filter is given by the parameter σ_dAnd σ_rAnd (5) controlling. In JEM 6.0,. sigma._dIs set according to the transform unit size and prediction mode, and σ_rSet based on the QP for the current block.

Outputting the filtered sample value I_F(i, j) is calculated as:

in the encoder and decoder, if a CU is larger than 16 × 16, the proposed bilateral filter is applied to each CU or block of maximum size 16 × 16. In JEM 6.0, the bilateral filter is executed within an RDO (rate distortion optimized) loop on the encoder side. Thus, the filtered block may also be used to predict subsequent blocks (intra prediction).

ALF is basically designed based on a wiener filter with the aim of designing a linear filter (1D or 2D) to minimize the L2 distortion, i.e. to minimize the squared error between the filtered sample and the reference sample (usually the original sample). In JEM, ALF with block-based filter adaptation is applied. For the luminance component, one of 25 filters is selected for each 2 × 2 block based on the direction and activity of the signal.

Up to three circularly symmetric filter shapes are supported for the luminance component. An index is signaled at the picture level to indicate the filter shape for the luma component of the picture. For chroma components in a picture, a 5 × 5 diamond filter is always used.

A block classification is applied to each 2 x2 block, which is classified into one of 25 classes based on local signal analysis (gradient, directionality). For two chroma components in a picture, no classification method is applied, i.e., a single set of ALF coefficients is applied to each chroma component.

The filtering process of the luminance component can be controlled at the CU level. A flag is signaled to indicate whether ALF is applied to the luma component of a CU. For chroma components, whether or not ALF is applied is indicated only at the picture level. The ALF filter parameters are signaled in the first CTU before the SAO parameters of the first CTU. Up to 25 sets of luminance filter coefficients may be signaled. To reduce the bit overhead, filter coefficients of different classes may be combined. Furthermore, the ALF coefficients of the reference picture may be reused as the ALF coefficients of the current picture.

Fig. 3 shows a block diagram of a video decoder 200. In the decoder 200, the bit stream is decoded by a decoder element, as described below. Video decoder 200 typically performs a decoding pass reciprocal to the encoding pass described in fig. 1. Encoder 100 also typically performs video decoding as part of the encoded video data.

In particular, the input to the decoder comprises a video bitstream, which may be generated by the video encoder 100. The bitstream is first entropy decoded (230) to obtain transform coefficients, motion vectors, and other coding information. The picture segmentation information indicates how the picture is segmented. The decoder may thus divide (235) the picture according to the decoded picture partitioning information. The transform coefficients are dequantized (240) and inverse transformed (250) to decode the prediction residual. The decoded prediction residual is combined (255) with the prediction block, reconstructing the block. The prediction block may be obtained (270) from intra prediction (260) or motion compensated prediction (i.e., inter prediction) (275). An in-loop filter (265) is applied to the reconstructed image. The filtered image is stored in a reference picture buffer RPB (280), also referred to as a Decoded Picture Buffer (DPB).

The decoded pictures may further undergo a post-decoding process (285), e.g. post-filtering for improving picture rendering on a display, an inverse color transform (e.g. conversion from YCbCr 4:2:0 to RGB 4:4: 4) or an inverse remapping performing the remapping process performed in the pre-encoding process (101). The post-decoding process may use metadata derived in the pre-encoding process and signaled in the bitstream.

It is known to perform in-loop filtering using deep learning. The field of deep learning relates to the use of deep neural networks. Neural networks include neurons organized by groups called layers. In a neural network there is an input layer, an output layer and a hidden layer(s). The deep neural network has two or more hidden layers. Video compression may be considered linked with pattern recognition, as compression typically looks for repetitive patterns in order to remove redundancy. Because artifact removal or artifact reduction in video compression can be considered to identify and restore the original image, neural networks can be used as filters to reduce artifacts. Herein, artifact reduction may also be referred to as image restoration, and a neural network for reducing artifacts may also be referred to as a restoration filter.

FIG. 4 illustrates an exemplary M-layer convolutional neural network in which there is no filtering from the video codecIs restored by adding the residual R calculated by CNNMathematically, the network can be represented as:

wherein W_iAnd B_iG () is an activation function (e.g. a sigmoid or rectified linear unit (ReLU) function), and x represents a convolution operation, respectively, the weight and bias parameters for layer i. Output from CNNCan storeStored in a Reference Picture Buffer (RPB) to be used as a predictor for encoding or decoding a subsequent frame.

For example, by minimizing a loss function defined based on the error between the recovered image and the original image, W may be included from K training samples training_iAnd B_iA parameter set θ of { 1.·, M }, such as:

note that the loss function may also contain other terms to stabilize convergence or avoid overfitting. These regularization terms may simply be added to the error function.

Fig. 5A shows a CNN trained on a large database of images, where the network attempts to recover the reconstructed image by the encoder by minimizing the error (i.e., the difference from the original image). Fig. 5B shows the image after reconstruction being restored in the encoder using the resulting CNN. The restored picture can then be displayed or used as a reference for other frames in the prediction sequence. Symmetrically, the decoder as shown in fig. 5C receives the bitstream, reconstructs the image and restores the image using the same CNN.

The input to the first layer of CNN is typically the Y component of the reconstructed image, i.e., an image of size WxHx1, where W and H are the width and height of the image. In order to use other information as input as well, we treat the reconstructed image as one channel and use the additional channels to input other information. In an example, the first layer applies N × N × 1 filters of N convolutions, e.g., N — 3, to each sample of the input image. Thus, N values are obtained for each sample. An offset may then be added to each value, followed by a non-linear function such as ReLU. The ReLU returns 0 if the input value is negative, and returns the input value if the input value is positive. In the second layer, M convolutional 3 × 3xN filters are applied, followed by the addition of the bias and ReLU functions. Several such layers may be applied in succession.

An in-loop filter, such as SAO or DBF, is applied to reconstructed samples of a video image with the aim of reducing coding artifacts mainly due to quantization of the signal. The outputs of these filters are stored in the RPB (reference picture buffer) of the encoder (respectively the decoder). In-loop filters, objective measures (typically based on the L1 or L2 norm) are traditionally used to optimize the filtered image, which is used as a reference image for prediction.

Optimizing these filters using more perceptual metrics, such as MS-SSI, typically reduces the performance of the compression, i.e. the performance of the prediction, since the resulting image is used as a reference for the prediction.

Since the images are filtered to improve the picture rendering on the display, the out-of-loop filters (e.g., those running in a television set) are preferably optimized using perceptual metrics.

Such additional post-filters are applied, for example, in set-top boxes or television sets to adapt the picture rendering to the display characteristics. Since these filtered images are not used as reference images for prediction, perceptual improvement can be performed. Using objective metrics, such as the L2 norm, generally limits perceptual improvement.

These 2 types of filtering are independently optimized in current systems and result in sub-optimal filtering and/or higher complexity.

In at least one embodiment, the multipart CNN is trained by jointly optimizing reconstructed image quality for later use as a reference in a codec and image visual quality for display. The multipart CNN includes at least two CNNs, for example, CNNa and CNNb as in fig. 7A, 8A and 9A.

Fig. 6 shows an exemplary method for filtering a reconstructed image block according to an embodiment. The method of fig. 6 may be used for an encoding method and a decoding method.

The method starts in step S10. At S12, a first reconstructed version of the image block is accessed. At S14, the first reconstructed version of the image block is filtered through the first neural network to form a second reconstructed version of the image block for use as a reference. At S16, the second reconstructed version of the image block is filtered through a second neural network to form a third reconstructed version of the image block to be displayed. According to this embodiment, the first neural network and the second neural network are jointly trained. The method ends at S18.

Fig. 7A shows co-training CNNa and CNNb on a large image database according to the first embodiment, where CNNa is used as an in-loop filter and CNNb is used as an out-of-loop filter. Training uses S,Andas an input. Input of CNNaAre the pixels (luminance and/or chrominance channels) of the reconstructed image. The deep neural network CNNa used is to learn to be added to the reconstructed imageTo generate a restored imageThe residual Ra of (a) is a Full Convolution Neural Network (FCNN). The latter is optimized for use as a reference for encoding other pictures of the video.

The deep neural network CNNb used is that learning is to be added toTo produce a restored image to be displayedThe residual Rb of (a) is a Full Convolution Neural Network (FCNN). Export of CNNaAs input to the CNNb. The entire CNN has 2 parts: CNNa for in-loop filtering and CNNb for out-of-loop filtering. The two parts of the network are jointly trained. The first loss function La used to train the objective filtering, i.e., CNNa, is typically the original image S and the image filtered by CNNaL1 or L2 norm between:

the second loss Lb, CNNb, for training the perceptual filtering is typically designed using high-level functions. For example, a discriminator such as in a generative countermeasure network (GAN) may be used. Generative Antagonistic Networks (GANs) are a class of artificial intelligence algorithms used for unsupervised machine learning, implemented by a system of two neural networks. A network, generative model G, captures the data distribution, and a network, discriminant model D, estimates the probability that the sample comes from training data instead of G. The training procedure for G is to maximize the probability that D produces an error. This framework corresponds to the infinitesimal two-player game.

Fig. 7B depicts a discriminator-based penalty function. The discriminator is used to determine the Lb loss.

Reconstructing an imageFiltered through the entire network, thus CNNa and CNNb to produce a filtered image

Is input into a discriminator to output a discriminator scoreWhere D () represents the discriminator forwarding function and G () represents the generator forwarding function (where the codec's output is sent as the CNNa + CNNb input).

The original image S is also input to the discriminator.

During discriminator training, the following function is maximized:

where thetad is a network parameter of the discriminator.

During generator training, the following losses are minimized:

where θ g is the network parameter of the generator CNNa + CNNb.

Another method includes establishing a discriminator network that sums S and SBoth as inputs and let the network discriminate between the original image and the filtered image.

A typical D () function is a classification function. The discriminator tries to classify the original image as correct (score 1) and the filtered image as incorrect (score 0). A typical classifier uses a ResNet network, but more advanced networks, especially those based on multi-scale analysis, can be used as discriminators.

The final loss function when the whole network CNNa + CNNb is trained can be expressed as:

where λ and γ are parameters intended to balance the different terms, and R is a typical weight regularization function for improved convergence, here set to the L2 norm of the weights of CNNa + CNNb.

Fig. 7C shows the image after reconstruction restored using the resulting CNNa in the encoder. Symmetrically, the decoder shown in fig. 7D receives the bit stream, reconstructs an image using the CNNa and CNNb determined according to the first embodiment, and restores the image.

FIG. 8A illustrates co-training CNNa on a large database of images according to a second embodimentAnd CNNb, where CNNa acts as an in-loop filter and CNNb acts as an out-of-loop filter. Input of CNNb reconstructed image by in-loop filterAnd restoring the imageAnd (4) forming. It allows the second network CNNb to utilize information about the restoration that has been performed by the first network CNNa. Two inputs of CNNbAndis concatenated, i.e. for each pixel, an amount (from) is obtained having two valuesAnd value of fromValue of (d). The concatenation of the two inputs thus produces a tensor of size WxHx 2. Thus, the first layer of CNNb applies N convolved nxnx2 filters per sample, e.g., N — 3. The same loss functions La and Lb as used in the first embodiment and described in fig. 7A may be used. In particular, the discriminator of fig. 7B may thus be used to define the loss function Lb.

In a variant, CNNb does not use the CNNa recovered image (final output of CNNa), but uses an intermediate representation, e.g., a potential representation (typically, the output of the layers preceding the last layer, with a size WxHxN instead of WxHx1 of the final output). It allows CNNb to perform perceptual filtering using the underlying representation of CNNa.

Fig. 8B shows the image after reconstruction restored using the resulting CNNa in the encoder. The encoder is the same in the first (fig. 7C) and second (fig. 8B) embodiments. Symmetrically, the decoder as shown in fig. 8C receives the bit stream, reconstructs an image using CNNa and CNNb determined according to the second embodiment, and restores the image.

Fig. 9A shows co-training CNNa, CNNa', and CNNb on a large image database according to the third embodiment.

The network comprises three parts: CNNa, CNNa', and CNNb. The output of CNNa S0 is used as input to CNNa' and CNNb. The purpose of the network CNNa' is to restore the image. The main idea is to have a common part (CNNa) used by both objective and perceptual filtering and a dedicated part (CNNa' for objective filtering and CNNb for perceptual filtering). The output S0 need not be an image, but may be a tensor. Tensor S0 may generally have a size WxHxN, where WxH is the image size and N is the number of convolutions on the last layer of CNNa. The same loss function as used in the first embodiment shown in fig. 7A may be used.

Fig. 9B shows the image after reconstruction restored using the resulting CNNa and CNNa' in the encoder. Symmetrically, the decoder shown in fig. 9C receives the bit stream, reconstructs an image using CNNa, CNNa', and CNNb determined according to the second embodiment, and restores the image.

In one variant, which can be applied to the three embodiments, the coefficients of the "out-of-loop filter", i.e. the network parameters of the CNNb, are sent in the bitstream. In a second variation, which is also applicable to the three embodiments, the coefficients are learned by training a large data set and known to the post-filtering module.

These three embodiments make it possible to obtain an optimal image recovery for display while maintaining a high codec performance by combining jointly trained CNN-based filters. It also relies on the sharing of CNNa and CNNb layers, thereby reducing the computational requirements for CNNb. For the third embodiment, the sharing is even higher.

FIG. 10 illustrates a block diagram of an example of a system in which aspects and embodiments are implemented. The system 1000 may be implemented as a device including the various components described below and configured to perform one or more aspects described herein. Examples of such devices include, but are not limited to, various electronic devices such as personal computers, laptop computers, smart phones, tablet computers, digital multimedia set-top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. The elements of system 1000 may be implemented individually or in combination in a single Integrated Circuit (IC), multiple ICs, and/or discrete components. For example, in at least one embodiment, the processing and encoder/decoder elements of system 1000 are distributed across multiple ICs and/or discrete components. In various embodiments, system 1000 is communicatively coupled to one or more other systems or other electronic devices via, for example, a communications bus or through dedicated input and/or output ports. In various embodiments, system 1000 is configured to implement one or more aspects described herein.

The system 1000 includes at least one processor 1010 configured to execute instructions loaded therein for implementing various aspects described herein, for example. The processor 1010 may include embedded memory, an input-output interface, and various other circuits known in the art. The system 1000 includes at least one memory 1020 (e.g., volatile memory devices and/or non-volatile memory devices). System 1000 includes a storage device 1040 that may include non-volatile memory and/or volatile memory, including but not limited to Electrically Erasable Programmable Read Only Memory (EEPROM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, magnetic disk drives, and/or optical disk drives. By way of non-limiting example, the storage 1040 may include an internal storage, an attached storage (including removable and non-removable storage), and/or a network accessible storage.

The system 1000 includes an encoder/decoder module 1030 configured to, for example, process data to provide encoded video or decoded video, and the encoder/decoder module 1030 may include its own processor and memory. Encoder/decoder module 1030 represents module(s) that may be included in a device to perform encoding and/or decoding functions. As is known, a device may include one or both of an encoding and decoding module. Additionally, encoder/decoder module 1030 may be implemented as a separate element of system 1000 or may be incorporated within processor 1010 as a combination of hardware and software as known to those skilled in the art.

Program code to be loaded onto processor 1010 or encoder/decoder 1030 to perform the various aspects described in this document may be stored in storage device 1040 and subsequently loaded onto memory 1020 for execution by processor 1010. According to various embodiments, one or more of the processor 1010, the memory 1020, the storage 1040, and the encoder/decoder module 1030 may store one or more of various items during performance of the processes described herein. These stored terms may include, but are not limited to, portions of the input video, decoded video, or decoded video, bitstreams, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.

In some embodiments, memory within processor 1010 and/or encoder/decoder module 1030 is used to store instructions and provide working memory for processing required during encoding or decoding. However, in other embodiments, memory external to the processing apparatus (e.g., the processing device may be the processor 1010 or the encoder/decoder module 1030) is used for one or more of these functions. The external memory may be memory 1020 and/or storage 1040, such as dynamic volatile memory and/or non-volatile flash memory. In several embodiments, external non-volatile flash memory is used to store an operating system, such as a television. In at least one embodiment, fast external dynamic volatile memory such as RAM is used as working memory for video encoding and decoding operations, such as working memory for MPEG-2(MPEG refers to moving Picture experts group, MPEG-2 is also known as ISO/IEC 13818, and 13818-1 is also known as H.222, and 13818-2 is also known as H.262), HEVC (HEVC refers to high efficiency video coding, also known as H.265 and MPEG-H part 2), or VVC (general video coding, a new standard developed by JVET, Joint video team experts).

As shown in block 1130, input to the elements of system 1000 may be provided through a variety of input devices. Such input devices include, but are not limited to, (i) an RF portion that receives a Radio Frequency (RF) signal, for example, transmitted over the air by a broadcaster, (ii) a Component (COMP) input terminal (or set of COMP input terminals), (iii) a Universal Serial Bus (USB) input terminal, and/or (iv) a High Definition Multimedia Interface (HDMI) input terminal. Other examples not shown in fig. 10 include composite video.

In various embodiments, the input device of block 1130 has associated corresponding input processing elements known in the art. For example, the RF section may be associated with elements suitable for: (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a frequency band), (ii) downconverting the selected signal, (iii) band-limiting again to a narrower frequency band to select, for example, a signal band that may be referred to as a channel in some embodiments, (iv) demodulating the downconverted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select a desired data packet stream. The RF section of various embodiments includes one or more elements to perform these functions, such as frequency selectors, signal selectors, band limiters, channel selectors, filters, down-converters, demodulators, error correctors, and demultiplexers. The RF section may include a tuner that performs various of these functions including, for example, down-converting the received signal to a lower frequency (e.g., an intermediate or near baseband frequency) or baseband. In one set-top box embodiment, the RF section and its associated input processing elements receive RF signals transmitted over a wired (e.g., cable) medium and perform frequency selection by filtering, down-converting, and re-filtering to a desired frequency band. Various embodiments rearrange the order of the above (and other) elements, remove some of these elements, and/or add other elements that perform similar or different functions. Adding components may include inserting components between existing components, such as an amplifier and an analog-to-digital converter. In various embodiments, the RF section includes an antenna.

Additionally, USB and/or HDMI terminals may include respective interface processors for connecting the system 1000 to other electronic devices through USB and/or HDMI connections. It should be appreciated that various aspects of the input processing, such as Reed-Solomon error correction, may be implemented as desired, for example, within a separate input processing IC or processor 1010. Similarly, aspects of the USB or HDMI interface processing may be implemented within a separate interface IC or within the processor 1010, as desired. The demodulated, error corrected and demultiplexed stream is provided to various processing elements including, for example, a processor 1010 and an encoder/decoder 1030 that operate in conjunction with memory and storage elements to process the data stream as needed for presentation on an output device.

The various elements of system 1000 may be disposed within an integrated housing. Within the integrated housing, the various components may be interconnected and communicate data therebetween using a suitable connection arrangement 1140, such as internal buses known in the art, including inter-IC (I2C) buses, wiring, and printed circuit boards.

The system 1000 includes a communication interface 1050 that enables communication with other devices via a communication channel 1060. The communication interface 1050 may include, but is not limited to, a transceiver configured to transmit and receive data over the communication channel 1060. The communication interface 1050 may include, but is not limited to, a modem or network card, and the communication channel 1060 may be implemented, for example, within wired and/or wireless media.

In various embodiments, data is streamed or otherwise provided to system 1000 using a wireless network, such as a Wi-Fi network, such as IEEE 802.11(IEEE refers to the institute of Electrical and electronics Engineers). The Wi-Fi signals of these embodiments are received over a communication channel 1060 and a communication interface 1050 suitable for Wi-Fi communication. The communication channel 1060 of these embodiments is typically connected to an access point or router that provides access to external networks including the internet to allow streaming applications and other over-the-top communications. Other embodiments provide streaming data to the system 1000 using a set-top box that passes data over the HDMI connection of the input block 1130. Still other embodiments provide streaming data to the system 1000 using an RF connection of the input block 1130. As described above, various embodiments provide data in a non-streaming manner. In addition, various embodiments use wireless networks other than Wi-Fi, such as a cellular network or a Bluetooth network.

System 1000 may provide output signals to a variety of output devices, including a display 1100, speakers 1110, and other peripheral devices 1120. The display 1100 of various embodiments includes, for example, one or more of a touch screen display, an Organic Light Emitting Diode (OLED) display, a curved display, and/or a foldable display. The display 1100 may be used for a television, a tablet, a laptop, a cellular phone (mobile phone), or other device. Display 1100 may also be integrated with other components (e.g., as in a smart phone), or stand alone (e.g., an external monitor for a laptop computer). In examples of embodiments, other peripheral devices 1120 include one or more of a stand-alone digital video disc (or digital versatile disc) (DVR, both), a disc player, a stereo system, and/or a lighting system. Various embodiments use one or more peripherals 1120 that provide functionality based on the output of system 1000. For example, the disc player performs a function of playing an output of the system 1000.

In various embodiments, control signals are communicated between the system 1000 and the display 1100, speaker 1110, or other peripheral devices 1120 using signaling such as an AV. link, Consumer Electronics Control (CEC), or other communication protocol that enables device-to-device control with or without user intervention. Output devices may be communicatively coupled to system 1000 via dedicated connections through respective interfaces 1070, 1080, and 1090. Alternatively, an output device may be connected to system 1000 via communication interface 1050 using communication channel 1060. The display 1100 and speaker 1110 may be integrated in a single unit in an electronic device (e.g., a television) along with other components of the system 1000. In various embodiments, display interface 1070 includes a display driver, such as a timing controller (T con) chip.

For example, if the RF portion of input 1130 is part of a separate set-top box, display 1100 and speaker 1110 may alternatively be separate from one or more of the other components. In various embodiments where the display 1100 and speaker 1110 are external components, the output signals may be provided via a dedicated output connection, including, for example, an HDMI port, a USB port, or a COMP output.

The embodiments may be implemented by the processor 1010 or computer software implemented by a combination of hardware and software. By way of non-limiting example, embodiments may be implemented by one or more integrated circuits. The memory 1020 may be of any type suitable to the technical environment and may be implemented using any suitable data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory and removable memory, as non-limiting examples. The processor 1010 may be of any type suitable to the technical environment, and may include one or more of microprocessors, general purpose computers, special purpose computers, and processors based on a multi-core architecture, as non-limiting examples.

Various implementations relate to decoding. As used herein, "decoding" may include, for example, all or part of the processing performed on a received encoded sequence to produce a final output suitable for display. In various embodiments, such processes include one or more of the processes typically performed by a decoder, such as entropy decoding, inverse quantization, inverse transformation, and differential decoding. In various embodiments, such processes also or alternatively include processes performed by decoders of various implementations described in this application, e.g., filtering image blocks with at least two neural networks that are jointly trained.

As a further example, "decoding" in one embodiment refers to entropy decoding only, in another embodiment refers to differential decoding only, and in another embodiment "decoding" refers to a combination of entropy decoding and differential decoding. Whether the phrase "decoding process" is intended to refer specifically to a subset of operations or to a broader decoding process in general will be clear based on the context of the specific description and is believed to be well understood by those skilled in the art.

Various implementations relate to encoding. In a similar manner to the discussion above regarding "decoding," encoding "as used in this application may include, for example, all or part of the process performed on an input video sequence to produce an encoded bitstream. In various embodiments, such processes include one or more processes typically performed by an encoder, such as partitioning, differential encoding, transformation, quantization, and entropy encoding. In various embodiments, such processes also or alternatively include processes performed by encoders of various implementations described herein, e.g., filtering image blocks with at least two neural networks that are jointly trained.

As a further example, "encoding" in one embodiment refers only to entropy encoding, in another embodiment "encoding" refers only to differential encoding, and in another embodiment "encoding" refers to a combination of differential encoding and entropy encoding. Whether the phrase "encoding process" is intended to refer specifically to a subset of operations or to a broader encoding process in general will become clear based on the context of the specific description and is believed to be well understood by those skilled in the art.

Note that syntax elements, such as coefficients of a neural network, as used herein, are descriptive terms. Therefore, they do not exclude the use of other syntax element names.

While the figures are presented as flow charts, it should be understood that it also provides a block diagram of the corresponding apparatus. Similarly, when the figures are presented as block diagrams, it should be understood that it also provides flow diagrams of corresponding methods/processes.

Various embodiments relate to rate-distortion optimization. In particular, during the encoding process, a balance or trade-off between rate and distortion is typically considered, often giving constraints on computational complexity. Rate-distortion optimization is typically formulated as minimizing a rate-distortion function, which is a weighted sum of rate and distortion. There are different approaches to solve the rate-distortion optimization problem. For example, these methods may be based on extensive testing of all coding options, including all considered modes or coding parameter values, with a complete assessment of their coding cost and associated distortion of the reconstructed signal after encoding and decoding. Faster methods can also be used to save coding complexity, in particular to calculate the approximate distortion based on the prediction or prediction residual signal instead of the reconstructed signal. A mixture of these two approaches may also be used, for example by using approximate distortion only for some possible coding options, and full distortion for other coding options. Other methods evaluate only a subset of the possible coding options. More generally, many approaches employ any of a variety of techniques to perform optimization, but optimization is not necessarily a complete assessment of both coding cost and associated distortion.

The implementations and aspects described herein may be implemented in, for example, a method or process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (e.g., discussed only as a method), implementation of the features discussed may also be implemented in other forms (e.g., an apparatus or program). For example, the apparatus may be implemented in appropriate hardware, software and firmware. The method may be implemented, for example, in a processor, which refers generally to a processing device, including, for example, a computer, microprocessor, integrated circuit, or programmable logic device. Processors also include communication devices such as computers, cellular telephones, portable/personal digital assistants ("PDAs"), and other devices that facilitate the communication of information between end-users.

Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" as well as other variations means that a particular feature, structure, characteristic, etc. described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one implementation" or "in an implementation," as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.

In addition, the present application may relate to "determining" various information. Determining the information may include, for example, one or more of estimating the information, calculating the information, predicting the information, or retrieving the information from memory.

Further, the present application may relate to "accessing" various information. Accessing information may include, for example, one or more of receiving information, retrieving information (e.g., from memory), storing information, moving information, copying information, calculating information, determining information, predicting information, or estimating information.

In addition, this application may refer to "receiving" various information. As with "access," reception is intended to be a broad term. Receiving information may include, for example, one or more of accessing the information or retrieving the information (e.g., from a memory). Further, "receiving" is often referred to in one way or another during operations such as storing information, processing information, transmitting information, moving information, copying information, erasing information, calculating information, determining information, predicting information, or estimating information.

It should be understood that the use of any of the following "/", "and/or" and at least one of "is intended to encompass the selection of only the first listed option (a), or only the second listed option (B), or the selection of both options (a and B), such as in the case of" a/B "," a and/or B "and" at least one of a and B ". As a further example, in the case of "A, B and/or C" and "at least one of A, B and C", such phrases are intended to include selecting only the first listed option (a), or only the second listed option (B), or only the third listed option (C), or only the first and second listed options (a and B), or only the first and third listed options (a and C), or only the second and third listed options (B and C), or all three options (a and B and C). This can be extended to a number of items listed as will be clear to those of ordinary skill in this and related arts.

Furthermore, as used herein, the word "signal" particularly refers to something that a corresponding decoder indicates. For example, in certain embodiments, the encoder signals a particular one of the coefficients of the neural network. Thus, in one embodiment, the same parameters are used at both the encoder side and the decoder side. Thus, for example, the encoder may send (explicit signaling) specific parameters to the decoder, so that the decoder may use the same specific parameters. Conversely, if the decoder already has the particular parameters, as well as other parameters, signaling may be used without sending (implicit signaling) to simply allow the decoder to know and select the particular parameters. By avoiding the transmission of any actual functionality, bit savings are achieved in various embodiments. It should be understood that the signaling may be implemented in various ways. For example, in various embodiments, one or more syntax elements, flags, etc. are used to signal information to a corresponding decoder. Although the foregoing relates to a verb form of the word "signal," the word "signal" may also be used herein as a noun.

As will be apparent to one of ordinary skill in the art, implementations may produce various signals formatted to carry information that may be stored or transmitted, for example. The information may include, for example, instructions for performing a method, or data generated by one of the described implementations. For example, the signal may be formatted to carry a bitstream of the described embodiments. Such signals may be formatted, for example, as electromagnetic waves (e.g., using the radio frequency portion of the spectrum) or as baseband signals. Formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information carried by the signal may be, for example, analog or digital information. As is known, signals may be transmitted over a variety of different wired or wireless links. The signal may be stored on a processor readable medium.

We describe a number of embodiments. The features of these embodiments may be provided separately or in any combination across various claim categories and types. Furthermore, embodiments may include one or more of the following features, devices, or aspects, alone or in any combination, across the various claim categories and types:

filtering the reconstructed image block with at least two neural networks, wherein the two neural networks are jointly trained;

a bitstream or signal comprising one or more of the described syntax elements or variants thereof.

A bitstream or signal comprising syntax conveying information generated according to any of the described embodiments.

-inserting syntax elements in the signalling that enable the decoder to adapt in-loop and/or out-of-loop filtering in a manner corresponding to that used by the encoder.

Creating and/or transmitting and/or receiving and/or decoding a bitstream or signal comprising one or more of the described syntax elements or variants thereof.

Creation and/or transmission and/or reception and/or decoding according to any of the embodiments described.

A method, process, apparatus, medium storing instructions, medium storing data, or signal according to any of the embodiments described.

A TV, set-top box, cellular phone, tablet or other electronic device that performs in-loop and out-of-loop filtering processing according to any of the embodiments described.

A TV, set-top box, cellular phone, tablet or other electronic device that performs in-loop and out-of-loop filtering processes according to any of the described embodiments, and displays (e.g., using a monitor, screen or other type of display) the resulting image.

A TV, set-top box, cellular phone, tablet or other electronic device that selects (e.g., using a tuner) a channel to receive a signal comprising an encoded image, and performs in-loop and out-of-loop filtering processing according to any of the described embodiments.

A TV, set-top box, cellular phone, tablet or other electronic device that receives over the air (e.g., using an antenna) a signal comprising encoded images and performs in-loop and out-of-loop filtering processing according to any of the described embodiments.

According to a general aspect of at least one embodiment, there is presented a method for video decoding, comprising:

accessing a first reconstructed version of an image block of an encoded video; and

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

According to a general aspect of at least one embodiment, an apparatus for video decoding is presented, comprising one or more processors configured to:

accessing a first reconstructed version of an image block of an encoded video; and

filtering, by a second neural network, the second reconstructed version of the image block to form a third reconstructed version of the image block to be displayed,

wherein the first neural network and the second neural network are jointly trained.

In an embodiment, the first reconstructed version of the image patch is used as an additional input to train the second neural network.

In an embodiment, the first reconstructed version of the image block is filtered by a first neural network to form a second reconstructed version of the image block to be used as a reference:

-filtering the first reconstructed version of the image block by an initial neural network to form an initial reconstructed version of the image block, the initial reconstructed version to be used as input to the first and second neural networks;

wherein the initial neural network is trained to optimize objective and perceptual filtering.

In an embodiment, an intermediate representation of the output of the first neural network is used as an input to the second neural network to train the second neural network.