Predictive coding using neural networks

文档序号：174671 发布日期：2021-10-29 浏览：56次中文

阅读说明：本技术 利用神经网络的预测性编码 (Predictive coding using neural networks ) 是由翟杰夫 X·张周小松辛军吴锡荣苏野平于 2020-01-13 设计创作，主要内容包括：公开了用于视频压缩的系统和方法,该系统和方法利用神经网络进行预测视频编码。所采用的过程将多个神经网络库与编解码器系统部件组合以执行视频数据的编码和解码。(Systems and methods for video compression are disclosed that utilize neural networks for predictive video coding. The process employed combines multiple neural network libraries with codec system components to perform encoding and decoding of video data.)

1. A method for encoding a video stream, comprising:

generating, for a pixel block of an input frame to be encoded, a pixel block prediction based on input data derived from reference data of previously encoded data of the video stream;

generating a residual block representing a difference between the block of pixels and the block of pixels prediction;

encoding the residual block; and

encapsulating the coded residual block and associated coding parameters in a coded video stream,

wherein at least one of said generating a pixel block prediction and said encoding the residual block is performed using a neural network-based process.

2. The method of claim 1, further comprising:

generating a second block of pixels prediction using one of an inter-frame based prediction and an intra-frame based prediction when the neural network based process is a neural network based prediction; and

generating a residual block using the second pixel block prediction when the estimate of distortion associated with the pixel block prediction is higher than the estimate of distortion associated with the second pixel block prediction.

3. The method of claim 1, wherein when the neural network-based process is a neural network-based prediction, the neural network-based prediction comprises extracting feature vectors from the reference data.

4. The method of claim 1, wherein when the neural network-based process is a neural network-based prediction, the neural network-based prediction comprises transforming the reference data from a pixel domain to a transform domain.

5. The method of claim 1, wherein the neural network-based process is based on a combination of outputs of a neural network library and corresponding likelihoods of those outputs.

6. The method of claim 1, wherein the neural network library used in the neural network-based process is characterized by a network architecture, an activation function, a hyper-parameter, a training cost function, a training data class, a weight, or any combination thereof.

7. The method of claim 1, wherein when the neural network-based process is neural network-based encoding, the encoding the residual block comprises performing transform-based encoding when a confidence score associated with the neural network-based encoding is below a threshold.

8. The method of claim 1, wherein when the neural network-based process is neural network-based encoding, the neural network-based encoding comprises encoding based on data associated with the generating pixel block prediction.

9. The method of claim 1, wherein when the neural network-based process is neural network-based encoding, the neural network-based encoding comprises extracting feature vectors from the residual block.

10. The method of claim 1, wherein when the neural network-based process is neural network-based encoding, the neural network-based encoding comprises transforming the residual block from a pixel domain to a transform domain.

11. The method of claim 1, wherein a neural network bank used in the neural network-based process is trained to operate at a different bit rate or different cost function than that of other neural network banks used to train the neural network-based process.

12. The method of claim 1, wherein the generating a pixel block prediction is based on reference data within a spatio-temporal neighborhood of the pixel block and at one or more data resolutions.

13. The method of claim 1, wherein when the neural network-based process is neural network-based prediction, weights of a neural network library used in the neural network-based prediction are updated based on reference data and prediction data, the prediction data generated from the reference data during encoding of the video stream.

14. The method of claim 13, wherein the prediction data is generated using the neural network-based prediction, an intra-frame-based prediction, or an inter-frame-based prediction.

15. The method of claim 1, wherein when the neural network-based process is neural network-based encoding, weights of a neural network library used in the neural network-based encoding are updated based on a residual block and a coded residual block, the coded residual block being generated from the residual block during encoding of the video stream.

16. The method of claim 15, wherein the coded residual block is generated by the neural network-based encoding or transform-based encoding.

17. The method of claim 15, wherein the weights are updated based on prediction related data generated during encoding of the video stream.

18. A method for decoding an encoded video stream, the method comprising:

for a coded residual block to be decoded, extracting the coded residual block and associated coding parameters from the coded video stream;

decoding the encoded residual block based on the encoding parameters, thereby generating a decoded residual block;

generating a pixel block prediction based on input data derived from reference data of previously decoded data of the encoded video stream; and

generating a reconstructed pixel block, the reconstructed pixel block being a sum of the decoded residual block and the pixel block prediction,

wherein at least one of the generating a pixel block prediction and decoding the encoded residual block is performed using a neural network-based process.

19. The method of claim 18, wherein the neural network-based process is based on a combination of outputs of a neural network library and corresponding likelihoods of those outputs.

20. The method of claim 18, wherein the neural network library used in the neural network-based process is characterized by a network architecture, an activation function, a hyper-parameter, a training cost function, a training data class, a weight, or any combination thereof.

21. The method of claim 18, wherein when the neural network-based process is a neural network-based decoding, the neural network-based decoding comprises decoding based on data associated with the generating pixel block prediction.

22. The method of claim 18, wherein when the neural network-based process is a neural network-based prediction, the generating a block of pixels prediction comprises:

generating a pixel block prediction based on weights derived from: weights of a neural network library used by the neural network based prediction for generating a prediction of neighboring pixel blocks.

23. The method of claim 18, wherein when the neural network-based process is neural network-based decoding, the decoding the encoded residual block comprises:

generating a block of decoded residues based on weights derived from: weights of a neural network library used by the neural network-based decoding for decoding neighboring residual blocks.

24. A computer system, comprising:

at least one processor;

at least one memory including instructions configured to be executed by the at least one processor to perform a method comprising:

for a coded residual block to be decoded, extracting the coded residual block and associated coding parameters from the coded video stream;

decoding the encoded residual block based on the encoding parameters, thereby generating a decoded residual block;

generating a pixel block prediction based on input data derived from reference data of previously decoded data of the encoded video stream; and

generating a reconstructed pixel block, the reconstructed pixel block being a sum of the decoded residual block and the pixel block prediction,

wherein at least one of the generating a pixel block prediction and decoding the encoded residual block is performed using a neural network-based process.

25. The system of claim 24, wherein the neural network-based process is based on a combination of outputs of a neural network library and corresponding likelihoods of those outputs.

26. The system of claim 24, wherein the neural network library used in the neural network-based process is characterized by a network architecture, an activation function, a hyper-parameter, a training cost function, a training data class, a weight, or any combination thereof.

27. The system of claim 24, wherein when the neural network-based process is a neural network-based decoding, the neural network-based decoding comprises decoding based on data associated with the generating pixel block prediction.

28. The system of claim 24, wherein when the neural network-based process is a neural network-based prediction, the generating a pixel block prediction comprises:

29. The system of claim 24, wherein when the neural network-based process is neural network-based decoding, the decoding the encoded residual block comprises:

generating a block of decoded residues based on weights derived from: weights of a neural network library used by the neural network-based decoding for decoding neighboring residual blocks.

Background

The present disclosure relates to video compression technology.

Neural networks have found application in myriad applications in a variety of fields, including medicine, finance, engineering, and physics. Generally speaking, neural networks are trained to predict information of interest based on observations. Training may be achieved through a supervised learning process, in which correlations between exemplary pairs of input data (observations) and output data (information of interest) are learned. The larger the neural network is, the better the neural network can model the complex relationship between the input data and the output data; however, the larger the network, the greater the complexity of the training computation. The recent increase in computing power of end-user computers has made training of large neural networks more practical, making neural networks a reasonable solution for analyzing complex data. Meanwhile, recent developments in machine learning technology now enable neural networks to be better applied in the field of image and video compression, thereby solving the growing concern of streaming High Definition (HD), High Dynamic Range (HDR), and Wide Color Gamut (WCG) content.

Generally speaking, a neural network consists of a system of spatially connected nodes ("neurons") in a given architecture (usually layers), the nodes in one layer feeding information to the nodes in the next layer connected to it. Training the neural network results in "knowledge" represented by the strength of the connections between nodes ("synaptic weights"). The input data of the neural network is fed as a weighted combination into each node of the first layer of the network. Next, the weighted combination of inputs for each node is transformed according to the activation function, resulting in output data for the node. The output data from the first layer is then propagated and similarly processed in other intermediate layers of the network, with the last layer providing the output data. Thus, a neural network is characterized by the structure of its nodes and the activation functions of those nodes. The weights associated with the inputs to each node (i.e., the connection strength of each node) are learned by an iterative training process (e.g., a back propagation algorithm) from training parameters (learning rate and cost functions) and based on corresponding examples of input and output data.

Drawings

Fig. 1 is a functional block diagram of an encoding system according to one aspect of the present disclosure.

Fig. 2 is a functional block diagram of a decoding system according to one aspect of the present disclosure.

FIG. 3 is a block diagram of a single-reservoir neural network-based predictor in accordance with an aspect of the present disclosure.

FIG. 4 is a block diagram of a multi-bank neural network based predictor in accordance with an aspect of the present disclosure.

Fig. 5 is a block diagram of a multi-bank neural network based encoder and decoder in accordance with an aspect of the present disclosure.

Fig. 6 is a block diagram of an auto-encoder according to one aspect of the present disclosure.

Fig. 7 is a block diagram of an encoder and decoder according to an aspect of the present disclosure.

Detailed Description

In the present disclosure, machine learning techniques are integrated into predictive video coding systems, where neural network libraries are combined with codec system components to perform encoding and decoding of video data.

Reference will now be made in detail to aspects of the present disclosure, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.

Systems and methods described in this disclosure include techniques for encoding video streams using Neural Network (NN) -based encoding as well as transform-based encoding. In one aspect, for a pixel block of an input frame to be encoded, a pixel block prediction may be generated using NN-based prediction or intra/inter-based prediction; pixel block prediction may be performed based on input data derived from reference data of previously encoded data of a video stream. The residual block may be generated from the difference between the pixel block and the pixel block prediction and may be encoded, resulting in an encoded residual block. The residual block may then be encoded and may be encapsulated with associated encoding parameters in the encoded video stream. The residual block may be encoded according to an NN-based encoding method. Alternatively, the encoding of the residual block may be performed based on the transform-based encoding when the confidence score associated with the NN-based encoding is below a certain threshold.

Aspects of the present disclosure also describe systems and methods for decoding an encoded video stream. In one aspect, for a coded residual block to be decoded, the coded residual block and associated coding parameters may be extracted from a coded video stream. The encoded residual block may be decoded based on the encoding parameters using NN-based decoding or transform-based decoding, resulting in a decoded residual block. A pixel block prediction may then be generated based on reference data of previously decoded data of the encoded video stream. And, the reconstructed pixel block may then be obtained as the sum of the decoded residual block and the pixel block prediction. Similar to the prediction performed by the encoder, the pixel block prediction in the decoder may be performed by NN-based prediction or intra/inter-based prediction.

Fig. 1 is a functional block diagram of an encoding system 100 according to one aspect of the present disclosure. The encoding system 100 may include an encoder 110, a decoder 120, a loop filter 130, a reference picture buffer 140, a predictor 150, a controller 160, an entropy encoder 170, and a syntax unit 180. The predictor 150 may predict image data used during encoding of the newly presented input frame 105; it may provide an estimate for the input frame 105 based on the reference data retrieved from the reference picture buffer 140. The encoder 110 may then encode the differences between each input frame and its predicted version (i.e., the residual frame), applying NN-based encoding, transform-based encoding, or a combination of both techniques. Next, the encoder 110 may provide the encoded residual frame to the entropy encoder 170. In general, entropy coding is a lossless process, i.e. the coded data at the input of the entropy coder can be fully recovered from the entropy coded data at the output of the entropy coder. The entropy encoder 170 may further reduce the bandwidth of the code generated by the encoder 110, thereby implementing entropy encoding methods such as run-length encoding, huffman encoding, golomb encoding, context-adaptive binary arithmetic encoding, or any other suitable encoding method. After entropy encoding, the entropy encoded frame is presented to syntax unit 180. Syntax unit 180 may encapsulate the entropy encoded frames along with the corresponding encoding parameters in an encoded video stream that conforms to a managed encoding protocol. The encoder 110 may also provide the encoded frames to the decoder 120. The decoder 120 may decode the encoded frame to generate a decoded frame, i.e., a reconstructed frame. Next, loop filter 130 may perform one or more filtering operations on the reconstructed frame that may address artifacts introduced by the processing performed by encoder 110 and decoder 120. The reference picture buffer 140 may store the filtered reconstructed frame. The predictor 150 may then use these stored reference frames to predict later received frames.

The encoding process described herein with respect to a frame may be performed at a lower granularity with respect to sub-regions of the frame. For example, the encoder 110, decoder 120, and predictor 150 may operate independently of each block of pixels, slice, largest coding unit ("LCU"), or coding tree unit ("CTU") of a frame, whether covering one frame or multiple frames.

The encoder 110 may include a subtractor 112, a transform-based encoder 114, an NN-based encoder 116, and a code selector 118. The transform-based encoder 114 typically includes a transformer and a quantizer. The encoder 110 may receive the input frame 105 at an input of a subtractor 112. Subtractor 112 may subtract the received frame from its corresponding predicted frame provided by predictor 150 and vice versa. This subtraction operation may result in a residual frame. The encoder 110 may then decide what encoding technique to apply to the residual frame (or each block of pixels within the residual frame) using either NN-based encoding 116 or transform-based encoding 114. For example, the code selector 118 may receive a confidence score from the NN-based encoder 116. If the confidence score is below a certain threshold, the code selector 118 may select the code provided by the transform-based encoder 118; otherwise, it may select the code provided by the NN-based encoder 116. Alternatively, the encoder 110 may determine whether transform-based encoding 114 or NN-based encoding 116 may be applied for a certain frame or block, e.g., based on encoding parameters (e.g., prediction metadata).

If transform-based encoding is applied, transform-based encoder 114 may transform the received residual frame or block of pixels — mapping the residual frame or block of pixels from its original pixel domain into a transform domain, thereby producing a transform frame or block composed of transform coefficients. After the transform, a quantizer may quantize the transform coefficients. Alternatively, if NN-based encoding is applied, NN-based encoder 116 may encode the received residual frame or block of pixels, as described in detail below. As discussed, the transform-based encoder 114 and the NN-based encoder 116 may both be used in parallel, and the code selector 118 may select the output of either, for example, based on the confidence scores.

The transform-based encoder 114 may utilize a plurality of transform modes M, as may be determined by the controller 160. Generally, transform-based coding reduces spatial redundancy within a block of pixels by compressing the energy of the pixels into fewer transform coefficients within the transform block, allowing more bits to be spent on high energy coefficients while spending fewer bits or no bits at all on low energy coefficients. For example, transform-based encoder 114 may apply a transform mode such as a discrete cosine transform ("DCT"), a discrete sine transform ("DST"), a Walsh-Hadamard transform, a Haar transform, or a Daubechies wavelet transform. In one aspect, the controller 160 may: selecting a transformation mode M to be applied; the transformer of the transform-based encoder 114 is configured accordingly; and storing the coding mode M explicitly or implicitly in a record of coding parameters. After operation of the transformer, the quantizer of the transform-based encoder 114 may operate according to one or more quantization parameters QP, and may apply uniform or non-uniform quantization techniques according to settings that may be determined by the controller 160. In one aspect, the quantization parameter QP may be a vector. In this case, the quantization operation may employ different quantization parameters for each transform block and each coefficient or coefficient group within each transform block.

As described above, the controller 160 may set encoding parameters that may be used to configure the encoder 110, including parameters of the transform-based encoder 114 (e.g., parameters of the transformer and quantizer) and parameters of the NN-based encoder 116. Further, such encoding parameters may include parameters that control the logic that determines whether a transform-based encoder or an NN-based encoder is used to encode a particular frame or block of pixels. The controller 160 may set encoding parameters that may be used to additionally configure the entropy encoder 170 and the syntax unit 180. The encoded parameters may be encapsulated in the encoded video stream 190 along with the encoded residual and thus available to the decoding system 200 (fig. 2). The relevant encoding parameters may also be used for the decoder 120-making the encoding parameters available to the transform-based decoder 124 and the NN-based decoder 126.

Video coding systems that rely on predictive coding techniques typically include a decoding function. In one aspect, the video encoding 100 of fig. 1 includes a decoder 120 that recovers image data for frames designated as "reference frames," which are frames to be used for prediction by the predictor 150. In the absence of transmission errors or other operational anomalies, the decoder 120 should produce the same restored reference frame as that generated by the far-end video decoding system 200 of fig. 2. In general, the decoder 120 reverses the encoding operations applied by the encoder 110. For example, the decoder 120 may include a transform-based decoder 124, an NN-based decoder 126, and an adder 122. The transform-based decoder 124 may include an inverse quantizer and an inverse transformer. The inverse quantizer may reverse the operation of the quantizer of the transform-based encoder 114, performing uniform or non-uniform dequantization as specified by the QP. Similarly, the inverse transformer may reverse the operation of the transformer of the transform-based encoder 114 using a transform mode as specified by M. Therefore, in order to reverse the encoding operation of the transform-based encoder 114, the inverse quantizer and the inverse transformer may use the same quantization parameter QP and transform mode M as their counterparts. Note that quantization is a lossy operation because the transform coefficients are truncated by the quantizer (according to QP), and therefore the initial values of these coefficients cannot be recovered by the inverse quantizer, resulting in coding errors.

Adder 122 may reverse the operation performed by subtractor 112. Thus, the output of the transform-based decoder 124 or the NN-based decoder 126 may be an encoded/decoded version of the residual frame output by the subtractor 112, i.e., a reconstructed residual frame. Adder 122 may add the reconstructed residual frame to a predicted frame provided by predictor 150 (typically the same predicted frame used to generate the residual frame as provided by predictor 150 at the output of subtractor 112). Thus, an encoded/decoded version of the input frame 105, i.e., a reconstructed input frame, may be obtained at the output of the adder 122.

The loop filter 130 may obtain reconstructed input frames from the adder 122 and may perform various filtering operations on the reconstructed input frames, in particular to mitigate artifacts generated by independently processing data from different blocks of pixels, as may be performed by the encoder 110 and the decoder 120. Thus, loop filter 130 may include, for example, a deblocking filter 132 and a sample adaptive offset ("SAO") filter 134. Loop filter 130 may also employ other filters that perform adaptive loop filtering ("ALF"), maximum likelihood ("ML") based filtering schemes, deringing, decoking, sharpening, resolution scaling, and other such operations. After filtering, the filtered reconstructed input frame may be stored in the reference picture buffer 140.

The predictor 150 may include a mode selector 152, an intra-based predictor 154, an inter-based predictor 156, and an NN-based predictor 158. The predictor 150 may base the frame or pixel block prediction on previously encoded/decoded frames or pixel blocks that can be accessed from reference data stored in the reference picture buffer 140. The prediction may be done according to one of a plurality of prediction modes that may be determined by the mode selector 152. For example, in an intra-based prediction mode, the predictor may use a previously encoded/decoded block of pixels from the same currently encoded input frame to generate an estimate of the block of pixels from the currently encoded input frame. Thus, the reference picture buffer 140 may store the encoded/decoded pixel block of the input frame it is currently encoding. In contrast, in an inter-based prediction mode, the predictor may use previously encoded/decoded pixel blocks from a previous frame or a current frame and a previous frame to generate an estimate of a pixel block from a currently encoded input frame. The reference picture buffer 140 may store these encoded/decoded reference frames. Alternatively, the mode selector 152 may select the NN-based prediction mode to generate an estimate of the currently-encoded input frame or an estimate from a block of pixels of the currently-encoded input frame.

Thus, the interframe-based predictor 156 may receive an input pixel block of a new input frame 105 to be encoded. To this end, the inter-based predictor may search the reference picture buffer 140 for a matching pixel block to be used for predicting the input pixel block. On the other hand, the intra-based predictor 154 may search the reference picture buffer 140, thereby limiting its search to match reference blocks belonging to the same input frame 105. Also, the NN-based predictor 158 may use information from the same input frame and/or from previous frames to perform the prediction. All these predictors may generate prediction metadata PM, recording parameters for the prediction, such as identifiers of the reference frame or frames used, the location of the reference block used (e.g. motion vectors) or indices and/or parameters of the neural network bank used.

The mode selector 152 may determine the prediction mode or select the final prediction mode. For example, based on the prediction performance of the intra-based predictor 154, the inter-based predictor 156, and/or the NN-based predictor 158, the mode selector 152 may select a prediction mode that results in more accurate prediction. The predicted frame or block of pixels corresponding to the selected prediction mode may then be provided to subtractor 112, which subtractor 112 may generate a residual frame or block based on the predicted frame or block of pixels. In general, mode selector 152 selects the mode that achieves the lowest coding distortion given a target bitrate budget. Exceptions may occur when the coding mode is selected to satisfy other policies that the coding system 100 may follow, such as satisfying a particular channel behavior, or supporting random access or data refresh policies. In one aspect, a multi-hypothesis prediction mode may be employed, in which case the operations of the intra-based predictor 154, the inter-based predictor 156, and/or the NN-based predictor 158 may be repeated for each of a plurality of prediction hypotheses.

The controller 160 may control the overall operation of the encoding system 100. The controller 160 may select the operating parameters of the encoder 110 and predictor 150 based on an analysis of the input pixel blocks and/or based on external constraints such as coding bit rate targets and other operating parameters. For example, the mode selector 152 may output the prediction metadata PM, including the prediction mode and corresponding parameters, to the controller 160. Controller 160 may then add those prediction metadata to a record of all other coding parameters (e.g., M and QP), and may deliver those coding parameters to syntax unit 180 to be packaged with the coded residual.

As described above, during operation, the controller 160 can set the operating parameters of the encoder 110 at different granularities of the video frames, either pixel-by-pixel block or at a larger granularity level (e.g., frame-by-frame, slice-by-slice, LCU-by-LCU, or CTU-by-CTU). For example, the quantization parameters of the quantizer of transform-based encoder 114 may be modified on a pixel-by-pixel basis within the encoded frame. In addition, as discussed, the controller 160 may control the operations of the decoder 120, the loop filter 130, the predictor 150, the entropy encoder 170, and the syntax unit 180. For example, predictor 150 may receive control data regarding mode selection, such as the particular mode to be employed and the size of the search window within the reference data. Loop filter 130 may receive control data regarding filter selection and its parameters.

Fig. 2 is a functional block diagram of a decoding system 200 according to an aspect of the present disclosure. The decoding system 200 may include a decoder 210, a loop filter 230, a reference picture buffer 240, a predictor 250, a controller 260, an entropy decoder 270, and a syntax unit 280.

The syntax unit 280 may receive the encoded video stream 190 of fig. 1 and may parse the data stream into its constituent parts, including data representing the encoding parameters and the entropy encoded residuals. Data representing the encoding parameters may be delivered to the controller 260 and data representing the entropy encoded residue may be delivered to the entropy decoder 270. The entropy decoder 270 may perform entropy decoding to reverse the process performed by the entropy encoder 170, and may present the encoded residual (produced by the encoder 110 of fig. 1) to the decoder 210. The predictor 250 may use these reference frames or blocks of pixels specified by the prediction metadata PM provided in the coding parameters to predict the currently decoded frame or block based on the reference frames available in the reference picture buffer 240. The predicted frame or block may then be provided to decoder 210. The decoder 210 may generally generate reconstructed video frames by reversing the encoding operations applied by the encoder 110. Loop filter 230 may filter the reconstructed video frame. The filtered reconstructed video frame, i.e., output video 290, may then be output from the decoding system. If the filtered reconstructed video frames are designated for use as reference frames, they may be stored in the reference picture buffer 240.

In cooperation with the encoder 110, and in reverse order, the decoder 210 may include a transform-based decoder 214, an NN-based decoder 216, and an adder 212. Similarly, the transform-based decoder 214 and the NN-based decoder 216 may reverse the processes performed by the transform-based encoder 114 and the NN-based encoder 216, respectively. For example, the inverse quantizer may invert the quantization operation and the inverse transformer may invert the transform operation that may be performed by the transform-based encoder 114 for those frames or blocks of pixels encoded by the transform-based encoder 114. Thus, the inverse quantizer may use the quantization parameter QP provided by the coding parameters parsed from the coded video stream. Similarly, the inverse transformer may use a transform mode M provided by the encoding parameters parsed from the encoded video stream. As discussed, in transform-based coding, in general, the quantization operation is the main cause of coding distortion-the quantizer truncates its quantized data and thus the output of the inverse quantizer, and then the residual frame reconstructed at the output of the inverse transformer has coding errors when compared to the input of the transformer and quantizer respectively presented to transform-based encoder 114.

The adder 212 may reverse the operation performed by the subtractor 112 of fig. 1. Receiving the predicted frame or block of pixels from predictor 250, adder 212 may add these prediction data to the corresponding reconstructed residual frame or block of pixels provided by transform-based decoder 214 or NN-based decoder 216. Accordingly, adder 212 may output the reconstructed video frame to loop filter 230.

Loop filter 230 may perform various filtering operations on the received reconstructed video frame, as specified by the encoding parameters parsed from encoded video stream 190. For example, the loop filter 230 may include a deblocking filter 232 and an SAO filter 234. Other filters may perform ALF, ML-based filtering schemes, deringing, decoking, sharpening, or resolution scaling. Other similar operations may be used for loop filter 230. In this way, the operation of the loop filter 230 may simulate the operation of the corresponding loop filter 130 of the encoding system 100. Accordingly, loop filter 230 may output filtered reconstructed video frames, i.e., output video 290. The output video 290 may be used (e.g., displayed, stored, and/or processed) by a host system and/or may be further transmitted to another system.

The reference picture buffer 240 may store reference video frames, such as filtered reconstructed video frames provided by the loop filter 230. Those reference video frames may be used for subsequent prediction of other frames or blocks of pixels. Thus, the predictor 250 may access the reference frames or blocks of pixels stored in the reference picture buffer 240 and may retrieve those specified in the prediction metadata PM. Also, the predictor 250 may employ a prediction method and its associated parameters as specified in the prediction metadata PM. The prediction metadata may be part of the encoding parameters parsed from the encoded video stream 190. Predictor 250 may then perform the prediction and may provide the predicted frame or block of pixels to decoder 210.

The controller 260 may control the overall operation of the decoding system 200. Accordingly, the controller 260 may set the operation parameters of the decoder 210, the loop filter 230, the predictor 250, and the entropy decoder 270 based on the encoding parameters parsed from the encoded video stream 190. The encoding parameters can be set at various granularities of the video frame, e.g., pixel block by pixel block, frame by frame, slice by slice, LCU by LCU, CTU by CTU, or based on other types of regions defined for the input image. These operating parameters may include quantization parameter QP, transform mode M, and prediction metadata PM. The encoding parameters may also include NN-based encoding parameters to be used by the NN-based decoder 216 and NN-based predictor 250. The parameters associated with the neural network banks may include an index of the banks used with respect to a certain pixel block and a weight associated with each bank. For example, the encoding parameters may include parameters of a neural network library that are used to predict a certain frame or block and may be provided to predictor 250. The weights of the neural network library used in the prediction of a particular frame or block may be accessed from a server or may be part of the encoding parameters.

The NN-based predictor 158 may employ one or more neural networks to perform the prediction. FIG. 3 illustrates a single-library NN-based predictor. For example, the neural network 310 may receive as input previously encoded pixel blocks (reference data) that can be extracted from a currently encoded frame and/or a previously encoded frame accessed by the reference picture buffer 140. Alternatively or in addition, the neural network 310 may receive as input a processed version of a previously encoded block of pixels. Then, based on the weight 320, W_PredictorThe network may generate prediction data, such as a pixel block prediction, that predicts a currently encoded pixel block. In general, W_PredictorObtained through a training process in which a neural network "learns" the correlation between pairs of input data (observations) and output data (information of interest). Due to the fact thatThis, when presented with a new observation, the library predicts the information of interest based on its learned weights. As applied by the aspects disclosed herein, the neural network 310 may be trained to predict a currently encoded block of pixels based on previously encoded blocks of pixels (typically within the spatial-temporal neighborhood of the currently encoded block of pixels) or based on representations of these previously encoded blocks of pixels. Such representations may include, for example, transform coefficients provided by transforming reference data from a pixel domain to a transform domain or by extracting feature vectors from the reference data. As discussed, to render a certain pixel block prediction in the decoder, predictor 250 may obtain access to data related to the same bank used in the encoder in the prediction of the certain pixel block-e.g., the bank index, the weights of the bank, and reference data for the prediction (e.g., a previously encoded pixel block in a predetermined neighborhood of the pixel block to be predicted).

In one aspect, more than one bank may be used to generate the prediction of the pixel block. Fig. 4 illustrates an NN-based predictor that may include a preprocessor 410, a plurality of neural network libraries (libraries 420, 430, and 440), and a mixer 450. The input to the pre-processor 410 may be reference data accessible from the reference picture buffer 140. Such reference data may include previously encoded image information, such as previously encoded blocks of pixels from a current encoded frame or other previously encoded frames. The reference data may be further processed by a preprocessor 410 to generate a plurality of versions InBn. For example, the preprocessor 410 may convert the reference data from its pixel domain to other data domains and/or may extract feature vectors from the reference data. Alternatively or in addition, the preprocessor 410 may deliver the reference data as is to the libraries 420, 430, and 440. In one aspect, the preprocessor 410 may extract data from reference data within various neighborhoods of the current encoding block, where the spatio-temporal size and image resolution of these neighborhoods may vary. For example, the reference data may be sub-sampled to generate an image pyramid of different resolutions, which may be hierarchically processed to generate the prediction data. Thus, a predictor (NN-based or inter/intra-based predictor) may perform an initial prediction based on a low resolution version of the reference data, which may then be refined based on a higher resolution of the reference data. Such hierarchical processing provides more stable and accurate predictions, especially when processing high motion video content.

The number of libraries that make up the NN-based predictor 158 and the manner in which their predictions may be combined may be learned by the neural network. Thus, for example, bin n may be trained to predict pixel blocks, resulting in a prediction PB_nAnd predicting the likelihood of the prediction (denoted Pn). The mixers 450 may then be based on their probabilities P, respectively₁、P₂、...、P_NTo combine various block predictions PB₁、PB₂、...、PB_N. For example, mixer 450 may be based on linear combinationsOr more generally via any non-linear function PB ═ f (PB)_n,P_n)：n＝[1,N]To combine the predictions. The number of bins to be used in the prediction of a certain block of pixels may be determined by selecting a subset of predictors with the highest likelihood-where a higher likelihood may correspond to a lower prediction error.

In one aspect of the disclosure, the neural network may also be used to encode and decode residual frames. Thus, when presented with encoded residual frames, NN-based encoder 116 may be trained to generate encoded residual frames, and when presented with encoded residual frames, NN-based decoder 216 may be trained to generate decoded residual frames. In one aspect, the prediction related data may also be fed into the NN-based encoder 116 and the NN-based decoder 216. The prediction related data may comprise block predictions and corresponding prediction parameters. For example, the prediction parameters may include the type of predictor used (NN-based, inter-based, or intra-based predictors). If an NN-based predictor is used, the prediction parameters may also include an index of the library used and its likelihood P_n。

Similar to the NN-based predictor, the NN-based encoder 116 and NN-based decoder 216 may include one or more neural network libraries. Fig. 5 shows an NN-based encoder 510 and an NN-based decoder 550. The NN-based encoder 510 may include a preprocessor 520, banks 530, 532, and 534, and a mixer 540. Likewise, the NN-based decoder 550 may include a preprocessor 560, banks 570, 572, and 574, and a mixer 580. The NN-based encoder 510 may receive input data 515 including residual data, prediction related data, or a combination thereof. The input data 515 may be a granularity of a sub-region of the frame, such as a block, slice, LCU, or CTU. For example, when encoding with respect to a block of pixels, the input data 515 may include a block of residual errors and corresponding prediction related data (block prediction and/or its prediction parameters). The NN-based encoder 510 may generate encoded residual data 545. For example, when encoding with respect to a block of pixels, the encoded residual data 545 may include a block of encoded residues. Similarly, the NN-based decoder 550 may receive input data 555, including encoded residual data, prediction related data, or a combination thereof. The input data 555 may be a granularity of a sub-region of a frame, such as a block, slice, LCU, or CTU. For example, when decoding with respect to a block of pixels, the input data 555 may comprise a block of coded residues and corresponding prediction related data (block prediction and/or its prediction parameters). The NN-based decoder 550 may generate decoded residual data 585. For example, when encoding is relative to a block of pixels, the decoded residual data 585 may include a block of decoded residues.

Preprocessor 520 may process input data 515 to create several versions of these data-InB₁、InB₂And InB_NTo be presented to the libraries 530, 532 and 534, respectively. For example, the preprocessor 520 may transform the input data 515 from one domain (e.g., a pixel domain) to another domain (e.g., a frequency domain) and/or may extract feature vectors from the input data 515. Alternatively or in addition, the preprocessor 520 may deliver the input data as is to the libraries 530, 532, and 534. Similarly, preprocessor 560 may process input data 555, creating several versions of these data — InB₁、InB₂And InB_NTo be presented to the libraries 570, 572 and 574, respectively. For example, the preprocessor 560 may transform the input data 555 from a code domain to a transform domain and/or may extract feature vectors from the input data 555. Alternatively or additionally, the preprocessor 560 may leave the input data 555 as isTo banks 570, 572 and 574.

The mixer 540 of the encoder and the mixer 580 of the decoder may combine the outputs generated from their respective banks. The number of libraries that make up the NN-based encoder 510 and NN-based decoder 550, and the manner in which their outputs may be combined, may be learned by the neural network. Thus, for example, a bank n of the encoder 510 may be trained to generate a coded residual block to obtain a CRB_nAnd the likelihood of predicting the code, denoted Pn. Mixers 540 may then be based on their probabilities P, respectively₁、P₂、...、P₂To combine various coded residual blocks CRB₁、CRB₂、...、CRB_N. For example, mixer 540 may be based on linear combinationOr more generally via any non-linear function CRB ═ f (CRB)_n，P_n)：n＝[1，N]To combine the coded residual blocks. By selecting the CRB with the highest probability₁、CRB₂、...、CRB_NTo determine the number of bins to be used in the encoding of a certain residual block-where a higher probability may correspond to a lower encoding error. Likewise, the bank n of the decoder 550 may be trained to generate a decoded residual block, resulting in an RB_nAnd the possibility of predicting the decoded residual block, denoted P_n. Mixers 580 may then be based on their probabilities P, respectively₁、P₂、...、P_NTo combine various decoded residual blocks RB₁、RB₂、...、RB_N. For example, mixer 580 may be based on linear combinationRB_nP_nOr more generally via any non-linear function RB ═ f (RB)_n，P_n)：n＝[1，N]To combine the decoded residual blocks. It is also possible to select the RB with the highest probability₁、RB₂、...、RB_NTo determine the number of bins to be used in decoding, where a higher likelihood may correspond to a lower coding error.It is noted that when the NN-based encoder 510 and the NN-based decoder 550 are trained, the output of the NN-based encoder 510 may be used in the training process of the NN-based decoder 550.

In one aspect, the encoding and decoding of the residual block may be accomplished using an auto-encoder. Fig. 6 illustrates the training of an auto-encoder with respect to a bank pair of the NN-based encoder 116 and the NN-based decoder 216. Generally, an autoencoder is a neural network that is trained to output a copy of its input. Thus, the neural network of the autoencoder can be divided into two networks (banks) -an encoder bank 610 and a decoder bank 620. In one aspect, the input to the auto-encoder 600 may be input data X (e.g., a residual block) and the output of the auto-encoder 600 may be reconstructed input data X_R(e.g., reconstructed residual block). While the output data of the hidden layer denoted h may constitute the output of the encoder bank 610 and the input of the decoder bank 620. Thus, an auto-encoder may be represented by an encoding function h ═ f (X) designed to provide a compressed representation of the input data X and a reconstruction function X_RG (h) g (f) (x)). In order to obtain reconstructed input data X that is sufficiently close to input data X_RBy reacting X with X_RThe learning process may result in the weights 615 of the encoder bank and the weights 625 of the decoder bank being available later (in real-time operation) for encoding and decoding, respectively, the residual block.

In one aspect, the autoencoder 600 may utilize auxiliary data X in addition to the input data X_ATo generate X_R. For example, encoder bank 610 may receive as inputs the residual block generated by predictor 150 and prediction related data (e.g., block prediction and/or its prediction parameters). The auxiliary data may allow to exploit any correlation that may exist between the residual block and its corresponding prediction related data. In this case, the coding function may be h ═ f (X, X)_A). Thus, in order to obtain reconstructed input data X that is sufficiently close to input data X_RThe auto-encoder may be trained such that the distance function D (X, g (f (X, X)_A) ) is minimized.

In one aspect of the disclosure, banks 420, 430, and 440 of fig. 4 and banks 530, 532, 534, 570, 572, and 574 of fig. 5 may be of different types, characterized by, for example, their network architecture, activation functions (e.g., unipolar sigmoid, bipolar sigmoid, tangent hyperbolic, or radial basis functions), hyper-parameters, training cost functions, and training data classes. Typical hyper-parameters may be learning rate, momentum, batch size or weight decay rate. For example, different banks of the trainable NN-based predictor 158, the NN-based encoder 116, and the NN-based decoder 126 operate at different bit rates or with different end-to-end cost functions. Similarly, different libraries may be trained based on data from different classes, possibly because video frames may contain non-stationary data, where the statistical properties of the video vary across one frame (spatially) and across multiple frames (temporally). Accordingly, aspects of the present disclosure may model such different features by training each library for a particular class of video features.

Supporting multiple banks may increase the complexity of the operation of the decoder, particularly with respect to the memory bandwidth that will be used to switch from one bank to another when decoding the residual block 216 or when performing the NN-based prediction 250. On the other hand, a large set of libraries may be used to satisfy different video content characteristics. Therefore, it may be advantageous to take advantage of similarities between the parameters of the bins (i.e., the characteristics and weights of the bins) to reduce the total number of bins used in the decoding process. In addition, although the parameters of the library may be predefined (e.g., accessible from a server), they may also be encoded and encapsulated in the encoded video stream. In this case, exploiting similarities between the bins associated with adjacent pixel blocks may allow the parameters of these bins to be more efficiently encoded into the encoded video stream. For example, when predicting a certain pixel block using the NN-based predictor, the decoder 200 may utilize a neural network library that has been used for prediction of neighboring pixel blocks by using the same library (or using a variant thereof). Thus, with respect to a certain block, the encoder 100 may provide the decoder 200 (as part of the prediction parameters associated with the certain block) with the parameters of the library for predicting the certain block, or, conversely, the encoder may provide an indication instructing the decoder to derive the parameters of the new library based on the parameters of the library associated with the previously predicted block. Thus, in an aspect, the parameters that a decoder may need for a bank that predicts a certain pixel block may be derived from the parameters of a bank that has been used for predicting neighboring pixel blocks. Similarly, in another aspect, the parameters of the bank that the decoder may need for decoding a particular residual block may be derived from the parameters of the bank that has been used for decoding the neighboring residual block.

In one aspect disclosed herein, the weights of the library may be refined using dynamic training using online machine learning techniques. For example, the NN-based predictor 158 may include operations in which parameters of the library are updated based on the newly available training data. The newly available training data may be pairs of reference data (of the currently processed pixel block) and corresponding prediction data. These corresponding prediction data may be data generated by intra-based predictor 154, inter-based predictor 156, or any other prediction method. Similarly, the NN-based encoder 116 may include operations in which parameters of the library are updated based on the newly available training data. The newly available training data may be the currently processed residual block, the prediction related data and the corresponding coded residual block. For example, the corresponding coded residual block may be data generated by the transform-based encoder 114.

Fig. 7 illustrates another aspect of the present disclosure. Among other things, video encoder 710 may receive video stream 715 and may generate encoded video stream 755. The encoded video stream 755 may be provided to a video decoder 750, which may in turn generate a reconstructed video stream 790. A prediction for a block of pixels may be generated based on the reference data using an NN-based predictor, an intra-based predictor, or an inter-based predictor (block 720). A residual block may be generated by subtracting the pixel block from the generated pixel block prediction (block 725). The encoding of the residual block may then be performed using an NN-based encoder or a transform-based encoder (block 730). The encoded residual may be encapsulated in the encoded video stream along with corresponding encoding parameters (block 735). The encoded video stream may be stored for later reference or may be transmitted over a communication link to a receiving terminal where it may be decoded by decoder 750 (represented by arrow 755).

During the decoding process, the coded residual block and corresponding coding parameters may be extracted (block 760). The encoded residual block may be decoded (block 765) using an NN-based decoder or a transform-based decoder, resulting in a reconstructed residual block. A pixel block prediction is generated based on the reference data, thereby duplicating the prediction shown in block 720 (block 770). The decoder may add the reconstructed residual block to the pixel block prediction resulting in a reconstructed pixel block (block 775). The encoder 710 and decoder 750 can operate at the granularity of pixel blocks, as shown in fig. 7, or other levels of granularity as discussed.

The foregoing discussion has described the operation of various aspects of the present disclosure in the context of video encoders and decoders. These components are often provided as electronic devices. The video decoder and/or controller may be embedded in an integrated circuit, such as an application specific integrated circuit, a field programmable gate array, and/or a digital signal processor. Alternatively, they may be embedded in a computer program executed on a camera device, a personal computer, a laptop, a tablet, a smartphone, or a computer server. Such computer programs are typically stored in physical storage media such as electronic, magnetic, and/or optical based storage devices, where they are read into a processor and executed. Decoders are often packaged in consumer electronics devices such as smart phones, tablets, gaming systems, DVD players, portable media players, and the like. They may also be packaged in consumer software applications such as video games, media players, media editors, and the like. And, of course, these components may be provided as a hybrid system with distributed functionality on dedicated hardware components and programmed general-purpose processors, as desired.

Video encoders and decoders may exchange video over a channel in a variety of ways. They may communicate with each other via a communication and/or computer network, as shown in fig. 1. In other applications, the video encoder may output the video data to a storage device, such as an electronic, magnetic, and/or optical storage medium, which may be provided to the decoder at some later time. In such applications, the decoder may retrieve the encoded video data from the storage device and decode it.

Several embodiments of the present invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：环路整形信息的信令和语法

Predictive coding using neural networks

相关技术

网友询问留言