Hybrid motion compensated neural network with side information based video coding

文档序号：991724 发布日期：2020-10-20 浏览：7次中文

阅读说明：本技术 具有基于边信息的视频编译的混合运动补偿神经网络 (Hybrid motion compensated neural network with side information based video coding ) 是由德巴尔加·慕克吉乌尔旺·乔希陈悦萨拉·帕克于 2019-11-04 设计创作，主要内容包括：一种用于对视频流进行编译的混合装置包括第一编码器。第一编码器包括具有至少一个隐藏层的神经网络,并且该神经网络在至少一个隐藏层中的第一隐藏层处从视频流接收源数据,在第一隐藏层处接收与源数据相关的边信息,并且使用源数据和边信息来生成引导信息。第一编码器输出引导信息和边信息以供解码器重构源数据。(A mixing apparatus for coding a video stream includes a first encoder. The first encoder includes a neural network having at least one hidden layer, and the neural network receives source data from the video stream at a first hidden layer of the at least one hidden layer, receives side information related to the source data at the first hidden layer, and generates the guide information using the source data and the side information. The first encoder outputs the guide information and the side information for the decoder to reconstruct the source data.)

1. A blending apparatus for compiling a video stream, comprising:

a first encoder comprising a neural network having at least one hidden layer, wherein the neural network is configured to:

receiving source data from the video stream at a first hidden layer of the at least one hidden layer;

receiving, at the first hidden layer, side information related to the source data;

generating guidance information using the source data and the side information; and is

Outputting the leading information and the side information for a decoder to reconstruct the source data.

2. The mixing device of claim 1, further comprising:

a second encoder configured to generate the side information using the source data for input to the first encoder.

3. The mixing device of claim 2, wherein:

the second encoder includes a second decoder, and

the side information includes decoded source data from the second decoder.

4. The mixing device of any one of claims 1 to 3, wherein:

the first encoder comprises a first decoder configured to reconstruct the source data to form reconstructed source data, and

the neural network is trained to minimize a rate distortion value between the source data and the reconstructed source data.

5. The mixing device of any one of claims 1 to 4, wherein:

the first encoder includes a first decoder, and

the neural network comprises a plurality of hidden layers, at least the first hidden layer of the plurality of hidden layers forming the first encoder and at least a second hidden layer of the plurality of hidden layers forming the first decoder, and the first decoder is configured to receive the guiding information and the side information for reconstructing the source data.

6. The mixing device of claim 5, wherein:

each hidden layer of the first encoder is configured to pass the side information such that the first layer of the first decoder receives the side information.

7. The mixing device of any of claims 1-6, wherein the first encoder comprises a first decoder, the mixing device further comprising:

a deterministic transform configured to transform the side information prior to providing the side information to the first encoder and the first decoder.

8. The mixing device of any one of claims 1 to 7, wherein:

the side information includes a full resolution prediction signal generated using motion prediction.

9. The mixing device of claim 8, wherein:

the neural network is trained to select a transform for a block residual within the full resolution prediction signal to minimize a rate distortion value.

10. The mixing device of any one of claims 1 to 9, further comprising:

a second encoder configured to generate the side information using the source data for input to the first encoder, wherein the second encoder comprises a block-based encoder.

11. The mixing device of any one of claims 1 to 10, wherein:

the side information includes a per-frame reduced resolution reconstruction of the reduced resolution base layer.

12. The mixing device of claim 11, wherein:

the neural network is configured to generate a high resolution layer using the per frame reduced resolution reconstruction.

13. The mixing device of claim 12, further comprising:

a second encoder configured to generate the side information using the source data for input to the first encoder, wherein the second encoder comprises a block-based encoder; and

a reference frame buffer to store full resolution reference frames output from the neural network for use in predicting subsequent frames.

14. A method for compiling a video stream, comprising:

providing source data from the video stream to a first encoder comprising a neural network;

generating side information using the source data;

inputting the side information to the neural network for encoding the source data; and

transmitting the source data and the side information from the first encoder to a decoder or to storage.

15. The method of claim 14, wherein generating the side information comprises performing motion prediction using the source data to output a prediction signal.

16. The method of claim 15, wherein performing motion prediction using the source data to output a prediction signal comprises using the first encoder to perform the motion prediction.

17. The method of any of claims 14 to 16, further comprising:

transforming the side information to the same resolution as the source data; and

generating difference information comprising a difference between the source data and the transformed side information, wherein providing the source data to the neural network comprises providing the difference information to the neural network.

18. The method of any of claims 14 to 17, wherein the first encoder comprises a first decoder, the neural network comprises a plurality of hidden layers, and the first encoder passes the side information through at least one hidden layer to only the first hidden layer of the first decoder.

19. A blending apparatus for compiling a video stream, comprising:

a first encoder and a first decoder comprising a neural network having a plurality of hidden layers, wherein the neural network is configured to:

receiving source data from the video stream at a first hidden layer of the encoder;

receiving side information related to the source data at a first hidden layer of the first encoder;

generating guidance information using the source data and the side information; and is

Receiving the leading information and the side information at a first hidden layer of the first decoder for reconstructing the source data.

20. The hybrid device of claim 19, wherein the neural network further comprises an expander layer configured to receive the pilot information from the first encoder and increase an amount of data in the pilot information, and transmit the pilot information to a first hidden layer of the first decoder.

Background

A digital video stream may represent video using a sequence of frames or still images. Digital video can be used for a variety of applications including, for example, video conferencing, high definition video entertainment, video advertising, or user-generated video sharing. Digital video streams can contain large amounts of data and consume a large amount of computing or communication resources of a computing device for processing, transmission, or storage of video data. Various methods, including compression and other encoding techniques, have been proposed to reduce the amount of data in a video stream.

Disclosure of Invention

One aspect of the disclosed implementations is a first encoder that includes a neural network having at least one hidden layer, wherein the neural network receives source data from a video stream at a first hidden layer of the at least one hidden layer, receives side information related to the source data at the first hidden layer, and generates steering information using the source data and the side information. The first encoder outputs the guide information and the side information to a decoder for reconstructing the source data.

The method for compiling a video stream described herein includes: providing source data from a video stream to a first encoder comprising a neural network; generating side information using the source data; inputting the side information to a neural network for encoding the source data; and transmitting the source data and the side information from the first encoder to the decoder.

Another hybrid apparatus for coding a video stream described herein includes a first encoder and a first decoder including a neural network having a plurality of hidden layers. The neural network receives source data from a video stream at a first hidden layer of an encoder, receives side information related to the source data at the first hidden layer of the encoder, generates steering information using the source data and the side information, and receives the steering information and the side information at the first hidden layer of a first decoder for reconstructing the source data.

These and other aspects of the disclosure are disclosed in the following detailed description of the embodiments, the appended claims and the accompanying drawings.

Drawings

The description herein makes reference to the accompanying drawings wherein like reference numerals refer to like parts throughout the several views.

Fig. 1 is a diagram of a video encoding and decoding system.

Fig. 2 is a block diagram of an example of a computing device capable of implementing a transmitting station or a receiving station.

Fig. 3 is a diagram of a video stream to be encoded and subsequently decoded.

Fig. 4 is a block diagram of an encoder in accordance with an implementation of the present disclosure.

Fig. 5 is a block diagram of a decoder according to an implementation of the present disclosure.

Fig. 6 is a block diagram of a representation of a portion of a frame according to an implementation of the present disclosure.

Fig. 7 is a block diagram of an example of a quadtree representation of a block according to an implementation of the present disclosure.

FIG. 8 is a flow diagram of a process for searching a best mode to compile a block.

Fig. 9 is a block diagram of a process of estimating a rate and a distortion cost for coding an image block by using a prediction mode.

Fig. 10 is a flow diagram of a process for encoding blocks of a video stream according to an implementation of the present disclosure.

Fig. 11 is a block diagram of an example of a codec including a neural network with side information according to an implementation of the present disclosure.

Fig. 12 is a block diagram of an example of a neural network that can be used to implement the codec of fig. 11.

FIG. 13 is a block diagram of another example of a neural network that can be used to implement the codec of FIG. 11.

Fig. 14 is a block diagram of a variation in the example of the neural network of fig. 13.

Fig. 15 is a block diagram of an alternative example of a codec including a neural network with side information according to an implementation of the present disclosure.

Detailed Description

The coding technique may be designed to maximize coding efficiency. Coding efficiency may refer to encoding video at the lowest possible bit rate while minimizing distortion (e.g., while maintaining a certain level of video quality). Coding efficiency is typically measured in terms of both rate and distortion. Rate refers to the number of bits required for encoding, such as encoding a block, frame, etc. The distortion measure is, for example, the loss of quality between the source video block and the reconstructed version of the source video block. For example, the distortion may be calculated as a mean square error between pixel values of the source block and pixel values of the reconstructed block. By performing a rate-distortion optimization process, the video codec optimizes the amount of distortion against the rate required to encode the video.

Modern video codecs (e.g., h.264, also known as MPEG-4 AVC; VP 9; h.265, also known as HEVC; AVS 2; and AV1) define and use a large number of tools and configurations (e.g., parameters) to improve coding efficiency. The video encoder can use mode decision to examine (e.g., test, evaluate, etc.) at least some of the valid combinations of parameters to select a combination that results in a relatively low rate-distortion value. An example of a mode decision is an intra prediction mode decision, which determines the best intra prediction mode for coding a block. Another example of a mode decision is a partition decision, which determines the best sub-partition of a coding unit (also referred to as a coding tree unit or CTU). Another example of mode decision includes a decision as to the type of transform to be used when transforming a block (such as a residual block or an image block) from the pixel domain to the frequency domain to form a transform block comprising transform coefficients.

To evaluate whether one combination is better than another, a metric can be calculated for each examined combination and the corresponding metrics compared. In one example, the metric can combine the rate and distortion described above to produce a rate-distortion (RD) value or cost. The RD value or cost may be a single scalar value.

Quantization parameters in video codecs can be used to control the trade-off between rate and distortion. In general, a larger quantization parameter means that a higher quantization (such as quantization of transform coefficients) results in a lower rate but higher distortion; while a smaller quantization parameter means that a lower quantization results in a higher rate but lower distortion. The variables QP, Q, and Q may be used interchangeably in this disclosure to refer to a quantization parameter.

The value of the quantization parameter may be fixed. For example, the encoder can use one quantization parameter value to encode all frames and/or all blocks of the video. In other examples, the quantization parameter can change from frame to frame, for example. For example, in the case of a video conferencing application, the encoder can change the quantization parameter values based on fluctuations in the network bandwidth.

Since the quantization parameter can be used to control the trade-off between rate and distortion, the quantization parameter can be used to calculate the RD cost associated with the respective combination of parameters. The combination that results in the lowest cost (e.g., lowest RD cost) can be used to encode, for example, a block or frame in a compressed bitstream. That is, the QP value may be used to determine the RD cost whenever the encoder decision (e.g., mode decision) is based on the RD cost.

In one example, QP can be used to derive a multiplier that is used to combine rate and distortion values into one metric. Some codecs may refer to the multiplier as the lagrangian multiplier(denoted as λ)_mode) (ii) a Other codecs may use similar multipliers called rdmult. Each codec may have a different method of computing multipliers. Unless the context clearly indicates, regardless of the codec, the multiplier is referred to herein as a lagrangian multiplier or lagrangian parameter.

Restated, the lagrange multiplier can be used to evaluate the RD cost of a competitive mode (i.e., a competitive combination of parameters). Specifically, let r_mRepresents the rate (in bits) resulting from using pattern m and let d_mRepresenting the resulting distortion. The rate-distortion cost of selecting mode m can be calculated as a scalar value: d_m+λ_moder_m. By using lagrange parameter lambda_modeThe costs of the two modes can then be compared and the mode with the lower combined RD cost selected. This technique of evaluating the rate-distortion cost is the basis for a mode decision process in at least some video codecs.

Different video codecs may use different techniques to compute the lagrangian multiplier from the quantization parameter. This is due in part to the fact that: different codecs may have different meanings (e.g., definitions, semantics, etc.) for the quantization parameters and their methods of use.

A codec implementing the h.264 standard (referred to herein as the h.264 codec) may use equation (1) to derive the lagrangian multiplier λ_mode：

λ_mode＝0.85×2^(QP-12)/3(1)

A codec implementing the HEVC standard, referred to herein as an HEVC codec, may use a formula similar to formula (1). A codec implementing the h.263 standard (referred to herein as the h.263 codec) may use equation (2) to derive the lagrangian multiplier λ_mode：

λ_mode＝0.85·Q_H263 ²(2)

A codec implementing the VP9 standard (referred to herein as the VP9 codec) may derive the multiplier rdmult using equation (3):

rdmult＝88·q²/24 (3)

a codec implementing the AV1 standard (referred to herein as the AV1 codec) may use equation (4) to derive the lagrangian multiplier λ_mode：

λ_mode＝0.12·Q_AV1 ²/256 (4)

As can be seen in the above case, the multiplier has a non-linear relationship with the quantization parameter. In the case of HEVC and h.264, the multiplier is exponential with QP; whereas in the case of h.263, VP9, and AV1, the multiplier has a quadratic relationship with QP. Note that the multipliers may undergo further changes before being used in the corresponding codecs to account for additional side information (side information) included in the compressed bitstream by the encoder. Examples of side information include picture type (e.g., intra-frame to inter-frame predicted frames), color components (e.g., luminance or chrominance), and/or regions of interest. In one example, such additional changes may be linear changes of multipliers.

As described above, the best mode can be selected from many possible combinations. For example, the RD cost associated with a particular mode (or particular combination of tools) may be determined by performing at least a subset of the encoding steps of the encoder. Depending on the mode in which the RD cost is to be determined, the subset of encoding steps can include at least one of the following steps: determining a prediction block, determining a residual block, determining a transform type, determining an interpolation filter, quantizing a transform block, entropy encoding (e.g., using a hypothetical encoder), and so forth. Note that these encoding steps are neither intended to be an exhaustive list of encoding steps that a typical encoder may perform nor presented in any particular order (that is, the encoder does not necessarily perform these steps sequentially as listed). As the number of possible tools and parameters increases, the number of combinations also increases, which in turn increases the time required to determine the best mode.

Techniques such as machine learning may be utilized to reduce the time required to determine the best mode. Machine learning may be well suited to solve computational complexity problems in video compilation. For example, rather than performing all encoding steps (i.e., brute force or exhaustive methods) for determining the rate and distortion (or equivalently, the RD cost) associated with a pattern, a machine learning model can be used to estimate the rate and distortion, or to estimate the RD cost associated with a pattern. The best mode may then be selected based on, for example, the lowest RD cost.

The machine learning model may be trained using a large amount of training data available from encoders performing standard encoding techniques, such as those described with respect to fig. 4 and 6-9. More specifically, the training data can be used during a learning phase of machine learning to derive (e.g., learn, infer, etc.) a machine learning model that is a mapping (e.g., define, compose, etc.) from input data to output (in this example, RD costs that can be used to make one or more pattern decisions).

The predictive power (i.e., accuracy) of a machine learning model is as good as the input used to train the machine learning model and the input presented to the machine learning model to predict the outcome (e.g., best mode). Once the machine learning model is trained, the model computes the output as a deterministic function of its inputs. In one example, the machine learning model may be a neural network model, which may be a Convolutional Neural Network (CNN). Further details of the neural network model including CNN will be discussed below with respect to fig. 12-14.

As can be seen from the above description, a machine learning model can be used to decide (e.g., select, choose, etc.) a mode from among a plurality of available modes in a coding process for a block, e.g., an image block, a prediction block, or a transform. This can be a powerful tool for image compression. However, video compression relies heavily on exploiting temporal redundancy between frames, thereby introducing a third dimension-time and hence motion-into the horizontal and vertical dimensions of the pixels. It is possible to use machine learning to learn motion fields from a three-dimensional volume of data, but with an additional degree of complexity involved. Information derived from conventional encoding methods (e.g., motion information) may be made available for reconstruction of video data compressed, at least in part, using machine learning in accordance with the teachings herein. This is achieved using a deep neural network with structural constraints that enforce the availability of information at the decoder.

The neural network is first described herein with reference to a block-based codec, while the teachings may be incorporated. Although block-based codecs are described as examples, other codecs, including feature-based codecs, may be used with the present teachings.

Fig. 1 is a diagram of a video encoding and decoding system 100. Transmission station 102 may be, for example, a computer with an internal configuration of hardware, such as the computer described with respect to fig. 2. However, other suitable implementations of transmission station 102 are possible. For example, the processing of transmission station 102 can be distributed among multiple devices.

Network 104 may be capable of connecting transmitting station 102 and receiving station 106 for encoding and decoding video streams. In particular, the video stream can be encoded in the transmitting station 102 and the encoded video stream can be decoded in the receiving station 106. The network 104 may be, for example, the internet. The network 104 may also be a Local Area Network (LAN), Wide Area Network (WAN), Virtual Private Network (VPN), cellular telephone network, or any other means of transferring a video stream from the transmitting station 102 to the receiving station 106 in this example.

In one example, the receiving station 106 may be a computer having an internal configuration of hardware, such as the computer described with respect to fig. 2. However, other suitable implementations of the receiving station 106 are possible. For example, the processing of the receiving station 106 can be distributed among multiple devices.

Other implementations of the video encoding and decoding system 100 are possible. For example, one implementation can omit network 104. In another implementation, the video stream can be encoded and then stored for transmission to the receiving station 106 or any other device having memory at a later time. In one implementation, the receiving station 106 receives an encoded video stream (e.g., via the network 104, a computer bus, and/or some communication path) and stores the video stream for later decoding. In one example implementation, the real-time transport protocol (RTP) is used to transport encoded video over the network 104. In another implementation, transport protocols other than RTP may be used, such as the Hypertext transfer protocol (HTTP) based video streaming protocol.

For example, when used in a video conferencing system, transmitting station 102 and/or receiving station 106 may include the capability to both encode and decode video streams as described below. For example, the receiving station 106 may be a video conference participant that receives an encoded video bitstream from a video conference server (e.g., the transmitting station 102) for decoding and viewing, and further encodes and transmits its own video bitstream to the video conference server for decoding and viewing by other participants.

Fig. 2 is a block diagram of an example of a computing device 200 capable of implementing a transmitting station or a receiving station. For example, computing device 200 may be capable of implementing one or both of transmitting station 102 and receiving station 106 of fig. 1. Computing device 200 may be in the form of a computing system including multiple computing devices or in the form of a single computing device, such as a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, and so forth.

The CPU202 in the computing device 200 may be a central processing unit. Alternatively, CPU202 may be any other type of device, or devices now existing or hereafter developed, capable of manipulating or processing information. Although the disclosed implementations can be practiced with a single processor (e.g., CPU 202) as shown, speed and efficiency advantages can be realized by using more than one processor.

In one implementation, the memory 204 in the computing device 200 may be a Read Only Memory (ROM) device or a Random Access Memory (RAM) device. Any other suitable type of storage device can be used as memory 204. The memory 204 can include code and data 206 that are accessed by the CPU202 using the bus 212. The memory 204 can further include an operating system 208 and application programs 210, the application programs 210 including at least one program that allows the CPU202 to perform the methods described herein. For example, the application programs 210 can include applications 1 through N, which further include a video compilation application that performs the methods described herein. The computing device 200 can also include secondary storage 214, which secondary storage 214 can be, for example, a memory card for use with the mobile computing device 200. Because video communication sessions may contain a large amount of information, they can be stored in whole or in part in secondary storage 214 and loaded into memory 204 for processing as needed.

Computing device 200 can also include one or more output devices, such as a display 218. In one example, display 218 may be a touch sensitive display that combines the display with a touch sensitive element operable to sense touch inputs. A display 218 can be coupled to the CPU202 via the bus 212. In addition to or in lieu of display 218, other output devices can be provided that allow a user to program computing device 200 or otherwise utilize computing device 200. When the output device is or includes a display, the display can be implemented in a variety of ways, including as a Liquid Crystal Display (LCD); cathode Ray Tube (CRT) displays; or a Light Emitting Diode (LED) display, such as an organic LED (oled) display.

Computing device 200 can also include or be in communication with an image sensing device 220, such as a camera or any other image sensing device now existing or hereafter developed that can sense an image, such as an image of a user operating computing device 200. Image sensing device 220 can be positioned such that it is oriented toward a user operating computer computing device 200. In one example, the position and optical axis of the image sensing device 220 can be configured such that the field of view includes an area directly adjacent to the display 218 and visible to the display 218.

The computing device 200 can also include or be in communication with a sound sensing device 222, such as a microphone or any other sound sensing device now existing or hereafter developed that can sense sound in the vicinity of the computing device 200, 222. The sound sensing device 222 can be positioned such that it is directed toward a user operating the computing device 200, and can be configured to receive sound, such as a voice or other utterance, made by the user while the user is operating the computing device 200.

Although fig. 2 depicts the CPU202 and memory 204 of the computing device 200 as being integrated into a single unit, other configurations can be utilized. The operations of CPU202 can be distributed over multiple machines (each machine having one or more processors) that can be coupled directly or across a local area network or other network. Memory 204 can be distributed across multiple machines, such as a network-based memory or a memory in multiple machines that perform operations for computing device 200. Although depicted here as a single bus, bus 212 of computing device 200 can be comprised of multiple buses. In addition, secondary storage 214 can be directly coupled to other components of computing device 200 or accessible via a network and can include a single integrated unit (such as a memory card) or multiple units (such as multiple memory cards). Computing device 200 can be implemented in a wide variety of configurations.

Fig. 3 is a diagram of an example of a video stream 300 to be encoded and subsequently decoded. The video stream 300 includes a video sequence 302. At the next level, the video sequence 302 includes a plurality of adjacent frames 304. Although three frames are described as adjacent frames 304, the video sequence 302 can include any number of adjacent frames 304. The adjacent frame 304 can then be further subdivided into individual frames, such as frame 306. At the next level, the frame 306 can be divided into a series of segments 308 or planes. For example, the segment 308 may be a subset of frames that allow parallel processing. The segments 308 may also be a subset of frames that are capable of separating the video data into individual colors. For example, a frame 306 of color video data can include a luminance plane and two chrominance planes. The segments 308 may be sampled at different resolutions.

Regardless of whether the frame 306 is divided into segments 308, the frame 306 may be further subdivided into blocks 310, which blocks 310 can contain data corresponding to, for example, 16 × 16 pixels in the frame 306. The block 310 can also be arranged to include data from one or more segments 308 of pixel data. The block 310 may also have any other suitable size, such as 4 × 4 pixels, 8 × 8 pixels, 16 × 8 pixels, 8 × 16 pixels, 16 × 16 pixels, or larger.

Fig. 4 is a block diagram of an encoder 400 in accordance with an implementation of the present disclosure. As described above, encoder 400 can be implemented in transmitting station 102, for example, by providing a computer software program stored in a memory, such as memory 204. The computer software program can include machine instructions that, when executed by a processor such as CPU202, cause transmission station 102 to encode video data in the manner described herein. Encoder 400 can also be implemented as dedicated hardware included in, for example, transmission station 102. The encoder 400 has the following stages to perform various functions in the forward path (shown by the solid connecting lines) to generate an encoded or compressed bitstream 420 using the video stream 300 as input: an intra/inter prediction stage 402, a transform stage 404, a quantization stage 406, and an entropy coding stage 408. The encoder 400 may also include a reconstruction path (shown by the dashed connecting lines) to reconstruct the encoded frame for future blocks. In fig. 4, the encoder 400 has the following stages to perform various functions in the reconstruction path: a dequantization stage 410, an inverse transform stage 412, a reconstruction stage 414 and a loop filtering stage 416. Other structural variations of the encoder 400 can be used to encode the video stream 300.

When the video stream 300 is presented for encoding, the frames 306 can be processed in units of blocks. At the intra/inter prediction stage 402, a block can be encoded using either intra prediction (also referred to as intra prediction) or inter prediction (also referred to as inter prediction), or a combination of both. In any case, a prediction block can be formed. In the case of intra prediction, all or part of the prediction block may be formed from samples in the current frame that have been previously encoded and reconstructed. In the case of inter-prediction, all or a portion of the prediction block may be formed from samples in one or more previously constructed reference frames determined using motion vectors.

Next, still referring to FIG. 4, the prediction block can be subtracted from the current block at the intra/inter prediction stage 402 to generate a residual block (also referred to as a residual). The transform stage 404 uses a block-based transform to transform the residual into transform coefficients, e.g., in the frequency domain. Such block-based transforms (i.e., transform types) include, for example, Discrete Cosine Transform (DCT) and Asymmetric Discrete Sine Transform (ADST). Other block-based transforms are possible. In addition, a combination of different transforms may be applied to a single residual. In one example of applying a transform, the DCT transforms the residual block into a frequency domain where the transform coefficient values are based on spatial frequency. The lowest frequency (DC) coefficients are at the top left of the matrix and the highest frequency coefficients are at the bottom right of the matrix. It is noted that the size of the prediction block and thus the generated residual block may be different from the size of the transform block. For example, a prediction block may be partitioned into smaller blocks to which separate transforms are applied.

The quantization stage 406 uses quantizer values or quantization levels to convert the transform coefficients into discrete quantum values, which are referred to as quantized transform coefficients. For example, the transform coefficients may be divided by the quantizer values and truncated. The quantized transform coefficients are then entropy encoded by the entropy encoding stage 408. Entropy compilation may be performed using any number of techniques including token trees and binary trees. The entropy encoded coefficients are then output to the compressed bitstream 420 along with other information used to decode the block (which may include, for example, the type of prediction, transform type, motion vectors, and quantizer values used). The information used to decode the block may be entropy coded into a block, frame, slice, and/or segment header within the compressed bitstream 420. The compressed bitstream 420 can also be referred to as an encoded video stream or an encoded video bitstream; these terms will be used interchangeably herein.

The reconstruction path in fig. 4 (shown by the dashed connecting lines) can be used to ensure that both the encoder 400 and the decoder 500 (described below) use the same reference frames and blocks to decode the compressed bitstream 420. The reconstruction path performs functions similar to those performed during the decoding process and discussed in more detail below, including dequantizing the quantized transform coefficients at a dequantization stage 410 and inverse transforming the dequantized transform coefficients at an inverse transform stage 412 to produce a block of derivative residues (also referred to as derivative residues). At the reconstruction stage 414, the predicted block predicted at the intra/inter prediction stage 402 can be added to the derivative residual to create a reconstructed block. A loop filtering stage 416 can be applied to the reconstructed block to reduce distortion, such as blocking artifacts.

Other variations of the encoder 400 can be used to encode the compressed bitstream 420. For example, the non-transform based encoder 400 can quantize the residual signal directly without the transform stage 404 for certain blocks or frames. In another implementation, the encoder 400 enables the quantization stage 406 and the dequantization stage 410 to be combined into a single stage.

Fig. 5 is a block diagram of a decoder 500 in accordance with an implementation of the present disclosure. The decoder 500 can be implemented in the receiving station 106, for example, by providing a computer software program stored in the memory 204. The computer software program can include machine instructions that, when executed by a processor such as CPU202, cause receiving station 106 to decode video data in the manner described below. Decoder 500 can also be implemented in hardware included in, for example, transmitting station 102 or receiving station 106.

Similar to the reconstruction path of the encoder 400 discussed above, the decoder 500 in one example includes various functions performed in the lower order to generate the output video stream 516 from the compressed bitstream 420: an entropy decoding stage 502, a dequantization stage 504, an inverse transform stage 506, an intra/inter prediction stage 508, a reconstruction stage 510, a loop filtering stage 512, and a post filtering stage 514. Other structural variations of the decoder 500 can be used to decode the compressed bitstream 420.

When the compressed bitstream 420 is presented for decoding, data elements within the compressed bitstream 420 can be decoded by the entropy decoding stage 502 to produce a set of quantized transform coefficients. The dequantization stage 504 dequantizes the quantized transform coefficients (e.g., by multiplying the quantized transform coefficients by quantizer values), and the inverse transform stage 506 inverse transforms the dequantized transform coefficients using the selected transform type to produce derivative residuals, which may be the same as the derivative residuals created by the inverse transform stage 412 in the encoder 400. Using the header information decoded from the compressed bitstream 420, the decoder 500 can use the intra/inter prediction stage 508 to create the same prediction block as was created in the encoder 400, e.g., at the intra/inter prediction stage 402. At the reconstruction stage 510, the prediction block can be added to the derivative residual to create a reconstructed block. A loop filtering stage 512 can be applied to the reconstructed block to reduce blocking artifacts. Other filtering can be applied to the reconstructed block. In one example, post-filtering stage 514 is applied to the reconstructed block to reduce block distortion, and the result is output as output video stream 516. The output video stream 516 can also be referred to as a decoded video stream; these terms will be used interchangeably herein.

Other variations of the decoder 500 can be used to decode the compressed bitstream 420. For example, the decoder 500 can generate the output video stream 516 without the post-filtering stage 514. In some implementations of the decoder 500, a post-filtering stage 514 is applied after the loop filtering stage 512. The loop filtering stage 512 can include an optional deblocking filtering stage. Additionally or alternatively, encoder 400 includes an optional deblock filtering stage in loop filtering stage 416.

Fig. 6 is a block diagram of a representation of a portion 600 of a frame (e.g., frame 306 of fig. 3) according to an implementation of the present disclosure. As shown, the portion 600 of the frame includes four 64 x 64 blocks 610 in two rows and two columns in a matrix or cartesian plane, which may be referred to as a super block. The super-block can have a larger or smaller size. Although fig. 6 is illustrated with reference to a superblock of size 64 x 64, the description can be readily extended to larger (e.g., 128 x 128) or smaller superblock sizes.

In one example, and without loss of generality, a superblock may be a basic or largest Coding Unit (CU). Each CU can include four 32 x 32 blocks 620. Each 32 x 32 block 620 can include four 16 x 16 blocks 630. Each 16 x 16 block 630 can include four 8 x 8 blocks 640. Each 8 x 8 block 640 can include four 4 x 4 blocks 650. Each 4 x 4 block 650 can include 16 pixels, which 16 pixels can be represented in a cartesian plane or matrix in four rows and four columns in each respective block. The pixels can include information representing an image captured in a frame, such as luminance information, color information, and position information. In one example, a block such as the 16 x 16 pixel block shown can include: a luminance block 660, which can include luminance pixels 662; and two chroma blocks 670/680, such as a U or Cb chroma block 670 and a V or Cr chroma block 680. The chroma block 670/680 can include chroma pixels 690. For example, luma block 660 can include 16 × 16 luma pixels 662 and each chroma block 670/680 can include 8 × 8 chroma pixels 690 as shown. Although one arrangement of blocks is shown, any arrangement can be used. Although fig. 6 illustrates an nxn block, in some implementations, an nxm (where N ≠ M) block can be used. For example, 32 × 64 blocks, 64 × 32 blocks, 16 × 32 blocks, 32 × 16 blocks, or any other size of blocks can be used. In some implementations, an N × 2N block, a 2N × N block, or a combination thereof can be used.

Video compilation can include ordered block-level compilation. Ordered block-level compilation can include compiling blocks of a frame in a scanning order, such as a raster scan order, where blocks can be identified and processed starting with a block or portion of a frame in the upper left corner of the frame and proceeding from left to right along rows and from top to bottom, identifying each block in turn for processing. For example, a CU in the top row and left column of a frame may be the first block to be coded and a CU immediately to the right of the first block may be the second block to be coded. The second row from the top may be the second row that is compiled, such that the CUs in the left column of the second row can be compiled after the CUs in the rightmost column of the first row.

In one example, compiling the block can include using quadtree compilation, which can include compiling smaller block units in the block in raster scan order. For example, a 64 × 64 superblock shown in the lower left corner of the portion of the frame shown in fig. 6 can be coded using quadtree coding, where the upper left 32 × 32 block can be coded, then the upper right 32 × 32 block can be coded, then the lower left 32 × 32 block can be coded, and then the lower right 32 × 32 block can be coded. Each 32 x 32 block can be compiled using quadtree compilation, where the top left 16 x 16 block can be compiled, then the top right 16 x 16 block can be compiled, then the bottom left 16 x 16 block can be compiled, and then the top right 16 x 16 block can be compiled. Each 16 x 16 block can be coded using quadtree coding, where the top left 8 x 8 block can be coded, then the top right 8 x 8 block can be coded, then the bottom left 8 x 8 block can be coded, and then the bottom right 8 x 8 block can be coded. Each 8 x 8 block can be coded using quadtree coding, where the top left 4 x 4 block can be coded, then the top right 4 x 4 block can be coded, then the bottom left 4 x 4 block can be coded, and then the bottom right 4 x 4 block can be coded. In some implementations, 8 x 8 blocks can be omitted for 16 x 16 blocks and the 16 x 16 blocks can be compiled using quad-tree coding, where the top-left 4 x 4 block can be compiled and then the other 4 x 4 blocks in the 16 x 16 blocks can be compiled in raster scan order.

Video coding can include compressing information included in an original or input frame by omitting some information in the original frame from a corresponding encoded frame. For example, the compiling can include reducing spectral redundancy, reducing spatial redundancy, reducing temporal redundancy, or a combination thereof.

In one example, reducing spectral redundancy can include using a color model based on a luminance component (Y) and two chrominance components (U and V or Cb and Cr), which can be referred to as a YUV or YCbCr color model or color space. Using the YUV color model can include representing a luminance component of a portion of a frame using a relatively large amount of information and representing each corresponding chrominance component of the portion of the frame using a relatively small amount of information. For example, a portion of a frame can be represented by a high resolution luma component (which can include a 16 x 16 pixel block) and can be represented by two lower resolution chroma components (each of which represents the portion of the frame as an 8 x 8 pixel block). A pixel can indicate a value (e.g., a value in the range from 0 to 255) and can be stored or transmitted using, for example, eight bits. Although the present disclosure is described with reference to a YUV color model, any color model can be used.

Reducing spatial redundancy can include intra-predicting a block and transforming the residual block into the frequency domain as described above. For example, a unit of an encoder, such as transform stage 404 of fig. 4, can perform a DCT using transform coefficient values based on spatial frequencies after intra/inter prediction stage 402.

Reducing temporal redundancy can include using similarities between frames to encode frames using a relatively small amount of data based on one or more reference frames, which can be previously encoded, decoded, and reconstructed frames of a video stream. For example, a block or pixel of the current frame may be similar to a spatially corresponding block or pixel of the reference frame. The blocks or pixels of the current frame may be similar to blocks or pixels of the reference frame at different spatial locations. Thus, reducing temporal redundancy can include generating motion information indicative of spatial differences (e.g., a translation between a location of a block or pixel in a current frame and a corresponding location of a block or pixel in a reference frame). This is referred to above as inter prediction.

Reducing temporal redundancy can include identifying a block or pixel in a reference frame or a portion of a reference frame that corresponds to a current block or pixel of a current frame. For example, a reference frame or a portion of a reference frame that can be stored in memory can be searched for the best block or pixel to be used for encoding the current block or pixel of the current frame. For example, the search may identify a block of the reference frame in which a difference in pixel values between the reference block and the current block is minimized, and can be referred to as a motion search. The portion of the reference frame being searched can be limited. For example, the portion of the reference frame being searched that can be referred to as a search area can include a limited number of lines of the reference frame. In one example, identifying the reference block can include calculating a cost function, such as a Sum of Absolute Differences (SAD), between pixels of the block in the search region and pixels of the current block.

The spatial difference between the position of the reference block in the reference frame and the position of the current block in the current frame can be represented as a motion vector. The difference in pixel values between the reference block and the current block can be referred to as differential data, residual data, or a residual block. In some implementations, generating the motion vector can be referred to as motion estimation, and the pixels of the current block can be based on using a vector such as f_x,yIs indicated by the location of the cartesian coordinates of (a). Similarly, can be based on usage such as r_x,yIndicating the pixels of the search area of the reference frame. The Motion Vector (MV) of the current block can be determined based on, for example, the SAD between a pixel of the current frame and a corresponding pixel of the reference frame.

Although other partitions are possible, as described above with respect to FIG. 6, a CU or block may be compiled using quadtree partitioning or compilation as shown in the example of FIG. 7. This example shows a quadtree partition of block 700. However, block 700 can be partitioned differently, for example, by an encoder (e.g., encoder 400 of fig. 4) or a machine learning model as described herein.

Block 700 is partitioned into four blocks, namely blocks 700-1, 700-2, 700-3, and 700-4. The block 700-2 is further partitioned into blocks 702-1, 702-2, 702-3, and 702-4. Thus, if, for example, the size of block 700 is N × N (e.g., 128 × 128), each of the sizes of blocks 700-1, 700-2, 700-3, and 700-4 is N/2 × N/2 (e.g., 64 × 64), and each of the sizes of blocks 702-1, 702-2, 702-3, and 702-4 is N/4 × N/4 (e.g., 32 × 32). If a block is partitioned, it is partitioned into four equal sized non-overlapping square sub-blocks.

The quadtree data representation is used to describe how block 700 is partitioned into sub-blocks, such as blocks 700-1, 700-2, 700-3, 700-4, 702-1, 702-2, 702-3, and 702-4. A quadtree 704 of partitions of block 700 is shown. Each node of the quadtree 704 is assigned a flag of "1" if the node is further partitioned into four child nodes, and is assigned a flag of "0" if the node is not partitioned. The flag can be referred to as a split bit (e.g., 1) or a stop bit (e.g., 0) and is compiled in the compressed bitstream. In a quadtree, a node has four children or no children. Nodes without children correspond to blocks that are not further partitioned. Each of the child nodes of the partition block corresponds to a child block.

In quad-tree 704, each node corresponds to a sub-block of block 700. The corresponding sub-blocks are shown between parentheses. For example, node 704-1 having a value of 0 corresponds to block 700-1.

Root node 704-0 corresponds to block 700. When the block 700 is partitioned into four sub-blocks, the value of the root node 704-0 is a partition bit (e.g., 1). At an intermediate level, the flag indicates whether the sub-block of block 700 is further partitioned into four sub-blocks. In this case, node 704-2 includes a flag of "1" because block 700-2 is partitioned into blocks 702-1, 702-2, 702-3, and 702-4. Each of nodes 704-1, 704-3, and 704-4 includes a flag of "0" because the corresponding block is not partitioned. Nodes 704-5, 704-6, 704-7, and 704-8 need not be labeled "0" or "1" since these nodes are at the bottom level of the quadtree. The lack of further segmentation of the blocks 702-5, 702-6, 702-7, and 702-8 can be inferred from the absence of additional flags corresponding to these blocks. In this example, the minimum sub-block is 32 × 32 pixels, but further partitioning is possible.

The quad tree data representation of the quad tree 704 can be represented by binary data "10100", where each bit represents a node of the quad tree 704. The binary data indicates the partitions of the block 700 to the encoder and decoder. In the event that the encoder needs to transmit binary data to a decoder (e.g., decoder 500 of fig. 5), the encoder can encode the binary data in a compressed bitstream (e.g., compressed bitstream 420 of fig. 4).

Blocks corresponding to leaf nodes of the quadtree 704 can be used as a basis for prediction. That is, prediction can be performed for each of blocks 700-1, 702-2, 702-3, 702-4, 700-3, and 700-4, referred to herein as coded blocks. As mentioned with respect to fig. 6, the coding block may be a luminance block or a chrominance block. Note that in one example, block partitions can be determined with respect to luma blocks. The same partition or a different partition can be used with the chroma blocks.

The prediction type (e.g., intra prediction or inter prediction) is determined with the coded block. That is, the coded block is a decision point for prediction.

A mode decision process (e.g., a partition decision process) determines the partitions of a compiled block, such as block 700. The partition decision process calculates the RD cost for different combinations of compilation parameters. That is, for example, different combinations of prediction blocks and predictions (e.g., intra-prediction, inter-prediction, etc.) are examined to determine the best partition.

As those skilled in the art will recognize, many mode decision processes can be performed by the encoder.

The machine learning model can be used to generate estimates of the RD costs associated with the respective patterns, which in turn are used in pattern determination. That is, implementations according to the present disclosure can be used in cases where the best mode is selected from among a set of possible modes, typically using RDO procedures.

Fig. 8 is a flow diagram of a process 800 for searching a best mode to compile a block. Process 800 is an illustrative high-level process for a mode decision process that determines the best mode of a plurality of available modes. For ease of description, process 800 is described with respect to selecting an intra-prediction mode for encoding a prediction block. Other examples of best modes that can be determined by a process similar to process 800 include determining a transform type and determining a transform size. Process 800 can be implemented by an encoder (e.g., encoder 400 of fig. 4) using a brute-force approach to mode decision.

At 802, process 800 receives a block. When process 800 is described with respect to determining an intra-prediction mode, a block may be a prediction unit. Referring to FIG. 7, for example, each of the leaf node coded blocks (e.g., blocks 700-1, 702-2, 702-3, 702-4, 700-3, or 700-4) can be partitioned into one or more prediction units. Thus, a block may be one such prediction unit.

At 804, the process 800 determines (e.g., selects, calculates, chooses, etc.) a list of modes. The list of patterns can include K patterns, where K is an integer. The list of modes can be represented as m₁,m₂,…,m_k}. The encoder can have a list of available intra prediction modes. For example, in the case of the AV1 codec, the list of available intra prediction modes may be { DC _ PRED, V _ PRED, H _ PRED, D45_ PRED, D135_ PRED, D117_ PRED, D153_ PRED, D207_ PRED, D63_ PRED, SMOOTH _ V _ PRED, and SMOOTH _ H _ PRED, pass _ PRED }. The description of these intra prediction modes is omitted because it is irrelevant to an understanding of the present disclosure. The list of modes determined at 804 may be any subset of the list of available intra-prediction modes.

At 806, the process 800 initializes the BEST _ COST variable to a high value (e.g., INT _ MAX, which may be equal to 2,147,483,647) and initializes a loop variable i to 1, which corresponds to the first mode to be checked.

At 808, process 800 for mode_iCalculating or calculating RD _ COST_i. At 810, process 800 tests the current mode being examined_iRD COST RD _ COST of_iWhether or not less than the current BEST COST BEST _ COST. If the test is positive, at 812, process 800 updates the BEST COST to the COST of the current mode (i.e., BEST _ COST — RD _ COST)_i) And sets the current BEST MODE index (BEST _ MODE) to a loop variable i (BEST _ MODE ═ i). Process 800 then proceeds to 814 to increment the loop variable i (i.e., i ═ i +1) in preparation for checking the next mode, if any. If the test is negative, process 800 proceeds to 814.

At 816, if there are more patterns to check, process 800 returns to 808; otherwise process 800 proceeds to 818. At 818, process 800 outputs the index BEST MODE. Outputting the best mode can refer to returning the best mode to the caller of process 800. Outputting the best mode can refer to encoding the image using the best mode. The output best mode can have other semantics. The process 800 then terminates.

FIG. 9 is a diagram of estimating a coding pattern m by using coding patterns_iA block diagram of a process 900 of rate and distortion cost for coding an image block X. Process 900 can be performed by an encoder (e.g., encoder 400 of fig. 4). Process 900 includes using coding mode m_iThe image block X is coded to determine the RD cost of encoding the block. More specifically, the process 900 calculates the number of bits (RATE) required to encode the image block X. Example 900 is also based on image block X and reconstructed version X of image block_dThe difference between them to calculate the DISTORTION (distorsion). Process 800 can be used by process 800 at 808. In this example, coding mode m_iIs a prediction mode.

At 904, a usage pattern m is determined_iAnd (4) predicting. The prediction can be determined as described with respect to the intra/inter prediction stage 402 of fig. 4. At 906, a residual is determined as the difference between the image block 902 and the prediction. At 908 and 910, the residual is transformed and quantized, such asAs described with respect to the transform stage 404 and the quantization stage 406, respectively, of fig. 4. The RATE (RATE) is calculated by a RATE estimator 912, which RATE estimator 912 performs hypothetical encoding. In one example, the rate estimator 912 can perform entropy encoding, such as described with respect to the entropy encoding stage 408 of fig. 4.

The quantized residue is dequantized at 914 (such as described, for example, with respect to dequantization stage 410 of fig. 4), inverse transformed at 916 (such as described, for example, with respect to inverse transform stage 412 of fig. 4), and reconstructed at 918 (such as described, for example, with respect to reconstruction stage 414 of fig. 4) to generate a reconstructed block. The distortion estimator 920 calculates distortion between the image block X and the reconstructed block. In one example, the distortion may be a mean square error between pixel values of the image block X and pixel values of the reconstructed block. The distortion may be the sum of absolute difference errors between the pixel values of the image block X and the reconstructed block. Any other suitable distortion measure can be used.

The RATE and DISTORTION distorrion are then combined into a scalar value (i.e., RD cost) by using lagrange multipliers, as shown in equation (5)

DISTORTION+λ_mode×RATE， (5)

Lagrange multiplier λ of equation 5_modeCan be varied (e.g., depending on the encoder performing the operations of procedure 900).

Fig. 8 and 9 illustrate a method of mode decision in a block-based encoder, which is primarily a serial process of essentially determining the mode with the best cost by coding an image block X using candidate modes. These techniques have been used to reduce the complexity in pattern determination. For example, an early termination technique is used to terminate the loop of process 800 of fig. 8 once certain conditions are met, such as the rate-distortion cost being below a threshold. Other techniques include, for example, selecting a subset of the available candidate patterns based on heuristics or using multiple passes through the candidate patterns.

Fig. 10 is a flow diagram of a process 1000 for encoding a block of a video stream using a machine learning model according to an implementation of the present disclosure. The process 1000 includes two stages: a training phase and an inference phase. For simplicity of illustration, the training phase and the inference phase are shown as phases of one process (i.e., process 1000). However, the training phase and the reasoning phase are often separate processes.

At 1002, process 1000 trains a Machine Learning (ML) pattern. The ML model can be trained using training data 1004 as input. The training data 1004 is a set of training data. Each training basis is indicated by the index i. Each training reference of training data 1004 can include a video block (i.e., a training block) encoded by a conventional encoding method (e.g., by a block-based encoder) such as described with respect to fig. 4 and 6-9_i) (ii) a One or more modes used by an encoder to encode a training block_i(ii) a And usage patterns as determined by the encoder_iFor the training block_iResulting coding cost of performing coding_i. In the training phase, parameters of the ML model are generated such that, for at least some of the training data 1004, the ML model is able to be referenced to a training benchmark_iInference schema_iCoding cost_iOr both. During a training phase at 1002, the ML model learns (e.g., trains, builds, derives, etc.) a mapping (i.e., a function) from inputs to outputs. The mode may be a partition decision or any other mode decision for compression or reconstruction in video coding. The mode can include a combination of mode decisions.

The block may be, for example, an image block, a prediction block, or a transform block of a source frame. The block may be a residual block, i.e. the difference between the source image block and the prediction block. Thus, the coding mode can be related to any of these blocks. For example, the encoding mode can include a partition mode, an intra or inter prediction mode, a transform mode, etc., and the encoding cost can be a cost of encoding the block using the encoding mode. In addition to the input data shown, the input data can also include training blocks during a training phase_iThe block characteristics of (1). Which block features are computed (e.g., generated) and used as input to the machine learning model may depend on the encoding mode. For example, coding related to prediction blocks can be extracted (e.g., calculated, determined, etc.) for coding modes related to transform blocksBlock characteristics that differ in code pattern.

In one example, the encoding cost can include two separate values; that is, the ratio of the RD cost and the distortion can be calculated as described above. In one example, the encoding cost can include or may be the RD cost value itself.

The ML model can then be used by process 1000 during the inference phase. As shown, the inference phase includes operations 1020 and 1022. The separation 1010 indicates that the training phase and the inference phase can be separated in time. Thus, the inference phase can be performed using a different encoder than the encoder used to train the machine learning model at 1002. In one example, the same encoder is used. In either case, the inference phase uses a machine learning model trained as described with respect to 1002.

Although not specifically shown, during the inference phase, process 1000 receives a source block for which a best mode is to be determined for encoding the block in the bitstream. The best mode may be the partition that minimizes the coding cost. The best mode may be a mode related to the block, such as transform type or transform size. The best mode may be a mode related to an intra prediction block, such as an intra prediction mode. The best mode may be a mode related to the inter-predicted block, such as an interpolation filter type. The best mode may be a combination of modes used to encode and optionally reconstruct the source block.

At 1020, the source blocks are presented to a model trained as described with respect to 1002. At 1022, the process 1000 obtains (e.g., generates, calculates, selects, determines, etc.) a mode decision (e.g., best mode) that minimizes the encoding cost as an output of the machine learning model. At 1024, process 1000 encodes the block using the best mode in the compressed bitstream.

Information derived from the source block during the inference phase of the encoding process 1000 is not readily available to different mode decisions of the encoder or decoder. In addition, process 1000 is well suited for image compression, but is more difficult to apply to video compression. For at least these reasons, while neural network encoders (e.g., those implementing machine learning models) may be better at representing and recovering high frequency information and residuals, conventional encoders are often better at capturing simple motion and compiling low frequencies.

In the hybrid approach described herein, motion may be conventionally handled primarily, and the neural network may operate above the dimensionality at the frame, block level. In this way, for example, side information may be available that would not otherwise be utilized by the neural network. This improves the use of neural network encoders with video compression, for example. Such a structure may be generally represented by fig. 11, fig. 11 being a block diagram of an example of a codec 1100 including a neural network with side information. This arrangement can be thought of as a modification to the learned image compression, where the network learns (through training) how to get close to the optimal rate-distortion function for source X. The side information Y may be used in the neural network together with the guidance information to guide the recovery. The goal of the design of FIG. 11 and its variations is to send from source X a transformation enabling side information Y to X, given a source image represented by source X and a degraded image represented by input Y (also referred to as degraded source data)_dWherein X is_dCloser to the source X than the side information Y. A conventional encoder pipeline may encode the bitstream, which generates a base layer reconstruction. The base layer reconstruction can be used as side information Y, while the separate guiding information provided by the source X results in a recovered signal X_d(also referred to as reconstructed source data).

In FIG. 11, a source (e.g., input) X1102 is input to an encoder 1104, the encoder 1104 incorporating a decoder 1108 for reconstructing the source X1102, the source X1102 being a reconstructed source or output X1102_d1110. The encoder 1104 and decoder 1108 may include one or more neural networks embodying a machine learning model that can be developed in accordance with the teachings herein. For example, the encoder may be referred to as a neural network encoder 1104, and the decoder 1108 may be referred to as a neural network decoder 1108. The machine learning model may be trained to approximate an optimal rate-distortion function for the source information, e.g., source blocks. That is, the neural network may be trained such that source X is reconstructed_d1110 is substantially similar to source X1102. For example, when minimizing the encoding cost, the source X is reconstructed_d1110 are substantially similar to source X1102. The encoding cost may be a rate-distortion value in some implementations. In FIG. 11, an objective function R of a neural network is trained_X/Y(D) Is the rate R given the known side information Y transmitting source X1102 at distortion D.

Once trained, codec 1100 is capable of producing an output or compressed bitstream for transmission to a decoder or for storage. The compressed bitstream may be generated by quantizing the block residue from the encoder 1104 using a quantizer 1106 and entropy coding the quantized residue using an entropy coder 1112. The block residual may or may not be transformed. Quantizer 1106 may operate similarly to quantization stage 406 of fig. 4. The entropy compiler 1112 may operate similarly to the entropy encoding stage 408 of fig. 4. The bitstream from the entropy coder 1112 may be transmitted to a decoder, such as a decoder constructed similarly to the decoder 1108.

The codec 1100 receives as input the side information Y, an example of which is described below. Typically, side-information is information related to source X and can be utilized by both the encoder and decoder without modification by their neural networks. The available side information is provided to the neural network to derive the guiding information, which together with the side information enables reconstruction of the source. In this way, the guidance information may be considered as the enhancement information. The structure of codec 1100 provides a powerful framework that enables many hybrid video coding architectures to be implemented by changing the side-information Y.

FIG. 12 is a block diagram of a neural network that can be used to implement the codec of FIG. 11. The neural network may include a CNN and/or a fully connected neural network. In this example, constraints are added to the neural network structure in order to pass the side information Y from the input through to the encoder and to a single (e.g., first) layer of the decoder.

At a high level, and without loss of generality, machine learning models such as the classification deep learning model include two main parts: a feature extraction part and a classification part. The feature extraction section detects features of the model. The classification component attempts to classify the detected features into expected responses. Each of these portions can include one or more layers and/or one or more operations. The term "classification" is used herein to refer to one or more of the layers that output one or more values from the model. The output may be a discrete value, such as a class or category. The output may be a continuous value (e.g., a rate value, a distortion value, a RD cost value). Therefore, the classification section may be appropriately referred to as a regression section.

As described above, CNN is an example of a machine learning model. In CNN, the feature extraction part often includes a set of convolution operations. The convolution operation may be a series of filters for filtering the input image based on a filter (typically the square of the size k without loss of generality). For example, and in the context of machine vision, these filters can be used to find features in the input image. These features can include, for example, edges, corners, end points, and the like. As the number of stacked convolution operations (e.g., layers) increases, later convolution operations can look for higher level features. Note that the term "feature" is used in two different contexts within this disclosure. First, the "feature" can be extracted from the input image or block by the feature extraction part of the CNN. Second, a "feature" can be computed (e.g., derived) from the input block and used as an input to the machine learning model. The context clearly indicates which use of the term "feature" is intended.

In CNN, the classification (e.g., regression) part may be a set of fully connected layers. The fully connected layer can be treated as looking at all input features of the image in order to generate an advanced classifier. Several stages (e.g., series) of the high-level classifier ultimately generate the desired regression output.

As mentioned, a CNN may consist of a number of convolution operations (e.g., feature extraction portions) followed by a number of fully-connected layers. Each type of operand and its corresponding size may be determined during a training phase of machine learning. Additional layers and/or operations can be included in each section, as will be appreciated by those skilled in the art. For example, combinations of Pooling, MaxPholing, Dropout, Activation, Normalization, Batchnormalization, and other operations can be grouped together with convolution operations (i.e., in the feature extraction portion) and/or fully-concatenated operations (i.e., in the classification portion). The fully-connected layer may be referred to as a Dense operation. The Convolution operation can use the separatableconvolation 2D or the Convolution2D operation.

As used in this disclosure, a convolutional layer may be a set of operations starting from the Convolution2D or the separatableconvolation 2D operation, followed by zero or more operations (e.g., Pooling, Dropout, Activation, Normalization, BatchNormalization, other operations, or a combination thereof) until the output of another convolutional layer, a density operation, or a CNN is reached. Similarly, a Dense layer may be a set of operations or layers starting with a Dense operation (i.e., a fully connected layer) followed by zero or more operations (e.g., Pooling, Dropout, Activation, Normalization, Batchnormalization, other operations, or combinations thereof) until the output of another convolutional layer, another Dense layer, or the net is reached. Although not used in the example of fig. 12, the boundary between convolutional network-based feature extraction and feature classification using the density operation can be marked by a scatter operation that flattens the multidimensional matrix from feature extraction into a vector.

Each of the fully-connected operations is a linear operation in which each input is connected to each output by a weight. Thus, a fully connected layer with N inputs and M outputs can have a total of N × M weights. The Dense operation may be followed by a nonlinear activation function to generate the output of the layer.

In the neural network of fig. 12, three hidden layers 1200A, 1200B, and 1200C are included. The first hidden layer 1200A may be a feature extraction layer, whereas the second hidden layer 1200B and the third hidden layer 1200C may be classification layers.

The data of source X comprises input data from a video stream. The input data can include pixel data such as luminance or chrominance data, position data such as x and y coordinates, and the like. Along with the source X, the side information Y is provided to the first hidden layer 1200A for feature extraction. The resulting extracted features are then used for classification at the second hidden layer 1200B. In this example, the output of the quantizer 1106 includes a block residual (e.g., for luma and each chroma block) that may or may not be transformed as previously described. This is by way of example only and other information (e.g., partitions, etc.) needed to reconstruct the block may also be transmitted.

The encoder 1104 passes the side information Y through a single layer, here the first layer (e.g., the third hidden layer 1200C), to the decoder 1108. That is, the side information Y, after being used for feature extraction in the first hidden layer 1200A, passes through the layers of the encoder 1104 for reconstruction in the first layer of the decoder 1108. When referring to "passing through" the side information Y, the disclosure herein refers to transmitting the side information Y or whatever information is needed to recreate the side information Y from the encoder 1104 to the decoder 1108. In fig. 12, the side information Y passes the hidden layer of the encoder 1104 through to the decoder 1108. Alternatively, the side information Y (e.g., information needed to recreate the side information Y) may skip (or bypass) one or more layers of the encoder 1104 as described below with respect to fig. 13. In either case, the neural network may be referred to as a constrained network because the neural network is constrained by the side-information Y. That is, the layers in each of the encoder 1104 and the decoder 1108 depend on the side information Y.

FIG. 13 is a block diagram of another neural network that can be used to implement the codec of FIG. 11. The neural network may include a CNN and/or a fully connected neural network similar to the neural network described with respect to fig. 12. Also in this example, constraints are added to the neural network structure in order to pass the side information Y from the input through to the encoder and to the first layer of the decoder.

In the neural network of fig. 13, three hidden layers 1300A, 1300B, and 1300C are included. The first hidden layer 1300A may be a feature extraction layer, and the second hidden layer 1300B and the third hidden layer 1300C may be classification layers.

The data of source X comprises input data from a video stream. The input data can include pixel data such as luminance or chrominance data, position data such as x-and y-coordinates, and the like. Along with the source X, the side information Y is provided to the first hidden layer 1300A for feature extraction. The resulting extracted features are then used for classification at the second hidden layer 1300B. The output of the quantizer 1106 may include block residuals (e.g., for luma and each chroma block) that may or may not be transformed as previously described. Information required to reconstruct the block (e.g., partitions, etc.) may also be transmitted.

The encoder 1104 passes the side information Y to the first layer (e.g., the third hidden layer 1300C) of the decoder 1108. In the encoder 1104 of fig. 12, the side information Y is passed through one or more hidden layers after the first layer of the encoder 1104 for reconstruction in the first layer of the decoder 1108. In contrast, the side information Y bypasses one or more hidden layers of the encoder 1104 of fig. 13 in order to pass the side information Y or information required to recreate the side information Y through the first layer to the decoder 1108.

The third hidden layer 1300C uses the side information Y as input along with the output of the quantizer 1106 and provides a reconstructed source or output X_d1110. Fig. 13 is another example of a constrained network because the layers of each of the encoder 1104 and decoder 1108 depend on the side information Y.

Although three hidden layers are shown in fig. 12 and 13, this is by way of example only. Neural networks include at least one and often multiple layers, hidden or otherwise. Thus, more or less than three layers may be used in any implementation. Further, the number of layers used by each of the encoder 1104 and decoder 1108 may be different than the number of layers shown. For example, the decoder 1108 may include one or more additional layers after the third hidden layer 1300C, where the subsequent layers do not receive the side information Y as input, but instead receive input from the third hidden layer 1200C. A block diagram of a variation in the example of the neural network of fig. 13 is shown in fig. 14.

The neural network in fig. 14 includes at least four hidden layers 1400A, 1400B, 1400C, and 1400D. The first hidden layer 1400A may be a feature extraction layer and the second, third and fourth hidden layers 1400B, 1400C, 1400D may be classification layers. In this example, rather than using fully-connected layers for classification, convolution operations are used for each of feature extraction and classification. Therefore, this may be referred to as a convolutional only neural network. This is not necessary and a fully connected layer may also or alternatively be used for classification.

The data of source X comprises input data from a video stream. Input data 1102 can include pixel data such as luminance or chrominance data, position data such as x and y coordinates, and the like. Along with the source X, the side information Y is provided to the first hidden layer 1400A for feature extraction. The resulting extracted features are then used for classification at the second hidden layer 1400B. The output of the quantizer 1106 may include block residuals (e.g., for luma and each chroma block) that may or may not be transformed as previously described. Information required to reconstruct the block (e.g., partitions, etc.) may also be transmitted.

The encoder 1104 passes the side information Y to a single layer (e.g., the first layer) of the decoder 1108. More specifically, and as in fig. 13, the side information Y bypasses one or more hidden layers of the encoder 1104 of fig. 14 in order to pass the side information Y or information required to recreate the side information Y through to the decoder 1108.

Fig. 14 differs from fig. 13 in that fig. 14 includes, among other things, an expander layer 1410 that can be considered part of decoder 1108. Expander layer 1410 may be another hidden convolutional layer. The expander layer 1410 receives as input the output from quantizer 1106 and expands the information from the output for use as input to the third hidden layer 1300C. As discussed above with respect to fig. 11, it is desirable to minimize the pilot information obtained from source X for input to decoder 1108 by training a codec, such as the codec of fig. 14. Thus, an expander layer 1410 may be incorporated to allow the encoder 1104 to send less information than the codecs of fig. 12 and 13. The expander layer 1410 receives the output from the quantizer 1106 and the information is expanded, i.e., the amount of data is increased by convolution operations to form the pilot information for the first layer (e.g., the third hidden layer 1400C) input to the decoder 1108. For example, the amount of data may be increased by increasing the resolution of the guide information from the encoder 1104.

While the expander layer 1410 may be referred to as part of the decoder 1108, it is not considered the first layer of the decoder 1108. Instead, it is considered as a pre-layer that generates the guide information for the first layer input to the decoder. The first layer of the decoder is the first layer to input both the guide information and the side information Y.

The third hidden layer 1400C uses the side information Y as an input together with the output of the expander layer 1410 and performs a convolution operation to provide an output to the fourth hidden layer 1400D. The fourth hidden layer 1400D performs a convolution operation on the output from the third hidden layer 1400C to output a reconstructed source or output X_d1110. Fig. 14 may be described as a constrained network because the layers of each of the encoder 1104 and decoder 1108 depend on (i.e., are constrained by) the side information Y.

As is apparent from the above description, the expander layer 1410 may perform enlargement. Accordingly, the expander layer 1410 may be referred to as an amplification layer. The presence of layer 1410 indicates that the side information Y and the input data 1102 (or output of quantizer 1106, respectively) may not have the same resolution. Thus, the layer 1410 may alternatively be a reduction layer, i.e., a layer that reduces the amount of data by a convolution operation to form the leading information for the first layer (e.g., the third hidden layer 1400C) input to the decoder 1108. For example, the amount of data may be reduced by reducing the resolution of the pilot information from the encoder 1104. Thus, layer 1410 may be referred to as a reduction layer. More generally, layer 1410 may be referred to as a resolution adjustment layer or normalization layer.

In some implementations, such as shown in fig. 14, the resolution adjustment layer may be a training layer (implemented using machine learning). In other implementations, the resolution adjustment layer may perform any one or more of the standard algorithms to change the resolution of its input in accordance with the resolution of the side information Y. Layer 1410 can also be omitted. In such implementations, the expansion or reduction function performed by layer 1410 may be performed by the third hidden layer 1400C by training the layer using two data sources at different resolutions as inputs, while the output of the fourth hidden layer 1400D is at full resolution or reduced resolution, whichever is directed to the reconstructed source or outputX_d1110 are indicated.

The architecture of each of fig. 11-14 provides end-to-end training of the neural network. In this way, the network automatically learns how to recover Y to get X_dAnd what minimum information is to be sent as a guide.

In an alternative example to the examples shown in fig. 11 to 14, the neural network structure is not constrained by the side information Y. Fig. 15 is a block diagram of an alternative example of a codec 1500 including a neural network with side information according to an implementation of the present disclosure. Codec 1500 is similar to codec 1100 except for the structure of the neural network forming encoder 1504 and decoder 1508.

More specifically, the input, which may be the same source X1102 of FIG. 11, is input to an encoder 1504 that incorporates a decoder 1508 for reconstructing the source X1102, the source X1102 being the reconstructed output or source X1102_d1110. The encoder 1504 and decoder 1508 may include one or more neural networks embodying machine learning models that can be developed in accordance with the teachings herein. For example, encoder 1504 may be referred to as a neural network encoder, and decoder 1508 may be referred to as a neural network decoder.

Codec 1500 generates an output or compressed bitstream R_X/Y(D) For transmission to a decoder or for storage. The compressed bitstream R may be generated by quantizing the block residue from the encoder 1504 using the quantizer 1106 and entropy coding the quantized residue using the entropy coder 1112_X/Y(D) In that respect The block residual may or may not be transformed. Quantizer 1106 may operate similarly to quantization stage 406 of fig. 4. The entropy compiler 1112 may operate similarly to the entropy encoding stage 408 of fig. 4.

Codec 1500 receives side information Y as input, an example of which is described below. In contrast to codec 1100, no structural constraints are imposed on the neural network to account for side information Y in this example. Instead, a simplified model is used whereby side information is used as input by generating a difference X-t (y) that is compiled using a neural network formed by the encoder 1504 and decoder 1508. That is, the difference X-T (Y) is used as an input to the first layer of encoder 1504. The machine learning model may be trained to approach a poor optimal rate-distortion function. The neural network may be trained such that the output of the decoder 1508 is substantially similar to the difference X-t (y). For example, the output is substantially similar to the difference X-t (y) when minimizing the coding cost. The encoding cost may be a rate-distortion value in some implementations.

The side information t (Y) is defined by a deterministic transformation of Y such that the information t (Y) is at the same resolution as the source X1102. Thus, the side information t (y) may be used to generate a reconstruction source or output X by adding the side information t (y) to the output of the decoder 1508_d1110。

Although the layers of the codec 1500 are not shown, the codec 1500 may have a structure similar to the codec 1100 as shown in the examples of fig. 12 to 14, except that it includes a structure allowing passing of the side information Y. An advantage of codec 1500 over codec 1100 is that the structure of the neural network is easier. Due to the lack of side-information constraints on the neural network structure, various neural network Application Programming Interfaces (APIs) may be used (e.g., trained). However, the codec 1500 may have reduced performance (e.g., higher rate-distortion values) due to the loss of structure in the side information Y.

Note that details of the machine learning neural network (including the convolutional neural network) and/or details that would be known to one of skill in the art based on the description herein are omitted. For example, those skilled in the art recognize that the values of the convolution filter and the weights of the connections between nodes (i.e., neurons) in the CNN are determined by the CNN during the training phase. Accordingly, such details are not discussed in detail herein. In each of fig. 11 to 15, an encoder integrating a decoding stage labeled as a decoder is described. This is a similar arrangement as described with respect to fig. 4. Although not explicitly shown in these figures, another decoder having a structure similar to that of the decoder 1108 or 1508 may be used to reconstruct the source data/information using the side information and the source data as inputs into the decoder (e.g., at least the first hidden layer).

In the above discussion, the side information Y is described as being derived from a conventional encoding method. The side information Y may be any side information related to the source information. The side information Y may be the product of the encoder itself, such as where only part of the encoder is implemented via a neural network. That is, the encoder may be a hybrid encoder including certain block-based components as described in the examples of fig. 3-5 or object-based components as known to those skilled in the art, in combination with one or more neural networks. In this example, the hybrid encoder itself may generate (and use) the side information. The side information Y may also be determined by a second encoder, wherein the second encoder provides the side information Y to the first (e.g. neural network) encoder.

The side information Y (or its functionality) is used together with a deep neural network that may have structural constraints that enforce the availability of the side information Y at the decoder side. As briefly described above, the information derived from the neural network layer may be considered as guiding information or enhancement information for the video being compiled. Many variations of the side information Y and thus enhancement of the information are possible.

In a first example of a hybrid codec according to the teachings herein, the side information Y may be used in prediction residual (residual) coding and reconstruction. For example, the side information Y may be a full resolution predictor or a prediction signal from a conventional motion-based predictor (e.g., a prediction block from inter prediction performed, for example, at the intra/inter prediction stage 402). The neural network may learn, for example, the optimal residual transform (i.e., the transform that yields the lowest rate distortion value) for the residuals produced by source X (e.g., block) and the full resolution predictor. The full resolution predictor may also improve the reconstruction by the decoder due to the availability of the structure (i.e., features) of the full resolution predictor. In this example, only the residual coding in the conventional encoder is generated by the modified hybrid video encoder. Since conventional video compression compiles prediction residuals independently of prediction, including the prediction as side information with a neural network may provide better reconstruction of the original video data.

In another example of a hybrid codec according to the teachings herein, a neural network may be used for recovery with the guiding information. For example, a conventional encoder may be used to encode a bitstream, which may be referred to as a base layer. The base layer reconstruction can then be used as side information Y to refine the source X in a separate guiding layer. In this way, a form of scalable coding is formed.

In an implementation of this latter technique, a conventional encoder may encode the base layer bitstream at a reduced resolution. While the high resolution information is encoded using source X, a reduced resolution reconstruction from the conventional base layer (e.g., a reduced resolution reconstruction per frame) may be used as side information Y. This process generates a form of spatial scalable coding in which the motion information is only at reduced resolution. Even in such designs, it is expected that advantages (e.g., in terms of rate-distortion values) will result for small reduction ratios, low bit rates, or both.

When the base layer bitstream is used in the spatial scalability model, the base layer bitstream can be decoded independently of the enhanced neural network layer. Some loss in coding efficiency is expected due to the loss in accuracy of the motion information. However, this allows the design whereby the training of the neural network can be open-loop. This may be achieved, for example, with two encoders as described above.

In another design using base layer bit streams in the spatial scalability model, intra-loop super resolution can be achieved. That is, the output of the full resolution enhanced neural network layer may be used to refresh the frame buffer used to encode the subsequent frame. In this way, the reference frame buffer may always be at full resolution (e.g., the reference frame buffer can store full resolution reference frames). The coded frames at lower resolution can use scaled motion compensation. In contrast to the open-loop design above, a single hybrid encoder can be used to implement such an in-loop design.

An artificial neural network with side information may also be used to generate a multi-modal predictor. In such an implementation, the source X may be data to be predicted and the side information Y may be data used to predict the source X. For example, the side information Y may include neighboring pixels used for intra prediction. The neural network will then be trained using the results of the exhaustive analysis described previously as inputs, where the results include the most efficient intra-prediction modes for the corresponding training blocks (e.g., the intra-prediction mode that yields the lowest encoding cost). Fully connected layers may be desirable for classification in such designs (e.g., instead of convolutional layers) because such designs may be used to more strictly control the dimensions of the layers (e.g., the number of parameters). For example, the available modes in the multi-mode predictor may be considered to limit the delivery of only a few bits of information.

The codecs 1100 and 1500 of fig. 11-15 include both encoders and decoders. This arrangement represents the structure used to train the neural network and may represent an encoder at the transmission station 102. Where the output of the entropy coder 1112 (and thus the codec 1100 or the codec 1500) is transmitted or stored for playback on a separate decoder, e.g., at the receiving station 106 where the codec 1100 or the codec 1500 is located at the transmitting station 102, the decoder may structurally correspond to the trained implementation of the decoder 1108 or the decoder 1508 and portions of another decoder or another decoder, e.g., the decoder 500 of fig. 5. In either case, the decoder can use the output of the entropy coder 1112 to generate side information for inclusion in a first layer of the neural network portion of the decoder, or can receive the side information as a separate signal from the entropy coder 1112 for inclusion in the first layer of the neural network portion of the decoder.

The artificial neural network with side information as described above provides a powerful framework that can address many use cases of interest. Neural networks have structural constraints that enforce the availability of side information at the decoder side. The side information may be derived from conventional encoding methods or any other degradation process (i.e., a degradation process that degrades the source). Many variations are possible depending on what the side information is.

For simplicity of explanation, the processes herein are depicted and described as a series of blocks, steps, or operations. However, blocks, steps or operations in accordance with the present disclosure can occur in various orders and/or concurrently. Additionally, other steps or operations not presented and described herein may be used. Moreover, not all illustrated steps or operations may be required to implement a technique in accordance with the disclosed subject matter.

Aspects of the above-described encoding and decoding illustrate some encoding and decoding techniques. It should be understood, however, that "encoding" and "decoding" as those terms are used in the claims may mean compression, decompression, transformation, or any other processing or variation of data.

The word "example" or "implementation" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" or "implementation" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "example" or "implementation" is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clearly indicated otherwise by context, "X comprises a or B" is intended to mean either of its natural inclusive permutations. That is, if X comprises A; x comprises B; or X includes both a and B, then "X includes a or B" is satisfied under any of the foregoing examples. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form. Moreover, unless so described, the use of the term "implementation" or "one implementation" in various places is not intended to refer to the same embodiment or implementation.

Implementations of transmitting station 102 and/or receiving station 106 (as well as algorithms, methods, instructions, etc. stored thereon and/or thereby included for execution by encoder 400 and decoder 500) can be implemented in hardware, software, or any combination thereof. The hardware can include, for example, a computer, an Intellectual Property (IP) core, an Application Specific Integrated Circuit (ASIC), a programmable logic array, an optical processor, a programmable logic controller, microcode, a microcontroller, a server, a microprocessor, a digital signal processor, or any other suitable circuit. In the claims, the term "processor" should be understood to include any of the foregoing hardware, either alone or in combination. The terms "signal" and "data" are used interchangeably. In addition, portions of transmitting station 102 and receiving station 106 do not necessarily have to be implemented in the same manner.

Further, in one aspect, transmitting station 102 or receiving station 106 can be implemented, for example, using a general purpose computer or a general purpose processor having a computer program that, when executed, performs any of the respective methods, algorithms, and/or instructions described herein. Additionally or alternatively, for example, a special purpose computer/processor can be utilized, which can contain other hardware for performing any of the methods, algorithms, or instructions described herein.

For example, transmitting station 102 and receiving station 106 can be implemented on computers in a video conferencing system. Alternatively, the transmitting station 102 can be implemented on a server and the receiving station 106 can be implemented on a device separate from the server, such as a handheld communication device. In this example, transmission station 102, using encoder 400, is able to encode content into an encoded video signal and transmit the encoded video signal to a communication device. In turn, the communication device can then decode the encoded video signal using the decoder 500. Alternatively, the communication device can decode content stored locally on the communication device, e.g., content that was not transmitted by transmission station 102. Other implementations of transmitting station 102 and receiving station 106 are available. For example, the receiving station 106 may be a generally stationary personal computer rather than a portable communication device, and/or a device that includes the encoder 400 may also include the decoder 500.

Additionally, all or portions of implementations of the present disclosure can take the form of a computer program product accessible from, for example, a tangible computer-usable or computer-readable medium. A computer-usable or computer-readable medium may be any apparatus that can, for example, tangibly embody, store, communicate, or transport the program for use by or in connection with any processor. The medium may be, for example, an electronic, magnetic, optical, electromagnetic, or semiconductor device. Other suitable media are also available.

The above-described embodiments, implementations, and aspects have been described in order to allow easy understanding of the present disclosure and do not limit the present disclosure. On the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation as permitted under the law so as to encompass all such modifications and equivalent arrangements.

38页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于具有降低的译码复杂性的视频译码的广义双预测

Hybrid motion compensated neural network with side information based video coding

相关技术

网友询问留言