Method and apparatus for optical flow prediction refinement

文档序号：1909859 发布日期：2021-11-30 浏览：11次中文

阅读说明：本技术 光流预测细化的方法和装置 (Method and apparatus for optical flow prediction refinement ) 是由修晓宇陈漪纹王祥林叶水明马宗全朱弘正于 2020-04-30 设计创作，主要内容包括：本申请提供一种光流预测细化(PROF)的位宽表征方法、装置和非暂时性计算机可读存储介质。该方法包括获得与视频信号内的视频块相关联的第一参考图片I,从该第一参考图片I中的参考块获取该视频块的预测样本I(i,j),通过基于不同的比特位移值将右移应用于多个内部PROF参数来控制用于所述多个内部PROF参数的各种表征精度的PROF推导过程的内部位宽,基于应用于基于该预测样本I(i,j)的视频块的该PROF推导过程获得用于该视频块中样本的预测细化值,以及基于这些预测样本和预测细化值的组合获得该视频块的预测样本。(A bit-width characterization method, apparatus, and non-transitory computer-readable storage medium for optical flow Prediction Refinement (PROF) are provided. The method includes obtaining a first reference picture I associated with a video block within a video signal, obtaining prediction samples I (I, j) of the video block from reference blocks in the first reference picture I, controlling an internal bit width of a PROF derivation process for various characterization accuracies of a plurality of internal PROF parameters by applying a right shift to the plurality of internal PROF parameters based on different bit shift values, obtaining prediction refinement values for samples in the video block based on the PROF derivation process applied to the video block based on the prediction samples I (I, j), and obtaining prediction samples for the video block based on a combination of the prediction samples and the prediction refinement values.)

1. A bit-width characterization method for optical flow Prediction Refinement (PROF) for decoding video signals, comprising:

obtaining a first reference picture I associated with a video block within a video signal;

obtaining prediction samples I (I, j) of the video block from a reference block in the first reference picture I, wherein I and j represent coordinates of one sample relative to the video block;

controlling internal bit-widths of a PROF derivation process to achieve internal PROF parameters of different characterization accuracies by applying a right shift to the internal PROF parameters based on different bit-shift values, wherein the internal PROF parameters include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the predicted samples I (I, j);

obtaining prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block based on the prediction samples I (I, j); and

obtaining prediction samples for the video block based on a combination of the prediction samples and the prediction refinement values.

2. The method of claim 1, wherein controlling internal bit-widths of a PROF derivation process by applying a right shift to internal PROF parameters based on different bit shift values to achieve the internal PROF parameters of different characterization accuracies comprises:

obtaining a horizontal gradient value of the first prediction sample I (I, j) based on a difference value of the first prediction sample I (I +1, j) and the first prediction sample I (I-1, j);

obtaining a vertical gradient value of a first prediction sample I (I, j) based on a difference value of the first prediction sample I (I, j +1) and the first prediction sample I (I, j-1);

right shifting the horizontal gradient value by a first bit shift value; and

right shifting the vertical gradient value by the first bit shift value.

3. The method of claim 2, further comprising:

obtaining a control point Motion Vector (MV) for the first prediction sample I (I, j), wherein the control point MV comprises a motion vector for an upper left block, an upper right block, and a lower left block of a block comprising the video block;

obtaining affine model parameters derived based on the control point MVs;

obtaining a horizontal MV difference Deltav for the first prediction sample I (I, j) based on the affine model parameters_x(i, j) and vertical MV Difference Δ v_y(i，j)；

Difference of the horizontal MV by Δ v_x(i, j) right shifting a second bit shift value; and

difference of the vertical MV by Δ v_y(i, j) right-shifting the second bit shiftThe value is obtained.

4. The method of claim 3, wherein the second bit-shift value is equal to 13 minus the first bit-shift value.

5. The method of claim 3, wherein obtaining prediction refinements for samples in the video block comprises:

based on the horizontal gradient value, the horizontal MV difference value Deltav_x(i, j), the vertical gradient value and the vertical MV difference value Deltav_y(i, j) obtaining the prediction refinement value; and

right-shifting the predicted refinement value by 1.

6. The method of claim 3, wherein the first bit-shift value is greater than the second bit-shift value when a quantization parameter used to encode the video signal is less than a predetermined threshold.

7. The method of claim 3, wherein the first bit shift value is less than the second bit shift value when high frequency information in the video signal is less than a predetermined threshold.

8. The method according to claim 3, wherein the first bit shift value is smaller than the second bit shift value when the maximum difference between control points MV is smaller than a predetermined threshold.

9. A computing device, comprising:

one or more processors;

a non-transitory computer-readable storage medium storing instructions executable by the one or more processors, wherein the one or more processors are configured to:

obtaining a first reference picture I associated with a video block within a video signal;

obtaining prediction samples I (I, j) of the video block from a reference block in the first reference picture I, wherein I and j represent coordinates of one sample relative to the video block;

based on the prediction samples I (I, j), obtaining prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block; and

obtaining prediction samples for the video block based on a combination of the prediction samples and the prediction refinement values.

10. The computing device of claim 10, wherein the one or more processors configured to control internal bit widths of a PROF derivation process by applying a right shift to internal PROF parameters based on different bit shift values to achieve the internal PROF parameters of different characterization accuracies are further configured to:

obtaining a horizontal gradient value of the first prediction sample I (I, j) based on a difference value of the first prediction sample I (I +1, j) and the first prediction sample I (I-1, j);

obtaining a vertical gradient value of a first prediction sample I (I, j) based on a difference value of the first prediction sample I (I, j +1) and the first prediction sample I (I, j-1);

right shifting the horizontal gradient value by a first bit shift value; and

right shifting the vertical gradient value by the first bit shift value.

11. The computing device of claim 11, wherein the one or more processors are further configured to:

obtaining affine model parameters derived based on the control point MVs;

obtaining a horizontal MV difference Deltav for the first prediction sample I (I, j) based on the affine model parameters_x(i, j) and vertical MV Difference Δ v_y(i，j)；

Difference of the horizontal MV by Δ v_x(i, j) right shifting a second bit shift value; and

difference of the vertical MV by Δ v_y(i, j) right shifting the second bit shift value.

12. The computing device of claim 12, wherein the second bit-shift value is equal to 13 minus the first bit-shift value.

13. The computing device of claim 12, wherein the one or more processors configured to obtain the prediction refinements for samples in the video block are further configured to:

right-shifting the predicted refinement value by 1.

14. The computing device of claim 12, wherein the first bit-shift value is greater than the second bit-shift value when a quantization parameter used to encode the video signal is less than a predetermined threshold.

15. The computing device of claim 12, wherein the first bit shift value is less than the second bit shift value when high frequency information in the video signal is less than a predetermined threshold.

16. The computing device of claim 12, wherein the first bit shift value is less than the second bit shift value when a maximum difference between control points MV is less than a predetermined threshold.

17. A non-transitory computer readable storage medium storing a plurality of programs for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform:

obtaining a first reference picture I associated with a video block within a video signal;

obtaining prediction samples I (I, j) of the video block from a reference block in the first reference picture I, wherein I and j represent coordinates of one sample relative to the video block;

controlling internal bit-widths of a PROF derivation process to achieve different characterization accuracies of a plurality of internal PROF parameters by applying a right shift to the internal PROF parameters based on different bit-shift values, wherein the internal PROF parameters include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the predicted samples I (I, j);

based on the prediction samples I (I, j), obtaining prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block; and

obtaining prediction samples for the video block based on a combination of the prediction samples and the prediction refinement values.

18. The non-transitory computer readable storage medium of claim 19, wherein the plurality of programs that cause the computing device to perform controlling internal bit widths of a PROF derivation process by applying a right shift to internal PROF parameters based on different bit shift values to achieve the internal PROF parameters of different characterization accuracies further cause the computing device to perform:

obtaining a horizontal gradient value of the first prediction sample I (I, j) based on a difference value of the first prediction sample I (I +1, j) and the first prediction sample I (I-1, j);

obtaining a vertical gradient value of a first prediction sample I (I, j) based on a difference value of the first prediction sample I (I, j +1) and the first prediction sample I (I, j-1);

right shifting the horizontal gradient value by a first bit shift value; and

right shifting the vertical gradient value by the first bit shift value.

19. The non-transitory computer readable storage medium of claim 20, wherein the plurality of programs further cause the computing device to perform:

obtaining affine model parameters derived based on the control point MVs;

obtaining a horizontal MV difference Deltav for the first prediction sample I (I, j) based on the affine model parameters_x(i, j) and vertical MV Difference Δ v_y(i，j)；

Difference of the horizontal MV by Δ v_x(i, j) right shifting a second bit shift value; and

difference of the vertical MV by Δ v_y(i, j) right shifting the second bit shift value.

20. The non-transitory computer-readable storage medium of claim 21, wherein the second bit-shift value is equal to 13 minus the first bit-shift value.

21. The non-transitory computer-readable storage medium of claim 21, wherein the plurality of programs that cause the computing device to perform obtaining prediction refinements for samples in the video block further cause the computing device to perform:

right-shifting the predicted refinement value by 1.

22. The non-transitory computer-readable storage medium of claim 21, wherein the first bit-shift value is greater than the second bit-shift value when a quantization parameter used to encode the video signal is less than a predetermined threshold.

23. The non-transitory computer-readable storage medium of claim 21, wherein the first bit shift value is less than the second bit shift value when high frequency information in the video signal is less than a predetermined threshold.

24. The non-transitory computer-readable storage medium of claim 21, wherein the first bit shift value is less than the second bit shift value when a maximum difference between control points MV is less than a predetermined threshold.

Technical Field

The present invention relates to video encoding and compression. More particularly, the present invention relates to methods and apparatus for two inter-frame prediction tools studied in the general video coding (VVC) standard, namely, prediction refinement with optical flow (PROF) and bi-directional optical flow (BDOF).

Background

Video data may be compressed using various video coding techniques. Video encoding is performed according to one or more video encoding standards. For example, video coding standards include general video coding (VVC), joint exploration test model (JEM), high-efficiency video coding (h.265/HEVC), advanced video coding (h.264/AVC), Moving Picture Expert Group (MPEG) coding, and the like. Video coding typically utilizes prediction methods (e.g., inter-prediction, intra-prediction, etc.) that exploit redundancy present in video images or sequences. An important goal of video coding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation of video quality.

Disclosure of Invention

Examples of the present application provide methods and apparatus for bit depth control for optical flow prediction refinement.

According to a first aspect of the present application, a bit-width (bit-width) characterization method for a PROF for decoding a video signal is provided. The method may include obtaining a first reference picture I associated with a video block within a video signal. The method may also include obtaining prediction samples I (I, j) for the video block from a reference block in the first reference picture I. i and j may characterize the coordinates of a sample relative to the video block. The method may include controlling an internal bit-width of the PROF derivation process by applying a right shift to the internal PROF parameters based on different bit shift values to achieve the internal PROF parameters at different characterization accuracies. The intra-PROF parameters may include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the prediction samples I (I, j). The method may also include obtaining prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block based on the prediction samples I (I, j). The method may include obtaining prediction samples for the video block based on a combination of the prediction samples and a prediction refinement value.

According to a second aspect of the present application, a computing device is provided. The computing device may include one or more processors, a non-transitory computer-readable storage medium storing instructions executable by the one or more processors. The one or more processors may be configured to obtain a first reference picture I associated with a video block within a video signal. The one or more processors may also be configured to obtain prediction samples I (I, j) of the video block from a reference block in the first reference picture I. i and j characterize the coordinates of a sample with respect to the video block. The one or more processors may be further configured to control an internal bit width of the PROF derivation process by applying a right shift to the internal PROF parameters based on different bit shift values to achieve the internal PROF parameters at different characterization accuracies. The intra-PROF parameters may include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the prediction samples I (I, j). The one or more processors may also be configured to obtain prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block based on the prediction samples I (I, j). One or more processors may be configured to obtain prediction samples for the video block based on a combination of the prediction samples and a prediction refinement value.

According to a third aspect of the present application, a non-transitory computer-readable storage medium having instructions stored therein is provided. The instructions, when executed by one or more processors of the apparatus, cause the apparatus to perform obtaining a first reference picture I associated with a video block within a video signal. The instructions also cause the apparatus to obtain prediction samples I (I, j) of the video block from a reference block in the first reference picture I. i and j characterize the coordinates of a sample with respect to the video block. The instructions cause the apparatus to control an internal bit width of the PROF derivation process by applying a right shift to the internal PROF parameters based on different bit shift values to achieve the internal PROF parameters at different characterization accuracies. The intra-PROF parameters include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the prediction sample I (I, j). The instructions also cause the apparatus to obtain, based on the prediction samples I (I, j), prediction refinement values for samples in the video block according to the PROF derivation process applied to the video block. The instructions cause the apparatus to obtain prediction samples for the video block based on a combination of the prediction samples and a prediction refinement value.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples consistent with the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a block diagram of an encoder according to an example of the present application.

Fig. 2 is a block diagram of a decoder according to an example of the present application.

Fig. 3A is a schematic diagram illustrating block partitioning in a multi-type tree structure according to an example of the present application.

Fig. 3B is a schematic diagram illustrating block partitioning in a multi-type tree structure according to an example of the present application.

Fig. 3C is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present application.

Fig. 3D is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present application.

Fig. 3E is a schematic diagram illustrating block partitioning in a multi-type tree structure according to an example of the present application.

Fig. 4 is a schematic diagram of a bi-directional optical flow (BDOF) model according to an example of the present application.

FIG. 5A is a schematic diagram of an affine model according to an example of the present application.

FIG. 5B is a schematic diagram of an affine model according to an example of the present application.

FIG. 6 is a schematic diagram of an affine model according to an example of the present application.

Fig. 7 is a schematic diagram of a PROF according to an example of the present application.

Fig. 8 is a workflow diagram of a BDOF according to an example of the present application.

Fig. 9 is a workflow diagram of a PROF according to an example of the present application.

Fig. 10 is a bit width characterization method of a PROF according to the present application.

Fig. 11 is a method for controlling internal bit widths of a PROF derivation process according to the present application.

Fig. 12 is a diagram of a workflow for a bipredictive PROF according to an example of the present application.

FIG. 13 is a view of a pipeline stage of BDOF and PROF according to the application.

Fig. 14 is a view of a gradient derivation method of the BDOF according to the present application.

Fig. 15 is a view of a gradient derivation method of a PROF according to the present application.

FIG. 16 is a schematic diagram illustrating a computing environment coupled with a user interface according to an example of the present application.

Detailed Description

Reference will now be made in detail to example embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which like reference numerals refer to the same or similar elements throughout the different views unless otherwise specified. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with aspects set forth in the claims below that are related to the present application.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein is intended to mean and include any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may be termed a second message without departing from the scope of the present invention; similarly, the second information may also be referred to as the first information. As used herein, the term "if" may be understood to mean "when" or "based on" or "in response to a determination", depending on the context.

The first edition of the HEVC standard, finalized in 2013 in 10 months, provides a bit rate saving of about 50% or an equivalent perceptual quality compared to the previous generation video coding standard h.264/mpeg avc. Although the HEVC standard provides significant coding improvements over its predecessors, there is evidence that higher coding efficiency can be achieved using additional coding tools compared to HEVC. Based on this, both VCEG and MPEG began the search for new coding techniques that standardize future video coding. ITU-TVECG and ISO/IECMPEG established a Joint Video Exploration Team (JVET) 10.2015, and important research into advanced technologies capable of significantly improving coding efficiency began. Jfet maintains a reference software called Joint Exploration Model (JEM) by integrating several additional coding tools on top of the HEVC test model (HM).

In 2017, 10 months, ITU-T and ISO/IEC issued joint proposals (CfP) for video compression with capabilities beyond HEVC. In month 4 2018, 23 CfP replies were received and evaluated at the 10 th jfet meeting, indicating an improvement of compression efficiency of about 40% over HEVC. Based on such evaluation results, jfet initiated a new project to develop a new generation of video coding standard named Versatile Video Coding (VVC). In the same month, a reference software code base called VVC Test Model (VTM) is established for demonstrating the reference implementation of the VVC standard.

As with HEVC, VVC builds on the block-based hybrid video coding framework fig. 1 shows an overall view of a block-based video encoder for VVC. In particular, fig. 1 shows a typical encoder 100. Encoder 100 has a video input 110, motion compensation 112, motion estimation 114, intra/inter mode decision 116, block predictor 140, adder 128, transform 130, quantization 132, prediction related information 142, intra prediction 118, picture buffer 120, inverse quantization 134, inverse transform 136, adder 126, memory 124, loop filter 122, entropy coding 138, and bitstream 144.

In encoder 100, a video frame is partitioned into multiple video blocks for processing. For each given video block, a prediction is formed based on either an inter prediction method or an intra prediction method.

A prediction residual, representing the difference between the current video block (part of video input 110) and its predictor (part of block predictor 140), is sent from adder 128 to transform 130. The transform coefficients are then sent from transform 130 to quantization 132 to reduce entropy. The quantized coefficients are then fed to entropy encoding 138 to generate a compressed video bitstream. As shown in fig. 1, prediction related information 142 from the intra/inter mode decision 116, such as video block partitioning information, Motion Vectors (MVs), reference picture indices and intra prediction modes, is also fed through entropy coding 138 and saved as a compressed bitstream 144. The compressed bitstream 144 comprises a video bitstream.

In the encoder 100, circuitry associated with the decoder is also required for pixel reconstruction for prediction purposes. First, the prediction residual is reconstructed by inverse quantization 134 and inverse transform 136. This reconstructed prediction residual is combined with the block predictor 140 to generate an unfiltered reconstructed pixel for the current video block.

Spatial prediction (or "intra prediction") uses pixels from samples (called reference samples) from already-coded neighboring blocks in the same video frame as the current video block to predict the current video block.

Temporal prediction (also referred to as "inter prediction") uses reconstructed pixels from an encoded video picture to predict a current video block. Temporal prediction reduces temporal redundancy inherent in video signals. The temporal prediction signal for a given Coding Unit (CU), or coding block, is typically signaled by one or more MVs, which indicate the amount and direction of motion between the current CU and its temporal reference. Further, if multiple reference pictures are supported, one reference picture index is additionally sent, which is used to identify from which reference picture in the reference picture memory the temporal prediction signal comes.

Motion estimation 114 receives video input 110 and signals from picture buffer 120 and outputs a motion estimation signal to motion compensation 112. Motion compensation 112 receives video input 110, a signal from picture buffer 120, and a motion estimation signal from motion estimation 114, and outputs the motion compensated signal to intra/inter mode decision 116.

After spatial and/or temporal prediction, an intra/inter mode decision 116 in the encoder 100 selects the best prediction mode, e.g., based on a rate-distortion optimization method. The block predictor 140 is then subtracted from the current video block and the resulting prediction residual is decorrelated using the transform 130 and quantization 132. The obtained quantized residual coefficients are inverse quantized by inverse quantization 134 and transformed by inverse transform 136 to form the reconstructed residual, which is then added back to the prediction block to form the reconstructed signal of the CU. Further, a loop filter 122, such as a deblocking filter, Sample Adaptive Offset (SAO), and/or an adaptive in-loop filter (ALF), may be applied to the reconstructed CU picture buffer 120 and used for future video block encoding before the reconstructed CU is placed in the reference picture store. To form the output video bitstream 144, the coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to the entropy coding unit 138 to be further compressed and packed to form the bitstream.

For example, deblocking filters may be used in current versions of AVC, HEVC, and VVC. In HEVC, an additional loop filter, called SAO (sample adaptive offset), is defined to further improve coding efficiency. In the latest version of the VVC standard, another loop filter called ALF (adaptive loop filter) is actively studied and possibly included in the final standard.

These loop filter operations are optional. Performing these operations helps to improve coding efficiency and visual quality. They may also be turned off as a decision presented by the encoder 100 to save computational complexity.

It should be noted that intra-prediction is typically based on unfiltered reconstructed pixels, whereas inter-prediction is based on filtered reconstructed pixels if these filter options are turned on by the encoder 100.

The input video signal is processed block by block (referred to as coding units, CUs). In VTM-1.0, a CU may reach 128 × 128 pixels. However, unlike HEVC which partitions blocks based on only quadtrees, in VVC one Coding Tree Unit (CTU) is split into multiple CUs to accommodate different local features based on quadtrees/binary trees/ternary trees. Furthermore, the concept of multi-partition unit type is removed from HEVC, i.e. there is no longer a separation of CU, Prediction Unit (PU) and Transform Unit (TU) in VVC; instead, each CU is always used as a basic unit for prediction and transform without further partitioning. In the multi-type tree structure, one CTU is first divided by a quadtree structure. The leaf nodes of each quadtree may then be further partitioned by binary and ternary tree structures.

As shown in fig. 3A, 3B, 3C, 3D, and 3E, there are five types of segmentation, i.e., quaternary segmentation, horizontal binary segmentation, vertical binary segmentation, horizontal ternary segmentation, and vertical ternary segmentation.

FIG. 3A illustrates a schematic diagram of block quad-partitioning in a multi-type tree structure according to the present application.

FIG. 3B illustrates a schematic diagram of block vertical binary partitioning in a multi-type tree structure according to the present application.

FIG. 3C illustrates a schematic diagram of block-level binary partitioning in a multi-type tree structure according to the present application.

FIG. 3D illustrates a schematic diagram of block vertical tri-segmentation in a multi-type tree structure according to the present application.

FIG. 3E illustrates a schematic diagram of block horizontal tri-segmentation in a multi-type tree structure according to the present application.

In fig. 1, spatial prediction and/or temporal prediction may be performed. Spatial prediction (or "intra prediction") predicts the same video block using pixels from samples of already coded neighboring blocks (called reference samples) in the same video picture/slice. Spatial prediction reduces the spatial redundancy inherent in video signals. Temporal prediction (also referred to as "inter prediction" or "motion compensated prediction") uses reconstructed pixels from an encoded video picture to predict the video block. Temporal prediction reduces temporal redundancy inherent in video signals. The temporal prediction signal for a given CU is typically signaled by one or more MVs that indicate the amount and direction of motion between the current CU and its temporal reference. Furthermore, if multiple reference pictures are supported, one reference picture index is additionally sent, which is used to identify from which reference picture in the reference picture memory the temporal prediction signal came. After spatial and/or temporal prediction, a mode decision block in the encoder selects the best prediction mode, e.g., based on a rate-distortion optimization method. The prediction block is then subtracted from the current video block and the prediction residual is decorrelated using transform and quantization. The quantized residual coefficients are inverse quantized and inverse transformed to form the reconstructed residual, which is then added back to the prediction block to form the reconstructed signal for the CU. Furthermore, before the reconstructed CU is put into reference picture memory and used to encode future video blocks, loop filtering, such as deblocking filters, Sample Adaptive Offset (SAO), and Adaptive Loop Filters (ALF), may be applied on the reconstructed CU. To form the output video bitstream, the (inter or intra) coding mode, the prediction mode information, the motion information and the quantized residual coefficients are all sent to an entropy coding unit for further compression and packing to form the bitstream.

Fig. 2 shows a general block diagram of a video decoder for VVC. In particular, fig. 2 shows a block diagram of a typical decoder 200. The decoder 200 has a bitstream 210, entropy decoding 212, inverse quantization 214, inverse transform 216, adder 218, intra/inter mode selection 220, intra prediction 222, memory 230, loop filter 228, motion compensation 224, picture buffer 226, prediction related information 234, and video output 232.

The decoder 200 is similar to the reconstruction related part located in the encoder 100 of fig. 1. In the decoder 200, an input video bitstream 210 is first decoded by entropy decoding 212 to derive quantized coefficient levels and prediction related information. These quantized coefficient levels are then processed by inverse quantization 214 and inverse transform 216 to obtain the reconstructed prediction residual. The block predictor mechanism implemented in the intra/inter mode selector 220 is configured to perform intra prediction 222 or motion compensation 224 based on the decoded prediction information. A set of unfiltered reconstructed pixels is obtained by adding the reconstructed prediction residual from the inverse transform 216 to the prediction output generated by the block predictor mechanism using adder 218.

The reconstructed block may further pass through a loop filter 228 before being stored in a picture buffer 226, which serves as a reference picture memory. The reconstructed video in the picture buffer 226 may be sent to drive a display device and used to predict future video blocks. With the loop filter 228 turned on, these reconstructed pixels are subjected to a filtering operation to derive a final reconstructed video output 232.

In fig. 2, a general block diagram of a block-based video decoder is presented. The video bitstream is first entropy decoded in an entropy decoding unit. The coding mode and prediction information are sent to a spatial prediction unit (if intra-coded) or a temporal prediction unit (if inter-coded) to form the prediction block. The residual transform coefficients are sent to an inverse quantization unit and an inverse transform unit to reconstruct the residual block. The prediction block and the residual block are then added together. The reconstructed block may be further loop filtered before being stored in the reference picture memory. The reconstructed video in the reference picture store is then sent out to drive the display device and used to predict future video blocks.

Generally, the basic inter prediction techniques applied in VVC remain the same as those of HEVC, except that several modules are further extended and/or enhanced. In particular, for all previous video standards, when a coding block is single predicted, the coding block can only be associated with a single MV, or when the coding block is bi-predicted, the coding block can only be associated with two MVs. Due to this limitation of conventional block-based motion compensation, small motion still remains in the motion compensated prediction samples, thereby negatively affecting the overall efficiency of the motion compensation. In order to improve the granularity and precision of these MVs, two optical flow-based sampling refinement methods for the VVC standard are currently being studied, namely bi-directional optical flow (BDOF) and optical flow Prediction Refinement (PROF) for affine mode. The main technical aspects of these two inter-coding tools are briefly reviewed below.

Bidirectional light stream

In VVC, BDOF is applied to refine the prediction samples of bi-predictive coded blocks. Specifically, as shown in FIG. 4, BDOF is based on blocks when using dual predictionSample-by-sample motion refinement over motion compensated prediction. The motion refinement of each 4 x 4 sub-block is calculated by minimizing the difference between the L0 and L1 prediction samples after applying the BDOF within a 6 x 6 window around the sub-block. Specifically, (v)_x，v_y) The value of (d) is derived as:

wherein the content of the first and second substances,is a floor function; clip3(min, max, x) is at [ min, max]Intercepting a function of a given value x within the range; symbol>>Representing a shift-by-bit right operation; symbol<<Indicating a shift left operation; th (h)_BDOFIs a motion refinement threshold to prevent propagation errors due to irregular local motion, equal to 1 < max (5, bit-Depth-7), where bit-Depth is the internal bit Depth. In the step (1), the first step is carried out,

S₁、S₂、S₃、S₅and S₆The value of (d) is calculated as:

wherein the content of the first and second substances,

wherein, I^(k)(i, j) is the sample value at the coordinates of the predicted signal in list k, k0, 1, these samplesThe value is generated with medium and high precision (16 bits);andis the horizontal and vertical gradients of the sample obtained by directly calculating the difference between two adjacent samples of the sample, that is,

based on the motion refinement derived in (1), the final bi-prediction sample for the CU is computed by interpolating the L0/L1 prediction samples along the motion trajectory based on the optical flow model, as shown in

Wherein shift and o_offsetIs the right-hand displacement value and offset value for combining the L0 and L1 prediction signals for bi-prediction, equal to 15-BD and 1 < (14-BD) +2 (1 < 13), respectively. Based on the bit depth (bit-depth) control method, it is guaranteed that the maximum bit depth of the intermediate parameter of the whole BDOF process does not exceed 32 bits, and the maximum input of multiplication is within 15 bits, i.e. a 15-bit multiplier is sufficient for BDOF implementation.

Affine mode

In HEVC, only the translational motion model is applied to motion compensated prediction. In the real world, there are many kinds of movements such as zoom-in/zoom-out, rotation, perspective movement, and other irregular movements. In VVC, affine motion compensated prediction is applied by signaling a flag for each inter-coded block to indicate whether translational motion or affine motion model is applied for inter prediction. In the current VVC design, two affine modes including a 4-parameter affine mode and a 6-parameter affine mode are supported for one affine coding block.

The 4-parameter affine model has the following parameters: two parameters for the translational motion in the horizontal and vertical directions, respectively, one parameter for the zooming motion in both directions and one parameter for the rotational motion. The horizontal scaling parameter is equal to the vertical scaling parameter. The horizontal rotation parameter is equal to the vertical rotation parameter. To better accommodate the MVs and affine parameters, in VVC, these affine parameters are converted into two MVs (also called Control Point Motion Vectors (CPMVs)) located at the top left and top right corners of the current block. As shown in fig. 5A and 5B, the affine motion field of the block is described by two control points MV (V0, V1).

FIG. 5A shows a diagram of a 4-parameter affine model. FIG. 5B shows a diagram of a 4-parameter affine model. Based on the control point motion, a motion field vector (v) of an affine coding block_x,v_y) Is described as

The 6-parameter affine pattern has the following parameters: two parameters for translational motion for horizontal and vertical directions, respectively, one parameter for zoom motion and one parameter for rotational motion in the horizontal direction, one parameter for zoom motion and one parameter for rotational motion in the vertical direction. The 6-parameter affine motion model is encoded with three MVs at the three CPMVs.

FIG. 6 shows a view of a 6-parameter affine model. As shown in FIG. 6, the three control points for a 6-parameter affine block are located at the top left, top right, and bottom left corners of the block, respectively. The motion at the upper left control point is related to translational motion, the motion at the upper right control point is related to rotational and zooming motion in the horizontal direction, and the motion at the lower left control point is related to rotational and zooming motion in the vertical direction. The 6-parameter horizontal rotation and zoom motions may be different from the vertical motions compared to the 4-parameter affine motion model. Suppose (V)₀,V₁,V₂) Are MVs of the top left corner, top right corner and bottom left corner of the current block in FIG. 6Each sub-block (v)_x,v_y) The MVs of (a) are derived using three MVs at the control points, as follows:

optical flow prediction refinement for affine mode

To improve the accuracy of affine motion compensation, PROF is currently being studied in VVC, which refines subblock-based affine motion compensation based on optical flow models. Specifically, after sub-block-based affine motion compensation is performed, a luminance prediction sample of an affine block is modified into a sample refinement value derived based on an optical flow equation. Specifically, the operation of the PROF can be summarized as the following four steps:

the method comprises the following steps: sub-block based affine motion compensation is performed to generate sub-block prediction I (I, j) using sub-block MV, as in (6) for the 4-parameter affine model and the 6-parameter affine model derived in (7).

Step two: the spatial gradient g of each prediction sample_x(i, j) and g_y(i, j) is calculated as:

g_x(i，j)＝(I(i+1，j)-I(i-1，j))>>(max(2，14-bit-depth)-4)

g_y(i，j)＝(I(i，j+1)-I(i，j-1))>>(max(2，14-bit-depth)-4) (8)

to calculate these gradients, one extra row/column prediction sample needs to be generated on each side of a sub-block. To reduce memory bandwidth and complexity, samples on extended boundaries are copied from the nearest integer pixel position in the reference picture to avoid additional interpolation processes.

Step three: the luminance prediction refinement value is calculated by

ΔI(i，j)＝g_x(i，j)*Δv_x(i，j)+gy(i，j)*Δv_y(i，j) (9)

Where Δ v (i, j) is the difference between the pixel MV calculated for sample position (i, j), identified as v (i, j), and the sub-block MV of the sub-block in which pixel (i, j) is located.

Step four: in the current PROF design, after the prediction refinement is added to the original prediction sample, a truncation operation is performed to truncate the value of the refined prediction sample to within 15-bit, that is,

I^r(i，j)＝I(i，j)+ΔI(i，j)

I^r(i，j)＝clip3(-2¹⁴，2¹⁴-1，I^r(i，j))；

wherein, I (I, j) and I^r(i, j) are the original and refined prediction samples at location (i, j), respectively.

Fig. 7 shows a view of the PROF process for the affine mode.

Since these affine model parameters and pixel positions relative to the center of the sub-block do not vary from sub-block to sub-block, Δ v (i, j) can be calculated for the first sub-block and reused for other sub-blocks in the same CU. Let Δ x and Δ y be the horizontal and vertical offsets from the sample position (i, j) to the center of the sub-block to which the sample belongs, Δ v (i, j) can be derived as

Based on the affine sub-blocks MV deriving equations (6) and (7), the MV differences Δ v (i, j) can be derived. In particular, for a 4-parameter affine model,

for a 6-parameter affine model,

wherein (v)_0x，v_0y)、(v_1x，v_1y)、(v_2x，v_2y) The top left, top right and bottom left control points MV of the current coding block, w and h are the width and height of the block. In thatIn the existing PROF design, the MV difference value Deltav_xAnd Δ v_yAlways derived with an accuracy of 1/32 pixels.

Coding efficiency of affine mode

Although PROF can improve the encoding efficiency of affine mode, its design can be further improved. In particular, given that both PROF and BDOF are built on the optical flow concept, it is highly desirable to coordinate the design of PROF and BDOF as much as possible so that PROF can leverage the existing logic of BDOF to the maximum extent to facilitate hardware implementation. Based on this consideration, the following issues of interaction between current PROF and BDOF designs are identified in the present application.

First, as described in the section "optical flow prediction refinement for affine mode", in equation (8), the precision of the gradient is determined based on the internal bit depth. On the other hand, the MV difference, i.e. Δ v_xAnd Δ v_yAlways derived with an accuracy of 1/32 pixels. Accordingly, based on equation (9), the accuracy of the derived PROF refinement depends on the internal bit depth. However, like BDOF, PROF is applied to predicted sample values of medium to high bit depth (i.e., 16 bits) to maintain high PROF derivation accuracy. Therefore, the precision of the prediction refinement derived by this PROF should match the precision of the intermediate prediction samples (i.e., 16 bits), regardless of the intra-coded bit depth. In other words, the characterized bit depths of MV differences and gradients in existing PROF designs do not match perfectly to derive accurate prediction improvement over the prediction sample precision (i.e., 16 bits). Meanwhile, existing PROF and BDOF use different precisions to characterize these sample gradients and MV differences based on the comparison of equations (1), (4), and (8). As previously mentioned, this non-uniform design is not desirable for hardware because existing BDOF logic cannot be reused.

Secondly, as described in the section "optical flow prediction refinement for affine mode", when one current affine block is double predicted, PROF is applied to the prediction samples in lists L0 and L1, respectively; the enhanced L0 and L1 prediction signals are then averaged to generate the final bi-predictive signal. In contrast, rather than deriving the PROF refinement separately for each prediction direction, BDOF derives a prediction refinement once and then applies it to the enhanced combined L0 and L1 prediction signals.

Fig. 8 and 9 (described below) compare the bi-predictive workflow of the current BDOF and PROF. In practical codec hardware pipeline designs, each pipeline stage is typically assigned a different primary encoding/decoding module so that more encoded blocks can be processed in parallel. However, due to the differences between the BDOF and PROF workflows, this may result in that it is difficult for BDOF and PROF to share one and the same pipeline design, which is not friendly to the implementation of practical codecs.

Fig. 8 shows a workflow of BDOF. The workflow 800 includes L0 motion compensation 810, L1 motion compensation 820, and BDOF 830. For example, the L0 motion compensation 810 may be a list of motion compensated samples from previous reference pictures. The previous reference picture is a reference picture that precedes the current picture in the video block. For example, L1 motion compensation 820 may be a list of motion compensated samples from the next reference picture. The next reference picture is a reference picture subsequent to the current picture in the video block. The BDOF830 takes motion compensated samples from L1 motion compensation 810 and L1 motion compensation 820 and outputs predicted samples as described previously for fig. 4.

Fig. 9 shows a workflow of the existing PROF. The workflow 900 includes L0 motion compensation 910, L1 motion compensation 920, L0 PROF930, L1 PROF940, and averaging 960. For example, the L0 motion compensation 910 may be a list of motion compensated samples from previous reference pictures. The previous reference picture is a reference picture that precedes the current picture in the video block. For example, the L1 motion compensation 920 may be a list of motion compensated samples from the next reference picture. The next reference picture is a reference picture subsequent to the current picture in the video block. The L0 PROF930 takes the L0 motion compensated samples from the L0 motion compensation 910 and outputs motion refinement values as described previously for fig. 7. The L1 PROF940 takes these L1 motion compensated samples from the L1 motion compensation 920 and outputs motion refinement values as described previously for fig. 7. The average 960 averages the motion refinement values output of the L0 PROF930 and L1 PROF 940.

Third, for BDOF and PROF, the gradient needs to be computed for each sample within the current coding block, which requires an additional row/column prediction sample to be generated on each side of the block. To avoid the additional computational complexity of sample interpolation, the prediction samples in the extended region around the block are copied directly from the reference samples at integer positions (i.e., without interpolation). However, according to existing designs, integer samples at different positions are selected to generate the gradient values for BDOF and PROF. Specifically, for BDOF, integer reference samples are used that are located to the left of the prediction sample (for horizontal gradients) and above the prediction sample (for vertical gradients); for PROF, the integer reference samples closest to the prediction samples are used for gradient computation. Similar to the bit depth characterization problem, this non-uniform gradient computation method is not desirable for the implementation of hardware codecs.

Fourth, as previously mentioned, the motivation for PROF is to compensate for small MV differences between the MV of each sample and the sub-block MV derived at the center of the sub-block to which the sample belongs. According to current PROF designs, an encoded block is always called when predicted by affine mode. However, as shown in equations (6) and (7), the sub-block MVs of one affine block is derived from these control points MVs. Therefore, when the difference between these control points MVs is small, the MV at each sample position should be consistent. In this case, since the benefits of applying a PROF may be very limited, it may not be worth performing the PROF when considering the performance/complexity tradeoff.

Improving efficiency of affine mode using PROF

In the present application, methods are provided that improve and simplify existing PROF designs to facilitate hardware codec implementations. Special attention is given to coordinating the design of BDOF and PROF to maximize sharing of existing BDOF logic with PROF. In general, the main aspects of the technology presented in this application are summarized below.

Fig. 10 illustrates a bit width characterization method of a PROF for decoding a video signal according to the present application.

In step 1010, a first reference picture I associated with a video block within a video signal is obtained. For example, the first reference picture may be a video picture adjacent to a current picture being encoded.

In step 1012, prediction samples I (I, j) of the video block are obtained from a reference block in the first reference picture I. i and j characterize the coordinates of a sample with respect to the video block. For example, the prediction sample I (I, j) may be a prediction sample using the MV of the first reference picture.

In step 1014, the internal bit-widths of the PROF derivation process are controlled by applying a right shift to the internal PROF parameter based on the different bit shift values to achieve the internal PROF parameter at different characterization accuracies. The intra-PROF parameters include horizontal gradient values, vertical gradient values, horizontal motion differences, and vertical motion differences derived for the prediction sample I (I, j).

In step 1016, based on the prediction samples I (I, j), prediction refinement values for the samples in the video block are obtained according to the PROF derivation process applied to the video block.

In step 1018, prediction samples for the video block are obtained based on a combination of the prediction samples and the prediction refinement values.

In order to improve the encoding efficiency of the PROF while realizing a more uniform design, a method for unifying the sample gradients and the characterization bit-depth (bit-depth) of the MV differences used by the BDOF and the PROF is proposed.

To facilitate hardware pipeline design, it is proposed to coordinate the workflow of the PROF with the workflow of the BDOF for bi-prediction. In particular, unlike existing PROF's that derive prediction refinements for L0 and L1, respectively, the proposed method derives a first prediction refinement and applies it to the combined L0 and L1 prediction signals.

Two methods are proposed to coordinate the derivation of integer reference samples to compute the gradient values used by BDOF and PROF.

To reduce computational complexity, early termination methods are proposed to adaptively disable the PROF process for affine encoded blocks when certain conditions are met.

Improved bit depth characterization design for PROF gradient and MV differences

As analyzed in the "coding efficiency of affine mode" section, the characterized bit-depth and sample gradients of MV differences are not aligned in the current PROF to derive accurate prediction refinement. Furthermore, the characterized bit depth and MV differences of the sample gradient between BDOF and PROF are not consistent, which is detrimental to the hardware. In this section, an improved bit depth characterization method by extending the bit depth characterization method of BDOF to PROF is proposed. Specifically, in the proposed method, the horizontal and vertical gradients at each sample position are calculated as:

g_x(i，j)＝(I(i+1，J)-I(i-1，J))>>max(6，bit-depth-6)

g_y(i，j)＝(I(i，j+1)-I(i，j-1))>>max(6，bit-depth-6) (11)

in addition, assuming that Δ x and Δ y are horizontal and vertical offsets characterized with 1/4 pixel accuracy from one sample position to the center of the sub-block to which the sample belongs, the corresponding PROFMV difference Δ v (x, y) at the sample position is derived as:

where dMvBits is the bit depth of the gradient value used by the BDOF process, i.e., dMvBits ═ max (5, (bit-depth-7)) + 1. In equations (11) and (12), c, d, e, and f are affine parameters derived based on these affine control points MV. In particular, for a 4-parameter affine model,

for a 6-parameter affine model,

wherein (v)_0x，v_0y)、(v_1x，v_1y)、(v_2x，v_2y) Are the top left, top right and bottom left control points MV of the current coding block, characterized with an accuracy of 1/16-pel, w and h being the width and height of the block.

Fig. 11 illustrates a method 1100 for controlling internal bit widths of a PROF derivation process according to the present application.

In step 1110, a horizontal gradient value of the first prediction sample I (I, j) is obtained based on a difference between the first prediction sample I (I +1, j) and the first prediction sample I (I-1, j).

In step 1112, a vertical gradient value of the first prediction sample I (I, j) is obtained based on a difference between the first prediction sample I (I, j +1) and the first prediction sample I (I, j-1).

In step 1114, the horizontal gradient value is right shifted by a first bit shift value.

In step 1116, the vertical gradient value is right shifted by the first bit shift value.

In the discussion above, the values of these gradients and MV differences are calculated using a pair of fixed right shifts as shown in equations (11) and (12). In practice, different bit-wise right shifts may be applied to (11) and (12) to achieve various characterization accuracies of these gradients and MV differences to make different tradeoffs between intermediate computational accuracy and bit-widths of the internal PROF derivation process. For example, when the input video contains a large amount of noise, the derived gradients may not reliably characterize the true local horizontal/vertical gradient values for each sample. In this case, it makes more sense to use more bits to characterize the MV difference than to use a gradient. Since noise is high frequency information, a large amount of noise can be determined, for example, by determining the amount of high frequency information. In another example, a quantization parameter may be used, where a high quantization parameter value may indicate a low noise level.

On the other hand, when the input video shows a steady motion, the MV differences derived from the affine model should be very small. If so, using high precision MV disparity does not provide the additional benefit of improving the precision of the derived PROF refinement. In other words, in this case, it would be more advantageous to use more bits to characterize the gradient values. For example, a stable motion may be determined by using the difference between control point motion vectors.

Based on the above considerations, in one embodiment of the present application, a general method is presented below to calculate the gradient and MV difference for the PROF. In particular, it is assumed that the horizontal and vertical gradients at each sample position are calculated by applying a right shift to the difference of neighboring prediction samples, i.e.,

g_x(i，j)＝(I(i+1，j)-I(i-1，j))>>n_a

g_y(i，j)＝(I(i，j+1)-I(i，j-1))>>n_a (13A)

the corresponding PROFMV difference at this sample position should be calculated as:

where Δ x and Δ y are horizontal and vertical offsets characterized with 1/4 pixel accuracy from a sample location to the center of the sub-block to which the sample belongs, and c, d, e, and f are affine parameters derived based on the 1/16 pixel affine control point MV. Finally, the final PROF refinement for this sample is calculated as:

ΔI(i，j)＝(g_x(i，j)*Δv_x(i，j)+g_y(i，j)*Δv_y(i，j)+1)>>1 (15A)

as mentioned before, when an affine coding block is bi-predicted, the current PROF is applied in a unilateral way. More specifically, the PROF sample refinement is derived separately and applied to the prediction samples in lists L0 and L1. Thereafter, the refined prediction signals from lists L0 and L1, respectively, are averaged to generate the final bi-prediction signal for the block. This is in contrast to the BDOF design, where sample refinement is derived and applied to the bi-predictive signal. This difference between the bi-predictive workflows of BDOF and PROF may not be friendly to the actual codec pipeline design.

In order to facilitate hardware pipeline design, according to the present application, a simplified method is to modify the bidirectional prediction process of the PROF, so that the work flows of the two prediction refinement methods are coordinated. Specifically, the proposed PROF method does not apply refinement separately for each prediction direction, but derives a one-time prediction refinement based on the control points MV of the lists L0 and L1; the derived prediction refinement is then applied to the combined L0 and L1 prediction signals to improve quality. Specifically, based on the MV difference derived in equation (12), the final bi-predictive sample for an affine coding block is calculated by the proposed method as:

pred_PROF(i，j)＝(I⁽⁰⁾(i，j)+I⁽¹⁾(i，j)+ΔI(i，j)+o_offset)>>shift

ΔI(i，j)＝(g_x(i，j)*Δv_x(i，j)+g_y(i，j)*Δv_y(i，j)+1)>>1

I^r(i，j)＝I(i，j)+ΔI(i，j) (16)

wherein shift and o_offsetIs the right-hand shift value and offset value for combining the L0 and L1 prediction signals for bi-prediction, equal to (15-bit-depth) and 1 < (14-bit-depth) + (2 < 13), respectively. Furthermore, as shown in equation (16), the clipping operation in the existing PROF design (shown in equation (9)) is eliminated in the proposed method.

Fig. 12 shows the corresponding PROF procedure when applying the proposed bi-predictive PROF method. The PROF process 1200 includes L0 motion compensation 1210, L1 motion compensation 1220, and bi-predictive PROF 1230. For example, the L0 motion compensation 1210 may be a list of motion compensated samples from previous reference pictures. The previous reference picture is a reference picture that precedes the current picture in the video block. For example, L1 motion compensation 1220 may be a list of motion compensated samples from the next reference picture. The next reference picture is a reference picture subsequent to the current picture in the video block. As described above, the bi-predictive PROF 1230 takes motion-compensated samples from the L1 motion compensation 1210 and the L1 motion compensation 1220 and outputs bi-predictive samples.

Fig. 13 shows a diagram of an example pipeline stage when both BDOF and the proposed PROF are applied. FIG. 13 illustrates the potential benefits of the proposed approach for hardware pipeline design. Pipeline stage 1300 includes parsing/decoding MVs and fetching reference samples 1310, motion compensation 1320, BDOF/PROF 1330. Pipeline stage 1300 encodes video blocks BLK0, BKL1, BKL2, BKL3, and BLK 4. Each video block will start with parsing/decoding the MV and fetching the reference samples 1310 and move to motion compensation 1320, then to motion compensation 1320, BDOF/PROF 1330 in turn. This means that BLK0 does not start during pipeline stage 1300 until BLK0 moves onto motion compensation 1320. All phases and video blocks are the same over time from T0 to T1, T2, T3, and T4.

In fig. 13, the decoding process of an intra block mainly includes three steps:

first, the MV of the encoded block is parsed/decoded and reference samples are obtained.

Second, L0 and/or L1 prediction signals for the encoded block are generated.

Third, the generated bi-predictive samples are refined sample-by-sample based on the BDOF when the encoded block is predicted by a non-affine mode, or based on the PROF when the encoded block is predicted by an affine mode.

As shown in fig. 13, after applying the proposed coordination method, both BDOF and PROF are directly applied to these bi-predictive samples. Given that BDOF and PROF are applied to different types of encoded blocks (i.e., BDOF is applied to non-affine blocks and PROF is applied to affine blocks), these two encoding tools cannot be invoked simultaneously. Thus, their respective decoding processes may be performed by sharing the same pipeline stages. This is more efficient than existing PROF designs, which have difficulty assigning the same pipeline stage to BDOF and PROF because of their different bi-predictive workflows.

In the above discussion, the proposed method only considers the coordination of the workflows of BDOF and PROF. However, according to the existing design, the basic operation units for these two coding tools are also performed in different sizes. In particular, for BDOF, one coding block is divided into sizes W_s×H_sA plurality of sub-blocks; wherein, W_s＝min(W,16)，H_sMin (H, 16); where W and H are the width and height of the coding block. These BODF operations, such as gradient calculations and sample refinement derivation, are performed independently for each sub-block. On the other hand, as previously described, the affine coding block is divided into 4 × 4 sub-blocks, each of which is assigned an individual MV derived based on a 4-parameter or 6-parameter affine model. Since PROF is only applicable to affine blocks, itThe basic unit of operation is a 4 × 4 subblock. Similar to the bi-predictive workflow problem, using different basic operation unit sizes for PROF and BDOF also does not facilitate hardware implementation and makes it difficult for BDOF and PROF to share the same pipeline stage of the entire decoding process. To address this issue, in one embodiment, it is proposed to align the sub-block size of the affine mode with the sub-block size of the BDOF. In particular, according to the proposed method, if an encoding block is encoded using affine mode, it is divided into blocks of size W_s×H_sA sub-block of (1), wherein W_s＝min(W,16)，H_sMin (H, 16); where W and H are the width and height of the coding block. Each sub-block is assigned a separate MV and treated as an independent PROF operation unit. It is worth mentioning that an independent PROF operation unit can ensure that a PROF operation is performed on it without reference to information from a neighboring PROF operation unit. Specifically, calculating the PROFMV difference at a certain sample position as the difference between the MV at the sample position and the MV at the center of the PROF operation unit where the sample is located; the gradient used for the PROF derivation is calculated by filling samples along each of the PROF operating units. The stated benefits of the proposed method mainly include the following aspects: 1) a simplified pipeline architecture with uniform basic operating unit size for motion compensation and BDOF/PROF refinement; 2) reduced memory bandwidth usage due to the enlarged sub-block size for affine motion compensation; 3) the per-sample computation complexity of fractional sample interpolation is reduced.

It should also be mentioned that since the proposed method has a reduced computational complexity (i.e. item 3), the constraint of the affine coding block on the existing 6-tap interpolation filter can be removed. Conversely, the default 8-tap interpolation for non-affine coding blocks is also used for affine coding blocks. In this case, the overall computational complexity is still comparable to the existing PROF design (i.e., based on 4 × 4 sub-blocks with 6-tap interpolation filters).

Coordination of gradient derivation for BDOF and PROF

As previously described, both BDOF and PROF compute the gradient of each sample within the current coding block, which accesses one additional row/column prediction sample on each side of the block. To avoid additional interpolation complexity, the prediction samples needed in the extension region around the block boundary are copied directly from the integer reference samples. However, as indicated in the section "encoding efficiency of affine mode", integer samples at different positions are used for calculating the gradient values of BDOF and PROF.

To achieve a more uniform design, two methods are presented below to unify the gradient derivation methods used by BDOF and PROF. In the first approach, it is proposed to align the gradient derivation method of PROF with that of BDOF. In particular, by the first approach, the integer positions used to generate these prediction samples in the extension region are determined by rounding down the fractional sample positions, i.e., the selected integer sample positions are to the left of the fractional sample positions (for horizontal gradients) and above the fractional sample positions (for vertical gradients).

In the second approach, it is proposed to align the gradient derivation method of BDOF with that of PROF. More specifically, when the second method is applied, the integer reference sample closest to the prediction sample is used for gradient computation.

Fig. 14 illustrates an example of a gradient derivation method using the BDOF, in which a blank circle represents a reference sample 1410 at an integer position, a triangle represents a fractional prediction sample 1430 of a current block, and a gray circle represents an integer reference sample 1420 used to fill an extended region of the current block.

Fig. 15 shows an example of a gradient derivation method using a PROF, in which a blank circle represents a reference sample 1510 at an integer position, a triangle represents a fractional prediction sample 1530 of a current block, and a gray circle represents an integer reference sample 1520 for filling an extended region of the current block.

Fig. 14 and 15 show the respective integer sample positions used to derive the gradients of BDOF and PROF when the first method (fig. 14) and the second method (fig. 15) are applied, respectively. In fig. 14 and 15, the blank circles represent reference samples at integer positions, the triangles represent fractional prediction samples of the current block, and the patterned circles represent integer reference samples used to fill an extended region of the current block for gradient derivation.

Early termination of PROF based on control point MV differences

According to current PROF designs, PROF is always invoked when an encoded block is predicted by affine mode. However, as shown in equations (6) and (7), the sub-block MVs of one affine block is derived from these control points MVs. Therefore, when the difference between these control points MVs is small, the MV at each sample position should be consistent. In this case, the benefits of applying PROF can be very limited. Therefore, to further reduce the average computational complexity of the PROF, it is proposed to adaptively skip the PROF-based sample refinement according to the maximum MV difference between the sample MV and the sub-block MV within one 4 × 4 sub-block. Since the PROFMV difference for a sample within a 4 x 4 sub-block is symmetric around the center of the sub-block, the maximum horizontal and vertical PROF MV difference can be calculated according to equation (10) as:

depending on the application, different metrics may be used to determine whether the MV difference is small enough to skip the PROF process.

In one example, based on equation (14), when the sum of the absolute maximum horizontal MV difference and the absolute maximum vertical MV difference is less than a predetermined threshold, the PROF process may be skipped, i.e.,

in another example, ifAndis not greater than the threshold, the PROF process may be skipped.

Where MAX (a, b) is a function that returns the larger of the input values a and b.

Furthermore, for both examples above, the spirit of the present application also applies to the case where other metrics are used to determine whether the MV difference is small enough to skip the PROF process.

In the above method, the PROF is skipped based on the magnitude of the MV difference. On the other hand, in addition to the MV difference, the PROF sample refinement is also calculated based on the local gradient information at each sample position in one motion compensation block. For a prediction block containing less high frequency detail (e.g., flat regions), the gradient values tend to be small, so the values of the derived sample refinement should be small. In view of this, according to another embodiment of the present application, it is proposed to apply PROF only to predicted samples of blocks containing sufficient high frequency information.

Different metrics may be used to determine whether a block contains enough high frequency information to merit invoking the PROF process for that block. In one example, the decision is made based on the average magnitude (i.e., absolute value) of the gradient of the sample within the prediction block. If the average size is less than a threshold, the prediction block is classified as a flat region, no PROF should be applied; otherwise, the prediction block is considered to contain enough high frequency detail, with PROF still applicable. In another example, the maximum magnitude of the gradient of a sample within the prediction block may be used. If the maximum magnitude is less than a threshold, then PROF will be skipped for the block. In yet another example, the difference I between the maximum sample value and the minimum sample value of the prediction block_max-I_minMay be used to determine whether to apply a PROF to the block. If this difference is less than the threshold, then PROF will be skipped for that block. It is noted that the spirit of the present application also applies to the case where some other metric is used to determine whether a given block contains sufficient high frequency information.

FIG. 16 illustrates a computing environment 1610 coupled with a user interface 1660. The computing environment 1610 may be part of a data processing server. The computing environment 1610 includes a processor 1620, memory 1640, and I/O interfaces 1650.

The processor 1620 generally controls the overall operation of the computing environment 1610, such as operations associated with display, data acquisition, data communication, and image processing. Processor 1620 may include one or more processors to execute instructions to perform all or some of the steps of the above-described methods. Further, processor 1620 may include one or more modules that facilitate interaction between processor 1620 and other components. The processor may be a Central Processing Unit (CPU), microprocessor, single chip, GPU, etc.

The memory 1640 is configured to store various types of data to support operation of the computing environment 1610. Memory 1640 may include predetermined software 1642. Examples of such data include instructions for any application or method operating on the computing environment 1610, video data sets, image data, and so forth. The memory 1640 may be implemented using any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), magnetic memory, flash memory, magnetic disks, or optical disks.

I/O interface 1650 provides an interface between processor 1620 and peripheral interface modules (e.g., keyboard, click wheel, buttons, etc.). These buttons may include, but are not limited to, a home button, a start scan button, and a stop scan button. I/O interface 1650 may couple with an encoder and a decoder.

Here, there is also provided a non-transitory computer readable storage medium comprising a plurality of programs for performing the above-described methods, such as a plurality of programs contained in the memory 1640 executable by the processor 1620 in the computing environment 1610. For example, the non-transitory computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The non-transitory computer readable storage medium has stored therein a plurality of programs for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform the motion prediction method described above.

In one embodiment, the computing environment 1610 may be implemented with one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), controllers, micro-controllers, microprocessors, or other electronic components to perform the above-described methods.

The description of the present application has been presented for purposes of illustration and is not intended to be exhaustive or limited to the application. Many modifications, variations and alternative embodiments will become apparent to those of ordinary skill in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.

The examples were chosen and described in order to explain the principles of the application and to enable others of ordinary skill in the art to understand the application for various implementations and to best utilize the basic principles and various implementations with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the application is not to be limited to the specific examples of the disclosed embodiments and that modifications and other embodiments are intended to be included within the scope of the application.

33页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：帧内预测和残差编解码

Method and apparatus for optical flow prediction refinement

相关技术

网友询问留言