Unified computing method using refined interframe prediction

文档序号:1943015 发布日期:2021-12-07 浏览:25次 中文

阅读说明:本技术 利用细化的帧间预测的统一计算方法 (Unified computing method using refined interframe prediction ) 是由 刘鸿彬 张莉 张凯 许继征 王悦 于 2020-05-07 设计创作,主要内容包括:一种可视媒体处理的方法包括执行可视媒体数据的当前块和当前视频块的相应编解码表示之间的转换,其中当前块的转换包括基于与当前块相关联的空域临近样点的样点差值来计算空域梯度和/或基于与当前块相关联的时域临近样点来计算时域梯度,并且其中用于计算空域梯度的技术与用于计算时域梯度的技术相同。(A method of visual media processing includes performing a transform between a current block of visual media data and a corresponding codec representation of a current video block, wherein the transform for the current block includes computing a spatial gradient based on a difference in samples of spatial neighboring samples associated with the current block and/or computing a temporal gradient based on temporal neighboring samples associated with the current block, and wherein techniques for computing the spatial gradient are the same as techniques for computing the temporal gradient.)

1. A method of visual media processing, comprising:

a conversion between the current block of visual media data and a corresponding codec representation of the current video block is performed,

wherein the transformation of the current block comprises computing a spatial gradient based on a sample difference value of spatially neighboring samples associated with the current block and/or computing a temporal gradient based on temporally neighboring samples associated with the current block, and

wherein the technique for calculating the spatial gradient is the same as the technique for calculating the temporal gradient.

2. The method of claim 1, wherein a technique for calculating the spatial gradient and/or the temporal gradient comprises left-shifting or right-shifting sample differences of the spatially neighboring samples and/or the temporally neighboring samples.

3. The method of claim 2, wherein the spatially neighboring samples and/or the temporally neighboring samples are left-shifted or right-shifted before the sample difference values are calculated.

4. The method according to claim 2, wherein the sample difference values of the spatial and/or temporal neighboring samples are computed before the left or right shift of the sample difference values.

5. The method of claim 4, wherein an offset is added to the sample difference value before left or right shifting of the sample difference value.

6. The method of claim 3, wherein an offset is added to the spatially neighboring samples or the temporally neighboring samples prior to the left-shifting or the right-shifting.

7. The method of any one or more of claims 5-6, wherein the shift is represented as 1< < (shift 1-1), wherein shift1 represents the amount of the left shift or the right shift.

8. The method of any one or more of claims 5-6, wherein the shift is represented as 1< < shift1> >1, where shift1 represents the amount of the left shift or the right shift.

9. The method of any one or more of claims 1-8, wherein the spatial and/or temporal gradients are represented as SatShift (neig 0-neig 1, shift1), wherein SatShift (x, n) is defined as

Wherein offset0 and/or offset1 are set to (1< < n) > >1 and/or (1< (n-1)) and/or zero, wherein neighbor 0 and neighbor 1 represent sample values of two adjacent samples of samples included in the current block of the visual media data, and wherein shift1 represents an amount of left or right shift of the two adjacent samples.

10. The method of any one or more of claims 1-8, wherein the spatial and/or temporal gradients are represented as SatShift (neig0, shift1) -SatShift (neig1, shift1), wherein SatShift (x, n) is defined as

Wherein offset0 and/or offset1 are set to (1< < n) > >1 and/or (1< (n-1)) and/or zero, wherein neighbor 0 and neighbor 1 represent sample values of two adjacent samples of samples included in the current block of the visual media data, and wherein shift1 represents an amount of left or right shift of the two adjacent samples.

11. The method of any one or more of claims 1-10, wherein the spatial neighboring samples are included in a prediction block of the current block corresponding to a prediction list.

12. The method of claim 11, wherein the prediction block is a first prediction block in a first prediction list and/or a second prediction block in a second prediction list.

13. The method of any one or more of claims 1-10, wherein the temporal gradient is based on at least two temporal neighboring samples associated with the current block, the at least two temporal neighboring samples being in a first prediction block of the current block corresponding to a first prediction list and in a second prediction block of the current block corresponding to a second prediction list.

14. The method of any one or more of claims 1-9, wherein the spatial gradient is calculated in a horizontal direction and/or a vertical direction.

15. The method of any one or more of claims 1-13, wherein the converting comprises using a bi-directional optical flow (BDOF) technique.

16. A method of visual media processing, comprising:

a conversion between the current block of visual media data and a corresponding codec representation of the current video block is performed,

wherein the transforming of the current block comprises computing a spatial gradient based on computing a difference of samples of spatially neighboring samples associated with the current block, and

wherein left or right shifting of the sample difference values of adjacent samples in the spatial domain is applied before calculating the sample difference values.

17. A video decoding apparatus comprising a processor configured to implement the method according to one or more of claims 1 to 16.

18. A video coding apparatus comprising a processor configured to implement the method according to one or more of claims 1 to 16.

19. A computer program product having computer code stored thereon, which when executed by a processor causes the processor to implement the method of any one of claims 1 to 16.

20. A method, apparatus or system as described in this document.

Technical Field

This document relates to video and image coding and decoding techniques.

Background

Digital video occupies the largest bandwidth usage on the internet and other digital communication networks. As the number of networked user devices capable of receiving and displaying video increases, the bandwidth requirements for pre-counting digital video usage will continue to grow.

Disclosure of Invention

In one example aspect, a method of visual media processing is disclosed. The method comprises performing a transformation between a current block of visual media data and a corresponding codec representation of the current video block, wherein the transformation of the current block comprises computing a spatial gradient based on a difference of samples of spatially neighboring samples associated with the current block and/or computing a temporal gradient based on temporally neighboring samples associated with the current block, and wherein a technique for computing the spatial gradient is the same as a technique for computing the temporal gradient.

In another example aspect, a method of visual media processing is disclosed. The method comprises performing a transformation between a current block of visual media data and a corresponding codec representation of the current video block, wherein the transformation of the current block comprises computing a spatial gradient based on computing sample difference values for spatially neighboring samples associated with the current block, and wherein a left or right shift of sample difference values for the spatially neighboring samples is applied before computing the sample difference values.

In another example aspect, the above method may be implemented by a video decoder apparatus comprising a processor.

In another example aspect, the above method may be implemented by a video encoder apparatus comprising a processor.

In yet another example aspect, the methods may be embodied in the form of processor-executable instructions and stored on a computer-readable program medium.

These and other aspects will be further described in this document.

Drawings

FIG. 1 illustrates an example of bilateral matching.

Fig. 2 shows an example of template matching.

Fig. 3 shows an example of unilateral Motion Estimation (ME) in Frame-Rate Up Conversion (FRUC).

FIG. 4 shows an example of an optical flow trace.

Fig. 5A and 5B show examples of bi-directional optical flow (BIO) without block expansion.

FIG. 6 shows an example of bilateral matching using a 6-point search.

Fig. 7 shows examples of an adaptive integer search mode and a half-sample search mode.

Fig. 8 is a block diagram of an example of a video processing apparatus.

Fig. 9 shows a block diagram of an example implementation of a video encoder.

Fig. 10 is a flowchart of an example of a video processing method.

Fig. 11 is a flowchart of an example of a video processing method.

Fig. 12 is a flowchart of an example of a video processing method.

Fig. 13 is a flowchart of an example of a video processing method.

Fig. 14 is a flowchart of an example of a video processing method.

Fig. 15 is a flowchart of an example of a video processing method.

Fig. 16 is a flowchart of an example of a video processing method.

FIG. 17 is a block diagram of an example video processing system in which the disclosed techniques may be implemented.

Fig. 18 is a flowchart of an example of a visual media processing method.

Fig. 19 is a flowchart of an example of a visual media processing method.

Detailed Description

In order to improve the compression rate of video, researchers are constantly searching for new techniques for encoding video. This document provides various techniques that a decoder of a video bitstream can use to improve the quality of decompressed or decoded digital video. In addition, the video encoder may also implement these techniques during the encoding process in order to reconstruct the decoded frames for further encoding.

The section headings are used in this document to improve readability, and the scope of the techniques and embodiments described in each section is not limited to that section only. Furthermore, although certain terms from various existing video encoder standards are used, the disclosed techniques are not limited to these video standards or their successor standards, and are applicable to other video encoder standards. Further, in some cases, techniques are disclosed that use respective codec steps, and it should be understood that at the decoder, the respective decoding steps will be performed in reverse order. In addition, codecs may also be used to perform transcoding, where video is represented from one codec representation (e.g., one bitrate) to another codec representation (e.g., a different bitrate).

1. Overview

This patent document relates to video codec/decoding techniques. In particular, it relates to motion compensation in video coding/decoding. It can be applied to existing video codec standards, such as HEVC, or to future release standards (universal video codec). It may also be applicable to future video codec standards or video encoders.

2. Preliminary discussion

The video codec standards have evolved largely through the well-known development of the ITU-T and ISO/IEC standards. ITU-T makes H.261 and H.263, ISO/IEC makes MPEG-1 and MPEG-4 visualizations, and both organizations jointly make the H.262/MPEG-2 video and 264/MPEG-4 Advanced Video Codec (AVC) and H.265/HEVC standards. Since h.262, video codec standards have been based on hybrid video codec structures in which temporal prediction plus transform coding is utilized. In order to explore future video codec technologies other than HEVC, VCEG and MPEG united in 2015 to form a joint video exploration team (jfet). Since then, many new methods have been adopted by jfet and put into a reference software named Joint Exploration Model (JEM). In month 4 of 2018, joint video experts group (JPEG) was created between VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11(MPEG), working on the VVC standard, with the goal of reducing the bit rate by 50% over HEVC.

The latest version of the VVC draft, universal video codec (draft 2), can be found at the following website: http:// phenix. it-supplaris. eu/JVET/doc _ end _ user/documents/11_ Ljubljana/wg11/JVET-K1001-v7. zip. The latest reference software for VVCs, named VTM, can be found at the following website: https:// vcgit. hhi. fraunhofer. de/jvet/VVCSOFTWOW _ VTM/tags/VTM-2.1.

Fig. 9 is a block diagram of an example implementation of a video encoder. Fig. 9 shows an encoder implementation with a built-in feedback path, where the video encoder also performs the video decoding function (reconstructing a compressed representation of the video data for encoding of the next video data).

2.1 Pattern matching motion vector derivation

The Pattern Matched Motion Vector Derivation (PMMVD) mode is a special Merge mode based on Frame Rate Up Conversion (FRUC) techniques. With this mode, the motion information of the block is not signaled but is obtained at the decoder side.

When the CU's Merge flag is true, a FRUC flag is signaled for it. When the FRUC flag is false, the Merge index is signaled and the regular Merge mode is used. When the FRUC flag is true, an additional FRUC mode flag is signaled to indicate which method (bilateral matching or template matching) is to be used to derive motion information for the block.

At the encoder side, the decision whether or not the CU uses FRUC Merge mode is selected based on the RD cost, as is done for normal Merge candidates. That is, both matching patterns (bilateral matching and template matching) are examined for CUs using RD cost selection. The least costly mode is further compared to other CU modes. If the FRUC matching pattern is the most efficient pattern, then the FRUC flag is set to true for the CU and the associated matching pattern is used.

The motion derivation process in FRUC target mode has two steps. CU-level motion search is performed first, followed by sub-CU-level motion refinement. At the CU level, an initial motion vector is derived for the entire CU based on bilateral matching or template matching. First, a list of MV candidates is generated and the candidate with the lowest matching cost is selected as the starting point for further CU-level refinement. Then, local search based on bilateral matching or template matching is performed around the starting point, and the MV result with the minimum matching cost is taken as the MV of the entire CU. Subsequently, the motion information is further refined at the sub-CU level, taking the derived CU motion vector as a starting point.

For example, the following derivation process is performed for the CU motion information derivation of W × H. In the first stage, the MVs of the entire W × H CU are derived. In the second stage, the CU is further divided into M × M sub-CUs. The value of M is shown as (16), D is a predefined division depth, which is set to 3 by default in JEM. Then the MV of each sub-CU is derived as:

as shown in fig. 1, bilateral matching is used to derive motion information of a current CU by finding the closest match between two blocks along the motion trajectory of the current CU in two different reference pictures. Under the assumption of a continuous motion trajectory, the motion vectors MV0 and MV1 pointing to the two reference blocks should be proportional to the temporal distance between the current picture and the two reference pictures (i.e., TD0 and TD 1). As a special case, the bilateral matching becomes a mirror-based bi-directional MV when the current picture is temporally located between two reference pictures and the temporal distances from the current picture to the two reference pictures are the same.

As shown in fig. 2, template matching is used to derive motion information of a current CU by finding the closest match between a template in the current picture (neighboring blocks above and/or to the left of the current CU) and a block in a reference picture (of the same size as the template). In addition to the FRUC Merge mode described above, template matching is also applicable to AMVP mode. In JEM, there are two candidates for AMVP, just as in HEVC. New candidates are derived using a template matching method. If the newly derived candidate by template matching is different from the first existing AMVP candidate, it is inserted into the very beginning of the AMVP candidate list and then the list size is set to 2 (meaning the second existing AMVP candidate is deleted). When applied to AMVP mode, only CU level search is applied.

CU level MV candidate set

The CU-level MV candidate set may include:

if the current CU is in AMVP mode, the original AMVP candidate is included,

all of the Merge candidates are selected,

a few MVs in the interpolated MV field introduced in section 2.1.1.3, and

neighboring motion vectors on the top and left

When using bilateral matching, each valid MV of the Merge candidate is used as an input to generate a MV pair assuming bilateral matching. For example, one valid MV of the Merge candidate is at reference list a (MVa, refa). Then, the reference picture refb of its pair of bilateral MVs is found in another reference list B, so that refa and refb are located on different sides of the current picture in the temporal domain. If such refb is not available in reference list B, refb is determined to be a reference other than refa and its temporal distance to the current picture is the minimum distance in list B. After refb is determined, MVb is derived by scaling MVa based on the temporal distance between the current picture and refa, refb.

Four MVs from the interpolated MV field are also added to the CU level candidate list. More specifically, the interpolated MVs at positions (0,0), (W/2,0), (0, H/2)) and (W/2, H/2) of the current CU are added.

When FRUC is applied to AMVP mode, the original AMVP candidate is also added to the CU-level MV candidate set.

At the CU level, up to 15 MVs are added to the candidate list for AMVP CUs and up to 13 MVs are added to the candidate list for Merge CUs.

sub-CU level MV candidate set

The sub-CU level MV candidate set may include:

the MV determined from the CU level search,

upper, left, upper left and upper right neighboring MVs,

collocated MVs from a scaled version of a reference picture,

up to 4 ATMVP candidates, and

up to 4 STMVP candidates.

The scaled MV from the reference picture is derived as follows. All reference pictures in both lists are traversed. The MVs at the collocated positions of sub-CUs in the reference picture are scaled to the reference of the starting CU level MV.

ATMVP and STMVP candidates are limited to the first four.

At the CU level, up to 17 MVs are added to the candidate list.

Generation of interpolated MV fields

Before encoding and decoding the frame, an interpolation motion field is generated for the whole picture based on the one-way ME. This motion field can then later be used as MV candidates at the CU level or sub-CU level.

First, the motion field of each reference picture in the two reference lists is traversed at the 4 × 4 block level. For each 4 x 4 block, if the motion associated with the block passing through the 4 x 4 block in the current picture (as shown in fig. 3) and the block are not assigned any interpolated motion, the motion of the reference block is scaled to the current picture according to temporal distances TD0 and TD1 (in the same way as MV scaling of TMVP in HEVC), and the scaled motion is assigned to the block in the current frame. If no scaled MV is assigned for a 4 x 4 block, the motion of that block is marked as unavailable in the interpolated motion field.

1. Interpolation and matching costs

When the motion vector points to a fractional sample position, motion compensated interpolation may be performed. To reduce complexity, bilinear interpolation (rather than regular 8-tap HEVC interpolation) is used for both bilateral matching and template matching.

The calculation of the matching cost differs at different steps. When selecting a candidate from the candidate set at the CU level, the matching cost is the absolute sum difference (SAD) of the bilateral matching or the template matching. After determining the starting MV, the matching cost C of the bilateral match searched at the sub-CU level is calculated as follows:

where w is a weighting factor, which is empirically set to 4, MV and MVsRepresenting the current MV and the starting MV, respectively. SAD is still used as the matching cost for template matching for sub-CU level search.

In FRUC mode, MVs are derived by using only the luminance samples. The derived Motion will be used for luminance and chrominance for MC (Motion Compensation) inter prediction. After the MV is decided, an 8-tap interpolation filter is used for luminance and a 4-tap interpolation filter is used for chrominance to perform the final MC.

MV refinement

MV refinement is a pattern-based MV search whose criteria are a bilateral matching cost or a template matching cost. In JEM, two search modes are supported-an unrestricted center-biased diamond search (UCBDS) and an adaptive cross search for MV refinement at the CU level and the sub-CU level, respectively. For MV refinement at the CU and sub-CU level, the MV is searched directly with quarter-luma sample MV precision, followed by one-eighth luma sample MV refinement. The MV refined search range of the CU and sub-CU step is set equal to 8 luma samples.

3. Selection of prediction direction in template matching FRUC Merge mode

In the bilateral matching Merge mode, bi-prediction is always applied, since the motion information of a CU is derived based on the closest match between two blocks along the motion trajectory of the current CU in two different reference pictures. Template matching Merge patterns have no such limitation. In the template matching Merge mode, the encoder may select among unidirectional prediction from list 0, unidirectional prediction from list 1, or bidirectional prediction of a CU. The selection is based on the template matching cost, as follows:

if costBi & gt factor & ltmin (cost0, cost1)

Using bi-directional prediction;

otherwise, if cost0< ═ cost1

Using one-way prediction from list 0;

if not, then,

using unidirectional prediction from list 1;

where cost0 is the SAD of the List 0 template match, cost1 is the SAD of the List 1 template match, and cost Bi is the SAD of the bidirectional prediction template match. The value of factor is equal to 1.25, which means that the selection process is biased towards bi-directional prediction.

Inter prediction direction selection is applied only to the CU-level template matching process.

2.2 hybrid Intra and inter prediction

In JFET-L0100, multi-hypothesis prediction is proposed, where hybrid intra and inter prediction is one way to generate multiple hypotheses.

When applying multi-hypothesis prediction to improve intra mode, multi-hypothesis prediction combines one intra prediction and one Merge index prediction. In the Merge CU, a flag is signaled for Merge mode to select an intra mode from the intra candidate list if the flag is true. For the luminance component, the intra candidate list is derived from 4 intra prediction modes including DC, planar, horizontal, and vertical modes, and the size of the intra candidate list may be 3 or 4, depending on the block shape. When the CU width is greater than twice the CU height, the horizontal mode is not included in the intra-mode list, and when the CU height is greater than twice the CU width, the vertical mode is removed from the intra-mode list. One intra prediction mode selected by the intra mode index and one Merge index prediction selected by the Merge index are combined using a weighted average. For the chroma component, DM is always applied without extra signaling. The weights of the combined predictions are described below. Equal weights are applied when either DC or planar mode is selected, or CB width or height is less than 4. For those CBs having a CB width and height greater than or equal to 4, when the horizontal/vertical mode is selected, one CB is first vertically/horizontally divided into four equal-area regions. Each set of weights, denoted as (w _ intra)i,w_interi) Where i ranges from 1 to 4, and (w _ intra)1,w_inter1)=(6,2),(w_intra2,w_inter2)=(5,3),(w_intra3,w_inter3) (3,5) and (w _ intra)4,w_inter4) (2,6) will be applied to the corresponding region. (w _ intra)1,w_inter1) For the region closest to the reference sample, (w _ intra)4,w_inter4) For the region furthest from the reference sample point. The combined prediction can then be calculated by summing the two weighted predictions and right-shifting by 3 bits. Furthermore, the intra-prediction mode of the pre-measured intra hypothesis may be saved for reference by subsequent neighboring CUs.

2.3 bidirectional light flow

In BIO, motion compensation is first performed to generate a first prediction (in each prediction direction) of the current block. The first prediction is used to derive the spatial gradient, temporal gradient and optical flow for each subblock/pixel within the block, and then used to generate a second prediction, the final prediction of the subblock/pixel. The detailed description is as follows.

Bi-directional optical flow (BIO) is a point-by-point motion refinement performed on the basis of bi-directionally predicted block-by-block motion compensation. Sample level motion refinement does not use signaling.

Let I(k)Is the luminance value from reference k (k 0,1) after block motion compensation,are each I(k)The horizontal and vertical components of the gradient. Assuming that the optical flow is valid, the motion vector field (v)x,vy) Given by the following equation:

combining the optical flow equation with Hermite interpolation of the motion track of each sampling point to generate a matching function value I at the tail end(k)And derivative ofIs the only third order polynomial. The value of this polynomial when t is 0 is the BIO prediction:

here, τ0And τ1The distance to the reference frame is shown in fig. 4. Distance tau0And τ1Calculated based on POC of Ref0 and Ref 1: tau is0=POC(current)-POC(Ref0)、τ1POC (Ref1) -POC (current). If both predictions are from the same time direction (either both from the past or both from the future), then the sign is different (i.e., τ0·τ1< 0). In this case, only if the predictions are not from the same time instant (i.e., τ)0≠τ1) Both reference areas haveWith non-zero motion (MVx)0,MVy0,MVx1,MVy1Not equal to 0) and block motion vector is not equal to the temporal distance (MVx)0/MVx1=MVy0/MVy1=-τ01) When proportional, BIO is applied.

The motion vector field (v) is determined by minimizing the difference between the values in points a and B (the intersection of the motion trajectory and the reference frame plane in fig. 9)x,vy). The model uses only the first linear term for the local taylor expansion of Δ, as follows:

all values in equation 5 depend on the sample position (i ', j'), which has been omitted from the annotation so far. Assuming that the motion is consistent in the local surrounding area, the value of Δ may be minimized inside a (2M +1) × (2M +1) square window Ω centered on the current predicted point (i, j), where M equals 2:

for this optimization problem, JEM uses a simplified approach, first minimizing in the vertical direction and then minimizing in the horizontal direction. This results in

Wherein the content of the first and second substances,

to avoid division by zero or very small values, regularization parameters r and m are introduced in equations 7 and 8, where:

r=500·4d-8 (10)

m=700·4d-8 (11)

where d is the bit depth of the video samples.

To keep the memory access of the BIO the same as the regular bi-predictive motion compensation, all predictions and the gradient values I(k)Are calculated only for the position within the current block. In equation 9, a (2M +1) × (2M +1) square window Ω centered on the current prediction point on the prediction block boundary can access the position outside the block (as shown in fig. 5 (a)). In JEM, block external I(k) Is set equal to the nearest available value inside the block. This may be implemented as a fill, for example, as shown in fig. 5 (b).

With BIO, the motion field of each sample point can be refined. To reduce computational complexity, block-based BIO designs are used in JEM. Motion refinement is calculated based on 4 x 4 blocks. In block-based BIO, s in equation 9 totals all samples in a 4 × 4 blocknValue of, then using snTo derive the BIO motion vector offset for the 4 x 4 block. More specifically, the following formula is used for block-based BIO derivation:

wherein b iskRepresents a set of samples belonging to the kth 4 x 4 block of the prediction block. S in equations 7 and 8nQuilt(s)n,bk)>>4 instead to derive the associated motion vector offset.

In some cases, MV refinement of BIO may be unreliable due to noise or irregular motion. Thus, in BIO, the size of MV refinement is clipped to the threshold thBIO. The threshold is determined based on whether the reference pictures of the current picture are all from one direction. If all reference pictures of the current picture come from one direction, the value of the threshold is set to 12 × 214-d(ii) a Otherwise, it is set to 12 × 213-d

The gradient of the BIO is computed at the same time as the motion compensated interpolation using operations consistent with the HEVC motion compensation process (2D separable FIR). The input to the 2D separable FIR is the same reference frame sample as the motion compensation process and fractional position (fracX, fracY) according to the fractional part of the block motion vector. In the horizontal gradientIn the case of (3), the signal is first vertically interpolated using the BIOfiltERS corresponding to the fractional position fracY with a de-scaling (de-scaling) shift d-8, and then the gradient filter BIOfiltERG corresponding to the fractional position fracX with a de-scaling shift 18-d is applied in the horizontal direction. In the vertical gradientIn the case of (2), the gradient filter is first applied vertically using the bianterg corresponding to the fractional position fracY with the de-scaling shift d-8, and then the signal permutation is performed in the horizontal direction using the bianters corresponding to the fractional position fracX with the de-scaling shift 18-d. The length of the interpolation filter bialterg and signal permutation bialterf used for gradient computation is short (6 taps) in order to keep reasonable complexity. Table 1 shows the filters used for gradient calculation for different fractional positions of block motion vectors in the BIO. Table 2 shows interpolation filters used for prediction signal generation in BIO.

Table 1: filter for gradient calculation in BIO

Fractional pixel position Interpolation filter for gradient (BIOfilterg)
0 {8,-39,-3,46,-17,5}
1/16 {8,-32,-13,50,-18,5}
1/8 {7,-27,-20,54,-19,5}
3/16 {6,-21,-29,57,-18,5}
1/4 {4,-17,-36,60,-15,4}
5/16 {3,-9,-44,61,-15,4}
3/8 {1,-4,-48,61,-13,3}
7/16 {0,1,-54,60,-9,2}
1/2 {-1,4,-57,57,-4,1}

Table 2: interpolation filter for prediction signal generation in BIO

Fractional pixel position Interpolation filter for prediction signal (BIOfilters)
0 {0,0,64,0,0,0}
1/16 {1,-3,64,4,-2,0}
1/8 {1,-6,62,9,-3,1}
3/16 {2,-8,60,14,-5,1}
1/4 {2,-9,57,19,-7,2}
5/16 {3,-10,53,24,-8,2}
3/8 {3,-11,50,29,-9,2}
7/16 {3,-11,44,35,-10,3}
1/2 {3,-10,35,44,-11,3}

In JEM, when the two predictions are from different reference pictures, the BIO is applied to all bi-directionally predicted blocks. When LIC is enabled for a CU, BIO is disabled.

In JEM, OBMC is applied for a block after the normal MC process. To reduce computational complexity, BIO is not applied during the OBMC process. This means that when using its own MV, the BIO is only applied in the MC procedure for a block, while when using the MVs of neighboring blocks during the OBMC procedure, the BIO is not applied in the MC procedure.

A two-phase early termination method is used to conditionally disable BIO operations depending on the similarity between two prediction signals. The early termination is applied first to the CU level and then to the sub-CU level. Specifically, the proposed method first calculates the SAD between the L0 and L1 prediction signals at the CU level. In the case where the BIO is only applied to the luminance, only the luminance samples can be used for the SAD calculation. If the CU level SAD is not greater than the predefined threshold, the BIO process is completely disabled for the entire CU. The CU level threshold is set to 2 per sample(BDepth-9). If the BIO process is not disabled at the CU level, and if the current CU includes multiple sub-CUs, the SAD of each sub-CU within the CU will be calculated. Then, a decision is made at the sub-CU level whether to enable or disable the BIO process based on a predefined sub-CU level SAD threshold, which is set to 3 x2 per sample point(BDepth -10)

2.4VVC BDOF Specification

The specification of BDOF (in JVET-N1001-v 2) is as follows:

8.5.7.4 bidirectional optical flow prediction process

The inputs to the process include:

two variables nCbW and nCbH specify the width and height of the current codec block,

two (nCbW +2) x (nCbH +2) luma prediction sample point arrays predSamplesL0 and predSamplesL1,

the prediction list utilizes the flags predflag l0 and predflag l1,

reference indices refIdxL0 and refIdxL1,

-the bi-directional optical flow makes use of the flag bdafutilizationnflag [ xIdx ] [ yIdx ], where xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1.

The output of this process is an (nCbW) x (ncbh) array of luma prediction sample values pbSamples.

The variables bitDepth, shift1, shift2, shift3, shift4, offset4, and mvRefineThres are derived as follows:

-the variable bitDepth is set equal to bitDepthY

Variable shift1 is set to Max (2, 14-bitDepth).

Variable shift2 is set to Max (8, bitDepth-4).

Variable shift3 is set to Max (5, bitDepth-7).

Variable shift4 is set to Max (3,15-bitDepth) and variable offset4 is set to 1< < (shift 4-1).

The variable mvRefineThres is set to Max (2,1< < (13-bitDepth)).

For xIdx ═ 0. (nCbW > >2) -1 and yIdx ═ 0. (nCbH > >2) -1, the following applies:

the variable xSb is set equal to (xIdx < <2) +1, ySb is set equal to (yIdx < <2) + 1.

-if bdafulizationflag [ xsbdx ] [ yIdx ] is equal to FALSE, for x ═ xSb-1.. xSb +2, y ═ ySb-1.. ySb +2, the predicted sample point values for the current sub-block are derived as follows:

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset2+predSamplesL1[x+1][y+1])>>shift2) (8-852)

-otherwise (BDofutilizationflag [ XsBidx ] [ YidX ] equals true), the predicted sample values for the current subblock are derived as follows:

for x ═ xSb-1.. xSb +4, y ═ ySb-1.. ySb +4, the following ordered procedure applies:

1. predicting a position (h) of each of the corresponding sample point positions (x, y) within the sample point arrayx,vy) The derivation is as follows:

hx=Clip3(1,nCbW,x) (8-853)

vy=Clip3(1,nCbH,y) (8-854)

2. the variables gradientHL0[ x ] [ y ], gradientVL0[ x ] [ y ], gradientHL1[ x ] [ y ] and gradientVL1[ x ] [ y ] are derived as follows:

gradientHL0[x][y]=(predSamplesL0[hx+1][vy]-predSampleL0[hx-1][vy])>>shift1 (8-855)

gradientVL0[x][y]=(predSampleL0[hx][vy+1]-predSampleL0[hx][vy-1])>>shift1 (8-856)

gradientHL1[x][y]=(predSamplesL1[hx+1][vy]-predSampleL1[hx-1][vy])>>shift1 (8-857)

gradientVL1[x][y]=(predSampleL1[hx][vy+1]-predSampleL1[hx][vy-1])>>shift1 (8-858)

3. variable temp [ x ]][y]、tempH[x][y]And tempV [ x ]][y]The derivation is as follows: diff [ x ]][y]=(predSamplesL0[hx][vy]>>shift2)-(predSamplesL1[hx][vy]>>shift2) (8-859)

tempH[x][y]=(gradientHL0[x][y]+gradientHL1[x][y])>>shift3 (8-860)

tempV[x][y]=(gradientVL0[x][y]+gradientVL1[x][y])>>shift3 (8-861)

The variables sGx2, sGy2, sGxGy, sGxdI and sGydI are derived as follows:

sGx2=ΣiΣj(tempH[xSb+i][ySb+j]*tempH[xSb+i][ySb+j]),

wherein i, j ═ 1..4 (8-862)

sGy2=ΣiΣj(tempV[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-863)

sGxGy=ΣiΣj(tempH[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j-1..4 (8-864)

sGxdI=ΣiΣj(-tempH[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-865)

sGydI=ΣiΣj(-tempV[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-866)

-horizontal motion offset and vertical motion offset of the current sub-block are derived as follows:

vx=sGx2>0Clip3(-mvRefineThres,mvRefineThres,-(sGxdI<<3)>>Floor(Log2(sGx2))):0 (8-867)

vy=sGy2>0Clip3(-mvRefineThres,mvRefineThres,((sGydI<<3)-((vx*sGxGym)<<12+vx*sGxGys)>>1)>>Floor(Log2(sGx2))):0 (8-868)

-for x-xSb-1.. xSb +2, y-ySb-1.. ySb +2, the predicted sample values for the current subblock are derived as follows:

bdofOffset=Round((vx*(gradientHL1[x+1][y+1]-gradientHL0[x+1][y+1]))>>1)+Round((vy*(gradientVL1[x+1][y+1]-gradientVL0[x+1][y+1]))>>1) (8-869)

[ Ed. (JC): the Round () operation is defined for floating point inputs. Since the inputs are integer values, the Round () operation appears redundant here. Confirmed by the proposer ]

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset4+predSamplesL1[x+1][y+1]+bdofOffset)>>shift4) (8-870)

The spatial gradient is calculated as follows:

gradientHL0[x][y]=(predSamplesL0[hx+1][vy]-predSampleL0[hx-1][vy])>>shift1 (8-855)

on the other hand, the temporal gradient is calculated as follows:

diff[x][y]=(predSamplesL0[hx][vy]>>shift2)-(predSamplesL1[hx][vy]>>shift2) (8-859)

therefore, the computation of the spatial and temporal gradients are not aligned.

2.5 decoder-side motion vector refinement

In the bi-directional prediction operation, for prediction of one block region, two prediction blocks respectively formed using Motion Vectors (MVs) of list 0 and MVs of list 1 are combined to form a single prediction signal. In jfet-K0217, in the decoder-side motion vector refinement (DMVR) method, two motion vectors of bidirectional prediction are further refined through a bilateral matching process.

In the proposed method, DMVR is only applicable for both Merge and skip modes if the following conditions are true:

(POC-POC0)*(POC-POC1)<0,

where POC is the picture order count of the current picture to be encoded, and POC0 and POC1 are the picture order counts of references to the current picture.

The signaled Merge candidate pair (candidate pair) is used as input to the DMVR procedure and is denoted as initial motion vectors (MV0, MV 1). The search points searched by the DMVR obey the motion vector difference mirroring condition. In other words, any point examined by DMVR represented by the candidate motion vector pair (MV0 ', MV 1') obeys the following two equations:

MV0′=MV0+MVdiff

MV1′=MV1-MVdiff

wherein, MVdiffRepresenting a point in the search space in one of the reference pictures.

After the search space is constructed, a regular 8-tap DCTIF interpolation filter is used to construct the single-sided prediction. The bilateral matching cost function is computed by using the MRSAD (sum of mean removed absolute differences) between the two predictions (FIG. 6), and the search point that results in the smallest cost is selected as the refined MV pair. For MRSAD calculation, samples of 16-bit precision (which is the output of the interpolation filtering) are used, and no clipping and rounding operations are applied prior to MRSAD calculation. The reason why rounding and clipping are not applied is to reduce the internal buffer requirements.

FIG. 6 shows an example of bilateral matching using a 6-point search.

In the proposed method, integer precision search points are selected by an adaptive mode method. The cost corresponding to the center point (to which the initial motion vector points) is first calculated. The other 4 costs (symbol shapes) are calculated by two predictions, located on opposite sides of the center point. The last 6 th point at this angle is selected by the gradient of the previously calculated cost (fig. 7).

Fig. 7 shows examples of an adaptive integer search mode and a half-sample search mode.

The output of the DMVR process is a refined pair of motion vectors corresponding to the minimum cost.

If after one iteration the minimum cost is achieved at the center point of the search space, i.e. the motion vectors have not changed, the refinement process terminates. Otherwise, the best cost is further considered as the center, and the process continues while the minimum cost does not correspond to the center point and does not exceed the search range.

The half-pel precision search is applied only if the application of the half-pixel search does not exceed the search range. In this case, only 4 MRSAD calculations are performed, corresponding to plus-shaped points around a center point, which is selected as the best during the integer precision search. Finally, the refined motion vector pair corresponding to the minimum cost point is output.

Some simplifications and improvements are further proposed in JFET-L0163.

4. Reference sample padding

Reference sample padding is applied in order to extend the block of reference samples pointed to by the initial motion vector. If the size of the coded block is given by "w" and "h", then it is assumed that blocks of sizes w +7 and h +7 are retrieved from the reference picture buffer. The retrieved buffer is then extended by 2 samples in each direction by repeated sample filling using the nearest samples. Thereafter, once refined motion vectors are obtained (which may be 2 samples from the initial motion vector in each direction), the expanded block of reference samples is used to generate the final prediction.

Note that this modification eliminates the external memory access requirements of the DMVR entirely without any codec penalty.

5. Bilinear interpolation replacing 8-tap DCTIF

According to this proposal, bilinear interpolation is applied during the DMVR search process, which means that the predictions used in the MRSAD calculation are generated using bilinear interpolation. Once the final refined motion vector is obtained, a regular 8-tap DCTIF interpolation filter is applied to generate the final prediction.

6. DMVR with disabled tiles

DMVR is disabled for 4 x 4, 4 x 8 and 8 x 4 blocks.

7. Early termination based on MV difference between Merge candidates

An additional condition is imposed on the DMVR that limits the MV refinement process. With it, the DMVR is conditionally disabled when the following conditions are met.

The MV difference between the selected Merge candidate and any previous Merge candidate in the same Merge list is less than a predefined threshold (i.e., an interval of 1/4 pixels wide, 1/2 pixels wide, and 1 pixel wide, respectively, for CUs having less than 64 pixels, less than 256 pixels, and at least 256 pixels).

8. Early termination of SAD cost based on center search coordinates

The Sum of Absolute Differences (SAD) between the two prediction signals (L0 and L1 predictions) using the initial motion vector of the current CU is calculated. If SAD is not greater than the predefined threshold, i.e., 2 samples per sample(BDepth-9)If yes, skipping DMVR; otherwise, DMVR is still applied to refine both motion vectors of the current block.

DMVR application Condition

DMVR is applied with the condition (POC-POC 1) x (POC-POC2) <0, which is implemented in BMS2.1 and replaced by the new condition (POC-POC 1) ═ (POC 2-POC). This means that DMVR is only applied when the reference picture is in the opposite temporal direction and equidistant from the current picture.

10. MRSAD calculation using every other line

The MRSAD cost is computed only for the odd lines of the block, not considering the even sample lines. Therefore, the number of operations for the MRSAD calculation is halved.

2.6 related methods

In a patent application No. PCT/CN2018/098691 entitled "motion refinement of visual media codec" filed on 8/4 in 2018, which is incorporated herein by reference, an MV updating method and a two-step inter-frame prediction method are proposed. The derived MV between reference block 0 and reference block 1 in the BIO is scaled and added to the original motion vectors of list 0 and list 1. Meanwhile, the updated MV is used to perform motion compensation, and a second inter prediction is generated as a final prediction. The temporal gradient is modified by removing the mean difference between reference block 0 and reference block 1.

Further, in a patent application No. PCT/CN2018/114931 (which is incorporated herein by reference) filed on 11/10/2018, entitled "subblock-based prediction method and system," a set of MVs is generated for chroma components.

2.7 DMVR in VVC draft 4

The use of DMVR in JFET-M1001 _ v7(VVC working draft 4, version 7) is defined as follows:

-the dmvrFlag is set to 1 when all of the following conditions are true:

-sps _ dmvr _ enabled _ flag is equal to 1

-the current block is not coded with triangle prediction mode, AMVR affine mode, sub-block mode (including Merge affine mode and ATMVP mode)

-merge _ flag [ xCb ] [ yCb ] equal to 1

predFlagL0[0] [0] and predFlagL1[0] [0] both equal 1

Mmvd _ flag [ xCb ] [ yCb ] equal to 0

DiffPicoderCnt (currPic, RefPicList [0] [ refIdxL0]) equals DiffPicoderCnt (RefPicList [1] [ refIdxL1], currPic)

-cbHeight greater than or equal to 8

-cbHeight cbWidth greater than or equal to 64

3. Examples of problems addressed by embodiments

In BIO, the difference between two reference blocks or sub-blocks is computed at the early termination stage, while the temporal gradient is also computed. Because the temporal gradient is actually the difference (or right-shifted difference) between two reference pixels, it makes no sense to calculate both the difference and the temporal gradient.

In DMVR, MRSAD calculates a refined motion vector for determining a block.

In BIO, SAD calculation is used to decide whether to enable or disable BIO for a block or a sub-block using all samples of the block/sub-block, which increases computational complexity.

The spatial gradient and the temporal gradient are calculated in different ways.

4. Examples of the embodiments

SATD denotes the sum of absolute transformed differences, MRSATD denotes the sum of mean removed absolute transformed differences, SSE denotes the sum of squared errors, and MRSSE denotes the sum of mean removed squared errors.

The following detailed technology should be considered as an example to explain the general concept. These techniques should not be construed narrowly. Furthermore, these inventions may be combined in any manner.

In the discussion that follows, SatShift (x, n) is defined as

Shift (x, n) is defined as Shift (x, n) ═ x + offset0) > > n.

In one example, offset0 and/or offset1 is set to (1< < n) > >1 or (1< < n-1). In another example, offset0 and/or offset1 are set to 0.

In another example, offset0 ═ offset1 ═ ((1< < n) > >1) -1 or ((1< (n-1))) -1.

In gradient computation of BDOF, the difference between two nearby (spatially or temporally nearby) and/or non-nearby samples may be computed, and right-shifting may be performed during gradient computation. Suppose that the two nearby samples are neig0 and neig1, the right shift value is shift1, and the gradient to be calculated is grad. Note that shift1 may be different for spatial and temporal gradients.

1. It is proposed to align the methods used in computing the spatial and temporal gradients.

a. In one example, the gradient is calculated from the shifted sample point difference.

i. Alternatively, the gradient is calculated from the modified sample point difference (e.g., via a shift).

b. In one example, in gradient calculation, subtraction may be performed before right shifting. For example, grad ═ (neig 0-neig 1) > > shift 1.

c. In one example, in gradient calculation, subtraction may be performed after right shifting. For example, grad ═ (neig0> > shift1) - (neig1> > shift 1).

d. In one example, in gradient calculations, subtraction may be performed before right shifting, and an offset may be added before right shifting. For example, grad ═ (neig 0-neig 1+ offset) > > shift 1. The offset may be equal to 1< < (shift 1-1) or 1< < shift1> > 1.

e. In one example, in gradient calculations, subtraction may be performed after the right shift, and an offset may be added before the right shift. For example, grad ═ ((neighbor 0+ offset) > > shift1) - ((neighbor 1+ offset) > > shift 1). The offset may be equal to 1< < (shift 1-1) or 1< < shift1> > 1.

f. In one example, the gradient may be calculated as SatShift (neighbor 0-neighbor 1, shift 1).

i. Alternatively, the gradient can be calculated as SatShift (neig0, shift1) -SatShift (neig1, shift 1).

2. It is proposed to use other criteria for deciding to enable/disable BIO or/and DMVR in the early termination phase, such as SATD or mrstd or SSE or MRSSE or mean difference or gradient values.

a. In one example, the block-level and sub-block-level enable/disable decisions may select different rules, e.g., one using SAD and the other using SATD.

b. In one example, for a block/sub-block, if a gradient value (horizontal and/or vertical) or average gradient value or range of gradient values satisfies a condition (e.g., is greater than a threshold or outside a given range), the BIO and/or DMVR may be disabled.

c. The proposed standard for deciding to enable/disable the BIO/DMVR may be signaled from the encoder to the decoder in the VPS/SPS/PPS/slice header/slice group header.

3. It is proposed to use other criteria for determining a refined motion vector for a block in the DMVR procedure, such as replacing MRSAD with SATD or mrstd or SSE or MRSSE.

a. In one example, a refined motion vector for a sub-block in the DMVR process, such as replacing MRSAD with SATD or mrstd or SSE or MRSSE.

b. In one example, if SATD (or MRSATD) is applied, the entire block is divided into M × N sub-blocks, and SATD (or MRSATD) is calculated for each sub-block. The SATD (MRSATD) values of all or some of the sub-blocks are added to obtain the SATD (MRSATD) value of the whole block.

4. When the mean difference of two reference blocks of a block is greater than a threshold (T1), the BIO or/and DMVR may be disabled.

a. When the mean difference of two reference sub-blocks of a sub-block is greater than a threshold (T2), BIO may be disabled.

b. The thresholds T1 and/or T2 may be predefined.

c. The thresholds T1 and/or T2 may depend on the block size.

5. It is proposed that at the early termination stage of the BIO, the reference block or/and sub-block may be modified first before calculating the difference between the two reference blocks/sub-blocks (e.g. SAD/SATD/SSE, etc.).

a. In one example, an average of the reference block or/and sub-block may be calculated and then subtracted by the reference block or/and sub-block.

b. In one example, the method disclosed in application PCT/CN2018/096384 entitled "motion prediction based on updated motion vectors" (which is incorporated herein by reference), filed on 20.7.2018, may be used to calculate the mean of reference blocks or/and sub-blocks, i.e. for some representative locations.

6. It is proposed that at the early termination stage of the BIO or/and DMVR, the difference between two reference blocks or/and sub-blocks (e.g. SAD/SATD/SSE/MRSAD/MRSSE, etc.) may be calculated for only some representative positions.

a. In one example, only the difference values for even rows are calculated for a block or/and sub-block.

b. In one example, only the difference of the four corner samples of a block/sub-block is calculated for the block or/and sub-block.

c. In one example, the method disclosed in U.S. provisional application 62/693,412 entitled "decoder-side motion vector derivation in video codec" (which is incorporated herein by reference) filed on 7/2/2018 may be used to select a representative location.

d. In one example, the difference between two reference blocks may be calculated for only some representative sub-blocks (e.g., SAD/SATD/SSE/MRSAD/MRSSE, etc.).

e. In one example, the difference values calculated for representative locations or sub-blocks (e.g., SAD/SATD/SSE/MRSAD/MRSATD/MRSSE, etc.) are added up to get the difference values for the entire block/sub-block.

7. It is proposed to use the temporal gradient (the temporal gradient at position (x, y) is defined as G (x, y) ═ P0(x, y) -P1(x, y), where P0(x, y) and P1(x, y) represent the prediction at (x, y) from two different reference pictures) or a modified temporal gradient as the difference in the early termination phase of the BIO (instead of SAD), and the threshold used in early termination can be adjusted accordingly.

a. In one example, the absolute sum of the temporal gradients is calculated and used as the difference of two reference blocks or/and sub-blocks.

b. In one example, the absolute sum of the temporal gradients is computed over only some representative locations of the block or/and sub-block.

c. In one example, the method disclosed in U.S. provisional application 62/693,412 entitled "decoder-side motion vector derivation in video codec" (which is incorporated herein by reference) filed on 7/2/2018 may be used to select a representative location.

8. It is proposed that the temporal gradient modification process can be performed adaptively for different blocks/sub-blocks.

a. In one example, the temporal gradient is only modified when the absolute mean difference between the two reference blocks (or SAD/SATD/SSE, etc.) is greater than a threshold T, e.g., T4.

b. In one example, the temporal gradient is only modified when the absolute mean difference between the two reference blocks (or SAD/SATD/SSE, etc.) is less than a threshold T, e.g., T-20.

c. In one example, the temporal gradient is only modified when the absolute mean difference (or SAD/SATD/SSE, etc.) between the two reference blocks is in the range [ T1, T2], e.g., T1-4, T2-20.

d. In one example, if the absolute average difference between two reference blocks (or SAD/SATD/SSE, etc.) is greater than a threshold T (e.g., T ═ 40), then BIO is disabled.

e. In one example, these thresholds may be implicitly predefined.

f. In one example, these thresholds may be signaled in the SPS/PPS/picture/slice level.

g. In one example, the thresholds may be different for different CUs, LCUs, slices, or pictures.

i. In one example, these thresholds may be designed based on decoded/encoded pixel values.

in one example, these thresholds may be designed differently for different reference pictures.

h. In one example, the temporal gradient is modified only when the (absolute) mean of the two (or any of the two) reference blocks is greater than a threshold T, e.g., T40.

i. In one example, the temporal gradient is modified only when the (absolute) mean of the two (or any of the two) reference blocks is less than a threshold T, e.g., T100.

j. In one example, the temporal gradient is modified only when the (absolute) mean of two (or any of the two) reference blocks is within the range of [ T1, T2], e.g., T1-40, T2-100.

k. In one example, the temporal gradient is modified only when the (absolute) mean of the two (or any of the two) reference blocks is greater/less than the absolute mean difference (or SAD/SATD etc.) multiplied by T, which in one example is 4.5.

In one example, the temporal gradient is modified only when the (absolute) mean of the two (or any of the two) reference blocks is within the range of the absolute mean difference (or SAD/SATD, etc.) multiplied by [ T1, T2], in one example, T1-4.5 and T2-7.

9. It is proposed that in a mixed intra and inter prediction mode, the two inter reference blocks may be modified when calculating the spatial gradient in the BIO, or they may be modified before performing the entire BIO process.

a. In one example, the intra-predicted block and the inter-predicted block in each prediction direction are weighted averaged (using the same weighting method as in the mixed intra and inter prediction) to generate two new predicted blocks, denoted wAvgBlkL0 and wAvgBlkL1, which are used to derive the spatial gradient in the BIO.

b. In one example, wAvgBlkL0 and wAvgBlkL1 are used to generate a prediction block, denoted as predBllk, for the current block. Then, wavgbblkl 0, wavgbblkl 1, and predBlk are further used for the BIO process, and the refined prediction block generated in the BIO is used as a final prediction block.

10. It is proposed that the DMVR or/and BIO flag may be signaled at the block level to indicate whether the DMVR or/and BIO is enabled for the block.

a. In one example, such a flag may be signaled only for AMVP mode, and in Merge mode, such a flag may be inherited from spatially or/and temporally adjacent blocks.

b. In one example, whether the BIO or/and DMVR are enabled may be determined by both the signaled flag and an immediate decision (e.g., SAD based decision at early termination stage). The signaled flag may indicate whether the immediate decision is correct.

c. For uni-directional prediction blocks, such a flag is not signaled.

d. Such a flag may not be signaled for a bi-predicted block whose two reference pictures are both preceding or following pictures in display order.

e. Such a flag may not be signaled for bi-prediction blocks if POC _ diff (currpic, ref0) is not equal to POC _ diff (ref1, currpic), where POC _ diff () computes the POC difference between two pictures and ref0 and ref1 are reference pictures for the current picture.

f. For intra-coded blocks, this flag is not signaled. Alternatively, such a flag is not signaled for blocks coded with mixed intra and inter prediction modes, in addition.

Alternatively, for a current picture reference block (i.e., the reference picture is the current picture), such a flag is not signaled.

g. Whether or not to signal the flag may depend on the block dimension. For example, if the block size is less than the threshold, such a flag is not signaled. Alternatively, if the block width and/or height is equal to or greater than the threshold, such a flag is not signaled.

h. Whether or not to signal the flag may depend on the motion vector accuracy. For example, if the motion vector is integer precision, such a flag is not signaled.

i. If such a flag is not signaled, it can be implicitly deduced as true or false.

j. A signaling flag may be signaled at slice header/PPS/SPS/VPS to indicate whether the method is enabled.

k. Such signaling methods may depend on the temporal layer of the picture, e.g., for pictures with high temporal layers, such signaling methods may be disabled.

This signaling method may depend on the QP of the picture, e.g., for pictures with high QP, this signaling method may be disabled.

11. It is proposed to decide whether to enable or disable DMVR based only on block height, rather than checking both block height and block size.

a. In one example, DMVR may be enabled when the block height is greater than T1 (e.g., T1 ═ 4).

b. In one example, DMVR may be enabled when the block height is equal to or greater than T1 (e.g., T1 ═ 8).

12. The above-described method applied to DMVD/BIO may be applicable only to other decoder-side motion vector derivation (DMVD) methods, such as affine-mode optical-flow-based prediction refinement.

a. In one example, the conditional checks for use determination of DMVR and BIO may be aligned, such as whether the block heights meet the same threshold.

i. In one example, DMVR and BIO may be enabled when the block height is equal to or greater than T1 (e.g., T1 ═ 8).

in one example, DMVR and BIO may be enabled when the block height is greater than T1 (e.g., T1 ═ 4).

5. Examples of the embodiments

5.1 example #1

The use of DMVR in JFET-M1001 _ v7(VVC working draft 4, version 7) is modified as follows. The text in bold capital letters will be deleted from JFET-M1001 _ v 7:

-dmvrFlag is set to 1 when all of the following conditions are true:

-sps _ dmvr _ enabled _ flag is equal to 1

-the current block is not coded with triangle prediction mode, AMVR affine mode, sub-block mode (including Merge affine mode and ATMVP mode)

-merge _ flag [ xCb ] [ yCb ] equal to 1

predFlagL0[0] [0] and predFlagL1[0] [0] both equal 1

Mmvd _ flag [ xCb ] [ yCb ] equal to 0

DiffPicoderCnt (currPic, RefPicList [0] [ refIdxL0]) equals DiffPicoderCnt (RefPicList [1] [ refIdxL1], currPic)

-cbHeight greater than or equal to 8

That is, "cbHeight × cbWidth greater than or equal to 64" is deleted.

5.2 example 2

Newly added portions are highlighted in bold italics and deleted portions are highlighted in bold uppercase.

i. An example

8.5.7.4 bidirectional optical flow prediction process

The inputs to this process are:

two variables nCbW and nCbH specify the width and height of the current codec block,

two (nCbW +2) x (nCbH +2) luma prediction sample point arrays predSamplesL0 and predSamplesL1,

the prediction list utilizes the flags predflag l0 and predflag l1,

reference indices refIdxL0 and refIdxL1,

-the bi-directional optical flow makes use of the flag bdafutilizationnflag [ xIdx ] [ yIdx ], where xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1.

The output of this process is an (nCbW) x (ncbh) array of luma prediction sample values pbSamples.

The variables bitDepth, shift1, shift2, shift3, shift4, offset4, and mvRefineThres are derived as follows:

-the variable bitDepth is set equal to bitDepthY

Variable shift1 is set to Max (2, 14-bitDepth).

Variable shift2 is set to Max (8, bitDepth-4).

Variable shift3 is set to Max (5, bitDepth-7).

Variable shift4 is set to Max (3,15-bitDepth) and variable offset4 is set to 1< < (shift 4-1).

The variable mvRefineThres is set to Max (2,1< < (13-bitDepth)).

For xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1, the following applies:

the variable xSb is set equal to (xIdx < <2) +1 and ySb is set equal to (yIdx < <2) + 1.

-if bdafulizationnflag [ xsbdx ] [ yIdx ] is equal to FALSE, for x ═ xSb

-1.. xSb +2, y.. ySb-1.. ySb +2, the predicted sample values of the current subblock are derived as follows: pbSamples [ x ]][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset2+predSamplesL1[x+1][y+1])>>shift2) (8-852)

Otherwise (bdafulizationflag [ xsbdx ] [ yIdx ] equals TRUE), the predicted sample value of the current subblock is derived as follows:

for x ═ xSb-1.. xSb +4, y ═ ySb-1.. ySb +4, the following ordered procedure applies:

4. predicting a position (h) of each of the corresponding sample point positions (x, y) within the sample point arrayx,vy) The derivation is as follows:

hx=Clip3(1,nCbW,x) (8-853)

vy=Clip3(1,nCbH,y) (8-854)

5. the variable gradients gradientHL0[ x ] [ y ], gradientVL0[ x ] [ y ], gradientHL1[ x ] [ y ] and gradientVL1[ x ] [ y ] are derived as follows:

gradientHL0[x][y]=(predSamplesL0[hx+1][vy]-predSampleL0[hx-1][vy])>>shift1 (8-855)

gradientVL0[x][y]=(predSampleL0[hx][vy+1]-predSampleL0[hx][vy-1])>>shift1 (8-856)

gradientHL1[x][y]=(predSamplesL1[hx+1][vy]-predSampleL1[hx-1][vy])>>shift1 (8-857)

gradientVL1[x][y]=(predSampleL1[hx][vy+1]-predSampleL1[hx][vy-1])>>shift1 (8-858)

6. the variables temp [ x ] [ y ], tempH [ x ] [ y ], and tempV [ x ] [ y ] are derived as follows:

DIFF[X][Y]=(PREDSAMPLESL0[HX][VY]>>SHIFT2)-(PREDSAMPLESL1[HX][VY]>>SHIFT2) (8-859)

diff[x][y]=(predSamplesL0[hx][vy]-predSamplesL1[hx][vy])>>shift2 (8-859)

tempH[x][y]=(gradientHL0[x][y]+gradientHL1[x][y])>>shift3 (8-860)

tempV[x][y]=(gradientVL0[x][y]+gradientVL1[x][y])>>shift3 (8-861)

the variables sGx2, sGy2, sGxGy, sGxdI and sGydI are derived as follows:

sGx2=ΣiΣj(tempH[xSb+i][ySb+j]*tempH[xSb+i][ySb+j]),

wherein i, j ═ 1..4 (8-862)

sGy2=ΣiΣj(tempV[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-863)

sGxGy=ΣiΣj(tempH[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j-1..4 (8-864)

sGxdI=ΣiΣj(-tempH[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-865)

sGydI=ΣiΣj(-tempV[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-866)

-horizontal motion offset and vertical motion offset of the current sub-block are derived as follows:

vx=sGx2>0Clip3(-mvRefineThres,mvRefineThres,-(sGxdI<<3)>>Floor(Log2(sGx2))):0 (8-867)

vy=sGy2>0Clip3(-mvRefineThres,mvRefineThres,((sGydI<<3)-((vx*sGxGym)<<12+vx*sGxGys)>>1)>>Floor(Log2(sGx2))):0 (8-868)

-for x-xSb-1.. xSb +2, y-ySb-1.. ySb +2, the predicted sample values for the current subblock are derived as follows:

bdofOffset=Round((vx*(gradientHL1[x+1][y+1]-gradientHL0[x+1][y+1]))>>1)+Round((vy*(gradientVL1[x+1][y+1]-gradientVL0[x+1][y+1]))>>1) (8-869)

[ Ed. (JC): Round () operation is defined for floating point inputs. Since the inputs are integer values, the Round () operation appears redundant here. Confirmed by the proposer ]

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset4+predSamplesL1[x+1][y+1]+bdofOffset)>>shift4) (8-870)

One example

8.5.7.4 bidirectional optical flow prediction process

The inputs to this process are:

two variables nCbW and nCbH specify the width and height of the current codec block,

two (nCbW +2) x (nCbH +2) luma prediction sample point arrays predSamplesL0 and predSamplesL1,

the prediction list utilizes the flags predflag l0 and predflag l1,

reference indices refIdxL0 and refIdxL1,

-the bi-directional optical flow makes use of the flag bdafutilizationnflag [ xIdx ] [ yIdx ], where xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1.

The output of this process is an (nCbW) x (ncbh) array of luma prediction sample values pbSamples.

The variables bitDepth, shift1, shift2, shift3, shift4, offset4, and mvRefineThres are derived as follows:

-the variable bitDepth is set equal to bitDepthY

Variable shift1 is set to Max (2, 14-bitDepth).

Variable shift2 is set to Max (8, bitDepth-4).

Variable shift3 is set to Max (5, bitDepth-7).

Variable shift4 is set to Max (3,15-bitDepth) and variable offset4 is set to 1< < (shift 4-1).

The variable mvRefineThres is set to Max (2,1< < (13-bitDepth)).

For xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1, the following applies:

the variable xSb is set equal to (xIdx < <2) +1, ySb is set equal to (yIdx < <2) + 1.

-if bdafulizationnflag [ xsbdx ] [ yIdx ] is equal to FALSE, for x ═ xSb

-1.. xSb +2, y.. ySb-1.. ySb +2, the predicted sample values of the current subblock are derived as follows:

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset2+predSamplesL1[x+1][y+1])>>shift2) (8-852)

otherwise (bdafulizationflag [ xsbdx ] [ yIdx ] equals TRUE), the predicted sample value of the current subblock is derived as follows:

for x ═ xSb-1.. xSb +4, y ═ ySb-1.. ySb +4, the following ordered procedure applies:

7. predicting a position (h) of each of the corresponding sample point positions (x, y) within the sample point arrayx,vy) The derivation is as follows:

hx=Clip3(1,nCbW,x) (8-853)

vy=Clip3(1,nCbH,y) (8-854)

8. the variables gradientHL0[ x ] [ y ], gradientVL0[ x ] [ y ], gradientHL1[ x ] [ y ] and gradientVL1[ x ] [ y ] are derived as follows:

GRADIENTHL0[X][Y]=(PREDSAMPLESL0[HX+1][VY]-PREDSAMPLEL0[HX-1][VY])>>SHIFT1 (8-855)

GRADIENTVL0[X][Y]=(PREDSAMPLEL0[HX][VY+1]-PREDSAMPLEL0[HX][VY-1])>>SHIFT1 (8-856)

GRADIENTHL1[X][Y]=(PREDSAMPLESL1[HX+1][VY]-PREDSAMPLEL1[HX-1][VY])>>SHIFT1 (8-857)

GRADIENTVL1[X][Y]=(PREDSAMPLEL1[HX][VY+1]-PREDSAMPLEL1[HX][VY-1])>>SHIFT1 (8-858)

gradientHL0[x][y]=(predSamplesL0[hx+1][vy]>>shift1)-(predSampleL0[hx-1][vy]>>shift1) (8-855)

gradientVL0[x][y]=(predSampleL0[hx][vy+1]>>shift1)–(predSampleL0[hx][vy-1]>>shift1) (8-856)

gradientHL1[x][y]=(predSamplesL1[hx+1][vy]>>shift1)–(predSampleL1[hx-1][vy]>>shift1) (8-857)

gradientVL1[x][y]=(predSampleL1[hx][vy+1]>>shift1)–(predSampleL1[hx][vy-1]>>shift1) (8-858)

9. the variables temp [ x ] [ y ], tempH [ x ] [ y ], and tempV [ x ] [ y ] are derived as follows:

diff[x][y]=(predSamplesL0[hx][vy]>>shift2)-(predSamplesL1[hx][vy]>>shift2) (8-859)

tempH[x][y]=(gradientHL0[x][y]+gradientHL1[x][y])>>shift3 (8-860)

tempV[x][y]=(gradientVL0[x][y]+gradientVL1[x][y])>>shift3 (8-861)

the variables sGx2, sGy2, sGxGy, sGxdI and sGydI are derived as follows:

sGx2=ΣiΣj(tempH[xSb+i][ySb+j]*tempH[xSb+i][ySb+j]),

wherein i, j ═ 1..4 (8-862)

sGy2=ΣiΣj(tempV[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-863)

sGxGy=ΣiΣj(tempH[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j-1..4 (8-864)

sGxdI=ΣiΣj(-tempH[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-865)

sGydI=ΣiΣj(-tempV[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-866)

The horizontal and vertical motion offsets for the current sub-block are derived as follows:

vx=sGx2>0Clip3(-mvRefineThres,mvRefineThres,-(sGxdI<<3)>>Floor(Log2(sGx2))):0 (8-867)

vy=sGy2>0Clip3(-mvRefineThres,mvRefineThres,((sGydI<<3)-((vx*sGxGym)<<12+vx*sGxGys)>>1)>>Floor(Log2(sGx2))):0 (8-868)

-for x-xSb-1.. xSb +2, y-ySb-1.. ySb +2, the predicted sample values for the current subblock are derived as follows:

bdofOffset=Round((vx*(gradientHL1[x+1][y+1]-gradientHL0[x+1][y+1]))>>1)+Round((vy*(gradientVL1[x+1][y+1]-gradientVL0[x+1][y+1]))>>1) (8-869)

[ Ed. (JC): Round () operation is defined for floating point inputs. Since the inputs are integer values, the Round () operation appears redundant here. Confirmed by the proposer ]

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset4+predSamplesL1[x+1][y+1]+bdofOffset)>>shift4) (8-870)

One example of

8.5.7.4 bidirectional optical flow prediction process

The inputs to the process are:

two variables nCbW and nCbH specify the width and height of the current codec block,

two (nCbW +2) x (nCbH +2) luma prediction sample point arrays predSamplesL0 and predSamplesL1,

the prediction list utilizes the flags predflag l0 and predflag l1,

reference indices refIdxL0 and refIdxL1,

-the bi-directional optical flow makes use of the flag bdafutilizationnflag [ xIdx ] [ yIdx ], where xIdx ═ 0. (nCbW > >2) -1, yIdx ═ 0. (nCbH > >2) -1.

The output of this process is an (nCbW) x (ncbh) array of luma prediction sample values pbSamples.

The variables bitDepth, shift1, shift2, shift3, shift4, offset4, offset5, offset6, and mvrefinethrees are derived as follows:

-the variable bitDepth is set equal to bitDepthY

Variable shift1 is set to Max (2, 14-bitDepth).

Variable shift2 is set to Max (8, bitDepth-4).

Variable shift3 is set to Max (5, bitDepth-7).

Variable shift4 is set to Max (3,15-bitDepth) and variable offset4 is set to 1< < (shift 4-1).

The variable mvRefineThres is set to Max (2,1< < (13-bitDepth)).

The variable offset5 is set equal to (1< < (shift 1-1)).

The variable offset6 is set equal to (1< < (shift 2-1)).

For xIdx ═ 0. (nCbW > >2) -1 and yIdx ═ 0. (nCbH > >2) -1, the following applies:

the variable xSb is set equal to (xIdx < <2) +1, ySb is set equal to (yIdx < <2) + 1.

-if bdafulizationnflag [ xsbdx ] [ yIdx ] is equal to FALSE, for x ═ xSb

-1.. xSb +2, y.. ySb-1.. ySb +2, the predicted sample values of the current subblock are derived as follows:

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset2+predSamplesL1[x+1][y+1])>>shift2) (8-852)

otherwise (bdafulizationflag [ xsbdx ] [ yIdx ] equals TRUE), the predicted sample value of the current subblock is derived as follows:

for x ═ xSb-1.. xSb +4, y ═ ySb-1.. ySb +4, the following ordered procedure applies:

10. predicting a position (h) of each of the corresponding sample point positions (x, y) within the sample point arrayx,vy) The derivation is as follows:

hx=Clip3(1,nCbW,x) (8-853)

vy=Clip3(1,nCbH,y) (8-854)

11. the variables gradientHL0[ x ] [ y ], gradientVL0[ x ] [ y ], gradientHL1[ x ] [ y ] and gradientVL1[ x ] [ y ] are derived as follows:

gradientHL0[x][y]=(predSamplesL0[hx+1][vy]-predSampleL0[hx-1][vy]+offset5)>>shift1 (8-855)

gradientVL0[x][y]=(predSampleL0[hx][vy+1]-predSampleL0[hx][vy-1]+offset5)>>shift1 (8-856)

gradientHL1[x][y]=(predSamplesL1[hx+1][vy]-predSampleL1[hx-1][vy]+offset5)>>shift1 (8-857)

gradientVL1[x][y]=(predSampleL1[hx][vy+1]-predSampleL1[hx][vy-1]+offset5)>>shift1 (8-858)

12. the variables temp [ x ] [ y ], tempH [ x ] [ y ], and tempV [ x ] [ y ] are derived as follows:

DIFF[X][Y]=(PREDSAMPLESL0[HX][VY]>>SHIFT2)-(PREDSAMPLESL1[HX][VY]>>SHIFT2) (8-859)

diff[x][y]=(predSamplesL0[hx][vy]-predSamplesL1[hx][vy]+offset6)>>shift2 (8-859)

tempH[x][y]=(gradientHL0[x][y]+gradientHL1[x][y])>>shift3 (8-860)

tempV[x][y]=(gradientVL0[x][y]+gradientVL1[x][y])>>shift3 (8-861)

the variables sGx2, sGy2, sGxGy, sGxdI and sGydI are derived as follows:

sGx2=ΣiΣj(tempH[xSb+i][ySb+j]*tempH[xSb+i][ySb+j]),

wherein i, j ═ 1..4 (8-862)

sGy2=ΣiΣj(tempV[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-863)

sGxGy=ΣiΣj(tempH[xSb+i][ySb+j]*tempV[xSb+i][ySb+j]),

Wherein i, j-1..4 (8-864)

sGxdI=ΣiΣj(-tempH[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-865)

sGydI=ΣiΣj(-tempV[xSb+i][ySb+j]*diff[xSb+i][ySb+j]),

Wherein i, j ═ 1..4 (8-866)

-horizontal motion offset and vertical motion offset of the current sub-block are derived as follows:

vx=sGx2>0Clip3(-mvRefineThres,mvRefineThres,-(sGxdI<<3)>>Floor(Log2(sGx2))):0 (8-867)

vy=sGy2>0Clip3(-mvRefineThres,mvRefineThres,((sGydI<<3)-((vx*sGxGym)<<12+vx*sGxGys)>>1)>>Floor(Log2(sGx2))):0 (8-868)

-for x-xSb-1.. xSb +2, y-ySb-1.. ySb +2, the predicted sample values for the current subblock are derived as follows:

bdofOffset=Round((vx*(gradientHL1[x+1][y+1]-gradientHL0[x+1][y+1]))>>1)+Round((vy*(gradientVL1[x+1][y+1]-gradientVL0[x+1][y+1]))>>1) (8-869)

[ Ed. (JC): Round () operation is defined for floating point inputs. Since the input is an integer value, Round () therefore appears redundant here. Confirmed by the proposer ]

pbSamples[x][y]=Clip3(0,(2bitDepth)-1,(predSamplesL0[x+1][y+1]+offset4+predSamplesL1[x+1][y+1]+bdofOffset)>>shift4) (8-870)

Fig. 8 is a block diagram of a video processing device 800. The device 800 may be used to implement one or more of the methods described herein. The apparatus 800 may be embodied in a smartphone, tablet, computer, internet of things (IoT) receiver, or the like. The apparatus 800 may include one or more processors 802, one or more memories 804, and video processing hardware 806. The processor(s) 802 may be configured to implement one or more of the methods described in this document. The memory(s) 804 may be used to store data and code for implementing the methods and techniques described herein. Video processing hardware 806 may be used to implement some of the techniques described in this document in hardware circuitry. The video processing hardware 806 may be partially or completely included in the processor(s) 802 in the form of dedicated hardware, a Graphics Processor Unit (GPU), or a dedicated signal processing block.

Fig. 10 is a flow diagram of a method 1000 of processing video. Method 1000 includes performing a determination (1005) of a characteristic of a first video block, the characteristic including a difference between reference blocks associated with the first video block, the difference including one or more of a Sum of Absolute Transformed Differences (SATD), a sum of mean removed absolute transformed differences (mrstd), a Sum of Squared Errors (SSE), a sum of mean removed squared errors (MRSSE), a mean difference, or a gradient value, determining (1010) an operating state of one or both of a bi-directional optical flow (BIO) technique or a decoder-side motion vector refinement (DMVR) technique based on the characteristic of the first video block, the operating state being one of enabled or disabled, and performing (1015) further processing of the first video block consistent with the operating state of the one or both of the BIO technique or the DMVR technique.

Fig. 11 is a flow diagram of a method 1100 of processing video. The method 1100 includes modifying (1105) a first reference block to generate a first modified reference block and modifying a second reference block to generate a second modified reference block, the first and second reference blocks being associated with a first video block, determining (1110) a difference between the first and second modified reference blocks, the difference comprising one or more of a Sum of Absolute Transformed Differences (SATD), a sum of absolute transformed differences with Mean Removal (MRSATD), a Sum of Squared Errors (SSE), a sum of squared errors with Mean Removal (MRSSE), a mean difference, or a gradient value, and performing (1115) further processing of the first video block based on the difference between the first and second modified reference blocks.

Fig. 12 is a flow diagram of a method 1200 of processing video. The method 1200 includes determining (1205) a difference between a portion of a first reference block and a portion of a second reference block associated with the first video block, the difference including one or more of a Sum of Absolute Transformed Differences (SATD), a sum of mean removed absolute transformed differences (mrstd), a Sum of Squared Errors (SSE), a sum of mean removed squared errors (MRSSE), a mean difference, or a gradient value, and performing (1210) further processing of the first video block based on the difference.

Fig. 13 is a flow diagram of a method 1300 of processing video. Method 1300 includes determining (1305) a temporal gradient or a modified temporal gradient using a reference picture associated with the first video block, the temporal gradient or the modified temporal gradient indicating a difference between the reference pictures, and performing (1310) further processing of the first video block using a bi-directional optical flow (BIO) codec tool according to the difference.

Fig. 14 is a flow diagram of a method 1400 of processing video. Method 1400 includes determining (1405) a temporal gradient using a reference picture associated with the first video block, modifying (1410) the temporal gradient to generate a modified temporal gradient, and performing (1415) further processing of the first video block using the modified temporal gradient.

Fig. 15 is a flow chart of a method 1500 of processing video. The method 1500 includes modifying (1505) one or both of a first inter reference block and a second inter reference block associated with a first video block, determining (1510) a spatial gradient according to a bi-directional optical flow (BIO) codec tool using the modified one or both of the first inter reference block and the modified second inter reference block, and performing (1515) further processing of the first video block based on the spatial gradient.

Fig. 16 is a flow diagram of a method 1600 of processing video. The method 1600 includes performing the determining (1605): the flag signaled at a block level indicates that one or both of decoding-side motion vector refinement (DMVR) or bi-directional optical flow (BIO) is to be enabled for the first video block, and performing (1610) further processing of the first video block, the processing including applying one or both of DMVR or BIO in accordance with the flag.

Some examples of determining to use bi-directional optical flow (BIO) or decoder-side motion vector refinement (DMVR) are described in section 4 of this document with reference to methods 1000, 1100, 1200, 1300, 1400, 1500, and 1600. For example, as described in section 4, a difference between reference blocks may be determined, which may be used to enable or disable the BIO or DMVR.

Referring to methods 1000, 1100, 1200, 1300, 1400, 1500, and 1600, a video block may be encoded in a video bitstream, where bit efficiency may be achieved by using bitstream generation rules related to motion information prediction.

These methods may include where the operating state of the BIO technique or DMVR technique is different between the block level and the sub-block level.

The methods may include determining that one or more of the gradient values, the average of the gradient values, or the range of gradient values are within a threshold range, wherein the operating state is determined based on determining that the gradient values, the average of the gradient values, or the range of gradient values are within the threshold range.

The methods may include wherein determining the operational state is further based on information signaled from an encoder to a decoder in a Video Parameter Set (VPS), a Sequence Parameter Set (SPS), a Picture Parameter Set (PPS), a slice header, or a slice header.

The methods may include determining a refined motion vector for the first video block based on SATD, mrstd, SSE, or MRSSE, and wherein further processing is performed based on the refined motion vector.

The methods may include wherein determining whether the refinement motion vector is based on SATD or mrstd, the method further comprising: determining a SATD or a MRSATD for each sub-block of the first video block; and generating SATD or mrstd of the first video block based on a sum of the SATD or mrstd of each sub-block, wherein further processing of the first video block is based on the generated SATD or mrstd.

The methods may include determining that a mean difference of two reference blocks of the first video block is greater than a threshold, and wherein one or both of the BIO or DMVR are in a disabled state of operation based on the mean difference of the two reference blocks.

The methods may include determining that a mean difference of two reference sub-blocks of a sub-block of the first video block is greater than a threshold, and wherein one or both of the BIO or DMVR are in a disabled state of operation based on the mean difference of the two reference sub-blocks.

These methods may include wherein the threshold is predefined.

The methods may include determining a size of the first video block, and wherein the threshold is based on the size of the first video block.

The methods may include wherein modifying the first reference block and the second reference block comprises subtracting a mean of the first reference block from the first reference block.

The methods may include wherein the portion of the first reference block and the second reference block includes even rows.

The methods may include wherein the portions of the first reference block and the second reference block include angular samples.

The methods may include wherein the portion of the first reference block and the second reference block includes a representative sub-block.

The methods may include wherein differences between the representative sub-blocks are added to generate a difference for the first reference block or the second reference block.

These methods may include those in which the difference is related to the absolute sum of the time domain gradients.

The methods may include wherein modifying the temporal gradient is based on an absolute mean difference between the reference blocks being greater than a threshold.

These methods may include wherein the threshold is 4.

The methods may include wherein modifying the temporal gradient is based on an absolute mean difference between the reference blocks being less than a threshold.

These methods may include wherein the threshold is 20.

The methods may include wherein modifying the temporal gradient is within a threshold range based on an absolute mean difference between the reference blocks.

These methods may include wherein the BIO is in a disabled operating state based on the absolute mean difference being greater than a threshold.

These methods may include wherein the threshold or threshold range is indicated in the VPS, SPS, PPS, picture, slice, or slice level.

These methods may include where the threshold or threshold range is different for different Coding Units (CUs), Largest Coding Units (LCUs), slices, or pictures.

These methods may include those in which the threshold or threshold range is based on decoded or encoded pixel values.

These methods may include wherein the threshold or threshold range is based on a reference picture.

The methods may include wherein determining the spatial gradient comprises determining a weighted average of the intra-predicted block and the inter-predicted block in each prediction direction.

These methods may include wherein the flag is provided in an Advanced Motion Vector Prediction (AMVP) mode and is inherited from one or both of spatially neighboring blocks or temporally neighboring blocks in a Merge mode.

The methods may include wherein the flag is not signaled for a unidirectional prediction block.

These methods may include bi-prediction blocks in which the reference picture is a preceding picture or a following picture in display order, the flag is not signaled.

The methods may include wherein the flag is not signaled for bi-prediction blocks.

These methods may include wherein the flag is not signaled for intra coded blocks.

The methods may include wherein the flag is not signaled for blocks that are coded with mixed intra and inter prediction modes.

The methods may include wherein the flag is signaled based on a size of the first video block.

These methods may include wherein the flag is signaled in the VPS, SPS or PPS.

The methods may include wherein the flag is based on a temporal layer of a picture associated with the first video block.

The methods may include wherein the flag is based on a Quantization Parameter (QP) of a picture associated with the first video block.

Fig. 17 is a block diagram illustrating an example video processing system 1700 in which various techniques disclosed herein may be implemented. Various embodiments may include some or all of the components of system 1700. The system 1700 may include an input 1702 for receiving video content. The video content may be received in a raw or uncompressed format, such as 8 or 10 bit multi-component pixel values, or may be received in a compressed or encoded format. Input 1702 may represent a network interface, a peripheral bus interface, or a storage interface. Examples of network interfaces include wired interfaces (such as ethernet, Passive Optical Network (PON), etc.) and wireless interfaces (such as Wi-Fi or cellular interfaces).

The system 1700 can include a codec component 1704 that can implement various codecs or encoding methods described in this document. The codec component 1704 may reduce the average bit rate of the video from the input 1702 to the output of the codec component 1704 to produce a codec representation of the video. Thus, codec techniques are sometimes referred to as video compression or video transcoding techniques. The output of the codec component 1704 can be stored or transmitted via a connected communication, as represented by component 1706. Component 1708 may use a stored or transmitted bitstream (or codec) representation of the video received at input 1702 to generate pixel values or displayable video that is sent to display interface 1710. The process of generating user-viewable video from a bitstream representation is sometimes referred to as video decompression. Further, while certain video processing operations are referred to as "codec" operations or tools, it should be understood that codec tools or operations are used at the encoder and the corresponding decoding tools or operations that invert the codec results will be performed by the decoder.

Examples of a peripheral bus interface or display interface may include a Universal Serial Bus (USB) or a high-definition multimedia interface (HDMI) or displayport, among others. Examples of storage interfaces include SATA (serial advanced technology attachment), PCI, IDE interfaces, and the like. The techniques described in this document may be embodied in various electronic devices, such as mobile phones, laptops, smartphones, or other devices capable of performing digital data processing and/or video display.

It should be appreciated that the disclosed techniques may be embodied in a video encoder or decoder to improve compression efficiency when the compressed codec unit has a shape that is significantly different from a conventional square or half-square rectangular block. For example, new codec tools using long or tall codec units (such as units of 4 x 32 or 32 x 4 size) may benefit from the disclosed techniques.

In some embodiments, a method of video processing may be performed as follows:

using a filtering method for calculating spatial and temporal gradients during transitions between video blocks and bit stream representations of the video blocks, an

The conversion is performed using filtering.

Here, the conversion includes generating a bitstream representation from pixel values of the video block or generating pixel values from the bitstream representation.

In some embodiments, spatial and temporal gradients are computed using the shifted sample point differences.

In some embodiments, the spatial and temporal gradients are calculated using the modified samples.

Item 1, discussed in section 4, provides additional details of the method.

Fig. 18 is a flowchart of an example of a visual media processing method. The steps of this flowchart are discussed in connection with example embodiment 1 discussed in section 4 of this document. At step 1802, the process performs a transform between a current block of visual media data and a corresponding codec representation of the current video block, wherein the transform for the current block includes computing a spatial gradient based on a difference in samples of spatial vicinity samples associated with the current block and/or computing a temporal gradient based on temporal vicinity samples associated with the current block, and wherein techniques for computing the spatial gradient are the same as techniques for computing the temporal gradient.

Fig. 19 is a flowchart of an example of a visual media processing method. The steps of this flowchart are discussed in connection with example embodiment 1 discussed in section 4 of this document. At step 1902, the process performs a transform between a current block of visual media data and a corresponding codec representation of the current video block, wherein the transform for the current block includes computing a spatial gradient based on computing sample difference values for spatially neighboring samples associated with the current block, and wherein a left or right shift of the sample difference values for the spatially neighboring samples is applied prior to computing the sample difference values.

Some embodiments of the present document are now presented in a clause-based format.

1. A method of visual media processing, comprising:

a conversion between the current block of visual media data and a corresponding codec representation of the current video block is performed,

wherein the transformation of the current block comprises computing a spatial gradient based on a sample difference value of spatially neighboring samples associated with the current block and/or computing a temporal gradient based on temporally neighboring samples associated with the current block, and

wherein the technique for computing spatial gradients is the same as the technique for computing temporal gradients.

2. The method of clause 1, wherein the technique for calculating the spatial and/or temporal gradients comprises left or right shifting of sample difference values of spatially neighboring samples and/or temporally neighboring samples.

3. The method of clause 2, wherein the spatially neighboring samples and/or temporally neighboring samples are left-shifted or right-shifted prior to calculating the sample difference values.

4. The method of clause 2, wherein the sample difference values of the spatially neighboring samples and/or the temporally neighboring samples are calculated before left-shifting or right-shifting of the sample difference values.

5. The method of clause 4, wherein an offset is added to the sample difference value before the left or right shift of the sample difference value.

6. The method of clause 3, wherein the offset is added to the spatially neighboring samples or the temporally neighboring samples before left-shifting or right-shifting.

7. The method of any one or more of clauses 5-6, wherein the shift is represented as 1< < (shift 1-1), wherein shift1 represents the amount of left or right shift.

8. The method of any one or more of clauses 5-6, wherein the shift is represented as 1< < shift1> >1, wherein shift1 represents the amount of shift left or right.

9. The method of any one or more of clauses 1-8, wherein the spatial and/or temporal gradients are represented as SatShift (neig 0-neig 1, shift1), wherein SatShift (x, n) is defined as

Wherein offset0 and/or offset1 are set to (1< < n) > >1 and/or (1< (n-1)) and/or zero, wherein neighbor 0 and neighbor 1 represent sample values of two neighboring samples of samples included in a current block of visual media data, and wherein shift1 represents an amount of left or right shift of the two neighboring samples.

10. The method of any one or more of clauses 1-8, wherein the spatial and/or temporal gradients are represented as SatShift (neig0, shift1) -SatShift (neig1, shift1), wherein SatShift (x, n) is defined as

Wherein offset0 and/or offset1 are set to (1< < n) > >1 and/or (1< (n-1)) and/or zero, wherein neighbor 0 and neighbor 1 represent sample values of two neighboring samples of samples included in a current block of visual media data, and wherein shift1 represents an amount of left or right shift of the two neighboring samples.

11. The method of any one or more of clauses 1-10, wherein the spatial neighboring samples are included in a prediction block of the current block that corresponds to a prediction list.

12. The method of clause 11, wherein the prediction block is a first prediction block in a first prediction list and/or a second prediction block in a second prediction list.

13. The method of any one or more of clauses 1-10, wherein the temporal gradient is based on at least two temporal neighboring samples associated with the current block in a first prediction block of the current block corresponding to a first prediction list and in a second prediction block of the current block corresponding to a second prediction list.

14. The method of any one or more of clauses 1-9, wherein the spatial gradient is calculated in a horizontal direction and/or a vertical direction.

15. The method of any one or more of clauses 1-13, wherein the converting comprises using a bi-directional optical flow (BDOF) technique.

16. A method of visual media processing, comprising:

a conversion between the current block of visual media data and a corresponding codec representation of the current video block is performed,

wherein the transforming of the current block comprises computing a spatial gradient based on computing a difference of samples of spatially neighboring samples associated with the current block, and

wherein a left or right shift of the sample difference values of adjacent samples in the spatial domain is applied before the sample difference values are calculated.

17. A video decoding apparatus comprising a processor configured to implement the method according to one or more of clauses 1-16.

18. A video encoding apparatus comprising a processor configured to implement the method according to one or more of clauses 1-16.

19. A computer program product having computer code stored thereon, which when executed by a processor causes the processor to implement the method according to any of clauses 1 to 16.

20. A method, apparatus or system as described in this document.

In this document, the term "video processing" or "visual media processing" may refer to video encoding, video decoding, video compression, or video decompression. For example, a video compression algorithm may be applied during the conversion from a pixel representation of the video to a corresponding bitstream representation, and vice versa. As defined by the syntax, the bitstream representation of the current video block may, for example, correspond to bits that are co-located or distributed at different locations within the bitstream. For example, a macroblock may be encoded from transformed and coded error residual values, and may also be coded using bits in headers and other fields in the bitstream. Furthermore, during the transition, the decoder may parse the bitstream knowing that some fields may or may not be present, based on the determination as described in the above solution. Similarly, the encoder may determine that certain syntax fields are or are not included and generate the codec representation accordingly by including or excluding syntax fields from the codec representation.

The disclosed and other solutions, examples, embodiments, modules, and implementations of functional operations described in this document can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this document and their structural equivalents, or in combinations of one or more of them. The disclosed and other embodiments may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term "data processing apparatus" encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this document can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not require such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this patent document contains many specifics, these should not be construed as limitations on the scope of any subject matter or claim, but rather as descriptions of features specific to particular embodiments of particular technologies. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

Only a few embodiments and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

52页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:神经数据压缩的内容自适应优化

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类