Spatial motion compression

文档序号：1617288 发布日期：2020-01-10 浏览：19次中文

阅读说明：本技术 空间运动压缩 (Spatial motion compression ) 是由张莉张凯刘鸿彬王悦于 2019-07-01 设计创作，主要内容包括：本公开涉及空间运动压缩。公开了用于视频编码的设备、系统和方法。一个示例方法包括解析比特流以恢复视频的第一部分中的一个或多个当前块的运动信息；利用视频的第一部分中的块的运动信息来生成代表性运动信息集；以及使用代表性运动信息集重建视频的第二部分中的块。(The present disclosure relates to spatial motion compression. Devices, systems, and methods for video encoding are disclosed. One example method includes parsing a bitstream to recover motion information for one or more current blocks in a first portion of a video; generating a representative set of motion information using motion information for blocks in a first portion of a video; and reconstructing a block in the second portion of the video using the representative set of motion information.)

1. A video encoding method, comprising:

parsing a bitstream to recover motion information for one or more current blocks in a first portion of a video;

generating a representative set of motion information using the motion information for blocks in a first portion of the video; and

reconstructing a block in a second portion of the video using the representative set of motion information.

2. The method of claim 1, wherein the blocks in the first portion of the video comprise different sets of motion information.

3. The method of claim 1, wherein the first portion of video is a coding tree unit.

4. The method of claim 1, wherein the first portion of the video corresponds to a plurality of coding tree units.

5. The method of any one or more of claims 1-2, further comprising:

resetting motion information of a block in the first portion of the video to the representative set of motion information.

6. The method of claim 5, wherein the resetting is prior to applying a filtering process.

7. The method of claim 6, wherein the filtering process comprises a deblocking filter or a Sample Adaptive Offset (SAO) filter.

8. A method as claimed in any one or more of claims 1-7, wherein the representative set of motion information is used in the filtering process.

9. A method as any one or more of claims 1-8 recites, wherein the representative set of motion information is associated with a representative region of a selected block in the first portion of the video.

10. The method of claim 9, wherein the representative region is a center or lower right region.

11. A method as claimed in any one or more of claims 9 to 10, wherein the representative region is determined by checking blocks in the first portion of the video according to a sequence.

12. The method of claim 11, wherein verifying the block is terminated when an inter-coded block is identified in the first portion of the video.

13. The method of claim 12, wherein the representative set of motion information is associated with an inter-coded block.

14. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of one or more of claims 1 to 13.

15. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing a method according to one or more of claims 1 to 13.

Technical Field

This document relates generally directly to picture and video coding techniques.

Background

Digital video uses the most bandwidth over the internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth requirements for pre-counting the use of digital video will continue to grow.

Disclosure of Invention

Devices, systems, and methods related to identifying non-neighboring Merge candidates to reduce memory and complexity requirements and improve coding efficiency are described. Apparatus, systems, and methods for reducing the complexity of non-adjacent Merge designs. For example, the presently disclosed technology discloses rules for selecting non-adjacent Merge candidates to keep the size of the line buffer below a threshold. The described methods may be applied to existing video coding standards, such as High Efficiency Video Coding (HEVC), and future video coding standards or video codecs.

In one representative aspect, the disclosed techniques may be used to provide methods for video encoding and decoding. The method includes parsing a bitstream to recover motion information for one or more current blocks in a first portion of a video; generating a representative set of motion information using motion information for blocks in a first portion of a video; and reconstructing a block in the second portion of the video using the representative set of motion information. A list of some solutions preferred by some embodiments is further provided in the document.

In yet another representative aspect, the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.

In yet another representative aspect, an apparatus configured or operable to perform the above-described method is disclosed. The apparatus may include a processor programmed to implement the method.

The above and other aspects and features of the disclosed technology are described in more detail in the accompanying drawings, the description and the claims.

Drawings

Fig. 1 illustrates an example block diagram of a typical High Efficiency Video Coding (HEVC) video encoder and decoder.

FIG. 2 shows an example of Macroblock (MB) partitioning in H.264/AVC.

Fig. 3 shows an example of dividing the Coding Block (CB) into Prediction Blocks (PB).

Fig. 4A and 4B show examples of the subdivision of a Coding Tree Block (CTB) into CBs and Transform Blocks (TBs) and the corresponding quadtrees, respectively.

Fig. 5A and 5B show examples of subdivisions for a Largest Coding Unit (LCU) and corresponding QTBT (quadtree plus binary tree).

Fig. 6A-6E illustrate examples of dividing an encoded block.

Fig. 7 shows an example subdivision of QTBT-based CBs.

Fig. 8A-8I show an example of partitioning of CBs to support multi-tree types (MTT), which is a generalization of QTBT.

Fig. 9 shows an example of building a Merge candidate list.

Fig. 10 shows an example of the positions of spatial candidates.

Fig. 11 shows an example of a candidate pair subjected to a redundancy check of a spatial Merge candidate.

Fig. 12A and 12B illustrate examples of a location of a second Prediction Unit (PU) based on the size and shape of a current block.

Fig. 13 shows an example of motion vector scaling for the temporal domain Merge candidate.

Fig. 14 shows an example of candidate positions of the time domain Merge candidate.

Fig. 15 shows an example of generating combined bidirectional predictive Merge candidates.

Fig. 16 shows an example of constructing a motion vector prediction candidate.

Fig. 17 shows an example of motion vector scaling for spatial motion vector candidates.

Fig. 18 shows an example of motion prediction using an Alternative Temporal Motion Vector Prediction (ATMVP) algorithm for a Coding Unit (CU).

FIG. 19 shows an example of identification of source blocks and source pictures

Fig. 20 shows an example of a Coding Unit (CU) having sub-blocks and neighboring blocks used by a spatial-temporal motion vector prediction (STMVP) algorithm.

Fig. 21 shows an example of bilateral matching in the mode matching motion vector derivation (PMMVD) mode, which is a special Merge mode based on the Frame Rate Up Conversion (FRUC) algorithm.

Fig. 22 shows an example of template matching in the FRUC algorithm.

Fig. 23 shows an example of unilateral motion estimation in the FRUC algorithm.

Fig. 24 shows an example of a decoder-side motion vector refinement (DMVR) algorithm based on double-sided template matching.

Fig. 25 shows an example of spatial neighboring blocks used to derive spatial Merge candidates.

FIG. 26 illustrates exemplary pseudo code for adding non-adjacent Merge candidates.

Fig. 27 shows an example of a restricted area of a non-adjacent block.

Fig. 28 illustrates a flow diagram of another example method for video encoding in accordance with the presently disclosed technology.

Fig. 29 is a block diagram of an example of a hardware platform for implementing the visual media decoding or visual media encoding techniques described in this document.

Detailed Description

Due to the increasing demand for higher resolution video, video encoding methods and techniques are ubiquitous in modern technology. Video codecs typically include electronic circuits or software that compress or decompress digital video, and are continually being improved to provide higher coding efficiency. The video codec converts uncompressed video into a compressed format and vice versa. There is a complex relationship between video quality, the amount of data used to represent the video (determined by the bit rate), the complexity of the encoding and decoding algorithms, susceptibility to data loss and errors, ease of editing, random access, and end-to-end delay (lag). The compression format typically conforms to a standard video compression specification, such as the High Efficiency Video Coding (HEVC) standard (also known as h.265 or MPEG-H part 2), the general video coding standard to be completed, or other current and/or future video coding standards.

Embodiments of the disclosed techniques may be applied to existing video coding standards (e.g., HEVC, h.265) and future standards to improve compression performance. Section headings are used in this document to enhance readability of the description, and do not limit discussion or the embodiments (and/or implementations) in any way to only the corresponding sections. Although explained using certain terms of the HEVC, h.265, and VVC standards, the techniques are not limited to these codecs, and may be extended to other video encoding or decoding protocols.

1. Example embodiments of video coding

Fig. 1 shows an example block diagram of a typical HEVC video encoder and decoder. The encoding algorithm that produces an HEVC compliant bitstream generally proceeds as follows. Each picture is divided into block-like regions, and the exact block division is communicated to the decoder. The first picture of the video sequence (and each clean random access point to the first picture in the video sequence) is encoded using only intra picture prediction (prediction using region-to-region spatial data within the same picture, but not relying on other pictures). Inter-picture temporal prediction coding modes are typically used for most blocks for all remaining pictures of the sequence or pictures between random access points. The encoding process for inter picture prediction includes selecting motion data containing the selected reference picture and a Motion Vector (MV) to be applied to predict samples of each block. The encoder and the decoder generate the same inter picture prediction signal, which is transmitted as side information, by applying Motion Compensation (MC) using the MV and the mode decision data.

The residual signal of the intra picture prediction or the inter picture prediction, which is the difference between the initial block and its prediction, is transformed by a linear spatial transform. The transform coefficients are then scaled, quantized, entropy encoded, and transmitted along with the prediction information.

The encoder replicates the decoder processing loop (see grey shaded box in fig. 1) so that both will generate the same prediction for subsequent data. Thus, the quantized transform coefficients are constructed by inverse scaling and then inverse transformed to replicate the decoded approximation of the residual signal. The residual is then added to the prediction and the result of this addition can then be fed into one or two loop filters to smooth artifacts caused by block-wise processing and quantization. The final picture representation (i.e., a copy of the decoder output) is stored in the decoded picture buffer for prediction of subsequent pictures. In general, the order of the encoding or decoding processes of pictures is usually different from the order in which they arrive from the source; it is necessary to distinguish the decoding order (i.e., bitstream order) and the output order (i.e., display order) of the decoder.

Video material to be encoded by HEVC is typically expected to be input as a progressive scan image (since the source video originates from this format or results from de-interlacing prior to encoding). There are no explicit coding features in HEVC design to support the use of interlaced scanning, as interlaced scanning is no longer used for displays and becomes substantially uncommon for distribution. However, metadata syntax has been provided in HEVC to allow the encoder to indicate that interlaced video has been sent by encoding each field of interlaced video (i.e., even or odd lines of each video frame) as a separate picture, or by encoding each interlaced frame as an HEVC-encoded picture. This provides an efficient way of encoding interlaced video without the need to provide a decoder with support for special decoding processes.

Example of partitioning tree Structure in 1.1H.264/AVC

The core of the coding layer in the previous standard is a macroblock, containing a 16x16 block of luma samples and, in the case of the conventional 4:2:0 color samples, two corresponding 8x8 blocks of chroma samples.

Intra-coded blocks use spatial prediction to explore spatial correlation between pixels. Two kinds of partitions are defined: 16x16 and 4x 4.

Inter-coded blocks use temporal prediction, rather than spatial prediction, by estimating motion between pictures. The motion of a 16x16 macroblock or any sub-macroblock partition thereof can be estimated separately: 16x8, 8x16, 8x8, 8x4, 4x8, 4x4, as shown in fig. 2. Only one Motion Vector (MV) is allowed per sub-macroblock partition.

Example of partition tree structure in 1.2HEVC

In HEVC, various local characteristics are accommodated by dividing a Coding Tree Unit (CTU) into Coding Units (CUs) using a quadtree structure (denoted as coding tree). It is decided at the CU level whether to encode a picture region using inter-picture (temporal) prediction or intra-picture (spatial) prediction. Each CU may be further divided into one, two, or four PUs according to the division type of the Prediction Unit (PU). In one PU, the same prediction process is applied and the relevant information is transmitted to the decoder on a PU basis. After a residual block is obtained by applying a prediction process based on the PU partition type, the CU may be partitioned into Transform Units (TUs) according to another quadtree structure similar to a coding tree of the CU. An important feature of the HEVC structure is that it has multiple partitioning concepts, including CU, PU, and TU.

Some features involved in hybrid video coding using HEVC include:

(1)code Tree Unit (CTU) and Code Tree Block (CTB) structure: a similar structure in HEVC is a Coding Tree Unit (CTU) that has a size selected by the encoder and may be larger than a conventional macroblock. A CTU consists of a luma CTB and corresponding chroma CTB, as well as syntax elements. The size L × L of the luminance CTB may be chosen to be L ═ 16, 32, or 64 samples, with larger sizes generally enabling better compression. HEVC then supports the partitioning of CTBs into smaller blocks using tree structures and quadtree signaling.

(2)Coding Unit (CU) and Coding Block (CB): the quad-tree syntax of the CTU specifies the size and location of its luma and chroma CBs. The root of the quadtree is associated with the CTU. Therefore, the size of the luminance CTB is the maximum size supported by the luminance CB. The partitioning of the luma and chroma CBs of the CTU is jointly signaled. One luma CB and typically two chroma CBs together with associated syntax form a Coding Unit (CU). A CTB may contain only one CU or may be partitioned into multiple CUs, with each CU having an associated partition, partitioned into Prediction Units (PUs) and transform unit Trees (TUs).

(3)Prediction unit and Prediction Block (PB): it is decided at the CU level whether to encode a picture region using inter-picture prediction or intra-picture prediction. The root of the PU partition structure is at the CU level. The luma and chroma CBs may be further divided in size and predicted from the luma and chroma Prediction Blocks (PB) depending on basic prediction type decisions. HEVC supports variable PB sizes from 64 x 64 to 4x4 samples. Fig. 3 shows an example of allowable PBs for an M × M CU.

(4)Transform Unit (TU) and Transform Block (TB): the prediction residual is encoded using a block transform. The root of the TU tree structure is at the CU level. The luminance CB residual may be the same as the luminance TB or may be further divided into smaller luminances TB. The same applies to chroma TB. For 4X4, 8X8, 16X16 and 32 extractionThe square TB of 32 defines an integer basis function similar to the Discrete Cosine Transform (DCT). For the 4x4 transform of the intra luma prediction residual, an integer transform derived from the Discrete Sine Transform (DST) form may also be specified.

1.2.1 examples of Tree structures divided into TB and TU

For residual coding, the CB may be recursively divided into Transform Blocks (TBs). The partitioning is signaled by a residual quadtree. Only square CB and TB partitions are specified, where the block can be recursively partitioned into four quadrants as shown in fig. 4A and 4B. For a given luminance CB of size M, the flag indicates whether it is divided into four blocks of size M/2. If further partitioning is possible, each quadrant is assigned a flag indicating whether it is partitioned into four quadrants, as indicated by the maximum depth of the residual quadtree as indicated in the Sequence Parameter Set (SPS). The leaf node blocks generated by the residual quadtree are transform blocks that are further processed by transform coding. The encoder indicates the maximum and minimum luminance TB sizes it will use. When the CB size is larger than the maximum TB size, partitioning is implied. When the division would result in a smaller luminance TB size than the indicated minimum value, no division is implied. The chroma TB size is half the luma TB size in each dimension except when the luma TB size is 4 × 4, in which case the area covered by four 4 × 4 luma TBs uses a single 4 × 4 chroma TB. In the case of an intra-picture predicted CU, the decoded samples of the nearest neighboring TB (either intra CB or extra CB) are used as reference data for the intra-picture prediction.

Unlike previous standards, HEVC design allows TBs to span multiple PBs for inter-picture predicted CUs to maximize the potential coding efficiency of TB partitioning that benefits from a quadtree structure.

1.2.2 parent node and child node

The CTB is divided according to a quadtree structure, and nodes of the CTB are coding units. The plurality of nodes in the quadtree structure includes leaf nodes and non-leaf nodes. Leaf nodes have no children in the tree structure (i.e., leaf nodes are not further divided). The non-leaf nodes include the root nodes of the tree structure. The root node corresponds to an initial video block (e.g., CTB) of the video data. For each respective non-root node of the plurality of nodes, the respective non-root node corresponds to a video block that is a sub-block of a video block corresponding to a parent node in a tree structure of the respective non-root node. Each respective non-leaf node of the plurality of non-leaf nodes has one or more child nodes in the tree structure.

1.3 example of quad Tree plus binary Tree Block Structure with larger CTU in JEM

In some embodiments, reference software called Joint Exploration Model (JEM) is used to explore future video coding techniques. In addition to binary tree structures, JEM also describes quadtree plus binary tree (QTBT) and Ternary Tree (TT) structures.

1.3.1 example of QTBT Block partitioning Structure

Compared to HEVC, the QTBT structure eliminates the concept of multiple partition types, i.e., it eliminates the separation of CU, PU and TU concepts and supports more flexibility of CU partition shapes. In the QTBT block structure, a CU may be square or rectangular. As shown in fig. 5A, a Coding Tree Unit (CTU) is first divided in a quadtree structure. The leaf nodes of the quadtree are further divided by a binary tree structure. There are two partition types in binary tree partitioning: a symmetrical horizontal division and a symmetrical vertical division. The binary tree leaf nodes are called Coding Units (CUs), and this partition is used for prediction and conversion processing without further partitioning. This means that CU, PU and TU have the same block size in the QTBT coding block structure. In JEM, a CU sometimes consists of Coded Blocks (CBs) of different color components, e.g., in P-and B-slices of the 4:2:0 chroma format, one CU contains one luma CB and two chroma CBs, and a CU sometimes consists of CBs of a single component, e.g., in the case of I-slices, one CU contains only one luma CB or only two chroma CBs.

The following parameters are defined for the QTBT partitioning scheme.

-CTU size: the root node size of the quadtree is the same as the concept in HEVC.

-miniqtsize: minimum allowed quadtree leaf node size

-MaxBTSize: maximum allowed binary tree root node size

-MaxBTDePTh: maximum allowed binary tree depth

-MiNBTSize: minimum allowed binary tree leaf node size

In one example of the QTBT partition structure, the CTU size is set to 128 × 128 luma samples with two corresponding 64 × 64 chroma sample blocks, the miniqtsize is set to 16 × 16, the MaxBTSize is set to 64 × 64, the minibtsize (width and height) is set to 4 × 4, and the MaxBTSize is set to 4. Quadtree partitioning is first applied to CTUs to generate quadtree leaf nodes. The sizes of the leaf nodes of the quadtree may have sizes from 16 × 16 (i.e., miniqtsize) to 128 × 128 (i.e., CTU size). If the leaf quadtree node is 128 x 128, it is not further partitioned by the binary tree because its size exceeds the MaxBTSize (i.e., 64 x 64). Otherwise, the leaf quadtree nodes may be further partitioned by the binary tree. Thus, the leaf nodes of the quadtree are also the root nodes of the binary tree, and their binary tree depth is 0. When the binary tree depth reaches MaxBTDePTH (i.e., 4), no further partitioning is considered. When the width of the binary tree node is equal to MiNBTSize (i.e., 4), no further horizontal partitioning is considered. Likewise, when the height of the binary tree nodes is equal to the MiNBTSize, no further vertical partitioning is considered. The leaf nodes of the binary tree are further processed by prediction and transformation processes without further partitioning. In JEM, the maximum CTU size is 256 × 256 luminance samples.

Fig. 5A shows an example of block partitioning by using QTBT and fig. 5B shows the corresponding tree representation. The solid lines represent quad-tree partitions and the dashed lines represent binary tree partitions. In each partition (i.e., non-leaf) node of the binary tree, a flag is signaled to indicate which partition type (i.e., horizontal or vertical) to use, where 0 represents horizontal partition and 1 represents vertical partition. For the quad-tree partition, there is no need to indicate the partition type, because the quad-tree partition always divides one block horizontally and vertically to generate 4 subblocks of the same size.

Furthermore, the QTBT scheme supports the ability for luminance and chrominance to have separate QTBT structures. Currently, luminance and chrominance CTBs in one CTU share the same QTBT structure for P-and B-stripes. However, for I-slices, luma CTB is divided into CUs with a QTBT structure and chroma CTB is divided into chroma CUs with another QTBT structure. This means that a CU in an I-slice consists of either a coded block for the luma component or a coded block for the two chroma components, and a CU in a P-slice or a B-slice consists of coded blocks for all three color components.

In HEVC, to reduce memory access for motion compensation, inter prediction of small blocks is restricted such that 4 × 8 and 8 × 4 blocks do not support bi-prediction and 4 × 4 blocks do not support inter prediction. In the QTBT of JEM, these restrictions are removed.

1.4 Ternary Tree (TT) for multifunctional video coding (VVC)

Fig. 6A shows an example of a Quadtree (QT) partition, and fig. 6B and 6C show examples of vertical and horizontal Binary Tree (BT) partitions, respectively. In some embodiments, in addition to quadtrees and binary trees, Ternary Tree (TT) partitioning is supported, such as horizontal and vertical center-side ternary trees (as shown in fig. 6D and 6E).

In some implementations, two levels of trees are supported: region trees (quadtrees) and prediction trees (binary or ternary). The CTUs are first partitioned with a Region Tree (RT). The RT leaves may be further partitioned with a Prediction Tree (PT). PT leaves may also be further partitioned with PT until a maximum PT depth is reached. The PT leaf is the basic coding unit. For convenience, it is still referred to as CU. The CU cannot be further divided. Both prediction and transformation are applied to the CU in the same way as JEM. The entire partition structure is called a "multi-type tree".

1.5 examples of partition structures in alternative video coding techniques

In some embodiments, a tree structure called multi-tree (MTT) is supported, which is a generalization of QTBT. In QTBT, as shown in fig. 7, a Coding Tree Unit (CTU) is first divided in a quadtree structure. The leaf nodes of the quadtree are then further partitioned using a binary tree structure.

The structure of MTT consists of two types of tree nodes: region Trees (RT) and Prediction Trees (PT), supporting nine types of partitions, as shown in FIGS. 8A-8I. The zone tree may recursively divide the CTUs into square blocks up to zone leaf nodes of size 4x 4. At each node of the region tree, a prediction tree may be formed from one of three tree types: binary trees, ternary trees, and asymmetric binary trees. In PT partitioning, quad-tree partitioning is prohibited in branches of a prediction tree. As with JEM, the luma tree and chroma trees are separated in the I stripe.

Example of inter prediction in HEVC/H.265

Video coding standards have improved significantly over the years and now provide, in part, high coding efficiency and support for higher resolution. Recent standards such as HEVC and h.265 are based on hybrid video coding structures, where temporal prediction plus transform coding is utilized.

2.1 examples of prediction modes

Each inter-predicted PU (prediction unit) has motion parameters of one or two reference picture lists. In some embodiments, the motion parameters include a motion vector and a reference picture index. In other embodiments, inter _ pred _ idc may also be used to signal the use of one of the two reference picture lists. In yet other embodiments, the motion vector may be explicitly encoded as a delta with respect to the predictor.

When a CU is coded in skip mode, one PU is associated with the CU and there are no significant residual coefficients, no motion vector delta coded or reference picture indices. A Merge mode is specified by which the motion parameters of the current PU can be obtained from neighboring PUs (including spatial and temporal candidates). The Merge mode may be applied to any inter-predicted PU, not just the skip mode. Another option for the Merge mode is the explicit transmission of motion parameters, where the use of motion vectors, reference picture indices for each reference picture list, and reference picture lists are signaled explicitly in each PU.

When the signaling indicates that one of the two reference picture lists is to be used, the PU is generated from one sample block. This is called "one-way prediction". Unidirectional prediction is available for both P-slices and B-slices.

When the signaling indicates that two reference picture lists are to be used, the PU is generated from two sample blocks. This is called "bi-prediction". Bi-directional prediction is available only for B slices.

2.1.1 example of constructing candidates for Merge mode

When predicting a PU using the Merge mode, the index pointing to an entry in the Merge candidate list is analyzed from the bitstream and used to retrieve motion information. The structure of this list can be summarized in the following order of steps:

step 1: initial candidate derivation

Step 1.1: spatial candidate derivation

Step 1.2: redundancy check of spatial candidates

Step 1.3: time domain candidate derivation

Step 2: additional candidate insertions

Step 2.1: creation of bi-directional prediction candidates

Step 2.2: insertion of zero motion candidates

Figure 9 shows an example of building a Merge candidate list based on the sequence of steps outlined above. For spatial Merge candidate derivation, a maximum of four Merge candidates are selected among the candidates located at five different positions. For time domain Merge candidate derivation, at most one Merge candidate is selected among the two candidates. Since the number of candidates per PU is assumed to be constant at the decoder, additional candidates are generated when the number of candidates does not reach the maximum large candidate (e.g., MaxNumMergeCand) signaled in the slice header. Since the number of candidates is constant, the index of the best Merge candidate is row-coded using truncated unary binarization (TU). If the size of a CU is equal to 8, all PUs of the current CU share one large candidate list, which is the same as the large candidate list of the 2N × 2N prediction unit.

2.1.2 constructing spatial Merge candidates

In the derivation of the spatial Merge candidates, a maximum of four Merge candidates are selected among the candidates located at the positions shown in fig. 10. The derivation order was a1, B1, B0, a0, and B2. Position B2 is considered only if any PU of position a1, B1, B0, a0 is unavailable (e.g., because it belongs to another stripe or slice) or is intra-coded. After adding the candidates at the a1 position, a redundancy check is performed on the addition of the remaining candidates, which ensures that candidates with the same motion information are excluded from the list, thereby improving coding efficiency. In order to reduce the computational complexity, not all possible candidate pairs are considered in the mentioned redundancy check. In contrast, only the pairs linked with the arrows in fig. 11 are considered, and only when the corresponding candidates for redundancy check do not have the same motion information, the candidates are added to the list. Another source of duplicate motion information is the "second PU" associated with a 2 nx 2N different partition. For example, fig. 12A and 12B depict the second PU in the N × 2N and 2N × N cases, respectively. When the current PU is divided into N × 2N, candidates for the a1 position are not considered for list construction. In some embodiments, adding this candidate may result in two prediction units with the same motion information, which is redundant for having only one PU in the coding unit. Likewise, when the current PU is divided into 2N × N, position B1 is not considered.

2.1.3 construction of time-domain Merge candidates

In this step, only one candidate is added to the list. In particular, in the derivation of this temporal-domain Merge candidate, a scaled motion vector is derived based on the collocated PU having the smallest POC difference with the current picture in a given reference picture list. The reference picture lists used to derive the collocated PUs are explicitly signaled in the slice header.

Fig. 13 shows an example of a derivation of a scaled motion vector for a temporal-domain Merge candidate (as dashed line) that is scaled from the motion vector of a collocated PU using POC distances tb and td, where tb is defined as the POC difference between the reference picture of the current picture and the current picture, and td is defined as the POC difference between the reference picture of the collocated picture and the collocated picture. The reference picture index of the temporal region Merge candidate is set to zero. For B slices, two motion vectors are obtained (one for reference picture list 0 and the other for reference picture list 1) and combined to make it a bi-predictive Merge candidate.

In collocated PU (Y) belonging to the reference frame, in candidate C₀And C₁The location of the time domain candidate is selected as shown in fig. 14. If at position C₀PU of (4) is unavailable, intra-coded or in the current CTUOtherwise, position C is used₁. Otherwise, position C₀Is used for the derivation of the time domain Merge candidates.

2.1.4 additional types of Merge candidates

In addition to space-time Merge candidates, there are two additional types of Merge candidates: the bidirectional prediction Merge candidate and the zero Merge candidate are combined. The combined bidirectional predictive Merge candidate is generated using the space-time Merge candidate. The combined bi-directional predicted Merge candidates are only for B slices. A combined bi-directional prediction candidate is generated by combining the first reference picture list motion parameters of the initial candidate with the second reference picture list motion parameters of the other candidate. If these two tuples provide different motion hypotheses they will form new bi-directional prediction candidates.

Fig. 15 shows an example of this process, where two candidates in the initial list with mvL0 and refIdxL0 or mvL1 and refIdxL1 (1510 on the left) are used to create a combined bi-predictive Merge candidate (1520 on the right) that is added to the final list.

Zero motion candidates are inserted to fill the remaining entries in the Merge candidate list to reach the capacity of MaxumMergeCand. These candidates have zero spatial displacement and reference picture indices that start from zero and increase each time a new zero motion candidate is added to the list. The number of reference frames that these candidates use is 1 frame and 2 frames for unidirectional prediction and bidirectional prediction, respectively. In some embodiments, no redundancy check is performed on these candidates.

2.1.5 examples of motion estimation regions processed in parallel

To speed up the encoding process, motion estimation may be performed in parallel, thereby deriving motion vectors for all prediction units within a given region simultaneously. Deriving the Merge candidate from the spatial neighborhood may interfere with parallel processing because one prediction unit cannot derive motion parameters from neighboring PUs before completing the associated motion estimation. To mitigate the trade-off between coding efficiency and processing delay, a Motion Estimation Region (MER) may be defined. The size of the MER may be signaled in a Picture Parameter Set (PPS) using the syntax element "log 2_ parallel _ merge _ level _ minus 2". When defining MER, the Merge candidates falling into the same region are marked as unavailable and are therefore not considered in the list construction.

A Picture Parameter Set (PPS) initial byte sequence payload (RBSP) syntax is shown in table 1, where Log2_ parallel _ Merge _ level _ minis 2 plus 2 specifies the value of the variable Log2parmrgl level, which is used in the derivation process of the Merge mode luma motion vector, as well as the derivation process of the spatial Merge candidates specified in existing video coding standards. The value of log2_ parallel _ Merge _ level _ MiNus2 should be in the range of 0 to CtbLog2SizeY-2, including 0 and CtbLog2 SizeY-2.

The variable Log2parmrgl is derived as follows:

Log2ParMrgLevel＝log2_parallel_Merge_level_MiNus2+2

note that the value of Log2 parmrglvel represents a built-in function for parallel derivation of the Merge candidate list. For example, when Log2 parmrglvel is equal to 6, the Merge candidate lists of all Prediction Units (PUs) and Coding Units (CUs) contained in a 64 × 64 block may be derived in parallel.

Table 1 general picture parameter set RBSP syntax

2.2 motion vector prediction embodiment

Motion vector prediction exploits the spatial-temporal correlation of motion vectors with neighboring PUs, which is used for explicit transmission of motion parameters. The motion vector candidate list is first constructed by checking the availability of temporally adjacent PU locations to the upper left, removing redundant candidate locations, and adding a zero vector to make the candidate list length constant. The encoder may then select the best predictor from the candidate list and send a corresponding index indicating the selected candidate. Similar to the Merge index signaling, the index of the best motion vector candidate is encoded using a truncated unary.

2.2.1 example of constructing motion vector prediction candidates

Fig. 16 summarizes the derivation process of motion vector prediction candidates and may be implemented with refidx as an input for each reference picture list.

In motion vector prediction, two types of motion vector candidates are considered: spatial motion vector candidates and temporal motion vector candidates. For the derivation of spatial motion vector candidates, two motion vector candidates are finally derived based on the motion vectors of each PU located at the five different positions previously shown in fig. 10.

For the derivation of temporal motion vector candidates, one motion vector candidate is selected from two candidates, derived based on two different collocated positions. After the first list of spatio-temporal candidates is made, the duplicate motion vector candidates in the list are removed. If the number of potential candidates is greater than two, the motion vector candidate with a reference picture index greater than 1 in the associated reference picture list is removed from the list. If the number of spatial-temporal motion vector candidates is less than two, additional zero motion vector candidates are added to the list.

2.2.2 construction of spatial motion vector candidates

In deriving the spatial motion vector candidates, a maximum of two candidates are considered among the five potential candidates, which are from PUs at the positions previously shown in fig. 10, which are the same as the position of the motion Merge. The derivation order on the left side of the current PU is defined as A₀、A₁And scaled A₀Zoom of A₁. The derivation order above the current PU is defined as B₀、B₁、B₂Zoomed B₀Zoomed B₁Zoomed B₂. Thus, four cases per side can be used as motion vector candidates, two cases not requiring the use of spatial scaling and two cases using spatial scaling. Four different cases are summarized as follows:

-no spatial scaling

(1) Same reference picture list, and same reference picture index (same POC)

(2) Different reference picture lists, but the same reference picture index (same POC)

-spatial scaling

(3) Same reference picture list, but different reference picture indices (different POCs)

(4) Different reference picture lists, and different reference picture indices (different POCs)

The case of no spatial scaling is checked first, and then the case of allowing spatial scaling is checked. Spatial scaling is considered when POC differs between the reference picture of the neighboring PU and the reference picture of the current PU, regardless of the reference picture list. If all PUs of the left candidate are not available or intra coded, the motion vectors are allowed to be scaled to facilitate parallel derivation of left and top MV candidates. Otherwise, spatial scaling of the motion vectors is not allowed.

As shown in the example in fig. 17, for the spatial scaling case, the motion vectors of neighboring PUs are scaled in a similar manner as the temporal scaling. One difference is that given the reference picture list and index of the current PU as input, the actual scaling process is the same as the temporal scaling process.

2.2.3 construction of temporal motion vector candidates

All derivation processes of the temporal domain Merge candidate are the same as those of the spatial motion vector candidate, except for the derivation of the reference picture index (as shown in the example in fig. 14). In some embodiments, the reference picture index is signaled to the decoder.

2.2.4 Signaling of AMVP information

For AMVP mode, four parts can be signaled in the bitstream, including prediction direction, reference index, MVD, and MV prediction candidate index, which are described in the context of the syntax shown in tables 2 and 3.

Table 2: motion vector difference syntax

Table 3: motion vector difference semantics

Example of inter-frame prediction method in 3 Joint Exploration Model (JEM)

In some embodiments, reference software known as the Joint Exploration Model (JEM) is used to explore future video coding techniques. In JEM, sub-block based prediction such as affine prediction, optional temporal motion vector prediction (ATMVP), spatio-temporal motion vector prediction (STMVP), bi-directional optical flow (BIO), frame rate up-conversion (FRUC), Locally Adaptive Motion Vector Resolution (LAMVR), Overlapped Block Motion Compensation (OBMC), Local Illumination Compensation (LIC), and decoder-side motion vector refinement (DMVR) is employed in several coding tools.

3.1 example of sub-CU-based motion vector prediction

In a JEM with a quadtree plus binary tree (QTBT), each CU can have at most one set of motion parameters for each prediction direction. In some embodiments, two sub-CU level motion vector prediction methods are considered in the encoder by partitioning a large CU into sub-CUs and deriving motion information for all sub-CUs of the large CU. An Alternative Temporal Motion Vector Prediction (ATMVP) method allows each CU to obtain multiple sets of motion information from a plurality of blocks smaller than the current CU in the collocated reference picture. In the spatial-temporal motion vector prediction (STMVP) method, a motion vector of a sub-CU is recursively derived by using a temporal motion vector predictor and a spatial neighboring motion vector.

In some embodiments, motion compression of the reference frame is currently disabled in order to maintain a more accurate motion field for sub-CU motion prediction.

3.1.1 example of optional temporal motion vector prediction (ATMVP)

Among the ATMVP method, a Temporal Motion Vector Prediction (TMVP) method is modified by extracting multiple sets of motion information (including a motion vector and a reference index) from a block smaller than a current CU.

FIG. 18 shows an example of the ATMVP motion prediction process for CU 1800. The ATMVP method predicts the motion vector of sub-CU 1801 within CU 1800 in two steps. The first step is to identify a corresponding block 1851 in the reference picture 1850 using the time domain vector. The reference picture 1850 is also referred to as a motion source picture. The second step is to divide the current CU 1800 into sub-CUs 1801 and obtain the motion vector and reference index of each sub-CU from the block corresponding to each sub-CU.

In a first step, the reference picture 1850 and the corresponding block are determined by motion information of spatially neighboring blocks of the current CU 1800. To avoid the repeated scanning process of the neighboring blocks, the first Merge candidate in the Merge candidate list of the current CU 1800 is used. The first available motion vector and its associated reference index are set as the indices of the temporal vector and the motion source picture. In this way, the corresponding block can be identified more accurately than the TMVP, where the corresponding block (sometimes referred to as a collocated block) is always located in the lower right corner or center position with respect to the current CU.

In one example, if the first Merge candidate is from a left neighboring block (i.e., A in FIG. 19)₁) The associated MV and reference picture are used to identify the source block and the source picture.

In a second step, the corresponding block of the sub-CU 1851 is identified by the time domain vector in the motion source picture 1850 by adding the time domain vector to the coordinates of the current CU. For each sub-CU, motion information of the sub-CU is derived using motion information of its corresponding block (e.g., the smallest motion grid covering the central samples). After identifying the motion information for the corresponding nxn block, it is converted into a motion vector and reference index for the current sub-CU, as in the TMVP method of HEVC, where motion scaling and other processing is applied. For example, the decoder checks whether a low delay condition is met (e.g., POC of all reference pictures of the current picture is less than POC of the current picture) and predicts a motion vector MVy for each sub-CU, possibly using a motion vector MVx (e.g., a motion vector corresponding to reference picture list X) (e.g., X equals 0 or 1 and Y equals 1-X).

3.1.2 example of spatial motion vector prediction (STMVP)

In the STMVP method, the motion vectors of sub-CUs are recursively derived in raster scan order.

Fig. 20 shows an example of a CU with four sub-blocks and neighboring blocks. Consider an 8 × 8 CU2000 that contains four 4 × 4 sub-CUs a (2001), B (2002), C (2003), and D (2004). The neighboring 4 × 4 blocks in the current frame are labeled a (2011), b (2012), c (2013), and d (2014).

The motion derivation of sub-CU a starts by identifying its two spatial neighbors. The first neighbor is an nxn block (block c 2013) above the sub-CU a 1101. If this block c 2013 is not available or intra coded, the other nxn blocks above the child CU a (2001) are checked (from left to right, starting at block c 2013). The second neighbor is a block to the left of sub-CU a 2001 (block b 2012). If block b (2012) is not available or is intra-coded, the other blocks to the left of sub-CU a 2001 are checked (top to bottom, starting at block b 2012). The motion information obtained by each list from the neighboring blocks is scaled to the first reference frame of the given list. Next, the Temporal Motion Vector Prediction (TMVP) of sub-block a 2001 is derived following the same procedure as the TMVP specified in HEVC. The motion information of the collocated block at block D2004 is extracted and scaled accordingly. Finally, after retrieving and scaling the motion information, all available motion vectors are averaged separately for each reference list. The average motion vector is specified as the motion vector of the current sub-CU.

3.1.3 example of sub-CU motion prediction mode signaling

In some embodiments, the sub-CU mode is enabled as an additional Merge candidate mode and no additional syntax element is needed to signal this mode. Two more Merge candidates are added to the Merge candidate list of each CU to represent ATMVP mode and STMVP mode. In other embodiments, if the sequence parameter set indicates ATMVP and STMVP are enabled, a maximum of seven Merge candidates may be used. The coding logic of the additional Merge candidates is the same as the coding logic of the Merge candidates in the HM, which means that two additional RD checks may need to be performed on two additional Merge candidates for each CU in a P-slice or a B-slice. In some embodiments, e.g., JEM, the Merge index, all binary files (bins) are context coded by CABAC (context-based adaptive binary arithmetic coding). In other embodiments, such as HEVC, only the first binary is context coded and the remaining binaries are context bypass coded

3.2 example of adaptive motion vector difference resolution

In some embodiments, when use _ integer _ mv _ flag is equal to 0 in the slice header, a Motion Vector Difference (MVD) (between the motion vector of the PU and the prediction motion vector) is signaled in units of quarter luminance samples. In JEM, a locally adaptive motion vector resolution (lamfr) is introduced. In JEM, MVDs may be encoded in units of quarter luma samples, integer luma samples, or four luma samples. The MVD resolution control is at the Coding Unit (CU) level, and the MVD resolution flag conditionally signals each CU that has at least one non-zero MVD component.

For a CU with at least one non-zero MVD component, the first flag will signal to indicate whether quarter luma sample MV precision is used in the CU. When the first flag (equal to 1) indicates that quarter-luma sample MV precision is not used, the other flag signals whether integer-luma sample MV precision or four-luma sample MV precision is used.

A CU uses quarter-luma sample MV resolution when the first MVD resolution flag of the CU is zero or not coded for the CU (meaning all MVDs in the CU are zero). When a CU uses integer luma sample MV precision or four luma sample MV precision, the MVP in the AMVP candidate list of the CU will be rounded to the corresponding precision.

In the encoder, RD checking at the CU level is used to determine which MVD resolution is to be used for the CU. That is, RD checking at the CU level is performed three times for each MVD resolution. To speed up the encoder speed, the following encoding scheme is applied in JEM.

-storing motion information (integer luma sampling precision) of the current CU during RD-checking of CUs with normal quarter-luma sampling MVD resolution. When performing the RD check on the same CU with integer luma sample and 4 luma sample MVD resolutions, the stored motion information (after rounding) is used as the starting point for further small-range motion vector refinement, so that the time-consuming motion estimation process is not repeated three times.

-conditionally invoking RD-checking of CUs with 4 luma samples MVD resolution. For a CU, when the RD verify cost for the integer luma sample MVD resolution is much greater than the RD verify cost for the quarter-luma sample MVD resolution, RD verify for the 4 luma sample MVD resolution of the CU will be skipped.

3.3 example of Pattern Matching Motion Vector Derivation (PMMVD)

The PMMVD mode is a special Merge mode based on a Frame Rate Up Conversion (FRUC) method. In this mode, the motion information of the block is not signaled, but is derived at the decoder side.

For a CU, the FRUC flag may be signaled when its Merge flag is true. When the FRUC flag is false, the Merge index may be signaled and the normal Merge mode used. When the FRUC flag is true, another FRUC mode flag may be signaled to indicate which mode (e.g., bilateral matching or template matching) will be used to derive motion information for the block.

At the encoder side, a decision is made whether to use FRUCMerge mode for a CU based on the RD cost choices made for the normal Merge candidate. For example, multiple matching patterns (e.g., bilateral matching and template matching) for a CU are verified by using RD cost selection. The mode that results in the lowest cost is further compared to other CU modes. If the FRUC matching pattern is the most efficient pattern, then the FRUC flag is set to true for the CU and the associated matching pattern is used.

Typically, the motion derivation process in FRUC Merge mode has two steps: CU-level motion search is performed first, followed by sub-CU-level motion optimization. At the CU level, an initial motion vector for the entire CU is derived based on bilateral matching or template matching. First, a list of MV candidates is generated and the candidate that results in the lowest matching cost is selected as the starting point for further optimization of the CU level. A local search based on bilateral matching or template matching is then performed near the starting point. And taking the MV result of the minimum matching cost as the MV value of the whole CU. Then, the motion information is further refined at sub-CU level, starting from the derived CU motion vector.

For example, the following derivation process is performed for W × H CU motion information derivation. In the first stage, the MV of the entire W × H CU is derived. In the second stage, the CU is further divided into M × M sub-CUs. The value of M is calculated according to equation (3), D is the predefined division depth, and is set to 3 by default in JEM. The MV value for each sub-CU is then derived.

Fig. 21 shows an example of bilateral matching used in a Frame Rate Up Conversion (FRUC) method. Motion information of the current CU (2100) is derived using bilateral matching by finding the closest match between two blocks in two different reference pictures (2110, 2111) along the motion trajectory of the current CU. Under the continuous motion trajectory assumption, the motion vectors MV0(2101) and MV1(2102) pointing to the two reference blocks are proportional to the temporal distance between the current picture and the two reference pictures (e.g., TD0(2103) and TD1 (2104). in some embodiments, when the current picture 2100 is temporally located between the two reference pictures (2110, 2111) and the temporal distance of the current picture to the two reference pictures is the same, bilateral matching becomes a mirror-based bi-directional MV.

Fig. 22 shows an example of template matching used in a Frame Rate Up Conversion (FRUC) method. Template matching is used to derive motion information for the current CU 2200 by finding the closest match between the template in the current picture (e.g., the top and/or left neighboring blocks of the current CU) and the block in the reference picture 2210 (e.g., the same size as the template). Template matching may also be applied to AMVP mode, in addition to FRUC Merge mode described above. In both JEM and HEVC, AMVP has two candidates. New candidates can be derived using a template matching method. If the newly derived candidate from template matching is different from the first existing AMVP candidate, it is inserted at the very beginning of the AMVP candidate list and then the list size is set to 2 (e.g., by removing the second existing AMVP candidate). When applied to AMVP mode, only CU level search is applied.

The MV candidate sets at the CU level may include the following: (1) the initial AMVP candidate, if the current CU is in AMVP mode, (2) all Merge candidates, (3) several MVs in the interpolated MV field (described later), and top and left neighboring motion vectors.

When using bilateral matching, each valid MV of the Merge candidate may be used as an input to generate an MV pair that is assumed to be a bilateral match. For example, one valid MV of the Merge candidate at the reference list a is (MVa, ref)_a). Then find its paired reference picture ref of bilateral MV in another reference list B_bSo as to ref_aAnd ref_bTemporally on different sides of the current picture. If reference ref in reference list B_bIf not, ref will be referenced_bDetermined as being equal to the reference ref_aDifferent references and its temporal distance to the current picture is the minimum distance in list B. Determining a reference ref_bThen, based on the current picture and the reference ref_aRef, reference_bTime distance between MVa and MVb.

In some implementations, four MVs from the interpolated MV field are also added to the CU level candidate list. More specifically, the interpolated MVs at positions (0,0), (W/2, 0), (0, H/2) and (W/2, H/2) of the current CU are added. When FRUC is applied in AMVP mode, the initial AMVP candidates are also added to the CU-level MV candidate set. In some implementations, at the CU level, 15 MVs of an AMVP CU and 13 MVs of a Merge CU may be added to the candidate list.

MV candidates set at the sub-CU level include (1) MVs determined from the CU level search, (2) top, left, top left, and top right neighboring MVs, (3) scaled versions of collocated MVs from the reference picture, (4) one or more ATMVP candidates, and (5) one or more STMVP candidates (e.g., up to 4). The scaled MV from the reference picture is derived as follows. The reference pictures in both lists are traversed. The MVs at collocated positions of sub-CUs in the reference picture are scaled as references to starting CU-level MVs. The ATMVP and STMVP candidates may be the first four. At the sub-CU level, one or more MVs (up to 17) are added to the candidate list.

Generation of interpolated MV fields.Before encoding a frame, an interpolation motion field (motion field) of the entire picture is generated based on the one-way ME. This motion field can then be used subsequently as MV candidates at the CU level or sub-CU level.

In some embodiments, the motion field for each reference picture in the two reference lists is traversed at the 4x4 block level. Fig. 23 shows an example of unilateral Motion Estimation (ME)2300 in the FRUC method. For each 4x4 block, if the motion associated with the block passes through a 4x4 block in the current picture and the block is not assigned any interpolation motion, the motion of the reference block is scaled to the current picture according to temporal distances TD0 and TD1 (same as the MV scaling of TMVP in HEVC) and the scaled motion is assigned to the block in the current frame. If no scaled MV is assigned to a 4x4 block, the motion of the block is marked as unavailable in the interpolated motion field.

Interpolation and matching costs.When the motion vector points to a fractional sample position, motion compensated interpolation is required. To reduce complexity, bilinear interpolation may be used for both bilateral matching and template matching instead of the conventional 8-tap HEVC interpolation.

The computation of the matching cost is somewhat different at different steps. When selecting candidates from the CU-level candidate set, the matching cost may be the absolute sum-difference (SAD) of the bilateral matching or the template matching. After determining the starting MV, the matching cost C of the bilateral matching search at the sub-CU level is calculated as follows:

here, w is a weight coefficient. In some embodiments, w is empirically set to 4. MV and MV^sIndicating the current MV and the starting MV, respectively. SAD can still be used as the matching cost for pattern matching searching at sub-CU level.

In FRUC mode, MVs are derived by using only luminance samples. The derived motion will be used for MC inter prediction of luminance and chrominance. After the MV is determined, the final MC is performed using an 8-tap (8-taps) interpolation filter for luminance and a 4-tap (4-taps) interpolation filter for chrominance.

MV refinement is a pattern-based MV search, with bilateral cost or template matching cost as criteria. In JEM, two search modes are supported-the Unrestricted Centric Biased Diamond Search (UCBDS) and the adaptive cross search, with MV refinement at the CU level and sub-CU level, respectively. For MV refinement at both CU level and sub-CU level, MV is searched directly at quarter luma sample precision, followed by one-eighth luma sample MV refinement. The MV refined search range for the CU and sub-CU step is set to 8 luma samples.

In the bilateral Merge mode, bi-prediction is applied because the motion information of a CU is derived in two different reference pictures based on the closest match between two blocks on the current CU motion trajectory. In template matching Merge mode, the encoder may select a CU from list 0 uni-directional prediction, list 1 uni-directional prediction, or bi-directional prediction. The selection may be based on the template matching cost as follows:

if costBi & gt factor & ltmin (cost0, cost1)

Then bi-directional prediction is used;

otherwise, if cost0< ═ cost1

Then the one-way prediction in list 0 is used;

if not, then,

using the unidirectional prediction in table 1;

here, cost0 is the SAD for the list 0 template match, cost1 is the SAD for the list 2 template match, and cost bi is the SAD for the bi-prediction template match. For example, when the value of the factor is equal to 1.25, it means that the selection process is biased towards bi-prediction. Inter prediction direction selection may be applied to CU-level template matching processing.

3.4 example of decoder-side motion vector refinement (DMVR)

In the bi-directional prediction operation, for prediction of one block region, two prediction blocks respectively formed of Motion Vectors (MVs) of list 0 and MVs of list 1 are combined to form a single prediction signal. In the decoder-side motion vector refinement (DMVR) method, the two motion vectors of the bi-prediction are further refined by a two-sided template matching process. The bilateral template matching applied in the decoder is used to perform a distortion-based search between the bilateral template and the reconstructed samples in the reference picture to obtain refined MVs without transmitting additional motion information.

In DMVR, the two-sided template is generated as a weighted combination (i.e., average) of two prediction blocks from the initial MV0 of list 0 and MV1 of list 1, respectively, as shown in fig. 24. The template matching operation includes calculating a cost metric between the generated template and a sample region (around the initial prediction block) in the reference picture. For each of the two reference pictures, the MV yielding the smallest template cost is considered as the updated MV of the list to replace the original MV. In JEM, nine MV candidates are searched for each list. The nine MV candidates include the initial MV and 8 surrounding MVs that have a luminance sample offset from the initial MV in either the horizontal or vertical direction, or both. Finally, the two new MVs shown in fig. 24 (i.e., MV0 'and MV 1') are used to generate the final bi-directional prediction results. The Sum of Absolute Difference (SAD) is used as a cost measure.

DMVR is applied to the bidirectionally predicted Merge mode without transmitting additional syntax elements, where one MV is from a past reference picture and another MV is from a future reference picture. In JEM, DMVR is not applied when LIC, affine motion, FRUC, or sub-CU Merge candidates are enabled for a CU.

3.5 examples of Merge/skip modes with bilateral matching refinement

The Merge candidate list is first constructed by inserting the motion vectors and reference indices of the spatially neighboring and temporally neighboring blocks into the candidate list using a redundancy check until the number of available candidates reaches the maximum candidate size 19. The candidate list for Merge/skip mode is constructed according to a predetermined insertion order and by inserting spatial candidates, temporal candidates, affine candidates, advanced temporal mvp (atmvp) candidates, spatiotemporal mvp (stmvp) candidates, and additional candidates (combination candidates and zero candidates) as used in HEVC in the context of the numbered blocks shown in fig. 25:

(1) spatial candidates for blocks 1-4

(2) Extrapolated affine candidates for blocks 1-4

(3)ATMVP

(4)STMVP

(5) Virtual affine candidates

(6) Spatial candidates (Block 5) (used only if the number of available candidates is less than 6)

(7) Extrapolated affine candidates (Block 5)

(8) Time domain candidates (as derived in HEVC)

(9) Non-adjacent spatial candidates followed by extrapolated affine candidates (blocks 6 to 49)

(10) Combining candidates

(11) Zero candidates

It may be noted that in addition to STMVP and affine, the IC flag is inherited from the Merge candidate. Further, for the first four spatial candidates, bi-directional prediction candidates are inserted before candidates for uni-directional prediction.

3.5.1 non-neighboring Merge candidates

If the total number of available Merge candidates has not reached the maximum allowed Merge candidate, a non-neighboring Merge candidate may be added to the Merge candidate list. In existing implementations, non-neighboring Merge candidates may be inserted into the Merge candidate list after the TMVP Merge candidate. The process of adding non-adjacent Merge candidates may be performed by the pseudo code shown in FIG. 26.

4 example of existing implementation

In existing implementations, using non-neighboring Merge candidates that obtain motion information from non-neighboring blocks may result in suboptimal performance.

In one example, prediction from motion information of non-adjacent blocks located above the CTU line may significantly increase the line buffer size.

In another example, prediction of motion information from non-neighboring blocks may bring additional coding gain, where all motion information (typically at 4x4 level) is stored in a cache, which significantly increases the complexity of the hardware implementation.

Example of a method for constructing non-neighboring Merge candidates

Embodiments of the presently disclosed technology overcome the disadvantages of existing implementations, providing video coding with lower memory and complexity requirements and higher coding efficiency. Non-neighboring Merge candidates are selected based on the disclosed techniques, which may enhance existing and future video coding standards, the selection of non-neighboring Merge candidates being set forth in the examples described below for various implementations. The examples of the disclosed technology provided below illustrate general concepts and are not meant to be construed as limiting. In examples, various features described in these examples may be combined unless explicitly indicated to the contrary.

Embodiments of the disclosed technology reduce the required cache/line buffer size for non-neighboring Merge candidates and methods for further improving the coding performance of non-neighboring Merge candidates.

For the example discussed below, let the coordinates of the left up sample point of the current block be (Cx, Cy) and let the coordinates of the left up sample point in one non-adjacent block be (NAx, NAy), the origin (0,0) is the upper left corner of the picture/slice/LCU row/LCU. The coordinate difference (i.e., the offset from the current block) is represented by (offsetX, offsetY), where offsetX ═ Cx-NAx and offsetY ═ Cy-NAy.

Examples 1 and 2 advantageously provide at least a reduction in memory and buffer occupancy for some embodiments.

Example 1.In one example, only non-adjacent blocks located at a particular location are accessed when building the Merge candidate.

(a) In one example, x and y should satisfy NAx% M-0 and NAy% N-0, where M and N are two non-zero integers, such as M-N-8 or 16.

(b) In one example, if a left upper sample point in one non-adjacent block does not satisfy a given condition, detection of motion information associated with the block is skipped. Therefore, the associated motion information cannot be added to the Merge candidate list.

(c) Alternatively, if the left upper sample point in a non-adjacent block does not satisfy a given condition, the position of the block may be shifted, truncated, or rounded to ensure that the condition is satisfied. For example, (NAx, NAy) may be modified to ((NAx/M) × M, (NAy/N) × N), where "/" is integer division.

(d) The restricted area size covering all non-adjacent blocks may be predefined/signaled. In this case, when a non-adjacent block calculated by a given offset (OffsetX, OffsetY) is outside the region, it is marked as unavailable or considered as an intra-coding mode. The corresponding motion information may be added to the candidate list as a candidate. An example is depicted in fig. 27.

(i) In one example, the region size is defined as one or more CTBs.

(ii) Alternatively, all blocks above the LCU row covering the current block are marked as unavailable or considered as intra-code mode. The corresponding motion information may be added to the candidate list as a candidate.

(iii) Alternatively, let the left upper sample point coordinates of the LCU covering the current block be (LX, LY). (LX-NAx), and/or abs (LX-NAx), and/or (LY-NAy), and/or abs (LY-NAy) should be within the threshold.

(iv) One or more thresholds may be predefined. They may further depend on the minimum size of CU height/width/LCU size, etc. For example, (LY-NAy) should be less than the smallest dimension of the CU height, or (LY-NAy) should be less than twice the smallest dimension of the CU height.

(v) The region size or threshold may be signaled in a Sequence Parameter Set (SPS), Picture Parameter Set (PPS), Video Parameter Set (VPS), slice header, etc.

(vi) In one example, all non-neighboring blocks outside the current slice/other kind of unit used for parallel encoding are marked as unavailable and the corresponding motion information should not be added to the candidate list as candidates.

Example 2. Spatial motion compression may be applied after encoding/decoding a region (e.g., CTU, CTUs).

(a) When motion compression is applied to a region, it means that all blocks within the region are forced to share a representative set of motion information.

(b) After encoding/decoding the region, spatial motion compression may be applied.

(d) Alternatively, in addition, spatial motion compression is applied prior to the filtering process, e.g., deblocking filtering, Sample Adaptive Offset (SAO) filtering.

(e) Alternatively, in addition, compressed motion information is used in the filtering process.

(f) In one example, a set of motion information from a particular (e.g., center or bottom right) location of the region is selected as the representative set.

(g) Alternatively, multiple positions may be checked in sequence, and once an inter-coded block for a position is identified, the checking process is terminated and the associated motion information is used as a representative set.

(h) In one example, a set of motion information may be derived from the motion information associated with the region and used as the representative set.

Example 3 advantageously provides at least reduced computational complexity.

Example 3.When inserting new non-adjacent Merge candidates, pruning may be applied to partially available Merge candidates.

(a) In one example, new non-neighboring Merge candidates are not pruned to other intervening non-neighboring Merge candidates.

(b) In one example, new non-neighboring Merge candidates are not pruned to time domain Merge candidates, such as TMVP or ATMVP.

(c) In one example, some of the Merge candidates from some particular neighbor blocks are utilized to prune new non-neighbor Merge candidates, but some of the other Merge candidates from some other particular neighbor blocks are not utilized to prune new non-neighbor Merge candidates.

Examples 4-7 advantageously provide at least coding performance gain.

Example 4.All kinds of motion information of non-neighboring blocks are not considered in the Merge candidate list.

(a) In one example, only the motion information of non-neighboring blocks of AMVP coding is considered.

(b) In one example, only Merge-coded non-neighboring blocks' motion information is considered.

(c) In one example, only motion information of non-neighboring blocks that are not from the Merge coding of the spatial Merge candidate is considered.

(d) In one example, only the motion information of Merge-coded non-neighboring blocks that invoke the motion refinement process is considered.

(e) In one example, motion information of non-neighboring blocks encoded with any kind of decoder-side MV refinement (such as FRUC, template matching, or PMVR) is not considered.

(f) In one example, motion information with low MVD/MV precision (e.g., 4-pixel precision) is not considered.

Example 5.For non-adjacent spatial neighbors, those encoded in the first mode are given higher priority (e.g., check before all non-adjacent spatial neighbors encoded in the second mode) than those encoded in the second mode.

(a) Alternatively, for each distance, non-adjacent spatial neighbors encoded with the first mode are given higher priority than those encoded with the second mode. While non-adjacent spatial neighbors with smaller distances are given higher priority for different distances.

(b) In one example, the first mode is defined as the AMVP mode and the second mode is defined as the Merge mode, or vice versa.

(c) Alternatively, the first mode is defined as bi-directional prediction mode and the second mode is defined as uni-directional prediction mode, or vice versa.

(d) Alternatively, the first mode is defined as the AMVP mode, and the second mode is defined as the Merge mode instead of the spatial Merge candidate.

(e) Alternatively, the first pattern is defined as an affine pattern and the second pattern is defined as a non-affine pattern, or vice versa.

(f) Alternatively, the first mode is defined as a high MV precision mode and the second mode is defined as a low MV precision mode.

Example 6.Instead of checking non-neighboring spatial neighbors after a certain Merge candidate (e.g., TMVP), multiple phases of non-neighboring Merge candidates can be derived and added to the Merge candidate list.

(a) In one example, for the first phase, those non-adjacent spatial neighbors that are encoded with the first mode may be added after the first given kind of Merge candidate. While for the second phase those non-adjacent spatial neighbors encoded with the second mode may be added after the second given kind of Merge candidate, and so on.

(b) In one example, the first given kind of Merge candidate may be a last spatial Merge candidate, and the second given kind of Merge candidate may be a TMVP Merge candidate.

(d) The first and second modes may be as defined in example 5.

Example 7.The motion information of the non-neighboring blocks may be used as a prediction value in the AMVP mode.

(a) In one example, one or more MV candidates derived (e.g., scaled to a target reference picture)/inherited from one or more non-neighboring blocks may be added to the AMVP candidate list.

(b) Alternatively, one or more motion candidates inherited from one or more non-neighboring blocks may be added to the AMVP candidate list. In this case, the signaled reference picture index and prediction direction are skipped and they inherit from the motion candidates.

(c) The use of motion information associated with non-neighboring blocks may be applied under certain conditions, e.g. the total number of AMVP candidates does not reach the maximum allowed size, and/or when an indication of use is signaled, and/or the block size (W, H), where W > th0 and H > th1(th0 and th1 are two thresholds, e.g. both equal to 8).

The examples described above may be incorporated in the context of the methods described below, e.g., method 2800, which may be implemented at a video decoder and/or a video encoder.

Some embodiments of the present document may preferably provide the following solutions. These embodiments include further variations and aspects of the techniques described in example 2 that specify methods for constructing non-adjacent Merge candidates.

1. A video encoding method (e.g., method 2800 shown in fig. 28), comprising:

parsing (2810) the bitstream to recover motion information for one or more current blocks in a first portion of the video;

generating a representative set of motion information using (2820) motion information of blocks in a first portion of the video; and

blocks in the second portion of the video are reconstructed (2830) using the representative set of motion information.

2. The method of solution 1 wherein the blocks in the first portion of the video comprise different sets of motion information.

3. The method of solution 1 wherein the first portion of the video is a coding tree unit.

4. The method of solution 1 wherein the first portion of the video corresponds to a plurality of coding tree units.

5. The method as in any one or more of solutions 1-2, further comprising:

the motion information of the block in the first portion of the video is reset to the representative set of motion information.

6. The method of solution 5 wherein the resetting is prior to applying the filtering process.

7. The method of solution 6, wherein the filtering process comprises a deblocking filter or a Sample Adaptive Offset (SAO) filter.

8. The method as in any one or more of solutions 1-7, wherein a representative set of motion information is used in the filtering process.

9. The method as in any one or more of solutions 1-8, wherein the representative set of motion information is associated with a representative region of a selected block in the first portion of the video.

10. The method of solution 9, wherein the representative region is the center or lower right region.

11. The method as in any one or more of solutions 9-10, wherein the representative region is determined by checking blocks in the first portion of the video according to a sequence.

12. The method of solution 11 wherein the check block is terminated when an inter-coded block is identified in the first portion of the video.

13. The method of solution 12 wherein the representative set of motion information is associated with an inter-coded block.

15. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing a method as one or more of solutions 1 to 13.

In some embodiments, the above-listed solutions may be preferably used in video coding operations. In some embodiments, the above-listed solutions may be preferably used in video decoding operations. In some embodiments, the above listed solutions may preferably be used in video transcoding operations.

6 example implementation of the disclosed technology

Fig. 29 is a block diagram of the video processing apparatus 2900. The apparatus 2900 may be used to implement one or more of the methods described herein. The apparatus 2900 may be implemented in a smartphone, tablet, computer, internet of things (IoT) receiver, and so on. The apparatus 2900 may include one or more processors 2902, one or more memories 2904, and video processing hardware 2906. The processor 2902 may be configured to implement one or more of the methods described herein (including, but not limited to, the method 2800). Memory (es) 2904 may be used to store data and code for implementing the methods and techniques described herein. Video processing hardware 2906 may be used to implement some of the techniques described herein in hardware circuitry.

In some embodiments, a video decoder device may implement a method using zero units as described herein for video decoding. Various features of the method may be similar to the method 2800 described above.

In some embodiments, the video decoding method may be implemented using a decoding apparatus implemented on a hardware platform, as described with respect to fig. 29.

From the foregoing, it will be appreciated that specific embodiments of the disclosed technology have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the invention. Accordingly, the disclosed technology is not limited, except as by the terms.

Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a transitory and non-transitory computer-readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The terms "data processing unit" and "data processing apparatus" include all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or groups of computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

It is intended that the specification, together with the drawings, be considered exemplary only, with the examples being meant as examples. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, the use of "or" is intended to include "and/or" unless the context clearly indicates otherwise.

While this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various functions described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claim combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Likewise, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described herein should not be understood as requiring such separation in all embodiments.

Only a few implementations and examples have been described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

56页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：具有LIC的LUT

Spatial motion compression

相关技术

网友询问留言