Selective use of coding and decoding tools in video processing

文档序号:426165 发布日期:2021-12-21 浏览:13次 中文

阅读说明:本技术 视频处理中编解码工具的选择性使用 (Selective use of coding and decoding tools in video processing ) 是由 张凯 张莉 刘鸿彬 王悦 于 2020-05-11 设计创作,主要内容包括:提供了一种视频处理方法,包括:为视频的当前图片的当前视频块和视频的编解码表示之间的转换,确定由于使用其大小不同于当前图片的大小的参考图片来将当前视频块编解码成编解码表示,所以对当前视频块禁用编解码工具的使用;以及基于该确定来执行转换。(There is provided a video processing method comprising: determining, for a transition between a current video block of a current picture of a video and a codec representation of the video, that use of a codec tool is disabled for the current video block due to the current video block being codec into the codec representation using a reference picture having a size different from a size of the current picture; and performing a conversion based on the determination.)

1.A video processing method, comprising:

determining, for a transition between a current video block of a current picture of a video and a codec representation of the video, that use of a codec tool is disabled for the current video block due to the current video block being codec into a codec representation using a reference picture having a size different from a size of the current picture; and

performing the conversion based on the determination.

2. The method of claim 1, wherein information related to the codec tool is not signaled in the event that use of the codec tool is disabled.

3. The method of claim 1, wherein motion information of the current video block is not inserted into a history-based motion vector prediction (HMVP) table, wherein HMVP table includes one or more entries corresponding to motion information of one or more previously processed blocks.

4. The method of any of claims 1-3, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Differences (SMVD), Merge mode with motion vector differences (MMVD), inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Sets (MTS), sub-block transform (SBT), prediction refinement with optical flow (PROF), low-frequency non-square transform (LFNST), or filtering tool.

5. The method of any of claims 1-4, wherein the size comprises at least one of a width and a height of the current picture.

6. The method of any of claims 1 to 5, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

7. A video processing method, comprising:

for a current video block of a current picture of a video, generating a prediction block by applying a coding tool to the current video block based on rules that determine whether and/or how to use a reference picture of a size different from a size of the current picture; and

performing a conversion between the current video block and a codec representation of the video using the prediction block.

8. The method of claim 7, wherein the rule specifies that the reference picture is not used to generate a prediction block due to a coding tool applied to the current video block.

9. The method of claim 8, wherein Merge candidates that refer to the reference picture are skipped or not placed into a Merge candidate list.

10. The method of claim 8, wherein signaling of a reference index corresponding to the reference picture is skipped or not allowed.

11. The method of claim 7, wherein the rule specifies scaling the reference picture according to a resolution of the current picture and a resolution of the reference picture prior to applying a coding tool.

12. The method of claim 7, wherein the rule specifies scaling a motion vector or a motion vector difference according to a resolution of the current picture and a resolution of the reference picture before applying a coding tool.

13. The method of any of claims 7 to 12, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Differences (SMVD), Merge with motion vector differences mode (MMVD), inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Sets (MTS), sub-block transform (SBT), refinement of prediction with optical flow (PROF), low-frequency non-square transform (LFNST), or filtering tool.

14. The method of any of claims 7 to 12, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

15. A video processing method, comprising:

for a transition between a current video block of a current picture of a video and a codec representation of the video, determining whether to disable a codec tool for the current video block based on a first resolution of a reference picture associated with one or more reference picture lists and/or a second resolution of a current reference picture used to derive a prediction block for the current video block; and

performing the conversion based on the determination.

16. The method of claim 15, wherein the determining determines to disable coding tools for one reference picture list and enable coding tools for another reference picture list.

17. The method of claim 15, wherein the determining determines to disable coding tools for a first reference picture of a reference picture list and to enable coding tools for a second reference picture of the reference picture list or another reference picture list.

18. The method of claim 15, wherein the determining determines whether a coding tool is disabled for a first reference picture list without regard for a second reference picture list that is different from the first reference picture list.

19. The method of claim 18, wherein the determining determines whether to disable a coding tool based on a reference picture of the first reference picture list and the current picture.

20. The method of claim 18, wherein the determining determines whether to disable coding tools for the first reference picture list if a reference picture associated with the first reference picture list is different than the current picture.

21. The method of claim 15, wherein the determining determines whether to disable coding tools further based on other resolutions of reference pictures and/or current pictures associated with the one or more reference picture lists.

22. The method of claim 21, wherein the determining determines to disable a coding tool if at least one reference picture has a resolution different from a resolution of the current picture.

23. The method of claim 21, wherein the determining does not disable a coding tool if at least one reference picture has a different resolution than a resolution of the current picture and the reference pictures are associated with the same resolution.

24. The method of claim 21, wherein the determining determines to disable a coding tool if at least one reference picture has a different resolution than a resolution of the current picture and the reference pictures are associated with different resolutions from each other.

25. The method of any of claims 15 to 24, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Differences (SMVD), Merge with motion vector differences mode (MMVD), inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Sets (MTS), sub-block transform (SBT), refinement of prediction with optical flow (PROF), low-frequency non-square transform (LFNST), or a filtering tool.

26. The method of any of claims 15 to 25, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

27. A video processing method, comprising:

performing a conversion between a video picture comprising one or more video blocks and a codec representation of the video,

wherein at least some of the one or more video blocks are coded by referencing a reference picture list of the video picture according to a rule,

wherein the rule specifies that the reference picture list includes reference pictures having a maximum of K different resolutions, where K is an integer.

28. The method of claim 27, wherein K equals 2.

29. The method of claim 27 or 28, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

30. A video processing method, comprising:

performing a conversion between N consecutive video pictures of a video and a codec representation of said video,

wherein the N consecutive video pictures comprise one or more video blocks coded at different resolutions according to a rule,

wherein the rule specifies that a maximum of K different resolutions are allowed for the N consecutive pictures, where N and K are integers.

31. The method of claim 30, wherein N and K are equal to 3.

32. The method of claim 30, wherein N equals 10 and K equals 3.

33. The method of claim 30, wherein K different resolutions are allowed in a group of pictures (GOP) in the coded representation.

34. The method of claim 30, wherein K different resolutions are allowed between two pictures having the same temporal layer identification.

35. The method of any of claims 30 to 34, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

36. A video processing method, comprising:

performing a conversion between a video comprising a plurality of pictures and a codec representation of the video,

wherein at least some of the plurality of pictures are coded into the coded representation using different coded video resolutions, and

wherein the codec representation conforms to a format rule that: the first codec resolution of the previous frame is changed to the second codec resolution of the next frame only if the next frame sequentially following the previous frame is codec as an intra frame.

37. The method of claim 36, wherein the order corresponds to an encoding order in which the plurality of pictures are encoded.

38. The method of claim 36, wherein the order corresponds to a decoding order in which the plurality of pictures are decoded.

39. The method of claim 36, wherein the order corresponds to a display order in which the plurality of pictures are displayed after decoding.

40. The method of any of claims 36-39, wherein the intra frame is an intra random access point picture.

41. The method of any one of claims 36-39, wherein the intra frame is an Instantaneous Decoding Refresh (IDR) frame.

42. The method of any of claims 36 to 41, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

43. A video processing method, comprising:

parsing an encoded representation of a video to determine that a current video block of a current picture of the video references a reference picture associated with a resolution different from a resolution of the current picture;

generating a prediction block for the current video block by converting a bi-directional prediction mode into a uni-directional prediction mode to be applied to the current video block; and

generating the video from the codec representation using the prediction block.

44. The method of claim 43, wherein generating the prediction block comprises dropping prediction from a list, wherein the list has reference pictures associated with a resolution different from a resolution of the current picture.

45. A video processing method, comprising:

generating a prediction block for a current video block of a current picture of a video by enabling or disabling inter prediction from reference pictures having different resolutions from each other according to a motion vector precision and/or a resolution ratio; and

performing a conversion between the current video block and a codec representation of the video using the prediction block.

46. The method of claim 45, wherein the inter prediction is enabled if a motion vector scaled according to the resolution ratio points to an integer position.

47. The method of claim 45, wherein the inter prediction is enabled if a motion vector scaled according to a resolution ratio points to a sub-pixel position.

48. The method of claim 45, wherein bi-prediction is disabled if a reference picture is associated with a resolution different from a resolution of the current picture.

49. The method of claim 45, wherein bi-prediction is enabled if one reference picture has a different resolution than a resolution of the current picture and another reference picture is associated with a same resolution as the resolution of the current picture.

50. The method of claim 45, wherein unidirectional prediction is not allowed if a reference picture is associated with a resolution different from a resolution of the current picture and a block size of the current video block satisfies a particular condition.

51. The method of any of claims 45 to 50, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

52. A video processing method, comprising:

determining, based on a codec characteristic of a current video block of a current picture of a video, whether a reference picture having a size different from a size of the current picture is allowed to be used to generate a prediction block for the current video block during a transition between the current video block and a codec representation of the video; and

performing the conversion according to the determination.

53. The method of claim 52, wherein the coding characteristics comprise a size of the current video block and/or a prediction mode of the current video block.

54. The method of claim 52, wherein the determining determines that reference pictures are allowed in the case of WB > -T1 and HB > -T2, wherein WB and HB correspond to a width and height, respectively, of the current video block, and T1 and T2 are positive integers.

55. The method of claim 52, wherein the determining determines that a reference picture is allowed in the case of WB HB > -T, wherein WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

56. The method of claim 52, wherein the determining determines that a reference picture is allowed in the case of Min (WB, HB) > -T, where WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

57. The method of claim 52, wherein the determining determines that a reference picture is allowed in a case of Max (WB, HB) > - [ T ], wherein WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

58. The method of claim 52, wherein the determining determines that reference pictures are allowed in the case of WB < ═ T1 and HB < ═ T2, wherein WB and HB correspond to a width and height, respectively, of the current video block, and T1 and T2 are positive integers.

59. The method of claim 52, wherein the determining determines that a reference picture having a size different from a size of the current video block is allowed in the case of WB HB < ═ T, wherein WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

60. The method of claim 52, wherein the determining determines that a reference picture is allowed in the case of Min (WB, HB) < ═ T, where WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

61. The method of claim 52, wherein the determining determines that a reference picture is allowed in a case of Max (WB, HB) < ═ T, where WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

62. The method of claim 52, wherein the determining determines that reference pictures are not allowed in the case of WB < ═ T1 and/or HB < ═ T2, wherein WB and HB correspond to a width and height, respectively, of the current video block, and T1 and T2 are positive integers.

63. The method of any of claims 52-62, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

64. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of claims 1-63.

65. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing the method of any of claims 1 to 63.

Technical Field

This patent document relates to video processing techniques, devices, and systems.

Background

Despite advances in video compression, digital video still occupies the largest bandwidth used on the internet and other digital communication networks. As the number of connected user devices capable of receiving and displaying video increases, the bandwidth requirements for pre-counting the use of digital video will continue to grow.

Disclosure of Invention

Apparatus, systems, and methods related to digital video processing, such as adaptive loop filtering for video processing, are described. The described methods may be applied to existing video codec standards (e.g., High Efficiency Video Codec (HEVC)) and future video codec standards (e.g., general video codec (VVC)) or codecs.

The video codec standard has evolved largely through the development of the well-known ITU-T and ISO/IEC standards. ITU-T produces H.261 and H.263, ISO/IEC produces MPEG-1 and MPEG-4Visual, and these two organizations jointly produce the H.262/MPEG-2 video and the H.264/MPEG-4 advanced video codec and the H.265/HEVC standard. Since h.262, video codec standards have been based on hybrid video codec structures, in which temporal prediction plus transform coding is utilized. In order to explore future video codec technologies other than HEVC, VCEG and MPEG united in 2015 to form a joint video exploration team (jfet). Since then, many new methods are adopted by JVET and put into the reference software named Joint Exploration Model (JEM). In 4.2018, VCEG (Q6/16) and ISO/IEC JTC1SC29/WG11(MPEG) established the Joint video experts group (JVT) working on the VVC standard with a 50% reduction in bitrate compared to HEVC.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: deriving, for a conversion between a current video block of a current picture of a video and a codec representation of the video, one or more motion vector offsets based on one or more resolutions of a reference picture associated with the current video block and a resolution of the current picture; and performing the conversion using the one or more motion vector offsets.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: for a conversion between a current video block of a current picture of a video and a codec representation of the video, constructing a motion candidate list in which motion candidates are included in a priority order such that priorities of the motion candidates are based on resolutions of reference pictures associated with the motion candidates; and performing the conversion using the motion candidate list.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: determining parameters of an adaptive loop filter for a current video picture based on a size of the current video picture including one or more video units; and performing a conversion between the current video picture and the codec representation of the current video picture by filtering the one or more video units according to the parameters of the adaptive loop filter.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: applying a Luma Mapping (LMCS) process with chroma scaling, in which luma samples of a current video block are reshaped between a first domain and a second domain, to a current video block of a current picture of a video, and scaling a chroma residual in a luma-dependent manner by using LMCS parameters associated with respective sizes; and performing a conversion between the current video block and a codec representation of the video.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: determining, for a transition between a current video block of a video and a codec representation of the video, whether and/or how to enable a codec tool that divides the current video block into a plurality of sub-partitions according to rules based on reference picture information for the plurality of sub-partitions; and performing a conversion based on the determination.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: determining, for a transition between a current video block of a current picture of a video and a codec representation of the video, that use of a codec tool is disabled for the current video block due to a reference picture having a size different from a size of the current picture being used to codec the current video block into the codec representation; and performing a conversion based on the determination.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: generating a prediction block for a current video block of a current picture of a video by applying a coding tool to the current video block based on rules that determine whether and/or how to use a reference picture of a size different from a size of the current picture; and performing a conversion between the current video block and a codec representation of the video using the prediction block.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: determining, for a transition between a current video block of a current picture of the video and a codec representation of the video, whether to disable a codec tool for the current video block based on a first resolution of a reference picture associated with one or more reference picture lists and/or a second resolution of the current reference picture used to derive a prediction block for the current video block; and performing a conversion based on the determination.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method includes performing a conversion between a video picture comprising one or more video blocks and a coded representation of the video, wherein at least some of the one or more video blocks are coded by referring to a reference picture list of the video picture according to a rule, wherein the rule specifies that the reference picture list comprises reference pictures having at most K different resolutions, wherein K is an integer.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method includes performing a conversion between N consecutive video pictures of the video and a coded representation of the video, wherein the N consecutive video pictures include one or more video blocks coded at different resolutions according to a rule, wherein the rule specifies that K different resolutions are allowed at most for the N consecutive pictures, wherein N and K are integers.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises performing a conversion between video comprising a plurality of pictures and a codec representation of the video, wherein at least some of the plurality of pictures are codec into the codec representation using different codec video resolutions, and wherein the codec representation complies with a format rule that is: the first codec resolution of the previous frame is changed to the second codec resolution of the next frame only in a case where the next frame sequentially following the previous frame is codec as an intra frame.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method includes parsing a codec representation of a video to determine that a current video block of a current picture of the video references a reference picture associated with a resolution different from a resolution of the current picture; generating a prediction block of the current video block by converting a bi-directional prediction mode into a unidirectional prediction mode to be applied to the current video block; and generating video from the codec representation using the prediction block.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: generating a prediction block for a current video block of a current picture of a video by enabling or disabling inter prediction from reference pictures having different resolutions from each other according to a motion vector precision and/or a resolution ratio; and performing a conversion between the current video block and a codec representation of the video using the prediction block.

In one representative aspect, the disclosed techniques can be used to provide a video processing method. The method comprises the following steps: determining, based on a codec characteristic of a current video block of a current picture of the video, whether to allow a reference picture having a size different from that of the current picture to generate a prediction block for the current video block during a transition between the current video block and a codec representation of the video; and performing a conversion according to the determination.

In yet another representative aspect, the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.

In yet another representative aspect, an apparatus configured or operable to perform the above-described method is disclosed. The apparatus may include a processor programmed to implement the method.

In yet another representative aspect, a video decoder apparatus may implement the methods described herein.

The above and other aspects and features of the disclosed technology are described in more detail in the accompanying drawings, the description and the claims.

Drawings

Fig. 1 shows an example of adaptive streaming of two representations of the same content coded at different resolutions.

Fig. 2 shows an example of adaptive streaming of two representations of the same content coded at different resolutions.

Fig. 3 shows an example of two representative open GOP prediction structures.

Fig. 4 shows an example of presentation switching at an open GOP position.

Fig. 5 shows an example of a decoding process of RASL pictures by using resampled reference pictures from other bitstreams as references.

Fig. 6A-6C show examples of MCTS-based RWMR viewport-related 360 ° streams.

Fig. 7 shows an example of a collocated sub-picture representation of different IRAP spacing and different sizes.

Fig. 8 shows an example of a segment received when a change in viewing direction results in a change in resolution.

Fig. 9 shows an example of a change in viewing direction slightly up and towards the surface of the right cube compared to fig. 6.

Fig. 10 shows an example of an implementation in which a sub-picture representation of two sub-picture positions is presented.

Fig. 11 and 12 show an implementation of an ARC encoder and decoder, respectively.

Fig. 13 shows an example of ARC resampling based on slice (title) groups.

Fig. 14 shows an example of adaptive resolution change.

Fig. 15 shows an ATMVP motion prediction example for a CU.

Fig. 16A and 16B show examples of a simplified 4-parameter affine motion model and a simplified 6-parameter affine motion model, respectively.

Fig. 17 shows an example of affine MVF of each sub-block.

Fig. 18A and 18B show examples of a 4-parameter affine model and a 6-parameter affine model, respectively.

Fig. 19 shows MVPs (motion vector differences) of AF _ INTER for inherited affine candidates.

Fig. 20 shows MVP of AF _ INTER for the constructed affine candidates.

Fig. 21A and 21B show five neighboring blocks (neighbor blocks) and CPMV predictor derivation, respectively.

Fig. 22 shows an example of candidate positions for the affine Merge mode.

Fig. 23A and 23B are block diagrams of examples of hardware platforms for implementing the visual media decoding or visual media encoding techniques described in this document.

Fig. 24A-24E show a flow diagram of an example method of video processing in accordance with some implementations of the disclosed technology.

Fig. 25A through 25I show flowcharts of example methods of video processing according to some implementations of the disclosed technology.

Detailed Description

Techniques and apparatus disclosed in this document provide a codec tool with adaptive resolution conversion. AVC and HEVC do not have the ability to change resolution without introducing IDR or Internal Random Access Point (IRAP) pictures; this capability may be referred to as Adaptive Resolution Change (ARC). There are some use cases or application scenarios that may benefit from the ARC feature, including the following:

rate adaptation in video telephony and conferencing: to adapt the codec video to changing network conditions, the encoder can adapt to it by encoding a smaller resolution picture when the network conditions become worse so that the available bandwidth becomes lower. At present, the change of the picture resolution can only be carried out after the IRAP picture; this has several problems. A reasonable quality IRAP picture will be much larger than an inter-coded picture, and decoding is correspondingly more complex: this takes time and resources. This is a problem if the decoder requests a resolution change for loading reasons. It may also break low latency buffer conditions, force audio resynchronization, and the end-to-end (end-to-end) latency of the stream will increase, at least temporarily. This may result in a poor user experience.

-change of active speaker in multi-party video conference: for multi-party video conferencing, the active speaker is typically displayed in a larger video size than the video of the other conference participants. The picture resolution of each participant may also need to be adjusted when the active speaker changes. The need for an ARC function becomes more important as such changes often occur in active speakers.

-fast start in stream (streaming): for streaming applications, it is common for the application to buffer a certain length of decoded pictures before starting display. Starting the bitstream at a smaller resolution will allow the application to have enough pictures in the buffer to start displaying faster.

Adaptive stream switching in streams: the dynamic adaptive streaming over HTTP (DASH) specification includes a property named @ mediastreamstructurewld. This enables switching between different representations of open GOP random access points with non-decodable leading pictures, e.g., CRA pictures with HEVC associated RASL pictures. When two different representations of the same video have different bitrates but the same spatial resolution while they have the same value of @ mediastream structure id, a switch may be made between the two representations of CRA pictures having associated RASL pictures and the RASL pictures associated with the switch of CRA pictures may be decoded with acceptable quality, thereby enabling seamless switching. With ARC, the @ medias streams structure Id feature may also be used to switch between DASH representations with different spatial resolutions.

ARC is also known as dynamic resolution conversion.

ARC can also be considered a special case of Reference Picture Resampling (RPR), such as h.263 annex P.

Reference picture resampling in 1.1h.263 annex P

This mode describes an algorithm that warps (warp) the reference picture before it is used for prediction. This may be useful for resampling reference pictures that have a different source format than the predicted pictures. It can also be used for global motion estimation or rotational motion estimation by warping the shape, size and position of the reference picture. The syntax includes the warping parameters to be used and the resampling algorithm. The simplest level of operation with reference to the picture resampling mode is an implicit factor of 4 resampling, since only FIR filters need to be applied to the upsampling and downsampling processes. In this case, no additional signaling overhead is required, since its use can be understood when the size of the new picture (indicated in the picture header) is different from the size of the previous picture.

1.2VVC proposals for ARC

1.2.1.JVET-M0135

The initial design of ARC, some of which are taken from JCTVC-F158, described below, is suggested as a placeholder, merely to trigger discussion. Double brackets are arranged before and after the deleted text.

2.2.1.1 description of the basic tools

The basic tool constraints for supporting ARC are as follows:

the spatial resolution may differ from the nominal resolution by a factor of 0.5, applicable to both dimensions. The spatial resolution may be increased or decreased, resulting in 0.5 and 2.0 scaling.

The aspect ratio and the chroma format of the video format are unchanged.

-the clipping region is scaled in proportion to the spatial resolution.

Simply rescale the reference pictures as needed and apply inter prediction as usual.

2.2.1.2 zoom operation

It is proposed to use simple zero-phase separable down-scaling and up-scaling filters. Note that these filters are only used for prediction; the decoder may use more complex scaling for output purposes.

The following 1 was used: a2 down scaling filter with zero phase and 5 taps:

(-1,9,16,9,-1)/32

the down-sampling points are located at even sample positions and at the same position (co-located). The same filter is used for luminance and chrominance.

For 2: 1 upsampling, generating additional samples at odd grid positions using half-pixel motion compensated interpolation filter coefficients in the latest VVC WD.

The combined up and down sampling does not change the position or phase of the chrominance sampling points.

2.2.1.2 resolution description in parameter set

The signaling change of picture resolution in SPS is as follows.

Table 1: sequence parameter set RBSP syntax and semantics.

[ [ pic _ width _ in _ luma _ samples ] specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic _ height _ in _ luma _ samples specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

num _ pic _ size _ in _ luma _ samples _ minus1 plus 1 specifies the number of picture sizes (width and height) in units of luma samples that may be present in a coded video sequence.

pic _ width _ in _ luma _ samples [ i ] specifies the ith width of a decoded picture in units of luma samples that may be present in a coded video sequence. pic _ width _ in _ luma _ samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic _ height _ in _ luma _ samples [ i ] specifies the ith height of a decoded picture in luma samples that may be present in a coded video sequence. pic _ height _ in _ luma _ samples [ i ] should not equal 0 and should be an integer multiple of MinCbSizeY.

Table 2: picture parameter set RBSP syntax and semantics

pic _ size _ idx specifies the index of the ith picture size in the sequence parameter set. The width of a picture of the reference picture parameter set is pic _ width _ in _ luma _ samples [ pic _ size _ idx ] of the luma samples. Likewise, the height of a picture of the reference picture parameter set is pic _ height _ in _ luma _ samples [ pic _ size _ idx ] of luma samples.

1.2.2.JVET-M0259

1.2.2.1. Background: sub-picture

The term sub-picture track is defined in the omnidirectional media format (OMAF) as follows: tracks that have spatial relationships with other tracks and represent spatial subsets of the original video content that have been divided into spatial subsets prior to video encoding at the content producer end. The sub-picture track of HEVC may be constructed by rewriting the parameter set and slice segment header for the motion-limited slice set so that it becomes an independent HEVC bitstream. A sub-picture representation may be defined as a DASH representation carrying a sub-picture track.

jfet-M0261 uses the term sub-picture as a spatial segmentation unit for VVC, summarized as follows:

1. pictures are divided into sub-pictures, slice groups and slices.

2.A sub-picture is a set of rectangular slice groups, starting with a slice group with tile _ group _ address equal to 0.

3. Each sub-picture may refer to its own PPS and thus may have its own slice segmentation.

4. In the decoding process, the sub-picture is regarded as a picture.

5. The reference picture used for decoding the sub-picture is generated by extracting a region collocated with the current sub-picture from the reference picture in the decoded picture buffer. The extracted region should be a decoded sub-picture, i.e. inter prediction occurs between sub-pictures of the same size and same position within the picture.

6. A slice group is a sequence of slices in a slice raster scan of a sub-picture.

In this context, the term "sub-picture" is to be understood as defined in JFET-M0261. However, the tracks of the sub-picture sequence defined in the package jfet-M0261 have very similar properties to the sub-picture tracks defined by the OMAF, and the examples given below apply to both cases.

1.2.2.2. Use case

1.2.2.2.1. Adaptive resolution change in a stream

Requirement for support of adaptive streaming

Section 5.13 of MPEG N17074 ("support for adaptive streaming") includes the following requirements for VVC: in case of an adaptive streaming service providing multiple representations of the same content, the standard should support fast representation switching, each representation having different properties (e.g. spatial resolution or sampling bit depth). The standard should be able to use an efficient prediction structure (e.g. a so-called open group of pictures) without compromising the ability of fast and seamless representation switching between representations of different properties, such as different spatial resolutions.

Example with open GOP prediction Structure indicating switching

The content generation of the adaptive bitrate stream includes the generation of different representations, which may have different spatial resolutions. The client requests the segments from the representation and can therefore decide at what resolution and bit rate the content is received. At the client, the segments of different representations are concatenated, decoded and played. The client should be able to achieve seamless play-out with one decoder instance. As shown in fig. 1, a closed GOP structure (starting from an IDR picture) is conventionally used. Fig. 1 shows an adaptive stream of two representations of the same content coded at different resolutions.

The open GOP prediction structure (starting with CRA pictures) enables better compression performance than the corresponding closed GOP prediction structure. For example, in the case where IRAP pictures are spaced at 24 pictures, the average bit rate decreases by 5.6% in terms of the luminance Bjontegaard delta bit rate. For convenience, simulation conditions and results are summarized in section YY.

The open GOP prediction structure is also reported to reduce subjective visual quality pumping (pumping).

The challenge in using open GOPs in a stream is that RASL pictures cannot be decoded with the correct reference picture after switching the representation. Fig. 2 illustrates the challenges associated with the representations, and fig. 2 illustrates adaptive streaming of two representations of the same content at different resolutions of codec. In fig. 2, the slices use a closed GOP or open GOP prediction structure.

A slice starting with a CRA picture includes RASL pictures whose at least one reference picture was in the previous slice. This is illustrated in fig. 3, which shows two representative open GOP prediction structures. In fig. 3, picture 0 in both bitstreams is located in the previous slice and is used as a reference for predicting RASL pictures.

The representation switch marked with a dotted rectangle in fig. 2 is shown in fig. 4 below, and fig. 4 shows the representation switch at the open GOP position. It can be observed that the reference picture of the RASL picture ("picture 0") is not decoded. Therefore, RASL pictures are not decodable and there will be gaps in the playback of the video.

However, it has been found that decoding RASL pictures with resampled reference pictures is subjectively acceptable, see section 4. The resampling of "picture 0" and its use as a reference picture for decoding RASL pictures is illustrated in fig. 5. Fig. 5 shows a decoding process of a RASL picture by using a resampled reference picture from another bitstream as a reference.

2.2.2.2.2. Viewport changes in a region-level mixed resolution (RWMR)360 ° video stream

Background: HEVC-based RWMR stream

The RWMR 360 ° flow provides higher effective spatial resolution over the viewport. The slices covering the viewport are derived from 6K (6144 × 3072) ERP pictures or equivalent CMP resolution schemes, as shown in fig. 6, with "4K" decoding capability (HEVC level 5.1), included in clauses d.6.3 and d.6.4 of OMAF, and adopted in VR industry forum guidelines. This resolution is considered suitable for head mounted displays using quad-HD (2560 x 1440) display panels.

Encoding: the content is encoded at two spatial resolutions, with cube face sizes of 1536 × 1536 and 768 × 768, respectively. In both bit streams, a 6 x 4 slice grid is used and a motion limited slice set (MCTS) is coded for each slice position.

Package with a metal layer: each MCTS sequence is encapsulated as a sub-picture track and is available as a sub-picture representation in DASH.

Selection of streams MCTS: 12 MCTSs are selected from the high resolution bitstream, and complementary 12 MCTSs are extracted from the low resolution bitstream. Thus, half of the streaming contentThe sphere (180 ° × 180 °) is derived from the high resolution bit stream.

Merging MCS into a bitstream to be decoded: the received MCTSs for a single time instance are merged into a 1920 × 4608 coded picture, which is compliant with HEVC level 5.1. Another option for merging pictures is to have 4 tile columns of 768 width, 2 tile columns of 384 width and 3 tile rows of 768 brightness samples height, resulting in a picture of 3840 × 2304 brightness samples.

FIG. 6 shows an example of MCTS-based RWMR viewport-related 360 ° streams. Fig. 6a shows an example of a codec bitstream, fig. 6b shows an example of an MCTS selected for a stream, and fig. 6c shows an example of a picture merged from the MCTS.

Background: several representations of different IRAP intervals for view port related 360 ° streams

When viewing direction changes in a 360 ° stream related to an HEVC-based viewport, a new selection of sub-picture representations may take effect at the next segment boundary aligned with the IRAP. The sub-picture representations are merged into a codec picture for decoding, so the VCL NAL unit types are aligned in all selected sub-picture representations.

To provide a trade-off between response time to changes in viewing direction and rate-distortion performance when the viewing direction is stable, multiple versions of the content may be codec at different IRAP intervals. This is illustrated in fig. 7 for a collection of collocated subpicture representations presented in fig. 6 for encoding, and is discussed in more detail in section 3 of "Separate list for sub-block merge" of jfet-L0368, h.yang, j.chen, 10 months 2018.

Fig. 7 shows an example of a collocated sub-picture representation of different IRAP spacing and different sizes.

Fig. 8 presents an example in which the sub-picture position is first selected to be received at a lower resolution (384 × 384). The change in viewing direction results in the selection of a new sub-picture position being received at a higher resolution (768 x 768). In the example of fig. 8, the received segment starts at segment 4 when the viewing direction change results in a change in resolution. In this example, the viewing direction is changed such that a segment 4 is received from the short IRAP interval sub-picture representation. Thereafter, the viewing direction is stable, so a long IRAP interval version can be used starting from segment 5.

Disadvantage of updating all sub-picture positions

Because the viewing direction moves gradually under typical viewing conditions, the resolution only varies within a subset of the subpicture positions in the RWMR viewport-related stream. Fig. 9 shows the change in viewing direction from fig. 6 slightly up and towards the right cube face. The cube face partition of a different resolution than the previous is denoted by "C". It can be observed that of the 24 cube face partitions, the resolution of 6 partitions has changed. However, as described above, in response to a change in viewing direction, it is necessary to receive a slice starting with an IRAP picture for all 24 cube-face partitions. From a flow rate distortion performance point of view, it is inefficient to update all sub-picture positions with a slice starting from an IRAP picture.

Furthermore, the ability to use an open GOP prediction structure with the sub-picture representation of the RWMR 360 ° stream is desirable to improve rate-distortion performance and avoid visual picture quality pumping caused by a closed GOP prediction structure.

Suggested design examples

The following design goals are suggested:

VVC design should allow merging a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same codec picture conforming to VVC.

VVC designs should be able to use an open GOP prediction structure in sub-picture representations without compromising the fast and seamless representation switching capability between sub-picture representations of different properties (such as different spatial resolutions) while being able to merge the sub-picture representations into a single VVC bitstream.

An example of a design goal may be illustrated with fig. 10, where a sub-picture representation of two sub-picture positions is presented. For two sub-picture positions, separate versions of the content are coded for each combination between two resolutions and two random access intervals. Some slices start with an open GOP prediction structure. The viewing direction change causes the resolution of sub-picture position 1 to be switched at the beginning of segment 4. Since slice 4 starts with the CRA picture associated with the RASL picture, those reference pictures of the RASL picture in slice 3 need to be resampled. Note that this resampling applies to sub-picture position 1, while decoded sub-pictures of some other sub-picture positions are not resampled. In this example, the change in viewing direction does not cause a change in resolution for sub-picture position 2, and therefore the decoded sub-picture for sub-picture position 2 is not resampled. In the first picture of slice 4, the slice at sub-picture position 1 contains sub-pictures from CRA pictures, while the slice at sub-picture position 2 contains sub-pictures from non-random access pictures. It is proposed to allow merging of these sub-pictures into a coded picture in VVC.

2.2.2.2.3. Adaptive resolution change in video conferencing

The adaptive resolution change proposed by JCTVC-F158 is mainly for video conferencing. The following subsections are copied from JCTVC-F158 and present use cases in which adaptive resolution changes are considered useful.

Seamless network adaptation and error recovery capability

Applications such as video conferencing and streaming over packet networks often require coded streams to adapt to changing network conditions, especially when bit rates are too high and data is lost. Such applications typically have a return channel, allowing the encoder to detect errors and perform adjustments. There are two main tools available for encoders: the bit rate reduces and changes the temporal or spatial resolution. By using a layered prediction structure for encoding and decoding, temporal resolution changes can be efficiently achieved. However, to achieve the best quality, the spatial resolution needs to be changed and a part of a well-designed encoder for video communication is needed.

Changing spatial resolution in AVC requires sending IDR frames and resetting the stream. This causes a serious problem. A reasonable quality IDR frame will be much larger than an inter picture and accordingly decoding will be more complex: this takes time and resources. This is a problem if the decoder requests a change of resolution for loading reasons. It may also break low latency buffer conditions, force audio resynchronization, and the end-to-end latency of the stream will increase, at least temporarily. This results in a poor user experience.

To minimize these problems, IDRs are typically sent at low quality, use a similar number of bits as P-frames, and require significant time to recover to full quality for a given resolution. To obtain a sufficiently low delay, the quality may be very low and there will usually be a significant blur before the image is "refocused". In fact, intra frames do little useful work in compression: it is simply a method of restarting the stream.

Therefore, the approach in HEVC needs to allow for changing resolution with minimal impact on the subjective experience, especially under challenging network conditions.

Quick start

It is useful to have a "fast start" mode in which the first frame is sent at a reduced resolution and the resolution is increased in the next few frames in order to reduce latency and reach normal quality faster without unacceptable picture blurring at the start.

Conference "composition" (composition)

Video conferencing also typically has a feature whereby the person speaking will be displayed full screen and the other participants will be displayed in a smaller resolution window. To support this effectively, typically smaller pictures are sent at a lower resolution. The resolution is increased when the participant becomes the speaker and becomes full screen. Sending intra frames at this time can cause an unpleasant hiccup (hiccup) in the video stream. This effect can be very noticeable and objectionable if the speakers alternate rapidly.

2.2.2.3. Proposed design goals

The following are high level design choices proposed for VVC version 1:

1. it is proposed to include a reference picture resampling process in VVC version 1 for the following use cases:

efficient prediction structures (e.g. so-called open groups of pictures) are used in adaptive streaming without compromising fast and seamless representation switching capabilities between representations of different properties, such as different spatial resolutions.

Adapting low-latency session video content to network conditions and application-induced resolution changes without significant latency or latency variations.

2. The proposed VVC design allows merging a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same coded picture that conforms to the VVC. This is believed to be for efficient handling of viewing direction changes in mixed quality and mixed resolution viewport adaptation 360 ° streams.

3. It is proposed to include a sub-picture level resampling process in VVC version 1. This is believed to be to enable an efficient prediction structure to more efficiently handle viewing direction changes in mixed resolution viewport adaptation 360 ° streams.

2.2.3.JVET-N0048

The use case and design goals for Adaptive Resolution Change (ARC) are discussed in detail in JFET-M0259. The summary is as follows:

1. real-time communication

The following use case for adaptive resolution change is initially contained in JCTVC-F158:

a. seamless network adaptation and error recovery capability (through dynamic adaptive resolution change)

b. Rapid start (resolution gradually increasing at session start or reset)

c. Conference "composition" (the person speaking is given a higher resolution)

2. Adaptive streaming

Section 5.13 ("Support for Adaptive Streaming") of MPEG N17074 includes the following requirements for VVC: in case of an adaptive streaming service providing multiple representations of the same content, the standard should support fast representation switching, each representation having different properties (e.g. spatial resolution or sampling bit depth). The standard should be able to use an efficient prediction structure (e.g. a so-called open group of pictures) without compromising the fast and seamless representation switching capability between representations of different properties (e.g. different spatial resolutions).

jfet-M0259 discusses how to satisfy this requirement by resampling the reference pictures of the leading pictures.

3.360 degree viewport correlation stream

jfet-M0259 discusses how to solve this use case by resampling certain independent codec picture regions of the reference pictures of the leading picture.

This proposal proposes an adaptive resolution codec method that is asserted to meet all of the above use case and design objectives. The 360 degree viewport-related stream and the conference "composition" use case are handled by this proposal together with jfet-N0045 (which proposes a separate sub-picture layer).

Proposed specification text

Double brackets are arranged before and after the deleted text.

Signaling

Table 3: sps _ max _ rpr

sps _ max _ rpr specifies the maximum number of active reference pictures in reference picture list 0 or 1 for any slice group in the CVS having pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ samples that are not equal to pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ samples, respectively, of the current picture.

Tables 4 and 5: picture width and height

The requirement that max _ width _ in _ luma _ samples specify bitstream consistency is that pic _ width _ in _ luma _ samples in any active PPS for any picture for which the SPS is an active CVS is less than or equal to max _ width _ in _ luma _ samples.

The requirement that max _ height _ in _ luma _ samples specify bitstream consistency is that pic _ height _ in _ luma _ samples in any active PPS for any picture for which the SPS is an active CVS is less than or equal to max _ height _ in _ luma _ samples.

High layer decoding process

The decoding process for the current picture CurrPic operates as follows:

1. clause 8.2 specifies the decoding of NAL units.

2. The process in clause 8.3 specifies the following decoding process using syntax elements at the slice header layer and above:

variables and functions related to picture order count are derived as specified in clause 8.3.1. This need only be invoked for the first slice group of pictures.

-at the start of the decoding process for each slice group of the non-IDR picture, invoking the decoding process of the reference picture list construction specified in clause 8.3.2 to derive reference picture list 0(RefPicList [0]) and reference picture list 1(RefPicList [1 ]).

-invoking the decoding process of reference picture marking in clause 8.3.3, wherein a reference picture can be marked as "unused for reference" or "used for long term reference". This only needs to be called for the first slice group of a picture.

For each active reference picture in RefPicList [0] and RefPicList [1] with pic _ width _ in _ luma _ samples or pic _ height _ in _ luma _ samples not equal to pic _ width _ in _ luma _ samples or pic _ height _ in _ luma _ samples of CurrPic, respectively, the following applies:

-the resampling process in clause x.y.z is invoked, wherein the output has the same reference picture marking and picture order count as the input.

The reference picture used as input to the resampling process is marked as "unused for reference".

The invocation of decoding procedures for coding and decoding tree elements, scaling, transformation, in-loop filtering, etc. may be further discussed.

After all slice groups of the current picture are decoded, the current decoded picture is marked as "used for short-term reference".

Resampling process

The SHVC resampling process (HEVC clause h.8.1.4.2) proposes to add the following:

if sps _ ref _ surrounding _ enabled _ flag is equal to 0, then the sample value tempArray [ n ], where n is 0..7, is derived as follows:

tempArray[n]=

(fL[xPhase,0]*rlPicSampleL[Clip3(0,refW-1,xRef-3),yPosRL]+

fL[xPhase,1]*rlPicSampleL[Clip3(0,refW-1,xRef-2),yPosRL]+

fL[xPhase,2]*rlPicSampleL[Clip3(0,refW-1,xRef-1),yPosRL]+

fL[xPhase,3]*rlPicSampleL[Clip3(0,refW-1,xRef),yPosRL]+

fL[xPhase,4]*rlPicSampleL[Clip3(0,refW-1,xRef+1),yPosRL]+(H-38)

fL[xPhase,5]*rlPicSampleL[Clip3(0,refW-1,xRef+2),yPosRL]+

fL[xPhase,6]*rlPicSampleL[Clip3(0,refW-1,xRef+3),yPosRL]+

fL[xPhase,7]*rlPicSampleL[Clip3(0,refW-1,xRef+4),yPosRL])>>shift

1

otherwise, the sample value is tempArray [ n ], where n is 0..7, derived as follows:

refOffset=(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY

tempArray[n]=

(fL[xPhase,0]*rlPicSampleL[ClipH(refOffset,refW,xRef-3),yPosRL]+

fL[xPhase,1]*rlPicSampleL[ClipH(refOffset,refW,xRef-2),yPosRL]+

fL[xPhase,2]*rlPicSampleL[ClipH(refOffset,refW,xRef-1),yPosRL]+

fL[xPhase,3]*rlPicSampleL[ClipH(refOffset,refW,xRef),yPosRL]+

fL[xPhase,4]*rlPicSampleL[ClipH(refOffset,refW,xRef+1),yPosRL]+

fL[xPhase,5]*rlPicSampleL[ClipH(refOffset,refW,xRef+2),yPosRL]+

fL[xPhase,6]*rlPicSampleL[ClipH(refOffset,refW,xRef+3),yPosRL]+

fL[xPhase,7]*rlPicSampleL[ClipH(refOffset,refW,xRef+4),yPosRL])

>>shift1

if sps _ ref _ surrounding _ enabled _ flag is equal to 0, then the sample value tempArray [ n ],

where n ═ 0..3, the following is derived:

tempArray[n]=(fC[xPhase,0]*rlPicSampleC[Clip3(0,refWC-1,xRef-1),yPosRL]+

fC[xPhase,1]*rlPicSampleC[Clip3(0,refWC-1,xRef),yPosRL]+

fC[xPhase,2]*rlPicSampleC[Clip3(0,refWC-1,xRef+1),yPosRL]+(H-50)

fC[xPhase,3]*rlPicSampleC[Clip3(0,refWC-1,xRef+2),yPosRL])>>shift1

otherwise, the sample value is tempArray [ n ], where n is 0..3, derived as follows:

refOffset=(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY)/SubWidthC

tempArray[n]=

(fC[xPhase,0]*rlPicSampleC[ClipH(refOffset,refWC,xRef-1),yPosRL]+

fC[xPhase,1]*rlPicSampleC[ClipH(refOffset,refWC,xRef),yPosRL]+

fC[xPhase,2]*rlPicSampleC[ClipH(refOffset,refWC,xRef+1),yPosRL]+

fC[xPhase,3]*rlPicSampleC[ClipH(refOffset,refWC,xRef+2),yPosRL])

>>shift1

2.2.4.JVET-N0052

adaptive resolution change, as a concept in video compression standards, has existed since at least 1996; in particular h.263+ related proposals for reference picture resampling (RPR, annex P) and reduced resolution update (annex Q). It has recently gained some attention, first in the proposal of Cisco during JCT-VC, then in the context of VP9 (which is now moderately widely deployed), and recently in the context of VVC. ARC allows reducing the number of samples that need to be coded for a given picture and sampling the results to a higher resolution on a reference picture if needed.

ARCs of particular interest are considered under two scenarios:

1) intra-coded pictures, such as IDR pictures, are typically much larger than inter-pictures. Downsampling pictures that are intended for intra-coding may provide a better input for future prediction, for whatever reason. This is clearly also advantageous from a rate control point of view, at least in low latency applications.

2) When operating codecs near breakpoints, at least some cable and satellite operators typically do so, even for non-intra coded pictures, ARC may become convenient, as in scene transitions without hard transition points.

3) Looking forward may be somewhat too: is the concept of fixed resolution generally reasonable? With the departure of CRTs and the ubiquity of scaling engines in rendering devices, hard binding between rendering and codec resolutions has been the past. Furthermore, it is worth noting that there are some studies available that most people cannot concentrate on fine details (possibly associated with high resolution) when there is a lot of activity in the video sequence, even if the activity is elsewhere on the airspace. If this is true and is generally accepted, fine granularity resolution change may be a better rate control mechanism than adaptive QP. This will now be discussed. The concept of canceling fixed resolution bitstreams has myriad system-layer and implementation implications, which is clear, at least at the layers where they exist, if not their detailed nature.

Technically, ARC may be implemented as reference picture resampling. There are two main aspects to implementing reference picture resampling: resampling filter and signaling of resampling information in the bitstream. This document focuses on the latter and touches the former to the extent of implementation experience. More research into suitable filter designs is encouraged, and any suggestions in this regard will be carefully considered and supported where appropriate, to greatly improve the strawman design provided.

Overview of ARC implementation of Tencent

Fig. 11 and 12 present the ARC encoder and decoder implementation of the Tencent, respectively. Implementations of the disclosed techniques enable picture width and height to be varied at a per picture granularity, regardless of picture type. At the encoder, the input picture data is downsampled to the selected picture size of the current picture encoding. After the first input picture is encoded as an intra picture, the decoded picture is stored in a Decoded Picture Buffer (DPB). When subsequent pictures (consecutive pictures) are downsampled at different sampling rates and encoded as inter pictures, the reference picture(s) in the DPB are up/down scaled according to the spatial ratio between the reference picture size and the current picture size. At the decoder, the decoded pictures are stored in the DPB without resampling. However, when used for motion compensation, the reference pictures in the DPB are upscaled/downscaled relative to the spatial ratio between the current decoded picture and the reference. The decoded picture is upsampled to the original picture size or, when highlighted, to the desired output picture size. In the motion estimation/compensation process, the motion vector is scaled according to the picture size ratio and the picture order count difference.

Signaling of ARC parameters

The term "ARC parameters" is used herein as a combination of any parameters required for ARC to work. In the simplest case this could be a scaling factor, or an index to a table with defined scaling factors. It may be the target resolution (e.g. in samples or maximum CU size granularity) or an index to a table providing the target resolution, as proposed in jfet-M0135. But also the filter selector and even the filter parameters (up to the filter coefficients) of the up/down sampling filter used.

From the outset, implementations proposed herein allow, at least conceptually, different parts of a picture to have different ARC parameters. According to the current VVC draft, a suitable syntax structure is proposed to be a rectangular slice group (TG). Those using scan order TG will be limited to using ARC only for the full picture or to the extent that scan order TG is contained in a rectangular TG. This can be easily specified by the bitstream constraints.

Since different TGs may have different ARC parameters, the appropriate location of the ARC parameters may be in the TG header, or in a parameter set with a range of TGs and referenced by the TG header, i.e. the adaptive parameter set in the current VVC draft, or a more detailed reference (index) in a table in a higher parameter set. In these three options, it is proposed to use the TG header to codec a reference to a table entry comprising ARC parameters at this point, and the table is in SPS, with the largest table value being coded in the (upcoming) DPS. The scaling factor can be coded directly into the TG header without using any parameter set values. If per slice group signaling of ARC parameters is a design criterion, then using PPS as a reference (as proposed in jfet-M0135) is the opposite.

As for the table entries themselves, the following options are available:

-coding and decoding down-sampling factors, one for two dimensions, or independently for X and Y dimensions? This is mainly a discussion of (HW-) implementations, and some may prefer the result that the scaling factor in the X dimension is quite flexible, but the scaling factor in the Y dimension is fixed to 1, or only a few choices. The proposed grammar is the wrong place to express such constraints, which are preferred if they are needed. In other words, grammar flexibility is maintained.

The codec proposal for the target resolution is as follows. There may be more or less complex restrictions on these resolutions, possibly expressed in bitstream conformance requirements, relative to the current resolution.

Preferably each slice group is downsampled to allow picture composition/extraction. However, this is not important from a signaling point of view. If the group makes an unintelligent decision that only ARCs are allowed at the picture granularity, then bitstream conformance requirements can be included where all TGs use the same ARC parameters.

-control information related to ARC. In our following design, the reference picture size is included.

Is flexibility in filter design required? Is there larger than a few code points? If so, put them into APS? In some implementations, the proposed bit stream must consume overhead if the downsampling filter changes and the ALF remains unchanged.

Currently, in order to keep the proposed technique consistent and simple (to the extent possible), the following are proposed:

fixed filter design

Target resolution in the table in SPS, with bitstream restriction TBD.

Minimum/maximum target resolution in DPS to facilitate upper limit exchange/negotiation.

The resulting syntax may be as follows:

table 6: decoder parameter set RBSP syntax

The max _ pic _ width _ in _ luma _ samples bitstream specifies the maximum width of a decoded picture in units of luma samples. max _ pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. The value of dec _ pic _ width _ in _ luma _ samples [ i ] cannot be greater than the value of max _ pic _ width _ in _ luma _ samples.

max _ pic _ height _ in _ luma _ samples specifies the maximum height of the decoded picture in units of luma samples. max _ pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. The value of dec _ pic _ height _ in _ luma _ samples [ i ] cannot be greater than the value of max _ pic _ height _ in _ luma _ samples.

Table 7: sequence parameter set RBSP syntax

adaptive _ pic _ resolution _ change _ flag equal to 1 specifies the output picture size (output _ pic _ width _ in _ luma _ samples, output _ pic _ height _ in _ luma _ samples), an indication of the number of decoded picture sizes (num _ dec _ pic _ size _ in _ luma _ samples _ minus1) and at least one decoded picture size (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]) present in the SPS. The presence of the reference picture size (reference _ pic _ width _ in _ luma _ samples, reference _ pic _ height _ in _ luma _ samples) depends on the value of the reference _ pic _ size _ present _ flag.

output _ pic _ width _ in _ luma _ samples specifies the width of the output picture in units of luma samples. output _ pic _ width _ in _ luma _ samples should not be equal to 0.

output _ pic _ height _ in _ luma _ samples specifies the height of the output picture in units of luma samples. output _ pic _ height _ in _ luma _ samples should not equal 0.

reference _ pic _ size _ present _ flag equal to 1 specifies that reference _ pic _ width _ in _ luma _ samples and reference _ pic _ height _ in _ luma _ samples are present.

The reference _ pic _ width _ in _ luma _ samples specifies the width of a reference picture in units of luma samples. output _ pic _ width _ in _ luma _ samples should not be equal to 0. If not, the value of reference _ pic _ width _ in _ luma _ samples will be inferred to be equal to dec _ pic _ width _ in _ luma _ samples [ i ].

reference _ pic _ height _ in _ luma _ samples specifies the height of a reference picture in units of luma samples. output _ pic _ height _ in _ luma _ samples should not equal 0. If not, the value of reference _ pic _ height _ in _ luma _ samples will be inferred to be equal to dec _ pic _ height _ in _ luma _ samples [ i ].

Note 1-the size of the output picture should be equal to the values of output _ pic _ width _ in _ luma _ samples and output _ pic _ height _ in _ luma _ samples. When the reference picture is used for motion compensation, the size of the reference picture should be equal to the values of reference _ pic _ width _ in _ luma _ samples and _ pic _ height _ in _ luma _ samples.

num _ dec _ pic _ size _ in _ luma _ samples _ minus1 plus 1 specifies the number of decoded picture sizes (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]) in the coded video sequence in units of luma samples.

dec _ pic _ width _ in _ luma _ samples [ i ] specifies the ith width of the decoded picture size in units of luma samples in the coded video sequence. dec _ pic _ width _ in _ luma _ samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

dec _ pic _ height _ in _ luma _ samples [ i ] specifies the ith height of the decoded picture size in units of luma samples in the coded video sequence. dec _ pic _ height _ in _ luma _ samples [ i ] should not equal 0 and should be an integer multiple of MinCbSizeY.

Note that the 2-ith decoded picture size (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]) may be equal to the decoded picture size of the decoded picture in the coded video sequence.

Table 8: slice header grammar

dec _ pic _ size _ idx specifies that the width of a decoded picture should be equal to pic _ width _ in _ luma _ samples [ dec _ pic _ size _ idx ], and the height of a decoded picture should be equal to pic _ height _ in _ luma _ samples [ dec _ pic _ size _ idx ].

Filter

The proposed design conceptually comprises four different filter banks: a downsampling filter from the original picture to the input picture, an upsampling filter to rescale the reference picture for motion estimation/compensation, and an upsampling filter from the decoded picture to the output picture. The first and last may be considered non-normative issues. Within the scope of the specification, the up/down sampling filter needs to be explicitly signaled in the appropriate parameter set, or predefined.

Our implementation uses a downsampling filter of SHVC (SHM ver.12.4), which is a 12-tap and 2D separable filter, for downsampling to adjust the size of the reference picture for motion compensation. In the current implementation, only binary sampling is supported. Thus, by default, the phase of the downsampling filter is set to zero. For upsampling, an 8-tap interpolation filter with 16 phases is used to shift the phases and align the luma and chroma pixel positions with the original positions.

Tables 9 and 10 provide 8-tap filter coefficients fL [ p, x ], where p is 0..15 and x is 0..7, for the luminance upsampling process, and 4-tap filter coefficients fC [ p, x ], where p is 0..15 and x is 0..3, for the chrominance upsampling process.

Table 11 provides the 12 tap filter coefficients for the downsampling process. The same filter coefficients are used for both the downsampled luminance and chrominance.

Table 9: brightness upsampling filter with 16 phases

Table 10: chroma upsampling filter with 16 phases

Table 11: downsampled filter coefficients for luminance and chrominance

When using filters adapted to the content and/or the scaling factor, a (possibly significant) subjective and objective gain can be expected.

Slice group boundary discussion

This may be true for many of the work related to slice groups, but our implementation has not been completed for slice group (TG) based ARC. Once at least one working draft has been generated by the discussion of spatial composition and extraction of multiple sub-pictures into a composite picture in the compressed domain, we tend to re-discuss the implementation. However, this does not prevent inferring the results to some extent and adjusting our signaling design accordingly.

Currently, the slice header is the correct position as dec _ pic _ size _ idx proposed above, for reasons already stated. A single ue (v) code point dec _ pic _ size _ idx is conditionally present in the slice group header to indicate the ARC parameters employed. To match an ARC implementation for each picture only in the canonical space, either only a single slice group needs to be coded or it is a condition for bitstream conformance that all TG headers of a given coded picture have the same dec _ pic _ size _ idx value (when present).

The parameter dec _ pic _ size _ idx may be moved into the header of any starting sub-picture. It may continue to be the slice group header.

In addition to these syntax considerations, some additional work needs to be done to enable slice group or sub-picture based ARC. Perhaps the most difficult part is how to solve the problem of sub-pictures being resampled to unwanted samples in a picture of smaller size.

Fig. 13 shows a slice group based ARC resampling example. Consider the right picture, which consists of four sub-pictures (possibly represented as four rectangular slice groups in the bitstream syntax). On the left, the TG in the lower right is sub-sampled to half size. It is necessary to discuss how samples outside the relevant area are processed, labeled "half".

Many (most. This means that each sample of the picture is represented by one or more syntax elements and each syntax element affects at least one sample. To maintain this, it may be necessary to somehow fill the area around the samples covered by the down-sampled TG labeled "half". H.263+ annex P solves this problem by padding; in fact, the sample values of the padding samples may be signaled in the bitstream (within certain strict limits).

An alternative that may constitute a significant departure from the previous assumption, but in any case may need to support sub-bitstream extraction (and synthesis) based on rectangular parts of the picture, is to relax the current understanding that each sample point of the reconstructed picture needs to be represented by something in the codec picture (even if that is just a skipped block).

Implementation considerations, system implications and Profile/level

Basic ARCs are proposed to be included in "baseline/primary" grades. Sub-profiling can be used to remove certain application scenarios if they are not needed. Certain limitations may be acceptable. In this respect, it is noted that some h.263+ grades and "recommended modes" (these previous grades) include a limitation, i.e. the attachment P can only be used as an "implicit factor of 4", i.e. a binary down-sampling of two dimensions. This is sufficient to support a fast start (fast complete I frame) in a video conference.

This design allows all filtering to be done "on the fly" in time with no or negligible increase in memory bandwidth. For the moment, it does not seem necessary to put the ARC into the incoming profile.

Complex tables, etc. may not be efficiently used for capability exchange, as discussed in Marrakech with JFET-M0135. Given offer-answer (offer-answer) and similar limited deep handshakes, the number of options is too large to allow meaningful cross-vendor interoperability. Indeed, to support ARC in a meaningful way in a capability exchange scenario, we have to move back to the next, at most, at the point of interoperability. For example: no ARC, ARC with implicit factor 4, full ARC. Alternatively, the required support can be specified for all ARCs and the bitstream complexity constraint is left to the higher layer SDO. In any event, this is a strategic discussion to be made at some point (beyond what has been discussed in the context of sub-profiles and markup).

As for the grade: the basic design principle needs to be that as a condition of bitstream conformance, the sample point count of an upsampled picture needs to be suitable for the level of the bitstream no matter how many upsamples are signaled in the bitstream, and all sample points need to be suitable for upsampled coded pictures. Note that this is not the case in H263 +; in H263+, certain samples may not be present.

2.2.5.JVET-N0118

Several aspects are proposed:

1.a list of picture resolutions is signaled in the SPS and an index of the list is signaled in the PPS to specify the size of the individual picture.

2. For any picture to be output, the decoded picture before resampling is cropped (as needed) and output, i.e., the resampled picture is not used for output, only for inter prediction reference.

3. A resampling ratio of 1.5x and 2x is supported. Any resampling ratio is not supported. Further study was made as to whether one or more other resampling ratios were needed.

4. Between picture-level resampling and block-level resampling, the supporter prefers block-level resampling.

a. However, if picture-level resampling is chosen, the following aspects are proposed:

i. when a reference picture is resampled, both the resampled version of the reference picture and the original resampled version are stored in the DPB, and thus both affect the DPB fullness (fullness).

When the corresponding non-resampled reference picture is marked as "unused for reference", the resampled reference picture is marked as "unused for reference".

RPL signaling syntax remains unchanged, while the RPL construction process is modified as follows: when a reference picture needs to be included in an RPL entry and a version of the reference picture having the same resolution as the current picture is not in the DPB, a picture resampling process is invoked and a resampled version of the reference picture is included in the RPL entry.

The number of resampled reference pictures that may be present in the dpb should be limited, e.g., less than or equal to 2.

b. Otherwise (block level resampling is selected), the following is suggested:

i. to limit the complexity of the worst case decoder, it is proposed not to allow bi-prediction of a block from a reference picture whose resolution is different from that of the current picture.

Another option is to combine the two filters and apply this operation immediately when resampling and quarter-pixel interpolation is required.

5. Whether a picture-based or block-based resampling method is chosen, it is proposed to apply temporal motion vector scaling as needed.

2.2.5.1. Implementation of

The ARC software was implemented above VTM-4.0.1 with the following changes:

-a list of supported resolutions is signaled in the SPS.

-spatial resolution signaling that has moved from SPS to PPS.

-implementing a picture-based resampling scheme for resampling reference pictures. After the pictures are decoded, the reconstructed pictures may be resampled to different spatial resolutions. Both the original reconstructed picture and the resampled reconstructed picture are stored in the DPB and can be referenced by future pictures in decoding order.

The resampling filter implemented is based on the filter tested in JCTVC-H0234, as follows:

-an upsampling filter: 4 taps +/-quarter phase DCTIF with taps (-4, 54, 16, -2)/64

-a down-sampling filter: h11 filter having taps (1,0, -3,0,10,16,10,0, -3,0,

1)/32

when constructing the reference picture lists of the current picture (i.e. L0 and L1), only reference pictures having the same resolution as the current picture are used. Note that the reference picture may be of an original size or a resampled size.

TMVP and ATVMP may be enabled; however, when the original codec resolutions of the current picture and the reference picture are different, TMVP and ATMVP are disabled for the reference picture.

For ease and simplicity of starting point software implementation, the decoder outputs the highest available resolution when outputting the picture.

Signaling on picture size and picture output

1. List of spatial resolutions for coded pictures in a bitstream

Currently, all the coded pictures in CVS have the same resolution. Therefore, it is straightforward to signal only one resolution (i.e. picture width and height) in SPS. With ARC support, a list of picture resolutions needs to be signaled instead of one. It is proposed to signal this list in the SPS and the index of the list in the PPS to specify the size of the individual picture.

2. Relating to picture output

It is proposed that for any picture to be output, the decoded picture before resampling is cropped (as needed) and output, i.e. the resampled picture is not used for output, only for inter prediction reference. The ARC resampling filter needs to be designed to optimize the use of the resampled picture for inter prediction and such a filter may not be optimal for picture output/display purposes, whereas video terminal devices typically already implement optimized output scaling/scaling functions.

2.2.5.3. With respect to resampling

The resampling of the decoded picture may be picture-based or block-based. For the final ARC design in VVC, block-based resampling is preferred over picture-based resampling. It is recommended to discuss these two methods, and jfet decides which of these two methods is designated as ARC support in VVC.

Picture-based resampling

In picture-based resampling for ARC, a picture is only resampled once for a certain resolution and then stored in the DPB, while an un-resampled version of the same picture remains in the DPB as well.

There are two problems with picture-based resampling for ARC: 1) an additional DPB buffer is required to store the resampled reference pictures, and 2) additional memory bandwidth is required because the operations of reading reference picture data from and writing reference picture data to the DPB are increased.

Keeping only one version of the reference picture in the DPB is not a good idea for picture-based resampling. If only the non-resampled version is retained, the reference picture may need to be resampled multiple times because multiple pictures may refer to the same reference picture. On the other hand, if the reference picture is resampled and only the resampled version is retained, then when the reference picture needs to be output, inverse resampling needs to be applied, since it is better to output the non-resampled picture, as described above. This is a problem because the resampling process is not a lossless operation. Taking a photo A, down-sampling the photo A, and up-sampling the photo A to obtain A 'with the same resolution as A, wherein A and A' are not the same; a' will contain less information than a because some high frequency information is lost during the down-sampling and up-sampling.

To address the additional DPB buffer and memory bandwidth issues, it is proposed that if the ARC design in VVC uses picture-based resampling, the following applies:

1. when a reference picture is resampled, both the resampled version of the reference picture and the original resampled version are stored in the DPB, and thus both affect DPB fullness.

2.A resampled reference picture is marked as "unused for reference" when the corresponding non-resampled reference picture is marked as "unused for reference".

3. The Reference Picture List (RPL) of each slice group contains reference pictures having the same resolution as the current picture. Although the RPL signaling syntax does not need to be changed, the RPL construction process is modified to ensure what was said in the previous sentence, as follows: when a reference picture needs to be included in an RPL entry, but a version of the reference picture having the same resolution as the current picture is not yet available, a picture resampling process is invoked and a resampled version of the reference picture is included.

The number of resampled reference pictures that may be present in the DPB should be limited, e.g., less than or equal to 2.

Furthermore, in order to enable temporal MV usage (e.g., Merge mode and ATMVP) in case the temporal MV is from a different reference frame than the current resolution, it is proposed to scale the temporal MV to the current resolution as needed.

Block-based ARC resampling

In block-based resampling for ARC, a reference block is resampled whenever necessary, and the resampled picture is not stored in the DPB.

The main problem here is the additional decoder complexity. This is because a block in a reference picture may be referred to multiple times by multiple blocks in another picture and by blocks in multiple pictures.

When a block in a reference picture is referred to by a block in a current picture and the resolutions of the reference picture and the current picture are different, the reference block is resampled by calling an interpolation filter so that the reference block has an integer-pixel resolution. When the motion vector is a quarter-pixel, the interpolation process is invoked again to obtain a resampled reference block of quarter-pixel resolution. Thus, for each motion compensation operation of a current block from a reference block involving a different resolution, up to two interpolation filtering operations are required instead of one. Without ARC support, only one interpolation filtering operation (i.e., generating a reference block at quarter-pixel resolution) is needed at most.

To limit the worst case complexity, it is proposed that if the ARC design in VVC uses block-based resampling, the following applies:

-not allowing bi-prediction of a block from a reference picture of a different resolution than the current picture.

More precisely, the constraints are as follows: for a current block blkA in the current picture picA to refer to a reference block blkB in a reference picture picB, when picA and picB have different resolutions, the block blkA should be a unidirectional prediction block.

With this constraint, the worst number of interpolation operations required to decode a block is limited to two. If the block references a block from a different resolution picture, as described above, the number of interpolation operations required is two. This is the same as when the block refers to a reference block from the same resolution picture and is coded as a bi-prediction block, since the number of interpolation operations is also two (i.e., one for obtaining a quarter-pixel resolution for each reference block).

To simplify the implementation, another variant is proposed, if the ARC design in VVC uses block-based resampling, the following applies:

if the reference frame and the current frame have different resolutions, the corresponding position of each pixel of the predicted value is first calculated, and then only one interpolation is applied. That is, two interpolation operations (i.e., one for resampling and one for quarter-pixel interpolation) are combined into only one interpolation operation. The sub-pixel interpolation filter in the current VVC can be reused (reuse), but in this case the granularity of interpolation should be enlarged, but the number of interpolation operations is reduced from two to one.

To enable temporal MV usage (e.g. Merge mode and ATMVP) in case the temporal MV is from a different reference frame than the current resolution, it is proposed to scale the temporal MV to the current resolution as needed.

Resampling ratio

In jfet-M0135, to begin the discussion about ARC, it is proposed to consider only 2x resampling ratios (meaning 2x 2 for upsampling and 1/2 x 1/2 for downsampling) for the start of ARC. Further discussion of this topic after the marakench conference has found that supporting only 2x resampling ratios is very limited, since in some cases a smaller difference between the resampled and un-resampled resolutions may be more beneficial.

While supporting an arbitrary resampling ratio may be desirable, it seems difficult to support. This is because the number of resampling filters that must be defined and implemented appears to be too large to support an arbitrary resampling ratio and places a large burden on the decoder implementation.

It is proposed that more than one, but a smaller number of resampling ratios need to be supported, but at least 1.5x and 2x resampling ratios are supported, and no arbitrary resampling ratios are supported.

2.2.5.4. Maximum DPB buffer size and buffer fullness

With ARC, the DPB may contain decoded pictures of different spatial resolutions within the same CVS. For DPB management and related aspects, it is no longer efficient to calculate DPB size and fullness in decoded picture units.

The following is a discussion of some specific aspects and possible solutions that need to be addressed in the final VVC specification if ARC is supported:

1. instead of using the value of picsizeinsampley (picsizeinsampley ═ pic _ width _ in _ luma _ samples @ pic _ height _ in _ luma _ samples) to derive MaxDpbSize (i.e., the maximum number of reference pictures that may be present in the DPB), MaxDpbSize is derived based on the value of minpicsizeinsampley. The definition of MinPicSizeInSampleY is as follows:

2. each decoded picture is associated with a value named pictureszeiunit. The PictureSizeUnit is an integer value that specifies the size of the decoded picture relative to the MinPicCizeInSampleY size. The definition of PictureSinit depends on the resampling rate supported by ARC in VVC.

For example, if ARC supports only a resampling ratio of 2, the PictureSeizeUnit is defined as follows:

-the decoded picture with the smallest resolution in the bitstream is associated with a picturesizeUnit of 1.

The decoded picture of 2 by 2 resolution with the smallest resolution in the bitstream is associated with a 4 (i.e. 1 x 4) picturezeunit.

For another example, if ARC supports a resampling ratio of both 1.5 and 2, the PictureSemizeUnit is defined as follows:

-the decoded picture with the smallest resolution in the bitstream is associated with a 4's PictureSizeUnit.

The decoded picture of resolution 1.5 by 1.5 with the smallest resolution in the bitstream is associated with a pictureseunit of 9 (i.e. 2.25 x 4).

The decoded picture of 2 by 2 resolution with the smallest resolution in the bitstream is associated with a PictureSizeUnit of 16 (i.e. 4 x 4).

For other resampling ratios supported by ARC, the values of the picturezeunit for each picture size should be determined using the same principles given in the above example.

3. Let the variable minpictesezeunit be the smallest possible value for picturezeunit. That is, if ARC only supports a resampling scale of 2, then MinPictureSeUnit is 1; MinPictureSIZEUnit is 4 if ARC supports resampling ratios of 1.5 and 2; likewise, the same principle is used to determine the value of MinPictureSizeUnit.

Values for sps _ max _ dec _ pic _ buffering _ minus1[ i ] range from 0 to (minpicturesize unit (MaxDpbSize-1)). The variable minpictesezeunit is the smallest possible value for picturezeunit.

DPB fullness operation is specified based on Pictures zeUnit as follows:

HRD is initialized at decoding unit 0, where both CPB and DPB are set to null (DPB fullness set to 0).

When flushing (flush) the DPB (i.e. removing all pictures from the DPB), the DPB fullness is set to 0.

-when a picture is removed from the DPB, the DPB fullness is decremented by the value of the PictureSizeUnit associated with the removed picture.

When a picture is inserted into the DPB, the DPB fullness is increased by the value of the picturezeunit associated with the inserted picture.

2.2.5.5. Resampling filter

In a software implementation, the resampling filter implemented is simply taken from the previously available filter described in JCTVC-H0234. If other resampling filters provide better performance and/or lower complexity, testing and use should be done. Various resampling filters are proposed to be tested to strike a balance between complexity and performance. Such testing may be done in the CE.

2.2.5.6. Other necessary modifications of existing tools

To support ARC, some modifications and/or additional operations may need to be performed on some existing codec tools. For example, in the picture-based resampling implemented by ARC software, TMVP and ATMVP are disabled when the original codec resolutions of the current and reference pictures are different for simplicity.

2.2.6.JVET-N0279

According to the "requirements of the future video codec standard", the standard should support fast representation switching in case of an adaptive streaming service providing multiple representations of the same content, each representation having different properties (e.g. spatial resolution or sampling bit depth). In real-time video communication, resolution is allowed to change in a coding and decoding video sequence without inserting I pictures, so that video data can be seamlessly adapted to dynamic channel conditions or user preferences, and the jumping effect caused by the I pictures can be eliminated. Fig. 14 shows a hypothetical example of adaptive resolution change, where a current picture is predicted from a reference picture of a different size.

This proposal proposes a high level syntax to signal adaptive resolution changes and modifications to the current motion compensated prediction process in the VTM. These modifications are limited to motion vector scaling and sub-pixel position derivation without changing the existing motion compensated interpolator. This would allow existing motion compensated interpolators to be reused and no new processing blocks are needed to support the adaptive resolution change, which would introduce additional cost.

2.2.6.1. Adaptive resolution change signaling

Double brackets are arranged before and after the deleted text.

Table 12: SPS

[ [ pic _ width _ in _ luma _ samples ] specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic _ height _ in _ luma _ samples specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

max _ pic _ width _ in _ luma _ samples specifies the maximum width of a decoded picture of the reference SPS in units of luminance samples. max _ pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

max _ pic _ height _ in _ luma _ samples specifies the maximum height of the decoded picture of the reference SPS in units of luma samples. max _ pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

Table 13: PPS

pic _ size _ differential _ from _ max _ flag equal to 1 specifies that the PPS signals a different picture width or picture height than max _ pic _ width _ in _ luma _ samples and max _ pic _ height _ in _ luma _ sample in the referenced SPS. pic _ size _ differential _ from _ max _ flag equal to 0 specifies that pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ sample are the same as max _ pic _ width _ in _ luma _ samples and max _ pic _ height _ in _ luma _ sample in the referenced SPS.

pic _ width _ in _ luma _ samples specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic _ width _ in _ luma _ samples is not present, it is inferred to be equal to max _ pic _ width _ in _ luma _ samples.

pic _ height _ in _ luma _ samples specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic _ height _ in _ luma _ samples does not exist, it is inferred to be equal to max _ pic _ height _ in _ luma _ samples.

The requirement for bitstream conformance is that the horizontal and vertical scaling should be in the range of 1/8 to 2, including each active reference picture. The scaling is defined as follows:

horizontal_scaling_ratio=((reference_pic_width_in_luma_samples<<14)+(pi c_width_in_luma_samples/2))/pic_width_in_luma_samples

vertical_scaling_ratio=((reference_pic_height_in_luma_samples<<14)+(pic_height_in_luma_samples/2))/pic_height_in_luma_samples

TABLE 14

Reference picture scaling procedure

When the resolution within the CVS changes, the picture may be different in size from its reference picture or pictures. The proposal normalizes all motion vectors to the current picture grid, rather than their corresponding reference picture grids. This is believed to be advantageous in maintaining design consistency and making the resolution change transparent to the motion vector prediction process. Otherwise, due to different scaling, neighboring motion vectors pointing to different sizes of reference pictures cannot be directly used for spatial motion vector prediction.

When the resolution changes, the motion vectors and reference blocks must be scaled when performing motion compensated prediction. The zoom range is limited to [1/8, 2], i.e., the upscaling is limited to 1: 8, and the downscaling is limited to 2: 1. note that upscaling refers to the case where the reference picture is smaller than the current picture, while downscaling refers to the case where the reference picture is larger than the current picture. In the following sections, the scaling process will be described in more detail.

Brightness block

The scaling factor and its fixed-point representation are defined as

The scaling process includes two parts:

1. mapping the upper left pixel of the current block to a reference picture;

2. the reference positions of other pixels of the current block are addressed using horizontal and vertical steps.

If the coordinates of the top-left pixel of the current block are (x, y), the sub-pixel position (x ^, y ^') in the reference picture pointed to by the motion vector (mvX, mvY) in units of 1/16 pixels is specified as follows:

the horizontal position in the reference picture is

X ═ ((X < 4) + mvX) · hori scale fp- [ equation 3]

And x' is further reduced to only 10 decimal places

x ' ═ Sign (x ') · ((Abs (x ') + (1< 7)) > 8) - [ equation 4]

Also, the vertical position in the reference picture is

y ═ ((y < 4) + mvY) · vert _ scale _ fp- [ equation 5]

And y' is further reduced to

y ═ Sign (y ') · ((Abs (y') + (1< 7)) > 8) - [ equation 6]

At this time, the reference position of the top-left pixel of the current block is at (x ^, y ^). Other reference sub-pixel/pixel locations are computed with respect to (x ^ ', y ^') with horizontal and vertical step sizes. These steps are derived from the above-mentioned horizontal and vertical scaling factors with an accuracy of 1/1024 pixels, as follows:

x _ step (hori _ scale _ fp +8) > 4- [ equation 7]

y _ step (vert _ scale _ fp +8) > 4- [ equation 8]

For example, if a pixel in the current block is i columns and j rows from the top-left pixel, the horizontal and vertical coordinates of its corresponding reference pixel are derived by

x′iX' + i x step equation 9]

y′jY' + j y _ step equation 10]

In sub-pixel interpolation, x'iAnd y'jMust be decomposed into a full-pixel part and a fractional-pixel part:

-the portion of full pixels used for addressing the reference block is equal to

(x′i+32) > 10- [ equation 11]

(y′j+32) > 10- [ equation 12]

-fractional pixel part for selecting an interpolation filter equal to

Δx=((x′i+32)>>6)&15- [ Eq.13]

Δy=((y′j+32)>>6)&15. - [ equation 14]

Once the full and fractional pixel positions within the reference picture are determined, existing motion compensated interpolators can be used without any additional changes. Full pixel positions will be used to extract the reference block patch (patch) from the reference picture and fractional pixel positions will be used to select the appropriate interpolation filter.

Chrominance block

When the chroma format is 4: 2: 0, the chroma motion vector has an accuracy of 1/32 pixels. The scaling process for chroma motion vectors and chroma reference blocks is almost the same as for luma blocks, except for chroma format-dependent adjustments.

When the coordinates of the upper left pixel of the current chroma block are (xc, yc), the initial horizontal and vertical positions in the reference chroma picture are (x, y)

xc′=((xc< 5) + mvX). hori scale fp- [ Eq.15]

yc′=((yc< 5) + mvY) vert _ scale _ fp- [ Eq.16]

Where mvX and mvY are the original luminance motion vectors, but should now be checked with 1/32 pixel accuracy.

xc' and yc' further reduced to maintain 1/1024 pixel accuracy

xc′=Sign(Xc′)·((Abs(xc') + (1< 8)) > 9) - [ equation 17]

yc′=Sign(yc′)·((Abs(yc') + (1< 8)) > 9) - [ equation 18]

The right shift adds one extra bit compared to the associated luminance equation.

The step size used is the same as the brightness. For a chroma pixel located at (i, j) relative to the upper left corner pixel, the horizontal and vertical coordinates of its reference pixel are derived by

xciXc' + i x step equation 19]

ycj=yc' + j y step equation 20]

In sub-pixel interpolation, xciAnd ycjAlso decomposed into a full-pixel part and a fractional-pixel part:

the portion of full pixels used to address the reference block is equal to:

(xci+16) > 10- [ equation 21]

(ycj+16) > 10- [ equation 22]

The fraction of pixels used to select the interpolation filter is equal to:

Δx=((xci+16)>>5)&31- [ eq 23]

Δy=((ycj+16)>>5)&31- [ eq 24]

Interaction with other codec tools

Because some codec tools interact with reference picture scaling with additional complexity and memory bandwidth, it is proposed to add the following limitations in the VVC specification:

when tile _ group _ temporal _ MVP _ enabled _ flag is equal to 1, the current picture and its collocated picture should have the same size.

Decoder motion vector refinement should be turned off when resolution changes are allowed in the sequence.

When resolution changes are allowed in the sequence, sps _ bdef _ enabled _ flag should be equal to 0.

Adaptive Loop Filter (ALF) based on Codec Tree Block (CTB) in JVT ET-N0415

Banded time-domain filter

An Adaptive Parameter Set (APS) is employed in VTM 4. Each APS contains a set of signaled ALF filters, supporting a maximum of 32 APSs. In this proposal, a band-level time-domain filter is tested. The slice group may reuse the ALF information from the APS to reduce overhead. The APS updates as a first-in-first-out (FIFO) buffer.

CTB-based ALF

For the luma component, when applying ALF to luma CTB, a selection is made in 16 fixed, 5 time, or 1 signaled filter set. Only the filter set index is signaled. For a stripe, only a new set of 25 filters can be signaled. If a new set is signaled for a slice, all luminance CTBs in the same slice share the set. The fixed filter set may be used to predict a new slice level filter set and may also be used as a candidate filter set for luma CTB. The total number of filters is 64.

For the chroma components, when applying ALF to the chroma CTB, the CTB uses the new filter if it is signaled for the slice, otherwise applies the latest temporal chroma filter that satisfies the temporal scalability constraint.

As a slice level time domain filter, APS updates a first-in-first-out (FIFO) buffer.

2.4. Alternative temporal motion vector prediction (also known as subblock-based temporal Merge candidate in VVC)

In an Alternative Temporal Motion Vector Prediction (ATMVP) method, motion vector Temporal Motion Vector Prediction (TMVP) is modified by extracting multiple sets of motion information (including motion vectors and reference indices) from blocks smaller than the current CU. As shown in fig. 14, a sub-CU is a square N × N block (N is set to 8 by default).

ATMVP predicts the motion vectors of sub-CUs within a CU in two steps. The first step is to identify the corresponding block in the reference picture with a so-called temporal vector. The reference picture is called a motion source picture. The second step is to divide the current CU into sub-CUs and obtain the motion vector and reference index of each sub-CU from the block corresponding to each sub-CU, as shown in fig. 15, which shows an example of ATMVP motion prediction for a CU.

In a first step, the reference picture and the corresponding block are determined from motion information of spatially neighboring blocks of the current CU. To avoid a repeated scanning process of neighboring blocks, the Merge candidate from block a0 (left block) in the Merge candidate list of the current CU is used. The first available motion vector from block a0 that references the collocated reference picture is set to be a temporal vector. In this way, in ATMVP, the corresponding block can be identified more accurately than in TMVP, where the corresponding block (sometimes referred to as a collocated block) is always in the lower right corner or center position relative to the current CU.

In a second step, the corresponding block of the sub-CU is identified by the temporal vector in the motion source picture by adding the temporal vector to the coordinates of the current CU. For each sub-CU, the motion information of its corresponding block (the minimum motion grid covering the central samples) is used to derive the motion information of the sub-CU. After the motion information of the corresponding nxn block is identified, it is converted into a motion vector and reference index of the current sub-CU in the same way as the TMVP of HEVC, where motion scaling and other processes are applied.

2.5. Affine motion prediction

In HEVC, only the translational motion model is used for Motion Compensated Prediction (MCP). In the real world, there are many kinds of movements, such as zoom-in/zoom-out, rotation, perspective movement, and other irregular movements. In VVC, a simplified affine transform motion compensated prediction with a 4-parameter affine model and a 6-parameter affine model is applied. Fig. 16a and 16b show a simplified 4-parameter affine motion model and a simplified 6-parameter affine motion model, respectively. As shown in fig. 16a and 16b, the affine motion field of the block is described by two Control Point Motion Vectors (CPMV) of a 4-parameter affine model and 3 CPMV of a 6-parameter affine model.

The Motion Vector Field (MVF) of the block is described by the following equations, respectively, in which a 4-parameter affine model in equation (1) (in which 4 parameters are defined as variables a, b, e, and f) and a 6-parameter affine model in equation (2) (in which 4 parameters are defined as variables a, b, c, d, e, and f):

where (mvh 0) is the motion vector for the upper left corner control point, and (mvh 1) is the motion vector for the upper right corner control point, and (mvh 2) is the motion vector for the lower left corner control point, all three motion vectors are referred to as Control Point Motion Vectors (CPMV), (x, y) represents the coordinates of the representative point relative to the upper left corner sample point within the current block, and (mvh (x, y), mvv (x, y)) is a motion vector derived for the sample point located at (x, y). The CP motion vectors may be signaled (as in affine AMVP mode) or derived on the fly (as in affine Merge mode). w and h are the width and height of the current block. In practice, division is achieved by right-shifting and rounding operations. In the VTM, a representative point is defined as the center position of the subblock, for example, when the coordinates of the upper left corner of the subblock with respect to the upper left corner sampling point within the current block are (xs, ys), the coordinates of the representative point are defined as (xs +2, ys + 2). For each sub-block (i.e., 4 x 4 block in the VTM), the representative point is used to derive a motion vector for the entire sub-block.

To further simplify the motion compensated prediction, sub-block based affine transform prediction is applied. To derive the motion vector of each M × N (M and N are set to 4 in the current VVC) subblock, the motion vector of the center sample point of each subblock is calculated according to equations 25 and 26, as shown in fig. 17, and rounded to 1/16 fractional accuracy. Then, using the derived motion vectors, 1/16-pixel motion compensated interpolation filters are applied to generate a prediction for each sub-block. The 1/16 pixel interpolation filter is introduced by the affine mode.

After MCP, the high precision motion vector of each sub-block is rounded and saved to the same precision as the regular motion vector.

2.5.1. Affine predictive signaling

Similar to the translational motion model, there are also two modes for signaling side information (side information) due to affine prediction. They are AFFINE _ INTER and AFFINE _ MERGE modes.

2.5.2.AF _ INTER mode

For CUs with a width and height larger than 8, the AF _ INTER mode may be applied. An affine flag at CU level is signaled in the bitstream to indicate whether AF _ INTER mode is used.

In this mode, for each reference picture list (list 0 or list 1), an affine AMVP candidate list is constructed with three types of affine motion predictors in the following order, where each candidate includes the estimated CPMV of the current block. Sending at encoder sideDifference in current best CPMV (such as mv in FIGS. 18a and 18 b)0mv1mv2And estimated CPMV) is signaled. Furthermore, an index of affine AMVP candidates is signaled from which the estimated CPMV is derived.

1) Inherited affine motion predictor

The inspection order is similar to the spatial MVP in HEVC AMVP list construction. First, the left inherited affine motion predictor is derived from the first block in { a1, a0} that is affine coded and has the same reference picture as the current block. Second, the inherited affine motion predictor is derived from the first block in { B1, B0, B2} that is affine coded and has the same reference picture as the current block. Five chunks a1, a0, B1, B0, B2 are depicted in fig. 19.

Once a neighboring block is found to be coded with affine mode, the CPMV of the coding unit covering the neighboring block is used to derive the prediction value of the CPMV of the current block. For example, if a1 was coded with a non-affine mode, while a0 was coded with a 4-parameter affine mode, the left-handed affine MV predictor would be derived from a 0. In this case, the CPMV of the CU covering A0, as for the upper left CPMV in FIG. 18bAnd the upper right corner CPMVRepresented, is used to derive the estimated CPMV of the current block, from the upper left (with coordinates (x0, y0)), the upper right (with coordinates (x1, y1)) and the lower right position (with coordinates (x2, y2)) of the current blockTo indicate.

2) Constructed affine motion predictor

The constructed affine motion predictor includes Control Point Motion Vectors (CPMV) derived from neighboring inter coded blocks, which have the same reference picture as shown in fig. 20. If the current affine motion model is a 4-parameter affine,the number of CPMVs is 2, otherwise if the current affine motion model is a 6-parameter affine, the number of CPMVs is 3. CPMV in the upper left cornerIs derived from the MV of the first block in the set { a, B, C }, which is inter-coded and has the same reference picture as the current block. CPMV in the upper right cornerIs derived from the MV of the first block in the set D, E, which is inter-coded and has the same reference picture as the current block. CPMV in the lower left cornerIs derived from the MV of the first block in the set F, G, which is inter-coded and has the same reference picture as the current block.

If the current affine motion model is a 4-parameter affine, only ifAndwhen all are established, i.e.Andthe constructed affine motion predictor is only inserted into the candidate list when used as the estimated CPMV for the top left (with coordinates (x0, y0)), top right (with coordinates (x1, y1)) position of the current block.

If the current affine motion model is a 6-parameter affine, only ifWhen and are all established, i.e.And the estimated CPMV used as the positions of the upper left corner (with coordinates (x0, y0)), the upper right corner (with coordinates (x1, y1)) and the lower right corner (with coordinates (x2, y2)) of the current block, the constructed affine motion predictor is inserted into the candidate list.

When the constructed affine motion predictor is inserted into the candidate list, no pruning process is applied.

3) Conventional AMVP motion prediction

The following applies until the number of affine motion predictors reaches a maximum.

i. By setting all CPMVs toTo derive affine motion predictors (if available).

By setting all CPMVs toTo derive affine motion predictors (if available).

By setting all CPMVs toTo derive affine motion predictors (if available).

Derive affine motion predictors by setting all CPMVs to HEVC TMVP (if available).

Deriving affine motion prediction values by setting all CPMVs to zero MV.

Note that the affine motion predictor that has been constructed is already derived

In the AF _ INTER mode, 2/3 control points are required when the 4/6 parameter affine mode is used, and thus 2/3 MVDs (motion vector differences) need to be codec for these control points, as shown in fig. 18a and 18 b. In JFET-K0337, it is proposed to derive the MV from mvd0Predicting mvd1And mvd2

Wherein the content of the first and second substances,mvdiand mv1The predicted motion vector, motion vector difference, and motion vector of the top left pixel (i ═ 0), the top right pixel (i ═ 1), or the bottom left pixel (i ═ 2), respectively, are shown in fig. 18 b. Note that the addition of two motion vectors (e.g., mvA (xA, yA) and mvB (xB, yB)) is equal to the sum of two components, respectively, i.e., newMV is mvA + mvB, and the two components of newMV are set to (xA + xB) and (yA + yB), respectively.

2.5.2.1.AF _ MERGE mode

When the CU applies to AF _ MERGE mode, it obtains the first block coded with affine mode from the valid neighboring reconstructed blocks. Fig. 21 shows candidates for AF _ MERGE. The candidate blocks are selected in order from left, top right, bottom left, to top left as shown in figure 21a (in order A, B, C, D, E). For example, if the adjacent lower left block is coded in affine mode, as shown by a0 in fig. 21b, the control point motion vector mv containing the upper left, upper right and lower left corners of the adjacent CU/PU of block a is extracted0 N、mv1 NAnd mv2 N. And is based on mv0 N、mv1 NAnd mv2 NComputing a motion vector mv for top left/top right/bottom left corner on a current CU/PU0 C、mv1 CAnd mv2 C(for 6 parameter affine models only). It should be noted that in VTM-2.0, the sub-block located in the upper left corner (e.g., the 4 x 4 block of VTM) stores mv0, and if the current block is affine-coded, the sub-block located in the upper right corner stores mv 1. If the current block is coded and decoded by using a 6-parameter affine model, the sub-block positioned at the lower left corner stores mv 2; otherwise (using a 4-parameter affine model), the LB stores mv 2'. The other sub-blocks store MVs for the MC.

After deriving CPMV mv0C, mv1C, and mv2C for the current CU, the MVF of the current CU is generated according to the simplified affine motion model equations 25 and 26. To identify whether the current CU is coded in AF _ MERGE mode, an affine flag is signaled in the bitstream when there is at least one neighboring block coded in affine mode.

In JVET-L0142 and JVET-L0632, affine Merge candidate list was constructed by the following steps:

1) inserting inherited affine candidates

Inherited affine candidates mean that the candidates are derived from affine motion models that effectively neighbor affine codec blocks. At most two inherited affine candidates are derived from the affine motion models of the neighboring blocks and inserted into the candidate list. For the left predictor, the scan order is { A0, A1 }; for the up prediction values, the scan order is { B0, B1, B2 }.

2) Insertion-built affine candidates

If the number of candidates in the affine Merge candidate list is less than MaxmumAffinic and (e.g., 5), the constructed affine candidate is inserted into the candidate list. The constructed affine candidate means that the candidate is constructed by combining the neighboring motion information of each control point.

a) The motion information of the control point is first derived from the specified spatial and temporal neighbors shown in fig. 22, and fig. 22 shows an example of candidate positions of the affine Merge pattern. CPk (k ═ 1, 2, 3, 4) represents the kth control point. A0, a1, a2, B0, B1, B2, and B3 are spatial positions of the predicted CPk (k ═ 1, 2, 3). T is the temporal location of the predicted CP 4.

The coordinates of CP1, CP2, CP3, and CP4 are (0, 0), (W, 0), (H, 0), and (W, H), respectively, where W and H are the width and height of the current tile.

The motion information of each control point is obtained according to the following priority order:

for CP1, the check priority is B2- > B3- > a 2. If B2 is present, it is used. Otherwise, if B2 is available, B3 is used. If neither B2 nor B3 is available, then A2 is used. If none of the three candidates are present, no motion information for CP1 can be obtained.

For CP2, check priority B1- > B0.

For CP3, check priority a1- > a 0.

For CP4, T is used.

b) Next, affine Merge candidates are constructed using combinations of control points.

Motion information of three control points is needed to construct a 6-parameter affine candidate. The three control points may be selected from one of the following four combinations: ({ CP1, CP2, CP4}, { CP1, CP2, CP3}, { CP2, CP3, CP4}, { CP1, CP3, CP4 }). The combinations CP1, CP2, CP3, CP2, CP3, CP4, CP1, CP3, CP4 will be converted into a 6-parameter motion model represented by the upper left, upper right and lower left control points.

Motion information of two control points is needed to construct a 4-parameter affine candidate. The two control points may be selected from one of two combinations: ({ CP1, CP2}, { CP1, CP3 }). These two combinations will be converted into a 4-parameter motion model represented by the upper left and upper right control points.

-the combination of constructed affine candidates is inserted into the candidate list in the following order:

{CP1、CP2、CP3}、{CP1、CP2、CP4}、{CP1、CP3、CP4}、{CP2、CP3、CP4}、{CP1、CP2}、{CP1、CP3}

i. for each combination, the reference index of list X of each CP is checked, and if they are all the same, the combination has a valid CPMV for list X. If the combination does not have a valid CPMV for List 0 and List 1, the combination is marked invalid. Otherwise, it is valid and CPMV is put into the subblock Merge list.

3) Filling with zero motion vector (Padding)

If the number of candidates in the affine Merge candidate list is less than 5, a zero motion vector with a zero reference index is inserted into the candidate list until the list is full.

More specifically, for the sub-block Merge candidate list, the MV of the 4-parameter Merge candidate is set to (0, 0) and the prediction direction is set to unidirectional prediction (for P slices) and bidirectional prediction (for B slices) from list 0.

Disadvantages of the existing implementations

When applied in VVCs, ARCs may have the following problems:

it is unclear to apply codec tools in VVCs, such as ALF, luma-mapped chroma scaling (LMCS), decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, Triangle Prediction Mode (TPM), Symmetric Motion Vector Differences (SMVD), Merge with motion vector differences (mmvd), inter-frame intra-prediction (also called combined inter-frame Merge and intra-frame prediction CIIP in VVC), Local Illumination Compensation (LIC), and history-based motion vector prediction (HMVP).

Exemplary method of codec tool with adaptive resolution conversion

Embodiments of the disclosed technology overcome the disadvantages of existing implementations. Examples of the disclosed technology provided below are discussed to facilitate an understanding of the disclosed technology and should be construed in a manner to limit the disclosed technology. Various features described in these examples may be combined unless explicitly stated to the contrary.

In the discussion that follows, SatShift (x, n) is defined as

Shift (x, n) is defined as Shift (x, n) ═ x + offset0) > > n.

In one example, offset0 and/or offset1 is set to (1< < n) > >1 or (1< < n-1). In another example, offset0 and/or offset1 are set to 0.

In another example, offset0 ═ offset1 ═ ((1< < n) > >1) -1 or ((1< (n-1))) -1.

Clip3(min, max, x) is defined as

Floor (x) is defined as the largest integer less than or equal to x.

Ceil (x) is the smallest integer greater than or equal to x.

Log2(x) is defined as the base 2 logarithm of x.

Some aspects of implementations of the disclosed technology are listed below.

The derivation of the MV offset in MMVD/SMVD and/or the refined motion vectors in the decoder-side derivation process may depend on the resolution of the reference picture associated with the current block and the resolution of the current picture.

a. For example, the second MV offset referring to the second reference picture may be scaled from the first MV offset referring to the first reference picture. The scaling factor may depend on the resolution of the first and second reference pictures.

2. The motion candidate list construction process may be constructed according to the reference picture resolution associated with the spatial/temporal/historical motion candidates.

a. In one example, a Merge candidate that references a reference picture with a higher resolution may have a higher priority than a Merge candidate that references a reference picture with a lower resolution. In the discussion, if W0< W1 and H0< H1, the resolution W0 × H0 is lower than the resolution W1 × H1.

b. For example, in the Merge candidate list, a Merge candidate referring to a reference picture having a higher resolution may be placed before a Merge candidate referring to a reference picture having a lower resolution.

c. For example, a motion vector of a reference picture having a reference resolution lower than that of the current picture cannot be in the Merge candidate list.

d. In one example, whether and/or how to update the history buffer (lookup table) may depend on the reference picture resolution associated with the decoded motion candidate.

i. In one example, if one reference picture associated with a decoded motion candidate has a different resolution, such motion candidate is not allowed to update the history buffer.

3. It is proposed to filter pictures with ALF parameters associated with the respective sizes.

a. In one example, an ALF parameter signaled in a video unit, such as an APS, may be associated with one or more picture sizes.

b. In one example, a video unit such as an APS signals that an ALF parameter can be associated with one or more picture sizes.

c. For example, a picture may only apply the ALF parameters associated with the same size signaled in a video unit (e.g., APS).

An indication of the index/resolution of the pps may be signaled in the ALF APS.

e. The ALF parameters are restricted to be inherited/predicted from only the parameters for pictures with the same resolution.

4. It is proposed that the ALF parameters associated with the first respective size may be inherited or predicted from the ALF parameters associated with the second respective size.

a. In one example, the first respective size must be the same as the second respective size.

b. In one example, the first respective size may be different from the second respective size.

5. It is proposed to reshape the samples in the picture with LMCS parameters associated with the respective sizes.

a. In one example, LMCS parameters signaled in a video unit such as an APS may be associated with one or more picture sizes.

b. In one example, a video unit such as an APS signals that LMCS parameters may be associated with one or more picture sizes.

c. For example, pictures may only apply LMCS parameters associated with the same size signaled in a video unit (e.g., APS).

The index/resolution indication of pps may be signaled in the LMCS APS.

e. The LMCS parameters are restricted to be inherited/predicted from parameters for pictures with the same resolution only.

6. It is proposed that the LMCS parameters associated with the first respective size may be inherited or predicted from the LMCS parameters associated with the second respective size.

a. In one example, the first respective size must be the same as the second respective size.

b. In one example, the first respective size may be different from the second respective size.

7. Whether and/or how to enable TPM (triangle prediction mode)/GEO (inter prediction with geometric partitioning) or other coding tools that may divide a block into two or more sub-partitions may depend on the associated reference picture information of the two or more sub-partitions.

a. In one example, it may depend on the resolution of one of the two reference pictures and the resolution of the current picture.

i. In one example, such a coding tool is disabled if at least one of the two reference pictures is associated with a different resolution than the current picture.

b. In one example, this may depend on whether the resolutions of the two reference pictures are the same.

i. In one example, such a coding tool may be disabled if two reference pictures are associated with different resolutions.

in one example, such a coding tool is disabled if both reference pictures are associated with different resolutions compared to the current picture.

Alternatively, such coding tools may still be disabled if both reference pictures are associated with different resolutions compared to the current picture, but both reference pictures have the same resolution.

Alternatively, if at least one reference picture has a different resolution than the current picture, and the reference picture has a different resolution, then coding tool X may be disabled.

c. Alternatively, or in addition, this may depend on whether the two reference pictures are the same reference picture.

d. Alternatively, or in addition, this may depend on whether the two reference pictures are in the same reference list.

e. Alternatively, if RPR (reference picture resampling is enabled in slice/picture header/sequence parameter set), such codec tool may always be disabled.

8. It is proposed that coding tool X may be disabled for a block if the block refers to at least one reference picture having a different size than the current picture.

a. In one example, information related to codec tool X may not be signaled.

b. In one example, motion information for such a block may not be inserted into the HMVP table.

c. Alternatively, if the coding tool X is applied in a block, the block cannot refer to a reference picture having a different size from the current picture.

i. In one example, a Merge candidate that refers to a reference picture having a different size than the current picture may be skipped or not placed in the Merge candidate list.

in one example, a reference index corresponding to a reference picture having a different size than the current picture may be skipped or not allowed to be signaled.

d. Alternatively, the coding tool X may be applied after scaling two reference blocks or pictures according to the resolution of the current picture and the resolution of the reference picture.

e. Alternatively, the coding tool X may be applied after scaling the two MVs or MVDs according to the resolution of the current picture and the resolution of the reference picture.

f. In one example, whether to disable or enable codec tool X may depend on the resolution of the reference picture associated with the reference picture list and/or the current reference picture for a block (e.g., a bi-predictive coded block or a block with multiple hypotheses from the same reference picture list with different reference pictures or different MVs; or a block with multiple hypotheses from different reference picture lists).

i. In one example, codec tool X may be disabled for one reference picture list but enabled for another reference picture list.

in one example, codec tool X may be disabled for one reference picture but enabled for another reference picture. Here, the two reference pictures may be from different or the same reference picture list.

in one example, for each reference picture list L, it is determined to enable/disable the coding tool regardless of a reference picture in another reference picture list different from list L.

1. In one example, it may be determined by the reference picture and the current picture of list L.

2. In one example, if the associated reference picture of list L is different from the current picture, the tool may be disabled for list L.

Alternatively, enabling/disabling the codec tool is determined by the resolution of all reference pictures and/or the current picture.

1. In one example, coding tool X may be disabled if at least one reference picture has a different resolution than the current picture.

2. In one example, if at least one reference picture has a different resolution than the current picture, but the reference pictures have the same resolution, then coding tool X may still be enabled.

3. In one example, if at least one reference picture has a different resolution than the current picture, and the reference pictures have a different resolution, coding tool X may be disabled.

g. Codec tool X may be any of the following.

iii.DMVR

iv.BDOF

Affine prediction

Triangle prediction mode

vii.SMVD

viii.MMVD

inter-frame intra prediction in VVC

x.LIC

xi.HMVP

xii. Multiple Transform Set (MTS)

Sub-block transformations (SBT)

PROF and/or other decoding side motion/prediction refinement methods

LFNST (Low frequency non-square transform)

Filtering method (e.g., deblocking filter/SAO/ALF)

GEO/TPM/Cross component ALF

9. The reference picture list of a picture may include no more than K different resolutions.

a. In one example, K is equal to 2.

10. For N consecutive (in decoding order or display order) pictures, more than K different resolutions are not allowed.

a. In one example, N-3 and K-3.

b. In one example, N-10 and K-3.

c. In one example, no more than K different resolutions are allowed in a GOP

d. In one example, no more than K different resolutions are allowed between two pictures with the same specific temporal layer id (denoted tid).

i. For example, K ═ 3 and tid ═ 0.

11. The resolution change may only be allowed for intra pictures.

12. If one or two reference pictures of a block have different resolutions from the current picture, bi-prediction can be converted to uni-prediction in the decoding process.

a. In one example, a prediction from list X with a corresponding reference picture having a different resolution than the current picture may be discarded.

13. Whether inter prediction from reference pictures of different resolutions is enabled or disabled may depend on motion vector precision and/or resolution ratios (resolutions).

a. In one example, if the motion vectors scaled according to the resolution ratio point to integer positions, inter prediction may still be applied.

b. In one example, if the motion vector scaled according to the resolution ratio points to a sub-pixel location (e.g., 1/4 pixels), which is allowed for the case without ARC, then inter prediction may still be applied.

c. Alternatively, when both reference pictures are different in resolution from the current picture, bi-prediction may not be allowed.

d. Alternatively, bi-prediction may be enabled when one reference picture has a different resolution than the current picture and the other has the same resolution.

e. Alternatively, when the reference picture has a different resolution from the current picture and the block size satisfies a certain condition, unidirectional prediction of the block may not be allowed.

14. A first flag (e.g., pic _ disable _ X _ flag) indicating whether codec tool X is disabled may be signaled in the picture header.

a. Whether a codec tool is enabled for slices/blocks (swick)/sub-pictures/other video units smaller than a picture may be controlled by this flag in the picture header and/or slice type.

b. In one example, when the first flag is true, codec tool X is disabled.

i. Alternatively, when the first flag is false, codec tool X is enabled.

in one example, it enables/disables all samples within a picture.

c. In one example, the signaling of the first flag may further depend on a syntax element or syntax elements in the SPS/VPS/DPS/PPS.

i. In one example, the signaling of the flag may depend on the enabling flag of codec tool X in the SPS.

Alternatively or in addition, a second flag (e.g., sps _ X _ slice _ present _ flag) may be signaled in sps to indicate the presence of the first flag in the picture header.

1) Alternatively, in addition, the second flag may be conditionally signaled when codec tool X is enabled for a sequence (e.g., sps _ X _ enabled _ flag is true).

2) Or, in addition, the first flag may be signaled in the picture header only if the second flag indicates the presence of the first flag.

d. In one example, the first and/or second flag is coded with 1 bit.

e. The codec tool X may be:

i. in one example, codec tool X is PROF

in one example, codec tool X is DMVR.

in one example, codec tool X is BDOF.

in one example, codec tool X is a cross-component ALF.

v. in one example, codec tool X is GEO.

In one example, codec tool X is TPM.

In one example, codec tool X is an MTS.

15. Whether a block can refer to a reference picture having a different size from the current picture may depend on the Width (WB) and/or Height (HB) of the block, and/or the block prediction mode (i.e., bi-prediction or uni-prediction).

a. In one example, if WB > -T1 and HB > -T2, the block may refer to a reference picture having a different size from the current picture. For example, T1-T2-8.

b. In one example, if WB × HB > -T, the block may refer to a reference picture having a different size from the current picture. For example, T-64.

c. In one example, if Min (WB, HB) > — T, the block may refer to a reference picture having a different size from the current picture. For example T-8.

d. In one example, if Max (WB, HB) > < T, the block may refer to a reference picture having a different size from the current picture. For example T-8.

e. In one example, if WB < ═ T1 and HB < ═ T2, the block may refer to a reference picture of a different size than the current picture. For example, T1-T2-64.

f. In one example, if WB × HB < ═ T, the block may refer to a reference picture having a different size from the current picture. For example, T4096.

g. In one example, if Min (WB, HB) < ═ T, the block may refer to a reference picture of a different size than the current picture. For example, T-64.

h. In one example, if Max (WB, HB) < ═ T, the block may refer to a reference picture having a different size from the current picture. For example, T-64.

i. Alternatively, if WB < ═ T1 and/or HB < ═ T2, the block is not allowed to refer to a reference picture having a different size from the current picture. For example, T1-T2-8.

j. Alternatively, if WB < ═ T1 and/or HB < ═ T2, the block is not allowed to refer to a reference picture having a different size from the current picture. For example, T1-T2-8.

Fig. 23A is a block diagram of the video processing device 2300. Device 2300 may be used to implement one or more methods described herein. The device 2300 may be embodied in a smartphone, tablet, computer, internet of things (IoT) receiver, or the like. The device 2300 may include one or more processors 2302, one or more memories 2304, and video processing hardware 2306. The processor(s) 2302 may be configured to implement one or more of the methods described in this document (including, but not limited to, the methods illustrated in fig. 24A-25I). The memory(s) 2304 may be used to store data and code for implementing the methods and techniques described herein. The video processing hardware 2306 may be used to implement some of the techniques described in this document in hardware circuits.

Fig. 23B is another example of a block diagram of a video processing system in which the disclosed technology may be implemented. Fig. 23B is a block diagram illustrating an example video processing system 2400 in which various techniques disclosed herein may be implemented. Various implementations may include some or all of the components of system 2400. System 2400 can include an input 2402 for receiving video content. The video content may be received in a raw or uncompressed format, such as 8 or 10 bit multi-component pixel values, or may be in a compressed or encoded format. Input 2402 may represent a network interface, a peripheral bus interface, or a storage interface. Examples of network interfaces include wired interfaces, such as ethernet, Passive Optical Networks (PONs), etc., and wireless interfaces, such as wireless interfaces or cellular interfaces.

System 2400 can include a codec component 2404 that can implement various codecs or encoding methods described in this document. Codec component 2404 may reduce an average bit rate of video from input 2402 to an output of codec component 2404 to produce a codec representation of the video. Thus, codec techniques are sometimes referred to as video compression or video transcoding techniques. The output of codec component 2404 may be stored or transmitted through a connected communication, as represented by component 2406. Component 2408 may use a stored or transmitted bitstream (or codec) representation of the video received at input 2402 to generate pixel values or displayable video that is sent to display interface 2409. The process of generating user-viewable video from a bitstream representation is sometimes referred to as video decompression. Further, while certain video processing operations are referred to as "codec" operations or tools, it should be understood that codec tools or operations are used at the encoder and that corresponding decoding tools or operations that reverse the codec results will be performed by the decoder.

Examples of a peripheral bus interface or display interface may include a Universal Serial Bus (USB) or a high-definition multimedia interface (HDMI) or displayport, among others. Examples of storage interfaces include SATA (serial advanced technology attachment), PCI, IDE interfaces, and the like. The techniques described in this document may be embodied in various electronic devices, such as mobile phones, laptops, smartphones, or other devices capable of performing digital data processing and/or video display.

Fig. 24A shows a flow diagram of an exemplary method of video processing. Referring to fig. 24A, a method 2410 includes, at step 2412, deriving one or more motion vector offsets based on one or more resolutions of a reference picture associated with a current video block and a resolution of a current picture for a conversion between the current video block of the current picture and a coded representation of the video. The method 2410 further includes, at step 2414, performing a conversion using the one or more motion vector offsets.

Fig. 24B shows a flow diagram of an exemplary method of video processing. Referring to fig. 24B, method 2420 comprises, at step 2422, constructing a motion candidate list for conversion between a current video block of a current picture of the video and a codec representation of the video, the motion candidates in the motion candidate list being included in a priority order such that the priority of the motion candidates is based on the resolution of the reference picture associated with the motion candidates. The method 2420 further includes, at step 2424, performing a transformation using the motion candidate list.

Fig. 24C shows a flow diagram of an exemplary method of video processing. Referring to fig. 24C, a method 2430 includes, at step 2432, determining parameters of an adaptive loop filter for a current video picture based on a size of the current video picture including one or more video units. The method 2430 further includes, at step 2434, performing a conversion between the current video picture and the codec representation of the current video picture by filtering the one or more video units according to the parameters of the adaptive loop filter.

Fig. 24D shows a flow diagram of an exemplary method of video processing. Referring to fig. 24D, a method 2440 includes, at step 2442, applying a luma mapping with chroma scaling (LMCS) process to a current video block of a current picture of a video, in which luma samples of the current video block are reshaped between a first domain and a second domain, and chroma residuals are scaled in a luma-related manner using LMCS parameters associated with respective sizes. Method 2440 also includes, at step 2444, performing a conversion between the current video block and a codec representation of the video.

Fig. 24E shows a flow diagram of an exemplary method of video processing. Referring to fig. 24E, method 2450 includes, at step 2452, determining whether and/or how to enable a coding tool that divides a current video block into a plurality of sub-partitions, according to a rule based on reference picture information for the plurality of sub-partitions, for a transition between the current video block of the video and a coded representation of the video. The method 2450 also includes, at step 2454, performing a conversion based on the determination.

Fig. 25A shows a flow diagram of an exemplary method of video processing. Referring to fig. 25A, a method 2510 includes, at step 2512, determining, for a transition between a current video block of a current picture of a video and a codec representation of the video, that use of a codec tool is disabled for the current video block due to the current video block being codec into the codec representation using a reference picture having a size different from a size of the current picture. The method 2510 also includes, at step 2514, performing a conversion based on the determination.

Fig. 25B shows a flow diagram of an exemplary method for video processing. Referring to fig. 25B, method 2520 includes, at step 2522, for a current video block of a current picture of the video, generating a prediction block by applying a coding tool to the current video block based on rules that determine whether and/or how to use a reference picture having a size different from the size of the current picture. The method 2520 also includes, at step 2524, performing a conversion between the current video block and a codec representation of the video using the prediction blocks.

Fig. 25C shows a flow diagram of an exemplary method of video processing. Referring to fig. 25C, method 2530 includes, at step 2532, determining whether to disable the codec tool for a current video block of a video based on a first resolution of a reference picture associated with one or more reference picture lists and/or a second resolution of the current reference picture used to derive a prediction block for the current video block for a transition between the current video block and a codec representation of the video. The method 2530 further includes, at step 2534, performing a conversion based on the determination.

Fig. 25D shows a flow diagram of an exemplary method of video processing. Referring to FIG. 25D, method 2540 includes, at step 2542, performing a conversion between a video picture including one or more video blocks and a codec representation of the video. In some implementations, at least some of the one or more video blocks are coded by referencing a reference picture list of the video picture according to a rule, and the rule specifies that the reference picture list includes reference pictures having at most K different resolutions, where K is an integer.

Fig. 25E shows a flow diagram of an exemplary method of video processing. Referring to fig. 25E, method 2550 includes, at step 2552, performing a conversion between N consecutive video pictures of video and a codec representation of the video. In some implementations, the N consecutive video pictures include one or more video blocks that are coded at different resolutions according to a rule, and the rule specifies that at most K different resolutions are allowed for the N consecutive pictures, where N and K are integers.

Fig. 25F shows a flow diagram of an exemplary method of video processing. Referring to fig. 25F, method 2560 includes, at step 2562, performing a conversion between video including a plurality of pictures and a codec representation of the video. In some implementations, at least some of the plurality of pictures are coded into a coded representation using different coded video resolutions, and the coded representation conforms to a format rule that is: the first codec resolution of the previous frame is changed to the second codec resolution of the next frame only in a case where the next frame sequentially following the previous frame is codec as an intra frame.

Fig. 25G shows a flow diagram of an exemplary method of video processing. Referring to fig. 25G, method 2570 includes, at step 2572, parsing a codec representation of the video to determine that a current video block of a current picture of the video references a reference picture associated with a resolution different from a resolution of the current picture. The method 2570 also includes, at step 2574, generating a prediction block for the current video block by converting the bi-prediction mode to a uni-prediction mode to be applied to the current video block. The method 2570 further includes, at step 2576, generating video from the codec representation using the prediction block.

Fig. 25H shows a flow diagram of an exemplary method of video processing. Referring to fig. 25H, method 2580 includes, at step 2582, generating a prediction block for a current video block of a current picture of a video by enabling or disabling inter prediction from reference pictures having different resolutions from each other according to a motion vector precision and/or a resolution ratio. Method 2580 further includes, at step 2584, performing a conversion between the current video block and a codec representation of the video using the prediction blocks.

Fig. 25I shows a flow diagram of an exemplary method for video processing. Referring to fig. 25I, method 2590 includes, at step 2592, determining, based on a coding characteristic of a current video block of a current picture of the video, whether a reference picture having a size different from a size of the current picture is allowed to generate a prediction block for the current video block during a transition between the current video block and a coded representation of the video. The method 2590 further includes, at step 2594, performing a conversion based on the determination.

In some embodiments, the video codec method may be implemented using an apparatus implemented on a hardware platform, as described with reference to fig. 23A or 23B. It should be appreciated that the disclosed methods and techniques would be beneficial to video encoder and/or decoder embodiments incorporated in video processing devices such as smart phones, laptops, desktops, and similar devices by allowing the use of the techniques disclosed in this document.

Some embodiments of the disclosed technology include making a decision or determination to enable a video processing tool or mode. In an example, when a video processing tool or mode is enabled, the encoder will use or implement the tool or mode in the processing of blocks of video, but does not necessarily modify the resulting bitstream based on the use of the tool or mode. That is, the conversion from a video block to a bitstream representation of the video will use the video processing tool or mode when the video processing tool or mode is enabled based on the decision or determination. In another example, when a video processing tool or mode is enabled, the decoder will process the bitstream knowing that the bitstream has been modified based on the video processing tool or mode. That is, the conversion from a bitstream representation of the video to a video block will be performed using a video processing tool or mode that is enabled based on the decision or determination.

Some embodiments of the disclosed technology include making a decision or determination to disable a video processing tool or mode. In an example, when a video processing tool or mode is disabled, the encoder will not use that tool or mode in the conversion of blocks of video to bitstream representations of video. In another example, when a video processing tool or mode is disabled, the decoder will process the bitstream knowing that the bitstream was not modified using the video processing tool or mode that was determined or determined to be disabled.

In this document, the term "video processing" may refer to video encoding, video decoding, video compression, or video decompression. For example, a video compression algorithm may be applied during the conversion from a pixel representation of the video to a corresponding bitstream representation, and vice versa. As defined by the syntax, the bitstream representation of the current video block may, for example, correspond to bits located at different locations within the bitstream or extending at different locations within the bitstream. For example, video blocks may be encoded according to the transform and the coded error residual values, and bits in the header and other fields in the bitstream may also be used. Here, the video block may be a logical unit corresponding to a processing operation being performed, for example, a codec unit, a transform unit, a prediction unit, and the like.

A first set of terms describes certain features and aspects of the disclosed technology set forth in the previous section.

1.A method for video processing, comprising: deriving a motion vector offset based on a resolution of a reference picture associated with the current video block and a resolution of the current picture during a transition between the current video block and a bitstream representation of the current video block; and performing the conversion using the motion vector offset.

2. The method of clause 1, wherein deriving the motion vector offset comprises: deriving a first motion vector offset with reference to a first reference picture; and deriving a second motion vector offset with reference to a second reference picture based on the first motion vector offset.

3. The method according to clause 1, further comprising: for a current video block, a motion candidate list construction process is performed based on reference picture resolutions associated with spatial, temporal, or historical motion candidates.

4. The method according to clause 1, wherein whether or how to update the lookup table depends on a reference picture resolution associated with the decoded motion candidate.

5. The method according to clause 1, further comprising: for a current picture, a filtering operation is performed using Adaptive Loop Filter (ALF) parameters associated with a respective size.

6. The method of clause 5, wherein the ALF parameters include a first ALF parameter associated with a first respective size and a second ALF parameter associated with a second respective size, the second ALF parameter inherited or predicted from the first ALF parameter.

7. The method according to clause 1, further comprising: the samples in the current picture are reshaped using LMCS (luma-mapped chroma scaling) parameters associated with the respective sizes.

8. The method of clause 7, wherein the LMCS parameters comprise a first LMCS parameter associated with a first respective size and a second LMCS parameter associated with a second respective size, the second LMCS parameter inherited or predicted from the first LMCS parameter.

9. The method of clause 5 or 7, wherein the ALF parameters or LMCS parameters signaled in the video unit are associated with one or more picture sizes.

10. The method of clause 1, further comprising disabling a coding tool for the current video block when the current video block references at least one reference picture that has a different size than the current picture.

11. The method according to clause 10, further comprising skipping or omitting the Merge candidate referring to the reference picture whose size is different from the size of the current picture.

12. The method according to clause 1, further comprising applying the coding tool after scaling two reference blocks or two reference pictures based on the resolution of the reference picture and the resolution of the current picture, or after scaling two MVs or MVDs (motion vector differences) based on the resolution of the reference picture and the resolution of the current picture.

13. The method of clause 1, wherein the current picture includes no more than K different resolutions, K being a natural number.

14. The method according to clause 13, wherein K different resolutions are allowed for N consecutive pictures, N being a natural number.

15. The method of clause 1, further comprising applying the resolution change to the current picture as an intra picture.

16. The method of clause 1, further comprising converting bi-prediction to uni-prediction when one or both reference pictures of the current video block have a resolution different from a resolution of the current picture.

17. The method of clause 1, further comprising enabling or disabling inter prediction from reference pictures of different resolutions according to at least one of a motion vector precision or a resolution ratio between the current block size and the reference block size.

18. The method of clause 1, further comprising applying bi-prediction according to whether two reference pictures or one reference picture has a resolution different from a resolution of the current picture.

19. The method of clause 1, wherein whether the current video block refers to a reference picture whose size is different from the size of the current picture depends on at least one of a size of the current video block or a block prediction mode.

20. The method of clause 1, wherein performing the conversion comprises generating the current video block from a bit stream representation.

21. The method of clause 1, wherein performing the conversion comprises generating a bitstream representation from the current video block.

22. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of clauses 1-21.

23. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing the method of any of clauses 1 to 21.

A second set of clauses describes certain features and aspects of the disclosed technology listed in the previous section, including, for example, example implementations 1-7.

1.A video processing method, comprising: deriving, for a conversion between a current video block of a current picture of a video and a codec representation of the video, one or more motion vector offsets based on one or more resolutions of a reference picture associated with the current video block and a resolution of the current picture; and performing the conversion using the one or more motion vector offsets.

2. The method according to clause 1, wherein the one or more motion vector offsets correspond to motion vector offsets in merge (MMVD) or Symmetric Motion Vector Differences (SMVD) having a motion vector difference, the MMVD comprising a motion vector expression comprising a distance index specifying a distance between two motion candidates, the SMVD symmetrically processing the motion vector differences.

3. The method according to clause 1, wherein the one or more motion vector offsets correspond to refined motion vectors used in the decoder-side derivation process.

4. The method of clause 1, wherein deriving the one or more vector offsets comprises: deriving a first motion vector offset with reference to a first reference picture; and deriving a second motion vector offset with reference to a second reference picture based on the first motion vector offset.

5. The method according to any of clauses 1-4, wherein the one or more motion vector offsets comprise a first offset (offset0) and a second offset (offset1), and wherein performing the conversion comprises calculating the motion vector using the SatShift formula defined as:

6. the method of any of clauses 1-5, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

7. A video processing method, comprising: constructing a motion candidate list for conversion between a current video block of a current picture of the video and a codec representation of the video, the motion candidates in the motion candidate list being included in order of priority such that the priority of the motion candidates is based on a resolution of a reference picture associated with the motion candidates; and performing the conversion using the motion candidate list.

8. The method according to clause 7, wherein the motion candidate referring to the reference picture with the higher resolution has a higher priority than another motion candidate referring to another reference picture with a lower resolution.

9. The method according to clause 7, wherein the motion candidate referring to the reference picture with the higher resolution is placed in the motion candidate list before another Merge candidate referring to another reference picture with a lower resolution.

10. The method of clause 7, wherein constructing the motion candidate list is performed without including motion candidates that reference a reference picture having a resolution lower than a resolution of a current picture including the current video block.

11. The method according to clause 7, wherein whether and/or how to update the lookup table depends on a resolution of a reference picture associated with the decoded motion candidate.

12. The method of clause 11, wherein the reference picture is associated with a decoded motion candidate and has a resolution different from a resolution of a current picture comprising the current video block, the motion candidate from the reference picture not allowing the lookup table to be updated.

13. The method according to any of clauses 7-12, wherein the motion candidates are temporal motion candidates.

14. The method according to any of clauses 7-12, wherein the motion candidates are spatial motion candidates.

15. The method according to any of clauses 7-12, wherein the motion candidate is a history-based motion candidate.

16. The method according to any of clauses 7 to 15, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

17. A video processing method, comprising: determining parameters of an adaptive loop filter for a current video picture based on a size of the current video picture including one or more video units; and performing a conversion between the current video picture and the codec representation of the current video picture by filtering the one or more video units according to the parameters of the adaptive loop filter.

18. The method of clause 17, wherein the parameters are signaled in a video unit and are associated with one or more picture sizes.

19. The method of clause 17, wherein the video unit for signaling the parameter is associated with one or more picture sizes.

20. The method according to clause 17, wherein the parameters signaled in the video unit and associated with the same size are applied to the current picture.

21. The method of clause 17, wherein the codec representation comprises a data structure that signals at least one of a resolution, an index of a Picture Parameter Set (PPS), an indication of the resolution.

22. The method of clause 17, wherein the parameter is inherited or predicted from a parameter for another picture having the same resolution as the current picture.

23. The method of clause 17, wherein the parameters comprise a first parameter set associated with a first respective size and a second parameter set associated with a second respective size, the second parameter set being inherited or predicted from the first parameter set.

24. The method of clause 23, wherein the first respective size is the same as the second respective size.

25. The method of clause 23, wherein the first respective size is different than the second respective size.

26. The method of any of clauses 17 to 25, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

27. A video processing method, comprising: applying a Luma Mapping (LMCS) process with chroma scaling, in which luma samples of a current video block are reshaped between a first domain and a second domain, to a current video block of a current picture of a video, and scaling a chroma residual in a luma-dependent manner by using LMCS parameters associated with respective sizes; and performing a conversion between the current video block and a codec representation of the video.

28. The method according to clause 27, wherein the LMCS parameters are signaled in a video unit and are associated with one or more picture sizes.

29. The method of clause 27, wherein the video unit used to signal the LMCS parameters is associated with one or more picture sizes.

30. The method according to clause 27, wherein the LMCS parameters signaled in the video unit and associated with the same size are applied to the current picture.

31. The method of clause 27, wherein the codec representation comprises a data structure that signals at least one of a resolution, an index of a picture parameter set, an indication of a resolution.

32. The method of clause 27, wherein the LMCS parameters are inherited or predicted from parameters for another picture having the same resolution as the current picture.

33. The method of clause 27, wherein the LMCS parameters comprise a first LMCS parameter associated with a first respective size and a second LMCS parameter associated with a second respective size, the second LMCS parameter inherited or predicted from the first LMCS parameter.

34. The method of clause 33, wherein the first respective size is the same as the second respective size.

35. The method of clause 33, wherein the first respective size is different than the second respective size.

36. The method according to any of clauses 27 to 35, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

37. A video processing method, comprising: determining, for a transition between a current video block of a video and a codec representation of the video, whether and/or how to enable a codec tool that divides the current video block into a plurality of sub-partitions according to rules based on reference picture information for the plurality of sub-partitions; and performing a conversion based on the determination.

38. The method of clause 37, wherein the coding tool corresponds to a Triangular Prediction Mode (TPM) in which at least one of the sub-partitions is a non-rectangular partition, or to inter-prediction with geometric partitioning (GEO) in which video blocks may be partitioned using non-horizontal or non-vertical lines.

39. The method of clause 37, wherein the rule specifies whether and/or how to enable the coding tool based on whether a resolution of one of two reference pictures corresponding to the two sub-partitions is the same as or different from a resolution of a current picture comprising the current video block.

40. The method according to clause 39, wherein the rule specifies that the coding tool is not enabled if at least one of the two reference pictures is associated with a resolution different from the resolution of the current picture.

41. The method of clause 39, wherein the rule specifies that the coding tool is not enabled if the two reference pictures have different resolutions from each other.

42. The method of clause 39, wherein the rule specifies that the coding tool is not enabled if the two reference pictures are associated with a resolution different from the resolution of the current picture.

43. The method of clause 39, wherein the rule specifies that the coding tool is not enabled if the two reference pictures are associated with the same resolution that is different from the resolution of the current picture.

44. The method according to clause 39, wherein the rule specifies that the coding tool is not enabled if at least one of the two reference pictures is associated with a resolution different from the resolution of the current picture.

45. The method of clause 37, wherein the rule specifies whether and/or how to enable the coding tool based on whether two reference pictures corresponding to the two sub-partitions are the same reference picture.

46. The method of clause 37, wherein the rule specifies whether and/or how to enable the coding tool based on whether two reference pictures corresponding to the two sub-partitions are in the same reference list.

47. The method according to clause 37, wherein the rule specifies that the coding tool is always disabled if reference picture resampling is enabled in a video unit of the video.

48. The method according to any of clauses 37 to 47, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

49. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of clauses 1-48.

50. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing the method of any of clauses 1-48.

A third set of clauses describes certain features and aspects of the disclosed technology listed in the previous section, including, for example, example implementations 8-13 and 15.

1.A video processing method, comprising: determining, for a transition between a current video block of a current picture of a video and a codec representation of the video, that use of a codec tool is disabled for the current video block due to the current video block being codec into the codec representation using a reference picture having a size different from a size of the current picture; and performing a conversion based on the determination.

2. The method according to clause 1, wherein the information related to the codec tool is not signaled in case the use of the codec tool is disabled.

3. The method of clause 1, wherein the motion information of the current video block is not inserted into a history-based motion vector prediction (HMVP) table, wherein the HMVP table includes one or more entries corresponding to motion information of one or more previously processed blocks.

4. The method according to any of clauses 1 to 3, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Difference (SMVD), merge with motion vector difference (mmvd) mode, inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Set (MTS), sub-block transform (SBT), prediction refinement with optical flow (PROF), low-frequency non-square transform (LFNST), or filtering tool.

5. The method of any of clauses 1-4, wherein the size comprises at least one of a width and a height of the current picture.

6. The method of any of clauses 1-5, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

7. A video processing method, comprising: for a current video block of a current picture of the video, generating a prediction block by applying a coding tool to the current video block based on rules that determine whether and/or how to use a reference picture of a size different from a size of the current picture; and performing a conversion between the current video block and a codec representation of the video using the prediction block.

8. The method of clause 7, wherein the rule specifies that the prediction block is not generated using the reference picture due to a coding tool applied to the current video block.

9. The method according to clause 8, wherein the Merge candidates of the reference picture are skipped or not put into the Merge candidate list.

10. The method according to clause 8, wherein the signaling of the reference index corresponding to the reference picture is skipped or not allowed.

11. The method according to clause 7, wherein the rule specifies scaling the reference picture according to the resolution of the current picture and the resolution of the reference picture before applying the coding tool.

12. The method according to clause 7, wherein the rule specifies scaling the motion vector or motion vector difference according to the resolution of the current picture and the resolution of the reference picture before applying the coding tool.

13. The method according to any of clauses 7 to 12, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Difference (SMVD), merge with motion vector difference (mmvd) mode, inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Set (MTS), sub-block transform (SBT), prediction refinement with optical flow (PROF), low-frequency non-square transform (LFNST), or filtering tool.

14. The method according to any of clauses 7 to 12, wherein performing the conversion comprises generating the codec representation from the video or generating the video from the codec representation.

15. A video processing method, comprising: determining, for a transition between a current video block of a current picture of the video and a codec representation of the video, whether to disable a codec tool for the current video block based on a first resolution of a reference picture associated with one or more reference picture lists and/or a second resolution of the current reference picture used to derive a prediction block for the current video block; and performing a conversion based on the determination.

16. The method of clause 15, wherein the determining determines to disable the coding tool for one reference picture list and enable the coding tool for another reference picture list.

17. The method of clause 15, wherein the determining determines that the coding tool is disabled for a first reference picture of the reference picture list and the coding tool is enabled for a second reference picture of the reference picture list or another reference picture list.

18. The method of clause 15, wherein the determining determines whether the coding tool is disabled for the first reference picture list without regard to a second reference picture list that is different from the first reference picture list.

19. The method of clause 18, wherein the determining determines whether to disable the coding tool based on the reference picture and the current picture of the first reference picture list.

20. The method of clause 18, wherein, in the event that the reference picture associated with the first reference picture list is different from the current picture, the determining determines whether to disable the coding tool for the first reference picture list.

21. The method of clause 15, wherein the determining further determines whether to disable the coding tool based on other resolutions of reference pictures and/or current pictures associated with the one or more reference picture lists.

22. The method of clause 21, wherein the determining determines to disable the coding tool in the event that the at least one reference picture has a resolution different from a resolution of the current picture.

23. The method of clause 21, wherein the determining determines not to disable the coding tool in the event that the at least one reference picture has a different resolution than the resolution of the current picture and the reference pictures are associated with the same resolution.

24. The method of clause 21, wherein the determining determines to disable the coding tool in the event that the at least one reference picture has a different resolution than the resolution of the current picture and the reference pictures are associated with different resolutions from each other.

25. The method according to any of clauses 15 to 24, wherein the codec tool corresponds to a decoder-side motion vector refinement (DMVR), bi-directional optical flow (BDOF), affine prediction, triangle prediction mode, Symmetric Motion Vector Difference (SMVD), merge with motion vector difference (mmvd) mode, inter-frame intra prediction, Local Illumination Compensation (LIC), history-based motion vector prediction (HMVP), Multiple Transform Set (MTS), sub-block transform (SBT), prediction refinement with optical flow (PROF), low-frequency non-square transform (LFNST), or filtering tool.

26. The method according to any of clauses 15 to 25, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

27. A video processing method, comprising: performing a conversion between a video picture comprising one or more video blocks and a codec representation of the video, wherein at least some of the one or more video blocks are codec by referencing a reference picture list of the video picture according to a rule, wherein the rule specifies that the reference picture list comprises reference pictures having at most K different resolutions, wherein K is an integer.

28. The method of clause 27, wherein K equals 2.

29. The method of clause 27 or 28, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

30. A video processing method, comprising: performing a conversion between N consecutive video pictures of the video and a codec representation of the video, wherein the N consecutive video pictures comprise one or more video blocks codec at different resolutions according to a rule, wherein the rule specifies that at most K different resolutions are allowed for the N consecutive pictures, wherein N and K are integers.

31. The method of clause 30, wherein N and K are equal to 3.

32. The method of clause 30, wherein N equals 10 and K equals 3.

33. The method of clause 30, wherein K different resolutions are allowed in a group of pictures (GOP) in the codec representation.

34. The method of clause 30, wherein K different resolutions are allowed between two pictures having the same temporal layer identification.

35. The method of any of clauses 30 to 34, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

36. A video processing method, comprising: performing a conversion between video comprising a plurality of pictures and a codec representation of the video, wherein at least some of the plurality of pictures are codec into the codec representation using different codec video resolutions, and wherein the codec representation complies with a format rule of: the first codec resolution of the previous frame is changed to the second codec resolution of the next frame only in a case where the next frame sequentially following the previous frame is codec as an intra frame.

37. The method of clause 36, wherein the order corresponds to an encoding order in which the plurality of pictures are encoded.

38. The method of clause 36, wherein the order corresponds to a decoding order of decoding the plurality of pictures.

39. The method of clause 36, wherein the order corresponds to a display order in which the plurality of pictures are displayed after decoding.

40. The method according to any of clauses 36 to 39, wherein the intra frame is an intra random access point picture.

41. The method of any of clauses 36-39, wherein the intra frame is an Instantaneous Decoding Refresh (IDR) frame.

42. The method according to any of clauses 36 to 41, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

43. A video processing method, comprising: parsing an encoded representation of a video to determine that a current video block of a current picture of the video references a reference picture associated with a resolution different from a resolution of the current picture; generating a prediction block of the current video block by converting a bi-directional prediction mode into a unidirectional prediction mode to be applied to the current video block; and generating video from the codec representation using the prediction block.

44. The method of clause 43, wherein generating the prediction block comprises dropping the prediction from a list having reference pictures associated with a resolution different from a resolution of the current picture.

45. A video processing method, comprising: generating a prediction block for a current video block of a current picture of a video by enabling or disabling inter prediction from reference pictures having different resolutions from each other according to a motion vector precision and/or a resolution ratio; and performing a conversion between the current video block and a codec representation of the video using the prediction block.

46. The method according to clause 45, wherein inter prediction is enabled if the motion vector scaled according to the resolution ratio points to an integer position.

47. The method of clause 45, wherein inter prediction is enabled if the motion vector scaled according to the resolution ratio points to a sub-pixel position.

48. The method of clause 45, wherein bi-prediction is disabled if the reference picture is associated with a resolution different from the resolution of the current picture.

49. The method of clause 45, wherein bi-prediction is enabled in the event that one reference picture has a different resolution than the resolution of the current picture and another reference picture is associated with the same resolution as the resolution of the current picture.

50. The method of clause 45, wherein unidirectional prediction is not allowed if the reference picture is associated with a resolution different from the resolution of the current picture and the block size of the current video block satisfies a particular condition.

51. The method according to any of clauses 45 to 50, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

52. A video processing method, comprising: determining, based on a codec characteristic of a current video block of a current picture of the video, whether to allow a reference picture having a size different from a size of the current picture to generate a prediction block for the current video block during a transition between the current video block and a codec representation of the video; and performing a conversion according to the determination.

53. The method of clause 52, wherein the codec characteristics comprise a size of the current video block and/or a prediction mode of the current video block.

54. The method of clause 52, wherein the determining determines that the reference picture is allowed in the case of WB > -T1 and HB > -T2, wherein WB and HB correspond to the width and height of the current video block, respectively, and T1 and T2 are positive integers.

55. The method of clause 52, wherein the determining determines that the reference picture is allowed in a case of WB HB > ═ T, where WB and HB correspond to a width and a height of the current video block, respectively, and T is a positive integer.

56. The method of clause 52, wherein the determining determines that the reference picture is allowed in the event Min (WB, HB) > — T, where WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

57. The method of clause 52, wherein the determining determines that the reference picture is allowed in a case of Max (WB, HB) > ═ T, where WB and HB correspond to a width and a height, respectively, of the current video block, and T is a positive integer.

58. The method of clause 52, wherein the determining determines that the reference picture is allowed in the case of WB < ═ T1 and HB < ═ T2, where WB and HB correspond to the width and height of the current video block, respectively, and T1 and T2 are positive integers.

59. The method of clause 52, wherein the determining determines that reference pictures having a size different from the size of the current video block are allowed in the case of WB × HB < ═ T, where WB and HB correspond to the width and height, respectively, of the current video block, and T is a positive integer.

60. The method of clause 52, wherein the determining determines that the reference picture is allowed in a case where Min (WB, HB) < ═ T, where WB and HB correspond to a width and height, respectively, of the current video block, and T is a positive integer.

61. The method of clause 52, wherein the determining determines that the reference picture is allowed in a case of Max (WB, HB) < ═ T, where WB and HB correspond to a width and a height, respectively, of the current video block, and T is a positive integer.

62. The method of clause 52, wherein the determining determines that the reference picture is not allowed in the case of WB < ═ T1 and/or HB < ═ T2, where WB and HB correspond to the width and height, respectively, of the current video block, and T1 and T2 are positive integers.

63. The method according to any of clauses 52 to 62, wherein performing the conversion comprises generating a codec representation from the video or generating the video from the codec representation.

64. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of clauses 1-63.

65. A computer program product stored on a non-transitory computer readable medium, the computer program product comprising program code for performing the method of any of clauses 1 to 63.

From the foregoing, it will be appreciated that specific embodiments of the presently disclosed technology have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the disclosed technology. Accordingly, the presently disclosed technology is not limited except as by the appended claims.

Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term "data processing unit" or "data processing apparatus" includes all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a document in a document system. A program can be stored in a portion of a document that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single document dedicated to the program in question, or in multiple coordinated documents (e.g., documents that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., programmable read only memory and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

The specification and drawings are to be regarded in an illustrative manner, with an exemplary meaning being exemplary. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Moreover, use of "or" is intended to include "and/or" unless the context clearly indicates otherwise.

Although this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

Only a few implementations and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

76页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:在编码视频流中进行分层随机访问的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类