Method for resampling reference picture by offset in video code stream

文档序号：1821937 发布日期：2021-11-09 浏览：21次中文

阅读说明：本技术 用视频码流中的偏移对参考图片进行重采样的方法 (Method for resampling reference picture by offset in video code stream ) 是由崔秉斗史蒂芬·文格尔刘杉于 2020-09-22 设计创作，主要内容包括：一种使用至少一个处理器对已编码视频码流进行解码的方法、设备和计算机可读介质,包括：获取指示一致性窗口未用于参考图片重采样的标志；基于所述指示一致性窗口未用于参考图片重采样的标志,确定是否有发信号通知重采样图片大小；基于确定有发信号通知重采样图片大小,基于重采样图片大小确定重采样比率；基于确定没有发信号通知重采样图片大小,基于输出图片大小确定重采样比率；以及使用重采样比率对当前图片执行参考图片重采样。(A method, apparatus and computer readable medium for decoding an encoded video bitstream using at least one processor, comprising: obtaining a flag indicating that the conformance window is not used for reference picture resampling; determining whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling; based on the determination that the resampled picture size is signaled, determining a resampling ratio based on the resampled picture size; based on determining that no resample picture size is signaled, determining a resampling ratio based on the output picture size; and performing reference picture resampling on the current picture using the resampling ratio.)

1. A method for decoding an encoded video stream using at least one processor, the method comprising:

obtaining a flag indicating that the conformance window is not used for reference picture resampling;

determining whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling;

based on determining that the resampled picture size is signaled, determining a resampling ratio based on the resampled picture size;

based on determining that the resampled picture size is not signaled, determine the resampling ratio based on an output picture size; and

performing the reference picture resampling for a current picture using the resampling ratio.

2. The method of claim 1, wherein the flag is signaled in a picture parameter set.

3. The method of claim 1, wherein the resampled picture size is signaled in the encoded video bitstream in at least one of a width of the resampled picture size and a height of the resampled picture size.

4. The method of claim 3, wherein at least one of the width and the height is signaled in a picture parameter set.

5. The method of claim 3, wherein at least one of the width and the height is expressed as a number of luma samples included in the at least one of the width and the height.

6. The method of claim 3, wherein at least one of the width and the height is determined based on at least one offset distance between a boundary of a current picture and a predetermined luma sample of a reference region.

7. The method of claim 6, wherein the at least one offset distance is signaled in a picture parameter set.

8. The method of claim 6, wherein the at least one offset distance comprises at least one of:

The horizontal offset distance between the left boundary of the current picture and the top left luminance sample of the reference region,

the vertical offset distance between the upper boundary of the current picture and the upper left luminance sample of the reference region,

a horizontal offset distance between a right boundary of the current picture and a bottom-right luma sample of the reference region, an

The vertical offset distance between the lower boundary of the current picture and the lower right luma sample of the reference region.

9. An apparatus for decoding an encoded video stream, the apparatus comprising:

at least one memory configured to store program code; and

at least one processor configured to read the program code and to operate according to instructions of the program code, the program code comprising:

obtaining code configured to cause the at least one processor to obtain a flag indicating that a conformance window is not used for reference picture resampling;

first determining code configured to cause the at least one processor to determine whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling;

second determining code configured to cause the at least one processor to determine a resampling ratio based on the resampled picture size based on determining that the resampled picture size is signaled;

Third determining code configured to cause the at least one processor to determine the resampling ratio based on an output picture size based on determining that the resampled picture size is not signaled; and

code configured to cause the at least one processor to perform the reference picture resampling for a current picture using the resampling ratio.

10. The apparatus of claim 9, wherein the flag is signaled in a picture parameter set.

11. The device of claim 9, wherein the resampled picture size is signaled in the encoded video bitstream in at least one of a width of the resampled picture size and a height of the resampled picture size.

12. The apparatus of claim 11, wherein at least one of the width and the height is signaled in a picture parameter set.

13. The apparatus of claim 3, wherein at least one of the width and the height is expressed as a number of luma samples included in the at least one of the width and the height.

14. The apparatus of claim 11, wherein at least one of the width and the height is determined based on at least one offset distance between a boundary of a current picture and a predetermined luma sample of a reference region.

15. The apparatus of claim 14, wherein the at least one offset distance is signaled in a picture parameter set.

16. The apparatus of claim 14, wherein the at least one offset distance comprises at least one of:

the horizontal offset distance between the left boundary of the current picture and the top left luminance sample of the reference region,

the vertical offset distance between the upper boundary of the current picture and the upper left luminance sample of the reference region,

a horizontal offset distance between a right boundary of the current picture and a bottom-right luma sample of the reference region, an

The vertical offset distance between the lower boundary of the current picture and the lower right luma sample of the reference region.

17. A non-transitory computer-readable medium storing instructions, the instructions comprising: one or more instructions that when executed by one or more processors of an encoded video stream decoding device, cause the one or more processors to:

Obtaining a flag indicating that the conformance window is not used for reference picture resampling;

determining whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling;

based on determining that the resampled picture size is signaled, determining a resampling ratio based on the resampled picture size;

based on determining that the resampled picture size is not signaled, determine the resampling ratio based on an output picture size; and

performing the reference picture resampling for a current picture using the resampling ratio.

18. The non-transitory computer readable medium of claim 17, wherein the resampled picture size is signaled in the encoded video stream in at least one of a width of the resampled picture size and a height of the resampled picture size.

19. The non-transitory computer-readable medium of claim 18, wherein the at least one of the width and the height is expressed as a number of luma samples included in the at least one of the width and the height.

20. The non-transitory computer-readable medium of claim 18, wherein at least one of the width and the height is determined based on at least one offset distance between a boundary of a current picture and a predetermined luma sample of a reference region.

Technical Field

The subject matter disclosed herein relates to video encoding and decoding, and more particularly to signaling reference picture resampling with a resampled picture size indication.

Background

Video encoding and decoding using inter picture prediction with motion compensation is known. Uncompressed digital video may consist of a series of pictures, each picture having certain spatial dimensions, e.g., 1920 x 1080 luma samples and associated chroma samples. The series of pictures may have a fixed or variable picture rate (informally, also referred to as frame rate), for example, 60 pictures per second or 60 hertz (Hz). Uncompressed video has significant bit rate requirements. For example, 1080p 604: 2:0 video (1920 × 1080 luminance sample resolution at 60Hz frame rate) with 8 bits per sample requires a bandwidth of approximately 1.5 Gbit/s. Such video requires more than 600GB of storage space for one hour.

One purpose of video encoding and decoding may be to reduce redundancy in the input video signal by compression. Compression may help reduce the bandwidth or storage requirements described above, by two or more orders of magnitude in some cases. Both lossless and lossy compression, as well as combinations thereof, may be used for video encoding and decoding. Lossless compression refers to a technique by which an exact copy of an original signal can be reconstructed from a compressed original signal. When lossy compression is used, the reconstructed signal may not be exactly identical to the original signal, but the distortion between the original signal and the reconstructed signal is small enough that the reconstructed signal can be used for the intended application. Lossy compression is widely used in video. The amount of distortion allowed for lossy compression depends on the application; for example, users of certain consumer streaming applications may tolerate higher distortion than users of television distribution applications. The achievable compression ratio may reflect: the higher the allowable/tolerable distortion, the higher the compression ratio that can be produced.

Video encoders and decoders may use several broad classes of techniques including, for example, motion compensation, transform, quantization, and entropy coding. Some of these classes of technologies are described below.

Historically, Video encoders and decoders have tended to operate at a given picture size, which in most cases is defined and kept constant for Coded Video Sequences (CVS), groups of Pictures (GOP) or similar multi-picture time frames. For example, in MPEG-2, the system design would change the horizontal resolution (and thus the picture size) depending on scene activity, etc., but only at I pictures, so this is generally applicable to GOPs. It is known to resample reference pictures to be able to use different resolutions in CVS, e.g. from ITU-T recommendation h.263 annex P. However, because the picture size here does not change, only the reference picture is resampled, which may result in only a portion of the picture canvas being used (in the case of downsampling) or only a portion of the scene being captured (in the case of upsampling). Furthermore, h.263 annex Q allows resampling of individual macroblocks up or down by a factor of two (in each dimension). Also, the picture size remains unchanged. The size of the macroblock is fixed in h.263 and therefore does not need to be signaled.

Picture size variations in predicted pictures are becoming more prevalent in modern video coding. For example, VP9 allows reference picture resampling and resolution change for the entire picture. Similarly, certain suggestions made for VVCs, including, for example, "Adaptive Resolution Change (ARC) for VVC" by Hendry et al, in conjunction with the video team file jfet-M0135-v 1, 2019, month 1, day 9 to day 19, which is incorporated herein in its entirety, allow resampling of the entire reference picture to a different (or high or low) resolution. In this document, it is proposed to encode different candidate resolutions in a sequence parameter set and referenced by per-picture syntax elements in a picture parameter set.

Disclosure of Invention

In one embodiment, a method of decoding an encoded video bitstream using at least one processor is provided, the method comprising: obtaining a flag indicating that the conformance window is not used for reference picture resampling; determining whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling; based on determining that the resampled picture size is signaled, determining a resampling ratio based on the resampled picture size; based on determining that the resampled picture size is not signaled, determine the resampling ratio based on an output picture size; and performing the reference picture resampling for the current picture using the resampling ratio.

In one embodiment, there is provided an apparatus for decoding an encoded video stream, the apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and to operate according to instructions of the program code, the program code comprising: obtaining code configured to cause the at least one processor to obtain a flag indicating that a conformance window is not used for reference picture resampling; first determining code configured to cause the at least one processor to determine whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling; second determining code configured to cause the at least one processor to determine a resampling ratio based on the resampled picture size based on determining that the resampled picture size is signaled; third determining code configured to cause the at least one processor to determine the resampling ratio based on an output picture size based on determining that the resampled picture size is not signaled; and executing code configured to cause the at least one processor to perform the reference picture resampling for a current picture using the resampling ratio.

In one embodiment, a non-transitory computer-readable medium storing instructions is provided, the instructions comprising: one or more instructions that when executed by one or more processors of an encoded video stream decoding device, cause the one or more processors to: obtaining a flag indicating that the conformance window is not used for reference picture resampling; determining whether a resampled picture size is signaled based on the flag indicating that the conformance window is not used for reference picture resampling; based on determining that the resampled picture size is signaled, determining a resampling ratio based on the resampled picture size; based on determining that the resampled picture size is not signaled, determine the resampling ratio based on an output picture size; and performing the reference picture resampling for the current picture using the resampling ratio.

Drawings

Other features, properties, and various advantages of the disclosed subject matter will be more apparent from the following detailed description and the accompanying drawings, in which:

fig. 1 is a schematic diagram of a simplified block diagram of a communication system according to an embodiment.

Fig. 2 is a schematic diagram of a simplified block diagram of a communication system according to an embodiment.

Fig. 3 is a schematic diagram of a simplified block diagram of a decoder according to an embodiment.

Fig. 4 is a schematic diagram of a simplified block diagram of an encoder according to an embodiment.

Fig. 5 is a schematic illustration of an option for signaling ARC parameters according to an embodiment.

Fig. 6A to 6B are schematic diagrams of examples of syntax tables according to an embodiment.

Fig. 7 is a schematic diagram of signaling picture size and conformance window in a PPS, according to an embodiment.

Fig. 8 is a schematic diagram of signaling picture size and conformance window in a PPS according to an embodiment.

Fig. 9 is a flow diagram of an example method for decoding an encoded video bitstream, according to an embodiment.

Fig. 10 is a schematic diagram of a computer system, according to an embodiment.

Detailed Description

Fig. 1 is a simplified block diagram of a communication system (100) according to an embodiment disclosed herein. The communication system (100) comprises at least two terminal devices (110) and (120), which may be interconnected to each other via a network (150). For unidirectional data transmission, the first terminal device (110) may encode video data at a local location for transmission over the network (150) to another terminal device (120). The second terminal device (120) may receive encoded video data of another terminal device from the network (150), decode the encoded video data, and display the restored video data. Unidirectional data transmission is common in applications such as media services.

Fig. 1 illustrates a second pair of end devices (130, 140) for supporting bidirectional transmission of encoded video, which may occur, for example, during a video conference. For bi-directional data transmission, each of the third end device (130) and the fourth end device (140) may encode video data captured at the local location for transmission over the network (150) to the other of the third end device (130) and the fourth end device (140). Each of the third end device (130) and the fourth end device (140) may also receive encoded video data transmitted by the other of the third end device (130) and the fourth end device (140), decode the encoded video data, and display the recovered video data on a local display device.

In the embodiment of fig. 1, the first terminal device (110), the second terminal device (120), the third terminal device (130), and the fourth terminal device (140) may be a server, a personal computer, and a smart phone, but the principles disclosed herein may not be limited thereto. Embodiments disclosed herein are applicable to laptop computers, tablet computers, media players, and/or dedicated video conferencing equipment. Network (150) represents any number of networks that communicate encoded video data between first terminal device (110), second terminal device (120), third terminal device (130), and fourth terminal device (140), including, for example, wired and/or wireless communication networks. The communication network (150) may exchange data in circuit-switched and/or packet-switched channels. Representative networks may include telecommunications networks, local area networks, wide area networks, and/or the internet. For purposes of the present discussion, the architecture and topology of the network (150) may be immaterial to the operation disclosed herein, unless explained below.

As an example of the application of the subject matter disclosed in this application, fig. 2 shows the placement of a video encoder and a video decoder in a streaming environment. The subject matter disclosed herein is equally applicable to other video-enabled applications including, for example, video conferencing, digital TV, storing compressed video on digital media including CDs, DVDs, memory sticks, and the like. The streaming system may include an acquisition subsystem (213), which may include a video source (201), such as a digital camera, that creates a stream of uncompressed video samples (202). The stream of video samples (202) is depicted as a thick line to emphasize a higher data amount compared to the encoded video stream, the stream of video samples (202) is depicted as a thick line to emphasize a stream of high data amount video samples compared to the encoded video stream, the stream of video samples (202) can be processed by an encoder (203) coupled to the digital camera (201). The encoder (203) may comprise hardware, software, or a combination of hardware and software to implement or embody aspects of the disclosed subject matter as described in more detail below. The encoded video codestream (204) is depicted as a thin line compared to the video sample stream (202) to emphasize the encoded video codestream (204) having a lower amount of data than the video sample stream (202), which may be stored on the streaming server (205) for future use. At least one streaming client (206, 208) can access the streaming server (205) to retrieve a copy (207) and a copy (209) of the encoded video codestream (204). The client (206) may include a video decoder (210), the video decoder (210) decoding an incoming copy (207) of an encoded video bitstream, and generating an output stream of video samples (211) that may be presented on a display (212) or other presentation device (not depicted). In some streaming systems, the video streams (204, 207, 209) may be encoded according to some video encoding/compression standard. Examples of such standards include ITU-T recommendation H.265. The developing Video Coding standard, informally referred to as next generation Video Coding, is informally referred to as multi-function Video Coding (VVC), and may be used in the context of the VVC standard.

Fig. 3 may be a block diagram of a video decoder (210) according to an embodiment of the present disclosure.

The receiver (310) may receive at least one encoded video sequence to be decoded by the video decoder (210). In the same or another embodiment, the receiver (301) may receive encoded video sequences one at a time, wherein each encoded video sequence is decoded independently of the other encoded video sequences. The encoded video sequence may be received from a channel (312), which may be a hardware/software link to a storage device that stores encoded video data. The receiver (310) may receive encoded video data as well as other data, e.g. encoded audio data and/or auxiliary data streams, which may be forwarded to their respective usage entities (not indicated). The receiver (310) may separate the encoded video sequence from other data. To prevent network jitter, a buffer memory (315) may be coupled between the receiver (310) and the entropy decoder/parser (320) (hereinafter "parser"). The buffer memory (315) may not need to be configured or may be made smaller when the receiver (310) receives data from a store/forward device with sufficient bandwidth and controllability or from an isochronous network. For use over best effort packet networks such as the internet, a buffer memory (315) may also be required, which may be relatively large and may be of an adaptive size.

The video decoder (210) may include a parser (320) to reconstruct symbols (321) from the entropy encoded video sequence. The categories of these symbols include information for managing the operation of the video decoder (210), as well as potential information to control a display device, such as a display screen (212), that is not an integral part of the video decoder, but may be coupled to the video decoder, as shown in fig. 3. The control Information for the display device may be in the form of Supplemental Enhancement Information (SEI message) or Video Usability Information (VUI). The parser (320) may parse/entropy decode the received encoded video sequence. Encoding of the encoded video sequence may be performed in accordance with video coding techniques or standards and may follow various principles, including variable length coding, Huffman coding, arithmetic coding with or without contextual sensitivity, and so forth. A parser (320) may extract a subgroup parameter set for at least one of the subgroups of pixels in the video decoder from the encoded video sequence based on at least one parameter corresponding to the group. A subgroup may include a Group of Pictures (GOP), a picture, a sub-picture, a tile, a slice, a brick (brick), a macroblock, a Coding Tree Unit (CTU), a Coding Unit (CU), a block, a Transform Unit (TU), a Prediction Unit (PU), and so on. A tile may refer to a rectangular region of a CU/CTU within a particular row or column of tiles in a picture. A brick may refer to a rectangular area of a CU/CTU row within a particular block. A slice may refer to at least one brick of a picture contained in a NAL unit. A sub-picture may refer to a rectangular region of at least one slice in a picture. The entropy decoder/parser may also extract information from the encoded video sequence, such as transform coefficients, quantizer parameter values, motion vectors, and so on.

The parser (320) may perform entropy decoding/parsing operations on the video sequence received from the buffer memory (315), thereby creating symbols (321).

The reconstruction of the symbol (321) may involve a number of different units depending on the type of the encoded video picture or a portion of the encoded video picture (e.g., inter and intra pictures, inter and intra blocks), among other factors. Which units are involved and the way in which they are involved can be controlled by subgroup control information parsed from the coded video sequence by the parser (320). For the sake of brevity, such a subgroup control information flow between parser (320) and the following units is not described.

In addition to the functional blocks already mentioned, the decoder 210 may be conceptually subdivided into several functional units as described below. In a practical embodiment operating under business constraints, many of these units interact closely with each other and may be integrated with each other. However, for the purposes of describing the disclosed subject matter, a conceptual subdivision into the following functional units is appropriate.

The first unit is a sealer/inverse transform unit (351). A sealer/inverse transform unit (351) receives the quantized transform coefficients as symbols (321) from the parser (320) along with control information including which transform scheme to use, block size, quantization factor, quantization scaling matrix, etc. The scaler/inverse transform unit (351) may output a block comprising sample values, which may be input into the aggregator (355).

In some cases, the output samples of sealer/inverse transform unit (351) may belong to an intra-coded block; namely: predictive information from previously reconstructed pictures is not used, but blocks of predictive information from previously reconstructed portions of the current picture may be used. Such predictive information may be provided by an intra picture prediction unit (352). In some cases, the intra picture prediction unit (352) generates a block of the same size and shape as the block being reconstructed, using surrounding reconstructed information extracted from the current (partially reconstructed) picture (358). In some cases, the aggregator (355) adds the prediction information generated by the intra prediction unit (352) to the output sample information provided by the scaler/inverse transform unit (351) on a per sample basis.

In other cases, the output samples of sealer/inverse transform unit (351) may belong to inter-coded and potential motion compensated blocks. In this case, the motion compensated prediction unit (353) may access the reference picture memory (357) to fetch samples for prediction. After motion compensating the extracted samples according to the sign (321), the samples may be added to the output of the scaler/inverse transform unit (351), in this case referred to as residual samples or residual signals, by an aggregator (355), thereby generating output sample information. The fetching of prediction samples by the motion compensation unit from addresses within the reference picture memory may be controlled by motion vectors and used by the motion compensation unit in the form of said symbols (321), said symbols (321) for example comprising X, Y and reference picture components. Motion compensation may also include interpolation of sample values fetched from the reference picture memory, motion vector prediction mechanisms, etc., when using sub-sample exact motion vectors.

The output samples of the aggregator (355) may be employed in a loop filter unit (356) by various loop filtering techniques. The video compression techniques may include in-loop filter techniques that are controlled by parameters included in the encoded video code stream and which are available to the loop filter unit (356) as symbols (321) from the parser (320), and may also be responsive to meta-information obtained during decoding of a previous (in decoding order) portion of an encoded picture or encoded video sequence, and to sample values that have been previously reconstructed and have been loop filtered.

The output of the loop filter unit (356) may be a stream of samples that may be output to a display device (212) and stored in a reference picture memory for subsequent inter picture prediction.

Once fully reconstructed, some of the coded pictures may be used as reference pictures for future prediction. For example, once an encoded picture is fully reconstructed and the encoded picture is identified as a reference picture (by, e.g., parser (320)), the current reference picture (358) may become part of reference picture buffer (357) and new current picture memory may be reallocated before reconstruction of a subsequent encoded picture begins.

The video decoder 210 may perform decoding operations according to a predetermined video compression technique, such as in ITU-T recommendation h.265. The encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in the sense that the encoded video sequence conforms to the syntax of the video compression technique or standard and the configuration files recorded in the video compression technique or standard. For compliance, the complexity of the encoded video sequence is also required to be within the limits defined by the level of the video compression technique or standard. In some cases, the hierarchy limits the maximum picture size, the maximum frame rate, the maximum reconstruction sampling rate (measured in units of, e.g., mega samples per second), the maximum reference picture size, etc. In some cases, the limits set by the hierarchy may be further defined by a Hypothetical Reference Decoder (HRD) specification and metadata signaled HRD buffer management in the encoded video sequence.

In an embodiment, the receiver (310) may receive additional (redundant) data along with the encoded video. The additional data may be part of an encoded video sequence. The additional data may be used by the video decoder (210) to properly decode the data and/or more accurately reconstruct the original video data. The additional data may be in the form of, for example, a temporal, spatial, or signal-to-noise ratio (SNR) enhancement layer, redundant stripes, redundant pictures, forward error correction codes, and so forth.

Fig. 4 may be a block diagram of a video encoder (203) according to an embodiment of the disclosure.

The encoder (203) may receive video samples from a video source (201) (not part of the encoder) that may capture video images to be encoded by the encoder (203).

The video source (201) may provide a source video sequence in the form of a stream of digital video samples to be encoded by the encoder (203), which may have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit … …), any color space (e.g., bt.601y CrCB, RGB … …), and any suitable sampling structure (e.g., Y CrCB4: 2:0, Y CrCB4:4: 4). In the media service system, the video source (201) may be a storage device that stores previously prepared video. In a video conferencing system, the video source (203) may be a camera that captures local image information as a video sequence. Video data may be provided as a plurality of individual pictures that are given motion when viewed in sequence. The picture itself may be constructed as an array of spatial pixels, where each pixel may comprise at least one sample, depending on the sampling structure, color space, etc. used. The relationship between pixels and samples can be readily understood by those skilled in the art. The following text focuses on describing the samples.

According to an embodiment, the encoder (203) may encode and compress pictures of a source video sequence into an encoded video sequence (443) in real-time or under any other temporal constraints required by the application. It is a function of the controller (450) to implement the appropriate encoding speed. The controller 450 controls and is functionally coupled to other functional units as described below. For simplicity, the couplings are not labeled in the figures. The parameters set by the controller (450) may include rate control related parameters (picture skip, quantizer, lambda value of rate distortion optimization technique, etc.), picture size, GOP layout, maximum motion vector search range, etc. Those skilled in the art can readily recognize other functions of the controller (450) that may involve the video encoder (203) being optimized for a system design.

Some video decoders operate in coding loops readily recognizable to those skilled in the art. As a simple description, the encoding loop may comprise an encoder (430) (hereinafter referred to as "source encoder") responsible for creating symbols based on the input picture to be encoded and the reference picture, and a (local) decoder (433) embedded in the encoder (203), the decoder (433) reconstructing the symbols to create sample data in a similar manner as the (remote) decoder creates the sample data (since any compression between the symbols and the encoded video bitstream is lossless in the video compression techniques considered in this application). The reconstructed sample stream is input to a reference picture memory (434). Since the decoding of the symbol stream produces bit accurate results independent of decoder location (local or remote), the reference picture buffer content also corresponds bit-accurately between the local encoder and the remote encoder. In other words, the reference picture samples that the prediction portion of the encoder "sees" are identical to the sample values that the decoder would "see" when using prediction during decoding. This reference picture synchronization philosophy (and the drift that occurs if synchronization cannot be maintained due to, for example, channel errors) is also used in some related techniques.

The operation of the "local" decoder (433) may be the same as the "remote" decoder (210) that has been described in detail above in connection with fig. 3. However, referring briefly also to fig. 4, when symbols are available and the entropy encoder (445) and parser (320) are able to losslessly encode/decode the symbols into an encoded video sequence, the entropy decoding portion of the video decoder (210), including the channel (312), receiver (310), buffer memory (315), and parser (320), may not be fully implemented in the local decoder (433).

At this point it can be observed that any decoder technique other than the parsing/entropy decoding present in the decoder must also be present in the corresponding encoder in substantially the same functional form. For this reason, the present application focuses on decoder operation. The description of the encoder techniques may be simplified because the encoder techniques are reciprocal to the fully described decoder techniques. A more detailed description is only needed in certain areas and is provided below.

During operation, in some embodiments, the source encoder (430) may perform motion compensated predictive coding. The motion compensated predictive coding predictively codes an input frame with reference to at least one previously coded frame from the video sequence that is designated as a "reference frame". In this way, the encoding engine (432) encodes the difference between a block of pixels of an input frame and a block of pixels of a reference frame, which may be selected as a prediction reference for the input frame.

The local video decoder (433) may decode encoded video data of a frame that may be designated as a reference frame based on symbols created by the source encoder (430). The operation of the encoding engine (432) may be a lossy process. When the encoded video data can be decoded at a video decoder (not shown in fig. 4), the reconstructed video sequence may typically be a copy of the source video sequence with some errors. The local video decoder (433) replicates a decoding process that may be performed on reference frames by the video decoder, and may cause reconstructed reference frames to be stored in a reference picture cache (434). In this way, the encoder (203) may locally store a copy of the reconstructed reference frame that has common content (no transmission errors) with the reconstructed reference frame to be obtained by the remote video decoder.

The predictor (435) may perform a prediction search for the coding engine (432). That is, for a new frame to be encoded, predictor (435) may search reference picture memory (434) for sample data (as candidate reference pixel blocks) or some metadata, such as reference picture motion vectors, block shapes, etc., that may be referenced as appropriate predictions for the new frame. The predictor (435) may operate on a block-by-block basis of samples to find a suitable prediction reference. In some cases, from search results obtained by predictor (435), it may be determined that the input picture may have prediction references taken from multiple reference pictures stored in reference picture memory (434).

The controller (450) may manage the encoding operations of the source encoder (430), including, for example, setting parameters and subgroup parameters for encoding the video data.

The outputs of all of the above functional units may be entropy encoded in an entropy encoder (445). The entropy encoder (445) losslessly compresses the symbols generated by the various functional units according to techniques such as huffman coding, variable length coding, arithmetic coding, etc., to convert the symbols into an encoded video sequence.

The transmitter (440) may buffer the encoded video sequence created by the entropy encoder (445) in preparation for transmission over a communication channel (460), which may be a hardware/software link to a storage device that will store the encoded video data. The transmitter (440) may combine the encoded video data from the video encoder (430) with other data to be transmitted, such as encoded audio data and/or an auxiliary data stream (sources not shown).

The controller (450) may manage the operation of the encoder (203). During encoding, the controller (450) may assign a certain encoded picture type to each encoded picture, but this may affect the encoding techniques applicable to the respective picture. For example, a picture may be generally assigned to any of the following frame types:

Intra pictures (I pictures), which may be pictures that can be encoded and decoded without using any other frame in the sequence as a prediction source. Some video codecs tolerate different types of intra pictures, including, for example, Independent Decoder Refresh ("IDR") pictures. Those skilled in the art are aware of variants of picture I and their corresponding applications and features.

Predictive pictures (P pictures), which may be pictures that may be encoded and decoded using intra prediction or inter prediction that uses at most one motion vector and reference index to predict sample values of each block.

Bi-predictive pictures (B-pictures), which may be pictures that can be encoded and decoded using intra-prediction or inter-prediction that uses at most two motion vectors and reference indices to predict sample values of each block. Similarly, multiple predictive pictures may use more than two reference pictures and associated metadata for reconstructing a single block.

A source picture may typically be spatially subdivided into blocks of samples (e.g., blocks of 4 x 4, 8 x 8, 4 x 8, or 16 x 16 samples) and encoded block-wise. These blocks may be predictively encoded with reference to other (encoded) blocks that are determined according to the encoding allocation applied to their respective pictures. For example, a block of an I picture may be non-predictive encoded, or the block may be predictive encoded (spatial prediction or intra prediction) with reference to an already encoded block of the same picture. The pixel block of the P picture can be prediction-coded by spatial prediction or by temporal prediction with reference to one previously coded reference picture. A block of a B picture may be prediction coded by spatial prediction or by temporal prediction with reference to one or two previously coded reference pictures.

The video encoder (203) may perform encoding operations according to a predetermined video encoding technique or standard, such as the ITU-T h.265 recommendation. In operation, the video encoder (203) may perform various compression operations, including predictive encoding operations that exploit temporal and spatial redundancies in the input video sequence. Thus, the encoded video data may conform to syntax specified by the video coding technique or standard used.

In an embodiment, the transmitter (440) may transmit the additional data while transmitting the encoded video. The video encoder (430) may treat such data as part of an encoded video sequence. The additional data may include temporal/spatial/SNR enhancement layers, redundant pictures and slices, among other forms of redundant data, SEI messages, VUI parameter set segments, and the like.

Recently, some attention has been drawn to compressed domain aggregation or extraction of multiple semantically independent picture portions into a single video picture. In particular, in the context of, for example, 360 degree codecs or certain surveillance applications, multiple semantically independent source pictures (e.g., six cube surfaces of a cube projecting a 360 degree scene, or a single camera input in the case of a multi-camera surveillance setting) may require separate adaptive resolution settings to handle different scene activities at a given point in time. In other words, at a given point in time, the encoder may choose to use different resampling factors for different semantically independent pictures that make up the entire 360 degrees or monitored scene. When merged into a single picture, this in turn requires that reference picture resampling be performed on portions of the encoded picture, and adaptive resolution codec signaling is available.

Some terms that will be mentioned in the rest of this specification are introduced below.

A "sub-picture" may in some cases refer to a rectangular arrangement of samples, blocks, macroblocks, coded units, or similar entities that are semantically grouped and may be independently coded with varying resolution. At least one sub-picture may form one picture. The at least one encoded sub-picture may form an encoded picture. At least one sub-picture may be assembled into one picture, and at least one sub-picture may be extracted from the one picture. In some circumstances, at least one encoded sub-picture may be assembled into one encoded picture in the compressed domain without transcoding it to the sample level, in which case, or in some other case, at least one encoded sub-picture may be extracted from the encoded picture in the compressed domain.

Reference picture resampling or Adaptive Resolution Change (ARC) may refer to a mechanism that allows for changing the Resolution of a picture or sub-picture within an encoded video sequence by, for example, reference picture resampling. The RPR/ARC parameters refer hereinafter to control information needed to perform the adaptive resolution change, which may include, for example, filter parameters, scaling factors, resolution of the output picture and/or reference picture, various control flags, and so forth.

In an embodiment, encoding and decoding may be performed on a single, semantically independent encoded video picture. Before describing the meaning of encoding/decoding of multiple sub-pictures with independent RPR/ARC parameters and their implicit additional complexity, the options for signaling the RPR/ARC parameters will be described.

Referring to fig. 5A-5E, several embodiments for signaling RPR/ARC parameters are shown. As indicated by each embodiment, they have certain advantages and disadvantages from the point of view of coding efficiency, complexity and architecture. The video coding standard or technique may select at least one of these embodiments, or select options known in the relevant art, for signaling RPR/ARC parameters. The embodiments may not be mutually exclusive and may be interchanged according to application requirements, the standard techniques involved, or the choice of encoder.

The categories of RPR/ARC parameters may include:

-an upsampling factor and/or a downsampling factor separated or combined in the X-dimension and the Y-dimension

-adding an up-sampling factor and/or a down-sampling factor in the time dimension for indicating a constant speed up-scaling and/or down-scaling of a given number of pictures

Either of the two may involve encoding one or more possibly shorter syntax elements, possibly pointing to a table containing the factor(s).

-resolution: resolution in X-dimension or Y-dimension of input pictures, output pictures, reference pictures, encoded pictures, in samples, blocks, macroblocks, CUs, or any other suitable granularity, combined, or split. If there is more than one resolution (e.g., one resolution for the input picture and one resolution for the reference picture), then in some cases, another set of values may be inferred from the set of values. For example, the resolution may be gated by using flags. For more detailed examples, please refer to the following.

"deformation" (waring) coordinates: similar to the coordinates used in annex P of the h.263 standard, can have the appropriate granularity as described above. One effective way of encoding such warped coordinates is defined in annex P of the h.263 standard, but other potentially more efficient methods are also envisaged, as can be imagined. For example, the variable length reversible "huffman" coding of the warped coordinates of appendix P may be replaced by a binary coding of appropriate length, wherein the length of the binary codeword may be derived, for example, from the maximum picture size (possibly multiplied by a factor and offset by a value) to allow "warping" outside the boundaries of the maximum picture size.

-upsampling filter parameters or downsampling filter parameters: in an embodiment, there may be only a single filter for up-sampling and/or down-sampling. However, in embodiments, it may be desirable to allow greater flexibility in the design of the filter, which may need to be achieved by signaling filter parameters. Such parameters may be selected by an index in a list of possible filter designs, the filter may be fully specified (e.g., by a list of filter coefficients using an appropriate entropy encoding technique), the filter may be implicitly selected by an upsampling ratio and/or a downsampling ratio, which in turn is signaled according to any of the above-mentioned mechanisms, and so on.

In the following, the present description assumes that a limited set of up-sampling factors and/or down-sampling factors (using the same factor in the X-dimension and the Y-dimension) indicated by a codeword is encoded. The codewords may be variable length coded, for example, by using an Ext-Golomb (exponential Golomb) code common to certain syntax elements in video coding specifications (e.g., h.264 and h.265). A suitable mapping of values to upsampling/downsampling factors may be found, for example, in table 1:

TABLE 1

Code word	Ext-Golomb coding	Native/target resolution
			0	1	1/1
1	010	1/1.5 (amplification 50%)
			2	011	1.5/1 (50% reduction)
3	00100	1/2 (amplification 100%)
			4	00101	2/1 (zoom out by 100%)

Many similar mappings may be designed according to the needs of the application and the capabilities of the zoom-in and zoom-out mechanisms available in the video compression technology or standard. This table 1 can be extended to more values. The values may also be represented by entropy coding mechanisms other than the Ext-Golomb code, for example using binary coding. The use of binary coding may have certain advantages when there is interest in resampling factors outside the video processing engine (most importantly the encoder and decoder) itself, e.g. a MANE (Media Aware Network Element). It should be noted that for cases where no resolution change is required, a shorter Ext-Golomb code may be selected; in table 1 above, there is only a single bit. For this most common case, using the Ext-Golomb code may have an advantage over the coding efficiency using binary codes.

The number of entries in table 1 and their semantics may be fully or partially configurable. For example, the basic modality of table 1 may be transmitted in a "higher layer" parameter set, such as a sequence parameter set or a decoder parameter set. In embodiments, one or more such tables may be defined in a video coding technique or standard and may be selected by, for example, a decoder or sequence parameter set.

How the up-sampling factor/down-sampling factor (ARC information) encoded as described above is included in the video encoding technique or standard syntax is described below. Similar considerations may apply to one or several codewords that control the upsampling/downsampling filters. See discussion below regarding when a filter or other data structure requires a relatively large amount of data.

As shown in fig. 5, the h.263 standard annex P includes ARC information (502) in the form of four warped coordinates in a picture header (501), and more specifically, in an h.263pluptype (503) header extension. This may be a judicious design choice when a) there is a picture header available, and b) ARC information is expected to change frequently. However, the overhead of using h.263 type signaling may be quite high and the scaling factor for picture boundaries may not be relevant because the picture header may be of a transient nature.

In the same or another embodiment, the signaling of ARC parameters may follow the detailed examples outlined in fig. 6A-6B. Fig. 6A-6B depict syntax diagrams for one representation of the use of symbols that generally follow C-type programming, such as used in the video coding standard at least since 1993. Bold lines represent syntax elements present in the codestream, and no bold lines typically represent control flow or variable settings.

As shown in fig. 6A, a tile group header (601), an exemplary syntax structure applicable to a header of a portion (possibly rectangular) of a picture, may conditionally contain a variable-length Exp-Golomb (exponential Golomb) encoded syntax element dec _ pic _ size _ idx (602) (shown in bold). The presence of this syntax element in the tile group header may be gated by using the adaptive resolution (603). Here, the value of the flag is not shown in bold, which means that the position point where the flag appears in the codestream is its appearance in the syntax table. Whether adaptive resolution is used for the picture or a portion thereof may be signaled in any high level syntax structure inside or outside the codestream. In the illustrated example, the adaptive resolution is signaled in the sequence parameter set as described below.

Referring to fig. 6B, an excerpt of a sequence parameter set (610) is also shown. The first syntax element shown is adaptive _ pic _ resolution _ change _ flag (611). When true, the flag may indicate that adaptive resolution is used, which in turn may require specific control information. In this example, such control information is conditionally present based on the value of a flag based on the sequence parameter set (612) and the if () statement in the tile group header (601).

When adaptive resolution is used, in the present example, the encoded is the output resolution in samples (613). Reference numeral 613 refers to both output _ pic _ width _ in _ luma _ samples and output _ pic _ height _ in _ luma _ samples, which together may define the resolution of the output picture. Some restrictions on either value may be defined elsewhere in the video coding technique or standard. For example, the level definition may limit the number of total output samples, which may be the product of the values of the two syntax elements described above. Furthermore, certain video coding techniques or standards, or external techniques or standards (e.g., system standards), may limit the range of values (e.g., one or both dimensions must be divisible by a power of 2) or aspect ratios (e.g., width and height must have a relationship of, for example, 4: 3 or 16: 9). Such limitations may be introduced for hardware implementation or for other reasons and are well known in the art.

In some applications, it is proposed that the encoder instructs the decoder to use a certain reference picture size, rather than implicitly assuming its size as the output picture size. In this example, the syntax element reference _ pic _ size _ present _ flag (614) gates the conditional presence of the reference picture size (615) (again, this number refers to both width and height).

Finally, one possible decoded picture width and height table is shown. Such a table may be represented, for example, by a table indication (num _ dec _ pic _ size _ in _ luma _ samples _ minus1) (616). "minus 1" (minus 1) may refer to an interpretation of the value of the syntax element. For example, if the encoded value is zero, there is one table entry. If the value is five, there are six table entries. For each "row" in the table, the decoded picture width and height are then included in the syntax (617).

The presented table entry (617) may be indexed using the syntax element dec _ pic _ size _ idx (602) in the tile group header, allowing each tile group to have a different decoding size (actually a scaling factor).

Certain video coding techniques or standards, such as VP9, support spatial scalability by implementing some form of reference picture resampling (which is signaled in a much different manner than disclosed herein) in conjunction with temporal scalability in order to achieve spatial scalability. More specifically, some reference pictures may be sampled to a higher resolution using an ARC type technique to form the basis of the spatial enhancement layer. These upsampled pictures can be refined using a high resolution standard prediction mechanism, thereby increasing detail.

The embodiments discussed herein may be used in such environments. In some cases, in the same or another embodiment, a value (e.g., a Temporal ID field) in a Network Abstraction Layer (NAL) unit header may be used to indicate not only a Temporal Layer but also a spatial Layer. This may have certain advantages for certain system designs; for example, an existing Selected Forwarding Unit (SFU) created and optimized for Temporal layer Selected Forwarding based on NAL Unit header Temporal ID values may be used in a scalable environment without modification. To achieve this, a mapping may be required between the coded picture size and the Temporal layer indicated by the Temporal ID field in the NAL unit header.

In an embodiment, the coherence window size may be signaled in the PPS. When the conformance window size of the reference picture is different from the conformance window size of the current picture, the resampling ratio can be calculated using the conformance window parameter. The decoder may need to identify the conformance window size of each picture to determine whether a resampling process is needed.

In an embodiment, a scaling factor (scale factor) for Reference Picture Resampling (RPR) may be calculated based on an output width and an output height between a current picture and a reference picture, which may be derived from a conformance window parameter. This allows the scaling factor to be calculated more accurately than using the decoded picture size. This may be applicable to most video sequences, whose output picture size is almost the same as the decoded picture size, with a smaller padding area.

However, this may also cause various problems. For example, when the conformance window size is completely different from the decoded picture size with a large offset value, computing the scaling factor based on the conformance window size may not guarantee the quality of inter-frame prediction with different resolutions for immersive media applications (e.g., 360 degree cube map, stereo, point cloud). In an extreme case, there may not be a co-located region of the current CU in the reference picture. When RPR is used for multi-layer scalability, the consistency window offset may not be used to compute reference areas across layers. Note that in HEVC scalability extension (SHVC), the reference area of each directly dependent layer may be explicitly signaled in the PPS extension. When a sub-stream for a specific region (sub-picture) is extracted from the entire stream, the size of the conformance window does not match the picture size at all. Note that once the codestream has been encoded, the conformance window parameter cannot be updated as long as the parameter is used to calculate the scaling.

Based on the potential problems described above, calculating a scaling factor based on the size of the coherence window may have extreme cases that require alternative parameters. As an alternative, when the conformance window parameter cannot be used for calculating the scaling factor, the present application suggests signaling scaled reference region parameters that can be used for calculating RPR and scalability.

In an embodiment, referring to fig. 7, a conformance _ window _ flag may be signaled in the PPS. A conformance _ window _ flag equal to 1 may indicate that the next one in the PPS is a consistent clipping window offset parameter. A conformance _ window _ flag equal to 0 may indicate that there is no consistent clipping window offset parameter.

In an embodiment, still referring to fig. 7, conf _ win _ left _ offset, conf _ win _ right _ offset, conf _ win _ top _ offset, and conf _ win _ bottom _ offset specify samples of a picture that references a PPS, output from a decoding process, output according to a rectangular region specified in picture coordinates. When compatibility _ window _ flag is equal to 0, it can be inferred that the values of conf _ window _ left _ offset, conf _ window _ right _ offset, conf _ window _ top _ offset, and conf _ window _ bottom _ offset are equal to 0.

In an embodiment, a flag may be present in the PPS or another parameter set, which may indicate whether the resampled picture size (width and height) is explicitly signaled in the PPS or another parameter set. If the resampled picture size parameter is explicitly signaled, a resample ratio between the current picture and the reference picture may be calculated based on the resampled picture size parameter.

In an embodiment, referring to fig. 7, use _ conf _ win _ for _ rpr _ flag equal to 0 may indicate that reset _ pic _ width _ in _ luma _ samples and reset _ pic _ height _ in _ luma _ samples immediately follow a suitable location, such as the next location in the PPS.

In an embodiment, use _ conf _ win _ for _ rpr _ flag equal to 1 may indicate that sampling _ pic _ width _ in _ luma _ samples and sampling _ pic _ height _ in _ luma _ samples are not present.

In an embodiment, the sampling _ pic _ width _ in _ luma _ samples may specify the width of each reference picture of the reference PPS in units of luma samples for resampling. The sampling _ pic _ width _ in _ luma _ samples may not equal 0, may be an integer multiple of Max (8, MinCbSizeY), and may be less than or equal to pic _ width _ Max _ in _ luma _ samples.

In an embodiment, the sampling _ pic _ height _ in _ luma _ samples may specify a height of each reference picture of the reference PPS in units of luma samples for resampling. Sampling _ pic _ height _ in _ luma _ samples may not equal 0, may be an integer multiple of Max (8, MinCbSizeY), and may be less than or equal to pic _ height _ Max _ in _ luma _ samples.

When the syntax element, sampling pic width in luma samples, is not present, it may be inferred that the value of sampling pic width in luma samples is equal to picooutputwidth l.

When the syntax element, sampling pic height in luma samples, is not present, it may be inferred that the value of sampling pic height in luma samples is equal to picooutputheight l.

In an embodiment, an example of fractional interpolation processing with reference picture resampling may be as follows:

the variable frecwwidth may be set equal to the sampling _ pic _ width _ in _ luma _ samples of the reference picture in the luma sample.

The variable freghight may be set equal to the sampling _ pic _ height _ in _ luma _ samples of the reference picture in the luma sample.

The motion vector mvLX may be set equal to (refMvLX-mvOffset).

If cIdx is equal to 0, the scaling factor and its fixed-point representation can be defined according to equations 1 and 2 below:

horizon _ scale _ fp ═ ((fRefWidth < <14) + (compressing _ pic _ width _ in _ luma _ samples > > 1))/compressing _ pic _ width _ in _ luma _ samples (equation 1)

vert _ scale _ fp ═ ((freghight < <14) + (sampling _ pic _ height _ in _ luma _ samples > > 1))/sampling _ pic _ height _ in _ luma _ samples (equation 2)

In an embodiment, referring to FIG. 8, use _ conf _ win _ for _ rpr _ flag equal to 0 may specify that reset _ pic _ width _ in _ luma _ samples and reset _ pic _ height _ in _ luma _ samples follow the PPS. use _ conf _ wid _ for _ rpr _ flag equal to 1 specifies that sampling _ pic _ width _ in _ luma _ samples and sampling _ pic _ height _ in _ luma _ samples are not present.

In an embodiment, ref _ region _ left _ offset may specify a horizontal offset from the top left luminance sample of a reference region in a decoded picture. The ref _ region _ left _ offset may have a value of-2¹⁴To 2¹⁴-1 (inclusive). When ref _ region _ left _ offset is not present, it can be inferred that its value is equal to conf _ win _ left _ offset.

In an embodiment, ref _ region _ top _ offset may specify a vertical offset from the top-left luma sample of the reference region in the decoded picture. The ref _ region _ top _ offset may be at a value of-2¹⁴To 2¹⁴-1 (inclusive). When ref _ region _ top _ offset is not present, it can be inferred that its value is equal to conf _ win _ right _ offset.

In an embodiment, ref _ region _ right _ offset may specify a horizontal offset from the lower right luma sample of the reference region in the decoded picture. The ref _ layer _ right _ offset may have a value of-2¹⁴To 2¹⁴-1 (inclusive). When ref _ region _ right _ offset is not present, it can be inferred that its value is equal to conf _ win _ top _ offset.

In an embodiment, ref _ region _ bottom _ offset may specify a vertical offset from the lower right luma sample of the reference region in the decoded picture. The ref _ layer _ bottom _ offset may have a value of-2 ¹⁴To 2¹⁴-1 (inclusive). When ref _ region _ bottom _ offset does not exist, it can be inferred that ref _ region _ bottom _ offset [ ref _ loc _ offset _ layer _ id [ i ]]]Is equal to conf _ win _ bottom _ offset.

The variables picrefwidth l and picrefheight l may be derived as shown in equations 3 and 4, as shown below:

picrefwidth l ═ pic _ width _ in _ luma _ samples-subwidtc [ ((ref _ region _ right _ offset + ref _ region _ left _ offset)) (equation 3)

Picrefheight l ═ pic _ height _ in _ pic _ size _ units-subfheht c: (ref _ region _ bottom _ offset + ref _ region _ top _ offset) (equation 4)

The variable fRefWidth may be set equal to the picrefwidth l of the reference picture in the luma sample.

The variable freghight may be set equal to the picrefheight l of the reference picture in the luma sample.

The motion vector mvLX may be set equal to (refMvLX-mvOffset).

If cIdx is equal to 0, the scaling factor and its fixed-point representation can be defined as shown in equations 5 and 6, as follows:

hori _ scale _ fp ═ ((fRefWidth < <14) + (picrefwidth l > >1))/picrefwidth l (equation 5)

vert _ scale _ fp ═ ((freghight < <14) + (PicRefHeight > >1))/PicRefHeight (equation 6)

Upper left coordinate of boundary block for reference sample padding (xsbtin) _L，ySbInt_L) Can be set equal to (xSb + (mvLX [0 ]]>>4)，ySb+(mvLX[1]>>4))。

For each luma sample position (x) within the predicted luma sample array predSamplesLX_L＝0..sbWidth-1+brdExtSize，y_LsbHeight-1+ brdexsize), its corresponding predicted luma sample value predSamplesLX [ x ] may be derived as follows_L][y_L]：

Hypothesis (refxSb)_L，refySb_L) And (refx)_L，refy_L) Is the luminance position pointed to by the motion vector (refMvLX ) given in 1/16 sample units. Variable refxSb_L、refx_L、refySb_LAnd refy_LCan be derived as shown in equations 7 through 10, as follows:

refxSb_L＝(((xSb+ref_region_left_offset)<<4)+refMvLX[0]) Hori scale fp (equation 7)

refxL＝((Sign(refxSb)*((Abs(refxSb)+128)>>8)+x_L*((hori_scale_fp+8)>>4))+32)>>6 (EQUATION 8)

refySb_L＝(((ySb+ref_region_top_offset)<<4)+refMvLX[1])*vert_scale_fp

(equation 9)

refyL ═ ((sign (refysb) ((abs (refysb) +128) > >8) + yL (vert _ scale _ fp +8) > >4)) +32) > >6 (equation 10)

Fig. 9 is a flow diagram of an example method 900 for decoding an encoded video bitstream. In some embodiments, at least one of the method blocks of fig. 9 may be performed by the decoder 210. In some embodiments, at least one method block of fig. 9 may be performed by another device or group of devices (such as encoder 203) separate from decoder 210 or including decoder 210.

As shown in fig. 9, the method 900 may include obtaining a flag indicating that the conformance window is not used for reference picture resampling (block 910).

As further shown in fig. 9, the method 900 may include determining whether a resampled picture size is signaled based on a flag indicating that the conformance window is not used for reference picture resampling (block 920).

If it is determined that the second flag indicates that a resampled picture size is signaled ("yes" at block 920), the method 900 may proceed to block 930 and then proceed to block 950. At block 930, method 900 may include determining a resampling ratio based on the resampled picture size.

If it is determined that the second flag does not indicate that the conformance window is used for reference picture resampling ("no" at block 920), the method 900 may proceed to block 940 and then proceed to block 950. At block 940, method 900 may include determining a resampling ratio based on the output picture size.

As further shown in fig. 9, the method 900 may include performing reference picture resampling for the current picture using the resampling ratio (block 950).

In an embodiment, the flag may be signaled in a picture parameter set.

In an embodiment, the resampled picture size may be signaled in the encoded video bitstream in at least one of a width of the resampled picture size and a height of the resampled picture size.

In an embodiment, at least one of the width and the height may be signaled in a picture parameter set.

In an embodiment, at least one of the width and the height may be expressed as a number of luminance samples included in the at least one of the width and the height.

In an embodiment, at least one of the width and the height may be determined based on at least one offset distance between a boundary of the current picture and a predetermined luma sample of the reference region.

In an embodiment, the at least one offset distance may be signaled in a picture parameter set.

In an embodiment, the at least one offset distance may comprise at least one of: a horizontal offset distance between a left boundary of the current picture and an upper left luma sample of the reference region, a vertical offset distance between an upper boundary of the current picture and an upper left luma sample of the reference region, a horizontal offset distance between a right boundary of the current picture and a lower right luma sample of the reference region, and a vertical offset distance between a lower boundary of the current picture and a lower right luma sample of the reference region.

Although fig. 9 shows example blocks of the method 900, in some implementations, the method 900 may include additional blocks, fewer blocks, different blocks, or different arrangements of blocks than those depicted in fig. 9. Additionally or alternatively, two or more of the blocks of method 900 may be performed in parallel.

Furthermore, the proposed method may be implemented by a processing circuit (e.g. at least one processor or at least one integrated circuit). In one example, at least one processor executes a program stored in a non-transitory computer readable medium to perform at least one of the proposed methods.

The techniques described above may be implemented as computer software using computer readable instructions and physically stored in one or more computer readable storage media. For example, fig. 10 illustrates a computer system 1000 suitable for implementing certain embodiments of the disclosed subject matter.

The computer software may be encoded using any suitable machine code or computer language that may be subject to assembly, compilation, linking, or similar mechanisms to create code comprising instructions that are executable by a computer Central Processing Unit (CPU), Graphics Processing Unit (GPU), etc., either directly or through interpretation, microcode execution, etc.

The instructions may be executed on various types of computers or computer components, including, for example, personal computers, tablets, servers, smart phones, gaming devices, internet of things devices, and the like.

The components shown in FIG. 10 for computer system 1000 are exemplary in nature and are not intended to suggest any limitation as to the scope of use or functionality of the computer software implementing embodiments of the application. Neither should the configuration of components be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiments of the computer system 1000.

The computer system 1000 may include some human interface input devices. Such human interface input devices may be responsive to input by one or more human users through, for example, tactile input (e.g., keys, swipes, data glove movements), audio input (e.g., speech, taps), visual input (e.g., gestures), olfactory input (not depicted). The human interface device may also be used to capture certain media that are not necessarily directly related to human conscious input, such as audio (e.g., speech, music, ambient sounds), images (e.g., scanned images, photographic images obtained from still-image cameras), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).

The input human interface device may include one or more of the following (only one depicted each): keyboard 1001, mouse 1002, track pad 1003, touch screen 1010 and associated graphics adapter 1050, data gloves, joystick 1005, microphone 1006, scanner 1007, camera 1008.

The computer system 1000 may also include some human interface output devices. Such human interface output devices may stimulate the perception of one or more human users through, for example, tactile output, sound, light, and smell/taste. Such human interface output devices may include haptic output devices (such as haptic feedback of touch screen 1010, data glove, or joystick 1005, although there may also be haptic feedback devices that do not act as input devices), audio output devices (such as speaker 1009, headphones (not depicted)), visual output devices (e.g., screen 1010, including Cathode Ray Tube (CRT) screens, Liquid Crystal Display (LCD) screens, plasma screens, Organic Light Emitting Diode (OLED) screens, each with or without touch screen input capability, each with or without haptic feedback capability-some of which are capable of outputting two-dimensional visual output or output greater than three-dimensional by way of, for example, a stereographic output; virtual reality glasses (not depicted), holographic displays, and smoke boxes (not depicted)), and printers (not depicted).

The computer system 1000 may also include human-accessible storage devices and associated media for the storage devices, such as optical media, including CD/DVD ROM/RW1020 with media 1021, such as CD/DVD, thumb drive 1022, removable hard or solid state drive 1023, legacy magnetic media, such as magnetic tape and floppy disks (not depicted), ROM/ASIC/PLD based application specific devices, such as security devices (not depicted), and so forth.

Those skilled in the art will also appreciate that the term "computer-readable medium" used in connection with the presently disclosed subject matter does not encompass transmission media, carrier waves, or other transitory signals.

The computer system 1000 may also include an interface to one or more communication networks. The network (1155) may be, for example, wireless, wired, optical. The network may also be local, wide area, metropolitan area, vehicle and industrial, real time, delay tolerant, etc. Examples of the network include a local area network such as ethernet, a wireless LAN, a cellular network including global system for mobile communications (GSM), third generation (3G), fourth generation (4G), fifth generation (5G), Long Term Evolution (LTE), etc., a TV cable or wireless wide area digital network including cable TV, satellite TV, and terrestrial broadcast TV, an in-vehicle network including a CAN bus, and an industrial network, etc. Certain networks typically require an external network interface adapter (1154) attached to some general purpose data port or peripheral bus (1149), such as a USB port of computer system 1000; other networks are typically integrated into the core of computer system 1000 by attachment to a system bus as described below (e.g., into a PC computer system through an ethernet interface, or into a smartphone computer system through a cellular network interface). By way of example, the network 1055 may be connected to a peripheral bus 1049 using a network interface 1054. Using any of these networks, computer system 1000 may communicate with other entities. Such communications may be unidirectional reception only (e.g., broadcast TV), unidirectional transmission only (e.g., CAN bus connected to some CAN bus devices), or bidirectional, e.g., using a local area digital network or a wide area digital network to connect to other computer systems. Certain protocols and protocol stacks may be used on each of the networks and network interfaces as those described above.

The human interface device, human accessible storage device, and network interface described above may be attached to the core 1040 of the computer system 1000.

The core 1040 may include one or more Central Processing Units (CPUs) 1041, Graphics Processing Units (GPUs) 1042, special purpose programmable processing units 1043 in the form of Field Programmable Gate Arrays (FPGAs), hardware accelerators 1044 for certain tasks, and so forth. These devices, along with Read Only Memory (ROM)1045, random access memory 1046, internal mass storage devices 1047 such as internal non-user accessible hard drives, Solid State Drives (SSDs), etc., may be connected by a system bus 1048. In some computer systems, system bus 1048 may be accessible through one or more physical plugs to enable expansion by additional CPUs, GPUs, and the like. The peripheral devices may be attached to the system bus 1048 of the core either directly or through a peripheral bus 1049. Architectures for peripheral buses include Peripheral Component Interconnect (PCI), USB, and the like.

The CPU1041, GPU1042, FPGA1043 and accelerator 1044 can execute certain instructions that, in combination, can constitute the above-described computer code. The computer code may be stored in ROM1045 or RAM 1046. Transitional data may also be stored in RAM1046, while persistent data may be stored, for example, in an internal mass storage device 1047. Fast storage and retrieval of any memory device may be achieved through the use of cache memories, which may be closely associated with one or more CPUs 1041, GPUs 1042, mass storage devices 1047, ROM1045, RAM1046, and the like.

Computer readable media may have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present application, or they may be of the kind well known and available to those having skill in the computer software arts.

By way of example, and not limitation, a computer system having architecture 1000, and in particular core 1040, may provide functionality resulting from a processor (including CPUs, GPUs, FPGAs, accelerators, etc.) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be media associated with the user-accessible mass storage devices described above as well as certain storage devices of a non-transitory nature of the core 1040 (e.g., core internal mass storage devices 1047 or ROM 1045. software implementing various embodiments of the present application may be stored in such devices and executed by the core 1040. depending on particular needs, computer-readable media may include one or more memory devices or chips. The logic may operate in lieu of or in conjunction with software to perform certain processes or certain portions of certain processes described herein. References to software may encompass logic, and vice versa, as appropriate. Where appropriate, reference to a computer-readable medium may encompass circuitry (e.g., an Integrated Circuit (IC)) that stores software for execution, circuitry that embodies logic for execution, or both. This application contemplates any suitable combination of hardware and software.

While this application describes several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of this application. It will thus be appreciated that those skilled in the art will be able to devise various systems and methods which, although not explicitly shown or described herein, embody the principles of the application and are thus within the spirit and scope of the application.

28页详细技术资料下载

Method for resampling reference picture by offset in video code stream

相关技术

网友询问留言