Method for signalling the output layer set of a scalable video stream

文档序号：411937 发布日期：2021-12-17 浏览：6次中文

阅读说明：本技术 可缩放视频流的输出层集的信令方法 (Method for signalling the output layer set of a scalable video stream ) 是由崔秉斗史蒂芬·文格尔刘杉于 2020-09-29 设计创作，主要内容包括：一种使用至少一个处理器对已编码视频码流进行解码的方法,包括：从已编码视频码流中获得包括多个输出层集的已编码视频序列；获得第一标志,该第一标志指示多个输出层集中的每个输出层集是否包括多于一个层；基于第一标志指示每个输出层集包括多于一个层,获得指示输出层集模式的第一语法元素；基于第一标志和第一语法元素中的至少一个,从多个输出层集中所包括的层当中选择至少一个层作为至少一个输出层；以及输出至少一个输出层。(A method of decoding an encoded video stream using at least one processor, comprising: obtaining an encoded video sequence comprising a plurality of output layer sets from an encoded video bitstream; obtaining a first flag indicating whether each of a plurality of output layer sets includes more than one layer; obtaining a first syntax element indicating an output layer set mode based on the first flag indicating that each output layer set includes more than one layer; selecting at least one layer as at least one output layer from among layers included in the plurality of output layer sets based on at least one of the first flag and the first syntax element; and outputting the at least one output layer.)

1. A method for decoding an encoded video stream using at least one processor, the method comprising:

obtaining an encoded video sequence comprising a plurality of output layer sets from the encoded video bitstream;

obtaining a first flag indicating whether each of the plurality of output layer sets includes more than one layer;

based on the first flag indicating that the each output layer set includes more than one layer, obtaining a first syntax element indicating an output layer set mode;

selecting at least one layer from among layers included in the plurality of output layer sets as at least one output layer based on at least one of the first flag and the first syntax element; and

outputting the at least one output layer.

2. The method of claim 1, wherein the first flag and the first syntax element are signaled in a Video Parameter Set (VPS).

3. The method of claim 1, wherein the one layer is selected as the at least one output layer based on the first flag indicating that the each output layer set includes only one layer.

4. The method of claim 1, wherein a highest layer of the each output layer set is selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is a first mode.

5. The method of claim 1, wherein all layers included in the plurality of output layer sets are selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is the second mode.

6. The method of claim 1, wherein the output layer set mode is a third mode based on the first syntax element indicating that the output layer set mode is selected from among the layers included in the plurality of output layer sets based on a second syntax element signaled in the encoded video stream.

7. The method of claim 6, wherein, based on the first syntax element indicating that the output layer set mode is the third mode, unselected ones of the layers included in the plurality of output layer sets are used as reference layers for the at least one output layer.

8. The method of claim 6, wherein the output layer set mode is inferred to be the third mode based on the first flag indicating that the each output layer set includes more than one layer and a second flag indicating that all layers included in the plurality of output layer sets are independently encoded.

9. An apparatus for decoding an encoded video stream, the apparatus comprising:

at least one memory configured to store program code; and

at least one processor configured to read and operate as directed by the program code, the program code comprising:

first obtaining code configured to cause the at least one processor to obtain an encoded video sequence comprising a plurality of output layer sets from the encoded video bitstream;

second obtaining code configured to cause the at least one processor to obtain a first flag indicating whether each of the plurality of output layer sets includes more than one layer;

third obtaining code configured to cause the at least one processor to obtain a first syntax element indicating an output layer set mode based on the first flag indicating that the each output layer set includes more than one layer;

selecting code configured to cause the at least one processor to select at least one layer as at least one output layer from among layers included in the plurality of output layer sets based on at least one of the first flag and the first syntax element; and

output code configured to cause the at least one processor to output the at least one output layer.

10. The apparatus of claim 9, wherein the first flag and the first syntax element are signaled in a Video Parameter Set (VPS).

11. The apparatus of claim 9, wherein the one layer is selected as the at least one output layer based on the first flag indicating that the each output layer set includes only one layer.

12. The apparatus of claim 9, wherein a highest layer of the each output layer set is selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is a first mode.

13. The apparatus of claim 9, wherein all layers included in the plurality of output layer sets are selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is the second mode.

14. The apparatus of claim 9, wherein the output layer set mode is a third mode based on the first syntax element indicating that the output layer set mode is selected from among the layers included in the plurality of output layer sets based on a second syntax element signaled in the encoded video stream.

15. The apparatus of claim 14, wherein a layer not selected from the layers included in the plurality of output layer sets is used as a reference layer for the at least one output layer based on the first syntax element indicating that the output layer set mode is the third mode.

16. The apparatus of claim 14, wherein the output layer set mode is inferred to be the third mode based on the first flag indicating that the each output layer set includes more than one layer and a second flag indicating that all layers included in the plurality of output layer sets are independently coded.

17. A non-transitory computer-readable medium storing instructions, the instructions comprising: one or more instructions that, when executed by one or more processors of a device for decoding an encoded video bitstream, cause the one or more processors to:

obtaining an encoded video sequence comprising a plurality of output layer sets from the encoded video bitstream;

obtaining a first flag indicating whether each of the plurality of output layer sets includes more than one layer;

based on the first flag indicating that the each output layer set includes more than one layer, obtaining a first syntax element indicating an output layer set mode; and

selecting at least one layer from among layers included in the plurality of output layer sets as at least one output layer based on at least one of the first flag and the first syntax element;

outputting the at least one output layer.

18. The non-transitory computer-readable medium of claim 17, wherein the highest layer of each output layer set is selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is a first mode.

19. The non-transitory computer-readable medium of claim 17, wherein all layers included in the plurality of output layer sets are selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is the second mode.

20. The non-transitory computer-readable medium of claim 17, wherein the output layer set mode is the third mode based on the first syntax element indicating that the output layer set mode is selected from among the layers included in the plurality of output layer sets based on a second syntax element signaled in the encoded video bitstream.

Technical Field

The subject matter disclosed herein relates to video encoding and decoding, and more particularly to signaling mechanisms for output levels of scalable video streams.

Background

The h.265/HEVC (high efficiency video coding) standard was published in 2013 (version 1), 2014 (version 2), 2015 (version 3) and 2016 (version 4) by ITU-T VCEG (international telecommunication union telecommunication sector video coding experts group) (Q6/16) and ISO/IEC MPEG (international organization for standardization/international electrotechnical commission moving picture experts group) (JTC 1/SC 29/WG 11) (joint technical commission 1/29 th division commission/11 th working group). In 2015, these two standards organizations jointly built jfet (joint video exploration team) to explore the potential for developing the next video coding standard beyond HEVC. In 2017, month 10, they released a joint proposal symptom (CfP) on video compression with capabilities beyond HEVC. By 2018, 2, 15 days, a total of 22 CfP responses for Standard Dynamic Range (SDR), 12 CfP responses for High Dynamic Range (HDR), and 12 CfP responses for the 360 degree video category were received, respectively. All CfP responses received were evaluated at the 122 th MPEG conference/10 th jvt conference in 2018, 4 months. As a result of this meeting, jfet formally initiated the standardization process for next generation video coding beyond HEVC. The new standard is named universal video coding (VVC) and jfet is renamed joint video experts group.

Disclosure of Invention

In an embodiment, there is provided a method of decoding an encoded video bitstream using at least one processor, comprising:

obtaining an encoded video sequence comprising a plurality of output layer sets from an encoded video bitstream;

obtaining a first flag indicating whether each of a plurality of output layer sets includes more than one layer;

obtaining a first syntax element indicating an output layer set mode based on the first flag indicating that each output layer set includes more than one layer;

selecting at least one layer as at least one output layer from among layers included in the plurality of output layer sets based on at least one of the first flag and the first syntax element;

and outputting the at least one output layer.

In an embodiment, there is provided an apparatus for decoding an encoded video bitstream, comprising:

at least one memory configured to store program code; and

at least one processor configured to read and operate as directed by the program code, the program code comprising:

first obtaining code configured to cause the at least one processor to obtain an encoded video sequence comprising a plurality of output layer sets from an encoded video bitstream;

second obtaining code configured to cause the at least one processor to obtain a first flag indicating whether each of a plurality of output layer sets includes more than one layer;

third obtaining code configured to cause the at least one processor to obtain, based on the first flag indicating that each output layer set includes more than one layer, a first syntax element indicating an output layer set mode;

selecting code configured to cause the at least one processor to select at least one layer as at least one output layer from among layers included in a plurality of output layer sets based on at least one of a first flag and a first syntax element; and

output code configured to cause the at least one processor to output at least one output layer.

In an embodiment, there is provided a non-transitory computer-readable medium storing instructions, the instructions comprising: one or more instructions that, when executed by one or more processors of a device for decoding an encoded video bitstream, cause the one or more processors to:

obtaining an encoded video sequence comprising a plurality of output layer sets from an encoded video bitstream;

obtaining a first flag indicating whether each of a plurality of output layer sets includes more than one layer;

obtaining a first syntax element indicating an output layer set mode based on the first flag indicating that each output layer set includes more than one layer; and

selecting at least one layer as at least one output layer from among layers included in the plurality of output layer sets based on at least one of the first flag and the first syntax element;

outputting at least one output layer.

Drawings

Other features, properties, and various advantages of the disclosed subject matter will be more apparent from the following detailed description and the accompanying drawings, in which:

fig. 1 is a schematic diagram of a simplified block diagram of a communication system according to an embodiment.

Fig. 2 is a schematic diagram of a simplified block diagram of a communication system according to an embodiment.

Fig. 3 is a schematic diagram of a simplified block diagram of a decoder according to an embodiment.

Fig. 4 is a schematic diagram of a simplified block diagram of an encoder according to an embodiment.

Fig. 5 is a schematic diagram of an example of a syntax table according to an embodiment.

Fig. 6 is a flow diagram of an example method for decoding an encoded video bitstream, according to an embodiment.

Fig. 7 is a schematic diagram of a computer system, according to an embodiment.

Detailed Description

Fig. 1 illustrates a simplified block diagram of a communication system (100) according to an embodiment of the present application. The system (100) may include at least two terminals (110) and (120) interconnected by a network (150). For unidirectional transmission of data, a first terminal (110) may encode video data at a local location for transmission over a network (150) to another terminal (120). The second terminal (120) may receive encoded video data of another terminal from the network (150), decode the encoded video data, and display the restored video data. Unidirectional data transmission may be common in applications such as media services.

Fig. 1 illustrates a second pair of terminals (130,140) for supporting bidirectional transmission of encoded video, which may occur, for example, during a video conference. For bi-directional transmission of data, each terminal (130,140) may encode video data captured at a local location for transmission over a network (150) to another terminal. Each terminal (130,140) may also receive encoded video data transmitted by the other terminal, may decode the encoded data, and display the recovered video data on a local display device.

In fig. 1, the terminal (110-140) may be illustrated as a server, a personal computer, and a smart phone, but the principles disclosed herein may not be limited thereto. Embodiments disclosed herein are applicable to laptop computers, tablet computers, media players, and/or dedicated video conferencing equipment. The network (150) represents any number of networks that communicate encoded video data between terminals (110-140), including, for example, wired and/or wireless communication networks. The communication network (150) may exchange data in circuit-switched and/or packet-switched channels. Representative networks include telecommunications networks, local area networks, wide area networks, and/or the internet. For purposes of the present discussion, the architecture and topology of the network (150) may be immaterial to the operation disclosed herein, unless explained below.

As an example of the application of the subject matter disclosed in this application, fig. 2 shows the placement of a video encoder and a video decoder in a streaming environment. The subject matter disclosed herein is equally applicable to other video-enabled applications including, for example, video conferencing, digital TV, storing compressed video on digital media including CDs, DVDs, memory sticks, and the like.

The streaming system may include an acquisition subsystem (213), which acquisition subsystem (213) may include a video source (201), such as a digital camera, which creates a stream of uncompressed video samples (202), for example. The sample stream (202) is depicted as a thick line to emphasize that it has a higher amount of data than the encoded video stream, the sample stream (202) can be processed by an encoder (203) coupled to the camera (201). The encoder (203) may comprise hardware, software, or a combination of hardware and software to implement or embody aspects of the subject matter disclosed herein as described in more detail below. The encoded video codestream (204) is depicted as a thin line to emphasize that it has a lower amount of data than the sample stream, which may be stored on the streaming server (205) for future use. One or more streaming clients (206, 208) may access a streaming server (205) to retrieve a copy (207, 209) of an encoded video codestream (204). The client (206) may include a video decoder (210), the video decoder (210) decoding an incoming copy (207) of an encoded video bitstream, and generating an output stream of video samples (211) that may be presented on a display (212) or another presentation device (not depicted). In some streaming systems, the video streams (204, 207, 209) may be encoded according to some video encoding/compression standard. Examples of such standards include ITU-T recommendation H.265. The Video Coding standard under development is informally referred to as Versatile Video Coding (Versatile Video Coding) or VVC. The subject matter disclosed herein may be used in the context of VVC standards.

Fig. 3 may be a functional block diagram of a video decoder (210) according to an embodiment disclosed herein.

The receiver (310) may receive one or more codec video sequences to be decoded by the decoder (210); in the same or another embodiment, the encoded video sequences are received one at a time, wherein each encoded video sequence is decoded independently of the other encoded video sequences. The encoded video sequence may be received from a channel (312), which may be a hardware/software link to a storage device that stores encoded video data. The receiver (310) may receive encoded video data as well as other data, e.g. encoded audio data and/or auxiliary data streams, which may be forwarded to their respective usage entities (not depicted). The receiver (310) may separate the encoded video sequence from other data. To prevent network jitter, a buffer memory (315) may be coupled between the receiver (310) and the entropy decoder/parser (320) (hereinafter "parser"). When the receiver (310) receives data from a store/forward device with sufficient bandwidth and controllability, or from an isochronous network, the buffer (315) may not need to be configured, or may be made smaller. For use over best effort traffic packet networks such as the internet, a buffer (315) may be required, which may be relatively large and advantageously may be of adaptive size.

The video decoder (210) may include a parser (320) to reconstruct symbols (321) from the entropy encoded video sequence. The categories of these symbols include information for managing the operation of the decoder (210), as well as potential information to control a presentation device such as a display (212) that is not an integral part of the decoder, but may be coupled to the decoder, as shown in fig. 3. The control Information for the presentation device may be a Supplemental Enhancement Information (SEI) message or a parameter set fragment (not depicted) of Video Usability Information (VUI). The parser (320) may parse/entropy decode the received encoded video sequence. Encoding of the encoded video sequence may be in accordance with video coding techniques or standards and may follow various principles known to those skilled in the art, including variable length coding, Huffman coding, arithmetic coding with or without context sensitivity, and so forth. A parser (320) may extract a subgroup parameter set for at least one of the subgroups of pixels in the video decoder from the encoded video sequence based on at least one parameter corresponding to the group. A sub-Group may include a Group of Pictures (GOP), a picture, a sub-picture, a tile (tile), a slice (slice), a brick (brick), a macroblock, a Coding Tree Unit (CTU), a Coding Unit (CU), a block, a Transform Unit (TU), a Prediction Unit (PU), and so on. A tile may refer to a rectangular region of a CU/CTU within a particular row or column of tiles in a picture. A brick may refer to a rectangular area of a CU/CTU row within a particular block. A slice may refer to one or more bricks of a picture contained in a NAL unit. A sub-picture may refer to a rectangular region of one or more slices in a picture. The entropy decoder/parser may also extract information from the encoded video sequence, such as transform coefficients, quantizer parameter values, motion vectors, and so on.

The parser (320) may perform entropy decoding/parsing operations on the video sequence received from the buffer (315), thereby creating symbols (321).

The reconstruction of the symbol (321) may involve a number of different units depending on the type of the encoded video picture or a portion of the encoded video picture (e.g., inter and intra pictures, inter and intra blocks), among other factors. Which units are involved and the way in which they are involved can be controlled by subgroup control information parsed from the coded video sequence by the parser (320). For the sake of brevity, such a subgroup control information flow between parser (320) and the following units is not described.

In addition to the functional blocks already mentioned, the decoder 210 may be conceptually subdivided into several functional units as described below. In a practical embodiment operating under business constraints, many of these units interact closely with each other and may be integrated with each other. However, for the purposes of describing the disclosed subject matter, a conceptual subdivision into the following functional units is appropriate.

The first unit is a sealer/inverse transform unit (351). A sealer/inverse transform unit (351) receives the quantized transform coefficients as symbols (321) from the parser (320) along with control information including which transform scheme to use, block size, quantization factor, quantization scaling matrix, etc. The scaler/inverse transform unit (351) may output a block comprising sample values, which may be input into the aggregator (355).

In some cases, the output samples of sealer/inverse transform unit (351) may belong to an intra-coded block; namely: predictive information from previously reconstructed pictures is not used, but blocks of predictive information from previously reconstructed portions of the current picture may be used. Such predictive information may be provided by an intra picture prediction unit (352). In some cases, the intra picture prediction unit (352) generates a block of the same size and shape as the block being reconstructed, using surrounding reconstructed information extracted from the current (partially reconstructed) picture (358). In some cases, the aggregator (355) adds the prediction information generated by the intra prediction unit (352) to the output sample information provided by the scaler/inverse transform unit (351) on a per sample basis.

In other cases, the output samples of sealer/inverse transform unit (351) may belong to inter-coded and potential motion compensated blocks. In this case, the motion compensated prediction unit (353) may access the reference picture memory (357) to fetch samples for prediction. After motion compensation of the extracted samples according to the symbols (321) belonging to the block, these samples may be added by an aggregator (355) to the output of the sealer/inverse transform unit, in this case called residual samples or residual signals, thereby generating output sample information. The fetching of prediction samples by the motion compensation unit from addresses within the reference picture memory may be controlled by motion vectors and used by the motion compensation unit in the form of said symbols (321), said symbols (321) for example comprising X, Y and reference picture components. Motion compensation may also include interpolation of sample values fetched from the reference picture memory, motion vector prediction mechanisms, etc., when using sub-sample exact motion vectors.

The output samples of the aggregator (355) may be employed in a loop filter unit (356) by various loop filtering techniques. The video compression techniques may include in-loop filter techniques that are controlled by parameters included in the encoded video code stream and which are available to the loop filter unit (356) as symbols (321) from the parser (320), and may also be responsive to meta-information obtained during decoding of a previous (in decoding order) portion of an encoded picture or encoded video sequence, and to sample values that have been previously reconstructed and have been loop filtered.

The output of the loop filter unit (356) may be a stream of samples that may be output to a rendering device (212) and stored in a reference picture memory for subsequent inter picture prediction.

Once fully reconstructed, some of the coded pictures may be used as reference pictures for future prediction. Once the encoded picture is fully reconstructed and the encoded picture is identified as a reference picture (by, for example, parser (320)), the current reference picture (358) may become part of reference picture buffer (357) and new current picture memory may be reallocated before reconstruction of a subsequent encoded picture begins.

The video decoder 210 may perform decoding operations according to a predetermined video compression technique, such as in ITU-T recommendation h.265. The encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in the sense that the encoded video sequence conforms to the syntax of the video compression technique or standard, as specified in the video compression technique document or standard, and in particular the configuration file therein. For compliance, the complexity of the encoded video sequence is also required to be within the limits defined by the level of the video compression technique or standard. In some cases, the hierarchy limits the maximum picture size, the maximum frame rate, the maximum reconstruction sampling rate (measured in units of, e.g., mega samples per second), the maximum reference picture size, etc. In some cases, the limits set by the hierarchy may be further defined by a Hypothetical Reference Decoder (HRD) specification and metadata signaled HRD buffer management in the encoded video sequence.

In an embodiment, the receiver (310) may receive additional (redundant) data along with the encoded video. The additional data may be part of an encoded video sequence. The additional data may be used by the video decoder (210) to properly decode the data and/or more accurately reconstruct the original video data. The additional data may be in the form of, for example, a temporal, spatial, or signal-to-noise ratio (SNR) enhancement layer, redundant stripes, redundant pictures, forward error correction codes, and so forth.

Fig. 4 may be a block diagram of a video encoder (203) according to an embodiment of the disclosure.

The encoder (203) may receive video samples from a video source (201) (not part of the encoder) that may capture video images to be encoded by the encoder (203).

The video source (201) may provide a source video sequence in the form of a stream of digital video samples to be encoded by the encoder (203), which may have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit … …), any color space (e.g., bt.601y CrCB, RGB … …), and any suitable sampling structure (e.g., Y CrCB4:2:0, Y CrCB4: 4: 4). In the media service system, the video source (201) may be a storage device that stores previously prepared video. In a video conferencing system, the video source (203) may be a camera that captures local image information as a video sequence. Video data may be provided as a plurality of individual pictures that are given motion when viewed in sequence. The picture itself may be constructed as an array of spatial pixels, where each pixel may comprise one or more samples, depending on the sampling structure, color space, etc. used. The relationship between pixels and samples can be readily understood by those skilled in the art. The following text focuses on describing the samples.

According to an embodiment, the encoder (203) may encode and compress pictures of a source video sequence into an encoded video sequence (443) in real-time or under any other temporal constraints required by the application. It is a function of the controller (450) to implement the appropriate encoding speed. The controller controls other functional units as described below and is functionally coupled to these units. For simplicity, the couplings are not labeled in the figures. The parameters set by the controller may include rate control related parameters (picture skip, quantizer, lambda value of rate distortion optimization technique, etc.), picture size, GOP layout, maximum motion vector search range, etc. Other functions of the controller (450), which may involve the video encoder (203) being optimized for a certain system design, may be readily identified by those skilled in the art.

Some video decoders operate in "coding loops" that are readily recognized by those skilled in the art. As a simple description, the encoding loop may comprise an encoding portion of an encoder (430) (hereinafter referred to as a "source encoder") (responsible for creating symbols based on the input picture to be encoded and the reference picture) and a (local) decoder (433) embedded in the encoder (203), the decoder (433) reconstructing the symbols to create sample data in a manner similar to the way a (remote) decoder creates sample data (since any compression between the symbols and the encoded video code stream is lossless in the video compression techniques contemplated by the presently disclosed subject matter). The reconstructed sample stream is input to a reference picture memory (434). Since the decoding of the symbol stream produces bit accurate results independent of decoder location (local or remote), the reference picture buffer content also corresponds bit-accurately between the local encoder and the remote encoder. In other words, the reference picture samples that the prediction portion of the encoder "sees" are identical to the sample values that the decoder would "see" when using prediction during decoding. Such reference picture synchronization philosophy (and drift that occurs if synchronization cannot be maintained, e.g., due to channel errors) is well known to those skilled in the art.

The operation of the "local" decoder (433) may be the same as the "remote" decoder (210) that has been described in detail above in connection with fig. 3. However, referring briefly also to fig. 4, when symbols are available and the entropy encoder (445) and parser (320) are able to losslessly encode/decode the symbols into an encoded video sequence, the entropy decoding portion of the decoder (210), including the channel (312), receiver (310), buffer (315), and parser (320), may not be fully implemented in the local decoder (433).

At this point it can be observed that any decoder technique other than the parsing/entropy decoding present in the decoder must also be present in the corresponding encoder in substantially the same functional form. For this reason, the present application focuses on decoder operation. The description of the encoder techniques may be simplified because the encoder techniques are reciprocal to the fully described decoder techniques. A more detailed description is only needed in certain areas and is provided below.

As part of its operation, the source encoder (430) may perform motion compensated predictive coding, which predictively codes the input frame, with reference to one or more previously coded frames from the video sequence that are designated as "reference frames". In this way, the encoding engine (432) encodes the difference between a block of pixels of an input frame and a block of pixels of a reference frame, which may be selected as a prediction reference for the input frame.

The local video decoder (433) may decode encoded video data of a frame that may be designated as a reference frame based on symbols created by the source encoder (430). The operation of the encoding engine (432) may advantageously be a lossy process. When the encoded video data can be decoded at a video decoder (not shown in fig. 4), the reconstructed video sequence may typically be a copy of the source video sequence with some errors. The local video decoder (433) replicates a decoding process that may be performed on reference frames by the video decoder, and may cause reconstructed reference frames to be stored in a reference picture cache (434). In this way, the encoder (203) may locally store a copy of the reconstructed reference frame that has common content (no transmission errors) with the reconstructed reference frame to be obtained by the remote video decoder.

The predictor (435) may perform a prediction search for the coding engine (432). That is, for a new frame to be encoded, predictor (435) may search reference picture memory (434) for sample data (as candidate reference pixel blocks) or some metadata, such as reference picture motion vectors, block shapes, etc., that may be referenced as appropriate predictions for the new picture. The predictor (435) may operate on a block-by-block basis of samples to find a suitable prediction reference. In some cases, from search results obtained by predictor (435), it may be determined that the input picture may have prediction references taken from multiple reference pictures stored in reference picture memory (434).

The controller (450) may manage encoding operations of the video encoder (430), including, for example, setting parameters and subgroup parameters for encoding the video data.

The outputs of all of the above functional units may be entropy encoded in an entropy encoder (445). The entropy encoder losslessly compresses the symbols generated by the various functional units according to techniques known to those skilled in the art, such as huffman coding, variable length coding, arithmetic coding, etc., to convert the symbols into an encoded video sequence.

The transmitter (440) may buffer the encoded video sequence created by the entropy encoder (445) in preparation for transmission over a communication channel (460), which may be a hardware/software link to a storage device that will store the encoded video data. The transmitter (440) may combine the encoded video data from the video encoder (430) with other data to be transmitted, such as encoded audio data and/or an auxiliary data stream (sources not shown).

The controller (450) may manage the operation of the encoder (203). During encoding, the controller (450) may assign each encoded picture a certain encoded picture type, which may affect the encoding techniques applicable to the respective picture. For example, a picture may be generally assigned to any of the following frame types:

intra pictures (I pictures), which may be pictures that can be encoded and decoded without using any other frame in the sequence as a prediction source. Some video codecs tolerate different types of intra pictures, including, for example, Independent Decoder Refresh ("IDR") pictures. Those skilled in the art are aware of variants of picture I and their corresponding applications and features.

Predictive pictures (P pictures), which may be pictures that may be encoded and decoded using intra prediction or inter prediction that uses at most one motion vector and reference index to predict sample values of each block.

Bi-predictive pictures (B-pictures), which may be pictures that can be encoded and decoded using intra-prediction or inter-prediction that uses at most two motion vectors and reference indices to predict sample values of each block. Similarly, multiple predictive pictures may use more than two reference pictures and associated metadata for reconstructing a single block.

A source picture may typically be spatially subdivided into blocks of samples (e.g., blocks of 4 x 4, 8 x 8, 4 x 8, or 16 x 16 samples) and encoded block-wise. These blocks may be predictively encoded with reference to other (encoded) blocks that are determined according to the encoding allocation applied to their respective pictures. For example, a block of an I picture may be non-predictive encoded, or the block may be predictive encoded (spatial prediction or intra prediction) with reference to an already encoded block of the same picture. The pixel block of the P picture may be non-predictively encoded by spatial prediction or by temporal prediction with reference to one previously encoded reference picture. A block of a B picture may be non-predictively encoded by spatial prediction or by temporal prediction with reference to one or two previously encoded reference pictures.

The video encoder (203) may perform encoding operations according to a predetermined video encoding technique or standard, such as the ITU-T h.265 recommendation. In operation, the video encoder (203) may perform various compression operations, including predictive encoding operations that exploit temporal and spatial redundancies in the input video sequence. Thus, the encoded video data may conform to syntax specified by the video coding technique or standard used.

In an embodiment, the transmitter (440) may transmit the additional data while transmitting the encoded video. The video encoder (430) may treat such data as part of an encoded video sequence. The additional data may include temporal/spatial/SNR enhancement layers, redundant pictures and slices, among other forms of redundant data, Supplemental Enhancement Information (SEI) messages, Visual Usability Information (VUI) parameter set segments, and the like.

Embodiments may relate to a signaling mechanism for an output layer set of a scalable video stream. When output layers are signaled for each output layer set, embodiments may involve methods that derive the corresponding (directly dependent/independent) layer for each output layer set from the direct reference layer information.

In an embodiment, the output layer may refer to a layer in which output layers are collectively output. In an embodiment, an output layer set may refer to a specified set of layers, where one or more layers in the set of layers are specified as output layers. In an embodiment, the output tier level index may refer to an index of a tier in OUTPUTTRAYERSET to a tier list in OUTPUTTRAYERSET.

Fig. 5 illustrates an example of a syntax table related to a Video Parameter Set (VPS) original byte sequence payload (RBSP) syntax according to an embodiment.

In an embodiment, the VPS RBSP may be available to the decoding process before it is referred to, included in at least one access unit with temporalld equal to 0 or provided by an external device, and the VPS NAL unit containing the VPS RBSP may have nuh _ layer _ id equal to VPS _ layer _ id [0 ].

In CVS, all VPS NAL units having a specific value of VPS _ video _ parameter _ set _ id may have the same content.

VPS _ video _ parameter _ set _ id provides an identifier of the VPS for reference by other syntax elements.

VPS _ max _ layers _ minus1 plus 1 may specify the maximum number of allowed layers in each CVS referencing the VPS.

vps _ all _ independent _ layers _ flag is equal to 1, it can be specified that all layers in CVS are independently coded, without using inter-layer prediction. vps _ all _ independent _ layers _ flag is equal to 0, it may be specified that one or more layers in the CVS may use inter-layer prediction. When vps _ all _ independent _ layers _ flag is not present, it can be inferred that its value is equal to 1. When vps _ all _ independent _ layers _ flag is equal to 1, it can be inferred that the value of vps _ independent _ layer _ flag [ i ] is equal to 1.

vps _ layer _ id [ i ] may specify the nuh _ layer _ id value for the ith layer. For any two non-negative integer values m and n, when m is less than n, the value of vps _ layer _ id [ m ] may be less than vps _ layer _ id [ n ].

vps _ independent _ layer _ flag [ i ] is equal to 1, and a layer with index i can be specified not to use inter-layer prediction. VPS _ independent _ layer _ flag [ i ] is equal to 0, it can be specified that a layer with index i can use inter-layer prediction and VPS _ layer _ dependency _ flag [ i ] exists in the VPS. When vps _ independent _ layer _ flag [ i ] is not present, it can be inferred that its value is equal to 1.

vps _ direct _ ref _ layer _ flag [ i ] [ j ] equal to 0 may specify that the layer with index j is not a direct reference layer to the layer with index i. vps _ direct _ ref _ layer _ flag [ i ] [ j ] is equal to 1, and a layer with index j can be specified as a direct reference layer to a layer with index i. When vps _ direct _ ref _ layer _ flag [ i ] [ j ] (i and j are in the range of 0 to vps _ max _ layers _ minus1 (inclusive)) is not present, it can be inferred that its value is equal to 0.

The variables NumDirectRiefLayers [ i ], DirectRiverIdx [ i ] [ d ], NumRefLayers [ i ] and RefLayerIdx [ i ] [ r ] can be derived as follows:

the variable GeneralLayerIdx [ i ] specifying the layer index for the layer with nuh _ layer _ id equal to vps _ layer _ id [ i ] can be derived as follows:

for(i＝0；i<＝vps_max_layers_minus1；i++)

GeneralLayerIdx[vps_layer_id[i]]＝i

each output layer set may be specified to contain only one layer and each layer in the bitstream is itself an output layer set, where a single included layer is the only output layer. The reach _ layer _ is _ an _ outputLayerSet _ flag is equal to 0, and it can be specified that the output layer set may include more than one layer. If vps _ max _ layers _ minus1 is equal to 0, it can be inferred that the value of reach _ layer _ is _ an _ outputLayerSet _ flag is equal to 1. Otherwise, when vps _ all _ independent _ layers _ flag is equal to 0, it can be inferred that the value of reach _ layer _ is _ an _ outputLayerSet _ flag is equal to 0.

The output layertset _ mode _ idc is equal to 0, it may be specified that the total number of output layers yersets specified by VPS is equal to VPS _ max _ layers _ minus1+1, the ith output layer yerset includes layers with layer indexes from 0 to i (inclusive), and only the highest layer in the output layer yersets is output for each output layer yerset.

The output layerteset _ mode _ idc is equal to 1, it may be specified that the total number of output layers yersets specified by VPS is equal to VPS _ max _ layers _ minus1+1, the ith output layer yerset includes layers with layer indices from 0 to i (inclusive), and for each output layer yerset, all layers in the output layer yerset are output.

An output layerset _ mode _ idc equal to 2 may specify that the total number of output layers specified by VPS is explicitly signaled, and for each output layer, the output layer is explicitly signaled, and the other layers are layers that are direct or indirect reference layers to the output layer of the output layer.

The value of outputLayerSet _ mode _ idc may be in the range of 0 to 2 inclusive. A value of 3 for outputLayerSet _ mode _ idc is reserved for use by the ITU-T | ISO/IEC.

When vps _ all _ independent _ layers _ flag is equal to 1 and reach _ layer _ is _ an _ outputLayerSet _ flag is equal to 0, it can be inferred that the value of outputLayerSet _ mode _ idc is equal to 2.

num _ output _ layer _ sets _ minus1 plus 1 may specify the total number of output layer sets _ mode _ idc equal to 2 specified by the VPS.

The variable TotalNumOutputLayerSet specifying the total number of outputplayersets specified by the VPS can be derived as follows:

if(vps_max_layers_minus1＝＝0)

TotalNumOutputLayerSets＝1

else if(each_layer_is_an_outputLayerSet_flag||outputLayerSet_mode_idc

＝＝0||outputLayerSet_mode_idc＝＝1)

TotalNumOutputLayerSets＝vps_max_layers_minus1+1

else if(outputLayerSet_mode_idc＝＝2)

TotalNumOutputLayerSets＝num_output_layer_sets_minus1+1

when output layerset _ mode _ idc is equal to 2, an output layerset _ output _ layer _ flag [ i ] [ j ] equal to 1 may specify that the layer with nuh _ layer _ id equal to vps _ layer _ id [ j ] is the output layer of the ith output layer. When output layerset _ mode _ idc is equal to 2, output layerset _ output _ layer _ flag [ i ] [ j ] equal to 0 may specify that the layer with nuh _ layer _ id equal to vps _ layer _ id [ j ] is not the output layer of the ith output layer.

The variable numoutputlayerinoutputlayerset [ i ] specifying the number of output layers in the ith output layer and the variable outputlayerldiinoutputlayerset [ i ] [ j ] specifying the nuh _ layer _ id value of the jth output layer in the ith output layer can be derived as follows:

the variable numlayerinoutputlayerset [ i ] specifying the number of layers in the ith output layer set and the variable layeridoutputlayerset [ i ] [ j ] specifying the nuh _ layer _ id value of the jth layer in the ith output layer set can be derived as follows:

the variable OutputLayerSetLayeIdx [ i ] [ j ] specifying the OUTPUTLayerLayerSetLayeIdx layer index for the layer with nuh _ layer _ id equal to LayerIdInOutputLayerSet [ i ] [ j ] can be derived as follows:

for(i＝0,i<TotalNumOutputLayerSets；i++)

for j＝0；j<NumLayersInOutputLayerSet[i]；j++)

OutputLayerSetLayeIdx[i][LayerIdInOutputLayerSet[i][j]]＝j

the lowest layer in each output layer may be a separate layer. In other words, for each i in the range of 0 to TotalNumOutputLayerSets-1 (inclusive), the value of vps _ independent _ layer _ flag [ general LayerIdx [ LayerIdInOutputLayerSet [ i ] [0] ] ] may be equal to 1.

Each layer may be included in at least one output layer designated by the VPS. In other words, for each layer having a particular value nuhLayerId of nuh _ layer _ id (equal to one of vps _ layer _ id [ k ], k ranging from 0 to vps _ max _ layers _ minus1 (inclusive)), there may be at least one pair of values i and j, where i ranges from 0 to TotalNumOutputLayerSets-1 (inclusive) and j ranges from NumLayerInputLayerSet [ i ] -1 (inclusive), such that the value of LayerIdInOutputLayerSet [ i ] [ j ] is equal to nuhLayerId.

VPS _ constraint _ info _ present _ flag is equal to 1, which may specify the general _ constraint _ info () syntax structure present in the VPS. VPS _ constraint _ info _ present _ flag is equal to 0, which may specify that the general _ constraint _ info () syntax structure is not present in the VPS.

In a code stream conforming to this version of the specification, vps _ reserved _ zero _7bits may be equal to 0. Other values of vps _ reserved _ zero _7bits are reserved for use by the ITU-T | ISO/IEC. The decoder may ignore the value of vps _ reserved _ zero _7 bits.

VPS _ extension _ flag is equal to 0, which may specify that a VPS _ extension _ data _ flag syntax element is not present in the VPS RBSP syntax structure. VPS _ extension _ flag is equal to 1, which may specify the presence of a VPS _ extension _ data _ flag syntax element in the VPS RBSP syntax structure.

The vps _ extension _ data _ flag may have any value. Its presence and its value do not affect the consistency of the decoder with the configuration file specified in this version of the specification. A decoder conforming to this version of the specification may ignore all vps _ extension _ data _ flag syntax elements.

Fig. 6 is a flow diagram of an example method 600 for decoding an encoded video bitstream. In some embodiments, one or more of the method blocks of fig. 6 may be performed by decoder 210. In some implementations, one or more of the method blocks of fig. 6 may be performed by another device or group of devices, such as encoder 203, separate from decoder 210 or including decoder 210.

As shown in fig. 6, method 600 may include obtaining an encoded video sequence including a plurality of output layer sets from an encoded video bitstream (block 601).

As further shown in fig. 6, the method 600 may include obtaining a first flag (block 602).

As further shown in fig. 6, the method 600 may include determining whether each of the plurality of output layer sets includes more than one layer based on the first flag (block 603).

As further shown in fig. 6, method 600 may include: based on the first flag indicating that each output layer set includes only a single layer ("no" at block 603), the single layer of each output layer set is selected as at least one output layer (block 604), and the at least one output layer is output (block 608).

As further shown in fig. 6, based on the first flag indicating that each output layer set includes more than one layer ("yes" at block 603), the method 600 may proceed to blocks 605, 606, 607, and 608.

As further shown in fig. 6, the method 600 may include obtaining a first syntax element indicating an output layer set mode (block 605).

As further shown in fig. 6, the method 600 may include determining an output layer set mode based on the first syntax element (block 606).

As further shown in fig. 6, the method 600 may include selecting at least one layer as at least one output layer from among the layers included in the plurality of output layer sets based on the output layer set mode (block 607).

As further shown in fig. 6, the method 600 may include outputting at least one output layer (block 608).

In an embodiment, the first flag and the first syntax element may be signaled in a Video Parameter Set (VPS).

In an embodiment, each output layer set may be indicated to include only one layer based on the first flag, the one layer being selected as the at least one output layer.

In an embodiment, the highest layer of each output layer set may be selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is the first mode.

In an embodiment, all layers included in the plurality of output layer sets may be selected as the at least one output layer based on the first syntax element indicating that the output layer set mode is the second mode.

In an embodiment, the output layer set mode is indicated to be the third mode based on the first syntax element, and the at least one output layer may be selected from among layers included in the plurality of output layer sets based on a second syntax element signaled in the encoded video stream.

In an embodiment, based on the first syntax element indicating that the output layer set mode is the third mode, a layer not selected from layers included in the plurality of output layer sets may be used as a reference layer for the at least one output layer.

In an embodiment, the output layer set mode may be inferred to be the third mode based on the first flag indicating that each output layer set includes more than one layer and the second flag indicating that all layers included in the plurality of output layer sets are independently encoded.

Although fig. 6 shows example blocks of the method 600, in some implementations, the method 600 may include more blocks, fewer blocks, different blocks, or a different arrangement of blocks than those depicted in fig. 6. Additionally or alternatively, two or more blocks of method 600 may be performed in parallel.

Further, the proposed method may be implemented by means of a processing circuit (e.g. one or more processors or one or more integrated circuits). In one example, at least one processor executes a program stored in a non-transitory computer readable medium to perform one or more of the proposed methods.

The techniques described above may be implemented as computer software using computer readable instructions and physically stored in one or more computer readable storage media. For example, fig. 7 illustrates a computer system 700 suitable for implementing certain embodiments of the disclosed subject matter.

The computer software may be encoded using any suitable machine code or computer language that may be subject to assembly, compilation, linking, or similar mechanisms to create code comprising instructions that are executable by a computer Central Processing Unit (CPU), Graphics Processing Unit (GPU), etc., either directly or through interpretation, microcode execution, etc.

The instructions may be executed on various types of computers or computer components, including, for example, personal computers, tablets, servers, smart phones, gaming devices, internet of things devices, and the like.

The components shown in FIG. 7 for computer system 700 are exemplary in nature and are not intended to suggest any limitation as to the scope of use or functionality of the computer software implementing embodiments of the application. Neither should the configuration of components be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiment of the computer system 700.

The computer system 700 may include some human interface input devices. Such human interface input devices may be responsive to input by one or more human users through, for example, tactile input (e.g., keys, swipes, data glove movements), audio input (e.g., speech, taps), visual input (e.g., gestures), olfactory input (not depicted). The human interface device may also be used to capture certain media that are not necessarily directly related to human conscious input, such as audio (e.g., speech, music, ambient sounds), images (e.g., scanned images, photographic images obtained from still-image cameras), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).

The input human interface device may include one or more of the following (only one depicted each): keyboard 701, mouse 702, track pad 703, touch screen 710 and associated graphics adapter 750, data gloves, joystick 705, microphone 706, scanner 707, camera 708.

The computer system 700 may also include some human interface output devices. Such human interface output devices may stimulate the perception of one or more human users through, for example, tactile output, sound, light, and smell/taste. Such human interface output devices may include haptic output devices (e.g., haptic feedback of touchscreen 710, data glove, or joystick 705, although there may also be haptic feedback devices that do not act as input devices), audio output devices (e.g., speaker 709, headphones (not depicted)), visual output devices (e.g., screen 710, including Cathode Ray Tube (CRT) screens, Liquid Crystal Display (LCD) screens, plasma screens, Organic Light Emitting Diode (OLED) screens, each with or without touch screen input capability, each with or without haptic feedback capability-some of which are capable of outputting two-dimensional visual output or output greater than three-dimensional by way of, for example, stereographic output; virtual reality glasses (not depicted), holographic displays, and smoke boxes (not depicted)), and printers (not depicted).

The computer system 700 may also include human-accessible storage and associated media for the storage, e.g., optical media, including CD/DVD ROM/RW720 with CD/DVD like media 721, thumb drive 722, removable hard drive or solid state drive 723, magnetic tape and floppy disk (not depicted) like legacy magnetic media, ROM/Application Specific Integrated Circuit (ASIC)/Programmable Logic Device (PLD) based application specific devices, e.g., security devices (not depicted), etc.

Those skilled in the art will also appreciate that the term "computer-readable medium" used in connection with the presently disclosed subject matter does not encompass transmission media, carrier waves, or other transitory signals.

The computer system 700 may also include an interface to one or more communication networks (1155). The network may be, for example, wireless, wired, optical. The network may also be local, wide area, metropolitan area, vehicle and industrial, real time, delay tolerant, etc. Examples of networks include local area networks such as ethernet, wireless LAN, cellular networks including global system for mobile communications (GSM), third generation (3G), fourth generation (4G), fifth generation (5G), Long Term Evolution (LTE), etc., TV cable or wireless wide area digital networks including cable TV, satellite TV, and terrestrial broadcast TV, vehicular networks including controller area network bus (CANBus), and industrial networks, etc. Certain networks typically require an external network interface adapter (1154) that attaches to certain universal data ports or peripheral buses (1149), such as a Universal Serial Bus (USB) port of computer system 700; other networks are typically integrated into the core of computer system 700 by attaching to a system bus as described below (e.g., into a PC computer system through an ethernet interface, or into a smartphone computer system through a cellular network interface). By way of example, the network 755 may be connected to the peripheral bus 749 using the network interface 754. Using any of these networks, computer system 700 may communicate with other entities. Such communications may be unidirectional reception only (e.g., broadcast TV), unidirectional transmission only (e.g., CANBus connected to certain CANBus devices), or bidirectional, e.g., connected to other computer systems using a local area digital network or a wide area digital network. Certain protocols and protocol stacks may be used on each of those networks and network interfaces (1154) as described above.

The human interface device, human accessible storage device, and network interface described above may be attached to the core 740 of the computer system 700.

The core 740 may include one or more Central Processing Units (CPUs) 741, Graphics Processing Units (GPUs) 742, dedicated programmable processing units 743 in the form of Field Programmable Gate Arrays (FPGAs), hardware accelerators 744 for certain tasks, and so forth. These devices, along with Read Only Memory (ROM)745, random access memory 746, internal mass storage devices 747 such as internal non-user accessible hard drives, Solid State Drives (SSDs), etc., may be connected by a system bus 748. In some computer systems, the system bus 748 may be accessible by one or more physical plugs to enable expansion by additional CPUs, GPUs, and the like. Peripheral devices may be attached to the system bus 748 of the core either directly or through a peripheral bus 749. Architectures for peripheral buses include Peripheral Component Interconnect (PCI), USB, and the like.

The CPU 741, GPU 742, FPGA 743, and accelerator 744 can execute certain instructions that, in combination, can constitute the above-described computer code. The computer code may be stored in ROM 745 or RAM 746. Transitional data may also be stored in RAM 746, while persistent data may be stored, for example, in the internal mass storage device 747. Fast storage and retrieval of any memory device may be achieved through the use of cache memory, which may be closely associated with one or more of CPU 741, GPU 742, mass storage 747, ROM 745, RAM 746, and the like.

Computer readable media may have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present application, or they may be of the kind well known and available to those having skill in the computer software arts.

By way of example, and not limitation, a computer system having architecture 700, and in particular core 740, may provide functionality resulting from a processor (including a CPU, GPU, FPGA, accelerator, etc.) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be media associated with the user-accessible mass storage devices described above as well as certain storage devices of a non-transitory nature of the core 740 (e.g., the core internal mass storage device 747 or the ROM 745). Software implementing various embodiments of the present application may be stored in such devices and executed by the core 740. The computer readable medium may include one or more memory devices or chips, according to particular needs. The software may cause the core 740, and in particular the processors therein (including CPUs, GPUs, FPGAs, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM 746 and modifying such data structures according to processes defined by the software. Additionally or alternatively, the computer system may provide functionality that results from logic that is hardwired or otherwise embodied in circuitry (e.g., the accelerator 744), which may operate in place of or in conjunction with software to perform a particular process or a particular portion of a particular process described herein. References to software may encompass logic, and vice versa, as appropriate. Where appropriate, reference to a computer-readable medium may encompass circuitry (e.g., an Integrated Circuit (IC)) that stores software for execution, circuitry that embodies logic for execution, or both. This application contemplates any suitable combination of hardware and software.

While this application describes several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of this application. It will thus be appreciated that those skilled in the art will be able to devise various systems and methods which, although not explicitly shown or described herein, embody the principles of the application and are thus within the spirit and scope of the application.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：视频编解码中的缩放方法

Method for signalling the output layer set of a scalable video stream

相关技术

网友询问留言