Multi-signal audio coding using signal whitening as pre-processing

文档序号：789746 发布日期：2021-04-09 浏览：23次中文

阅读说明：本技术 使用信号白化作为预处理的多信号音频编码 (Multi-signal audio coding using signal whitening as pre-processing ) 是由埃伦尼·福托普楼马库斯·马特拉斯萨沙·迪克戈兰·马尔科维奇帕拉维·马本斯里坎斯· 于 2019-06-27 设计创作，主要内容包括：用于编码至少三个音频信号的多信号编码器包括：信号预处理器(100),用于单独地预处理每个音频信号以获得至少三个经预处理的音频信号,其中,预处理被执行为使得经预处理的音频信号相对于预处理之前的信号被白化；自适应联合信号处理器(200),用于对该至少三个经预处理的音频信号执行处理,以获得至少三个经联合处理的信号或者未处理的信号和至少两个经联合处理的信号；信号编码器(300),用于对每个信号进行编码以获得一个或多个经编码的信号；以及输出接口(400),用于发送或存储经编码的多信号音频信号,该经编码的多信号音频信号包括该一个或多个经编码的信号、与所述预处理相关的边信息和与所述处理相关的边信息。(A multi-signal encoder for encoding at least three audio signals comprising: a signal pre-processor (100) for individually pre-processing each audio signal to obtain at least three pre-processed audio signals, wherein the pre-processing is performed such that the pre-processed audio signals are whitened with respect to the pre-processed signals; an adaptive joint signal processor (200) for performing processing on the at least three pre-processed audio signals to obtain at least three jointly processed signals or an unprocessed signal and at least two jointly processed signals; a signal encoder (300) for encoding each signal to obtain one or more encoded signals; and an output interface (400) for transmitting or storing an encoded multi-signal audio signal comprising the one or more encoded signals, side information related to said pre-processing and side information related to said processing.)

1. A multi-signal encoder for encoding at least three audio signals, comprising:

a signal pre-processor (100) for individually pre-processing each audio signal to obtain at least three pre-processed audio signals, wherein the pre-processing is performed such that the pre-processed audio signals are whitened with respect to the pre-processed signals;

an adaptive joint signal processor (200) for performing processing on the at least three pre-processed audio signals to obtain at least three jointly processed signals or an unprocessed signal and at least two jointly processed signals;

a signal encoder (300) for encoding each signal to obtain one or more encoded signals; and

an output interface (400) for transmitting or storing an encoded multi-signal audio signal comprising the one or more encoded signals, side information related to the pre-processing and side information related to the processing.

2. Multi-signal encoder in accordance with claim 1, in which the adaptive joint signal processor (200) is configured to perform a wideband energy normalization (210) on the at least three preprocessed audio signals such that each preprocessed audio signal has a normalized energy, and

wherein the output interface (400) is configured to include a wideband energy normalization value (534) for each pre-processed audio signal as further side information.

3. Multi-signal encoder in accordance with claim 2, in which the adaptive joint signal processor (200) is configured to:

calculating (212) information on the average energy of the pre-processed audio signal;

calculating (211) information about the energy of each pre-processed audio signal, an

-calculating (213, 214) the energy normalization value based on the information on the average energy and the information on the energy of the particular preprocessed audio signal.

4. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured to calculate (213, 214) a scaling (534b) of a particular preprocessed audio signal based on the average energy and the energy of the preprocessed audio signal, and

wherein the adaptive joint signal processor (200) is configured to determine a flag (534a) indicating whether the scaling is for zooming in or zooming out, and wherein the flag of each signal is included in the encoded signal.

5. A multi-signal encoder according to claim 4,

wherein the adaptive joint signal processor (200) is configured to quantize (214) the scaling to the same quantization range regardless of whether scaling is up or down.

6. Multi-signal encoder in accordance with one of the preceding claims, in which the adaptive joint signal processor (200) is configured to:

normalizing (210) each preprocessed audio signal with respect to a reference energy to obtain at least three normalized signals;

calculating (220) a cross-correlation value for each possible normalized signal pair of the at least three normalized signals;

selecting (229) the signal pair with the highest cross-correlation value;

determining (232a) a joint stereo processing mode for the selected signal pair; and

joint stereo processing (232b) is performed on the selected signal pairs according to the determined joint stereo processing mode to obtain processed signal pairs.

7. Multi-signal encoder in accordance with claim 6, in which the adaptive joint signal processor (200) is configured to apply cascaded signal pair pre-processing, or in which the adaptive joint signal processor (200) is configured to apply non-cascaded signal pair processing,

wherein in the cascaded signal pair pre-processing, the signals in the processed signal pair are selectable in a further iteration step consisting of: calculating updated cross-correlation values, selecting the signal pair with the highest cross-correlation value, determining a joint stereo processing mode for the selected signal pair, and joint stereo processing the selected signal pair according to the determined joint stereo processing mode, or

Wherein, in the non-cascaded signal pair processing, a signal in a processed signal pair is not selectable in: additionally selecting a signal pair having the highest cross-correlation value, determining a joint stereo processing mode for the selected signal pair, and joint stereo processing the selected signal pair according to the determined joint stereo processing mode.

8. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured to determine a signal to be individually encoded as a signal remaining after a pairwise processing procedure, an

Wherein the adaptive joint signal processor (200) is configured to modify the energy normalization applied to the signal prior to performing the pair-wise processing, e.g. to cause the energy normalization applied to the signal to revert (237) or at least partially revert prior to performing the pair-wise processing.

9. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured to determine bit allocation information (536) for each signal to be processed by the signal encoder (300), wherein the output interface (400) is configured to introduce the bit allocation information (536) of each signal into the encoded signal.

10. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured for calculating (282) signal energy information for each signal to be processed by the signal encoder (300),

calculating (284) a total energy of the plurality of signals to be encoded by the signal encoder (300);

calculating (286) bit allocation information (536) for each signal based on the signal energy information and the total energy information, an

Wherein the output interface (400) is configured to introduce the bit allocation information into the encoded signal for each signal.

11. A multi-signal encoder according to claim 10,

wherein the adaptive joint signal processor (200) is configured to: optionally assigning (290) an initial number of bits to each signal, assigning (291) a number of bits based on the bit allocation information, optionally performing (292) a further refining step, or optionally performing (292) a final donation step, and

wherein the signal encoder (300) is configured to perform signal encoding using the assigned bits of each signal.

12. Multi-signal encoder in accordance with one of the preceding claims, in which the signal pre-processor (100) is configured to perform, for each audio signal:

a time-to-spectrum conversion operation (108, 110, 112) to obtain a spectrum for each audio signal;

a time-domain noise shaping operation (114a, 114b) and/or a frequency-domain noise shaping operation (116) for each signal spectrum, an

Wherein the signal pre-processor (100) is configured to feed signal spectra to the adaptive joint signal processor (200) after the temporal noise shaping operation and/or the frequency domain noise shaping operation, and

wherein the adaptive joint signal processor (200) is configured to perform joint signal processing on the received signal spectrum.

13. Multi-signal encoder in accordance with one of the preceding claims, in which the adaptive joint signal processor (200) is configured to

For each signal of the selected pair of signals, determining: the necessary bit rate for a full band split coding mode, such as L/R, or the necessary bit rate for a full band joint coding mode, such as M/S, or the bit rate for a band wise joint coding mode, such as M/S, plus the necessary bits for band wise signaling, such as M/S masking,

determining a separate coding mode or a joint coding mode as a particular mode for all frequency bands of the signal pair, which is the case when most of the frequency bands have been determined for said particular mode and a small fraction of the frequency bands, less than 10% of all frequency bands, have been determined as the other coding mode; or determining a coding mode requiring a minimum number of bits, an

Wherein the output interface (400) is configured to include an indication into the encoded signal that the particular mode is to be used for all frequency bands of a frame, instead of using an encoding mode mask for the frame.

14. Multi-signal encoder according to one of the preceding claims,

wherein the signal encoder (300) comprises a rate loop processor for each individual signal or across two or more signals, the rate loop processor being configured to receive and use bit allocation information (536) for a particular signal or for two or more signals.

15. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured to adaptively select signal pairs for joint encoding, or wherein the adaptive joint signal processor (200) is configured to determine, for each selected signal pair, a band-wise mid/side coding mode, a full band mid/side coding mode or a full band left/right coding mode, and wherein the output interface (400) is configured to indicate the selected coding mode as side information (532) in the encoded multi-signal audio signal.

16. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured for: when encoding in the mid/side mode or in the left/right mode, a band-wise mid/side decision versus left/right decision is formed based on the estimated bit rate in each band, and wherein the final joint encoding mode is determined based on the result of the band-wise mid/side versus left/right decision.

17. Multi-signal encoder in accordance with one of the preceding claims, in which the adaptive joint signal processor (200) is configured for performing (260) a spectral band replication process or an intelligent gap-filling process for determining parametric side information for the spectral band replication process or the intelligent gap-filling process, and in which the output interface (400) is configured for including the spectral band replication or intelligent gap-filling side information (532) as additional side information in the encoded signal.

18. The multi-signal encoder of claim 18,

wherein the adaptive joint signal processor (200) is configured for performing a stereo intelligent gap-filling process on the encoded signal pairs and additionally a mono intelligent gap-filling process on at least one signal to be encoded separately.

19. Multi-signal encoder according to one of the preceding claims,

wherein the at least three audio signals comprise low frequency enhancement signals, and wherein the adaptive joint signal processor (200) is configured to apply a signal mask indicating for which signals the adaptive joint signal processor (200) is to be active, and wherein the signal mask indicates that the low frequency enhancement signals are not to be used in the pairwise processing of the at least three preprocessed audio signals.

20. Multi-signal encoder in accordance with one of claims 1 to 5, in which the adaptive joint signal processor (200) is configured to calculate an energy of an MDCT spectrum of a signal as information on the energy of the signal, or

Calculating an average energy of the MDCT spectra of the at least three preprocessed audio signals as information on the average energy of the at least three preprocessed audio signals.

21. Multi-signal encoder according to one of claims 1 to 5,

wherein the adaptive joint signal processor (200) is configured for calculating (213) a scaling factor for each signal based on energy information of a particular signal and energy information on an average energy of the at least three audio signals,

wherein the adaptive joint signal processor (200) is configured for quantizing (214) the scales to obtain quantized scale values, which are used to derive side information for the scales of each signal comprised in the encoded signal, and

wherein the adaptive joint signal processor (200) is configured to derive a quantized scaling from the quantized scaling value, wherein the preprocessed audio signal is scaled using the quantized scaling before being used for pair-wise processing of the scaled signal with another corresponding scaled signal.

22. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured for calculating (221) normalized inter-signal cross-correlation values of possible signal pairs in order to decide and select which signal pair has the highest degree of similarity and is thus adapted to be selected as one of the at least three pre-processed audio signals for pairwise processing,

wherein the normalized cross-correlation value for each signal pair is stored in a cross-correlation vector, an

Wherein the adaptive joint signal processor (200) is configured for determining whether one or more previous frame signal pair selections are to be retained by comparing (222, 223) a cross-correlation vector of a previous frame with a cross-correlation vector of a current frame, and wherein the previous frame signal pair selections are retained (225) when a difference between the cross-correlation vector of the current frame and the cross-correlation vector of the previous frame is less than a predefined threshold.

23. Multi-signal encoder according to one of the preceding claims,

wherein the signal pre-processor (100) is configured for performing a time-frequency transformation using a window length selected from a plurality of different window lengths,

wherein the adaptive joint signal processor (200) is configured to, upon comparing the pre-processed audio signals to determine a signal pair to be pair-wise processed, determine whether the signal pair has the same associated window length, and

wherein the adaptive joint signal processor (200) is configured to: paired processing of two signals is only allowed if the two signals have been associated with the same window length applied by the signal pre-processor (100).

24. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured to apply a non-cascaded signal pair processing in which signals in a processed signal pair are not selectable in a further signal pair processing, wherein the adaptive joint signal processor (200) is configured for selecting a signal pair based on a cross-correlation between the signal pairs for pair-wise processing, and wherein the pair-wise processing of several selected signal pairs is performed in parallel.

25. The multi-signal encoder of claim 25,

wherein the adaptive joint signal processor (200) is configured to determine a stereo coding mode for the selected signal pair, and wherein, when the stereo coding mode is determined to be dual mono mode, the signals involved in that signal pair are at least partially rescaled and indicated as signals to be encoded separately.

26. Multi-signal encoder according to one of claims 18 and 19,

wherein the adaptive joint signal processor (200) is configured for: performing stereo IGF operations for pairwise processed signal pairs if the stereo mode of the core region is different from the stereo mode of the intelligent gap-filling I GF region, or if the stereo mode of the core is labeled as band-wise mid/side coding, or

Wherein the adaptive joint signal processor (200) is configured to: if the stereo mode of the core region is not different from the stereo mode of the IGF region, or the stereo mode of the core is not labeled as a band-wise mid/side coding mode, then mono IGF analysis is applied on the signals in the pair-wise processed signal pair.

27. Multi-signal encoder according to one of the preceding claims,

wherein the adaptive joint signal processor (200) is configured for: performing an intelligent gap-filling operation before the results of the IGF operations are separately encoded by the signal encoder (300),

wherein a power spectrum is used for quantization and pitch/noise determination in intelligent gap-filling IGF, and wherein the signal pre-processor (100) is configured for performing the same frequency domain noise shaping for the MDST spectrum as has been applied to the MDCT spectrum, and

wherein the adaptive joint signal processor (200) is configured to perform the same mid/side processing on the pre-processed MDST spectrum such that the result of the processed MDST spectrum is used within the quantization performed by the signal encoder (300) or within the intelligent gap-filling process performed by the adaptive joint signal processor (200), or

Wherein the adaptive joint signal processor (200) is configured to apply the same normalized scaling based on a full-band scaling vector for the MDST spectrum as scaling the MDCT spectrum using the same quantized scaling vector.

28. Multi-signal encoder in accordance with one of the preceding claims, in which the adaptive joint signal processor (200) is configured for performing a pair-wise processing of the at least three pre-processed audio signals to obtain the at least three jointly processed signals or a signal to be encoded separately and at least two jointly processed signals.

29. Multi-signal encoder in accordance with one of the preceding claims, in which an audio signal of the at least three audio signals is an audio channel, or

Wherein an audio signal of the at least three audio signals is an audio component signal of a sound field description, such as an ambient stereo sound field description, a B-format description, an a-format description or any other sound field description, such as a sound field description describing a sound field with respect to a reference position.

30. Multi-signal encoder according to one of the preceding claims,

wherein the signal encoder (300) is configured for encoding each signal separately to obtain at least three separately encoded signals or for performing (entropy) encoding on more than one signal.

31. A multi-signal decoder for decoding an encoded signal, comprising:

a signal decoder (700) for decoding at least three encoded signals;

a joint signal processor (800) for performing joint signal processing on the basis of side information comprised in the encoded signal to obtain at least three processed decoded signals; and

a post-processor (900) for post-processing the at least three processed decoded signals in dependence on side information comprised in the encoded signal, wherein the post-processing is performed such that the post-processed signal is less whitened than the signal before post-processing, and wherein the post-processed signal represents the decoded audio signal.

32. The multi-signal decoder of claim 32, wherein the joint signal processor (800):

configured to extract (610) an energy normalization value for each joint stereo decoded signal from the encoded signal;

is configured to pair-wise process (820) the decoded signal using a joint stereo mode indicated by side information in the encoded signal to obtain a joint stereo decoded signal; and

is configured to energy rescale (830) the joint stereo decoded signal using the energy normalization value to obtain a processed decoded signal.

33. The multi-signal decoder of claim 32,

wherein the joint signal processor (800) is configured to check whether an energy normalization value for a specific signal extracted from the encoded signal has a predefined value, and

wherein the joint signal processor (800) is configured to: when the energy normalization value has the predefined value, no energy rescaling or only reduced energy rescaling is performed on the particular signal.

34. The multi-signal decoder of one of claims 32 to 34, wherein the signal decoder (700) is configured to

Extracting (620) a bit allocation value for each encoded signal from the encoded signals,

determining (720) the used bit allocation for the signal using the bit allocation values of the signal, the remaining number of bits for all signals and optionally a further refining step or optionally a final donation step; and

performing (710, 730) individual decoding based on the used bit allocation for each signal.

35. Multi-signal decoder in accordance with one of claims 32 to 35, in which the joint signal processor (800) is configured to

Performing (820) a spectral band replication or smart gap filling process on the individually decoded signals using the side information in the encoded signals to obtain spectrally enhanced individual signals; and

performing joint processing (820) according to a joint processing mode using the spectrally enhanced individual signals.

36. The multi-signal decoder of claim 36,

wherein the joint signal processor (800) is configured to: when the destination range is indicated as having another stereo representation, the source range is transformed from one stereo representation to the other stereo representation.

37. Multi-signal decoder in accordance with one of claims 32 to 37, in which the joint signal processor (800) is configured to

Extracting an energy normalization value (534b) of each joint stereo decoded signal from the encoded signal and additionally extracting a flag (534a), the flag (534a) indicating whether the energy normalization value is an amplification value or a reduction value, and

performing (830) a rescaling using the energy normalization value, the rescaling being a zoom-out when the sign has a first value and the rescaling being a zoom-in when the sign has a second value different from the first value.

38. Multi-signal decoder in accordance with one of claims 32 to 38, in which the joint signal processor (800) is configured to

Side information indicative of signal pairs resulting from the joint encoding operation is extracted (630) from the encoded signal,

inverse stereo or multi-channel processing is performed (820) starting from the last signal pair to obtain encoded signals for conversion back to the original pre-processed spectrum of each signal, and the inverse stereo processing is performed based on stereo mode and/or band wise mid/edge decisions indicated in the side information (532) of the encoded signals.

39. Multi-signal decoder according to one of claims 32 to 39,

wherein the joint signal processor (800) is configured to denormalize (830) all signals involved in the signal pair to the corresponding original energy level based on the quantized energy scaling information comprised for each individual signal, and wherein other signals not involved in the signal pair processing are not denormalized as the signals involved in the signal pair processing.

40. Multi-signal decoder according to one of claims 32 to 40,

wherein the post-processor (900) is configured to perform, for each individually processed decoded signal, a temporal noise shaping operation (910) or a frequency domain noise shaping operation (910) and a subsequent overlap/add operation (930) between a conversion (920) from the spectral domain into the time domain and a subsequent time frame of the post-processed signal.

41. Multi-signal decoder according to one of claims 32 to 41,

wherein the joint signal processor (800) is configured to extract from the encoded signal a flag indicating whether or not to use mid/side or left/right encoding for inverse processing of several frequency bands of a time frame of a signal pair, and wherein the joint signal processor (800) is configured to use the flag to subject corresponding frequency bands of the signal pair to mid/side processing or left/right processing in their entirety depending on the value of the flag, and

wherein, for different time frames of the same signal pair or different signal pairs at the same time frame, an encoding mode mask indicating an individual encoding mode for each individual frequency band is extracted from the side information of the encoded signal, and wherein the joint signal processor (800) is configured to: the inverse mid/edge processing or left/right processing is applied to the frequency band as indicated for the bit associated with the corresponding frequency band.

42. Multi-signal decoder in accordance with one of claims 32 to 42, in which the encoded signal is an encoded multi-channel signal, in which the multi-signal decoder is a multi-channel decoder, in which the encoded signal is an encoded multi-channel signal, in which the signal decoder (700) is a channel decoder, in which the encoded signal is an encoded channel, in which the joint signal processing is a joint channel processing, in which the at least three processed decoded signals are at least three processed decoded signals, in which the post-processed signals are channels, or

Wherein the encoded signal is an encoded multi-component signal representing an audio component signal of a sound field description, such as an ambient stereo sound field description, a B-format description, an a-format description or any other sound field description, such as a sound field description describing a sound field with respect to a reference position, wherein the multi-signal decoder is a multi-component decoder, wherein the encoded signal is an encoded multi-component signal, wherein the signal decoder (700) is a component decoder, wherein the encoded signal is an encoded component, wherein the joint signal processing is joint component processing, wherein the at least three processed decoded signals are at least three processed decoded components, and wherein the post-processed signal is a component audio signal.

43. A method for performing multi-signal encoding of at least three audio signals, comprising:

individually pre-processing each audio signal to obtain at least three pre-processed audio signals, wherein the pre-processing is performed such that the pre-processed audio signals are whitened with respect to the pre-processed signals;

performing processing on the at least three preprocessed audio signals to obtain at least three jointly processed signals or a signal to be encoded separately and at least two jointly processed signals;

encoding each signal to obtain one or more encoded signals; and

transmitting or storing an encoded multi-signal audio signal comprising the one or more encoded signals, side information related to the pre-processing and side information related to the processing.

44. A method for multi-signal decoding of an encoded signal, comprising:

separately decoding at least three encoded signals;

performing joint signal processing according to side information included in the encoded signal to obtain at least three processed decoded signals; and

post-processing the at least three processed decoded signals according to side information comprised in the encoded signal, wherein the post-processing is performed such that the post-processed signal is less whitened than the signal before post-processing, and wherein the post-processed signal represents the decoded audio signal.

45. A computer program for performing the method of claim 44 or the method of claim 45 when running on a computer or processor.

46. An encoded signal, comprising:

at least three separately encoded signals (510);

side information (520) relating to a pre-processing performed to obtain the three separately encoded signals; and

side information (532) relating to the pair-wise processing performed for obtaining the at least three individually encoded signals, and

wherein the encoded signal comprises an energy scaling value (534) for each of the at least three encoded signals obtained by multi-signal encoding or a bit allocation value (536) for each of the separately encoded signals.

Technical Field

Background

Embodiments relate to MDCT-based multi-signal encoding and decoding systems with signal adaptive joint channel processing, wherein the signal may be a channel and the multi-signal is a multi-channel signal, or alternatively the audio signal is a component of a sound field description, such as an ambient stereo (Ambisonics) component, i.e. W, X, Y, Z in first order ambient stereo or any other component in a higher order ambient stereo description. The signal may also be a signal described in the a-format or B-format of the sound field or any other format.

In MPEG USAC [1], joint stereo coding is performed on two channels using complex prediction, MPS2-1-2, or unified stereo with band limited or full band residual signals.

MPEG surround [2] hierarchically combines OTT and TTT boxes for joint coding of multi-channel audio with or without transmission of a residual signal.

MPEG-H quad [3] applies MPS2-1-2 stereo boxes hierarchically, followed by a complex prediction/MS stereo box that builds a "fixed" 4x4 remixing tree (remixing tree).

AC4[4] introduces new 3-channel, 4-channel, and 5-channel elements that allow remixing of the transmitted channels via the transmitted mixing matrix and subsequent joint stereo coding information.

Previous publications suggested the use of orthogonal transforms such as the Karhunen-Loeve transform (KLT) for enhanced multi-channel audio coding [5 ].

Multi-channel coding tools (MCT) [6] that support joint coding of more than two channels, enabling flexible and signal-adaptive joint channel coding in the MDCT domain. This is achieved by iterative combination and concatenation of stereo coding techniques such as real-valued complex stereo prediction and rotated stereo coding (KLT) of two specified channels.

In a 3D audio context, the speaker channels are allocated in several height layers, resulting in pairs of horizontal and vertical channels. The joint coding of only two channels defined in USAC is not sufficient to take into account the spatial and perceptual relationships between the channels. Applying MPEG surround in an additional pre/post processing step, the residual signal is sent separately without possible joint stereo coding to exploit e.g. the dependency between left and right vertical residual signals. A dedicated N-channel element in AC-4 is introduced that allows efficient coding of the joint coding parameters, but without a general speaker setup with more channels, which is proposed for the new immersive playback scene (7.1+4, 22.2). The MPEG-H four-channel element is also limited to 4 channels and cannot be dynamically applied to any channel, but only to a pre-configured and fixed number of channels. MCT introduces the flexibility of signal adaptive joint channel coding for arbitrary channels, but the stereo processing is done on windowed and transformed non-normalized (non-whitened) signals. Furthermore, the encoding of the prediction coefficients or angles for each stereo box in each frequency band requires a large number of bits.

Disclosure of Invention

It is an object of the present invention to provide an improved and more flexible concept for multi-signal encoding or decoding.

This object is achieved by the multi-signal encoder of claim 1, the multi-signal decoder of claim 32, the method for performing multi-signal encoding of claim 44, the method for performing multi-signal decoding of claim 45, the computer program of claim 46 or the encoded signal of claim 47.

The present invention is based on the following findings: the multi-signal coding efficiency is significantly enhanced by performing adaptive joint signal processing not on the original signal but on the pre-processed audio signal, wherein the pre-processing is performed such that the pre-processed audio signal is whitened with respect to the pre-processed signal. With respect to the decoder side, this means that post-processing is performed after the joint signal processing to obtain at least three processed decoded signals. The at least three processed decoded signals are post-processed in accordance with side information comprised in the encoded signal, wherein the post-processing is performed in such a way that the post-processed signal is less whitened than the signal before the post-processing. The post-processed signal finally represents the decoded audio signal, i.e. the decoded multi-signal, directly or after further signal processing operations.

Especially for immersive 3D audio formats, an efficient multi-channel encoding with properties of multiple signals is obtained to reduce the amount of transmitted data while preserving the overall perceptual audio quality. In a preferred embodiment, signal adaptive joint coding within a multichannel system is performed using perceptually whitened and additionally inter-channel level difference (ILD) compensated spectra. The joint encoding is preferably performed using a simple M/S transform per band decision, which is driven based on the estimated number of bits for the entropy encoder.

A multi-signal encoder for encoding at least three audio signals comprises a signal pre-processor for individually pre-processing each audio signal to obtain at least three pre-processed audio signals, wherein the pre-processing is performed such that the pre-processed audio signals are whitened with respect to the pre-processed signals. Adaptive joint signal processing of the at least three preprocessed audio signals is performed to obtain at least three jointly processed signals. The processing operates on the whitened signal. The pre-processing results in some signal characteristics (e.g. spectral envelope) being extracted or such that if not extracted, the efficiency of the joint signal processing (e.g. joint stereo or joint multi-channel processing) will be reduced. Furthermore, to improve the joint signal processing efficiency, a wideband energy normalization is performed on the at least three preprocessed audio signals such that each preprocessed audio signal has a normalized energy. The wideband energy normalization is signaled into the encoded audio signal as side information, so that after inverse joint stereo or joint multi-channel signal processing, the wideband energy normalization can be reversed at the decoder side. By means of this preferred additional wideband energy normalization process, the adaptive joint signal processing efficiency can be improved, such that the number of frequency bands or even the number of complete frames that can be subjected to mid/side processing is substantially improved compared to left/right processing (dual mono processing). The efficiency of the overall stereo encoding process is increasing, the more frequency bands or even the number of complete frames are subjected to the usual stereo or multi-channel processing (e.g. mid/side processing).

From a stereo processing perspective, the lowest efficiency is obtained for a band or for a frame when the adaptive joint signal processor has to adaptively decide that the band or frame is to be processed by "dual mono" or left/right processing. Here, the left and right channels are processed as they are, but naturally in a whitened and energy normalized domain. However, when the adaptive joint signal processor adaptively determines to perform mid/side processing for a certain frequency band or frame, a mid signal is calculated by adding a first channel and a second channel, and a side signal is calculated by calculating a difference of the first channel and the second channel in a channel pair. Typically, the mid signal is comparable to one of the first and second channels with respect to its range of values, but the side signal will typically be a less energetic signal, which can be efficiently encoded, or even in the most preferred case, the side signal is zero or close to zero, so that the spectral region of the side signal can even be quantized to zero and thus entropy encoded in an efficient manner. The entropy encoding is performed by a signal encoder for encoding each signal to obtain one or more encoded signals, and an output interface of the multi-signal encoder transmits or stores an encoded multi-signal audio signal comprising the one or more encoded signals, side information related to the pre-processing, and side information related to the adaptive joint signal processing.

On the decoder side, the signal decoder, which usually comprises an entropy decoder, usually relies on the preferably included bit allocation information to decode the at least three encoded signals. This bit allocation information is included as side information in the encoded multi-signal audio signal and may for example be derived at the encoder side by looking at the energy of the signal at the input of the signal (entropy) encoder. The outputs of the signal decoders within the multi-signal decoder are input into a joint signal processor for performing joint signal processing on the basis of side information included in the encoded signal to obtain at least three processed decoded signals. The joint signal processor preferably undoes the joint signal processing performed at the encoder side and typically performs inverse stereo or inverse multi-channel processing. In a preferred embodiment, the joint signal processor applies a processing operation to calculate the left/right signal from the mid/side signal. However, when the joint signal processor determines from the side information that dual mono processing already exists for a certain channel pair, this will be noted and used in the decoder for further processing.

Like the adaptive joint signal processor on the encoder side, the joint signal processor on the decoder side may be a processor operating in a cascaded channel pair tree or a reduced tree mode. The reduced tree also represents some sort of concatenation processing, but the reduced tree differs from a concatenated channel pair tree in that the output of a processed pair cannot be used as the input of another pair to be processed.

It may be the case that, in respect of a first channel pair used by the joint signal processor at the multi-signal decoder side to start the joint signal processing, this first channel pair, being the last channel pair processed at the encoder side, has side information for a certain frequency band, which side information indicates dual mono, but these dual mono signals may later be used in the channel pair processing as mid or side signals. This is signaled by corresponding side information related to a pair-wise processing performed to obtain the at least three separately encoded channels to be decoded at the decoder side.

Other advantages of the preferred embodiments are subsequently indicated. The codec uses the new concept to fuse the flexibility of signal adaptive joint coding of arbitrary channels described in [6] by introducing the concept described in [7] for joint stereo coding. These new concepts are:

a) perceptually whitened signals are used for further encoding (similar to the way they are used in speech coders). This has several advantages:

simplified codec architecture

Compact representation of noise shaping characteristics/masking thresholds (e.g. as LPC coefficients)

Unified transform and speech codec architecture and thus combined audio/speech coding

b) Using ILD parameters of arbitrary channels to efficiently encode panned (panned) sources

c) Based on the flexible bit allocation of energy between the processed channels.

Furthermore, the codec uses Frequency Domain Noise Shaping (FDNS) to perceptually whiten the signal by rate-looping as described in [8] in combination with spectral envelope warping as described in [9 ]. The codec further normalizes the FDNS-whitened spectrum towards the average energy level using ILD parameters. Channel pairs for joint coding are selected in an adaptive manner as described in [6], where stereo coding consists of a comparison of band-wise (base) M/S and L/R decisions. When encoding in L/R and M/S modes as described in [7], the band-wise M/S decision is based on the estimated bit rate in each band. The bit rate allocation between the channels processed in band mode M/S is based on energy.

Drawings

Preferred embodiments of the present invention will be described subsequently with reference to the accompanying drawings, in which:

FIG. 1 shows a block diagram of single channel preprocessing in a preferred embodiment;

FIG. 2 shows a preferred embodiment of a block diagram of a multi-signal encoder;

FIG. 3 illustrates a preferred embodiment of the cross-correlation vector and channel pair selection process of FIG. 2;

FIG. 4 illustrates an indexing scheme for channel pairs in a preferred embodiment;

FIG. 5a shows a preferred embodiment of a multi-signal encoder according to the present invention;

FIG. 5b shows a schematic representation of an encoded multi-channel audio signal frame;

FIG. 6 shows a process performed by the adaptive joint signal processor of FIG. 5 a;

FIG. 7 illustrates a preferred embodiment performed by the adaptive joint signal processor of FIG. 8;

FIG. 8 shows another preferred embodiment performed by the adaptive joint signal processor of FIG. 5;

FIG. 9 illustrates another process for performing bit allocation to be used by the quantization encoding processor of FIG. 5;

FIG. 10 shows a block diagram of a preferred embodiment of a multi-signal decoder;

FIG. 11 shows a preferred embodiment performed by the joint signal processor of FIG. 10;

FIG. 12 shows a preferred embodiment of the signal decoder of FIG. 10;

fig. 13 illustrates another preferred embodiment of the joint signal processor in the context of bandwidth extension or Intelligent Gap Filling (IGF);

FIG. 14 shows another preferred embodiment of the joint signal processor of FIG. 10;

FIG. 15a shows preferred processing blocks performed by the signal decoder and joint signal processor of FIG. 10; and

fig. 15b illustrates an embodiment of a post processor for performing de-whitening operations and optionally other processes.

Detailed Description

Fig. 5 shows a preferred embodiment of a multi-signal encoder for encoding at least three audio signals. The at least three audio signals are input into the signal processor 100 to individually pre-process each audio signal to obtain at least three pre-processed audio signals 180, wherein the pre-processing is performed such that the pre-processed audio signals are whitened with respect to the corresponding signals prior to the pre-processing. The at least three preprocessed audio signals 180 are input to an adaptive joint signal processor 200, the adaptive joint signal processor 200 being configured to perform processing of the at least three preprocessed audio signals to obtain at least three jointly processed signals, or in an embodiment, to obtain an unprocessed signal and at least two jointly processed signals, as will be explained later. The multi-signal encoder comprises a signal encoder 300, which signal encoder 300 is connected to an output of the adaptive joint signal processor 200 and is configured to encode each signal output by the adaptive joint signal processor 200 to obtain one or more encoded signals. These encoded signals at the output of the signal encoder 300 are forwarded to the output interface 400. The output interface 400 is configured for transmitting or storing the encoded multi-signal audio signal 500, wherein the encoded multi-signal audio signal 500 at the output of the output interface 400 comprises the one or more encoded signals generated by the signal encoder 300, side information 520 (i.e. whitening information) related to the pre-processing performed by the signal pre-processor 200, and additionally the encoded multi-signal audio signal further comprises side information 530 (i.e. side information related to the adaptive joint signal processing) related to the processing performed by the adaptive joint signal processor 200.

In a preferred embodiment, the signal encoder 300 comprises a rate loop processor controlled by bit allocation information 536, which bit allocation information 536 is generated by the adaptive joint signal processor 200 and forwarded not only from block 200 to block 300 but also within the side information 530 to the output interface 400 and thus into the encoded multi-signal audio signal. The encoded multi-signal audio signal 500 is typically generated in a frame-by-frame manner, wherein framing and typically corresponding windowing and time-frequency conversion is performed within the signal pre-processor 100.

An exemplary illustration of a frame of an encoded multi-signal audio signal 500 is shown in fig. 5 b. Fig. 5b shows a bitstream portion 510 for the separately encoded signal generated by block 300. Block 520 is directed to the pre-processed side information generated by block 100 and forwarded to the output interface 400. Furthermore, the joint processing side information 530 is generated by the adaptive joint signal processor 200 of fig. 5a and introduced into the encoded multi-signal audio signal frame shown in fig. 5 b. At the right of the diagram of fig. 5b, the next frame of the encoded multi-signal audio signal will be written into the serial bit stream, while at the left of the diagram of fig. 5b, the earlier frame of the encoded multi-signal audio signal will be written.

As will be explained later, the pre-processing comprises a temporal noise shaping process and/or a frequency domain noise shaping process or LTP (long term prediction) process or windowing process operation. The corresponding pre-processing side information 550 may include at least one of Temporal Noise Shaping (TNS) information, Frequency Domain Noise Shaping (FDNS) information, Long Term Prediction (LTP) information, or windowing information.

Temporal noise shaping includes prediction of spectral frames over frequency. Spectral values having higher frequencies are predicted using a weighted combination of spectral values having lower frequencies. The TNS side information includes the weights of the weighted combination, also referred to as LPC coefficients derived by prediction over frequency. The whitened spectral values are prediction residual values, i.e. for each spectral value, the difference between the original spectral value and the predicted spectral value. On the decoder side, an inverse prediction of the LPC synthesis filtering is performed in order to undo the TNS processing on the encoder side.

The FDNS processing comprises weighting spectral values of the frame with a weighting factor for the corresponding spectral values, wherein the weighting values are derived from LPC coefficients calculated from a block/frame of the windowed time domain signal. The FDNS side information comprises a representation of LPC coefficients derived from the time domain signal.

Another whitening process that is also useful with the present invention is spectral equalization using a scaling factor such that the equalized spectrum represents a more whitened version than an unequalized version. The side information will be the scaling factor used for weighting and the reverse process involves de-equalization at the decoder side using the transmitted scaling factor.

Another whitening process involves performing inverse filtering on the spectrum using an inverse filter controlled by LPC coefficients derived from the time-domain frame, as is known in the art of speech coding. The side information is inverse filter information and the inverse filtering is removed in the decoder using the transmitted side information.

Another whitening process involves performing LPC analysis in the time domain and computing temporal residual values, which are then converted to the spectral range. Typically, the spectral values thus obtained are similar to those obtained by FDNS. On the decoder side, the post-processing comprises performing LPC synthesis using the transmitted LPC coefficient representation.

In the preferred embodiment, the joint processing side-information 530 includes pair-wise processing side-information 532, energy scaling information 534, and bit allocation information 536. The pair-wise processing of the side information may comprise at least one of: channel side information bits, all mid/side or dual mono or band wise mid/side information, and a mid/side mask in case of band wise mid/side indication, wherein the mid/side mask indicates for each bandwidth in a frame whether the band is processed by mid/side processing or L/R processing. The pair-wise processing side information may additionally include intelligent gap-filling (IGF) or other bandwidth extension information such as SBR (spectral band replication) information, or the like.

For each whitened (i.e., preprocessed) signal 180, the energy scaling information 534 may include an energy scaling value and a flag indicating whether the energy scaling is to be scaled up or down. For example, in the case of eight channels, block 534 would include eight scaling values (e.g., eight quantized ILD values) and eight flags indicating, for each of the eight channels: whether the up-scaling or down-scaling is done in the encoder or whether the up-scaling or down-scaling has to be done in the decoder. Up-scaling in the encoder is necessary when the actual energy of a certain pre-processed channel in a frame is lower than the average energy for the frame in all channels, and down-scaling is necessary when the actual energy of a certain channel in a frame is higher than the average energy of all channels in a frame. The jointly processed side information may comprise bit allocation information for each of the jointly processed signals or for the unprocessed signal (if available) and each of the jointly processed signals, and this bit allocation information is used by the signal encoder 300 (as shown in fig. 5 a) and correspondingly by the signal decoder shown in fig. 10, which receives the bitstream information from the encoded signal via the input interface.

Fig. 6 shows a preferred embodiment of the adaptive joint signal processor. The adaptive joint signal processor 200 is configured to perform a wideband energy normalization on the at least three preprocessed audio signals such that each preprocessed audio signal has a normalized energy. The output interface 400 is configured to include as further side information a wideband energy normalization value for each preprocessed audio signal, wherein the value corresponds to the energy scaling information 534 of fig. 5 b. Fig. 6 shows a preferred embodiment of broadband energy normalization. In step 211, the wideband energy for each channel is calculated. The input into block 211 consists of preprocessed (whitened) channels. The result is C_totalA wideband energy value for each of the channels. In block 212, the values are typically determined by adding the values and dividing the valuesThe average wideband energy is calculated as the number of channels. However, other averaging calculation processes may be performed, such as geometric averaging, and the like.

In step 213, each channel is normalized. To this end, a scaling factor or value and zoom-in or zoom-out information are determined. Accordingly, block 213 is configured to output the scaling flag for each channel indicated at 534 a. In block 214, the actual quantization of the scale determined in block 212 is performed and the quantized scale is output at 534b for each channel. The quantized scaling is also indicated as inter-channel level differenceI.e. for a certain channel k, relative to a reference channel having an average energy. In block 215, the spectrum of each channel is scaled using the quantized scale. The zoom operation in block 215 is controlled by the output of block 213, i.e. by information on whether zooming in or zooming out is to be performed. The output of block 215 represents the scaled spectrum for each channel.

Fig. 7 shows a preferred embodiment of the adaptive joint signal processor 200 with respect to cascaded pair processing. The adaptive joint signal processor 200 is configured, as indicated by block 221, to calculate a cross-correlation value for each possible channel pair. Block 229 shows selecting the pair with the highest cross-correlation value and in block 232a, determining a joint stereo processing mode for the pair. The joint stereo processing mode may comprise mid/side coding for the complete frame, band wise mid/side coding, i.e. determining for each of a plurality of frequency bands: whether the band is to be processed in mid/side mode or L/R mode, or whether full band dual mono processing is to be performed for the particular pair in question for the actual frame. In block 232b, joint stereo processing for the selected pair is actually performed using the mode as determined in block 232 a.

In blocks 235, 238, the cascading process using the full tree or the simplified tree process continues or the non-cascading process continues until some termination criteria. At the certain termination criterion, for example, a pair indication output by block 229 and stereo mode processing information output by block 232a are generated and input into the bitstream in pair processing side information 532 as explained with respect to fig. 5 b.

Fig. 8 shows a preferred embodiment of an adaptive joint signal processor for preparing the signal encoding performed by the signal encoder 300 of fig. 5 a. To this end, in block 282, the adaptive joint signal processor 200 calculates the signal energy of each stereo processed signal. Block 282 receives as input the joint stereo processed signal and in the event that a channel is not subjected to stereo processing because it is not found to have sufficient cross-correlation with any other channel to form a useful channel pair, the channel is input into block 282 with energy that is inverted or modified or non-normalized. This is typically indicated as "energy-resilient signal", but the energy normalization performed in block 215 of fig. 6 does not necessarily have to be fully resilient. There are some alternatives for processing channel signals that have not been found to be used for channel pair processing with another channel. One procedure is to reverse the scaling initially performed in block 215 of fig. 6. The other process is to reverse the scaling only partially, or the other process weights the scaled channels in some different way, as appropriate.

In block 284, the total energy in all signals output by the adaptive joint signal processor 200 is calculated. In block 286, bit allocation information is calculated for each stereo processed signal based on the signal energy of each signal, or if available, energy-recovered or energy-weighted, and based on the total energy output by block 284. In one aspect, the side information 536 generated by block 286 is forwarded to the signal encoder 300 of fig. 5a and additionally to the output interface 400 via the logical connection 530, such that the bit allocation information is included in the encoded multi-signal audio signal 500 of fig. 5a or 5 b.

In the preferred embodiment, the actual bit allocation is performed based on the process shown in fig. 9. In a first procedure, a minimum number of bits for non-LFE (low frequency enhancement) channels are assigned, and, if available, low frequency enhancement channel bits are assigned. Signal encoder 300 requires these minimum number of bits regardless of some signal content. The remaining bits are assigned according to bit allocation information 536 generated according to block 286 of fig. 8 and input into block 291. This assignment is done based on the quantized energy ratio and preferably uses the quantized energy ratio instead of the non-quantized energy.

In step 292, a refinement is performed. When quantization is such that the remaining bits are assigned and the result is higher than the number of available bits, a subtraction has to be performed on the bits assigned in block 291. However, when the quantization of the energy ratio results in the assignment process in block 291 such that there are still bits to be further assigned, these bits may additionally be given or allocated in the refining step 292. If after the refinement step there are still any bits to be used by the signal encoder, a final donation step 293 is performed and the final donation is done for the channel with the largest energy. At the output of step 293, the assigned bit budget for each signal is available.

In step 300, quantization and entropy coding is performed on each signal using the assigned bit budget generated by the process of steps 290, 291, 292, 293. Basically, the bit allocation is performed in such a way that the higher energy channels/signals are quantized more accurately than the lower energy channels/signals. It is important that not the original signal or the whitened signal is used for bit allocation, but the signal at the output of the adaptive joint signal processor 200, the energy of which differs from the energy of the signal input into the adaptive joint signal processing due to the joint channel processing, is used for bit allocation. In this context, it is also noted that although channel pair processing is the preferred implementation, other channel groups may be selected and processed by means of cross-correlation. For example, groups of three or even four channels may be formed by means of an adaptive joint signal processor and processed accordingly in a cascaded complete process or a cascaded process with a simplified tree or in a non-cascaded process.

The bit allocation shown in blocks 290, 291, 292, 293 is performed on the decoder side in the same way, using allocation information 536 extracted from the encoded multi-signal audio signal 500, by means of the signal decoder 700 of fig. 10.

PREFERRED EMBODIMENTS

In this embodiment, the codec uses the new concept to fuse the flexibility of signal adaptive joint coding of arbitrary channels described in [6] by introducing the concept described in [7] for joint stereo coding. These new concepts are:

a) perceptually whitened signals are used for further encoding (similar to the way they are used in speech coders). This has several advantages:

simplified codec architecture

Compact representation of noise shaping characteristics/masking thresholds (e.g. as LPC coefficients)

Unified transform and speech codec architecture and thus combined audio/speech coding

b) Using ILD parameters of arbitrary channels to efficiently encode panned (panned) sources

c) Based on the flexible bit allocation of energy between the processed channels.

The codec uses Frequency Domain Noise Shaping (FDNS) to perceptually whiten the signal by rate-looping as described in [8] in combination with spectral envelope warping as described in [9 ]. The codec further normalizes the FDNS-whitened spectrum towards the average energy level using ILD parameters. Channel pairs for joint coding are selected in an adaptive manner as described in [6], where stereo coding consists of a comparison of band-wise (base) M/S and L/R decisions. When encoding in L/R and M/S modes as described in [7], the band-wise M/S decision is based on the estimated bit rate in each band. The bit rate allocation between the channels processed in band mode M/S is based on energy.

Encoder single channel processing dependent on whitened spectrum

Each individual channel is analyzed and transformed into a whitened MDCT domain spectrum according to the processing steps shown in the block diagram of fig. 1.

The processing blocks of the time domain transient detector, windowing, MDCT, MDST and OLA are described in [8 ]. MDCT and MDST form a Modulated Complex Lapped Transform (MCLT); performing MDCT and MDST separately is equivalent to performing MCLT; "MCLT through MDCT" means that only the MDCT portion of MCLT is employed and MDST is discarded.

Temporal Noise Shaping (TNS) is done similar to the description in [8], and the order of TNS and Frequency Domain Noise Shaping (FDNS) is adaptive. The presence of 2 TNS boxes in the figure will be understood as the possibility to change the order of FDNS and TNS. The order decision of the TNS and the FDNS can be, for example, the decision described in [9 ].

Frequency Domain Noise Shaping (FDNS) and calculation of FDNS parameters are similar to the procedure described in [9 ]. One difference is that the FDNS parameter of the frame where the TNS is not activated is calculated from the MCLT spectrum. In frames where TNS is activated, the MDST spectrum is estimated from the MDCT spectrum.

Fig. 1 shows a preferred embodiment of a signal processor 100 that performs whitening on at least three audio signals to obtain individually pre-processed whitened signals 180. The signal preprocessor 100 comprises an input for the time domain input signal of channel k. The signal is input to windower 102, transient detector 104 and LTP parameter calculator 106. The transient detector 104 detects whether the current portion of the input signal is transient and, in case it is confirmed that the current portion of the input signal is transient, the transient detector 104 controls the windower 102 to set the smaller window length. A window indication, i.e. which window length has been selected, is also included in the side information, and in particular in the pre-processed side information 520 of fig. 5 b. In addition, the LTP parameters calculated by block 106 are also introduced into the side information block, and these LTP parameters may be used, for example, to perform some type of post-processing of the decoded signal or other processes known in the art. The windower 140 generates a windowed time-domain frame, which is introduced into the time-to-spectrum converter 108. The time-to-spectrum converter 108 preferably performs a complex lapped transform. From the complex lapped transform, the real part may be derived to obtain the result of the MDCT transform, as indicated by block 112. The result of block 112 (i.e., the MDCT spectrum) is input into the TNS block 114a and the subsequently connected FDNS block 116. Alternatively, only FDNS is performed without the TNS block 114a, or vice versa, or TNS processing is performed after FDNS processing, as indicated by block 114 b. Typically, there is either a box 114a or a box 114 b. At the output of block 114b (when block 114a is not present) or at the output of block 116 (when block 114b is not present), a whitened separately processed signal, i.e. a pre-processed signal, is obtained for each channel k. The TNS block 114a or 114b and the FDNS block 116 generate the pre-processed information and forward it into the side information 520.

In any case, it is not necessary to perform a complex transform in block 108. Furthermore, for some applications it may also be sufficient that the time-to-spectrum converter performs only MDCT, and if the imaginary part of the transformation is needed, it may optionally also be estimated from the real part. A feature of the TNS/FDNS process is that in the case where the TNS is inactive, the FDNS parameters are calculated from the complex spectrum (i.e. from the MCLT spectrum), whereas in frames where the TNS is active, the MDST spectrum is estimated from the MDCT spectrum, so that there is always the complete complex spectrum available for the frequency domain noise shaping operation.

Joint vocal tract coding system description

In the described system, after each channel is transformed into the whitened MDCT domain, a signal adaptive exploitation of the similarity in the variation between arbitrary channels for joint coding is applied based on the algorithm described in [6 ]. By this procedure, respective channel pairs to be jointly encoded using band-wise M/S transform can be detected and selected.

Fig. 2 gives an overview of the encoding system. For simplicity, the block arrows represent single channel processing (i.e., applying the processing blocks to each channel), and the block "MDCT domain analysis" is shown in detail in fig. 1.

In the following paragraphs, the various steps of the algorithm applied for each frame will be described in detail. Figure 3 gives a data flow diagram of the described algorithm.

It should be noted that in an initial configuration of the system, there is a channel mask indicating which channels the multi-channel joint coding tool is active for. Thus, for inputs where LFE (low frequency effects/enhancement) channels are present, these LFE channels are not taken into account in the processing steps of the tool.

Energy normalization towards average energy for all channels

If ILDs are present, that is, if the channels are panned, the M/S transform is not efficient. We normalize the magnitude of the perceptually whitened spectrum of all channels to an average energy levelTo avoid this problem.

Calculate energy E for each channel_k，k＝0，...，C_total

Where N is the total number of spectral coefficients.

O calculating the mean energy

-normalizing the spectrum of each channel to an average energy

If it is not(reduction)

Where a is the scaling. The scale is uniformly quantized and sent to the decoder as side information bits.

Wherein the ILD_RANGE＝1＜＜ILD_bits

The quantized scale at which the spectrum is finally scaled is then given by the following formula

If it is not(amplification)

And

whereinThe calculation is performed as in the previous case.

To distinguish whether we have reduced or enlarged at the decoder and to restore the normalization, except for each channelDirectly outside, a 1-bit flag is sent (0: zoom-out/1: zoom-in). ILD_RANGEIndicating quantized scaling values for transmissionAnd the value is known to the encoder and decoder and does not have to be sent in the encoded audio signal.

Computing normalized inter-channel cross-correlation values for all possible channel pairs

In this step, in order to decide and select which channel pair has the highest degree of similarity and is therefore suitable to be selected as the pair for stereo joint coding, the inter-channel normalized cross-correlation value is calculated for each possible channel pair. The normalized cross-correlation value for each channel pair is given by the cross-spectrum (cross-spectrum), as follows:

wherein

N is the total number of spectral coefficients per frame, X_MDCTAnd Y_MDCTAre the corresponding spectra of the channel pair under consideration.

The normalized cross-correlation value for each channel pair is stored in the following cross-correlation vector

Wherein P ═ C_total*(C_total-1))/2 is the maximum number of possible pairs.

As shown in fig. 1, we can have different frame sizes (e.g., 10 or 20ms window frame size) depending on the transient detector. Therefore, assuming that the spectral resolution of the two channels is the same, the inter-channel cross-correlation is calculated. Otherwise, the value is set to 0, thereby ensuring that such channel pairs are not selected for joint encoding.

An indexing scheme is used that uniquely represents each channel pair. An example of such a scheme for indexing six input channels is shown in fig. 4.

The same indexing scheme is maintained throughout the algorithm, which is also used to signal the channel pairs to the decoder. The number of bits required to signal the number of channel pairs is

Channel pair selection and joint coding stereo processing

After calculating the cross-correlation vector, the first channel pair to be considered for joint coding is the corresponding channel pair having the highest cross-correlation value and above a minimum threshold value, preferably 0.3.

The selected channel pair is used as input to a stereo encoding process, i.e. a band-wise M/S transform. The decision whether a channel is to be encoded using M/S coding or using discrete L/R coding for each spectral band depends on the estimated bitrate for each case. An encoding method that is not highly demanding in terms of bits is selected. This process is described in detail in [7 ].

The output of this process results in an updated frequency spectrum for each channel of the selected channel pair. Furthermore, information about the channel pair (side information) that needs to be shared with the decoder is created, i.e. which stereo mode (full M/S, dual mono or band mode M/S) is selected and if band mode M/S is the selected mode, and a corresponding mask indicating whether M/S coding (1) or L/R (0) is selected.

For the next steps, there are two variants of this algorithm:

omicron cascaded channel pair tree

For this variant, the cross-correlation vector is updated for the channel pairs affected by the changed spectrum of the selected channel pair (if we use the M/S transform). For example, in the case of 6 channels, if the channel pair selected and processed is the channel pair indexed 0 in fig. 4, meaning the encoding of channel 0 and channel 1, then after stereo processing we will need to recalculate the cross-correlation for the affected channel pair (i.e. indices 0, 1, 2, 3, 4, 5, 6, 7, 8).

The process then continues as previously described: the channel pair with the largest cross-correlation is selected, confirmed to be above a minimum threshold, and stereo operation is applied. This means that channels that are part of a previous channel pair can be re-selected for use as input for a new channel pair, referred to as "concatenation". This may occur when there may still be a residual correlation between the output of the channel pair and another arbitrary channel representing a different direction in the spatial domain. Of course, the same channel pair should not be selected twice.

When the maximum allowed number of iterations is reached (P being the absolute maximum) or after updating the cross-correlation vector, no channel pair value exceeds the threshold 0.3 (no correlation exists between any channels), then the process continues.

Omicron simplified tree

The concatenated channel pair tree process is theoretically optimal because it attempts to remove the correlation of all arbitrary channels and provides maximum energy compression. On the other hand, it is rather complex, since the number of selected channel pairs may be larger thanMore, resulting in additional computational complexity (from the M/S decision process of stereo operation) and also in the need for additional metadata to be sent to the receiver for each channel pair.

For the simplified tree variant, "concatenation" is not allowed. This ensures that when starting from the above process, the values of the affected channel pairs of the previous channel versus stereo operation are not recalculated but set to 0 when updating the cross-correlation vector. It is therefore not possible to select a channel pair whose one of the channels is already part of an existing channel pair.

This is a variation describing the "adaptive joint channel processing" block in fig. 2.

This situation results in a similar complexity as a system with predefined channel pairs (e.g., L and R, rear L and rear R) because the largest channel pair that can be selected is

It should be noted that there may be situations where stereo operation of a selected channel pair does not change the frequency spectrum of the channels. This occurs when the M/S decision algorithm decides that the coding mode should be "dual mono". In this case, any channels involved are no longer considered as channel pairs, as they would be encoded separately. Furthermore, updating the cross-correlation vector will have no effect. To continue the process, the channel pair with the next highest value is considered. The steps in this case continue as described above.

Channel pair selection (stereo tree) of the remaining previous frame

In many cases, the normalized cross-correlation values of any channel pair may be close from frame to frame, and so the selection may often switch between the close values. This may result in frequent channel-to-tree switching, possibly resulting in audible instability of the output system. Therefore, the choice uses a stabilization mechanism, where a new set of channel pairs is selected only when a significant change in signal occurs and the similarity between any channels changes. To detect this, the cross-correlation vector of the current frame and the previous frame is compared and when the difference is greater than some threshold, a new channel pair is allowed to be selected.

The change in the cross-correlation vector over time is calculated as follows:

if C is present_diffT, allows the selection of new channel pairs to be jointly encoded (as described in the previous step). The selected threshold is given by

t＝0.15C_tot(C_tot-1)/2

On the other hand, if the difference is small, the same channel pair tree as the previous frame is used. For each given channel pair, a band-wise M/S operation is applied as previously described. However, if the normalized cross-correlation value for a given channel pair does not exceed the threshold 0.3, then selection of a new channel pair to create a new tree is initiated.

Energy recovery of individual sound channels

After the iterative process for channel pair selection terminates, there may be channels that are not part of any channel/pair and are therefore separately encoded. For those channels, the initial normalization of the energy level towards the average energy level is restored back to its original energy level. Using the inverse of the quantized scaling, depending on the flag signaling zoom-in or zoom-outTo restore the energy of these channels.

IGF for multichannel processing

With regard to IGF analysis, in the case of stereo channel pairs, additional joint stereo processing (fully described in [10]) is applied. This is necessary because for a certain destination range in the IGF spectrum, the signal can be a highly correlated panned sound source. In the event that the source regions selected for that particular region are not well correlated, the aerial image may be affected by irrelevant source regions, although the energies are matched for the destination region.

Therefore, if the stereo mode of the core region is different from that of the IGF region, or if the stereo mode of the core is labeled as band mode M/S, stereo IGF is applied for each channel pair. If these conditions do not apply, a single channel IGF analysis is performed. If there are individual channels that are not jointly coded in a channel pair, these individual channels will also undergo a single channel IGF analysis.

Allocation of available bits for encoding the spectrum of each channel

After the process of joint channel-to-stereo processing, each channel is quantized and separately encoded by an entropy encoder. Thus, for each channel, the number of bits available should be given. In this step, the energy of the processed channel is used to allocate the total available bits to each channel.

Due to the joint processing, the spectrum of each channel may have changed, so the energy of each channel is recalculated (the calculation of which is described in the normalization step above). The new energy is expressed asAs a first step, the energy-based ratio (which will be used to allocate bits) is calculated as:

it should be noted here that in the case where the input components are also from the LFE channel, the scale calculation does not take this input into account. For an LFE channel, only if the channel has non-zero content, a minimum number of bits are assigned_LFE. This ratio is uniformly quantized:

rt_RANGE＝1＜＜rt_bits

quantized ratioStored in the bitstream to be used by the decoder to assign the same amount of bits to each channel to read the transmitted channel spectral coefficients.

The bit allocation scheme is described below:

assigning a minimum number of bits required by the entropy encoder for each channel_min

The remaining bits, i.e., is used with quantized scaleTo divide:

because of this quantized scale, bits are roughly allocated, and thus it can beThus, in the second refining step, the difference bits_diff＝bits_split-bits_totalIs proportionally derived from the channel bits_kSubtracting:

omicron after the refining step, if with bits_totalCompare bits still present_splitNot consistent, the difference (typically a small number of bits) is donated to the channel with the greatest energy.

The decoder follows exactly the same procedure in order to determine the number of bits to read in order to decode the spectral coefficients of each channel. rt is an integer of_RANGEIndicating bits for bit allocation information bits_kAnd the value is known to the encoder and decoder and does not have to be sent in the encoded audio signal.

Quantization and coding per channel

Quantization, noise filling and entropy coding, including rate loops, e.g. [8]]The method as described in (1). Estimated G may be used_estTo optimize the rate loop. Power spectrum P (magnitude of MCLT) for use as [8]]The quantization sum described inTone/noise measurement in Intelligent Gap Filling (IGF). Since the whitened and band-wise M/S processed MDCT spectrum is used for the power spectrum, the same FDNS and M/S processing must be done on the MDST spectrum. The same ILD-based normalized scaling as done for the MDCT has to be done for the MDST spectrum. For frames with TNS activated, the MDST spectrum used for power spectrum calculation is estimated from the whitened and M/S processed MDCT spectrum.

Fig. 2 shows a block diagram of a preferred embodiment of an encoder and in particular of the adaptive joint signal processor 200 of fig. 2. The at least three preprocessed audio signals 180 are all input into an energy normalization block 210, which energy normalization block 210 generates at its output a channel energy ratio edge bit 534, which channel energy ratio edge bit 534 comprises on the one hand a quantized scale and on the other hand a flag indicating a zoom-in or zoom-out for each channel. However, other processes without explicit flags for zooming in or out may also be performed.

The normalized channels are input into block 220 to perform cross-correlation vector calculations and channel pair selection. Based on the process in block 220, which is preferably an iterative process using a cascaded full tree or cascaded reduced tree process, or alternatively a non-iterative non-cascaded process, the corresponding stereo operation is performed in block 240, and block 240 may perform a full band or band wise mid/side process or any other corresponding stereo processing operation, such as rotation, scaling, any weighted or non-weighted linear or non-linear combination, etc.

At the output of block 240, a stereo smart gap-fill (IGF) process or any other bandwidth extension process, such as a spectral band replication process or a harmonic bandwidth process, may be performed. The processing of the individual channel pairs is signaled via channel-to-side information bits and although not shown in fig. 2, the IGF or general bandwidth extension parameters generated by block 260 are also written into the bitstream for the joint processing side-information 530 and in particular the pair-wise processing side-information 532 of fig. 5 b.

The final stage of fig. 2 is a channel bit allocation processor 280, which channel bit allocation processor 280 calculates the bit allocation ratio, as explained for example with respect to fig. 9. Fig. 2 shows a schematic representation of the signal encoder 300 as a quantizer and coder (the signal encoder 300 being controlled by the channel bit rate side information 530), and additionally shows a schematic representation of the output interface 400 or the bit stream writer 400, which combines the results of the signal encoder 300 with all required side information bits 520, 530 of fig. 5 b.

Fig. 3 shows a preferred embodiment of the essential process performed by blocks 210, 220, 240. After the process begins, ILD normalization is performed as indicated at 210 in fig. 2 or fig. 3. In step 221, a cross-correlation vector is calculated. The cross-correlation vector consists of normalized cross-correlation values for each possible channel pair of channels from 0 to N output by block 210. For the example in fig. 4, where there are six channels, 15 different possibilities from 0 to 14 can be checked. The first element of the cross-correlation vector has the cross-correlation value between channel 0 and channel 1 and, for example, the element of the cross-correlation vector with index 11 has the cross-correlation between channel 2 and channel 5.

In step 222, a calculation is performed to determine whether the tree determined for the previous frame is to be maintained. For this purpose, the variation of the cross-correlation vector over time is calculated, and preferably the sum of the individual differences of the cross-correlation vector and in particular the magnitude of the differences are calculated. In step 223, it is determined whether the sum of the differences is greater than a threshold. If this is the case, the flag keepretree is set to 0 in step 224, which means that the tree is not retained but a new tree is computed. However, when the sum is determined to be less than the threshold, block 225 sets the flag keeppree to 1, so that the tree determined from the previous frame is also applied to the current frame.

In step 226, the iteration termination criteria are checked. In the event that it is determined that the maximum number of Channel Pairs (CP) has not been reached (of course, this is the case when block 226 is accessed for the first time), and when the flag keepgree is set to 0 as determined at block 228, the process continues with block 229 to select the channel pair having the greatest cross-correlation in the cross-correlation vector. However, when the tree of early frames is maintained, i.e., when the keepretree is equal to 1 as checked in block 225, block 230 determines whether the cross-correlation of the "forced" channel pair is greater than a threshold. If this is not the case, the process continues to step 227, which means that although the process in block 223 determines the opposite conclusion, a new tree is still to be determined. The evaluation in block 230 and the corresponding result in block 227 may override the determinations in blocks 223 and 225.

In block 231, it is determined whether the channel pair with the largest cross-correlation is above 0.3. If this is the case, a stereo operation in block 232 is performed, which is also indicated as 240 in fig. 2. When the stereo operation is determined to be monaural in block 233, the value keeppree is set equal to 0 in block 234. However, when it is determined that the stereo mode is different from dual mono, the cross-correlation vector 235 must be recalculated since the mid/side operation has been performed and the output of the stereo operation block 240 (or 232) is different due to this process. The CC vector 235 has to be updated only when there is actually a mid/side stereo operation or a stereo operation that is usually different from dual mono.

However, when the check in block 226 or the check in block 231 results in a "no" answer, control proceeds to block 236 to check whether a single channel is present. If this is the case, i.e., if a single channel is found that was not processed with another channel in the channel pair processing, the ILD normalization is reversed in block 237. Alternatively, the reversal in block 237 may be only a partial reversal, or may be some weighting.

In the case where the iteration is complete and also in the case where blocks 236 and 237 are complete, the process ends and all channel pairs have been processed, and at the output of the adaptive joint signal processor there are at least three jointly processed signals if block 236 results in a "no" answer, or at least two jointly processed signals and an unprocessed signal corresponding to "single channel" when block 236 has resulted in a "yes" answer.

Decoding system description

The decoding process starts with decoding and inverse quantization of the spectrum of the jointly coded channels followed by noise filling, e.g. [11]]Or [12]]6.2.2 "MDCT based TCX". The number of bits allocated to each channel is based onWindow length, stereo mode and bit rate ratio of coding in a bitstreamAnd (4) determining. The number of bits allocated to each channel must be known before the bitstream can be fully decoded.

In an intelligent gap-filling (IGF) box, lines quantized to zero within a certain range of the spectrum (called target blocks) are filled with processed content from a different range of the spectrum (called source blocks). Due to the band-wise stereo processing, the stereo representation (i.e., L/R or M/S) may be different for the source and target blocks. To ensure good quality, if the representation of the source block is different from the representation of the target block, the source block is processed to transform it into a representation of the target block before gap filling in the decoder. This process has already been described in [10 ]. In contrast to [11] and [12], IGF itself is applied in the whitened spectral domain instead of the original spectral domain. In contrast to known stereo codecs (e.g., [10]), IGF is applied in the whitened, ILD-compensated spectral domain.

From the bitstream signaling it is also known whether there are jointly coded channel pairs. The inverse process should start with the last channel pair formed in the encoder, especially for the concatenated channel pair tree, in order to convert back the original whitened spectrum of each channel. For each channel pair, inverse stereo processing is applied based on the stereo mode and band mode M/S decision.

Based on the channel information sent from the encoder for all channels involved in the channel pair and jointly encodedValue, the spectrum is denormalized to the original energy level.

Fig. 10 shows a preferred embodiment of a multi-signal decoder for decoding an encoded signal 500. The multi-signal decoder comprises an input interface 600, a signal decoder 700 for decoding at least three encoded signals output by the input interface 600. The multi-signal decoder comprises a joint signal processor 800 for performing joint signal processing on the basis of side information comprised in the encoded signal to obtain at least three processed decoded signals. The multi-signal decoder comprises a post-processor 900 for post-processing the at least three processed decoded signals in dependence of side information comprised in the encoded signal. In particular, the post-processing is performed in such a way that the post-processed signal is whitened less than the signal before post-processing. The post-processed signal represents the decoded audio signal 1000 directly or indirectly.

The side information extracted by the input interface 600 and forwarded to the joint signal processor 800 is the side information 530 shown in fig. 5b, and the side information extracted by the input interface 600 from the encoded multi-signal audio signal, which is forwarded to the post-processor 900 for performing the de-whitening operation, is the side information 520 shown and described with respect to fig. 5 b.

The joint signal processor 800 is configured to extract or receive an energy normalization value for each joint stereo decoded signal from the input interface 600. The energy normalization value for each joint stereo decoded signal corresponds to the energy scaling information 530 of fig. 5 b. The adaptive joint signal processor 200 is configured to pair-wise process 820 the decoded signal using joint stereo side information or joint stereo mode as indicated by joint stereo side information 532 comprised in the encoded audio signal 500 to obtain a joint stereo decoded signal at the output of block 820. In block 830, a rescaling operation of the joint stereo decoded signal and in particular an energy rescaling is performed using the energy normalization value to obtain a processed decoded signal at the output of block 800 of fig. 10.

To ensure that the channels have received the inverse ILD normalization as explained in block 237 with respect to fig. 3, the joint signal processor 800 is configured to check whether the energy normalization value extracted from the encoded signal for a particular signal has a predefined value. If this is the case, no energy rescaling is performed, or only a reduced energy rescaling of a particular signal is performed, or any other weighting operation is performed on the individual channel, in case the energy normalization value has the predefined value.

In an embodiment, the signal decoder 700 is configured to receive the bit allocation values for each encoded signal from the input interface 600 as indicated by block 620. The bit allocation value, shown at 536 in fig. 12, is forwarded to block 720 so that the signal decoder 700 determines the bit allocation used. Preferably, the same steps as those described with respect to the encoder side in fig. 6 and 9 (i.e., steps 290, 291, 292, 293) are performed by the signal decoder 700 for determining the bit allocation used in block 720 of fig. 12. In block 710/730, a separate decoding is performed to obtain the input to the joint signal processor 800 of FIG. 10.

Using some of the side information included in the side information box 532, the joint signal processor 800 has a band replication, bandwidth expansion, or intelligent gap-filling processing function. The side information is forwarded to block 810 and block 820 performs joint stereo (decoder) processing using the results of the bandwidth extension procedure applied by block 810. In block 810, the smart gap-filling process is configured to: when the destination range of the bandwidth extension or IGF processing is indicated as having another stereo representation, the source range is transformed from one stereo representation to the other stereo representation. When the destination range is indicated as having a mid/side stereo mode, and when the source range is indicated as having an L/R stereo mode, the L/R source range stereo mode will be transformed to a mid/side source range stereo mode, and then IGF processing is performed in the mid/side stereo mode representation of the source range.

Fig. 14 shows a preferred embodiment of the joint signal processor 800. The joint signal processor is configured to extract the ordered signal pair information, as shown at block 630. This extraction may be performed by the input interface 600, or the joint signal processor may extract this information from the output of the input interface, or may extract this information directly without a specific input interface, as is also the case with other extraction processes described with respect to joint signal processors or signal decoders.

In block 820, the joint signal processor performs the inverse process of the preferred cascade starting with the last signal pair, where the term "last" refers to the order of processing determined and performed by the encoder. In the decoder, the "last" signal pair is the first signal pair to be processed. Block 820 receives side information 532 indicating, for each signal pair indicated by the signal pair information shown in block 630 and implemented, for example, in the manner as explained with respect to fig. 4, whether the particular pair is a dual mono, full MS, or band mode MS procedure with an associated MS mask.

After the inverse process in block 820, the de-normalization of the signals involved in the channel pairs is performed in block 830, again in dependence on the side information 534 indicating the normalization information for each channel. The de-normalization illustrated with respect to block 830 in fig. 14 is preferably a rescaling using energy normalization values, a zoom-out when the flag 534a has a first value, and a rescaling performed as a zoom-in when the flag 534a has a second value (different from the first value).

Fig. 15a shows a preferred embodiment as a block diagram of the signal decoder and joint signal processor of fig. 10, and fig. 15b shows a block diagram representation of a preferred embodiment of the post processor 900 of fig. 10.

The signal decoder 700 comprises a decoder and dequantizer stage 710 for the spectrum comprised in the encoded signal 500. The signal decoder 700 comprises a bit allocator 720, which bit allocator 720 preferably receives as side information a window length, a certain stereo mode and bit allocation information for each encoded signal. In a preferred embodiment, the bit allocator 720 performs the bit allocation using in particular steps 290, 291, 292, 293, wherein the bit allocation information of each encoded signal is used in step 291, and wherein information on the window length and the stereo mode is used in block 290 or 291.

In block 730, noise filling, also preferably using noise filling side information, is also performed on ranges in the spectrum that are quantized to zero and not within the IGF range. Noise filling is preferably limited to the low-band portion of the signal output by block 710. In block 810, and using some side information, an intelligent gap-filling or general bandwidth extension process is performed, importantly, which operates on the whitened spectrum.

In block 820, and using the side information, the inverse stereo processor performs a process for undoing the processing performed in item 240 of FIG. 2. The final de-scaling is performed using the transmitted quantized ILD parameters for each channel included in the side information. The output of block 830 is input to block 910 of a post-processor that performs an inverse TNS process and/or an inverse frequency domain noise shaping process or any other de-whitening operation. The output of block 910 is a simple spectrum that is converted to the time domain by a frequency-to-time converter 920. The outputs of block 920 for adjacent frames are overlap-added in an overlap-add processor 930 according to some encoding or decoding rule to finally obtain multiple (multiply of) decoded audio signals, or in general, decoded audio signal 1000, from the overlap-add operation. The signal 1000 may be composed of individual channels or may be composed of components of a sound field description such as ambient stereo components or may be composed of any other components of a higher order ambient stereo description. The signal may also be a signal described in the a-format or B-format of the sound field or any other format. All these alternatives are collectively referred to as decoded audio signal 1000 in fig. 15 b.

Other advantages and specific features of the preferred embodiments are subsequently indicated.

The scope of the present invention is to provide a solution to the principle in [6] for processing perceptually whitened and ILD compensated signals.

FDNS with a rate-loop as described in [8] in combination with spectral envelope warping as described in [9], provides a simple but very efficient way of separating the perceptual shaping of quantization noise and the rate-loop.

Using the average energy level for all channels of the FDNS-whitened spectrum allows a simple and efficient way to decide whether there is an advantage of M/S processing for each channel pair selected for joint encoding as described in [7 ].

For the described system, it is sufficient to encode a single wideband ILD for each channel, and thus a bit saving is achieved compared to the known methods.

Selecting channel pairs with highly cross-correlated signals for joint coding typically results in a full spectrum M/S transform, saving extra averaging bits since the signaling of M/S or L/R for each band is mostly replaced by a single bit signaling a full M/S transform.

Flexible and simple bit allocation based on the energy of the processed channels.

Features of the preferred embodiments

As described in the previous paragraph, in this embodiment, the codec uses a new approach to fuse the flexibility of signal adaptive joint coding of arbitrary channels described in [6] by introducing the concept described in [7] for joint stereo coding. The novelty of the proposed invention is summarized in the following differences:

regarding global ILD compensation, the joint processing for each channel pair differs from the multi-channel processing described in [6 ]. The global ILD equalizes the levels of the channels before channel pairs are selected and M/S decision and processing are performed and thus enables more efficient stereo coding, especially of panning sources.

Regarding global ILD compensation, the joint processing for each channel pair differs from the stereo processing described in [7 ]. In the proposed system, there is no global ILD compensation for each channel pair. In order to be able to use the M/S decision mechanism described in [7] for any channel, all channels are normalized to a single energy level, i.e. the average energy level. This normalization occurs before the channel pairs are selected for joint processing.

After the adaptive channel pair selection process, if there are channels that are not part of the channel pair used for the joint processing, the energy level of that channel will be inverted to the initial energy level.

The bit allocation for entropy coding is not implemented on each channel pair as described in [7 ]. Instead, the energies of all channels are considered and bits are allocated as described in the corresponding paragraph in this document.

There is an explicit "low complexity" mode of adaptive channel pair selection described in [6], in which a single channel that is part of a channel pair during the iterative channel pair selection process is not allowed to be part of another channel pair during the channel pair selection process.

Through the fact that we use the signal adaptive channel pair selection of [6], the advantage of using simple band mode M/S for each channel and thus reducing the amount of information that needs to be sent in the bitstream is enhanced. By selecting highly correlated channels for joint coding, the wideband M/S conversion is optimal for most cases, i.e. M/S coding is used for all frequency bands. This can be signaled with a single bit and therefore requires much less signaling information than band-wise M/S decisions. It significantly reduces the total amount of information bits that need to be transmitted for all channel pairs.

Embodiments of the present invention relate to signal adaptive joint coding for multi-channel systems with perceptually whitening and ILD compensated spectrum, where the joint coding consists of simple M/S transform decisions per band based on the estimated number of bits for the entropy encoder.

Although some aspects have been described in the context of an apparatus, it will be clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of method steps also represent a description of the corresponding block or item or of a feature of the corresponding apparatus. Some or all of the method steps may be performed by (or using) a hardware device, such as a microprocessor, programmable computer, or electronic circuit. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

The novel encoded audio signals may be stored on a digital storage medium or may be transmitted over a transmission medium such as a wireless transmission medium or a wired transmission medium (e.g., the internet).

Embodiments of the invention may be implemented in hardware or in software, depending on certain implementation requirements. Implementation may be performed using a digital storage medium (e.g. a floppy disk, a DVD, a blu-ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a flash memory) having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Accordingly, the digital storage medium may be computer-readable.

Some embodiments according to the invention comprise a data carrier with electronically readable control signals capable of cooperating with a programmable computer system so as to perform one of the methods described herein.

Generally, embodiments of the invention can be implemented as a computer program product having a program code operable to perform one of the methods when the computer program product runs on a computer. The program code may be stored, for example, on a machine-readable carrier.

Other embodiments include a computer program stored on a machine-readable carrier for performing one of the methods described herein.

In other words, an embodiment of the inventive method is thus a computer program with a program code for performing one of the methods described herein, when the computer program runs on a computer.

Thus, another embodiment of the inventive method is a data carrier (or digital storage medium or computer readable medium) having a computer program recorded thereon for performing one of the methods described herein. The data carrier, the digital storage medium or the recording medium is typically tangible and/or non-transitory.

A further embodiment of the inventive method is thus a data stream or a signal sequence thus representing a computer program for performing one of the methods described herein. The data stream or signal sequence may for example be arranged to be transmitted via a data communication connection (e.g. via the internet).

Another embodiment comprises a processing apparatus, e.g., a computer or a programmable logic device, configured or adapted to perform one of the methods described herein.

Another embodiment comprises a computer having a computer program installed thereon for performing one of the methods described herein.

Another embodiment according to the present invention comprises an apparatus or system configured to transmit a computer program (e.g., electronically or optically) to a receiver, the computer program being for performing one of the methods described herein. The receiver may be, for example, a computer, a mobile device, a storage device, etc. The apparatus or system may for example comprise a file server for transmitting the computer program to the receiver.

In some embodiments, a programmable logic device (e.g., a field programmable gate array) may be used to perform some or all of the functions of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor to perform one of the methods described herein. In general, the method is preferably performed by any hardware device.

The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

The methods described herein may be performed using a hardware device, or using a computer, or using a combination of a hardware device and a computer.

The above-described embodiments are merely illustrative of the principles of the present invention. It is to be understood that modifications and variations of the arrangements and details described herein will be apparent to others skilled in the art. It is therefore intended that the scope of the appended patent claims be limited only by the details of the description and the explanation of the embodiments herein, and not by the details of the description and the explanation.

References (all incorporated herein by reference in their entirety)

[1]″Information technology-MPEG audio technologies Part 3：Unified speech and audio coding，″ISO/IEC 23003-3，2012.

[2]″Information technology-MPEG audio technologies Part 1：MPEG Surround，″ISO/IEC 23003-1，2007.

[3]J.Herre，J.Hilpert，K.Achim and J.Plogsties，″MPEG-H 3D Audio-The New Standard for Coding of Immersive Spatial Audio，″Journal of Selected Topics in Signal Processing，vol.5，no.9，pp.770-779，August 2015.

[4]″Digital Audio Compression(AC-4)Standard，″ETSI TS 103 190 V1.1.1，2014-04.

[5]D.Yang，H.Ai，C.Kyriakakis and C.Kuo，″High-fidelity multichannel audio coding with Karhunen-Loeve transform，″Transactions on Speech and Audio Processing，vol.11，no.4，pp.365-380，July 2003.

[6]F.Schuh，S.Dick，R.Füg，C.R.Helmrich，N.Rettelbach and T.Schwegler，″Efficient Multichannel Audio Transform Coding with Low Delay and Complexity，″in AES Convention，Los Angeles，September 20，2016.

[7]G.Markovic，E.Fotopoulou，M.Multrus，S.Bayer，G.Fuchs，J.Herre，E.Ravelli，M.Schnell，S.Doehla，W.Jaegers，M.Dietz and C.Helmrich，″Apparatus and method for mdct m/s stereo with global ild with improved mid/side decision″.International Patent WO2017125544A1，27July 2017.

[8]3GPP TS 26.445，Codecfor Enhanced Voice Services(EVS)；Detailed algorithmic description.

[9]G.Markovic，F.Guillaume，N.Rettelbach，C.Helmrich and B.Schubert，″Linear prediction based coding scheme using spectral domain noise shaping″.EU Patent 2676266B1，14February 2011.

[10]S.Disch，F.Nagel，R.Geiger，B.N.Thoshkahna，K.Schmidt，S.Bayer，C.Neukam，B.Edler and C.Helmrich，″Audio Encoder，Audio Decoder and Related Methods Using Two-Channel Processing Within an Intelligent Gap Filling Framework″.International Patent PCT/EP2014/065106，15072014.

[11]″Codec for Encanced Voice Services(EVS)；Detailed algorithmic description，″3GPP TS 26.445 V 12.5.0，December 2015.

[12]″Codec for Encanced Voice Services(EVS)；Detailed algorithmic description，″3GPP TS 26.445V 13.3.0，September 2016.

[13]Sascha Dick，F.Schuh，N.Rettelbach，T.Schwegler，R.Fueg，J.Hilpert and M.Neusinger，″APPARATUS AND METHOD FOR ENCODING OR DECODING A MULTI-CHANNEL SIGNAL″.Inernational Patent PCT/EP2016/054900，08 March 2016。

42页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于控制对经低比特率编码的音频的增强的方法和装置

Multi-signal audio coding using signal whitening as pre-processing

相关技术

网友询问留言