Method and apparatus for controlling enhancement of low bit rate encoded audio

文档序号：789747 发布日期：2021-04-09 浏览：38次中文

阅读说明：本技术 用于控制对经低比特率编码的音频的增强的方法和装置 (Method and apparatus for controlling enhancement of low bit rate encoded audio ) 是由 A·比斯瓦斯戴佳 A·S·马斯特于 2019-08-29 设计创作，主要内容包括：描述了一种用于对音频数据进行低比特率编码并生成用于在解码器侧控制对所述经低比特率编码的音频数据的音频增强的增强元数据的方法,所述方法包括以下步骤：(a)以低比特率对原始音频数据进行核心编码以获得经编码的音频数据；(b)生成要用于在对所述经编码的音频数据进行核心解码之后在所述解码器侧控制音频增强的类型和/或量的增强元数据；以及(c)输出所述经编码的音频数据和所述增强元数据。进一步描述了被配置为执行所述方法的编码器。此外,描述了一种用于基于增强元数据从经低比特率编码的音频数据生成增强的音频数据的方法以及一种被配置为执行所述方法的解码器。(A method for low bitrate coding of audio data and generating enhancement metadata for controlling audio enhancement of said low bitrate coded audio data at a decoder side is described, said method comprising the steps of: (a) core encoding original audio data at a low bitrate to obtain encoded audio data; (b) generating enhancement metadata to be used for controlling a type and/or amount of audio enhancement at the decoder side after core decoding of the encoded audio data; and (c) outputting the encoded audio data and the enhancement metadata. An encoder configured to perform the method is further described. Furthermore, a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata and a decoder configured to perform the method are described.)

1. A method for low bitrate coding of audio data and generating enhancement metadata for controlling audio enhancement of the low bitrate coded audio data at a decoder side, the method comprising the steps of:

(a) core encoding original audio data at a low bitrate to obtain encoded audio data;

(b) generating enhancement metadata to be used for controlling a type and/or amount of audio enhancement at the decoder side after core decoding of the encoded audio data; and

2. The method of claim 1, wherein the generating of the enhanced metadata in step (b) comprises:

(i) core decoding the encoded audio data to obtain core-decoded initial audio data;

(ii) inputting the core-decoded initial audio data to an audio enhancer to process the core-decoded initial audio data based on candidate enhancement metadata for controlling a type and/or amount of audio enhancement to audio data input to the audio enhancer;

(iii) obtaining enhanced audio data as an output from the audio enhancer;

(iv) determining applicability of the candidate enhancement metadata based on the enhanced audio data; and

(v) generating enhanced metadata based on a result of the determining.

3. The method of claim 2, wherein determining the suitability of the candidate enhancement metadata in step (iv) comprises: (vi) presenting the enhanced audio data to a user and receiving a first input from the user in response to the presentation, and wherein the generating the enhanced metadata in step (v) is based on the first input.

4. The method of claim 3, wherein the first input from the user includes an indication of whether the candidate enhanced metadata is accepted or rejected by the user.

5. The method of claim 4, wherein, in the event that the user rejects the candidate enhancement metadata, a second input is received from the user indicating a modification to the candidate enhancement metadata, and the generating of the enhancement metadata in step (v) is based on the second input.

6. The method of claim 4 or 5, wherein steps (ii) to (v) are repeated in the event that the user rejects the candidate enhancement metadata.

7. The method of any of claims 1-6, wherein the augmented metadata comprises one or more items of augmented control data.

8. The method of claim 7, wherein the enhancement control data comprises information regarding one or more types of audio enhancement including one or more of speech enhancement, music enhancement, and applause enhancement.

9. The method of claim 8, wherein the enhancement control data further comprises information regarding respective admissibility of one or more types of the audio enhancement.

10. The method of any of claims 7 to 9, wherein the enhancement control data further comprises information about the amount of audio enhancement.

11. The method of any of claims 7 to 10, wherein the enhancement control data further comprises information on the admissibility of whether audio enhancement is to be performed by an automatically updated audio enhancer at the decoder side.

12. The method of any of claims 7 to 11, wherein the processing of the core-decoded initial audio data based on the candidate enhancement metadata in step (ii) is performed by applying one or more predefined audio enhancement modules, and wherein the enhancement control data further comprises information on the admissibility of using one or more different enhancement modules at decoder side that achieve the same or substantially the same type of enhancement.

13. The method of any of claims 2 to 12, wherein the audio enhancer is a generator.

14. An encoder for generating enhancement metadata for controlling enhancement of low-bitrate encoded audio data, wherein the encoder comprises one or more processors configured to perform the method of any of claims 1 to 13.

15. A method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata, wherein the method comprises the steps of:

(a) receiving audio data and enhancement metadata encoded at a low bit rate;

(b) core decoding the encoded audio data to obtain core-decoded initial audio data;

(c) inputting the core-decoded initial audio data to an audio enhancer to process the core-decoded initial audio data based on the enhancement metadata;

(d) obtaining enhanced audio data as an output from the audio enhancer; and

(e) outputting the enhanced audio data.

16. The method of claim 15, wherein processing the core-decoded initial audio data based on the enhancement metadata is performed by applying one or more audio enhancement modules in accordance with the enhancement metadata.

17. The method of claim 15 or 16, wherein the audio enhancer is a generator.

18. A decoder for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata, wherein the decoder comprises one or more processors configured to perform the method of any of claims 15 to 17.

Technical Field

The present disclosure relates generally to a method for low bit-rate encoding audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at a decoder side, and more particularly to generating enhancement metadata to be used for controlling a type and/or amount of audio enhancement at the decoder side after core decoding (core decoding) of the encoded audio data. Furthermore, the present disclosure relates to a corresponding encoder, a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata, and a corresponding decoder.

Although some embodiments will be described herein with particular reference to this disclosure, it will be appreciated that the present disclosure is not limited to this field of use and may be applied in a broader context.

Background

Any discussion of the background art throughout the disclosure should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.

In recent years, it has been observed that in particular deep learning approaches can provide breakthrough audio enhancements.

Audio recording systems are used to encode an audio signal into an encoded signal suitable for transmission or storage, and then receive or retrieve the encoded signal and decode it to obtain a version of the original (original) audio signal for playback. Low bit rate audio coding is a perceptual audio compression technique that allows for reduced bandwidth and storage requirements. Examples of perceptual audio coding systems include Dolby-AC3, Advanced Audio Coding (AAC), and the standardized Dolby AC-4 audio coding system recently standardized by ETSI and included in ATSC 3.0.

However, low bitrate audio coding introduces unavoidable coding artifacts. Audio encoded at low bit rates may suffer, inter alia, from details in the audio signal and noise introduced due to quantization and encoding may degrade the quality of the audio signal. A particular problem in this respect is the so-called pre-echo artifact. Pre-echo artifacts are generated when a transient audio signal is quantized in the frequency domain, which causes quantization noise to spread prior to the transient audio signal. Pre-echo noise does seriously impair the quality of an audio codec, such as for example an MPEG AAC codec or any other transform-based (e.g. MDCT-based) audio codec.

Several methods have been developed to date to reduce pre-echo noise and thus improve the quality of low bit rate coded audio. These methods include short block switching and time domain noise shaping (TNS). The latter technique is based on applying a prediction filter in the frequency domain to shape the quantization noise in the time domain so that the noise appears to be less disturbing to the user.

Lapierre and r.lefebvre have issued a recent method for reducing pre-echo noise in frequency domain audio codecs from the 2017 conference of acoustic, speech and signal processing international conferences. This recently developed method is based on an algorithm that operates at the decoder using data from the received bitstream. In particular, the decoded bitstream is tested on a frame-by-frame basis for the presence of transient signals that may produce pre-echo artifacts. Upon detection of such a signal, the audio signal is divided into a front transient signal portion and a rear transient signal portion, which are then fed to a noise reduction algorithm together with specific transient characteristics and codec parameters. First, for each frequency band or frequency coefficient, the amount of quantization noise present in the frame is estimated using the scale factor and coefficient amplitude from the bitstream. This estimate is then used to shape the random noise signal added to the post-signal in the oversampled DFT domain, which is then transformed to the time domain, multiplied by the pre-window and returned to the frequency domain. In this way, spectral subtraction can be applied to the front signal without adding any artifacts. In order to further preserve the total frame energy and to take into account that the signal will be dragged from the back signal to the front signal due to the influence of quantization noise, the energy removed from the front signal is added back to the back signal. After adding the two signals together and transforming to the MDCT domain, the rest of the decoder may use the modified MDCT coefficients instead of the original coefficients. However, the authors have identified a drawback in that, although the algorithm can be used in today's systems, it still increases the computational load of the decoder.

Raghuram et al have published a novel post-processing toolkit for enhancing audio signals encoded at low bit rates in conference paper 7221 of the audio engineers presented on the 123 th major conference held in NY, new york, usa on 10/month 5 to 8 in 2007. In addition, the paper addresses the problem of noise in low bitrate coded audio and presents an Automatic Noise Removal (ANR) algorithm based on adaptive filtering techniques to remove wideband background noise. In particular, one aspect of ANR algorithms is that by detailed harmonic analysis of the signal and by utilizing perceptual modeling and accurate signal analysis and synthesis, the primary signal sounds can be preserved because the primary signal components in the signal are removed prior to the noise removal step. A second aspect of the ANR algorithm is that it continuously and automatically updates the noise profile/statistics by means of a novel signal activity detection algorithm, thereby fully automating the noise removal process. The noise removal algorithm takes a denoising Kalman filter as a core.

In addition to pre-echo artifacts, the quality of low bit rate coded audio is also affected by quantization noise. To reduce the information capacity requirements, the spectral components of the audio signal are quantized. However, quantization may inject noise into the signal. In general, perceptual audio coding systems involve the use of psychoacoustic models to control the amplitude of quantization noise so that it is masked or rendered inaudible by spectral components in the signal.

The spectral components within a given frequency band are typically quantized to the same quantization resolution, and according to a psychoacoustic model, a minimum signal-to-noise ratio (SNR) can be determined that accompanies a maximum minimum quantization resolution without injecting an audible level of quantization noise. For wider frequency bands, the information capacity requirement limits the coding system to a relatively coarse quantization resolution. As a result, if the size of the smaller-valued spectral components is less than the minimum quantization level, the spectral components are quantized to zero. Even if the quantization noise remains low enough to be inaudible or masked psychoacoustically, the presence of many spectral components (spectral holes) quantized to zero in the encoded signal may degrade the quality of the audio signal. This reduction may be due to inaudible quantization noise because the result of psychoacoustic masking is less than what the model used to determine the quantization resolution predicts. Furthermore, many spectral components quantized to zero may audibly reduce the energy or power of the decoded audio signal compared to the original audio signal. For coding systems that use distortion removal filter banks, the ability of the synthesis filter bank to remove distortion during decoding may be severely compromised if the values of one or more spectral components change significantly during the coding process, which also compromises the quality of the decoded audio signal.

Companding is a new type of coding tool in the Dolby AC-4 coding system that improves perceptual coding of speech and dense transient events (e.g., applause). The benefits of companding include reducing the short-time dynamics of the input signal, thereby reducing the bit rate requirements on the encoder side while ensuring proper time-domain noise shaping on the decoder side.

During the last years, deep learning approaches have become increasingly attractive in various application areas including speech enhancement. In this context, d.michelsnti and z. -h.tan are described in a publication on "Conditional generic adaptive Networks for Speech Enhancement and Noise-Robust Speaker Verification", published at INTERSPEECH, 2017, and the Conditional-Robust generation network (GAN) approach performs better than the classical short-time spectral amplitude minimum mean square error Speech Enhancement algorithm and is comparable to the deep neural network-based approach for Speech Enhancement.

However, this excellent performance can also lead to dilemma: the listener may prefer an enhanced version of the original audio based on deep learning over the original audio, which may not be the artistic intent of the content creator. It is therefore desirable to provide a measure of control on the encoder side to the content creator, allowing the creator to choose whether, how much, or which type of enhancement can be applied and for which case, the enhancement can be applied on the decoder side. This will allow the content creator to ultimately control the intent and quality of the enhanced audio.

Disclosure of Invention

According to a first aspect of the present disclosure, a method for low bit-rate encoding audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at a decoder side is provided. The method may comprise the steps of: (a) the original audio data is core encoded at a low bit rate to obtain encoded audio data. The method may further comprise the steps of: (b) generating enhancement metadata to be used for controlling a type and/or amount of audio enhancement at the decoder side after core decoding of the encoded audio data. And the method may comprise the steps of: (c) outputting the encoded audio data and the enhancement metadata.

In some embodiments, generating the enhanced metadata in step (b) may include:

(i) core decoding the encoded audio data to obtain core-decoded initial audio data;

(ii) inputting the core decoded initial (raw) audio data to an audio enhancer to process the core decoded initial audio data based on candidate enhancement metadata for controlling a type and/or amount of audio enhancement to audio data input to the audio enhancer;

(iii) obtaining enhanced audio data as an output from the audio enhancer;

(iv) determining applicability of the candidate enhancement metadata based on the enhanced audio data; and

(v) generating enhanced metadata based on a result of the determining.

In some embodiments, determining the suitability of the candidate enhancement metadata in step (iv) may comprise: (vi) presenting the enhanced audio data to a user and receiving a first input from the user in response to the presenting, and wherein the generating the enhanced metadata in step (v) may be based on the first input.

In some embodiments, the first input from the user may include an indication of whether the candidate enhancement metadata was accepted or rejected by the user.

In some embodiments, in the event that the user rejects the candidate augmented metadata, a second input may be received from the user indicating a modification to the candidate augmented metadata, and the generating of the augmented metadata in step (v) may be based on the second input.

In some embodiments, in the event that the user rejects the candidate enhanced metadata, steps (ii) through (v) may be repeated.

In some embodiments, the enhanced metadata may include one or more items of enhanced control data.

In some embodiments, the enhancement control data may include information regarding one or more audio enhancement types including one or more of speech enhancement, music enhancement, and applause enhancement.

In some embodiments, the enhancement control data may further comprise information on respective admissibility (allowability) of the one or more audio enhancement types.

In some embodiments, the enhancement control data may further comprise information regarding the amount of audio enhancement.

In some embodiments, the enhancement control data may further comprise information on the admissibility of whether audio enhancement is to be performed by the automatically updated audio enhancer of the decoder side.

In some embodiments, processing the core-decoded initial audio data based on the candidate enhancement metadata in step (ii) may be performed by applying one or more predefined audio enhancement modules, and the enhancement control data may further comprise information on the admissibility of using one or more different enhancement modules at the decoder side that achieve the same or substantially the same type of enhancement.

In some embodiments, the audio enhancer may be a generator.

According to a second aspect of the present disclosure, there is provided an encoder for generating enhancement metadata for controlling enhancement of low-bitrate coded audio data. The encoder may include one or more processors configured to perform a method for low bit-rate encoding of audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at a decoder side.

According to a third aspect of the present disclosure, a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata is provided. The method may comprise the steps of: (a) audio data encoded at a low bit rate, and enhancement metadata are received. The method may further comprise the steps of: (b) core decoding the encoded audio data to obtain core-decoded initial audio data. The method may further comprise the steps of: (c) the core decoded initial audio data is input to an audio enhancer to process the core decoded initial audio data based on the enhancement metadata. The method may further comprise the steps of: (d) obtaining enhanced audio data as output from the audio enhancer. And the method may comprise the steps of: (e) outputting the enhanced audio data.

In some embodiments, processing the core decoded initial audio data based on the enhancement metadata may be performed by applying one or more audio enhancement modules in accordance with the enhancement metadata.

In some embodiments, the audio enhancer may be a generator.

According to a fourth aspect of the present disclosure, a decoder for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata is provided. The decoder may include one or more processors configured to perform a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata.

Drawings

Example embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

fig. 1 illustrates a flow chart of an example of a method for low bitrate encoding of audio data and generating enhancement metadata for controlling audio enhancement of the low bitrate encoded audio data at a decoder side.

Fig. 2 illustrates a flow diagram of generating enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of encoded audio data.

Fig. 3 illustrates a flow diagram of another example of generating enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of encoded audio data.

Fig. 4 illustrates a flow diagram of yet another example of generating enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of encoded audio data.

Fig. 5 illustrates an example of an encoder configured to perform a method for low bit-rate encoding of audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at a decoder side.

Fig. 6 illustrates an example of a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata.

Fig. 7 illustrates an example of a decoder configured to perform a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata.

Fig. 8 illustrates an example of a system having an encoder configured to perform a method for low bit-rate encoding of audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at the decoder side, and a decoder configured to perform a method for generating enhanced audio data from the low bit-rate encoded audio data based on the enhancement metadata.

Fig. 9 illustrates an example of a device having two or more processors configured to perform the methods described herein.

Detailed Description

Audio enhancement summary

Enhanced audio data may be generated at the decoding side from a low bit-rate encoded audio bit stream, for example as given below and described at 62/733,409 (which is incorporated herein by reference in its entirety). A low bit-rate encoded audio bit stream of any codec used in lossy audio compression, such as AAC (advanced audio coding), Dolby-AC3, HE-AAC, USAC, or Dolby-AC4, may be received. Decoded initial audio data obtained from the received and decoded low bitrate encoded audio bitstream may be input into a generator for enhancing the initial audio data. The initial audio data may then be enhanced by the generator. In general, the enhancement process aims to enhance the quality of the original audio data by reducing coding artifacts. Thus, enhancing the initial audio data by the generator may include one or more of: reducing pre-echo noise, quantizing noise, filling spectral gaps, and calculating adjustments to one or more missing frames. The term spectral gap may include both spectral holes and missing high frequency bandwidth. The user-generated parameters may be used to calculate an adjustment to one or more missing frames. As an output of the generator, enhanced audio data may then be obtained.

The above-described method for performing audio enhancement may be performed in the time domain and/or at least partially in the intermediate (codec) transform domain. For example, the initial audio data may be transformed into an intermediate transform domain before being input into the generator, and the obtained enhanced audio data may be transformed back into the time domain. The intermediate transform domain may be, for example, an MDCT domain.

Audio enhancement may be implemented on any decoder in the time domain or intermediate (codec) transform domain. Alternatively or additionally, audio enhancement may also be guided by encoder-generated metadata. The encoder-generated metadata may generally include one or more of encoder parameters and/or bitstream parameters.

Audio enhancement may also be performed, for example, by a system having a decoder for generating enhanced audio data from a low bit rate encoded audio bit stream and a countermeasure generation network arrangement including a Generator (Generator) and a Discriminator (Discriminator).

As already mentioned above, the audio enhancement performed by the decoder may be guided by the metadata generated by the encoder. The encoder generated metadata may for example comprise an indication of the encoding quality. The indication of the encoding quality may comprise, for example, information about the presence of encoding artifacts and the effect of the encoding artifacts on the quality of the decoded audio data compared to the original audio data. Thus, the indication of the coding quality may be used to guide the enhancement of the initial audio data in the generator. The indication of the encoding quality may also be used as additional information to modify the audio data in the encoded audio feature space (also referred to as the bottleneck layer) of the generator.

The metadata may also include, for example, bitstream parameters. The bitstream parameters may for example comprise one or more of the following: bit rate, scale factor values associated with an AAC based codec and a dolby AC-4 codec, and global gain associated with an AAC based codec and a dolby AC-4 codec. The bitstream parameters may be used to guide the enhancement of the initial audio data in the generator. The bitstream parameters may also be used as additional information in the encoded audio feature space of the generator.

The metadata may further include an indication of whether the decoded initial audio data is to be enhanced by the generator, for example. This information can thus be used as a trigger for audio enhancement. If so, enhancement may be performed. If the indication is no, the decoder may circumvent the enhancement and may perform the decoding process conventionally performed on the decoder based on the received bitstream including the metadata.

Antagonistic generation network settings

As described above, the generator may be used on the decoding side to enhance the initial audio data to reduce coding artifacts introduced by low bit rate coding and thus enhance the quality of the initial audio data compared to the original uncoded audio data.

Such a generator may be a generator trained in a antagonistic generation network setting (GAN setting). The GAN setup typically includes a generator G and a discriminator D, both trained by an iterative process. During training in a challenge-oriented generation network setup, the generator G generates enhanced audio data x based on the random noise vector z and initial audio data derived from the original audio data x that has been encoded and decoded at a low bit rate, respectively. However, the random noise vector may be set to z-0, which is found to be best for reducing coding artifacts. Training can be done without inputting a random noise vector z. In addition, metadata may be input into the generator to modify the enhanced audio data in the encoded audio feature space. In this way, during training, the generation of enhanced audio data may be adjusted based on the metadata. The generator G tries to output enhanced audio data x indistinguishable from the original audio data x. The generated enhanced audio data x and original audio data x are fed to the discriminator D at once, and whether the input data is enhanced audio data x or original audio data x is judged in a false/true manner. In this way, the discriminator D tries to discriminate the original audio data x from the enhanced audio data x. During the iterative process, the generator G then adjusts its parameters to generate enhanced audio data x that is better and better compared to the original audio data x, and the discriminator D learns to better discriminate between the enhanced audio data x and the original audio data x. The antagonistic learning process can be described by the following equation (1):

it should be noted that to train generator G in the final step, arbiter D may be trained first. Training and updating the discriminator D may involve maximizing the probability of assigning high scores to the original audio data x and low scores to the enhanced audio data x. The goal of training the discriminator D may be that the (unencoded) original audio data is identified as true, while the (generated) enhanced audio data x is identified as false. The parameters of generator G may remain fixed while discriminant D is trained and updated.

Then, training and updating the generator G may involve minimizing the difference between the original audio data x and the generated enhanced audio data x. The goal of the training generator G may be to implement the discriminator D to recognize the generated enhanced audio data x as true.

The training of the generator G may for example involve the following. Can convert the initial audio dataAnd a random noise vector z is input into the generator G. The initial audio data may be obtained by encoding the original audio data x at a low bit rate and then decodingBased on the input, the generator G may then generate enhanced audio data x. If a random noise vector z is used, it may be set to z-0 or training may be performed without inputting the random noise vector z. In addition, in the encoded audio feature space, the generator G may be trained using metadata as input to modify the enhanced audio data x. The original data x (from which the original audio data has been derived) is then used) And the generated enhanced audio data x is once input to the discriminator D. As additional information, the original audio data may also be recorded each timeThe result is input to a discriminator D. Arbiter D may then determine the inputWhether the data is enhanced audio data x (false) or original data x (true). In a next step, the parameters of the generator G may then be adjusted until the discriminator D is no longer able to distinguish between the enhanced audio data x and the original data x. This can be done by an iterative process.

The decision of arbiter D may be based on the objective function of one or more perceptual stimuli as per equation (2) below:

the index LS refers to the combined least squares method. In addition, as can be seen from the first term in equation (2), by converting the initial audio dataIs input as additional information to the arbiter to apply the conditional countermeasure generation network settings.

However, it was found that especially with the introduction of the last term in equation (2) above, it can be ensured that lower frequencies are not disturbed during the iterative process, since these frequencies are typically encoded with a higher number of bits. The last term is the 1-norm distance scaled by a factor lambda (lambda). The value of Lambda may be selected from 10 to 100 depending on the application and/or the length of the signal input to the generator. For example, lambda may be chosen to be λ 100.

The training of the discriminant D may follow the same general procedure as described above for the training of the generator G, except that the parameters of the generator G may be fixed, while the parameters of the discriminant D may be variable. For example, the training of the discriminator D may be described by the following equation (3), which allows the discriminator D to determine the enhanced audio data x as false:

in the above case, by also converting the initial audio dataThe Least Squares (LS) and conditional countermeasure generation network settings are applied as additional information input to the arbiter.

In addition to the least squares method, other training methods may be used to train the generators and discriminators in the antagonistic generating network setup. For example, the so-called Wasserstein method may be used. In this case, a land movement distance (also called Wasserstein distance) may be used instead of the minimum variance distance. In general, different training methods make the training of the generator and the arbiter more stable. However, the kind of training method applied does not affect the architecture of the generator, which is exemplarily detailed below.

Architecture of generator

Although the architecture of the generator is generally not limited, the generator may, for example, include an encoder stage and a decoder stage. The encoder and decoder stages of the generator may be fully convolutional. The decoder stage may mirror the encoder stage, and the encoder stage and the decoder stage may each comprise a plurality of L layers, with N filters in each layer L. L may be a natural number equal to or greater than 1, and N may be a natural number equal to or greater than 1. The size of the N filters (also referred to as kernel size) is not limited and may be selected according to the requirements of the generator for enhancing the quality of the original audio data. However, in each L layer, the filter size may be the same.

In more detail, the generator may have a first encoder layer (layer number L ═ 1) which may include N ═ 16 filters of size 31. The second encoder layer (layer number L-2) may include N-32 filters of size 31. The subsequent encoder layer (layer number L-11) may include N-512 filters of size 31. The number of filters in each layer thus increases. Each filter may operate on audio data input to each encoder layer in steps of 2. Thus, as the width (duration of the signal) narrows, the depth becomes larger. Thus, a learnable downsampling by a factor of 2 may be performed. Alternatively, the filter may operate in steps of 1 in each encoder layer, followed by a factor of 2 downsampling (as in known signal processing).

In the at least one encoder layer and the at least one decoder layer, a non-linear operation may additionally be performed as an activation. The non-linear operation may for example comprise one or more of the following: a parameter modified linear unit (PReLU), a modified linear unit (ReLU), a leakage modified linear unit (LReLU), an exponential linear unit (eLU), and a scaled exponential linear unit (SeLU).

The corresponding decoder layer may reference the encoder layer. Although the number of filters in each layer in the decoder stage and the filter width in each layer may be the same as in the encoder stage, upsampling of the audio signal starting from a narrower width (duration of the signal) may be performed by two alternative methods. A fractional step convolution (also known as transposed convolution) operation may be used in the layers of the decoder stage to increase the width of the audio signal to the full duration, i.e. the frame of the audio signal input into the generator.

Alternatively, in each layer of the decoder stage, the filter may operate on the audio data input into each layer in steps of 1 after performing upsampling and interpolation as in conventional signal processing with an upsampling factor of 2.

In addition, an output layer (convolutional layer) may then follow the decoder stage before possibly outputting the enhanced audio data in a final step. For example, the output layer may include N ═ 1 filters of size 31.

In the output layer, the activation may be different from the activation performed in at least one of the encoder layers and at least one of the decoder layers. The activation may be any non-linear function defined in the same range as the audio signal input to the generator. The time signal to be enhanced may be defined, for example, between +/-1. Activation may then be based on, for example, a tanh operation.

Between the encoder stage and the decoder stage, the audio data may be modified to generate enhanced audio data. The modification may be based on the encoded audio feature space (also referred to as the bottleneck layer). The modification in the encoded audio feature space may be done, for example, by concatenating the random noise vector (z) with the vector representation of the initial audio data (c) as the output of the last layer of the encoder stage. However, the random noise vector may be set to z ═ 0. It has been found that setting the random noise vector to z-0 can produce the best results for reducing coding artifacts. As additional information, the bitstream parameters and encoder parameters included in the metadata may be input at this time to modify the enhanced audio data. In this way, the generation of enhanced audio data may be adjusted based on given metadata.

There may be a jump connection between the homologous layers of the encoder and decoder stages. In this way, the enhanced audio may maintain the temporal structure or texture (texture) of the encoded audio, since the above-described encoded audio feature space may be bypassed such that information loss is prevented. The hopping connection may be implemented using one or more of concatenation and signal addition. Due to the implementation of the jump connection, the number of filter outputs may be doubled "virtually".

The architecture of the generator can be summarized, for example, as follows (omitting the jump connection):

inputting: initial audio data

Encoder layer L ═ 1: filter number N equal to 16, filter size equal to 31, activation equal to PreLU

Encoder layer L ═ 2: number of filters N32, size of filters 31, activation PreLU

Encoder layer L ═ 11: the number of filters N is 512 and the size of the filters is 31

Encoder layer L-12: the number of filters N is 1024, and the size of the filter is 31

Encoded audio feature space

Decoder layer L ═ 1: the number of filters N is 512 and the size of the filters is 31

Decoder layer L10: number of filters N-32 and filter size 31, activate PreLU

Decoder layer L ═ 11: filter number N of 16 and filter size 31, activate PreLU

An output layer: the number of filters N is 1, the size of the filter is 31, and tanh is activated

And (3) outputting: enhanced audio data

However, depending on the application, the number of layers of the encoder and decoder stages of the generator may be scaled down or up, respectively.

Structure of discriminator

The architecture of the arbiter may follow the same one-dimensional convolution structure as the encoder stage of the exemplary generator described above. Thus, the arbiter architecture can mirror the decoder stage of the generator. Thus, the discriminator may comprise a plurality of L layers, wherein each layer may comprise N filters. L may be a natural number equal to or greater than 1, and N may be a natural number equal to or greater than 1. The size of the N filters is not limited and may also be selected according to the requirements of the discriminator. However, in each L layer, the filter size may be the same. The non-linear operations performed in the at least one encoder layer of the arbiter may include a leakage ReLU.

Following the encoder stage, the arbiter may comprise an output layer. The output layer may have N ═ 1 filters of filter size 1. In this way, the filter size of the output layer may be different from the filter size of the encoder layer. Thus, the output layer is a one-dimensional convolutional layer that does not downsample the concealment activations. This means that the filter in the output layer can operate with a step size of 1, while all previous layers of the encoder stage of the arbiter can use a step size of 2. The activation in the output layer may be different from the activation in the at least one encoder layer. The activation may be S-shaped. However, if a minimum variance training method is used, sigmoid activation may not be needed and is therefore optional.

The architecture of the arbiter can be exemplarily summarized as follows:

inputting: enhanced or original audio data

Encoder layer L ═ 1: filter number N of 16, filter size 31, active leakage ReLU

Encoder layer L ═ 2: filter number N of 32, filter size of 31, active leakage ReLU

Encoder layer L ═ 11: filter number N1024, filter size 31, active leakage ReLU

An output layer: the number of filters N is 1, the filter size is 1, optionally: activated as S-shape

Output (not shown): the input is judged true/false with respect to the original data and the enhanced audio data generated by the generator.

Depending on the application, the number of layers of the encoder stage of the discriminator may be scaled down or scaled up, respectively, for example.

Press expanding

The companding technique as described in US patent US 9,947,335B2, which is incorporated herein by reference in its entirety, enables the temporal noise shaping of the quantization noise by using a companding algorithm implemented in the QMF (quadrature mirror filter) domain to achieve the temporal noise shaping of the quantization noise in the audio codec. Generally, companding is a parametric coding tool operating in the QMF domain, which can be used to control the time-domain distribution of quantization noise (e.g., introduced in the MDCT (modified discrete cosine transform) domain). Thus, companding techniques may involve a QMF analysis step, followed by the application of the actual companding operation/algorithm, and a QMF synthesis step.

Companding can be viewed as an example technique to reduce the dynamic range of a signal, and equivalently to remove the time-domain envelope from the signal. Improving the quality of audio in a reduced dynamic range domain may be particularly valuable for applications using companding techniques.

Audio enhancement of audio data in the reduced dynamic range domain from the low bit rate audio bitstream may be performed, for example, as described in detail below and in 62/850,117 (which is incorporated herein by reference in its entirety). A low bit-rate audio bitstream of any codec used in lossy audio compression, such as AAC (advanced audio coding), Dolby-AC3, HE-AAC, USAC, or Dolby-AC4, may be received. However, the low bit rate audio bit stream may be in AC-4 format. The low bit rate audio bit stream may be core decoded and initial audio data with reduced dynamic range may be obtained based on the low bit rate audio bit stream. For example, a low bit rate audio bitstream may be core decoded to obtain initial audio data with reduced dynamic range based on the low bit rate audio bitstream. The reduced dynamic range audio data may be encoded in a low bit-rate audio bitstream. Alternatively, the dynamic range reduction may be performed before or after core decoding of the low bit rate audio bitstream. The initial audio data with reduced dynamic range may be input into a generator to process the initial audio data with reduced dynamic range. The initial audio data with reduced dynamic range may then be enhanced in the domain with reduced dynamic range by the generator. The enhancement process performed by the generator aims at enhancing the quality of the original audio data by reducing coding artifacts and quantization noise. As an output, enhanced reduced dynamic range audio data may be obtained for subsequent extension to the extended domain. Such a method may further include expanding the enhanced dynamic range reduced audio data to an expanded dynamic range domain by performing an expansion operation. The spreading operation may be a companding operation based on a p-norm of the spectral magnitude used to calculate the corresponding gain value.

Generally, in companding (compression/expansion), gain values for compression and expansion are calculated and applied to a filter bank. Short prototype (short prototype) filters may be applied to address potential problems associated with the application of various gain values. Referring to the companding operation above, the enhanced dynamic range reduced audio data output by the generator may be analyzed by a filter bank, and the wideband gain may be directly applied to the frequency domain. The corresponding effect in the time domain is to smooth the gain application naturally, depending on the shape of the prototype filter applied. The modified frequency signal is then converted back to the time domain in a corresponding synthesis filter bank. Analyzing the signal using a filter bank provides access to its spectral content and allows the calculation of gains that preferentially boost the contribution due to high frequencies (or boost the contribution due to any weaker spectral content), providing gain values that are not dominated by the strongest component in the signal, thus solving the problems associated with audio sources that comprise a mix of different sources. In this context, the gain value may be calculated using a p-norm of the spectral size, where p is typically less than 2, which has been found to be more efficient in shaping the quantization noise than energy based on p-2.

The above method may be implemented on any decoder. If the application of the above method incorporates companding, the above method can be implemented on an AC-4 decoder.

Alternatively or additionally, the above method may also be performed by a system having audio data means for generating enhancement from a low bit-rate audio bitstream in a domain of reduced dynamic range and a countermeasure generation network arrangement comprising a generator and an arbiter. The apparatus may be a decoder.

The above method may also be performed by an apparatus for generating enhanced audio data from a low bit-rate audio bit-stream in a domain of reduced dynamic range, wherein the apparatus may comprise: a receiver for receiving a low bit rate audio bitstream; a core decoder for core decoding a received low bit-rate audio bit-stream to obtain initial audio data with a reduced dynamic range based on the low bit-rate audio bit-stream; and a generator for enhancing the initial audio data of reduced dynamic range in the domain of reduced dynamic range. The apparatus may further comprise a signal splitter. The apparatus may further comprise an extension unit.

Alternatively or additionally, the apparatus may be part of a system having and means for applying dynamic range reduction to input audio data and encoding the dynamic range reduced audio data in a bitstream at a low bitrate.

Alternatively or additionally, the above method may be implemented by a respective computer program product comprising a computer readable storage medium having instructions adapted for causing a device having processing capabilities to perform the above method when executed on the device.

Alternatively or additionally, the above method may involve metadata. The received low bit-rate audio bit-stream may comprise metadata and the method may further comprise signal separating the received low bit-rate audio bit-stream. The generator may then enhance the initial audio data of reduced dynamic range based on the metadata. If the application incorporates companding, the metadata may include one or more items of companding control data. In general, companding may provide benefits for speech and transient signals while degrading the quality of some stationary signals, since modifying each QMF slot individually with gain values may result in discontinuities during encoding that may result in discontinuities in the envelope of the shaped noise at the companded decoder, resulting in audible artifacts. By means of corresponding companding control data, it is possible to selectively open companding for transient signals and close companding for static signals, or to apply average companding where appropriate. In this context, average companding refers to applying a constant gain to an audio frame, similar to the gain of an adjacent active companded frame. The companding control data may be detected during encoding and transmitted to the decoder via the low bit rate audio bitstream. The companding control data may comprise information about a companding mode of the one or more companding modes that have been used to encode the audio data. The companding modes may include a companding mode of companding on, a companding mode of companding off, and a companding mode of averaging companding. The enhancement of the dynamic range reduction by the generator of the initial audio data may depend on the companding mode indicated in the companding control data. If the companding mode is companding off, the generator may not perform the enhancement. Countermeasure generation network settings in reduced dynamic range domain

The generator may also enhance the initial audio data of reduced dynamic range in the reduced dynamic range domain. By the enhancement, coding artifacts introduced by low bit rate coding are reduced and thus the quality of the dynamic range reduced initial audio data is already enhanced before the dynamic range is extended compared to the original uncoded dynamic range reduced audio data.

Thus, the producers may be producers trained in a reduced dynamic range domain in a antagonistic generating network setting (GAN setting). For example, the reduced dynamic range domain may be an AC-4 companded domain. In some cases (as in AC-4 companding), dynamic range reduction may be equivalent to removing (or suppressing) the time-domain envelope of the signal. Thus, it can be said that the generator may be a generator trained in the domain after the time-domain envelope is removed from the signal. Furthermore, although the GAN setting will be described below, it is noted that this should not be construed as limiting and that other generative models are also conceivable.

As already described above, the GAN settings typically include a generator G and a discriminator D trained by an iterative process. During training in an antagonistic generating network setup, the generator G is based on initial reduced dynamic range audio data derived from the original reduced dynamic range audio data xThe enhanced reduced dynamic range audio data x is generated (core encoded and core decoded). The dynamic range reduction may be performed by applying a companding operation. The companding operation may be a companding operation as specified for the AC-4 codec and performed in the AC-4 encoder.

Also in this case, except for the initial audio data with reduced dynamic rangeIn addition, a random noise vector z may be input into the generator, and the enhanced reduced dynamic range audio data x may be generated by the generator additionally based on the random noise vector z. However, it is possible to set the random noise vector to z-0, since it was found that for reducing coding artifacts it may be best to set the random noise vector to z-0, especially for cases where the bit rate is not too low. Alternatively, training may be performed without inputting the random noise vector z. Alternatively or additionally, metadata may be input to the generatorIn the synthesizer, and enhancing the initial audio data with reduced dynamic rangeMay additionally be based on metadata. During training, the generation of enhanced dynamic range reduced audio data x may thus be adjusted based on the metadata. The metadata may include one or more items of companding control data. The companding control data may comprise information about a companding mode of the one or more companding modes used to encode the audio data. The companding modes may include a companding mode of companding on, a companding mode of companding off, and a companding mode of averaging companding. The generation of the enhanced dynamic range reduced audio data by the generator may depend on the companding mode indicated by the companding control data. Thus, during training, the generator may be conditioned on companding mode. If the companding mode is companding off, this may indicate that the incoming initial audio data is not reduced dynamic range, and in this case the generator may not perform enhancement. As described above, companding control data may be detected during encoding of audio data and enables selective application of companding: companding is turned on for transient signals, turned off for static signals, and average companding is applied where appropriate.

During training, the generator attempts to output enhanced reduced dynamic range audio data x that is indistinguishable from the original reduced dynamic range audio data x. The generated enhanced dynamic range reduced audio data x and the original dynamic range reduced data x are fed to the discriminator at once, and whether the input data is the enhanced dynamic range reduced audio data x or the original dynamic range reduced data x is judged in a false/true manner. In this way, the discriminator attempts to discriminate the original reduced dynamic range data x from the enhanced reduced dynamic range audio data x. During the iterative process, the generator then adjusts its parameters to generate increasingly better enhanced reduced dynamic range audio data x than the original reduced dynamic range audio data x, and the arbiter learns to better discriminate between the enhanced reduced dynamic range audio data x and the original reduced dynamic range data x.

It should be noted that in order to train the generator in the final step, the arbiter may be trained first. Training and updating of the discriminators may also be performed in a reduced dynamic range domain. Training and updating the discriminator may involve maximizing the probability of assigning high scores to the original reduced dynamic range audio data x and low scores to the enhanced reduced dynamic range audio data x. The goal of training the discriminator may be to identify the original reduced dynamic range audio data x as true, while the enhanced reduced dynamic range audio data x (the generated data) is identified as false. The parameters of the generator may remain fixed while the arbiter is trained and updated.

Training and updating the generator may involve minimizing the difference between the original reduced dynamic range audio data x and the generated enhanced reduced dynamic range audio data x. The goal of the training generator may be to implement a discriminator that recognizes the generated enhanced reduced dynamic range audio data x as true.

In detail, training the generator G in the reduced dynamic range domain in the antagonistic generating network setup may for example involve the following.

Original audio data x_ipA dynamic range reduction may be performed to obtain the original audio data x with a reduced dynamic range. The dynamic range reduction may be performed by applying a companding operation, in particular an AC-4 companding operation, followed by a QMF (quadrature mirror filter) synthesis step. Since the companding operation is performed in the QMF domain, a subsequent QMF synthesis step is required. The original audio data x with reduced dynamic range may additionally be core encoded and core decoded to obtain initial audio data with reduced dynamic range before being input to the generator GAnd then reducing the dynamic range of the initial audio dataAnd the random noise vector z is input to the generator G. Then, based on the input, the generator G generates enhanced reduced dynamic range audio data x in a reduced dynamic range domain. The random noise vector z may be set to z-0. Alternatively, training may be performed without inputting the random noise vector z. Alternatively or additionally, in the reduced dynamic range encoded audio feature space, the generator G may be trained using metadata as input to modify the enhanced reduced dynamic range audio data x. The original reduced dynamic range data x (from which the reduced dynamic range initial audio data has been derived)) And the generated enhanced reduced dynamic range audio data x is input to the discriminator D at once. As additional information, the initial audio data, whose dynamic range can also be reduced each time, can be provided with additional informationThe result is input to a discriminator D. The discriminator D then determines whether the input data is enhanced reduced dynamic range audio data x (false) or original reduced dynamic range data x (true).

In a next step, the parameters of the generator G are then adjusted until the discriminator D is no longer able to distinguish between the enhanced reduced dynamic range audio data x and the original reduced dynamic range data x. This can be done by an iterative process.

The decision of the arbiter may be based on the objective function of one or more perceptual stimuli as per the following equation (1):

the index LS refers to the combined least squares method. In addition, as can be seen from the first term in equation (1), the initial audio data reduced by the dynamic range of the core-decoded audio dataIs input as additional information to the arbiter to apply the conditional countermeasure generation network settings.

However, it has been found that especially with the introduction of the last term in equation (1) above, it can be ensured that lower frequencies are not disturbed during the iterative process, since these frequencies are typically encoded with a higher number of bits. The last term is the 1-norm distance scaled by a factor lambda (lambda). The value of Lambda may be selected from 10 to 100 depending on the application and/or the length of the signal input to the generator. For example, lambda may be chosen to be λ 100.

Training of the discriminator D in the reduced dynamic range domain in the antagonism-generating network setting may follow the same as above for responding to the enhanced reduced dynamic range audio data x and the original reduced dynamic range audio data x and the reduced dynamic range initial audio data xThe same general iterative process described for the training of generator G is described once input into arbiter D, except that the parameters of generator G may be fixed, while the parameters of arbiter D may vary. The training of the discriminator D may be described by the following equation (2), which allows the discriminator D to determine the enhanced dynamic range reduced audio data x as false:

in the above case, the initial audio data is also reduced by reducing the dynamic range of the core-decodedThe Least Squares (LS) and conditional countermeasure generation network settings are applied as additional information input to the arbiter.

In this case, in addition to the least squares method, other training methods may be used to train the generators and discriminators in the antagonistic generating network setting in the domain of reduced dynamic range. Alternatively or additionally, for example, the so-called Wasserstein method may be used. In this case, a land movement distance (also called Wasserstein distance) may be used instead of the minimum variance distance. In general, different training methods make the training of the generator and the arbiter more stable. However, the kind of training method applied does not affect the architectural description of the generator detailed below.

Architecture of generators trained in reduced dynamic range domain

The generator may for example comprise an encoder stage and a decoder stage. The encoder and decoder stages of the generator may be fully convolutional. The decoder stage may mirror the encoder stage, and the encoder stage and the decoder may each comprise a plurality of L layers, with N filters in each layer L. L may be a natural number equal to or greater than 1, and N may be a natural number equal to or greater than 1. The size of the N filters (also referred to as kernel size) is not limited and may be selected according to the quality enhancement requirements of the generator for the initial audio data of reduced dynamic range. However, in each L layer, the filter size may be the same.

In a first step, initial audio data of reduced dynamic range may be input into the generator. The first encoder layer (layer number L ═ 1) may include N ═ 16 filters of size 31. The second encoder layer (layer number L-2) may include N-32 filters of size 31. The subsequent encoder layer (layer number L-11) may include N-512 filters of size 31. Therefore, the number of filters in each layer can be increased. Each filter may operate on the reduced dynamic range audio data input to each encoder layer in steps greater than 1. Each filter may operate on the reduced dynamic range audio data input to each encoder layer, for example, in steps of 2. Thus, a learnable downsampling by a factor of 2 may be performed. Alternatively, the filter may also operate in steps of 1 in each encoder layer, followed by a factor of 2 downsampling (as is known in signal processing). Alternatively, each filter may operate on the reduced dynamic range audio data input to each encoder layer in steps of 4, for example. This can reduce the total number of layers in the generator by half.

In the at least one encoder layer and the at least one decoder layer of the generator, a non-linear operation may additionally be performed as an activation. The non-linear operation may include one or more of the following: a parameter modified linear unit (PReLU), a modified linear unit (ReLU), a leakage modified linear unit (LReLU), an exponential linear unit (eLU), and a scaled exponential linear unit (SeLU).

The corresponding decoder layer may reference the encoder layer. Although the number of filters in each layer and the filter width in each layer in the decoder stage may be the same as the encoder stage, the up-sampling of the audio signal in the decoder stage may be performed by two alternative methods. Fractional step convolution (also known as transposed convolution) operations may be used in layers at the decoder level. Alternatively, in each layer of the decoder stage, the filter may operate on the audio data input into each layer in steps of 1 after performing upsampling and interpolation as in conventional signal processing with an upsampling factor of 2.

In addition, the output layer (convolutional layer) may follow the last layer of the decoder stage before the enhanced dynamic range reduced audio data is subsequently output in the final step. For example, the output layer may include N ═ 1 filters of size 31.

In the output layer, the activation may be different from the activation performed in at least one of the encoder layers and at least one of the decoder layers. Activation may be based on, for example, a tanh operation.

Between the encoder stage and the decoder stage, the audio data may be modified to generate enhanced reduced dynamic range audio data. The modification may be based on an encoded audio feature space (also referred to as a bottleneck layer) with reduced dynamic range. A random noise vector z may be used in the reduced dynamic range encoded audio feature space to modify the audio in the reduced dynamic range domain. The modification in the encoded audio feature space of reduced dynamic range may for example be done by concatenating the random noise vector (z) with the vector representation (c) of the initial audio data of reduced dynamic range as the output of the last layer of the encoder stage. The random noise vector may be set to z-0 because it was found that setting the random noise vector to z-0 may yield the best results for reducing coding artifacts. Alternatively or additionally, metadata may be entered at this time to modify the enhanced dynamic range reduced audio data. In this way, the generation of enhanced audio data may be adjusted based on given metadata.

There may be a jump connection between the homologous layers of the encoder and decoder stages. In this way, the encoded audio feature space with reduced dynamic range as described above can be bypassed, thereby preventing information loss. The hopping connection may be implemented using one or more of concatenation and signal addition. Due to the implementation of the jump connection, the number of filter outputs may be doubled "virtually".

The architecture of the generator can be summarized, for example, as follows (omitting the jump connection):

inputting: initial audio data with reduced dynamic range

Encoder layer L ═ 1: filter number N equal to 16, filter size equal to 31, activation equal to PreLU

Encoder layer L ═ 2: number of filters N32, size of filters 31, activation PreLU

Encoder layer L ═ 11: the number of filters N is 512 and the size of the filters is 31

Encoder layer L-12: the number of filters N is 1024, and the size of the filter is 31

Encoded audio feature space with reduced dynamic range

Decoder layer L ═ 1: the number of filters N is 512 and the size of the filters is 31

Decoder layer L10: number of filters N-32 and filter size 31, activate PreLU

Decoder layer L ═ 11: filter number N of 16 and filter size 31, activate PreLU

An output layer: the number of filters N is 1, the size of the filter is 31, and tanh is activated

And (3) outputting: enhanced audio data

Depending on the application, the number of layers of the encoder and decoder stages of the generator may be scaled down or up, respectively, for example. In general, the above generator architecture provides the possibility of reducing artifacts at one time, since there is no need to perform complex operations in the Wavenet or sampleRNN.

Architecture for a discriminator trained in a reduced dynamic range domain

Although the architecture of the discriminator is not limited, the architecture of the discriminator may follow the same one-dimensional convolution structure as the encoder stage of the generator described above, and thus the discriminator architecture may reference the encoder stage of the generator. Thus, the discriminator may comprise a plurality of L layers, wherein each layer may comprise N filters. L may be a natural number equal to or greater than 1, and N may be a natural number equal to or greater than 1. The size of the N filters is not limited and may also be selected according to the requirements of the discriminator. However, in each L layer, the filter size may be the same. The non-linear operations performed in the at least one encoder layer of the arbiter may include a leakage ReLU.

Following the encoder stage, the arbiter may comprise an output layer. The output layer may have N ═ 1 filters of filter size 1. In this way, the filter size of the output layer may be different from the filter size of the encoder layer. Thus, the output layer may be a one-dimensional convolutional layer that does not downsample the concealment activations. This means that the filter in the output layer can operate with a step size of 1, while all previous layers of the encoder stage of the arbiter can use a step size of 2. Alternatively, each filter in the previous layer of the encoder stage may operate in steps of 4. This can reduce the total number of layers in the discriminator by half.

The activation in the output layer may be different from the activation in the at least one encoder layer. The activation may be S-shaped. However, if a minimum variance training method is used, sigmoid activation may not be needed and is therefore optional.

The architecture of the arbiter can be summarized, for example, as follows:

inputting: enhanced reduced dynamic range audio data or original reduced dynamic range audio data

Encoder layer L ═ 1: filter number N of 16, filter size 31, active leakage ReLU

Encoder layer L ═ 2: filter number N of 32, filter size of 31, active leakage ReLU

Encoder layer L ═ 11: filter number N1024, filter size 31, active leakage ReLU

An output layer: the number of filters N is 1, the filter size is 1, optionally: activated as S-shape

Output (not shown): the input is judged true/false relative to the original reduced dynamic range data and the enhanced reduced dynamic range audio data generated by the generator.

Depending on the application, the number of layers of the encoder stage of the discriminator may be scaled down or scaled up, respectively, for example.

Artistically controlled audio enhancement

Audio coding and audio enhancement may become more relevant than today because in the future a decoder that has implemented a deep learning based approach, for example as described above, may guess the original audio signal that may sound like an enhanced version of the original audio signal. Examples may include expanding bandwidth or forcing post-processing or decoding of decoded speech into clean speech. At the same time, the result may not be "explicitly coded" and sound wrong; for example, a phoneme error may occur in the decoded speech signal, and it is unclear if the system caused the error, not the speaker. This may involve audio that sounds "more natural, but different" from the original sound.

Audio enhancement may change artistic intent. For example, an artist may desire to have coding noise or intentional band limiting in a pop song. There may be coding systems (or at least decoders) that can make the quality better than the original uncoded audio. This may be desirable in some situations. However, only recently has it been demonstrated that the output of the decoder (e.g., speech and applause) may "sound better" than the input to the encoder.

In this context, the methods and apparatus described herein provide benefits to content creators and everyone who uses enhanced audio, particularly enhanced audio based on deep learning. These methods and apparatus are particularly significant in low bit rate situations where codec artifacts are most likely to be noticed. The content creator may wish to opt-in or out of allowing the decoder to enhance the audio signal in a way that sounds "more natural, but different" from the original sound. This may occur in particular in AC-4 multi-stream coding. In broadcast applications where the bitstream may include multiple streams, each with a low bit rate, the creator may utilize control parameters included in the enhancement metadata of the lowest bit rate stream to maximize quality and thereby mitigate low bit rate coding artifacts.

In general, the enhancement metadata may be, for example, encoder-generated metadata for directing audio enhancement by a decoder in a similar manner to the metadata already mentioned above, including, for example, one or more of coding quality, bitstream parameters, an indication as to whether the original audio data is to be fully enhanced, and companding control data. Depending on the respective requirements, the enhancement metadata may be generated by the encoder instead of or in addition to one or more of the aforementioned metadata, for example, and may be transmitted together with the encoded audio data via a bitstream. In some embodiments, enhanced metadata may be generated based on the aforementioned metadata. Also, the enhancement metadata may be generated based on presets (candidate enhancement metadata) which may be modified one or more times at the encoder side to generate enhancement metadata to be transmitted and used at the decoder side. This process may involve user interaction (as detailed below), allowing for enhancement of artistic control. In some embodiments, the presets for this purpose may be based on the metadata described above.

This provides a significant benefit compared to general audio enhancement of arbitrary signals, since most signals are delivered via bit rate constrained codecs. If the enhancement system enhances the audio before encoding, the benefits of the enhancement are lost when applying low bit rate codecs. If the audio is enhanced in the decoder without input from the content creator, the enhancement may not follow the creator's intent. The following table 1 illustrates this benefit:

system for controlling a power supply	Is high quality output allowed at the decoder?	Following the creator's intent?
			Encoder-only side enhancement	Whether or not	Is that
Decoder-only side enhancement	Is that	Whether or not
			Enhancement of artistic control	Is that	Is that

Table 1: benefits of artistically controlled audio enhancement

Thus, the methods and apparatus described herein provide a solution for encoding and/or enhancing audio, especially using deep learning, which solution is also able to preserve artistic intent, as the content creator is allowed to decide on the encoding side which decoding mode or modes are available. In addition, it is possible to transmit the settings selected by the content creator as enhancement metadata parameters in the bitstream to the decoder, indicating in which mode the decoder should operate and in which (generative) model should be applied.

For purposes of understanding, it should be noted that the methods and apparatus described herein may be used in the following modes:

mode 1: the encoder may enable the content creator to listen to the decoder-side enhancements on trial, so that he or she can directly approve the corresponding enhancement or reject and change and then approve the enhancement. In this process, audio is encoded, decoded and enhanced, and the content creator can listen to the enhanced audio. He or she may say yes or no for the enhanced audio (yes or no for various types and amounts of enhancement). This yes or no decision may be used to generate enhancement metadata that will be passed along with the audio content to the decoder for use by subsequent consumers (as opposed to mode 2, detailed below). Mode 1 may take some time-up to minutes or hours-because the content creator must actively listen to the audio. Of course, an automatic version of mode 1 is also contemplated, which may take less time. In mode 1, audio is not typically delivered to the consumer, except for live broadcasts, as detailed below. In mode 1, the only purpose of decoding and enhancing the audio is for audition (or automatic evaluation).

Mode 2: the distributor (e.g., Netflix or BBC) may emit the encoded audio content. The allocator may also include enhancement metadata generated in mode 1 to guide decoder-side enhancement. The encoding and transmission process may be instantaneous and may not involve trial listening, as trial listening is already part of the generation of the enhanced metadata in mode 1. The encoding and transmission process may also occur on a different date than mode 1. The consumer's decoder then receives the encoded audio and enhancement metadata generated in mode 1, decodes the audio, and enhances it according to the enhancement metadata, which may also occur on a different date.

It should be noted that for live broadcasts (e.g., sports, news), the content creator may select the enhancements allowed on-site in real-time, which may also affect the enhancement metadata sent in real-time. In this case, mode 1 and mode 2 occur simultaneously, since the signal listened to in the trial listening may be the same as the signal delivered to the consumer.

In the following, the method and apparatus are described in more detail with reference to the accompanying drawings, wherein fig. 1, 2 and 5 refer to an automatic generation of enhancement metadata at the encoder side, and fig. 3 and 4 additionally further refer to content creator trial listening. Furthermore, fig. 6 and 7 relate to the decoder side. Fig. 8 relates to a system with an encoder and a decoder according to mode 1 described above.

It should be noted that the following terms creator, artist, producer, and user (assumed to refer to creator, artist, or producer) may be used interchangeably.

Generating enhancement metadata for controlling audio enhancement of low-bitrate encoded audio data at a decoding side

Referring to the example of fig. 1, a flow diagram of an example of a method for low bit-rate encoding audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at the decoder side is illustrated. In step S101, original audio data is core-encoded to obtain encoded audio data. The original audio data may be encoded at a low bit rate. The codec used to encode the original audio data is not limited, and any codec (e.g., OPUS codec) may be used.

In step S102, enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of the encoded audio data is generated. As mentioned above, enhancement metadata may be generated by the encoder for directing the decoder for audio enhancement in a similar way as already mentioned above, the metadata for example comprising one or more of the following: coding quality, bitstream parameters, an indication as to whether the original audio data is to be fully enhanced, and companding control data. Depending on the respective requirements, the enhancement metadata may be generated instead of or in addition to one or more of these other metadata. Generating the enhanced metadata may be performed automatically. Alternatively or additionally, generating the enhanced metadata may involve user interaction (e.g., input by the content creator).

The encoded audio data and enhancement metadata are then output in step S103, e.g. for subsequent transmission to the decoder of the respective consumer via a low bit rate audio bitstream (mode 1) or distributor (mode 2). When generating enhancement metadata at the encoder side, it may allow e.g. a user (e.g. a content creator) to determine control parameters that can control the type and/or amount of audio enhancement at the decoder side when delivered to the consumer.

Referring now to the example of fig. 2, a flow diagram of an example of generating enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of encoded audio data is illustrated. In an embodiment, generating the enhanced metadata in step S102 may include step S201: core decoding the encoded audio data to obtain core decoded initial audio data.

The initial audio data thus obtained may then be input to an audio enhancer in step S202 to process the core-decoded initial audio data based on candidate enhancement metadata for controlling the type and/or amount of audio enhancement to the audio data input to the audio enhancer. The candidate enhancement metadata may be said to correspond to presets that may still be modified at the encoding side in order to generate enhancement metadata to be transmitted and used at the decoding side for guiding audio enhancement. The candidate enhancement metadata may be predefined presets that can be easily implemented in the encoder, or may be presets entered by a user (e.g., a content creator). In some embodiments, the presets may be based on the metadata mentioned above. The modification of the candidate enhanced metadata may be performed automatically. Alternatively or additionally, the candidate enhancement metadata may be modified based on user input, as described in detail below.

In step S203, the enhanced audio data is then obtained as an output from the audio enhancer. In an embodiment, the audio enhancer may be a generator. The generator itself is not limited. The generator may be a generator trained in a antagonistic generation network (GAN) setup, but other generation models are also contemplated. Also sampleRNN or Wavenet is conceivable.

In step S204, the suitability of candidate enhancement metadata is then determined based on the enhanced audio data. For example, the applicability may be determined by: the enhanced audio data is compared to the original audio data to determine whether, for example, coding noise or band limiting is intentional. The determination of the suitability of the candidate enhancement metadata may be an automated process, i.e. may be performed automatically by the respective encoder. Alternatively or additionally, determining the suitability of the candidate enhanced metadata may involve a user listening trial. Thus, a user (e.g., content creator) may be enabled to determine the suitability of candidate enhanced metadata, also as described in further detail below.

In step S205, based on the result of this determination, enhancement metadata is generated. In other words, if the candidate enhancement metadata is determined to be suitable, the enhancement metadata is generated based on the suitable candidate enhancement metadata.

Referring now to the example of fig. 3, a further example of generating enhancement metadata to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of encoded audio data is illustrated.

In an embodiment, the step S204 of determining suitability of candidate enhancement metadata based on the enhanced audio data may comprise the step S204a of: the enhanced audio data is presented to a user and a first input from the user is received in response to the presentation. Then, the generation of the enhanced metadata in step S205 may be based on the first input. The user may be a content creator. When presenting the enhanced audio data to the content creator, the content creator may listen to the enhanced audio data and decide whether the enhanced audio data reflects artistic intent.

As illustrated in the example of fig. 4, in an embodiment, the first input from the user may include an indication of whether the candidate enhanced metadata was accepted or rejected by the user, yes/no as illustrated in decision block S204 b. In an embodiment, in the event that the user rejects the candidate enhancement metadata, a second input may be received from the user indicating a modification to the candidate enhancement metadata in step S204c, and the generating of the enhancement metadata in step S205 may be based on the second input. Such second input may be, for example, input regarding a different set of candidate enhancement metadata (e.g., a different preset) or input according to a change to the current set of candidate enhancement metadata (e.g., a modification to the type and/or amount of enhancement that may be indicated by the respective enhancement control data). Alternatively or additionally, in an embodiment, steps S202 to S205 may be repeated in case the user rejects the candidate enhanced metadata. Thus, a user (e.g., a content creator) may, for example, be able to repeatedly determine the suitability of respective candidate enhancement metadata in order to obtain suitable results in an iterative process. In other words, the content creator may repeatedly listen to the enhanced audio data and decide whether the enhanced audio data subsequently reflects artistic intent in response to the second input. In step S205, the enhancement metadata may then also be based on the second input.

In an embodiment, the enhancement metadata may include one or more items of enhancement control data. Such enhancement control data may be used at the decoding side to control the audio enhancer to perform a desired type and/or amount of enhancement on the corresponding core decoded initial audio data.

In an embodiment, the enhancement control data may comprise information on one or more audio enhancement types (content purge type) including one or more of speech enhancement, music enhancement and applause enhancement.

Thus, there may be a set of (generative) models (e.g., GAN-based music models or sampleRNN-based speech models) to which various forms of deep learning-based enhancements are applied, which may be applied at the decoder side according to the input of the creator at the encoder side (e.g., dialog-centric, music-centric, etc., i.e., signal source dependent categories). Since the audio enhancement may be content-specific in the short term, the creator may also select from the available types of audio enhancement and indicate the type of audio enhancement to be used by the corresponding audio enhancer on the decoding side, respectively, by setting the enhancement control data.

In an embodiment, the enhancement control data may further comprise information on the respective admissibility of one or more audio enhancement types.

In this context, a user (e.g., a content creator) may also be allowed to choose to enable or disable so that current or future enhancement systems detect the type of audio to perform the enhancement, for example, in view of the generic enhancer being developed (e.g., speech, music, and others) or an automatic detector that may select a particular type of enhancement (e.g., speech, music, or others). Thus, the term admissibility may also be said to encompass the admissibility of detecting an audio type for subsequently performing a type of audio enhancement. The term admissibility can also be said to cover "just an option that makes it sound good". In this case, all aspects of the audio enhancement may be allowed to be selected by the decoder. The setting "aim at creating the most natural sound, the highest quality perceptual audio, without the artifacts that tend to be produced by the codec" can be disclosed to the user. Thus, if a user (e.g., a content creator) desires to create codec noise, he or she will deactivate the mode during these segments. An automated system for detecting codec noise may also be used to detect this situation and automatically deactivate the enhancement (or suggest deactivating the enhancement) at the relevant time.

Alternatively or additionally, in an embodiment, the enhancement control data may further comprise information about the amount of audio enhancement (amount of content cleanup allowed).

Such amounts may range from "none" to "many". In other words, such an arrangement may correspond to encoding audio in a generic way (none) using typical audio coding and to professionally generating audio content (much) regardless of the audio input. It is also possible to allow this setting to vary with bit rate, with the default value increasing with decreasing bit rate. Alternatively or additionally, in embodiments, the enhancement control data may further comprise information on the admissibility of whether audio enhancement (e.g. updated enhancement) is to be performed by an automatically updated audio enhancer on the decoder side.

Since deep learning enhancement is an active research and future product area, the capabilities of which are rapidly increasing, this setup allows a user (e.g., content creator) to choose to enable or exit future versions of enhancements (e.g., Dolby enhancements) that allow applications, not just versions that the user can listen to on trial when making his or her selection.

Alternatively or additionally, the processing of the core-decoded initial audio data based on the candidate enhancement metadata in step S202 may be performed by applying one or more predefined audio enhancement modules, and the enhancement control data may further comprise information on the admissibility of using one or more different enhancement modules at the decoder side that achieve the same or substantially the same type of enhancement.

Thus, even if the enhancement modules on the encoding side and the decoding side are different, artistic intent can be preserved during audio enhancement, since the same or substantially the same type of enhancement is achieved.

Referring now to the example of fig. 5, there is illustrated an example of an encoder configured to perform the above-described method. The encoder 100 may comprise a core encoder 101 configured to core encode raw audio data at a low bitrate to obtain encoded audio data. The encoder 100 may further be configured to generate enhancement metadata 102 to be used for controlling the type and/or amount of audio enhancement at the decoder side after core decoding of the encoded audio data. As already mentioned above, the generation of the enhanced metadata may be performed automatically. Alternatively or additionally, the generation of the enhanced metadata may involve user input. And the encoder may comprise an output unit 103 configured to output the encoded audio data and enhancement metadata (which is then passed to the consumer to control audio enhancement on the decoding side according to mode 1, or to the distributor according to mode 2). Alternatively or additionally, the encoder may be implemented as a device 400 comprising one or more processors 401, 402 configured to perform the above-described method, as exemplarily illustrated in fig. 9.

Generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata

Referring now to the example of fig. 6, an example of a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata is illustrated. In step S301, audio data encoded at a low bit rate and enhancement metadata are received. The encoded audio data and the enhancement metadata may be received, for example, as a low bit rate audio bitstream.

The low bit rate audio bitstream, e.g., signal, may then be separated into encoded audio data and enhancement metadata, where the encoded audio data is provided to a core decoder for core decoding and the enhancement metadata is provided to an audio enhancer for audio enhancement.

In step S302, the encoded audio data is core-decoded to obtain core-decoded initial audio data, and then the core-decoded initial audio data is input into the audio enhancer in step S303 to process the core-decoded initial audio data based on the enhancement metadata. As such, audio enhancement may be guided by one or more items of enhancement control data included in the enhancement metadata as detailed above. Since the enhancement metadata may be generated (automatically and/or based on input from the content creator) in view of artistic intent, the enhanced audio data obtained as output from the audio enhancer in step S304 may reflect and preserve artistic intent. In step S305, the enhanced audio data is then output to, for example, a listener (consumer).

In an embodiment, processing the core-decoded initial audio data based on the enhancement metadata may be performed by applying one or more audio enhancement modules in accordance with the enhancement metadata. The audio enhancement module to be applied may be indicated by including enhancement control data included in the enhancement metadata as detailed above.

Alternatively or additionally, processing the core-decoded initial audio data based on the enhancement metadata may be performed by an automatically updated audio enhancer if a corresponding tolerance is indicated in the enhancement control data as detailed above.

Although the type of audio enhancer is not limited, in an embodiment, the audio enhancer may be a generator. The generator itself is not limited. The generator may be a generator trained in a antagonistic generation network (GAN) setup, but other generation models are also contemplated. Also sampleRNN or Wavenet is conceivable.

Referring to the example of fig. 7, an example of a decoder configured to perform a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata is illustrated. The decoder 300 may comprise a receiver 301 configured to receive audio data encoded at a low bit rate, and enhancement metadata, e.g. via a low bit rate audio bit stream. The receiver 301 may be configured to provide enhancement metadata to an audio enhancer 303 (as illustrated by the dashed lines) and to provide encoded audio data to the core decoder 302. In case a low bitrate audio bitstream is received, the receiver 301 may be further configured to separate the received low bitrate audio bitstream signal into encoded audio data and enhancement metadata. Alternatively or additionally, the decoder 300 may comprise a demultiplexer. As already mentioned above, the decoder 300 may comprise a core decoder 302 configured to core decode the encoded audio data to obtain core decoded initial audio data. The core decoded initial audio data may then be input to an audio enhancer 303 configured to process the core decoded initial audio data based on the enhancement metadata and output the enhanced audio data. The audio enhancer 303 may comprise one or more audio enhancement modules to be applied to the core decoded initial audio data according to the enhancement metadata. Although the type of audio enhancer is not limited, in an embodiment, the audio enhancer may be a generator. The generator itself is not limited. The generator may be a generator trained in a antagonistic generation network (GAN) setup, but other generation models are also contemplated. Also sampleRNN or Wavenet is conceivable.

Alternatively or additionally, the decoder may be implemented as a device 400 comprising one or more processors 401, 402 configured to perform a method for generating enhanced audio data from low bitrate encoded audio data based on enhancement metadata, as exemplarily illustrated in fig. 9. Alternatively or additionally, the above method may be implemented by a respective computer program product comprising a computer readable storage medium having instructions adapted for causing a device having processing capabilities to perform the above method when executed on the device.

Referring now to the example of fig. 8, the above method may also be implemented by a system having an encoder and a corresponding decoder, the encoder being configured to perform a method for low bit-rate encoding of audio data and generating enhancement metadata for controlling audio enhancement of the low bit-rate encoded audio data at the decoder side, and the decoder being configured to perform a method for generating enhanced audio data from the low bit-rate encoded audio data based on the enhancement metadata. As illustrated by the example of fig. 8, enhancement metadata is transmitted from an encoder to a decoder via a bitstream of encoded audio data.

The enhancement metadata parameters may further be updated at some reasonable frequency, e.g., a temporal resolution boundary of a segment on the order of seconds to hours is a reasonable fraction of a second or a few frames. The interface of the system may allow real-time live switching of settings, changing of settings at a particular point in time in a file, or both.

Additionally, a cloud storage mechanism may be provided for a user (e.g., content creator) to update enhanced metadata parameters for given content. This may work in conjunction with the IDAT (ID and timing) metadata information carried in the codec, which may provide an index for the content item.

Explaining the meaning

Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present disclosure, discussions utilizing terms such as "processing," "computing," "calculating," "determining," "analyzing," or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

In a similar manner, the term "processor" may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory, to transform that electronic data into other electronic data that may be stored, e.g., in registers and/or memory. A "computer" or "computing machine" or "computing platform" may include one or more processors.

In one example embodiment, the methods described herein may be performed by one or more processors that accept computer-readable (also referred to as machine-readable) code containing a set of instructions that, when executed by the one or more processors, perform at least one of the methods described herein. Including any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further comprise a memory subsystem comprising main RAM and/or static RAM and/or ROM. A bus subsystem may be included for communication between the components. The processing system may further be a distributed processing system in which the processors are coupled together by a network. If the processing system requires a display, such a display may be included, for example, a Liquid Crystal Display (LCD) or Cathode Ray Tube (CRT) display. If manual data entry is required, the processing system may also include input devices such as one or more of an alphanumeric input unit (e.g., a keyboard), a pointing control device (e.g., a mouse), etc. The processing system may also encompass a storage system such as a disk drive unit. The processing system in some configurations may include a sound output device and a network interface device. The memory subsystem thus includes a computer-readable carrier medium carrying computer-readable code (e.g., software) comprising a set of instructions which, when executed by one or more processors, cause performance of one or more of the methods described herein. It should be noted that when the method includes several elements (e.g., several steps), no order of any of the elements is implied unless specifically stated. The software may reside in the hard disk, or may also reside, completely or at least partially, within the RAM and/or the processor during execution thereof by the computer system. Thus, the memory and the processor also constitute a computer readable carrier medium carrying computer readable code. Furthermore, a computer readable carrier medium may be formed or included in the computer program product.

In alternative example embodiments, one or more processors may operate as a standalone device or may be connected (e.g., networked) to other processors in a networked deployment, and may operate in the capacity of a server or a user machine in a server-user network environment, or as peer machines in a peer-to-peer or distributed network environment. The one or more processors may form a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.

It should be noted that the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

Accordingly, one example embodiment of each method described herein is in the form of a computer-readable carrier medium carrying a set of instructions, e.g., a computer program for execution on one or more processors (e.g., one or more processors that are part of a web server apparatus). Thus, as will be appreciated by one skilled in the art, example embodiments of the present disclosure may be embodied as a method, an apparatus, such as a special purpose apparatus, an apparatus, such as a data processing system, or a computer readable carrier medium (e.g., a computer program product). A computer-readable carrier medium carries computer-readable code comprising a set of instructions which, when executed on one or more processors, causes the one or more processors to implement a method. Accordingly, aspects of the present disclosure may take the form of a method, an entirely hardware exemplary embodiment, an entirely software exemplary embodiment or an exemplary embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.

The software may further be transmitted or received over a network via the network interface device. While the carrier medium is a single medium in the example embodiments, the term "carrier medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "carrier medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by one or more of the processors and that cause the one or more processors to perform any one or more of the methodologies of the present disclosure. A carrier medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise the bus subsystem. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. For example, the term "carrier medium" shall accordingly be taken to include, but not be limited to, solid-state memories, computer products embodied in optical and magnetic media; a medium carrying a propagated signal detectable by at least one processor or one or more processors and representing a set of instructions which, when executed, implement a method; and a transmission medium in the network carrying a propagated signal that is detectable by at least one of the one or more processors and represents a set of instructions.

It will be understood that in one example embodiment, the steps of the discussed method are performed by an appropriate processor (or processors) in a processing (e.g., computer) system executing instructions (computer-readable code) stored in a storage device. It will also be understood that the present disclosure is not limited to any particular implementation or programming technique, and that the present disclosure may be implemented using any suitable technique for implementing the functionality described herein. The present disclosure is not limited to any particular programming language or operating system.

Reference throughout this disclosure to "one example embodiment," "some example embodiments," or "example embodiments" means that a particular feature, structure, or characteristic described in connection with the example embodiments is included in at least one example embodiment of the present disclosure. Thus, the appearances of the phrases "in one example embodiment," "in some example embodiments," or "in an example embodiment" in various places throughout this disclosure are not necessarily all referring to the same example embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art in view of this disclosure, in one or more example embodiments.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

In the following claims and the description herein, any of the terms "comprising", "including" or "comprising" is an open-ended term that at least includes the following elements/features but does not exclude other elements/features. Thus, when the term "comprising" is used in the claims, it should not be interpreted as being limited to the means or elements or steps listed thereafter. For example, the scope of expression of a device comprising elements a and B should not be limited to devices consisting of only elements a and B. As used herein, any of the terms "including", "including" or "including" are also likewise open-ended terms that also mean including at least the elements/features that follow the term, but do not exclude other elements/features. Thus, including is synonymous with and means including.

It should be appreciated that in the foregoing description of example embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single example embodiment/figure or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed example embodiment. Thus, the claims following the description are hereby expressly incorporated into this description, with each claim standing on its own as a separate example embodiment of this disclosure.

Moreover, although some example embodiments described herein include some but not other features included in other example embodiments, combinations of features of different example embodiments are intended to be within the scope of the present disclosure and form different example embodiments, as will be appreciated by those of skill in the art. For example, in the following claims, any of the claimed example embodiments may be used in any combination.

In the description provided herein, numerous specific details are set forth. However, it is understood that example embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail to avoid obscuring the understanding of this description.

Therefore, while there has been described what is considered to be the best mode of the disclosure, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the disclosure, and it is intended to claim all such changes and modifications as fall within the scope of the disclosure. For example, any of the formulas given above are merely representative of processes that may be used. Functions may be added or deleted from the block diagrams and operations may be interchanged among the functional blocks. Steps may be added or deleted to methods described within the scope of the present disclosure.

34页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：声学事件识别装置、方法和程序

Method and apparatus for controlling enhancement of low bit rate encoded audio

相关技术

网友询问留言