Apparatus for post-processing audio signals using transient position detection

文档序号：1472248 发布日期：2020-02-21 浏览：21次中文

阅读说明：本技术 用于使用瞬态位置检测后处理音频信号的装置 (Apparatus for post-processing audio signals using transient position detection ) 是由萨沙·迪施克里斯蒂安·乌勒帕特里克·甘普丹尼尔·里奇特奥利弗·赫尔穆特于尔根·赫于 2018-03-28 设计创作，主要内容包括：一种用于后处理音频信号的装置,包括：转换器(100),用于将音频信号转换为时间频率表示；瞬态位置估计器(120),用于使用所述音频信号或所述时间频率表示估计瞬态部分的时间位置；以及用于操纵时间频率表示的信号操纵器(140),其中所述信号操纵器(140)被配置为在瞬态位置之前的时间位置处减少或消除所述时间频率表示中的前回波,或者在瞬态位置处执行所述时间频率表示的整形,以放大所述瞬态部分的起音。(An apparatus for post-processing an audio signal, comprising: a converter (100) for converting an audio signal into a time-frequency representation; a transient position estimator (120) for estimating a temporal position of a transient portion using the audio signal or the temporal frequency representation; and a signal manipulator (140) for manipulating the time-frequency representation, wherein the signal manipulator (140) is configured to reduce or eliminate pre-echoes in the time-frequency representation at a time position before the transient position, or to perform shaping of the time-frequency representation at the transient position, to amplify the attack of the transient part.)

1. An apparatus for post-processing (20) an audio signal, comprising:

a converter (100) for converting the audio signal into a time-frequency representation;

a transient position estimator (120) for estimating a temporal position of a transient portion using the audio signal or the temporal frequency representation; and

a signal manipulator (140) for manipulating a time-frequency representation, wherein the signal manipulator is configured to reduce (220) or eliminate pre-echoes in the time-frequency representation at a time position before a transient position, or to perform shaping (500) of the time-frequency representation at a transient position, to amplify an attack of the transient portion.

2. The apparatus of claim 1, wherein the first and second electrodes are disposed in a common plane,

wherein the signal manipulator (140) comprises a pitch estimator (200) for detecting a pitch signal component in the time-frequency representation temporally preceding a transient portion, an

Wherein the signal manipulator (140) is configured to apply pre-echo reduction or cancellation (220) in a frequency selective manner such that at frequencies where tonal signal components have been detected, signal manipulation is reduced or switched off compared to frequencies where tonal signal components have not been detected.

3. The apparatus of claim 1 or 2, wherein the signal manipulator (140) comprises a pre-echo width estimator (240) for estimating a temporal width of a pre-echo before a transient position based on a development of a signal energy of the audio signal over time to determine a pre-echo start frame in a temporal frequency representation comprising a plurality of subsequent audio signal frames.

4. The device of any one of the preceding claims,

wherein the signal manipulator (140) comprises a pre-echo threshold estimator (260) for estimating a pre-echo threshold for spectral values in the temporal frequency representation within a pre-echo width, wherein the pre-echo threshold is indicative of a magnitude threshold of a corresponding spectral value after pre-echo reduction or cancellation.

5. The apparatus as set forth in claim 4, wherein,

wherein the pre-echo threshold estimator (260) is configured to determine the pre-echo threshold using a weighting curve having an increasing characteristic from the start of the pre-echo width to the transient position.

6. The apparatus of any one of the preceding claims, wherein the pre-echo threshold estimator (260) is configured to:

smoothing (330) the time-frequency representation over a plurality of subsequent frames of the time-frequency representation, an

The smoothed time-frequency representation is weighted (340) using a weighting curve having an increasing characteristic from the start of the previous echo width to the transient position.

7. The apparatus of any one of the preceding claims, wherein the signal manipulator (140) comprises:

a spectral weight calculator (300, 160) for calculating respective spectral weights for spectral values of the time-frequency representation; and

a spectral weighter (320) for weighting spectral values of the temporal frequency representation using the spectral weights to obtain a steered temporal frequency representation.

8. The apparatus according to claim 7, wherein the spectral weight calculator (300) is configured to:

determining (450) original spectral weights using the actual spectral values and the target spectral values, or

Smoothing (460) the original spectral weights in frequency within the frame of the time-frequency representation, or

Reduction or elimination of pre-echo over multiple frames using fading curve fading (430) at the beginning of pre-echo width, or

Determining (420) target spectral values such that spectral values having an amplitude below a pre-echo threshold are unaffected by the signal manipulation, or

A target spectral value is determined (420) using an advanced masking model (410) to reduce attenuation of spectral values in a pre-echo region based on the advanced masking model (410).

9. The device of any one of the preceding claims,

wherein the time-frequency representation comprises complex-valued spectral values, an

Wherein the signal manipulator (140) is configured to apply real-valued spectral weighting values to the complex-valued spectral values.

10. The device of any one of the preceding claims,

wherein the signal manipulator (140) is configured to amplify (500) spectral values within a transient frame of the time-frequency representation.

11. The device of any one of the preceding claims,

wherein the signal manipulator (140) is configured to amplify only spectral values above a minimum frequency, the minimum frequency being larger than 250Hz and lower than 2 kHz.

12. The device of any one of the preceding claims,

wherein the signal manipulator (140) is configured to divide (630) the time-frequency representation into a duration part and a transient part at a transient position,

wherein the signal manipulator (140) is configured to amplify only the transient portion and not the sustained portion.

13. The device of any one of the preceding claims,

wherein the signal manipulator (140) is configured to further amplify a portion of the time frequency representation in time after the transient position using a fade-out characteristic (685).

14. The device of any one of the preceding claims,

wherein the signal manipulator (140) is configured to calculate (680) spectral weighting factors for the spectral values using the persistent portion of the spectral values, the amplified transient portion and the magnitudes of the spectral values, wherein the amount of amplification of the amplified portion is predetermined and is between 300% and 150%, or

Where the spectral weights are smoothed 690 over frequency.

15. The apparatus of any one of the preceding claims, further comprising:

a spectrum-to-time converter for converting (370) the manipulated time-frequency representation into the time domain using an overlap-and-add operation involving at least adjacent frames of the time-frequency representation.

16. The device of any one of the preceding claims,

wherein the converter (100) is configured to apply an analysis window of a jump size between 1ms and 3ms or having a window length between 2ms and 6ms, or

Wherein the spectro-temporal converter (370) is configured to use a range corresponding to an overlap size of the overlap window or to a jump size between 1ms and 3ms used by the converter, or to use a synthesis window having a window length between 2ms and 6ms, or wherein the analysis window and the synthesis window are identical to each other.

17. A method for post-processing (20) an audio signal, comprising:

-converting (100) the audio signal into a time-frequency representation;

estimating (120) a transient position in time of a transient portion using the audio signal or the time-frequency representation; and

manipulating (140) the time-frequency representation to reduce (220) or eliminate pre-echoes in the time-frequency representation at time positions preceding the transient position, or performing shaping (500) of the time-frequency representation at the transient position to amplify the onset of the transient portion.

18. A computer program for performing the method of claim 17 when run on a computer or processor.

Technical Field

The present invention relates to audio signal processing and, in particular, to audio signal post-processing to enhance audio quality by removing coding artifacts.

Background

Audio coding is the field of signal compression that uses psychoacoustic knowledge to deal with redundancy and irrelevancy in audio signals. Under low bit rate conditions, unwanted artifacts are often introduced into the audio signal. Significant artifacts are pre-and post-temporal echoes triggered by transient signal components.

Especially in block-based audio processing, these pre-and post-echoes occur due to quantization noise, e.g. spectral coefficients in a frequency domain transform coder, spread over the entire duration of a block. Semi-parametric coding tools, such as gap-filling, parametric spatial audio, or bandwidth extension, may also cause parametric band-limited echo artifacts, since parameter-driven adjustments typically occur within a time block of samples.

The present invention relates to a non-guided post-processor that reduces or mitigates transient subjective quality impairments that have been introduced by perceptual transform coding.

Prior art methods to prevent pre-and post-echo artifacts within the codec include transform codec block switching and temporal noise shaping. A prior art method of suppressing pre-and post-echo artifacts using post-processing techniques after the codec chain is disclosed in [1 ].

[1] Imen Samali, Mania Turki-Hadj Alauane, Gael Mahe, "Temporal engineering for Attack retrieval in Low Bit-Rate Audio Coding", 17th European Signal Processing Conference (EUSIPCO 2009), Scotland, 24-28, 2009; and

[2]Jimmy Lapierre and Roch Lefebvre,“Pre-Echo Noise Reduction InFrequency-Domain Audio Codecs”,ICASSP 2017,New Orleans.

the first category of methods needs to be inserted into the codec chain and cannot be applied a posteriori to items that have been previously encoded (e.g., archived sound material). Even if the second method is implemented essentially as a post-processor of the decoder, it still requires control information derived from the original input signal at the encoder side.

Disclosure of Invention

It is an object of the invention to provide an improved concept for post-processing an audio signal.

This object is achieved by an apparatus for post-processing an audio signal according to claim 1, a method of post-processing an audio signal according to claim 17 or a computer program according to claim 18.

One aspect of the present invention is based on the following findings: transients may still be found in audio signals that have been subjected to earlier encoding and decoding, because such earlier encoding/decoding operations, although degrading the perceptual quality, do not completely eliminate transients. Accordingly, a transient position estimator is provided for estimating a temporal position of a transient portion using an audio signal or a temporal frequency representation of the audio signal. According to the invention, the time-frequency representation of the audio signal is manipulated to reduce or eliminate pre-echoes in the time-frequency representation at time positions preceding the transient position, or to perform shaping of the time-frequency representation at the transient position and, depending on the implementation, after the transient position, so that attack (attack) of the transient part is amplified.

According to the invention, signal manipulation is performed within a time-frequency representation of the audio signal based on the detected transient position. Thus, by processing operations in the frequency domain, a rather accurate transient position detection may be obtained, and on the one hand a corresponding useful pre-echo reduction and on the other hand an attack amplification, so that the final frequency-time conversion results in an automatic smoothing/distribution of the manipulation over the entire frame and over more than one frame due to overlap-add operations. Finally, this avoids audible clicks due to manipulation of the audio signal and of course results in an improved audio signal without any pre-echo or with a reduced amount of pre-echo on the one hand and/or with a sharp onset for transient portions on the other hand.

The preferred embodiments relate to a non-guided post-processor that reduces or mitigates transient subjective quality impairments that have been introduced by perceptual transform coding.

According to another aspect of the invention, the transient improvement processing is performed without a specific need for a transient position estimator. In this respect, a temporal-spectral converter for converting an audio signal into a spectral representation comprising a sequence of spectral frames is used. The prediction analyzer then calculates prediction filter data for prediction of frequencies within the spectral frame, and a subsequently connected shaping filter controlled by the prediction filter data shapes the spectral frame to enhance transient portions within the spectral frame. Post-processing of the audio signal is done using a spectral-temporal conversion for converting the sequence of spectral frames comprising the shaped spectral frames back into the time domain.

Thus, again, any modifications are made within the spectral representation rather than within the time-domain representation, thereby avoiding any audible clicks or the like due to the time-domain processing. Furthermore, due to the fact that a prediction analyzer for calculating prediction filtering data for the prediction of frequencies within spectral frames is used, the corresponding temporal envelope of the audio signal is automatically affected by the subsequent shaping. In particular, the shaping is performed in such a way that, due to the processing in the spectral domain and to the fact that a prediction of the frequency is used, the temporal envelope of the audio signal is enhanced, i.e. such that the temporal envelope has higher peaks and deeper valleys. In other words, the reverse process of smoothing is performed by shaping that automatically enhances the transient without actually locating the transient.

Preferably, two kinds of prediction filter data are derived. The first prediction filter data is prediction filter data for a flat filter characteristic, and the second prediction filter data is prediction filter data for a shaping filter characteristic. In other words, the flattening filter characteristic is an inverse filter characteristic, and the shaping filter characteristic is a predictive synthesis filter characteristic. Again, however, both filter data are derived by performing a prediction of the frequency within the spectral frame. Preferably, the time constants used for deriving the different filter coefficients are different, such that for calculating the first prediction filter coefficient, a first time constant is used, and for calculating the second prediction filter coefficient, a second time constant is used, wherein the second time constant is larger than the first time constant. The process again automatically ensures that transient signal portions are more affected than non-transient signal portions. In other words, although the processing does not rely on explicit transient detection methods, transient portions are more affected than non-transient portions by means of flattening and subsequent shaping based on different time constants.

Thus, according to the invention and thanks to the application of the prediction of the frequency, an automatic type of transient improvement procedure is obtained, in which the temporal envelope is enhanced (rather than smoothed).

Embodiments of the present invention are designed as a post-processor that operates on previously encoded sound material without the need for further guidance information. Thus, these embodiments may be applied to archived sound material that has been compromised by perceptual coding that has been applied to the archived sound material before the archived sound material was archived.

A preferred embodiment of the first aspect comprises the following main process steps:

non-guided detection of transient positions within the signal to find transient positions;

estimating the pre-echo duration and intensity before the transient;

deriving an appropriate time gain curve for attenuating the pre-echo artifact;

avoiding/attenuating the estimated pre-echo by the adapted time gain curve before the transient (to mitigate the pre-echo);

at the sound starting position, the dispersion of the sound starting is reduced;

tones or other quasi-stationary spectral bands are excluded from ducking.

A preferred embodiment of the second aspect comprises the following main process steps:

unguided detection of transient position within the signal to find the transient position (this step is optional);

sharpening the attack envelope by applying a frequency domain linear prediction coefficient (FD-LPC) flattening filter and a subsequent FD-LPC shaping filter, the flattening filter representing a smoothed temporal envelope and the shaping filter representing a less smooth temporal envelope, wherein the prediction gains of both filters are compensated.

The preferred embodiment is that of a post-processor that implements non-boot transient enhancement as the last step in a multi-step processing chain. If other enhancement techniques are to be applied, such as unguided bandwidth extension, spectral gap filling, etc., the transient enhancement is preferably the last in the chain, so that the enhancement includes and is effective for signal modifications that have been introduced from the previous enhancement stage.

All aspects of the invention may be implemented as a post-processor, one, two or three modules may be computed serially or may share common modules for computational efficiency (e.g., (I) STFT, transient detection, pitch detection).

It should be noted that the two aspects described herein may be used independently of each other or together for post-processing an audio signal. The first aspect relying on transient position detection and pre-echo reduction and pitch amplification may be used in order to enhance the signal without the second aspect. Accordingly, the second aspect based on LPC analysis and corresponding shaping filtering of frequencies in the frequency domain does not necessarily rely on transient detection, but rather enhances the transient automatically without an explicit transient position detector. This embodiment may be enhanced by a transient position detector, but this transient position detector is not necessary. Furthermore, the second aspect may be applied independently of the first aspect. Further, it is emphasized that in other embodiments the second aspect may be applied to audio signals that have been post-processed by the first aspect. Alternatively, however, the ordering may be done in such a way that in a first step the second aspect is applied and subsequently the first aspect is applied in order to post-process the audio signal to improve its audio quality by removing earlier introduced coding artifacts.

Furthermore, it should be noted that the first aspect basically has two sub-aspects. The first sub-aspect is pre-echo reduction based on transient position detection, and the second sub-aspect is attack amplification based on transient position detection. Preferably, the two sub-aspects are combined in series, wherein even more preferably pre-echo reduction is performed first, followed by attack amplification. However, in other embodiments, the two different sub-aspects may be implemented independently of each other and may even be combined with the second sub-aspect as appropriate. Thus, pre-echo reduction may be combined with a prediction based transient enhancement process without any attack amplification. In other embodiments, no pre-echo reduction is performed, but instead pitch amplification is performed along with subsequent LPC-based transient shaping, which does not necessarily require transient position detection.

In a combined embodiment, the first and second aspects comprising the two sub-aspects are performed in a specific order, wherein the order comprises performing pre-echo reduction first, performing attack amplification second, and performing LPC-based attack/transient enhancement procedures third based on prediction of the spectral frame of frequencies.

Drawings

Preferred embodiments of the present invention will be discussed subsequently with reference to the accompanying drawings, in which:

fig. 1 is a schematic block diagram according to a first aspect;

FIG. 2a is a preferred embodiment of the first aspect based pitch estimator;

FIG. 2b is a preferred embodiment of the first aspect based on pre-echo width estimation;

FIG. 2c is a preferred embodiment of the first aspect based on pre-echo threshold estimation;

FIG. 2d is a preferred embodiment of the first sub-aspect relating to pre-echo reduction/cancellation;

FIG. 3a is a preferred embodiment of the first sub-aspect;

FIG. 3b is a preferred embodiment of the first sub-aspect;

FIG. 4 is a further preferred embodiment of the first sub-aspect;

FIG. 5 illustrates two sub-aspects of the first aspect of the invention;

FIG. 6a shows an overview of a second sub-aspect;

FIG. 6b shows a preferred embodiment relying on a second sub-aspect of the division into transient and persistent portions;

FIG. 6c illustrates a further embodiment of the division of FIG. 6 b;

FIG. 6d shows a further embodiment of the second sub-aspect;

FIG. 6e shows a further embodiment of the second sub-aspect;

FIG. 7 shows a block diagram of an embodiment of a second aspect of the present invention;

FIG. 8a shows a preferred embodiment of the second aspect based on two different filter data;

FIG. 8b shows a preferred embodiment of the second aspect for calculating two different prediction filter data;

FIG. 8c shows a preferred embodiment of the shaping filter of FIG. 7;

FIG. 8d shows a further embodiment of the shaping filter of FIG. 7;

fig. 8e shows a further embodiment of the second aspect of the invention;

FIG. 8f shows a preferred embodiment of LPC filter estimation using different time constants;

FIG. 9 shows an overview of a preferred embodiment of a post-processing procedure of a second aspect of the present invention relying on first and second sub-aspects of the first aspect of the present invention and additionally on performing the output of the procedure based on the first aspect of the present invention;

FIG. 10a shows a preferred embodiment of a transient position detector;

FIG. 10b illustrates a preferred embodiment of the detection function calculation of FIG. 10 a;

FIG. 10c shows a preferred embodiment of the start point (onset) selector of FIG. 10 a;

fig. 11 shows as a transient enhanced post processor a general arrangement of the invention according to the first and/or second aspect;

figure 12.1 shows moving average filtering;

FIG. 12.2 shows single-pole recursive averaging and high-pass filtering;

fig. 12.3 shows temporal signal prediction and residual;

FIG. 12.4 shows the autocorrelation of the prediction error;

FIG. 12.5 shows spectral envelope estimation using LPC;

figure 12.6 shows temporal envelope estimation using LPC;

FIG. 12.7 illustrates attack transients versus frequency domain transients;

FIG. 12.8 shows the spectrum of the "frequency domain transient";

FIG. 12.9 illustrates the difference between transients, onset points and attack;

FIG. 12.10 shows absolute thresholds in quiet and simultaneous masking;

FIG. 12.11 shows temporal masking;

FIG. 12.12 shows the general structure of a perceptual audio encoder;

fig. 12.13 shows the general structure of a perceptual audio decoder;

fig. 12.14 shows bandwidth limitation in perceptual audio coding;

FIG. 12.15 illustrates a degraded attack feature;

figure 12.16 shows pre-echo artifacts;

FIG. 13.1 shows a transient enhancement algorithm;

fig. 13.2 shows transient detection: a detection function (soundboard);

fig. 13.3 shows transient detection: detection function (park);

FIG. 13.4 shows a block diagram of a pre-echo reduction method;

FIG. 13.5 illustrates the detection of tonal components;

FIG. 13.6 shows pre-echo width estimation-an exemplary method;

figure 13.7 shows pre-echo width estimation-example;

FIG. 13.8 shows the pre-echo width estimation-detection function;

fig. 13.9 shows a pre-echo reduction-spectrogram (castanets);

FIG. 13.10 is a graphical representation of pre-echo threshold determination (castanets);

FIG. 13.11 is a graphical illustration of pre-echo threshold determination for tonal components;

figure 13.12 shows a parametric fading curve for pre-echo reduction;

FIG. 13.13 shows a model of the leading masking threshold;

FIG. 13.14 shows the calculation of target amplitude after pre-echo reduction;

fig. 13.15 shows a front echo reduction-spectrogram (bell);

FIG. 13.16 illustrates adaptive transient attack enhancement;

FIG. 13.17 shows a fade-out curve for adaptive transient attack enhancement;

FIG. 13.18 shows an autocorrelation window function;

figure 13.19 shows the time-domain transfer function of the LPC shaping filter; and

figure 13.20 shows LPC envelope shaping-input and output signals.

Detailed Description

Fig. 1 shows an apparatus for post-processing an audio signal using transient position detection. In particular, as shown in fig. 11, the device for post-processing is placed with respect to a general frame. In particular, fig. 11 shows the input of the corrupted audio signal shown at 10. This input is forwarded to the transient enhancement post processor 20 and the transient enhancement post processor 20 outputs an enhanced audio signal as shown at 30 in fig. 11.

The apparatus 20 for post-processing shown in fig. 1 comprises a converter 100 for converting an audio signal into a time-frequency representation. Furthermore, the apparatus comprises a transient position estimator 120 for estimating a temporal position of the transient portion. The transient position estimator 120 operates using a time-frequency representation as shown by the connection between the converter 100 and the transient position estimator 120, or using an audio signal in the time domain. This alternative is shown in dashed lines in fig. 1. Furthermore, the apparatus comprises a signal manipulator 140 for manipulating the representation of the time frequency. The signal manipulator 140 is configured to reduce or eliminate pre-echoes in the time frequency representation at a time position prior to the transient position, wherein the transient position is signaled by the transient position estimator 120. Alternatively or additionally, the signal manipulator 140 is configured to perform shaping of the time-frequency representation as shown by the line between the converter 100 and the signal manipulator 140 at the transient position such that the onset of the transient portion is amplified.

Thus, the apparatus for post-processing in fig. 1 reduces or eliminates pre-echoes and/or shapes the time-frequency representation to amplify the onset of transient portions.

Fig. 2a shows a pitch estimator 200. In particular, the signal manipulator 140 of fig. 1 comprises this pitch estimator 200 for detecting pitch signal components in the time-frequency representation temporally preceding the transient portion. In particular, the signal manipulator 140 is configured to apply pre-echo reduction or cancellation in a frequency selective manner such that at frequencies where tonal signal components have been detected, signal manipulation is reduced or turned off compared to frequencies where tonal signal components have not been detected. In this embodiment, pre-echo reduction/cancellation as shown in block 220 is thus frequency selectively turned on or off, or at least gradually reduced, at frequency locations in a particular frame where tonal signal components have been detected. This ensures that the tonal signal components are not manipulated, since usually the tonal signal components cannot be pre-echoes or transients at the same time. This is due to the fact that a typical property of transients is that transients are broadband effects which affect many frequency regions simultaneously, whereas, in contrast, tonal components are specific frequency regions with peak energy relative to a specific frame, while other frequencies in the frame have only low energy.

Furthermore, as shown in fig. 2b, the signal manipulator 140 comprises a pre-echo width estimator 240. The block is configured to estimate a temporal width of a pre-echo prior to the transient position. This estimation ensures that the appropriate time portion before the transient position is manipulated by the signal manipulator 140 in order to reduce or eliminate the pre-echo. The estimation of the pre-echo width in time is based on the development of the signal energy of the audio signal over time in order to determine a pre-echo start frame in a time-frequency representation comprising a plurality of subsequent audio signal frames. Typically, this development of signal energy of the audio signal over time will be an increasing or constant signal energy, but will not be a decreasing energy development over time.

Fig. 2b shows a block diagram of a preferred embodiment of the post-processing according to the first sub-aspect of the first aspect of the present invention, i.e. where pre-echo reduction or cancellation is performed, or pre-echo "ducking" as described in fig. 2 d.

The marred audio signal is provided at input 10 and input to the transformer 100, the transformer 100 preferably being implemented as a short time fourier transform analyzer operating at a certain block length and operating in overlapping blocks.

Furthermore, a pitch estimator 200 as discussed in fig. 2a is provided for controlling the pre-echo avoidance stage 320, which stage 320 is implemented for applying the pre-echo avoidance curve 160 to the time-frequency representation generated by the block 100 in order to reduce or eliminate the pre-echo. The output of block 320 is then converted into the time domain again using frequency-to-time converter 370. This frequency-to-time converter is preferably implemented as an inverse short-time fourier transform synthesis block that operates using an overlap-add operation to fade-in/fade-out from each block to the next, thereby avoiding blockiness.

The result of block 370 is an output of the enhanced audio signal 30.

Preferably, pre-echo avoidance curve block 160 is controlled by pre-echo estimator 150, and pre-echo estimator 150 collects pre-echo related characteristics, such as pre-echo width determined by block 240 of fig. 2b or pre-echo threshold determined by block 260 or other pre-echo characteristics discussed with respect to fig. 3a, 3b, 4.

Preferably, as depicted in FIG. 3a, the pre-echo avoidance curve 160 may be considered a weighting matrix having a particular frequency domain weighting factor for each frequency bin of the plurality of time frames generated by the block 100. FIG. 3a shows pre-echo threshold estimator 260 controlling spectral weighting matrix calculator 300 corresponding to block 160 in FIG. 2d, pre-echo threshold estimator 260 controlling spectral weighter 320 corresponding to pre-echo avoidance operation 320 of FIG. 2 d.

Preferably, the pre-echo threshold estimator 260 is controlled by the pre-echo width and also receives information about the temporal frequency representation. The same is true for the spectral weighting matrix calculator 300 and, of course, for the spectral weighter 320. The spectral weighter 320 ultimately applies a weighting factor matrix to the time-frequency representation to generate a frequency domain output signal in which pre-echoes are reduced or eliminated. Preferably, the spectral weighting matrix calculator 300 operates in a specific frequency range equal to or greater than 700Hz, and preferably equal to or greater than 800 Hz. Furthermore, the spectral weighting matrix calculator 300 is limited to calculating weighting factors such that it is only used for the pre-echo region, which in addition depends on the overlap-add characteristics as applied by the converter 100 of fig. 1. Furthermore, the pre-echo threshold estimator 260 is configured for estimating a pre-echo threshold for spectral values in the time-frequency representation within the pre-echo width, e.g. as determined by block 240 of fig. 2b, wherein the pre-echo threshold indicates a magnitude threshold for corresponding spectral values that should occur after pre-echo reduction or cancellation (i.e. should correspond to the true signal magnitude without pre-echo).

Preferably, the pre-echo threshold estimator 260 is configured to use a pulse width modulation signal having a sub-thresholdThe weighting curve of the increasing characteristic of the start of the pre-echo width to the transient position determines the pre-echo threshold. In particular, this weighting curve is based on M by block 350 in FIG. 3b_preThe indicated pre-echo width is determined. Then, in block 340, the weighting curve C is applied_mIs applied to the spectral values, which have been previously smoothed by means of the block 330. Then, as shown in block 360, the minimum value is selected as the threshold for all frequency indices k. Thus, according to a preferred embodiment, the pre-echo threshold estimator 260 is configured to smooth 330 the time-frequency representation over a plurality of subsequent frames of the time-frequency representation, and to weight 340 the smoothed time-frequency representation using a weighting curve having an increasing characteristic from the start of the pre-echo width to the transient position. This boost feature ensures that a certain energy of the normal "signal", i.e. the signal without pre-echo artifacts, is allowed to increase or decrease.

In a further embodiment, the signal manipulator 140 is configured to calculate respective spectral weights for the spectral values of the temporal-frequency representation using the spectral weight calculators 300, 160. Furthermore, a spectral weighter 320 is provided for weighting the spectral values of the time-frequency representation with spectral weights to obtain a manipulated time-frequency representation. Thus, the manipulation is performed in the frequency domain by using the weights and by weighting the respective time/frequency regions as generated by the converter 100 of fig. 1.

Preferably, the spectral weights are calculated as shown in the particular embodiment shown in fig. 4. The spectral weighter 320 receives a time-frequency representation X_k,mAs a first input and the spectral weights are received as a second input. These spectral weights are calculated by an original weight calculator 450, the original weight calculator 450 being configured to determine the original spectral weights using the actual spectral values and the target spectral values, both input into the block. The raw weight calculator operates as shown in equation 4.18 shown later, but other implementations that rely on actual values on the one hand and target values on the other hand are also useful. Further, alternatively or additionally, the spectral weights are smoothed over time in order to avoid artifacts and to avoid going from one frame to anotherToo strong a change of frame.

Preferably, the target values input into the raw weight calculator 450 are calculated by the look-ahead masking modeler 420 in particular. The look-ahead masking modeler 420 preferably operates according to equation 4.26, defined later, but other implementations that rely on psychoacoustic effects and in particular on look-ahead masking characteristics that typically occur for transients may also be used. The look-ahead masking modeler 420 is controlled on the one hand by the masking estimator 410, which masking estimator 410 specifically computes the masking in dependence of look-ahead masking type acoustic effects. In an embodiment, the masking estimator 410 operates according to equation 4.21 described later, but alternatively other masking estimates that rely on psychoacoustic look-ahead masking effects may be applied.

Furthermore, fader 430 is used to fade-in the reduction or elimination of the pre-echo using a fading curve over a number of frames at the beginning of the pre-echo width. This fading curve is preferably made up of the actual values in a particular frame and the determined pre-echo threshold th_kAnd (5) controlling. The fader 430 ensures that pre-echo reduction/cancellation not only starts immediately, but also fades up smoothly. A preferred implementation is shown later in connection with equation 4.20, but other fading operations are also useful. Preferably, fader 430 is controlled by fading curve estimator 440, fading curve estimator 440 being determined by pre-echo width M, e.g., as determined by pre-echo width estimator 240_preAnd (5) controlling. Embodiments of the fading curve estimator operate according to equation 4.19 discussed later, but other implementations are also useful. All these operations of blocks 410, 420, 430, 440 are useful for calculating a specific target value, so that finally, together with the actual value, a specific weight can be determined by block 450, which is then applied to the time-frequency representation and, in particular, to a specific time/frequency region after a preferred smoothing.

Naturally, the target value can also be determined without any look-ahead masking psychoacoustic effects and without any fading. The target value will then be directly the threshold th_kIt has been found, however, that the particular calculations performed by the blocks 410, 420, 430, 440 result in improved pre-emphasis in the output signal of the spectral weighter 320The echo is reduced.

Thus, it is preferred to determine the target spectral values such that spectral values having an amplitude below the pre-echo threshold are unaffected by signal manipulation, or to determine the target spectral values using the look- ahead masking models 410, 420 such that attenuation of spectral values in the pre-echo region is reduced based on the look-ahead masking model 410.

Preferably, the algorithm executed in the converter 100 is such that the time-frequency representation comprises complex-valued spectral values. However, on the other hand, the signal manipulator is configured to apply real-valued spectral weighting values to complex-valued spectral values such that after the manipulation in block 320, only the amplitude has changed, but the phase is the same as before the manipulation.

Fig. 5 illustrates a preferred embodiment of the signal manipulator 140 of fig. 1. In particular, the signal manipulator 140 includes a pre-echo reducer/estimator shown at 220 operating before the transient position, or includes a pitch amplifier shown at block 500 operating after/at the transient position. Both blocks 220, 500 are controlled by the transient position determined by the transient position estimator 120. According to a first aspect of the invention, the pre-echo reducer 220 corresponds to a first sub-aspect and the block 500 corresponds to a second sub-aspect. Both aspects may be used alternatively to each other, i.e. without the other aspects as shown in dashed lines in fig. 5. On the other hand, however, it is preferred to use these two operations in the particular order shown in fig. 5, i.e. the pre-echo reducer 220 is operational and the output of the pre-echo reducer/estimator 220 is input to the pitch amplifier 500.

Fig. 6a shows a preferred embodiment of a pitch amplifier 500. Further, the starting amplifier 500 includes a spectrum weight calculator 610 and a spectrum weighter 620 connected subsequently. Thus, the signal manipulator is configured to amplify 500 spectral values within the transient frame of the time-frequency representation, and preferably additionally to amplify spectral values within one or more frames following the transient frame within the time-frequency representation.

Preferably, the signal manipulator 140 is configured to amplify only spectral values above a minimum frequency, wherein the minimum frequency is greater than 250Hz and lower than 2 KHz. Since the attack at the beginning of the transient position typically extends over the entire high frequency range of the signal, amplification up to the upper boundary frequency may be performed.

Preferably, the signal manipulator 140, and in particular the pitch amplifier 500 of fig. 5, comprises a divider 630 for dividing the frame into transient portions on the one hand and sustained portions on the other hand. The transient portion is then subjected to spectral weighting and, additionally, spectral weights are also calculated from information about the transient portion. Then only the transient part is spectrally weighted and the results of the blocks 610, 620 in fig. 6b on the one hand and the continuation part as output of the divider 630 are finally combined in a combiner 640 to output an audio signal in which the attack has been amplified. Thus, the signal manipulator 140 is configured to divide 630 the time-frequency representation at the transient position into a duration part and a transient part, and preferably additionally also a frame following the transient position. The signal manipulator 140 is configured to amplify only the transient portion without amplifying or manipulating the sustained portion.

As depicted, the signal manipulator 140 is configured to also amplify the portion of the time in the time-frequency representation that is after the transient position using the fade-out characteristic 685, as shown in block 680. In particular, the spectral weight calculator 610 comprises a weighting factor determiner 680, the weighting factor determiner 680 receiving the fade-out curve G for a transient portion on the one hand and a sustained portion on the other hand _m685, and preferably also receives information about the corresponding spectral value X_k,mInformation of the amplitude of (d). Preferably, the weighting factor determiner 680 operates according to equation 4.29 discussed later, but other embodiments that rely on information about transient portion, sustained portion, and fade-out characteristics 685 are also useful.

After the weighting factor determination 680, smoothing across frequency is performed in block 690, and then at the output of block 690, the weighting factors for the individual frequency values are available and ready for use by the spectral weighter 620 for spectral weighting of the time/frequency representation. Preferably, the amount of amplification of the amplified portion, e.g., determined by the maximum value of the fade-out characteristic 685, is predetermined and is between 300% and 150%. In the preferred embodiment, since a maximum magnification factor of 2.2 is used, it is reduced over a number of frames up to a value of 1, wherein this reduction is obtained after 60 frames, for example, as shown in fig. 13.17. Although fig. 13.17 shows an exponential decay, other decays, such as linear or cosine decays, may be used.

Preferably, the result of the signal manipulation 140 is converted from the frequency domain to the time domain using a spectral-to-time converter 370 shown in fig. 2 d. Preferably, the spectro-temporal converter 370 applies overlap-add operations involving at least two adjacent frames of the time-frequency representation, but a multi-folding procedure may also be used, where an overlap of three or four frames is used.

Preferably, the converter 100 on the one hand and the converter 370 on the other hand apply an analysis window of the same jump size between 1ms and 3ms or with a window length between 2ms and 6 ms. And preferably the overlapping ranges on the one hand, and the hop sizes or windows on the other hand, applied by the time-to-frequency converter 100 and the frequency-to-time converter 370 are equal to each other.

Fig. 7 shows an apparatus 20 for post-processing of audio signals according to a second aspect of the present invention. The apparatus comprises a time-to-spectrum converter 700 for converting an audio signal into a spectral representation comprising a sequence of spectral frames. Additionally, a prediction analyzer 720 for calculating prediction filter data for the prediction of frequencies within the spectral frame is used. The predictive analyzer 720, operating on frequencies, produces filter data for a frame, and this filter data for the frame is used by the shaping filter 740 frame to enhance transient portions within the spectral frame. The output of the shaping filter 740 is forwarded to a spectrotime converter 760, the spectrotime converter 760 being adapted to convert a sequence of spectral frames comprising shaped spectral frames into the time domain.

Preferably, the predictive analyzer 720 on the one hand or the shaping filter 740 on the other hand operate without explicit transient location detection. Instead, the temporal envelope of the audio signal is manipulated due to the prediction of the frequency applied by block 720 and due to the shaping of the enhanced transient portion generated by block 740, such that the transient portion is automatically enhanced without any specific transient detection. However, as appropriate, blocks 720, 740 may also be supported by explicit transient position detection to ensure that any possible artifacts are not pushed into the audio signal at non-transient portions.

Preferably, the prediction analyzer 720 is configured to calculate first prediction filter data 720a for the flattening filter characteristics 740a and second prediction filter data 720b for the shaping filter characteristics 740b, as shown in fig. 8 a. In particular, predictive analyzer 720 receives as input a complete frame of a sequence of frames and then performs predictive analysis operations on the frequencies to obtain flat filter data characteristics or to generate shaping filter characteristics. The flat filter characteristic is a filter characteristic that ultimately resembles an inverse filter, which may also be represented by an FIR (finite impulse response) characteristic 740a, wherein the second filter data used for shaping corresponds to the synthetic or IIR filter characteristic (IIR ═ infinite impulse response) shown at 740 b.

Preferably, the degree of shaping represented by the second filter data 720b is larger than the degree of flatness 720a represented by the first filter data, so that after applying the shaping filters having the characteristics 740a, 740b, a kind of "over-shaping" of the signal is obtained, which results in a time envelope that is less flat than the original time envelope. This is exactly what is needed for transient enhancement.

Although fig. 8a shows the case where two different filter characteristics (one shaping filter and one flattening filter) are calculated, other embodiments rely on a single shaping filter characteristic. This is due to the fact that the signal can of course also be shaped without prior flattening, so that finally a highly shaped signal with automatically improved transients is obtained again. This effect of over-shaping can be controlled by the transient position detector, but is not required due to the preferred implementation of signal manipulation that affects the non-transient portion less automatically than the transient portion. Both processes rely entirely on the fact that the prediction analyzer 720 applies a prediction of the frequency in order to obtain information about the temporal envelope of the time domain signal, which is then processed in order to enhance the transient characteristics of the audio signal.

In this embodiment, the autocorrelation signal 800 is calculated from the spectral frame, as shown at 800 in fig. 8 b. The results of block 800 are then windowed using a window having a first time constant, as shown in block 802. Further, as shown at block 804, a window having a second time constant greater than the first time constant is used to window the autocorrelation signal obtained by block 800. From the resulting signal obtained from block 802, first prediction filter data is preferably computed by applying the Levinson-Durbin recursion, as shown in block 806. Similarly, second prediction filter data 808 is computed from the block 804 using the larger time constant. Again, block 808 preferably uses the same Levinson-Durbin algorithm.

Due to the fact that the autocorrelation signal is windowed with a window having two different time constants, an automatic transient enhancement is obtained. Typically, windowing is such that different time constants have an effect only on one type of signal and no effect on other types of signals. Transient signals are actually affected by two different time constants, whereas non-transient signals have such an autocorrelation signal that windowing with a second, larger time constant results in almost the same output as windowing with the first time constant. With respect to fig. 13 and 18, this is due to the fact that non-transient signals do not have any significant peaks at high time lags, and therefore the use of two different time constants with respect to these signals does not cause any difference. However, this is different for transient signals. Transient signals have peaks at higher time lags and therefore different time constants are applied to the autocorrelation signal which actually has peaks at higher time lags, as shown at 1300 in fig. 13 and 18, for example resulting in different outputs for different windowing operations using different time constants.

The shaping filter may be implemented in many different ways depending on the implementation. In fig. 8c is shown a way of cascading a flat sub-filter controlled by first filter data 806, indicated by 809, and a shaping sub-filter controlled by second filter data 808, indicated by 810, and a gain compensator 811, also implemented in cascade.

However, these two different filter characteristics and gain compensation may also be implemented within a single shaping filter 740, and the combined filter characteristic of shaping filter 740 is calculated by a filter characteristic combiner 820, filter characteristic combiner 820 depending on the first and second filter data on the one hand and on the gain of the first and second filter data on the other hand to finally also implement gain compensation function 811. Thus, with respect to the fig. 8d embodiment applying a combined filter, the frame is input into a single shaping filter 740 and the output is a shaped frame having on the one hand the filter characteristics and on the other hand the gain compensation function implemented thereon.

Fig. 8e shows a further embodiment of the second aspect of the invention, where the function of the combined shaping filter 740 of fig. 8d is shown to be identical to that of fig. 8c, but it should be noted that fig. 8e may actually be an embodiment of three separate stages 809, 810, 811, but at the same time be seen as a logical representation actually implemented using a single filter with filter characteristics with a numerator having inverse/flat filter characteristics and a denominator having composite characteristics, and where gain compensation is additionally included, as shown for example in equation 4.33, which is determined later.

FIG. 8f shows the windowing function obtained by blocks 802, 804 of FIG. 8b, where r (k) is the autocorrelation signal, w_lagIs a window, r' (k) is the windowed output, i.e. the output of blocks 802, 804, and additionally, an exemplary window function is shown which ultimately represents an exponential decay filter with two different time constants which can be set by using a specific value in fig. 8 f.

Thus, applying a window to the autocorrelation values prior to the Levinson-Durbin recursion results in an extension of the temporal support at the local temporal peaks. In particular, FIG. 8f depicts an extension using a Gaussian window. The embodiments herein rely on this idea to derive a temporal flattening filter that has a larger extension of temporal support at the local non-flat envelope than the subsequent shaping filter by selecting different values 4 a. Together, these filters result in sharpening of temporal onsets in the signal. As a result, there is compensation for the prediction gain of the filter, so that the spectral energy of the filtered spectral region is preserved.

Thus, as shown in fig. 8a to 8e, a signal stream based on voicing shaping of the frequency domain LPC is obtained.

Fig. 9 shows a preferred implementation of an embodiment relying on the first aspect shown by blocks 100 to 370 in fig. 9 and the second aspect shown by blocks 700 to 760 executed subsequently. Preferably, the second aspect relies on independent temporal spectral conversion using large frame sizes (e.g., 512 frame sizes and 50% overlap). On the other hand, the first aspect relies on a small frame size in order to have a better temporal resolution for transient position detection. Such a smaller frame size is for example a frame size of 128 samples and an overlap of 50%. However, it is generally preferred that separate temporal spectral transforms are used for the first and second aspects, with the second aspect being larger in frame size (lower temporal resolution but higher frequency resolution), and the first aspect being higher in temporal resolution with a corresponding lower frequency resolution.

Fig. 10a illustrates a preferred embodiment of the transient position estimator 120 of fig. 1. The transient position estimator 120 may be implemented as known in the art, but in a preferred embodiment it relies on the detection function calculator 1000 and a subsequently connected start point selector 1100, such that a binary value for each frame is finally obtained indicating the presence of a transient start point in the frame.

The detection function calculator 1000 relies on several steps shown in fig. 10 b. These are the sums of the energy values in block 1020. In block 1030, the computation of the time envelope is performed. Subsequently, in step 1040, a high-pass filtering of the temporal envelope of each band-pass signal is performed. In step 1050, the summation of the resulting high-pass filtered signals is performed in the frequency direction, and in block 1060, the consideration of time lag masking is performed, thereby finally obtaining a detection function.

Fig. 10c shows a preferred way of choosing from the starting point of the detection function as obtained by block 1060. In step 1110, a local maximum (peak) is found in the detection function. In block 1120, a threshold comparison is performed so that only peaks above a certain minimum threshold are kept for further implementation.

In block 1130, the area around each peak is scanned for a larger peak to determine a correlation peak from the area. The area around the peak extends before the peak by l_bOne frame, extended l after the peak_aAnd (4) one frame.

In block 1140, the near peaks are discarded such that the transient start point frame index m is finally determined_i。

Subsequently, techniques and auditory concepts used in the proposed transient enhancement method are disclosed. First, some basic digital signal processing techniques will be introduced with respect to selected filtering operations and linear prediction, followed by the definition of transients. Subsequently, the psychoacoustic concept of auditory masking, which is used in perceptual coding of audio content, is explained. This section ends with a brief description of a generic perceptual audio codec and the resulting compression artifacts, which are subject to the enhancement method according to the invention.

Smoothing and differentiating filter

The transient enhancement method described later frequently uses some specific filtering operation. An introduction of these filters will be given in the following section. For a more detailed description see [9, 10 ]]. Equation (2.1) describes a Finite Impulse Response (FIR) low pass filter, which is calculated as the input signal x_nCurrent output sample value y of the average of the current and past samples_n. The filtering process of such a so-called moving average filter is given by

Where p is the filter order. The top image of fig. 12.1 shows the input signal x_nThe result of the moving average filter operation in equation (2.1) above. By pairing x in the forward and backward directions_nCalculating the output signal y in the bottom image by applying a moving average filter twice_n. This compensates for the filter delay and also results in a smoother output signal y_nBecause of x_nIs filtered twice.

A different way to smooth the signal is to apply a single-pole recursive averaging filter, which is given by the following difference equation:

y_n=b·x_n+(1-b)·y_n-1， 1≤n≤N，

wherein y is₀＝x₁And N represents x_nNumber of samples in (1). Fig. 12.2(a) shows the result of a single-pole recursive averaging filter applied to a rectangular function. In (b), filters are applied in both directions to further smooth the signal. By using

And

as follows

And

wherein x_nAnd y_nThe input and output signals of equation (2.2), respectively, the resulting output signalAnd

directly following the attack or decay phase of the input signal. FIG. 12.2(c) shows

As solid black curve and

as a dashed black curve.

Input signal x_nStrong amplitude increments of orThe decrement may be performed by using a FIR high pass filter on x_nFiltering is performed to detect the presence of, as follows,

wherein b ═ 1, -1] or b ═ 1, 0., -1 ]. The resulting signal after high pass filtering the rectangular function is shown as a black curve in fig. 12.2 (d).

Linear prediction

Linear Prediction (LP) is a useful method for audio coding. Some past studies have described their ability to model the speech production process in particular [11, 12, 13], while others have generally applied it to the analysis of audio signals [14, 15, 16, 17 ]. The following sections are based on [11, 12, 13, 15, 18 ].

In Linear Predictive Coding (LPC), a sampled time signal

(where T is the sampling period) can be predicted by a weighted linear combination of its past values, in the form of

Where n is the time index identifying a particular time sample of the signal, p is the prediction order, a_r(where 1 ≦ r ≦ p) is the linear prediction coefficient (and in this case, the filter coefficient of an all-pole Infinite Impulse Response (IIR) filter). G is the gain factor, and u is_nIs some input signal to excite the model. By employing the z-transform of equation (2.6), the corresponding all-pole transfer function H (z) of the system is

Wherein

z＝e^j2πfT＝e^jωT.

UR filters H (z) are calledFor synthesis or LPC filters, and FIR filters

Referred to as an inverse filter. Using the prediction coefficient a_rAs filter coefficients for FIR filters, signal s_nCan be obtained by the following formula

This results in a predicted signal

And the actual signal s_nThe prediction error can be represented by

Wherein the equivalent representation of the prediction error in the z-domain is

FIG. 12.3 shows the original signal s_nPredicted signal

And a differential signal e_n，pWherein the prediction order p is 10. This differential signal e_n，pAlso known as residual error. In fig. 2.4, the autocorrelation function of the residual shows almost complete decorrelation between adjacent samples, which indicates e_n，pCan be approximately considered as white gaussian noise. Using e from equation (2.10)_n，pAs input signal u in equation (2.6)_nOr ep (z) from equation (2.11) is filtered using an all-pole filter h (z) from equation (2.7) where G ═ 1. The original signal canTo be perfectly recovered by the following respectively,

and

as the prediction order p increases, the energy of the residual decreases. In addition to the number of predictor coefficients, the residual energy also depends on the coefficients themselves. Therefore, the problem in linear predictive coding is how to obtain the optimal filter coefficients a_rThereby minimizing the energy of the residual. First, a windowed signal block x is formed from a windowed signal block x by_n＝s_n·w_nAnd its prediction

The total square error (total energy) of the residuals is taken, w_nIs a certain window function of the width N,

wherein

To minimize the total squared error E, the gradient of equation (2.14) must be relative to each a_rCalculate and pass settings

But is set to 0.

This leads to the so-called normal equation:

R_irepresenting a signal x_nThe auto-correlation of (a) is,

equation (2.17) forms a system of p linear equations from which p unknown prediction coefficients a can be calculated_rR is 1. ltoreq. p, which minimizes the total square error. Using equations (2.14) and (2.17), the minimum total squared error E_pCan be obtained by the following formula

A fast method to solve the normal equations in equation (2.17) is the Levinson-Durbin algorithm [19 ]. The algorithm works recursively, which brings the advantage that as the prediction order increases it produces predictor coefficients for current and all previous orders smaller than p. First, the algorithm is initialized by the following settings

E_o＝R_o

Then, for the prediction order m 1_mThe prediction coefficient a is calculated as follows_r ^(m)Which is the coefficient a of the current order m_r：

With each iteration, the minimum total squared error E of the current order m is calculated in equation (2.24)_m. Due to E_mIs always positive, and wherein E_o＝R_oIt can be shown that as m increases, the minimum total energy decreases, so there is

0≤E_m≤E_m-1.

Thus, recursion brings about the further advantage that when E_mThe computation of predictor coefficients may be stopped when it falls below a certain threshold.

Envelope estimation in time and frequency domain

An important feature of LPC filters is their ability to model the characteristics of the signal in the frequency domain if the filter coefficients are computed on a time signal. Equivalent to the prediction of the time series, the linear prediction approximates the spectrum of the series. Depending on the prediction order, the LPC filter may be used to calculate a more or less detailed envelope of the signal frequency response. The following sections are based on [11, 12, 13, 14, 16, 17, 20, 21 ].

From equation (2.13) it can be seen that by filtering the residual spectrum with an all-pole filter h (z), the original signal spectrum can be perfectly reconstructed from the residual spectrum. By setting u in equation (2.6)_n＝δ_nWherein δ_nIs a Dirac delta function, the signal spectrum S (z) can be filtered by an all-pole filter

Modeling from equation (2.7) is as follows

Wherein the prediction coefficient a is calculated using the Levinson-Durbin algorithm in equations (2.21) - (2.24)_rOnly the gain factor G remains to be determined. Using u_n＝δ_nEquation (2.6) becomes

Wherein h is_nIs the impulse response of the synthesis filter h (z). According to equation (2.17), the impulse response h_nSelf-correlation R &_iIs that

By comparing h in equation (2.27)_nSquaring and summing all n, the 0 th autocorrelation coefficient of the synthesis filter impulse response becomes

Because of the fact that

The 0 th autocorrelation coefficient corresponds to the signal s_nTotal energy of (c). The total energy in the original signal spectrum S (z) is approximated by it

Under the condition that the total energy in (1) should be equal, followingUsing this conclusion, the signal s in equation (2.17) and equation (2.28)_nAutocorrelation and impulse response h of_nRespectively becomeWherein i is more than or equal to 0 and less than or equal to p. The gain factor G can be calculated by reshaping equation (2.29) and using equation (2.19) as follows:

FIG. 12.5 shows the signal S from the speech signal_nSpectrum s (z) of one frame (1024 samples). The smoother black curve is according to equation (2.26)Calculated spectral envelope

Wherein the prediction order p is 20. Approximation as the prediction order p increasesAlways adjusted to be closer to the original spectrum s (z). The dashed curve is calculated using the same formula as the black curve, but where the prediction order p is 100. It can be seen that this approximation is more detailed and provides a better fit to s (z). At p → length(s)_n) In the case of (2), an all-pole filter may also be used

Accurately modeling S (z) such that

Assuming a time-signal s_nIs the minimum phase.

Due to the duality between time and frequency, linear prediction can also be applied to the spectrum of a signal in the frequency domain in order to model its temporal envelope. The calculation of the time estimate is done in the same way, except that the calculation of the predictor coefficients is performed on the signal spectrum, and then the resulting impulse response of the all-pole filter is transformed into the time domain. Fig. 2.6 shows the absolute values of the original time signal and two approximations using prediction orders of p 10 and p 20. For the estimation of the frequency response, it can be observed that the time approximation using the higher order is more accurate.

Transient state

In the literature, many different definitions of transients can be found. Some refer to them as onset points or onsets [22, 23, 24, 25], while others use these terms to describe transients [26, 27 ]. This section is intended to describe, for purposes of disclosure, different methods of defining transients and characterizing them.

Characterization of

Some early definitions of transients describe them as time domain phenomena only, such as found in Kliewer and Mertins [24 ]. They describe the transients as signal segments in the time domain whose energy rises rapidly from a low value to a high value. To define the boundaries of these segments, they use the ratio of the energies within two sliding windows on the time domain energy signal just before and just after the signal sample n. Dividing the energy of the window immediately after n by the energy of the preceding window yields a simple criterion function c (n), the peak of which corresponds to the beginning of the transient period. These peaks occur when the energy just after n is substantially greater than the previous energy, marking the onset of a sharp energy rise. The end of the transient is then defined as the time after the starting point at which c (n) falls below a certain threshold.

Masri and Bateman [28] describe transients as giant changes in the signal time envelope, where the signal segments before and after the onset of the transient are highly uncorrelated. The spectrum of a narrow time frame comprising a shock transient event typically shows a large burst of energy at all frequencies, which can be seen in the spectrogram of the castanets in fig. 2.7 (b). Other studies [23, 29, 25] also characterize transients in the time-frequency representation of the signal, where they correspond to time frames with sharp increases in energy occurring simultaneously in several adjacent frequency bands. Rodet and Jaillet [25] also indicate that this sudden increase in energy is particularly pronounced in higher frequencies, since the total energy of the signal is mainly concentrated in the low frequency region.

Herre [20] and Zhang et al [30] characterize transients with the degree of flatness of the temporal envelope. With a sudden increase in energy over time, the transient signal has a very uneven temporal structure with a corresponding flat spectral envelope. One way to determine spectral flatness is to apply Spectral Flatness Measurements (SFM) in the frequency domain [31 ]. The spectral flatness SF of the signal can be calculated using the ratio of the geometric mean Gm to the arithmetic mean Am of the power spectrum:

|X_ki represents the amplitude value of the spectral coefficient index K, K represents the spectrum X_kThe total number of coefficients of (a). If SF → 0, the signal has a non-flat frequency structureAnd therefore more likely to be a tone. In contrast, if SF → 1, the spectral envelope is flatter, which may correspond to a transient or noise-like signal. The flat spectrum does not strictly specify transients, where the phase response of the transient has a high correlation as opposed to the noise signal. To determine the flatness of the temporal envelope, the measurements in equation (2.31) may also be similarly applied in the time domain.

Suresh Babu et al [27] also distinguish attack transients from frequency domain transients. They characterize the frequency domain transients by abrupt changes in the spectral envelope between adjacent time frames rather than by energy changes in the time domain as previously described. These signal events may be produced, for example, by a bowed instrument like a violin or by human speech by changing the pitch of the rendered sound. Fig. 12.7 shows the difference between attack transients and frequency domain transients. (c) The signal in (a) describes the audio signal produced by a violin. The vertical dashed line marks the instant when the pitch of the presented signal changes, i.e. the start of a new tone or frequency domain transient, respectively. This new note onset does not cause a significant change in signal amplitude, as opposed to the attack transient produced by the castanets in (a). The moment of this change in spectral composition can be seen in the spectrogram of (d). However, in fig. 2.8, the spectrum difference before and after the transient is more pronounced, fig. 2.8 shows two spectra of the violin signal in fig. 12.7(c), one of which is the spectrum of the time frame before the onset of the frequency domain transient and the other of which is the spectrum of the time frame after the onset of the frequency domain transient. This indicates that the harmonic components are different between the two spectra. However, perceptual coding of frequency domain transients does not cause the various artifacts that would be addressed by the recovery algorithms presented in this paper, and therefore would be ignored. Henceforth, the term "transient" will be used to refer only to attack transients.

Discrimination of transients, onset points and attack

The distinction between the concepts of transients, onsets and onsets can be found in Bello et al [26], which will be adopted in this paper. The difference in these terms is also shown in fig. 12.9, using an example of the transient signal produced by the castanets.

In general, authors still do not fully define the concept of transients, but they characterize it as a short time interval, rather than at a different time instant. During this transient period, the amplitude of the signal rises rapidly in a relatively unpredictable manner. However, it is not precisely defined where the transient ends after its amplitude reaches its peak.

In their rather informal definition, they also include a portion of the amplitude decay to the transient interval. With this characterization, acoustic instruments create transients during which they are excited (e.g., when picking guitar strings or striking a snare drum) and then attenuated. After this initial decay, the subsequent slower signal decay is caused only by the resonant frequency of the instrument body.

The starting point is the moment when the amplitude of the signal starts to rise. For this study, the starting point will be defined as the start time of the transient.

The onset of a transient is the period of time within the transient between its onset and peak during which the amplitude increases.

Psychoacoustics

This section gives a basic introduction to the psycho-acoustic concepts used in perceptual audio coding and transient enhancement algorithms described later. The objective of psychoacoustics is to describe the relationship between the measurable physical properties of sound signals and the internal perception these sounds cause in a listener [32 ]. Human auditory perception has its limitations that can be used by perceptual audio encoders in the encoding of audio content to substantially reduce the bit rate of the encoded audio signal. Although the goal of perceptual audio coding is to encode audio material in such a way that the decoded audio signal should be voiced [1] exactly or as close as possible to the original signal, it may still introduce some audible coding artifacts. The necessary background to understand the origin of these artifacts and the psychoacoustic model of how perceptual audio coders are used will be provided in this section. The reader is referred to [33, 34] for a more detailed description of psychoacoustics.

Simultaneous masking

Simultaneous masking refers to a psychoacoustic phenomenon that if a sound (masked sound) is close in frequency to a stronger sound (masking sound), the sound may be inaudible to a human listener when presented simultaneously with the stronger sound. A widely used example describing this phenomenon is a conversation between two people beside a road. Without the disturbing noise, they may perfectly perceive each other, but if a car or truck passes by, they need to increase their speech volume to keep each other's comprehension.

The concept of simultaneous masking can be explained by examining the function of the human auditory system. If the probe sound is presented to the listener, it induces a traveling wave within the cochlea along the basal lamina (BM), spreading from its base at the oval window to the apex of its tip [17 ]. Starting from the elliptical window, the vertical displacement of the traveling wave initially rises slowly, reaches its maximum at a specific location, and then falls abruptly [33, 34 ]. The location of its maximum displacement depends on the frequency of the stimulus. The BM is narrow and stiff at the base and about three times wider and less stiff at the apex. Thus, each position along the BM is most sensitive to a particular frequency, with high frequency signal components causing the maximum displacement near the base of the BM and low frequencies causing the maximum displacement near the apex of the BM. This particular frequency is commonly referred to as the Characteristic Frequency (CF) [33, 34, 35, 36 ]. Thus, the cochlea may be considered as a frequency analyzer with a set of highly overlapping band-pass filters with an asymmetric frequency response, referred to as auditory filters [17, 33, 34, 37 ]. The pass band of these auditory filters shows a non-uniform bandwidth, referred to as the critical bandwidth. The concept of critical bands was first introduced in 1933 by Fletcher [38, 39 ]. He assumes that the audibility of the probe sound presented simultaneously with the noise signal depends only on the amount of noise energy that is close in frequency to the probe sound. If the signal-to-noise ratio (SNR) in this frequency region is below a certain threshold, i.e. the energy of the noise signal is to a certain extent higher than the energy of the detection sound, the detection signal is not audible to a human listener [17, 33, 34 ]. However, simultaneous masking does not occur only within a single critical band. In fact, a masking tone at the CF of a critical band may also affect the audibility of masked tones outside the boundary of this critical band, but to a lesser extent [17 ]. The simultaneous masking effect is shown in fig. 12.10. The dashed curve represents the threshold at rest, which "describes the minimum sound pressure level required for a human listener to detect a narrowband sound without other sounds" [32 ]. The black curve is the simultaneous masking threshold corresponding to a narrow band noise masking tone depicted as a dark gray bar. The masking tone masks the detection sound (light gray bar) if the sound pressure level of the detection sound is less than the simultaneous masking threshold at a particular frequency of the masked sound.

Temporal masking

Masking is effective not only in the case where a masking sound and a masked sound are presented simultaneously, but also in the case where they are separated in time. The probe sound [40] may be masked before and after the period of time that the masking tone is presented, which is referred to as leading masking and lagging masking. A graphical representation of the temporal masking effect is shown in fig. 2.11. Leading masking occurs before the starting point of the masking sound, which is depicted for negative values of t. After the leading masking period, simultaneous masking is active with an overshoot effect immediately after the masking tone is turned on, where the simultaneous masking threshold is temporarily increased [37 ]. After the masking tone is turned off (depicted for positive values of t), lag masking is active. The leading masking can be interpreted using the integration time required by the auditory system to produce perception of the presented sound [40 ]. Additionally, the auditory system processes loud sounds faster than weak sounds [33 ]. The period during which the leading masking occurs is highly dependent on the amount of training for the particular listener [17, 34] and may last up to 20ms [33], but is significant only for the period 1-5ms before the masking tone start point [17, 37 ]. The amount of lag masking depends on the frequency, masking tone level and duration of both the masking tone and the detected sound, and the time period [17, 34] between the instants when the detected sound and the masking tone are turned off. According to Moore [34], lag masking is effective for at least 20ms, other studies show even longer durations, up to about 200ms [33 ]. Furthermore, Painter and Spanias declare lag masking "also exhibits frequency dependent behavior similar to simultaneous masking, which can be observed when the relationship of masking tone and detection frequency changes," [17, 34 ].

Perceptual audio coding

The objective of perceptual audio coding is to compress an audio signal in such a way that the resulting bit rate is as small as possible compared to the original audio, while maintaining a transparent sound quality, wherein the reconstructed (decoded) signal should not be distinguishable from the uncompressed signal [1, 17, 32, 37, 41, 42 ]. This is done by removing redundant and irrelevant information from the input signal using some of the limitations of the human auditory system. While the redundancy can be removed, for example, by using subsequent signal samples, spectral coefficients or even correlations between different audio channels and by appropriate entropy coding, irrelevant information can be processed by quantization of the spectral coefficients.

General architecture of perceptual Audio encoder

The basic structure of a mono perceptual audio encoder is depicted in fig. 12.12. First, an input audio signal is transformed into a frequency domain representation by applying an analysis filter bank. In this way, the received spectral coefficients [32] can be selectively quantized "depending on their frequency content". The quantization block rounds successive values of the spectral coefficients to a set of discrete values to reduce the amount of data in the encoded audio signal. Thus, compression becomes lossy since it is not possible to reconstruct the exact values of the original signal at the decoder. This introduction of quantization error can be considered as an additive noise signal, which is referred to as quantization noise. The quantization is controlled by the output of a perceptual model which calculates a temporal and simultaneous masking threshold for each spectral coefficient in each analysis window. The absolute threshold at rest can also be used by assuming that the "4 kHz signal with ± 1 peak amplitude of the least significant bit of the 16 bit integer is at the absolute threshold of hearing" [31 ]. In the bit allocation block, these masking thresholds are used to determine the number of bits needed so that the quantization noise caused becomes inaudible to a human listener. In addition, spectral coefficients below the calculated masking threshold (and thus not related to human auditory perception) need not be transmitted and may be quantized to zero. The quantized spectral coefficients are then entropy encoded (e.g., by applying huffman coding or arithmetic coding), which reduces redundancy in the signal data. Finally, the encoded audio signal and additional side information (e.g., quantization scale factors) are multiplexed to form a single bitstream, which is then transmitted to a receiver. The audio decoder at the receiver side (see fig. 12.13) then performs the inverse operation by demultiplexing the input bitstream, reconstructing the spectral values using the transmitted scale factors, and applying a synthesis filter bank that is complementary to the analysis filter bank of the encoder, to reconstruct the resulting output time signal.

Transient coding artifacts

Although the goal of perceptual audio coding is to produce a transparent sound quality of the decoded audio signal, it still exhibits audible artifacts. Some of these artifacts affecting the perceived quality of the transient will be described below.

Bird (birds) and bandwidth limitation

Only a limited number of bits are available for the bit allocation process to provide quantization for the block of audio signals. If the bit requirement of a frame is too high, some spectral coefficients can be deleted by quantizing them to zero [1, 43, 44 ]. This essentially results in a temporary loss of some high frequency components and is mainly a problem for low bit rate coding or when processing very demanding signals, e.g. signals with frequent transient events. The allocation of bits varies from one block to the next, so that the frequency components of the spectral coefficients can be deleted in one frame and presented in the next. The resulting spectral gap is called a "bird" and can be seen in the bottom image of fig. 2.14. In particular, transient coding tends to produce bird artifacts, as the energy in these signal portions is spread across the entire spectrum. A common approach is to limit the bandwidth of the audio signal prior to the encoding process to save the available bits for quantization of the LF component, which is also shown in fig. 2.14 for the encoded signal. This trade-off is appropriate because birds have a greater impact on the perceived audio quality than the constant bandwidth loss, which is generally more tolerable. However, even with bandwidth limitations, birds may still occur. Although the transient enhancement method described later is not intended to correct spectral gaps or to spread the bandwidth of the encoded signal itself, the loss of high frequencies also results in reduced energy and degraded transient attack (see fig. 12.15), which is subject to the attack enhancement method described later.

Front echo

Another common compression artifact is the so-called pre-echo [1, 17, 20, 43, 44 ]. Pre-echoes can occur if a sharp increase in signal energy (i.e., a transient) occurs near the end of a signal block. The substantial energy included in the transient signal portion is distributed over a wide frequency range, which results in an estimation of a relatively high masking threshold in the psychoacoustic model and thus only a few bits are allocated for the quantization of the spectral coefficients. Then, during the decoding process, a large amount of the increased quantization noise is spread over the entire duration of the signal block. For a stable signal, it is assumed that the quantization noise is completely masked, but for signal blocks that include transients, if the quantization noise "exceeds the leading masking [ … ] period" [1], the quantization noise may precede the transient onset point and become audible. These artifacts are subject to current research, even though there are several proposed methods of processing pre-echoes. Fig. 12.16 shows an example of pre-echo artifacts for castanets transients. The dashed black curve is the waveform of the original signal without substantial signal energy before the transient onset. Thus, the resulting pre-echoes prior to the transients of the encoded signal (gray curve) are not masked at the same time and can be perceived even without direct comparison with the original signal. The proposed method for the supplementary reduction of pre-echo noise will be described later.

Several methods have been proposed over the past few years to improve the quality of the transient. These enhancement methods can be classified into those methods that are integrated in an audio codec and those methods that work as a post-processing module on a decoded audio signal. An overview of previous studies and methods regarding transient enhancement and transient event detection is given below.

Transient detection

Edler [6] proposed an early method of transient detection in 1989. This detection is used to control the adaptive window switching method, which will be described later in this section. The proposed method detects only at the audio encoder whether a transient is present in one signal frame of the original input signal, rather than the exact location of the transient in the frame. Two decision criteria are calculated to determine the likelihood of a current transient in a particular signal frame. For the first criterion, the input signal x (n) is filtered using an FIR high-pass filter according to equation (2.5), where the filter coefficients b are [1, -1 ]. The resulting differential signal d (n) shows a large peak at the instant when the amplitude between adjacent samples changes rapidly. Then, the ratio of the sum of the magnitudes of d (n) of the two neighboring blocks is used to calculate a first criterion:

the variable m denotes the frame number and N denotes the number of samples within a frame. However, c₁(m) the detection of very small transients at the end of a signal frame is difficult to achieve because their contribution to the total energy within the frame is rather small. Thus, a second criterion is established which calculates the ratio of the maximum amplitude value of x (n) to the average amplitude within a frame:

if c is₁(m) or c₂(m) exceeds a particular threshold, then it is determined that the particular frame m includes a transient event.

Kliewer and Mertins [24] also propose a detection method that operates exclusively in the time domain. Their approach is aimed at determining the exact beginning and ending samples of the transient by employing two sliding rectangular windows in the signal energy. The signal energy within the window is calculated as follows

And

where L is the window length and n represents the signal sample exactly in the middle between the left and right windows. Then, the detection function D (n) is calculated by the following formula

Wherein

If the peak value of D (n) is higher than a certain threshold value T_bThey correspond to the starting point of the transient. The end of the transient event is determined to be "less than some threshold T immediately after the onset point_eMaximum value of D (n)' [24]]。

Other detection methods are based on linear prediction in the time domain to use the predictability of the signal waveform to distinguish transient and steady-state signal portions [45 ]. Lee and Kuo proposed a method using linear prediction in 2006. They decompose the input signal into several sub-bands to calculate a detection function for each resulting narrowband signal. After filtering the narrowband signal using an inverse filter according to equation (2.10), the detection function is obtained as an output. A subsequent peak selection algorithm determines the resulting local maximum of the prediction error signal as a start point time candidate for each subband signal, and then uses the start point time candidates to determine a single transient start point time for the wideband signal.

The method of Niemeyer and Edler [23] works on the complex time-frequency representation of the input signal and determines the transient onset as a sharp increase in signal energy in the adjacent frequency band. Each band pass signal is filtered according to equation (2.3) to calculate the time envelope after a sudden energy increase as a detection function. Then, not only is the transient criterion calculated for band K, but also K on either side of K is taken into account for 7 adjacent bands.

Subsequently, different strategies for enhancing the transient signal portion will be described. The block diagram in fig. 13.1 shows an overview of the different parts of the recovery algorithm. The algorithm uses a coded signal s represented in the time domain_nAnd transformed into a time-frequency representation X by a short-time Fourier transform (STFT)_k,m. Then in the STFT domainWherein the enhancement of the transient signal portion is performed. In the first phase of the enhancement algorithm, the pre-echo just before the transient is reduced. The second stage enhances the onset of transients, and the third stage sharpens transients using a linear prediction based approach. The enhanced signal Y is then transformed using an inverse short-time Fourier transform (ISTFT)_k,mConverted back into the time domain to obtain an output signal y_n。

By applying STFT, input signal s_nIs first divided into a number of frames of length N, which overlap by L samples, and an analysis window function w is used_n,mIs windowed to obtain a signal block x_n,m＝s_n·w_n,m. Then, each frame x is transformed using a discrete Fourier transform (DTF)_n,mTransformation into the frequency domain. This produces a windowed signal frame x_n,mSpectrum X of_k,mWhere k is the spectral coefficient index and m is the frame number. Analysis by STFT can be represented by the following equation:

wherein

And

(N-L) is also referred to as hop size. For the analysis window w_n,mForms of sinusoidal windows have been used

To capture the fine temporal structure of the transient event, the frame size is selected to be relatively small. For the purposes of this study, it is set to N-128 samples for each time frame, with an overlap of L-N/2-64 samples for two adjacent frames. K in equation (4.2) defines the number of DFT points and is set to K256.This corresponds to X_k,mThe number of spectral coefficients of the two-sided spectrum of (2). Prior to STFT analysis, each windowed input signal frame is zero-padded to obtain a longer vector of length K to match the number of DFT points. These parameters give a sufficiently fine time-resolution to isolate the transient signal portion in a frame from the rest of the signal, while providing sufficient spectral coefficients for subsequent frequency-selective enhancement operations.

Transient detection

In an embodiment, the method for transient enhancement is applied specifically to the transient event itself, rather than constantly modifying the signal. Therefore, the instant of the transient must be detected. For the purpose of this research, transient detection methods have been implemented that have been adjusted independently for each individual audio signal. This means that for each particular sound file, the particular parameters and thresholds of the transient detection method, which will be described later in this section, are specifically adjusted to produce the best detection of the transient signal portion. The result of this detection is a binary value for each frame, indicating the presence of a transient start point.

The implemented transient detection method can be divided into two independent stages: the calculation of a suitable detection function and the method of selecting a starting point using the detection function as its input signal. In order to incorporate transient detection into a real-time processing algorithm, a proper look-ahead is required, since the subsequent pre-echo reduction method operates in a time interval before the start point of the detected transient.

Calculation of a detection function

For the calculation of the detection function, the input signal is transformed into a representation enabling an improved detection of the starting point of the original signal. The input to the transient detection block in fig. 13.1 is the input signal s_nTime frequency of (2) represents X_k,m. The calculation of the detection function is completed in five steps:

1. for each frame, the energy values of several adjacent spectral coefficients are summed.

2. The temporal envelope of the resulting band pass signal over all time-frames is calculated.

3. High-pass filtering of the time envelope of each band-pass signal.

4. The resulting high-pass filtered signals are summed in the frequency direction.

5. Time lag masking is considered.

TABLE 4.1 at Signal X_K,mAfter the concatenation of n adjacent spectral coefficients of the amplitude energy spectrum of (a), X_K,mOf the resulting passband boundary frequency f_lowAnd f_highAnd bandwidth Δ f

First, by

Wherein n is {2 ═ b⁰，2¹，2²，...，2⁶}=2^κ，

For each time frame pair X_k,mIs determined by summing the energies of several adjacent spectral coefficients.

Where K denotes the index of the resulting subband signal. Thus, X_K,mIncluded in the spectrum X by the representation for each frame m_k,mOf the energy in the specific frequency band of (a). Boundary frequency f_lowAnd f_highAnd the passband bandwidth deltaf and the number of spectral coefficients n of the connection are shown in table 4.1. Then smooth X over all time frames_K,mThe value of the band pass signal in (1). This is done by applying an IIR low-pass filter to each sub-band signal X in the time direction according to equation (2.2)_K,mThe filtering is performed as follows,

is the resulting smoothed energy signal for each channel K. The filter coefficients b and a-1-b are independently applied to each processed audio signal to produce a satisfactory timeAn inter constant. And then by using the pair of equation (2.5)

Is calculated via high-pass (HP) filteringThe slope of (a) is as follows,

wherein S is_K,mIs a differential envelope, b_iIs the filter coefficient of the deployed FIR high-pass filter and p is the filter order. The specific filter coefficients b are also defined independently for each individual signal_i. Then, S is spanned across all K pairs in the frequency direction_K,mSumming to obtain total envelope slope F_m。F_mA large peak in (a) corresponds to a time frame in which a transient event occurs. In order to ignore smaller peaks, especially after larger peaks, F_mBy an amplitude of F_m＝max(F_m-0.1,0) by a threshold value of 0.1. A single pole recursive averaging filter pair F equivalent to equation (2.2) is also used by_mFiltered and taken for each frame m according to equation (2.3)

And F_mThe larger value of (d) takes into account the lag masking after the larger peak:

whereinTo generate a resulting detection function D_m。

FIG. 13.2 shows castanets signals in the time and STFT domains, with the resulting detection function D shown in the bottom image_m. Then D_mIs used as input of the starting point selection methodSignals, which will be described in the following sections.

Starting point selection

Basically, the starting point selection method will detect the function D_mIs determined as S_nThe start point time frame of the transient event in (1). This is obviously a trivial task for the detection function of the castanets signal in fig. 13.2. The result of the start point selection method is shown as a red circle in the bottom image. However, other signals do not always produce such a detection function which is easy to handle, and therefore the determination of the actual transient starting point becomes slightly more complex. For example, the detection function of the music signal at the bottom of fig. 13.3 exhibits several local peaks that are not correlated with the transient start point frame. Therefore, the onset selection algorithm must distinguish between those "false" and "true" transient onsets.

First, D_mThe amplitude of the peak in (d) needs to be above a certain threshold th_peakTo be considered as a starting point candidate. This is done to prevent the input signal s_nAnd this small amplitude variation is not processed by the smoothing and lag masking filters in equations (4.5) and (4.7) to be detected as a transient starting point. . For the detection function D_mEach value of D_m＝_lThe starting point selection algorithm scans the areas before and after the current frame l to obtain the ratio D_m＝_lAnd larger values. If l precedes the current frame_bOne frame and after l_aIf no larger value exists for a frame, then l is determined to be a transient frame. "look-back" and "look-ahead" frames l_bAnd l_aNumber of (2) and threshold th_peakIs defined separately for each audio signal. After the correlation peak has been identified, detected transient start point frames closer than 50ms to the previous start point will be discarded [50, 51]. The output of the start point selection method (and general transient detection) is the transient start point frame m required for the subsequent transient enhancement block_iIs used to determine the index of (1).

Pre-echo reduction

The purpose of this enhancement phase is to reduce what is known as the lead-backA coding artifact of the wave that is audible for a certain period of time before the onset of the transient. An overview of the pre-echo reduction algorithm is shown in fig. 4.4. The pre-echo reduction stage analyzes the output X after STFT_k,m(100) And a previously detected transient start point frame index m_iAs an input signal. In the worst case, before a transient event, the pre-echo starts up to the length of the long block analysis window on the encoder side (2048 samples regardless of the codec sampling rate). The duration of this window depends on the sampling frequency of the particular encoder. For the worst case, assume a minimum codec sampling frequency of 8 kHz. In the decoded and resampled input signal s_nAt a sampling rate of 44.1kHz, the length of the long analysis window (and thus the potential range of the pre-echo region) corresponds to the time signal s_nN of (A)_long2048 · 44.1kHz/8kHz 11290 samples (or 256 ms). Since the enhancement method described in this section represents X for time frequency_k,mIs operated so that N_longNeeds to be converted into M_long＝(N_long-L)/(N-L) — (11290-64)/(128-64) — 176 frames. N and L are the frame size and overlap of the STFT analysis block (100) in fig. 13.1. M_longIs set as the upper limit of the pre-echo width and is used to limit the frame m to the detected transient start point_iThe previous echo starts the search area of the frame. For this study, the sampling rate of the decoded signal before resampling was taken as a ground truth, thus for the upper limit M of the pre-echo width_longAdapted for encoding s_nThe particular codec of (1).

Before estimating the actual width of the pre-echo, pitch frequency components located before the transient are detected (200). Thereafter, M before the transient frame_longA pre-echo width is determined (240) in the region of a frame. Using this estimate, a threshold value for the signal envelope in the pre-echo region may be calculated (260) to reduce the energy in those spectral coefficients whose amplitude values exceed the threshold value. For final pre-echo reduction, a spectral weighting matrix is calculated (450) that includes a multiplicative factor for each k and m, which will then be multiplied with X_k,mForward echo region element by elementMultiplication of elements.

Detection of tonal signal components prior to transients

In the subsequent pre-echo width estimation, subsequently detected spectral coefficients corresponding to tonal frequency components preceding the transient onset are used, as described in the next subsection. It is also beneficial to use them in subsequent pre-echo reduction algorithms to skip the energy reduction for those tonal spectral coefficients, since the pre-echo artifacts are likely to be masked by the current tonal component. However, in some cases skipping the pitch coefficients leads to the introduction of additional artifacts in the form of an increase in audible energy at some frequencies around the detected pitch frequency, so this approach has been omitted for the pre-echo reduction approach in this embodiment.

Fig. 13.5 shows a spectral diagram of the potential pre-echo region before a transient of the harmonica audio signal. The spectral coefficients of tonal components between two horizontal dashed lines are detected by combining two different methods:

1. linear prediction of frames along each spectral coefficient, an

2. All M's before the transient start point_longThe energy and length in each k over a frame is M_longIs compared with the energy of the running average of all previous potential pre-echo regions.

First, a linear prediction analysis across time is performed on each complex-valued STFT coefficient k, where the prediction coefficient a is calculated using the Levinson-Durbin algorithm according to equations (2.21) - (2.24)_k,r. Using these prediction coefficients, a prediction gain R can be calculated for each k_p,k[52，53，54]As follows below, the following description will be given,

wherein the content of the first and second substances,

andrespectively for each k input signal X_k,mAnd its prediction error E_k,mThe variance of (c). E_k,mIs calculated according to equation (2.10). The prediction gain is related to the use of a prediction coefficient a_k,rCan predict how accurate X_k,mWherein a high prediction gain corresponds to good predictability of the signal. Transient and noise-like signals tend to result in lower prediction gain for time-domain linear prediction, so if R is_p,kFor a particular k to be sufficiently high, the spectral coefficients may comprise tonal signal components. For this method, a threshold value for the prediction gain corresponding to the pitch frequency component is set to 10 dB.

In addition to a high prediction gain, the tonal frequency components should also include relatively high energy over the rest of the signal spectrum. Therefore, the energy ε in the potential pre-echo region of the current i-th transient is measured_i,kCompared to a specific energy threshold. Epsilon_i,kIs calculated as follows

The energy threshold is calculated using the running average energy of the previous echo region in the past, which is updated for each next transient. The running average energy will be expressed as

It is to be noted that it is preferable that,

the energy in the current pre-echo region of the i-th transient has not been considered. The index i merely indicates that,for detection of a current transient. If it is not

Is the total energy over all spectral coefficients k and frame m of the previous pre-echo region, thenCalculated by the following formula

Wherein b is 0.7

Therefore, if

R_p，k> 10dB and

the spectral coefficient index k in the current pre-echo region is defined to include a tonal component.

The result of the tonal signal component detection method (200) is a vector k for each pre-echo region preceding the detected transient_tonal,iWhich specifies the spectral coefficient index k satisfying the condition in equation (4.11).

Estimation of pre-echo width

Since there is no information about the signal s available for decoding_nThe decoder of (a) is accurate in framing (and thus in relation to the actual pre-echo width), so the actual pre-echo start frame needs to be estimated (240) for each transient before the pre-echo reduction process. This estimation is crucial for the resulting sound quality of the processed signal after the preceding echo reduction. If the estimated pre-echo region is too small, part of the current pre-echo will remain in the output signal. If too large, the excessive signal amplitude before the transient will be attenuated, potentially resulting in audible signal loss. As previously mentioned, M_longRepresents the size of the long analysis window used in the audio encoder and is considered to be the maximum possible number of frames of pre-echo dispersion before the transient event. Maximum range of pre-echo spread M_longWill be represented as a pre-echo search area.

Figure 13.6 shows a schematic representation of the pre-echo estimation method. The estimation method follows the assumption that the resulting pre-echo results in an increase in the amplitude of the temporal envelope before the start point of the transient. This is shown in fig. 13.6 for the area between the two vertical dashed lines. During decoding of the encoded audio signal, the quantization noise is not equally spread over the entire synthesis block, but will be shaped by the particular form of the window function used. Thus, the resulting pre-echo results in a gradual rise in amplitude rather than a sudden increase. Before the start point of the previous echo, the signal may comprise silence or other signal components, such as a duration of another acoustic event occurring some time before. Therefore, the purpose of the pre-echo width estimation method is to find the moment when the rise of the signal amplitude corresponds to the starting point of the induced quantization noise (i.e. the pre-echo artifact).

The detection algorithm uses only X_k,mThe HF component above 3kHz because most of the energy of the input signal is concentrated in the LF region. For the particular STFT parameter used herein, this corresponds to a spectral coefficient with k ≧ 18. In this way, the detection of the starting point of the pre-echo becomes more robust, since it is assumed that no other signal components are present, which might complicate the detection process. Further, if the pitch spectral coefficient k has been detected by the previously described pitch component detection method_tonalCorresponding to frequencies above 3kHz, they will also be excluded from the estimation process. The remaining coefficients are then used to calculate an appropriate detection function for the simplified pre-echo estimate. First, the signal energy is summed in the frequency direction for all frames in the pre-echo search region to obtain an amplitude signal L_mAs follows

k_maxCorresponding to the cut-off frequency of the low-pass filter, which has been used to limit the bandwidth of the original audio signal during the encoding process. Thereafter, L_mSmoothed to reduce fluctuations in signal level. Running average filter pair L with 3 taps in the forward and backward directions by crossing time_mFiltering to perform smoothing to generate a smoothed amplitude signal

Thus, the filter delay is compensated for, andand the filter becomes zero phase. Then the

Is derived to calculate its slope L 'by'_m，

Then L'_mUsed with before for L_mThe same running average filter of (a) performs the filtering. This results in a smoothed slope

Which is used as the resulting detection function D_m＝D_m

To determine the starting frame of the pre-echo.

The basic idea of pre-echo estimation is to find a signal with D_mThe last frame of negative values, which marks the instant after which the signal energy increases up to the start of the transient. FIG. 13.7 shows the detection function D_mAnd two examples of the calculation of a prior echo start frame that is subsequently estimated. For the signals in (a) and (b), the amplitude signal L_mAnd

is shown in the upper image, and the lower image shows the slope L'_mAnd

which is also the detection function D_m. For the signal in fig. 13.7(a), the detection simply requires finding D with negative values in the lower image_mLast frame of

Namely, it isDetermined pre-echo start frameRepresented as vertical lines. The rationality of this estimate can be seen by visual inspection of the upper image of fig. 13.7 (a). However, only take D_mThe last negative value of (a) will not give a suitable result for the lower signal (in kg) in (b). Here, the detection function ends with a negative value and the last frame is taken as m_preEffectively resulting in no reduction of the pre-echo at all. Furthermore, there may be D before that with a negative value_mNor do these frames coincide with the actual start of the pre-echo. This can be seen, for example, in the detection function of signal (b), where 52 ≦ m ≦ 58. Therefore, the search algorithm needs to take these fluctuations in the amplitude of the amplitude signal into account, which may also be present in the actual pre-echo region.

Completing the pre-echo start frame m by adopting an iterative search algorithm_preIs estimated. The procedure for pre-echo start frame estimation will be described using the example detection function shown in figure 13.8, which is the same as the detection function for the signal in figure 13.7 (b). The top and bottom images of fig. 13.8 show the first two iterations of the search algorithm. The estimation method scans in reverse order from the start point of the estimated transient to the start of the pre-echo search region D_mAnd determining D_mA number of frames of the sign change. These frames are represented in the figure as numbered vertical lines. The first iteration in the top image starts with D having a positive value_mIs shown here as the last frame (line 1)

And the previous frame whose sign changes from + → -is determined as the pre-echo start frame candidate (line 2). To decide whether a candidate frame should be considered as m_preIs determined to have a symbol change m before the candidate frame⁺Two additional frames of (line 3) and m- (line 4). Whether the candidate frame should be taken as the obtained pre-echo start frame m_preIs based on gray and black areas (A)⁺And A^-) Comparison between the summed values of (a). This comparison checks the black area A^-(wherein D_mExhibits a negative slope) may be considered as a sustained portion of the input signal prior to the starting point of the preceding echo or whether it is a temporary amplitude reduction in the actual preceding echo region. The summed slope A⁺And A^-Is calculated as follows

And

using A⁺And A^-If, if

A^-＞a·A⁺

Define the candidate pre-echo start frame at line 2 as the resulting start frame m_pre。

For the first iteration of the estimation algorithm, the factor a is initially set to a-0.5, and then for each subsequent iteration, the factor a is adjusted to a-0.92 · a. This emphasizes the negative slope region A^-This is for the amplitude signal L in the entire search area_mSome signals exhibiting stronger amplitude variations are necessary. If the stop criterion in equation (4.15) does not hold (which is the case for the first iteration in the top image of fig. 13.8), then the next iteration takes the previously determined m + as the last considered frame, as shown in the bottom image

And is performed equivalently to the past iterations. It can be seen that equation (4.15) holds for the second iteration, since A^-Is significantly greater than A⁺So the candidate frame at line 2 will be taken as the pre-echo start frame m_preIs estimated.

Adaptive pre-echo reduction

The following execution of adaptive pre-echo reduction can be divided into three phases, as can be seen in the bottom layer of the block diagram of fig. 13.4: determining a pre-echo amplitudeDegree threshold th_kCalculating a spectral weighting matrix W_k,mAnd by W_k,mWith complex-valued input signal X_k,mThe element-by-element multiplication of (a) reduces pre-echo noise. FIG. 13.9 shows the input signal X in the upper image_k,mAnd the processed output signal Y is shown in the intermediate image_k,mWherein the pre-echo has been reduced. By X_k,mAnd the calculated spectral weight W_k,mElement-by-element multiplication (shown in the lower image of fig. 13.9) performs pre-echo reduction

Y_k，m＝X_k，m·W_k，m.

The purpose of the pre-echo reduction method is to correct for X in the previously estimated pre-echo region_k,mIs weighted such that the resulting Y_k,mFalls within a specific threshold value th_kThe following. By X over the pre-echo region_k,mDetermines this threshold value th for each spectral coefficient in (1)_kAnd calculating the weighting factor required by the pre-echo attenuation for each frame m, and creating a frequency spectrum weighting matrix W_k,m。W_k,mIs limited to k_min≤k≤k_maxIn which k is_minIs corresponding to the closest f_minIndex of spectral coefficients for a frequency of 800Hz, thereby selecting k<k_minAnd k>k_max.f_minIs/are as follows

To avoid amplitude reduction in the low frequency region because most of the fundamental frequencies of instruments and speech lie below 800 Hz. Amplitude fading in this frequency region tends to produce audible signal loss prior to transients, especially for complex musical audio signals. Further, W_k,mIs limited to the estimated pre-echo region, where m_pre≤m≤m_i-2, wherein m_iIs the starting point of the detected transient. Due to the presence of the input signal s_nIs determined by the overlap of 50% between adjacent time frames in the STFT analysis, immediately following the transient start point frame m_iPrevious frames may also include transient events. Therefore, pre-echo attenuation is limited to frame m ≦ m_i-2。

Pre-echo threshold determination

As mentioned before, it is necessary for each spectral coefficient X_k,mDetermining (260) a threshold th_kWherein k is_min≤k≤k_maxThe threshold is used to determine the spectral weight required for pre-echo attenuation in the respective pre-echo region preceding each detected transient onset. th (h)_kCorresponds to X_k,mTo which the signal amplitude values should be reduced to obtain the output signal Y_k,m. An intuitive way may be to simply take the first frame m of the estimated pre-echo region_preSince it should correspond to the moment when the signal amplitude starts to rise constantly due to the induced pre-echo quantization noise. However, for example, if the pre-echo region is estimated to be too large or due to possible fluctuations of the amplitude signal in the pre-echo region, then

Not necessarily the smallest amplitude value of all signals. In fig. 4.10, the amplitude signal | X in the pre-echo region before the transient onset point is compared_k,mTwo examples of | are shown as solid gray curves. The top image represents the spectral coefficients of the soundboard signal and the bottom image represents the harmonica signal in a subband from the sustained tonal component of the previous harmonica tone. To calculate the appropriate threshold, | X is first filtered back and forth over time using a 2-tap running average filter_k,mTo get a smoothed envelope

(as shown by the dashed black curve). Then, the smoothed signal is processed

And a weighting curve C_mThe multiplication is performed so that the amplitude value increases toward the end of the pre-echo region. C_mShown in fig. 13.11 and may be generated as follows

Wherein M is_preIs the number of frames in the pre-echo region. In both of the graphs of FIG. 13.10And C_mThe weighting envelopes after multiplication are shown as dashed gray curves. Then, the pre-echo noise threshold th_kIs taken as

Minimum, indicated by black circles. Derived threshold value th for two signals_kDepicted as horizontal dotted lines. For the soundboard signal in the top image, simply take the smoothed amplitude signal

Without C_mIt is sufficient to weight them. However, for the harmonica signal in the bottom image, the application of a weighting curve is necessary, wherein

Is located at the end of the pre-echo region. Take this value as th_kWill result in a strong attenuation of tonal signal components and hence audible drop-out artifacts. Also, due to the higher signal energy in the pitch spectral coefficients, the pre-echo may be masked and thus inaudible. As can be seen,

and a weighting curve C_mMultiplication does not change the signal in the upper signal of fig. 4.10 very much

While resulting in a suitably high th for the tonal chime component shown in the bottom graph_k。

Calculation of spectral weights

Obtained threshold th_kFor calculating a reduction X_k,mAmplitude of (2)Spectral weight W required for value_k,mThus, a target amplitude signal will be calculated (450) for each spectral coefficient index k

Which represents the optimal output signal with reduced pre-echo for each individual k. Use of

Spectral weight matrix W_k,mCan be calculated as follows

Then W is summed over frequency by applying a 2-tap running average filter in the forward and backward directions for each frame m_k,mSmoothing (460) to match the input signal X_k,mThe large difference between the weighting factors of adjacent spectral coefficients k is reduced before multiplication. The attenuation of the pre-echo is not at the start of the pre-echo frame m_preProcessing proceeds immediately to its maximum extent but fades up over the time period of the previous echo region. This is achieved by using (430) a parameterized fading curve f with adjustable steepness_mImplemented, said parameterized fading curve f_mIs generated as follows (440)

Wherein the index is 10^cDetermination of f_mThe steepness of (d). Fig. 13.12 shows the decay curve for different values of c, which has been set to-0.5 for this study. Using f_mAnd th_kTarget amplitude signal

Can be calculated as follows

This effectively reducesIs higher than threshold th_kValue of (2) | X_k,mL while remaining below th_kThe value of (a) is not changed.

Application of time-advance masking model

Transient events act as masking sounds that can temporarily mask previous and subsequent weaker sounds. Here also the look-ahead masking model is applied (420) in such a way that | X should only be applied_k,mThe value of | is reduced until they fall below the leading masking threshold, at which they are assumed to be inaudible. The advanced masking model used first calculates the "prototype" advanced masking threshold

Then adjust it to X_k,mThe signal level of the particular masking tone transient in (1). According to B.Edler (Personal Communication,2016, 11, 22) [55]]The parameters used to calculate the look-ahead masking threshold are selected.

Is generated as an exponential function, e.g.

Determination of the parameters L and αThe level and slope of (d). The level parameter L is set to

L＝L_fall+L₀＝50dB+10dB＝60dB.

T before masking sound_fallThe look-ahead masking threshold should be lowered by L3 ms _fall50 dB. First, by taking

t_fallNeeds to be converted into a corresponding number of frames m_fallWherein (N-L) is the jump size of the STFT analysis, f_sIs the sampling frequency. Using L, L_fallAnd m_fallEquation (4.21) becomes

The parameter α can therefore be determined by transforming equation (4.24) as follows

The resulting preliminary leading masking threshold is shown in fig. 13.13 for the time period before the start point of the masking sound (which occurs at m-0)The vertical dotted line marks t corresponding to the point before the start of the masking tone_fallTime m of ms_fallWherein the threshold is reduced by L_fall-50 dB. According to Fastl and Zwicker [33]]And Moore [34]The look-ahead masking may last up to 20 ms. For the framing parameters used in the STFT analysis, this corresponds to M_maskAdvanced masking duration of approximately 14 frames, thereby

Is set to-oo frame m ≦ -Mm_ask。

To calculate X_k,mSpecific signal dependent look-ahead masking threshold mask in each pre-echo region of_k,m,iDetecting a transient frame m_iAnd then M_maskA frame is considered as a time instance of a potential masking tone. Thus, for each spectral coefficient,

is shifted to each m_i≤m<m_i+M_maskAnd at a signal-to-masking ratio of-6 dB (i.e., at the masking sound level and the masking sound frame)

Distance between) is adjusted to X_k,mThe signal level of (c). Thereafter, the maximum value of the overlap threshold is used as the resulting look-ahead masking threshold mask for the corresponding pre-echo region_k,m,i. Finally, the mask is frequency-matched in both directions by applying a single-pole recursive averaging filter equivalent to the filtering operation in equation (2.2)_k,m,iSmoothing is performed with the filter coefficient b equal to 0.3.

Then, by adopting the following formula, a leading masking threshold mask is used_k,m,iAdjusting a target amplitude signalThe value of (e) (as calculated in equation (4.20)),

FIG. 13.14 shows the same two signals from FIG. 13.10 with the resulting target amplitude signal

As solid black curve. For the castanets signal in the top image, it can be seen that the signal amplitude reaches the threshold th_kHow to fade up in the whole pre-echo region, and the effect of an early masking threshold of m-16 for the last frame, where

The bottom image (tonal spectral components of the chime signal) shows that the adaptive pre-echo reduction method has only a minor effect on the sustained tonal signal components, only slightly attenuating the smaller peaks, while the input signal X is maintained_k,mOf the total amplitude of the signal.

Then, X is used according to equation (4.18)_k,mAndcalculating (450) the resulting spectral weight W_k,mAnd then the obtained spectrum weight W is used_k,mApplied to the input signal X_k,mBefore oneIt is smoothed over frequency. Finally, the output signal Y of the adaptive pre-echo reduction method_k,mIs to weight the spectrum W by element-by-element multiplication according to equation (4.16)_k,mApplying (320) to X_k,mAnd then obtaining the product. Note that W_k,mIs real-valued and therefore does not change the complex-valued X_k,mThe phase response of (c). Fig. 4.15 shows the result of pre-echo reduction for a harmonica transient with a tonal component before the transient onset point. Spectral weight W in bottom image_k,mShowing a value at about 0dB in the frequency band of the tonal component, resulting in the preservation of the sustained tonal portion of the input signal.

Enhancement of transient attack

The approaches discussed in this section are directed to enhancing degraded transient attack and enhancing the amplitude of transient events.

Adaptive transient attack enhancement

Except for transient frame m_iIn addition, the signal in the period after the transient is also amplified, with the amplification gain fading out over the interval. The output signal of the former echo reducing stage of the self-adaptive transient sound starting enhancing method is used as the input signal X_k,m. Similar to the pre-echo reduction method, a spectral weighting matrix W is calculated (610)_k,mAnd applying (620) it to X_k,mSuch as

Y_k，m＝X_k，m·W_k，m.

However, in this case, W_k,mFor improving transient frames m_iAnd to a lesser extent the subsequent frame, rather than modifying the time period before the transient. The amplification is thus limited to f_minCut-off frequency f of a low-pass filter applied in an audio encoder and above 400Hz_maxThe following frequencies. First, a signal X is input_k,mIs divided into persistent parts

And transient part

Subsequent signalThe amplification is applied only to the transient signal portion, while the sustained portion is fully maintained.

Amplitude signal | X by using a single-pole recursive averaging filter according to equation (2.4)_k,mAnd | calculating 650 by filtering, wherein the filter coefficient used is set to b 0.41. The top image of fig. 13.16 shows the input signal amplitude | X as a gray curve_k,m| and corresponding persistent signal portions as dashed curves

The transient signal portion is then calculated (670) as follows

In the bottom image of fig. 13.16, the corresponding input signal amplitude | X in the top image_k,mTransient part of |Shown as a grey curve. Not only in m_iIs multiplied by a certain gain factor G, but after the transient frame

Over a period of one frame, the amount of amplification fades out (680). The faded-down gain curve G111 is shown in fig. 4.17.

Is set to G₁2.2, which corresponds to an amplitude level of 6.85dB, the gain of the subsequent frame increases according to G_mAnd decreases. Using the gain curve G111 and the continuous and transient signal portions, the spectral weighting matrix W_k,mWill be obtained by the following formula (680)

Then, before enhancing the transient attack according to equation (4.27), W is frequency-aligned in both the forward and backward directions according to equation (2.2)_k,mSmoothing is performed (690). In the bottom image of fig. 13.16, the gain curve G is used_mTransient signal portion of

The amplification result of (a) can be regarded as a black curve. In the top image, the output signal amplitude Y with enhanced transient attack_k,mShown as solid black curves.

Temporal envelope shaping using linear prediction

In contrast to the aforementioned adaptive transient attack enhancement method, this method aims to sharpen the attack of a transient event without increasing its amplitude. Instead, by applying (720) linear prediction in the frequency domain and using two different sets of prediction coefficients a for the inverse (720a) and synthesis filters (720b)_rShaping (740) the temporal envelope of the temporal signal Sn completes the "sharpening" of the transient. By filtering the input signal spectrum with an inverse filter (740a), a prediction residual E can be obtained according to equations (2..9) and (2.10)_k,mAs follows

An inverse filter (740a) filters the filtered input signal X in the frequency and time domains_k,mPerforming decorrelation, effectively rendering the input signal s_nThe temporal envelope of (a) is flat. If it is not

Pair E using synthesis filter (740b) according to equation (2.12)_k,mFiltering (using prediction coefficients)

) Perfectly reconstructing the input signal X_k,m. The goal of attack enhancement is to calculate the prediction coefficients

And

the transients are amplified in a combination of an inverse filter and a synthesis filter, while attenuating the signal portions before and after the transients in a particular transient frame.

The LPC shaping method works with different framing parameters as the previously described enhancement method. Therefore, the output signal of the previous adaptive attack enhancement stage needs to be re-synthesized with ISTFT and re-analyzed with new parameters. For this method, a frame size of N512 samples is used, where L N/2 is 50% overlap of 256 samples. The DFT size is set to 512. The larger frame size is chosen to improve the computation of the prediction coefficients in the frequency domain, so the high frequency resolution is more important than the high temporal resolution. The Levinson-Durbin algorithm and LPC order of p ═ 24 are used after equations (2.21) - (2.24), for f_min800Hz and f_max(which corresponds to k)_min＝10≤k_lpc≤k_maxSpectral coefficients of) in the input signal, in the input signal

Calculating the prediction coefficient on the complex spectrum

Andbefore that, the band-pass signalIs the autocorrelation function R of_iMultiplying (802, 804) two different window functions W_i ^flatAnd W_i ^synthIs used for

And

to smooth the data output by the corresponding LPC filter [56]]The temporal envelope is described. The window function is generated by

W_icⁱ0≤i≤k_max-k_min，

Wherein c is_flat0.4 and c_synth0.94. The top image of fig. 4.13 shows two different window functions, which are then multiplied by R_i. Autocorrelation function of an exemplary input signal frame along with two windowed versions (R)_i·W_i ^flat) And (R)_i·W_i ^synth) Depicted in the bottom image. Using the obtained prediction coefficients as filter coefficients of the flattening and shaping filters, the input signal X is subjected to the processing using the results of equations (4.30) and (2.6)_k,mIs shaped as follows

This describes a filtering operation using the resulting shaping filter, which can be interpreted as a combined application (820) of the inverse filter (809) and the synthesis filter (810). Using FIR (inverse/flat) filters (1-P)_n) And IIR (Synthesis) Filter A_nThe time domain filter Transfer Function (TF) of the system is obtained using the FFT transformation equation (4.32) as follows

Equation (4.32) can be equivalently formulated in the time domain as the input signal frame s_nAnd shaping filter

The product of (A) is as follows

Fig. 13.13 shows the different time domains TF of equation (4.33). The two dashed curves correspond to

And

and the inverse filter and the synthesis filter before multiplication by the gain factor G (811) are represented by solid gray curves

The combination of (820). It can be seen that for 140<n>426, a filtering operation using a gain factor G of 1 will result in a strong amplitude increase of the transient event. For the inverse filter and the synthesis filter, the appropriate gain factor G can be calculated as two predicted gains byAnd (b) and

in the ratio of (a) to (b),

prediction gain R_pIs derived from and predicts the coefficient a_rRelated partial correlation coefficient p_m(wherein 1. ltoreq. m.ltoreq.p) and is compared with a in equation (2.21) of the Levinson-Durbin algorithm_rAre calculated together. Then, ρ is used_mThe prediction gain (811) is obtained by the following equation

Final with adjusted amplitudeShown as a solid black curve in fig. 4.13. Drawing (A)4.13 shows the resulting output signal y after LPC envelope shaping in the top image_nAnd the input signal s in the transient frame_n. Bottom image is used for converting input signal amplitude spectrum X_k,mWith the filtered amplitude spectrum Y_k,mA comparison is made.

Furthermore, examples of embodiments are set forth subsequently, particularly in relation to the second aspect:

1. an apparatus for post-processing (20) an audio signal, comprising:

a temporal-to-spectral converter (700) for converting the audio signal into a spectral representation comprising a sequence of spectral frames;

a prediction analyzer (720) for computing prediction filter data for a prediction of frequencies within a spectral frame;

a shaping filter (740) controlled by the prediction filter data for shaping the spectral frame to enhance transient portions within the spectral frame; and

a spectrum-time converter (760) for converting a sequence of spectrum frames comprising the shaped spectrum frames into the time domain.

2. The apparatus as described in example 1 was used,

wherein the prediction analyzer (720) is configured to calculate first prediction filter data (720a) for a flattening filter characteristic (740a) and second prediction filter data (720b) for a shaping filter characteristic (740 b).

3. The apparatus as set forth in example 2,

wherein the prediction analyzer (720) is configured to calculate the first prediction filter data (720a) using a first time constant and to calculate the second prediction filter data (720b) using a second time constant, the second time constant being greater than the first time constant.

4. The apparatus as described in example 2 or 3,

wherein the flat filter characteristic (740a) is an analysis FIR filter characteristic or an all-zero filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a flatter temporal envelope compared to a temporal envelope of the spectral frame; or

Wherein the shaping filter characteristic (740b) is a synthetic IIR filter characteristic or an all-pole filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a less flat temporal envelope than a temporal envelope of the spectral frame.

5. The apparatus as in any one of the preceding examples,

wherein the predictive analyzer (720) is configured to:

calculating (800) an autocorrelation signal from the spectral frame;

windowing (802, 804) the autocorrelation signal using a window having a first time constant or having a second time constant, the second time constant being greater than the first time constant;

calculating (806, 808) first prediction filter data from the windowed autocorrelation signal windowed using the first time constant or calculating second prediction filter coefficients from the windowed autocorrelation signal windowed using the second time constant; and

wherein the shaping filter (740) is configured to shape the spectral frame using the second prediction filter coefficients or using the second prediction filter coefficients and first prediction filter coefficients.

6. The apparatus as in any one of the preceding examples,

wherein the shaping filter (740) comprises a cascade of two controllable sub-filters (809, 810), a first sub-filter (809) being a flattening filter having a flattening filter characteristic and a second sub-filter (810) being a shaping filter having a shaping filter characteristic,

wherein the sub-filters (809, 810) are all controlled by the prediction filter data derived by the prediction analyzer (720), or

Wherein the shaping filter (740) is a filter having a combined filter characteristic derived by combining (820) a flattening characteristic and a shaping characteristic, wherein the combined characteristic is controlled by the prediction filter data derived from the prediction analyzer (720).

7. The apparatus as set forth in example 6,

wherein the prediction analyzer (720) is configured to determine the prediction filter data such that using prediction filter data for the shaping filter (740) results in a degree of shaping that is higher than a degree of flatness obtained by using the prediction filter data for the flatness filter characteristic.

8. The apparatus as in any one of the preceding examples,

wherein the predictive analyzer (720) is configured to apply (806, 808) a Levinson-Durbin algorithm to a filtered autocorrelation signal derived from the spectral frame.

9. The apparatus as in any one of the preceding examples,

wherein the shaping filter (740) is configured to apply gain compensation such that the energy of the shaped spectral frames is equal to or within a tolerance range of ± 20% of the energy of the spectral frames generated by the temporal-to-spectral converter (700).

10. The apparatus as in any one of the preceding examples,

wherein the shaping filter (740) is configured to apply a flattening filter characteristic (740a) with a flattening gain and a shaping filter characteristic (740b) with a shaping gain, and

wherein the shaping filter (740) is configured to perform gain compensation for compensating for the effects of the flat gain and the shaping gain.

11. The apparatus as set forth in example 6,

wherein the predictive analyzer (720) is configured to calculate a flat gain and a shaping gain,

wherein the cascade of two controllable sub-filters (809, 810) further comprises a separate gain stage (811) for applying a gain derived from the flat gain and/or the shaped gain or a gain function comprised in at least one of the two sub-filters, or

Wherein the filter (740) having the combined characteristic is configured to apply a gain derived from the flat gain and/or the shaped gain.

12. The apparatus as set forth in example 5,

wherein the window comprises a gaussian window with a time lag as a parameter.

13. The apparatus as in any one of the preceding examples,

wherein the prediction analyzer (720) is configured to calculate prediction filter data for a plurality of frames such that the shaping filter (740) controlled by the prediction filter data performs signal manipulation on a frame of the plurality of frames that includes a transient portion, and such that the shaping filter (740) does not perform signal manipulation or performs less signal manipulation on another frame of the plurality of frames that does not include a transient portion than the frame that includes a transient portion.

14. The apparatus as in any one of the preceding examples,

wherein the spectrotime converter (760) is configured to apply an overlap-add operation involving at least two adjacent frames of the spectral representation.

15. The apparatus as in any one of the preceding examples,

wherein the time-to-spectrum converter (700) is configured to apply an analysis window of a jump size between 3ms and 8ms or having a window length between 6ms and 16ms, or

Wherein the spectrotime converter (760) is configured to use a range corresponding to an overlap size of an overlap window or a range corresponding to a jump size between 3ms and 8ms used by the converter, or to use a synthesis window having a window length between 6ms and 16ms, or wherein the analysis window and the synthesis window are identical to each other.

16. The apparatus as described in example 2 or 3,

wherein the flat filter characteristic (740a) is an inverse filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a flatter temporal envelope compared to a temporal envelope of the spectral frame; or

Wherein the shaping filter characteristic (740b) is a synthesis filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a temporal envelope that is less flat than a temporal envelope of the spectral frame.

17. The apparatus of any of the preceding examples, wherein the prediction analyzer (720) is configured to calculate prediction filter data for a shaping filter characteristic (740b), and wherein the shaping filter (740) is configured to filter the spectral frame obtained by the temporal-to-spectral converter (700), e.g. without prior flattening.

18. The apparatus of any of the preceding examples, wherein the shaping filter (740) is configured to represent a shaping action at or below a maximum temporal resolution according to the temporal envelope of the spectral frame, and wherein the shaping filter (740) is configured to represent a non-flat action or a flat action according to a temporal resolution that is less than a temporal resolution associated with the shaping action.

19. A method of post-processing (20) an audio signal, comprising:

converting (700) the audio signal into a spectral representation comprising a sequence of spectral frames;

calculating (720) prediction filter data for a prediction of frequencies within a spectral frame;

shaping (740) the spectral frame in response to the prediction filter data to enhance transient portions within the spectral frame; and

the sequence of spectral frames comprising the shaped spectral frames is converted (760) into the time domain.

20. A computer program for performing the method of example 19 when run on a computer or processor.

Although some aspects are described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.

Embodiments of the present invention may be implemented in hardware or software, depending on the particular implementation requirements. The implementation can be performed using a digital storage medium, e.g. a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a flash memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.

Some embodiments according to the invention comprise a data carrier with electronically readable control signals capable of cooperating with a programmable computer system so as to perform one of the methods described herein.

In general, embodiments of the invention may be implemented as a computer program product with a program code operable to perform one of the methods when the computer program product runs on a computer. The program code may be stored on a machine-readable carrier, for example.

Other embodiments include a computer program stored on a machine-readable carrier or non-transitory storage medium for performing one of the methods described herein.

In other words, an embodiment of the inventive methods is thus a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

Thus, a further embodiment of the inventive method is a data carrier (or digital storage medium, or computer readable medium) comprising a computer program recorded thereon for performing one of the methods described herein.

A further embodiment of the inventive method is thus a data stream or a signal sequence representing a computer program for performing one of the methods described herein. The data stream or signal sequence may for example be arranged to be transmitted via a data communication connection, for example via the internet.

Further embodiments include a processing apparatus, such as a computer or programmable logic device, configured or adapted to perform one of the methods described herein.

A further embodiment comprises a computer having installed thereon a computer program for performing one of the methods described herein.

In some embodiments, a programmable logic device (e.g., a field programmable gate array) may be used to perform some or all of the functions of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor to perform one of the methods described herein. Generally, the method is preferably performed by any hardware means.

The above-described embodiments are merely illustrative of the principles of the present invention. It is to be understood that modifications and variations of the arrangements and details described herein will be apparent to others skilled in the art. It is the intention, therefore, to be limited only by the scope of the claims appended hereto, and not by the specific details presented by way of description and explanation of the embodiments herein.

Reference to the literature

[1] Brandenburg, "MP 3 and AAC extended," in Audio Engineering society Conference:17th International Conference: High-Quality Audio Coding, 9 months 1999.

[2] Brandenburg and G.Stoll, "ISO/MPEG-1 audio: A genetic standard for coding of high-quality digital audio," J.Audio Eng.Soc., Vol.42, page 780-792, 10 months 1994.

[3]ISO/IEC 11172-3,”MPEG-1:Coding of moving pictures and associatedaudiofor digital storage media at up to about 1.5mbit/s-part 3:Audio”internationalstandard,ISO/IEC,1993.JTC1/SC29/WG11.

[4]ISO/IEC 13818-1,“Information technology-generic coding of movingpicturesand associated audio information:Systems,”international standard,ISO/IEC,2000.ISO/IEC JTC1/SC29.

[5] J.Herre and J.D.Johnston, "Enhancing the performance of performance audiologists by using temporal noise mapping (TNS)," in 101st Audio engineering society convention, code 4384, AES, 11 months 1996.

[6] Edler, "Codierun von audiosignal mit ü berlappendertransformation undatversen fensterfurtionn" Frequikz-Zeitschrift f ü rTelekekommunikation, Vol 43, p 253-.

[7] Samalali, M.T. -H.Alouane, and G.Mah, "Temporal evolution correction for authentication im low bit-rate audio coding" in 17th European Signal processing conference (EUSIPCO), (Glasgow, Scotland), IEEE, month 8 2009.

[8] Lapierre and R.Lefebvre, "Pre-echo noise reduction in frequency-domain audiodes," in 42nd IEEE International Conference on Acoustics, speech Signal processing, Page 686 and 690, IEEE, 3 months 2017.

[9]A.V.Oppenheim and R.W.Schafer,Discrete-Time SignalProcessing.Harlow,UK:Pearson Education Limited,3.ed.,2014.

[10]J.G.Proakis and D.G.Manolakis,Digital Signal Processing-Principles,Algorithms,and Applications.New Jersey,US:Pearson EducationLimited,4.ed.,2007.

[11] Benesty, J.Chen, and Y.Huang, Springer handbook of speedprocessing, ch.7.Linear Prediction, pages 121-134. Berlin Springer,2008.

[12] J. Makhoul, "Spectral analysis of speed by linear prediction" InIEEE Transactionson Audio and electronics, volume 21, page 140 and 148, IEEE, month 6 1973.

[13] Makhoul, "Linear prediction: A tubular review" "in Proceedings of the IEEE, volume 63, page 561-.

[14] M.Athineos and D.P.W.Ellis, "Frequency-domain linear prediction for temporalffeatures" in IEEE Workshop on Automatic Speech Recognition and Understand, page 261 and 266, IEEE, 11 months 2003.

[15]F.Keiler,D.Arfib,and U.

“Efficient linear prediction fordigital audioeffects,”in COST G-6Conference on Digital Audio Effects(DAFX-00),(Verona,Italy),2And 12 months in 000 years.

[16] J.Makhoul, "Spectral line prediction: Properties and applications" in IEEEtransactions on Acoustics, Speech, and Signal Processing, volume 23, page 283-.

[17] T.painter and a.spanias, "recent coding of digital audio," advances, ofhe IEEE, volume 88,2000 for 4 months.

[18] J. Makhoul, "Stable and effective diagnostic methods for Linear analysis," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-25, page 423 and 428, IEEE, 10 months 1977.

[19] Levinson, "The turbine rms (root mean square) error criterion design and prediction," Journal of Mathematics and Physics, Vol.25, p.261-.

[20] Herre, "Temporal noise mapping, hybridization and Coding method in technical Audio Coding: A clinical interaction," in Audio engineering society Conference:17th International Conference: High-Quality Audio Coding, volume 17, AES, month 8 1999.

[21] Schroeder, "Linear prediction, entry and signal analysis," IEEE ASSP Magazine, volume 1, pages 3-11,1984, month 7.

[22] Daudet, S.Molla, and B.Torr saini, "Transmission detection and coding using wavelet coeffcient trees," gels super Transmission product signals et Images, 9.2001.

[23] Edler and O.Niemeyer, "Detection and extraction of transformation for Audio coding," in Audio Engineering Society conversion 120, No. 6811, (Paris, France), 5 months 2006.

[24] Kliewer and A. Mertins, "Audio chewing and coding with improved signaling segments," in 9th European Signal processing Conference, Vol.9, (Rhodes), pages 1-4, IEEE, 9 months 1998.

[25] Jaillet, Detection and modeling of fast attransients, in Proceedings of the International Computer Music Conference, (Havana, Cuba), pages 30-33,2001.

[26] Bello, L.Daudet, S.Abdallah, C.Duxbury, and M.Davies, "A structural on set detection in music signals," IEEE Transactions on Speech and Audio processing, volume 13, page 1035-.

[27] Suresh Babu, A.K.Malot, V.Vijayachandar, and M.Vinay, "Transientdetection for transform domain coders," in Audio Engineering society Convention 116, No. 6175, (Berlin, Germany), 5 months 2004.

[28] Masri and A. Bateman, "Improved modification of attack transitions in Music analysis-regeneration," in International Computer Music Conference, page 100-.

[29] Kwong and R.Lefebvre, "transfer detection of audio signal based on an adaptive comb filter in the frequency domain," in Conference on signals, Systems and Computers,2004.Conference Record of the third-seven sloomar, Vol.1, Page 542-.

[30] Zhang, C.Cai, and J.Zhang, "A transfer signal detection technology based on flash measure," in 6th International Conference on computer science and discovery, (Singapore), page 310-.

[31] Johnston, "Transform coding of audio signals using qualitative information criterion," IEEE Journal on Selected Areas in Communications, volume 6, page 314-.

[32] Herre and S.Disch, Academic press in Signal processing, volume 4, chapter 28. temporal Audio Coding, pages 757-799. Academic press,2014.

[33]H.Fastl and E.Zwicker,Psychoacoustics-Facts andModels.Heidelberg:Springer,3.ed.,2007.

[34]B.C.J.Moore,An Introduction to the Psychology of Hearing.London:Emerald,6.ed.,2012.

[35]P.Dallos,A.N.Popper,and R.R.Fay,The Cochlea.New York:Springer,1.ed.,1996.

[36]W.M.Hartmann,Signals,Sound,and Sensation.Springer,5.ed.,2005.

[37] Brandenburg, C.Faller, J.Herre, J.D.Johnston, and B.Kleijn, "Perceptil coding of high-quality digital audio," in IEEE Transactions on Acoustics, Speech, and Signal Processing, volume 101, page 1905-.

[38] Fletcher and W.A.Munson, "Loodness, its definition, measurement and calculation," The Bell System Technical Journal, volume 12, number 4, page 377- "430, 1933.

[39] Fletcher, "Audio patterns," Reviews of Modern Physics, volume 12, number 1, pages 47-65,1940.

[40]M.Bosi and R.E.Goldberg,Introduction to Digital Audio Coding andStandards.Kluwer Academic Publishers,1.ed.,2003.

[41] Noll, "MPEG digital audio coding," IEEE Signal processing magazine, volume 14, pages 59-81,1997, 9 months.

[42] Pan, "A tutoral on MPEG/audio compression," IEEE MultiMedia, volume 2, number 2, pages 60-74,1995.

[43] Erne, "Perceptial audio coders" what to listen for "," in 111st Audio Engineering Society, accession No. 5489, AES, 9 months 2001.

[44] C. -M.Liu, H. -W.Hsu, and W.Lee, "Compression artifacts in procedural Audio coding," in IEEE Transactions on Audio, Speech, and Languge Processing, volume 16, page 681-.

[45] Daudet, "A review on techniques for the extraction of transformed sin biological signals," in Proceedings of the Third international conference on computer Music, page 219-.

[46] W. -C.Lee and C. -C.J.Kuo, "mechanical on set detected based on adaptive linear prediction," in IEEE International Conference on multimedia and Expo, (Toronto, Ontario), page 957-.

[47] M.Link, "An attachment processing of audio signals for optimizing the temporal characteristics of a low bit-rate audio coding system," in Audio engineering Society description, volume 95,1993 for 10 months.

[48]T.Vaupel,Ein Beitrag zur Transformationscodierung vonAudiosignalen unter Verwendung der Methode der“Time Domain AliasingCancellation(TDAC)”und einer Signalkompandierung im Zeitbereich.Ph.d.thesis,

Duisburg, Duisburg, Germany, 4 months 1991.

[49] Bertini, M.Magrini, and T.Giunti, "A time-domain system for transformation in reconstructed music," in 14th European Signal processing conference (EUSIPCO), (Florence, Italy), IEEE, 9.2013.

[50] Duxbury, M.Sandler, and M.Davies, "A hybrid approach to music onset detection," in Proc.of the 5th int.conference on Digital Audio effects (DAFx-02), (Hamburg, Germany), p.33-38,2002, 9 months.

[51] Klapuri, "Sound on set detection by applying Sound in the acoustical output of knowledge," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, month 3 1999.

[52] S.L.Goh and D.P.Mandic, "Nonlinear adaptive prediction of complex-valued PRNN," in IEEE Transactions on Signal processing, volume 53, page 1827 and 1836, IEEE, month 5 of 2005.

[53] Haykin and L.Li, "Nonlinear adaptive prediction of informativeness," in IEEE Transactions on Signal Processing, volume 43, page 526 and 535, IEEE,1995 month 2.

[54] D.P.Mandic, S.Javidi, S.L.Goh, and K.Aihara, "complete-valued comparison of wind profile using the appended components," in Renewable energy, volume 34, page 196-.

[55] Edler, "parameter of a pre-masking model," Personal communication,2016, 11, 22 days.

[56] ITU-R Recommendation BS.1116-3, "Method for the discovery of small interactions in audio systems," Recommendation, International Telecommunication Union, Geneva, Switzerland, month 2 2015.

[57] ITU-R Recommendation BS.1534-3, "Method for the objective assessment level of audio systems," Recommendation, International Telecommunication Union, Geneva, Switzerland,2015, 10 months.

[58] ITU-R Recommendation BS.1770-4, "Algorithms to measure audio reproduction low and true-peak audio level," Recommendation, International telecommunication Union, Geneva, Switzerland, month 10 2015.

[59]S.M.Ross,Introduction to Probability and Statistics for Engineersand Scientists.Elsevier,3.ed.,2004.

94页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于处理音频信号的装置和方法

Apparatus for post-processing audio signals using transient position detection

相关技术

网友询问留言