Apparatus for post-processing audio signals using transient position detection
阅读说明:本技术 用于使用瞬态位置检测后处理音频信号的装置 (Apparatus for post-processing audio signals using transient position detection ) 是由 萨沙·迪施 克里斯蒂安·乌勒 帕特里克·甘普 丹尼尔·里奇特 奥利弗·赫尔穆特 于尔根·赫 于 2018-03-28 设计创作,主要内容包括:一种用于后处理音频信号的装置,包括:转换器(100),用于将音频信号转换为时间频率表示;瞬态位置估计器(120),用于使用所述音频信号或所述时间频率表示估计瞬态部分的时间位置;以及用于操纵时间频率表示的信号操纵器(140),其中所述信号操纵器(140)被配置为在瞬态位置之前的时间位置处减少或消除所述时间频率表示中的前回波,或者在瞬态位置处执行所述时间频率表示的整形,以放大所述瞬态部分的起音。(An apparatus for post-processing an audio signal, comprising: a converter (100) for converting an audio signal into a time-frequency representation; a transient position estimator (120) for estimating a temporal position of a transient portion using the audio signal or the temporal frequency representation; and a signal manipulator (140) for manipulating the time-frequency representation, wherein the signal manipulator (140) is configured to reduce or eliminate pre-echoes in the time-frequency representation at a time position before the transient position, or to perform shaping of the time-frequency representation at the transient position, to amplify the attack of the transient part.)
1. An apparatus for post-processing (20) an audio signal, comprising:
a converter (100) for converting the audio signal into a time-frequency representation;
a transient position estimator (120) for estimating a temporal position of a transient portion using the audio signal or the temporal frequency representation; and
a signal manipulator (140) for manipulating a time-frequency representation, wherein the signal manipulator is configured to reduce (220) or eliminate pre-echoes in the time-frequency representation at a time position before a transient position, or to perform shaping (500) of the time-frequency representation at a transient position, to amplify an attack of the transient portion.
2. The apparatus of claim 1, wherein the first and second electrodes are disposed in a common plane,
wherein the signal manipulator (140) comprises a pitch estimator (200) for detecting a pitch signal component in the time-frequency representation temporally preceding a transient portion, an
Wherein the signal manipulator (140) is configured to apply pre-echo reduction or cancellation (220) in a frequency selective manner such that at frequencies where tonal signal components have been detected, signal manipulation is reduced or switched off compared to frequencies where tonal signal components have not been detected.
3. The apparatus of claim 1 or 2, wherein the signal manipulator (140) comprises a pre-echo width estimator (240) for estimating a temporal width of a pre-echo before a transient position based on a development of a signal energy of the audio signal over time to determine a pre-echo start frame in a temporal frequency representation comprising a plurality of subsequent audio signal frames.
4. The device of any one of the preceding claims,
wherein the signal manipulator (140) comprises a pre-echo threshold estimator (260) for estimating a pre-echo threshold for spectral values in the temporal frequency representation within a pre-echo width, wherein the pre-echo threshold is indicative of a magnitude threshold of a corresponding spectral value after pre-echo reduction or cancellation.
5. The apparatus as set forth in claim 4, wherein,
wherein the pre-echo threshold estimator (260) is configured to determine the pre-echo threshold using a weighting curve having an increasing characteristic from the start of the pre-echo width to the transient position.
6. The apparatus of any one of the preceding claims, wherein the pre-echo threshold estimator (260) is configured to:
smoothing (330) the time-frequency representation over a plurality of subsequent frames of the time-frequency representation, an
The smoothed time-frequency representation is weighted (340) using a weighting curve having an increasing characteristic from the start of the previous echo width to the transient position.
7. The apparatus of any one of the preceding claims, wherein the signal manipulator (140) comprises:
a spectral weight calculator (300, 160) for calculating respective spectral weights for spectral values of the time-frequency representation; and
a spectral weighter (320) for weighting spectral values of the temporal frequency representation using the spectral weights to obtain a steered temporal frequency representation.
8. The apparatus according to claim 7, wherein the spectral weight calculator (300) is configured to:
determining (450) original spectral weights using the actual spectral values and the target spectral values, or
Smoothing (460) the original spectral weights in frequency within the frame of the time-frequency representation, or
Reduction or elimination of pre-echo over multiple frames using fading curve fading (430) at the beginning of pre-echo width, or
Determining (420) target spectral values such that spectral values having an amplitude below a pre-echo threshold are unaffected by the signal manipulation, or
A target spectral value is determined (420) using an advanced masking model (410) to reduce attenuation of spectral values in a pre-echo region based on the advanced masking model (410).
9. The device of any one of the preceding claims,
wherein the time-frequency representation comprises complex-valued spectral values, an
Wherein the signal manipulator (140) is configured to apply real-valued spectral weighting values to the complex-valued spectral values.
10. The device of any one of the preceding claims,
wherein the signal manipulator (140) is configured to amplify (500) spectral values within a transient frame of the time-frequency representation.
11. The device of any one of the preceding claims,
wherein the signal manipulator (140) is configured to amplify only spectral values above a minimum frequency, the minimum frequency being larger than 250Hz and lower than 2 kHz.
12. The device of any one of the preceding claims,
wherein the signal manipulator (140) is configured to divide (630) the time-frequency representation into a duration part and a transient part at a transient position,
wherein the signal manipulator (140) is configured to amplify only the transient portion and not the sustained portion.
13. The device of any one of the preceding claims,
wherein the signal manipulator (140) is configured to further amplify a portion of the time frequency representation in time after the transient position using a fade-out characteristic (685).
14. The device of any one of the preceding claims,
wherein the signal manipulator (140) is configured to calculate (680) spectral weighting factors for the spectral values using the persistent portion of the spectral values, the amplified transient portion and the magnitudes of the spectral values, wherein the amount of amplification of the amplified portion is predetermined and is between 300% and 150%, or
Where the spectral weights are smoothed 690 over frequency.
15. The apparatus of any one of the preceding claims, further comprising:
a spectrum-to-time converter for converting (370) the manipulated time-frequency representation into the time domain using an overlap-and-add operation involving at least adjacent frames of the time-frequency representation.
16. The device of any one of the preceding claims,
wherein the converter (100) is configured to apply an analysis window of a jump size between 1ms and 3ms or having a window length between 2ms and 6ms, or
Wherein the spectro-temporal converter (370) is configured to use a range corresponding to an overlap size of the overlap window or to a jump size between 1ms and 3ms used by the converter, or to use a synthesis window having a window length between 2ms and 6ms, or wherein the analysis window and the synthesis window are identical to each other.
17. A method for post-processing (20) an audio signal, comprising:
-converting (100) the audio signal into a time-frequency representation;
estimating (120) a transient position in time of a transient portion using the audio signal or the time-frequency representation; and
manipulating (140) the time-frequency representation to reduce (220) or eliminate pre-echoes in the time-frequency representation at time positions preceding the transient position, or performing shaping (500) of the time-frequency representation at the transient position to amplify the onset of the transient portion.
18. A computer program for performing the method of claim 17 when run on a computer or processor.
Technical Field
The present invention relates to audio signal processing and, in particular, to audio signal post-processing to enhance audio quality by removing coding artifacts.
Background
Audio coding is the field of signal compression that uses psychoacoustic knowledge to deal with redundancy and irrelevancy in audio signals. Under low bit rate conditions, unwanted artifacts are often introduced into the audio signal. Significant artifacts are pre-and post-temporal echoes triggered by transient signal components.
Especially in block-based audio processing, these pre-and post-echoes occur due to quantization noise, e.g. spectral coefficients in a frequency domain transform coder, spread over the entire duration of a block. Semi-parametric coding tools, such as gap-filling, parametric spatial audio, or bandwidth extension, may also cause parametric band-limited echo artifacts, since parameter-driven adjustments typically occur within a time block of samples.
The present invention relates to a non-guided post-processor that reduces or mitigates transient subjective quality impairments that have been introduced by perceptual transform coding.
Prior art methods to prevent pre-and post-echo artifacts within the codec include transform codec block switching and temporal noise shaping. A prior art method of suppressing pre-and post-echo artifacts using post-processing techniques after the codec chain is disclosed in [1 ].
[1] Imen Samali, Mania Turki-Hadj Alauane, Gael Mahe, "Temporal engineering for Attack retrieval in Low Bit-Rate Audio Coding", 17th European Signal Processing Conference (EUSIPCO 2009), Scotland, 24-28, 2009; and
[2]Jimmy Lapierre and Roch Lefebvre,“Pre-Echo Noise Reduction InFrequency-Domain Audio Codecs”,ICASSP 2017,New Orleans.
the first category of methods needs to be inserted into the codec chain and cannot be applied a posteriori to items that have been previously encoded (e.g., archived sound material). Even if the second method is implemented essentially as a post-processor of the decoder, it still requires control information derived from the original input signal at the encoder side.
Disclosure of Invention
It is an object of the invention to provide an improved concept for post-processing an audio signal.
This object is achieved by an apparatus for post-processing an audio signal according to
One aspect of the present invention is based on the following findings: transients may still be found in audio signals that have been subjected to earlier encoding and decoding, because such earlier encoding/decoding operations, although degrading the perceptual quality, do not completely eliminate transients. Accordingly, a transient position estimator is provided for estimating a temporal position of a transient portion using an audio signal or a temporal frequency representation of the audio signal. According to the invention, the time-frequency representation of the audio signal is manipulated to reduce or eliminate pre-echoes in the time-frequency representation at time positions preceding the transient position, or to perform shaping of the time-frequency representation at the transient position and, depending on the implementation, after the transient position, so that attack (attack) of the transient part is amplified.
According to the invention, signal manipulation is performed within a time-frequency representation of the audio signal based on the detected transient position. Thus, by processing operations in the frequency domain, a rather accurate transient position detection may be obtained, and on the one hand a corresponding useful pre-echo reduction and on the other hand an attack amplification, so that the final frequency-time conversion results in an automatic smoothing/distribution of the manipulation over the entire frame and over more than one frame due to overlap-add operations. Finally, this avoids audible clicks due to manipulation of the audio signal and of course results in an improved audio signal without any pre-echo or with a reduced amount of pre-echo on the one hand and/or with a sharp onset for transient portions on the other hand.
The preferred embodiments relate to a non-guided post-processor that reduces or mitigates transient subjective quality impairments that have been introduced by perceptual transform coding.
According to another aspect of the invention, the transient improvement processing is performed without a specific need for a transient position estimator. In this respect, a temporal-spectral converter for converting an audio signal into a spectral representation comprising a sequence of spectral frames is used. The prediction analyzer then calculates prediction filter data for prediction of frequencies within the spectral frame, and a subsequently connected shaping filter controlled by the prediction filter data shapes the spectral frame to enhance transient portions within the spectral frame. Post-processing of the audio signal is done using a spectral-temporal conversion for converting the sequence of spectral frames comprising the shaped spectral frames back into the time domain.
Thus, again, any modifications are made within the spectral representation rather than within the time-domain representation, thereby avoiding any audible clicks or the like due to the time-domain processing. Furthermore, due to the fact that a prediction analyzer for calculating prediction filtering data for the prediction of frequencies within spectral frames is used, the corresponding temporal envelope of the audio signal is automatically affected by the subsequent shaping. In particular, the shaping is performed in such a way that, due to the processing in the spectral domain and to the fact that a prediction of the frequency is used, the temporal envelope of the audio signal is enhanced, i.e. such that the temporal envelope has higher peaks and deeper valleys. In other words, the reverse process of smoothing is performed by shaping that automatically enhances the transient without actually locating the transient.
Preferably, two kinds of prediction filter data are derived. The first prediction filter data is prediction filter data for a flat filter characteristic, and the second prediction filter data is prediction filter data for a shaping filter characteristic. In other words, the flattening filter characteristic is an inverse filter characteristic, and the shaping filter characteristic is a predictive synthesis filter characteristic. Again, however, both filter data are derived by performing a prediction of the frequency within the spectral frame. Preferably, the time constants used for deriving the different filter coefficients are different, such that for calculating the first prediction filter coefficient, a first time constant is used, and for calculating the second prediction filter coefficient, a second time constant is used, wherein the second time constant is larger than the first time constant. The process again automatically ensures that transient signal portions are more affected than non-transient signal portions. In other words, although the processing does not rely on explicit transient detection methods, transient portions are more affected than non-transient portions by means of flattening and subsequent shaping based on different time constants.
Thus, according to the invention and thanks to the application of the prediction of the frequency, an automatic type of transient improvement procedure is obtained, in which the temporal envelope is enhanced (rather than smoothed).
Embodiments of the present invention are designed as a post-processor that operates on previously encoded sound material without the need for further guidance information. Thus, these embodiments may be applied to archived sound material that has been compromised by perceptual coding that has been applied to the archived sound material before the archived sound material was archived.
A preferred embodiment of the first aspect comprises the following main process steps:
non-guided detection of transient positions within the signal to find transient positions;
estimating the pre-echo duration and intensity before the transient;
deriving an appropriate time gain curve for attenuating the pre-echo artifact;
avoiding/attenuating the estimated pre-echo by the adapted time gain curve before the transient (to mitigate the pre-echo);
at the sound starting position, the dispersion of the sound starting is reduced;
tones or other quasi-stationary spectral bands are excluded from ducking.
A preferred embodiment of the second aspect comprises the following main process steps:
unguided detection of transient position within the signal to find the transient position (this step is optional);
sharpening the attack envelope by applying a frequency domain linear prediction coefficient (FD-LPC) flattening filter and a subsequent FD-LPC shaping filter, the flattening filter representing a smoothed temporal envelope and the shaping filter representing a less smooth temporal envelope, wherein the prediction gains of both filters are compensated.
The preferred embodiment is that of a post-processor that implements non-boot transient enhancement as the last step in a multi-step processing chain. If other enhancement techniques are to be applied, such as unguided bandwidth extension, spectral gap filling, etc., the transient enhancement is preferably the last in the chain, so that the enhancement includes and is effective for signal modifications that have been introduced from the previous enhancement stage.
All aspects of the invention may be implemented as a post-processor, one, two or three modules may be computed serially or may share common modules for computational efficiency (e.g., (I) STFT, transient detection, pitch detection).
It should be noted that the two aspects described herein may be used independently of each other or together for post-processing an audio signal. The first aspect relying on transient position detection and pre-echo reduction and pitch amplification may be used in order to enhance the signal without the second aspect. Accordingly, the second aspect based on LPC analysis and corresponding shaping filtering of frequencies in the frequency domain does not necessarily rely on transient detection, but rather enhances the transient automatically without an explicit transient position detector. This embodiment may be enhanced by a transient position detector, but this transient position detector is not necessary. Furthermore, the second aspect may be applied independently of the first aspect. Further, it is emphasized that in other embodiments the second aspect may be applied to audio signals that have been post-processed by the first aspect. Alternatively, however, the ordering may be done in such a way that in a first step the second aspect is applied and subsequently the first aspect is applied in order to post-process the audio signal to improve its audio quality by removing earlier introduced coding artifacts.
Furthermore, it should be noted that the first aspect basically has two sub-aspects. The first sub-aspect is pre-echo reduction based on transient position detection, and the second sub-aspect is attack amplification based on transient position detection. Preferably, the two sub-aspects are combined in series, wherein even more preferably pre-echo reduction is performed first, followed by attack amplification. However, in other embodiments, the two different sub-aspects may be implemented independently of each other and may even be combined with the second sub-aspect as appropriate. Thus, pre-echo reduction may be combined with a prediction based transient enhancement process without any attack amplification. In other embodiments, no pre-echo reduction is performed, but instead pitch amplification is performed along with subsequent LPC-based transient shaping, which does not necessarily require transient position detection.
In a combined embodiment, the first and second aspects comprising the two sub-aspects are performed in a specific order, wherein the order comprises performing pre-echo reduction first, performing attack amplification second, and performing LPC-based attack/transient enhancement procedures third based on prediction of the spectral frame of frequencies.
Drawings
Preferred embodiments of the present invention will be discussed subsequently with reference to the accompanying drawings, in which:
fig. 1 is a schematic block diagram according to a first aspect;
FIG. 2a is a preferred embodiment of the first aspect based pitch estimator;
FIG. 2b is a preferred embodiment of the first aspect based on pre-echo width estimation;
FIG. 2c is a preferred embodiment of the first aspect based on pre-echo threshold estimation;
FIG. 2d is a preferred embodiment of the first sub-aspect relating to pre-echo reduction/cancellation;
FIG. 3a is a preferred embodiment of the first sub-aspect;
FIG. 3b is a preferred embodiment of the first sub-aspect;
FIG. 4 is a further preferred embodiment of the first sub-aspect;
FIG. 5 illustrates two sub-aspects of the first aspect of the invention;
FIG. 6a shows an overview of a second sub-aspect;
FIG. 6b shows a preferred embodiment relying on a second sub-aspect of the division into transient and persistent portions;
FIG. 6c illustrates a further embodiment of the division of FIG. 6 b;
FIG. 6d shows a further embodiment of the second sub-aspect;
FIG. 6e shows a further embodiment of the second sub-aspect;
FIG. 7 shows a block diagram of an embodiment of a second aspect of the present invention;
FIG. 8a shows a preferred embodiment of the second aspect based on two different filter data;
FIG. 8b shows a preferred embodiment of the second aspect for calculating two different prediction filter data;
FIG. 8c shows a preferred embodiment of the shaping filter of FIG. 7;
FIG. 8d shows a further embodiment of the shaping filter of FIG. 7;
fig. 8e shows a further embodiment of the second aspect of the invention;
FIG. 8f shows a preferred embodiment of LPC filter estimation using different time constants;
FIG. 9 shows an overview of a preferred embodiment of a post-processing procedure of a second aspect of the present invention relying on first and second sub-aspects of the first aspect of the present invention and additionally on performing the output of the procedure based on the first aspect of the present invention;
FIG. 10a shows a preferred embodiment of a transient position detector;
FIG. 10b illustrates a preferred embodiment of the detection function calculation of FIG. 10 a;
FIG. 10c shows a preferred embodiment of the start point (onset) selector of FIG. 10 a;
fig. 11 shows as a transient enhanced post processor a general arrangement of the invention according to the first and/or second aspect;
figure 12.1 shows moving average filtering;
FIG. 12.2 shows single-pole recursive averaging and high-pass filtering;
fig. 12.3 shows temporal signal prediction and residual;
FIG. 12.4 shows the autocorrelation of the prediction error;
FIG. 12.5 shows spectral envelope estimation using LPC;
figure 12.6 shows temporal envelope estimation using LPC;
FIG. 12.7 illustrates attack transients versus frequency domain transients;
FIG. 12.8 shows the spectrum of the "frequency domain transient";
FIG. 12.9 illustrates the difference between transients, onset points and attack;
FIG. 12.10 shows absolute thresholds in quiet and simultaneous masking;
FIG. 12.11 shows temporal masking;
FIG. 12.12 shows the general structure of a perceptual audio encoder;
fig. 12.13 shows the general structure of a perceptual audio decoder;
fig. 12.14 shows bandwidth limitation in perceptual audio coding;
FIG. 12.15 illustrates a degraded attack feature;
figure 12.16 shows pre-echo artifacts;
FIG. 13.1 shows a transient enhancement algorithm;
fig. 13.2 shows transient detection: a detection function (soundboard);
fig. 13.3 shows transient detection: detection function (park);
FIG. 13.4 shows a block diagram of a pre-echo reduction method;
FIG. 13.5 illustrates the detection of tonal components;
FIG. 13.6 shows pre-echo width estimation-an exemplary method;
figure 13.7 shows pre-echo width estimation-example;
FIG. 13.8 shows the pre-echo width estimation-detection function;
fig. 13.9 shows a pre-echo reduction-spectrogram (castanets);
FIG. 13.10 is a graphical representation of pre-echo threshold determination (castanets);
FIG. 13.11 is a graphical illustration of pre-echo threshold determination for tonal components;
figure 13.12 shows a parametric fading curve for pre-echo reduction;
FIG. 13.13 shows a model of the leading masking threshold;
FIG. 13.14 shows the calculation of target amplitude after pre-echo reduction;
fig. 13.15 shows a front echo reduction-spectrogram (bell);
FIG. 13.16 illustrates adaptive transient attack enhancement;
FIG. 13.17 shows a fade-out curve for adaptive transient attack enhancement;
FIG. 13.18 shows an autocorrelation window function;
figure 13.19 shows the time-domain transfer function of the LPC shaping filter; and
figure 13.20 shows LPC envelope shaping-input and output signals.
Detailed Description
Fig. 1 shows an apparatus for post-processing an audio signal using transient position detection. In particular, as shown in fig. 11, the device for post-processing is placed with respect to a general frame. In particular, fig. 11 shows the input of the corrupted audio signal shown at 10. This input is forwarded to the transient
The
Thus, the apparatus for post-processing in fig. 1 reduces or eliminates pre-echoes and/or shapes the time-frequency representation to amplify the onset of transient portions.
Fig. 2a shows a
Furthermore, as shown in fig. 2b, the
Fig. 2b shows a block diagram of a preferred embodiment of the post-processing according to the first sub-aspect of the first aspect of the present invention, i.e. where pre-echo reduction or cancellation is performed, or pre-echo "ducking" as described in fig. 2 d.
The marred audio signal is provided at
Furthermore, a
The result of
Preferably, pre-echo
Preferably, as depicted in FIG. 3a, the
Preferably, the
Preferably, the
In a further embodiment, the
Preferably, the spectral weights are calculated as shown in the particular embodiment shown in fig. 4. The
Preferably, the target values input into the
Furthermore,
Naturally, the target value can also be determined without any look-ahead masking psychoacoustic effects and without any fading. The target value will then be directly the threshold thkIt has been found, however, that the particular calculations performed by the
Thus, it is preferred to determine the target spectral values such that spectral values having an amplitude below the pre-echo threshold are unaffected by signal manipulation, or to determine the target spectral values using the look-
Preferably, the algorithm executed in the
Fig. 5 illustrates a preferred embodiment of the
Fig. 6a shows a preferred embodiment of a
Preferably, the
Preferably, the
As depicted, the
After the weighting factor determination 680, smoothing across frequency is performed in
Preferably, the result of the
Preferably, the
Fig. 7 shows an
Preferably, the
Preferably, the
Preferably, the degree of shaping represented by the
Although fig. 8a shows the case where two different filter characteristics (one shaping filter and one flattening filter) are calculated, other embodiments rely on a single shaping filter characteristic. This is due to the fact that the signal can of course also be shaped without prior flattening, so that finally a highly shaped signal with automatically improved transients is obtained again. This effect of over-shaping can be controlled by the transient position detector, but is not required due to the preferred implementation of signal manipulation that affects the non-transient portion less automatically than the transient portion. Both processes rely entirely on the fact that the
In this embodiment, the
Due to the fact that the autocorrelation signal is windowed with a window having two different time constants, an automatic transient enhancement is obtained. Typically, windowing is such that different time constants have an effect only on one type of signal and no effect on other types of signals. Transient signals are actually affected by two different time constants, whereas non-transient signals have such an autocorrelation signal that windowing with a second, larger time constant results in almost the same output as windowing with the first time constant. With respect to fig. 13 and 18, this is due to the fact that non-transient signals do not have any significant peaks at high time lags, and therefore the use of two different time constants with respect to these signals does not cause any difference. However, this is different for transient signals. Transient signals have peaks at higher time lags and therefore different time constants are applied to the autocorrelation signal which actually has peaks at higher time lags, as shown at 1300 in fig. 13 and 18, for example resulting in different outputs for different windowing operations using different time constants.
The shaping filter may be implemented in many different ways depending on the implementation. In fig. 8c is shown a way of cascading a flat sub-filter controlled by
However, these two different filter characteristics and gain compensation may also be implemented within a
Fig. 8e shows a further embodiment of the second aspect of the invention, where the function of the combined shaping
FIG. 8f shows the windowing function obtained by
Thus, applying a window to the autocorrelation values prior to the Levinson-Durbin recursion results in an extension of the temporal support at the local temporal peaks. In particular, FIG. 8f depicts an extension using a Gaussian window. The embodiments herein rely on this idea to derive a temporal flattening filter that has a larger extension of temporal support at the local non-flat envelope than the subsequent shaping filter by selecting different values 4 a. Together, these filters result in sharpening of temporal onsets in the signal. As a result, there is compensation for the prediction gain of the filter, so that the spectral energy of the filtered spectral region is preserved.
Thus, as shown in fig. 8a to 8e, a signal stream based on voicing shaping of the frequency domain LPC is obtained.
Fig. 9 shows a preferred implementation of an embodiment relying on the first aspect shown by
Fig. 10a illustrates a preferred embodiment of the
The
Fig. 10c shows a preferred way of choosing from the starting point of the detection function as obtained by
In
In
Subsequently, techniques and auditory concepts used in the proposed transient enhancement method are disclosed. First, some basic digital signal processing techniques will be introduced with respect to selected filtering operations and linear prediction, followed by the definition of transients. Subsequently, the psychoacoustic concept of auditory masking, which is used in perceptual coding of audio content, is explained. This section ends with a brief description of a generic perceptual audio codec and the resulting compression artifacts, which are subject to the enhancement method according to the invention.
Smoothing and differentiating filter
The transient enhancement method described later frequently uses some specific filtering operation. An introduction of these filters will be given in the following section. For a more detailed description see [9, 10 ]]. Equation (2.1) describes a Finite Impulse Response (FIR) low pass filter, which is calculated as the input signal xnCurrent output sample value y of the average of the current and past samplesn. The filtering process of such a so-called moving average filter is given by
Where p is the filter order. The top image of fig. 12.1 shows the input signal xnThe result of the moving average filter operation in equation (2.1) above. By pairing x in the forward and backward directionsnCalculating the output signal y in the bottom image by applying a moving average filter twicen. This compensates for the filter delay and also results in a smoother output signal ynBecause of xnIs filtered twice.
A different way to smooth the signal is to apply a single-pole recursive averaging filter, which is given by the following difference equation:
yn=b·xn+(1-b)·yn-1, 1≤n≤N,
wherein y is0=x1And N represents xnNumber of samples in (1). Fig. 12.2(a) shows the result of a single-pole recursive averaging filter applied to a rectangular function. In (b), filters are applied in both directions to further smooth the signal. By using
Andas followsAnd
wherein xnAnd ynThe input and output signals of equation (2.2), respectively, the resulting output signalAnd
directly following the attack or decay phase of the input signal. FIG. 12.2(c) showsAs solid black curve andas a dashed black curve.Input signal xnStrong amplitude increments of orThe decrement may be performed by using a FIR high pass filter on xnFiltering is performed to detect the presence of, as follows,
wherein b ═ 1, -1] or b ═ 1, 0., -1 ]. The resulting signal after high pass filtering the rectangular function is shown as a black curve in fig. 12.2 (d).
Linear prediction
Linear Prediction (LP) is a useful method for audio coding. Some past studies have described their ability to model the speech production process in particular [11, 12, 13], while others have generally applied it to the analysis of audio signals [14, 15, 16, 17 ]. The following sections are based on [11, 12, 13, 15, 18 ].
In Linear Predictive Coding (LPC), a sampled time signal
(where T is the sampling period) can be predicted by a weighted linear combination of its past values, in the form of
Where n is the time index identifying a particular time sample of the signal, p is the prediction order, ar(where 1 ≦ r ≦ p) is the linear prediction coefficient (and in this case, the filter coefficient of an all-pole Infinite Impulse Response (IIR) filter). G is the gain factor, and u isnIs some input signal to excite the model. By employing the z-transform of equation (2.6), the corresponding all-pole transfer function H (z) of the system is
Wherein
z=ej2πfT=ejωT.
UR filters H (z) are calledFor synthesis or LPC filters, and FIR filters
Referred to as an inverse filter. Using the prediction coefficient arAs filter coefficients for FIR filters, signal snCan be obtained by the following formulaOr
This results in a predicted signal
And the actual signal snThe prediction error can be represented by
Wherein the equivalent representation of the prediction error in the z-domain is
FIG. 12.3 shows the original signal snPredicted signal
And a differential signal en,pWherein the prediction order p is 10. This differential signal en,pAlso known as residual error. In fig. 2.4, the autocorrelation function of the residual shows almost complete decorrelation between adjacent samples, which indicates en,pCan be approximately considered as white gaussian noise. Using e from equation (2.10)n,pAs input signal u in equation (2.6)nOr ep (z) from equation (2.11) is filtered using an all-pole filter h (z) from equation (2.7) where G ═ 1. The original signal canTo be perfectly recovered by the following respectively,
and
as the prediction order p increases, the energy of the residual decreases. In addition to the number of predictor coefficients, the residual energy also depends on the coefficients themselves. Therefore, the problem in linear predictive coding is how to obtain the optimal filter coefficients arThereby minimizing the energy of the residual. First, a windowed signal block x is formed from a windowed signal block x byn=sn·wnAnd its prediction
The total square error (total energy) of the residuals is taken, wnIs a certain window function of the width N,
wherein
To minimize the total squared error E, the gradient of equation (2.14) must be relative to each arCalculate and pass settings
But is set to 0.
This leads to the so-called normal equation:
Rirepresenting a signal xnThe auto-correlation of (a) is,
equation (2.17) forms a system of p linear equations from which p unknown prediction coefficients a can be calculatedrR is 1. ltoreq. p, which minimizes the total square error. Using equations (2.14) and (2.17), the minimum total squared error EpCan be obtained by the following formula
A fast method to solve the normal equations in equation (2.17) is the Levinson-Durbin algorithm [19 ]. The algorithm works recursively, which brings the advantage that as the prediction order increases it produces predictor coefficients for current and all previous orders smaller than p. First, the algorithm is initialized by the following settings
Eo=Ro
Then, for the
With each iteration, the minimum total squared error E of the current order m is calculated in equation (2.24)m. Due to EmIs always positive, and wherein Eo=RoIt can be shown that as m increases, the minimum total energy decreases, so there is
0≤Em≤Em-1.
Thus, recursion brings about the further advantage that when EmThe computation of predictor coefficients may be stopped when it falls below a certain threshold.
Envelope estimation in time and frequency domain
An important feature of LPC filters is their ability to model the characteristics of the signal in the frequency domain if the filter coefficients are computed on a time signal. Equivalent to the prediction of the time series, the linear prediction approximates the spectrum of the series. Depending on the prediction order, the LPC filter may be used to calculate a more or less detailed envelope of the signal frequency response. The following sections are based on [11, 12, 13, 14, 16, 17, 20, 21 ].
From equation (2.13) it can be seen that by filtering the residual spectrum with an all-pole filter h (z), the original signal spectrum can be perfectly reconstructed from the residual spectrum. By setting u in equation (2.6)n=δnWherein δnIs a Dirac delta function, the signal spectrum S (z) can be filtered by an all-pole filter
Modeling from equation (2.7) is as follows
Wherein the prediction coefficient a is calculated using the Levinson-Durbin algorithm in equations (2.21) - (2.24)rOnly the gain factor G remains to be determined. Using un=δnEquation (2.6) becomes
Wherein h isnIs the impulse response of the synthesis filter h (z). According to equation (2.17), the impulse response hnSelf-correlation R &iIs that
By comparing h in equation (2.27)nSquaring and summing all n, the 0 th autocorrelation coefficient of the synthesis filter impulse response becomes
Because of the fact that
The 0 th autocorrelation coefficient corresponds to the signal snTotal energy of (c). The total energy in the original signal spectrum S (z) is approximated by itUnder the condition that the total energy in (1) should be equal, followingUsing this conclusion, the signal s in equation (2.17) and equation (2.28)nAutocorrelation and impulse response h ofnRespectively becomeWherein i is more than or equal to 0 and less than or equal to p. The gain factor G can be calculated by reshaping equation (2.29) and using equation (2.19) as follows:
FIG. 12.5 shows the signal S from the speech signalnSpectrum s (z) of one frame (1024 samples). The smoother black curve is according to equation (2.26)Calculated spectral envelope
Wherein the prediction order p is 20. Approximation as the prediction order p increasesAlways adjusted to be closer to the original spectrum s (z). The dashed curve is calculated using the same formula as the black curve, but where the prediction order p is 100. It can be seen that this approximation is more detailed and provides a better fit to s (z). At p → length(s)n) In the case of (2), an all-pole filter may also be usedAccurately modeling S (z) such thatAssuming a time-signal snIs the minimum phase.Due to the duality between time and frequency, linear prediction can also be applied to the spectrum of a signal in the frequency domain in order to model its temporal envelope. The calculation of the time estimate is done in the same way, except that the calculation of the predictor coefficients is performed on the signal spectrum, and then the resulting impulse response of the all-pole filter is transformed into the time domain. Fig. 2.6 shows the absolute values of the original time signal and two approximations using prediction orders of
Transient state
In the literature, many different definitions of transients can be found. Some refer to them as onset points or onsets [22, 23, 24, 25], while others use these terms to describe transients [26, 27 ]. This section is intended to describe, for purposes of disclosure, different methods of defining transients and characterizing them.
Characterization of
Some early definitions of transients describe them as time domain phenomena only, such as found in Kliewer and Mertins [24 ]. They describe the transients as signal segments in the time domain whose energy rises rapidly from a low value to a high value. To define the boundaries of these segments, they use the ratio of the energies within two sliding windows on the time domain energy signal just before and just after the signal sample n. Dividing the energy of the window immediately after n by the energy of the preceding window yields a simple criterion function c (n), the peak of which corresponds to the beginning of the transient period. These peaks occur when the energy just after n is substantially greater than the previous energy, marking the onset of a sharp energy rise. The end of the transient is then defined as the time after the starting point at which c (n) falls below a certain threshold.
Masri and Bateman [28] describe transients as giant changes in the signal time envelope, where the signal segments before and after the onset of the transient are highly uncorrelated. The spectrum of a narrow time frame comprising a shock transient event typically shows a large burst of energy at all frequencies, which can be seen in the spectrogram of the castanets in fig. 2.7 (b). Other studies [23, 29, 25] also characterize transients in the time-frequency representation of the signal, where they correspond to time frames with sharp increases in energy occurring simultaneously in several adjacent frequency bands. Rodet and Jaillet [25] also indicate that this sudden increase in energy is particularly pronounced in higher frequencies, since the total energy of the signal is mainly concentrated in the low frequency region.
Herre [20] and Zhang et al [30] characterize transients with the degree of flatness of the temporal envelope. With a sudden increase in energy over time, the transient signal has a very uneven temporal structure with a corresponding flat spectral envelope. One way to determine spectral flatness is to apply Spectral Flatness Measurements (SFM) in the frequency domain [31 ]. The spectral flatness SF of the signal can be calculated using the ratio of the geometric mean Gm to the arithmetic mean Am of the power spectrum:
|Xki represents the amplitude value of the spectral coefficient index K, K represents the spectrum XkThe total number of coefficients of (a). If SF → 0, the signal has a non-flat frequency structureAnd therefore more likely to be a tone. In contrast, if SF → 1, the spectral envelope is flatter, which may correspond to a transient or noise-like signal. The flat spectrum does not strictly specify transients, where the phase response of the transient has a high correlation as opposed to the noise signal. To determine the flatness of the temporal envelope, the measurements in equation (2.31) may also be similarly applied in the time domain.
Suresh Babu et al [27] also distinguish attack transients from frequency domain transients. They characterize the frequency domain transients by abrupt changes in the spectral envelope between adjacent time frames rather than by energy changes in the time domain as previously described. These signal events may be produced, for example, by a bowed instrument like a violin or by human speech by changing the pitch of the rendered sound. Fig. 12.7 shows the difference between attack transients and frequency domain transients. (c) The signal in (a) describes the audio signal produced by a violin. The vertical dashed line marks the instant when the pitch of the presented signal changes, i.e. the start of a new tone or frequency domain transient, respectively. This new note onset does not cause a significant change in signal amplitude, as opposed to the attack transient produced by the castanets in (a). The moment of this change in spectral composition can be seen in the spectrogram of (d). However, in fig. 2.8, the spectrum difference before and after the transient is more pronounced, fig. 2.8 shows two spectra of the violin signal in fig. 12.7(c), one of which is the spectrum of the time frame before the onset of the frequency domain transient and the other of which is the spectrum of the time frame after the onset of the frequency domain transient. This indicates that the harmonic components are different between the two spectra. However, perceptual coding of frequency domain transients does not cause the various artifacts that would be addressed by the recovery algorithms presented in this paper, and therefore would be ignored. Henceforth, the term "transient" will be used to refer only to attack transients.
Discrimination of transients, onset points and attack
The distinction between the concepts of transients, onsets and onsets can be found in Bello et al [26], which will be adopted in this paper. The difference in these terms is also shown in fig. 12.9, using an example of the transient signal produced by the castanets.
In general, authors still do not fully define the concept of transients, but they characterize it as a short time interval, rather than at a different time instant. During this transient period, the amplitude of the signal rises rapidly in a relatively unpredictable manner. However, it is not precisely defined where the transient ends after its amplitude reaches its peak.
In their rather informal definition, they also include a portion of the amplitude decay to the transient interval. With this characterization, acoustic instruments create transients during which they are excited (e.g., when picking guitar strings or striking a snare drum) and then attenuated. After this initial decay, the subsequent slower signal decay is caused only by the resonant frequency of the instrument body.
The starting point is the moment when the amplitude of the signal starts to rise. For this study, the starting point will be defined as the start time of the transient.
The onset of a transient is the period of time within the transient between its onset and peak during which the amplitude increases.
Psychoacoustics
This section gives a basic introduction to the psycho-acoustic concepts used in perceptual audio coding and transient enhancement algorithms described later. The objective of psychoacoustics is to describe the relationship between the measurable physical properties of sound signals and the internal perception these sounds cause in a listener [32 ]. Human auditory perception has its limitations that can be used by perceptual audio encoders in the encoding of audio content to substantially reduce the bit rate of the encoded audio signal. Although the goal of perceptual audio coding is to encode audio material in such a way that the decoded audio signal should be voiced [1] exactly or as close as possible to the original signal, it may still introduce some audible coding artifacts. The necessary background to understand the origin of these artifacts and the psychoacoustic model of how perceptual audio coders are used will be provided in this section. The reader is referred to [33, 34] for a more detailed description of psychoacoustics.
Simultaneous masking
Simultaneous masking refers to a psychoacoustic phenomenon that if a sound (masked sound) is close in frequency to a stronger sound (masking sound), the sound may be inaudible to a human listener when presented simultaneously with the stronger sound. A widely used example describing this phenomenon is a conversation between two people beside a road. Without the disturbing noise, they may perfectly perceive each other, but if a car or truck passes by, they need to increase their speech volume to keep each other's comprehension.
The concept of simultaneous masking can be explained by examining the function of the human auditory system. If the probe sound is presented to the listener, it induces a traveling wave within the cochlea along the basal lamina (BM), spreading from its base at the oval window to the apex of its tip [17 ]. Starting from the elliptical window, the vertical displacement of the traveling wave initially rises slowly, reaches its maximum at a specific location, and then falls abruptly [33, 34 ]. The location of its maximum displacement depends on the frequency of the stimulus. The BM is narrow and stiff at the base and about three times wider and less stiff at the apex. Thus, each position along the BM is most sensitive to a particular frequency, with high frequency signal components causing the maximum displacement near the base of the BM and low frequencies causing the maximum displacement near the apex of the BM. This particular frequency is commonly referred to as the Characteristic Frequency (CF) [33, 34, 35, 36 ]. Thus, the cochlea may be considered as a frequency analyzer with a set of highly overlapping band-pass filters with an asymmetric frequency response, referred to as auditory filters [17, 33, 34, 37 ]. The pass band of these auditory filters shows a non-uniform bandwidth, referred to as the critical bandwidth. The concept of critical bands was first introduced in 1933 by Fletcher [38, 39 ]. He assumes that the audibility of the probe sound presented simultaneously with the noise signal depends only on the amount of noise energy that is close in frequency to the probe sound. If the signal-to-noise ratio (SNR) in this frequency region is below a certain threshold, i.e. the energy of the noise signal is to a certain extent higher than the energy of the detection sound, the detection signal is not audible to a human listener [17, 33, 34 ]. However, simultaneous masking does not occur only within a single critical band. In fact, a masking tone at the CF of a critical band may also affect the audibility of masked tones outside the boundary of this critical band, but to a lesser extent [17 ]. The simultaneous masking effect is shown in fig. 12.10. The dashed curve represents the threshold at rest, which "describes the minimum sound pressure level required for a human listener to detect a narrowband sound without other sounds" [32 ]. The black curve is the simultaneous masking threshold corresponding to a narrow band noise masking tone depicted as a dark gray bar. The masking tone masks the detection sound (light gray bar) if the sound pressure level of the detection sound is less than the simultaneous masking threshold at a particular frequency of the masked sound.
Temporal masking
Masking is effective not only in the case where a masking sound and a masked sound are presented simultaneously, but also in the case where they are separated in time. The probe sound [40] may be masked before and after the period of time that the masking tone is presented, which is referred to as leading masking and lagging masking. A graphical representation of the temporal masking effect is shown in fig. 2.11. Leading masking occurs before the starting point of the masking sound, which is depicted for negative values of t. After the leading masking period, simultaneous masking is active with an overshoot effect immediately after the masking tone is turned on, where the simultaneous masking threshold is temporarily increased [37 ]. After the masking tone is turned off (depicted for positive values of t), lag masking is active. The leading masking can be interpreted using the integration time required by the auditory system to produce perception of the presented sound [40 ]. Additionally, the auditory system processes loud sounds faster than weak sounds [33 ]. The period during which the leading masking occurs is highly dependent on the amount of training for the particular listener [17, 34] and may last up to 20ms [33], but is significant only for the period 1-5ms before the masking tone start point [17, 37 ]. The amount of lag masking depends on the frequency, masking tone level and duration of both the masking tone and the detected sound, and the time period [17, 34] between the instants when the detected sound and the masking tone are turned off. According to Moore [34], lag masking is effective for at least 20ms, other studies show even longer durations, up to about 200ms [33 ]. Furthermore, Painter and Spanias declare lag masking "also exhibits frequency dependent behavior similar to simultaneous masking, which can be observed when the relationship of masking tone and detection frequency changes," [17, 34 ].
Perceptual audio coding
The objective of perceptual audio coding is to compress an audio signal in such a way that the resulting bit rate is as small as possible compared to the original audio, while maintaining a transparent sound quality, wherein the reconstructed (decoded) signal should not be distinguishable from the uncompressed signal [1, 17, 32, 37, 41, 42 ]. This is done by removing redundant and irrelevant information from the input signal using some of the limitations of the human auditory system. While the redundancy can be removed, for example, by using subsequent signal samples, spectral coefficients or even correlations between different audio channels and by appropriate entropy coding, irrelevant information can be processed by quantization of the spectral coefficients.
General architecture of perceptual Audio encoder
The basic structure of a mono perceptual audio encoder is depicted in fig. 12.12. First, an input audio signal is transformed into a frequency domain representation by applying an analysis filter bank. In this way, the received spectral coefficients [32] can be selectively quantized "depending on their frequency content". The quantization block rounds successive values of the spectral coefficients to a set of discrete values to reduce the amount of data in the encoded audio signal. Thus, compression becomes lossy since it is not possible to reconstruct the exact values of the original signal at the decoder. This introduction of quantization error can be considered as an additive noise signal, which is referred to as quantization noise. The quantization is controlled by the output of a perceptual model which calculates a temporal and simultaneous masking threshold for each spectral coefficient in each analysis window. The absolute threshold at rest can also be used by assuming that the "4 kHz signal with ± 1 peak amplitude of the least significant bit of the 16 bit integer is at the absolute threshold of hearing" [31 ]. In the bit allocation block, these masking thresholds are used to determine the number of bits needed so that the quantization noise caused becomes inaudible to a human listener. In addition, spectral coefficients below the calculated masking threshold (and thus not related to human auditory perception) need not be transmitted and may be quantized to zero. The quantized spectral coefficients are then entropy encoded (e.g., by applying huffman coding or arithmetic coding), which reduces redundancy in the signal data. Finally, the encoded audio signal and additional side information (e.g., quantization scale factors) are multiplexed to form a single bitstream, which is then transmitted to a receiver. The audio decoder at the receiver side (see fig. 12.13) then performs the inverse operation by demultiplexing the input bitstream, reconstructing the spectral values using the transmitted scale factors, and applying a synthesis filter bank that is complementary to the analysis filter bank of the encoder, to reconstruct the resulting output time signal.
Transient coding artifacts
Although the goal of perceptual audio coding is to produce a transparent sound quality of the decoded audio signal, it still exhibits audible artifacts. Some of these artifacts affecting the perceived quality of the transient will be described below.
Bird (birds) and bandwidth limitation
Only a limited number of bits are available for the bit allocation process to provide quantization for the block of audio signals. If the bit requirement of a frame is too high, some spectral coefficients can be deleted by quantizing them to zero [1, 43, 44 ]. This essentially results in a temporary loss of some high frequency components and is mainly a problem for low bit rate coding or when processing very demanding signals, e.g. signals with frequent transient events. The allocation of bits varies from one block to the next, so that the frequency components of the spectral coefficients can be deleted in one frame and presented in the next. The resulting spectral gap is called a "bird" and can be seen in the bottom image of fig. 2.14. In particular, transient coding tends to produce bird artifacts, as the energy in these signal portions is spread across the entire spectrum. A common approach is to limit the bandwidth of the audio signal prior to the encoding process to save the available bits for quantization of the LF component, which is also shown in fig. 2.14 for the encoded signal. This trade-off is appropriate because birds have a greater impact on the perceived audio quality than the constant bandwidth loss, which is generally more tolerable. However, even with bandwidth limitations, birds may still occur. Although the transient enhancement method described later is not intended to correct spectral gaps or to spread the bandwidth of the encoded signal itself, the loss of high frequencies also results in reduced energy and degraded transient attack (see fig. 12.15), which is subject to the attack enhancement method described later.
Front echo
Another common compression artifact is the so-called pre-echo [1, 17, 20, 43, 44 ]. Pre-echoes can occur if a sharp increase in signal energy (i.e., a transient) occurs near the end of a signal block. The substantial energy included in the transient signal portion is distributed over a wide frequency range, which results in an estimation of a relatively high masking threshold in the psychoacoustic model and thus only a few bits are allocated for the quantization of the spectral coefficients. Then, during the decoding process, a large amount of the increased quantization noise is spread over the entire duration of the signal block. For a stable signal, it is assumed that the quantization noise is completely masked, but for signal blocks that include transients, if the quantization noise "exceeds the leading masking [ … ] period" [1], the quantization noise may precede the transient onset point and become audible. These artifacts are subject to current research, even though there are several proposed methods of processing pre-echoes. Fig. 12.16 shows an example of pre-echo artifacts for castanets transients. The dashed black curve is the waveform of the original signal without substantial signal energy before the transient onset. Thus, the resulting pre-echoes prior to the transients of the encoded signal (gray curve) are not masked at the same time and can be perceived even without direct comparison with the original signal. The proposed method for the supplementary reduction of pre-echo noise will be described later.
Several methods have been proposed over the past few years to improve the quality of the transient. These enhancement methods can be classified into those methods that are integrated in an audio codec and those methods that work as a post-processing module on a decoded audio signal. An overview of previous studies and methods regarding transient enhancement and transient event detection is given below.
Transient detection
Edler [6] proposed an early method of transient detection in 1989. This detection is used to control the adaptive window switching method, which will be described later in this section. The proposed method detects only at the audio encoder whether a transient is present in one signal frame of the original input signal, rather than the exact location of the transient in the frame. Two decision criteria are calculated to determine the likelihood of a current transient in a particular signal frame. For the first criterion, the input signal x (n) is filtered using an FIR high-pass filter according to equation (2.5), where the filter coefficients b are [1, -1 ]. The resulting differential signal d (n) shows a large peak at the instant when the amplitude between adjacent samples changes rapidly. Then, the ratio of the sum of the magnitudes of d (n) of the two neighboring blocks is used to calculate a first criterion:
the variable m denotes the frame number and N denotes the number of samples within a frame. However, c1(m) the detection of very small transients at the end of a signal frame is difficult to achieve because their contribution to the total energy within the frame is rather small. Thus, a second criterion is established which calculates the ratio of the maximum amplitude value of x (n) to the average amplitude within a frame:
if c is1(m) or c2(m) exceeds a particular threshold, then it is determined that the particular frame m includes a transient event.
Kliewer and Mertins [24] also propose a detection method that operates exclusively in the time domain. Their approach is aimed at determining the exact beginning and ending samples of the transient by employing two sliding rectangular windows in the signal energy. The signal energy within the window is calculated as follows
And
where L is the window length and n represents the signal sample exactly in the middle between the left and right windows. Then, the detection function D (n) is calculated by the following formula
Wherein
If the peak value of D (n) is higher than a certain threshold value TbThey correspond to the starting point of the transient. The end of the transient event is determined to be "less than some threshold T immediately after the onset pointeMaximum value of D (n)' [24]]。
Other detection methods are based on linear prediction in the time domain to use the predictability of the signal waveform to distinguish transient and steady-state signal portions [45 ]. Lee and Kuo proposed a method using linear prediction in 2006. They decompose the input signal into several sub-bands to calculate a detection function for each resulting narrowband signal. After filtering the narrowband signal using an inverse filter according to equation (2.10), the detection function is obtained as an output. A subsequent peak selection algorithm determines the resulting local maximum of the prediction error signal as a start point time candidate for each subband signal, and then uses the start point time candidates to determine a single transient start point time for the wideband signal.
The method of Niemeyer and Edler [23] works on the complex time-frequency representation of the input signal and determines the transient onset as a sharp increase in signal energy in the adjacent frequency band. Each band pass signal is filtered according to equation (2.3) to calculate the time envelope after a sudden energy increase as a detection function. Then, not only is the transient criterion calculated for band K, but also K on either side of K is taken into account for 7 adjacent bands.
Subsequently, different strategies for enhancing the transient signal portion will be described. The block diagram in fig. 13.1 shows an overview of the different parts of the recovery algorithm. The algorithm uses a coded signal s represented in the time domainnAnd transformed into a time-frequency representation X by a short-time Fourier transform (STFT)k,m. Then in the STFT domainWherein the enhancement of the transient signal portion is performed. In the first phase of the enhancement algorithm, the pre-echo just before the transient is reduced. The second stage enhances the onset of transients, and the third stage sharpens transients using a linear prediction based approach. The enhanced signal Y is then transformed using an inverse short-time Fourier transform (ISTFT)k,mConverted back into the time domain to obtain an output signal yn。
By applying STFT, input signal snIs first divided into a number of frames of length N, which overlap by L samples, and an analysis window function w is usedn,mIs windowed to obtain a signal block xn,m=sn·wn,m. Then, each frame x is transformed using a discrete Fourier transform (DTF)n,mTransformation into the frequency domain. This produces a windowed signal frame xn,mSpectrum X ofk,mWhere k is the spectral coefficient index and m is the frame number. Analysis by STFT can be represented by the following equation:
wherein
And
(N-L) is also referred to as hop size. For the analysis window wn,mForms of sinusoidal windows have been used
To capture the fine temporal structure of the transient event, the frame size is selected to be relatively small. For the purposes of this study, it is set to N-128 samples for each time frame, with an overlap of L-N/2-64 samples for two adjacent frames. K in equation (4.2) defines the number of DFT points and is set to K256.This corresponds to Xk,mThe number of spectral coefficients of the two-sided spectrum of (2). Prior to STFT analysis, each windowed input signal frame is zero-padded to obtain a longer vector of length K to match the number of DFT points. These parameters give a sufficiently fine time-resolution to isolate the transient signal portion in a frame from the rest of the signal, while providing sufficient spectral coefficients for subsequent frequency-selective enhancement operations.
Transient detection
In an embodiment, the method for transient enhancement is applied specifically to the transient event itself, rather than constantly modifying the signal. Therefore, the instant of the transient must be detected. For the purpose of this research, transient detection methods have been implemented that have been adjusted independently for each individual audio signal. This means that for each particular sound file, the particular parameters and thresholds of the transient detection method, which will be described later in this section, are specifically adjusted to produce the best detection of the transient signal portion. The result of this detection is a binary value for each frame, indicating the presence of a transient start point.
The implemented transient detection method can be divided into two independent stages: the calculation of a suitable detection function and the method of selecting a starting point using the detection function as its input signal. In order to incorporate transient detection into a real-time processing algorithm, a proper look-ahead is required, since the subsequent pre-echo reduction method operates in a time interval before the start point of the detected transient.
Calculation of a detection function
For the calculation of the detection function, the input signal is transformed into a representation enabling an improved detection of the starting point of the original signal. The input to the transient detection block in fig. 13.1 is the input signal snTime frequency of (2) represents Xk,m. The calculation of the detection function is completed in five steps:
1. for each frame, the energy values of several adjacent spectral coefficients are summed.
2. The temporal envelope of the resulting band pass signal over all time-frames is calculated.
3. High-pass filtering of the time envelope of each band-pass signal.
4. The resulting high-pass filtered signals are summed in the frequency direction.
5. Time lag masking is considered.
TABLE 4.1 at Signal XK,mAfter the concatenation of n adjacent spectral coefficients of the amplitude energy spectrum of (a), XK,mOf the resulting passband boundary frequency flowAnd fhighAnd bandwidth Δ f
First, by
Wherein n is {2 ═ b0,21,22,...,26}=2κ,
For each time frame pair Xk,mIs determined by summing the energies of several adjacent spectral coefficients.
Where K denotes the index of the resulting subband signal. Thus, XK,mIncluded in the spectrum X by the representation for each frame mk,mOf the energy in the specific frequency band of (a). Boundary frequency flowAnd fhighAnd the passband bandwidth deltaf and the number of spectral coefficients n of the connection are shown in table 4.1. Then smooth X over all time framesK,mThe value of the band pass signal in (1). This is done by applying an IIR low-pass filter to each sub-band signal X in the time direction according to equation (2.2)K,mThe filtering is performed as follows,
is the resulting smoothed energy signal for each channel K. The filter coefficients b and a-1-b are independently applied to each processed audio signal to produce a satisfactory timeAn inter constant. And then by using the pair of equation (2.5)Is calculated via high-pass (HP) filteringThe slope of (a) is as follows,
wherein S isK,mIs a differential envelope, biIs the filter coefficient of the deployed FIR high-pass filter and p is the filter order. The specific filter coefficients b are also defined independently for each individual signali. Then, S is spanned across all K pairs in the frequency directionK,mSumming to obtain total envelope slope Fm。FmA large peak in (a) corresponds to a time frame in which a transient event occurs. In order to ignore smaller peaks, especially after larger peaks, FmBy an amplitude of Fm=max(Fm-0.1,0) by a threshold value of 0.1. A single pole recursive averaging filter pair F equivalent to equation (2.2) is also used bymFiltered and taken for each frame m according to equation (2.3)
And FmThe larger value of (d) takes into account the lag masking after the larger peak:whereinTo generate a resulting detection function Dm。
FIG. 13.2 shows castanets signals in the time and STFT domains, with the resulting detection function D shown in the bottom imagem. Then DmIs used as input of the starting point selection methodSignals, which will be described in the following sections.
Starting point selection
Basically, the starting point selection method will detect the function DmIs determined as SnThe start point time frame of the transient event in (1). This is obviously a trivial task for the detection function of the castanets signal in fig. 13.2. The result of the start point selection method is shown as a red circle in the bottom image. However, other signals do not always produce such a detection function which is easy to handle, and therefore the determination of the actual transient starting point becomes slightly more complex. For example, the detection function of the music signal at the bottom of fig. 13.3 exhibits several local peaks that are not correlated with the transient start point frame. Therefore, the onset selection algorithm must distinguish between those "false" and "true" transient onsets.
First, DmThe amplitude of the peak in (d) needs to be above a certain threshold thpeakTo be considered as a starting point candidate. This is done to prevent the input signal snAnd this small amplitude variation is not processed by the smoothing and lag masking filters in equations (4.5) and (4.7) to be detected as a transient starting point. . For the detection function DmEach value of Dm=lThe starting point selection algorithm scans the areas before and after the current frame l to obtain the ratio Dm=lAnd larger values. If l precedes the current framebOne frame and after laIf no larger value exists for a frame, then l is determined to be a transient frame. "look-back" and "look-ahead" frames lbAnd laNumber of (2) and threshold thpeakIs defined separately for each audio signal. After the correlation peak has been identified, detected transient start point frames closer than 50ms to the previous start point will be discarded [50, 51]. The output of the start point selection method (and general transient detection) is the transient start point frame m required for the subsequent transient enhancement blockiIs used to determine the index of (1).
Pre-echo reduction
The purpose of this enhancement phase is to reduce what is known as the lead-backA coding artifact of the wave that is audible for a certain period of time before the onset of the transient. An overview of the pre-echo reduction algorithm is shown in fig. 4.4. The pre-echo reduction stage analyzes the output X after STFTk,m(100) And a previously detected transient start point frame index miAs an input signal. In the worst case, before a transient event, the pre-echo starts up to the length of the long block analysis window on the encoder side (2048 samples regardless of the codec sampling rate). The duration of this window depends on the sampling frequency of the particular encoder. For the worst case, assume a minimum codec sampling frequency of 8 kHz. In the decoded and resampled input signal snAt a sampling rate of 44.1kHz, the length of the long analysis window (and thus the potential range of the pre-echo region) corresponds to the time signal snN of (A)long2048 · 44.1kHz/8kHz 11290 samples (or 256 ms). Since the enhancement method described in this section represents X for time frequencyk,mIs operated so that NlongNeeds to be converted into Mlong=(Nlong-L)/(N-L) — (11290-64)/(128-64) — 176 frames. N and L are the frame size and overlap of the STFT analysis block (100) in fig. 13.1. MlongIs set as the upper limit of the pre-echo width and is used to limit the frame m to the detected transient start pointiThe previous echo starts the search area of the frame. For this study, the sampling rate of the decoded signal before resampling was taken as a ground truth, thus for the upper limit M of the pre-echo widthlongAdapted for encoding snThe particular codec of (1).
Before estimating the actual width of the pre-echo, pitch frequency components located before the transient are detected (200). Thereafter, M before the transient framelongA pre-echo width is determined (240) in the region of a frame. Using this estimate, a threshold value for the signal envelope in the pre-echo region may be calculated (260) to reduce the energy in those spectral coefficients whose amplitude values exceed the threshold value. For final pre-echo reduction, a spectral weighting matrix is calculated (450) that includes a multiplicative factor for each k and m, which will then be multiplied with Xk,mForward echo region element by elementMultiplication of elements.
Detection of tonal signal components prior to transients
In the subsequent pre-echo width estimation, subsequently detected spectral coefficients corresponding to tonal frequency components preceding the transient onset are used, as described in the next subsection. It is also beneficial to use them in subsequent pre-echo reduction algorithms to skip the energy reduction for those tonal spectral coefficients, since the pre-echo artifacts are likely to be masked by the current tonal component. However, in some cases skipping the pitch coefficients leads to the introduction of additional artifacts in the form of an increase in audible energy at some frequencies around the detected pitch frequency, so this approach has been omitted for the pre-echo reduction approach in this embodiment.
Fig. 13.5 shows a spectral diagram of the potential pre-echo region before a transient of the harmonica audio signal. The spectral coefficients of tonal components between two horizontal dashed lines are detected by combining two different methods:
1. linear prediction of frames along each spectral coefficient, an
2. All M's before the transient start pointlongThe energy and length in each k over a frame is MlongIs compared with the energy of the running average of all previous potential pre-echo regions.
First, a linear prediction analysis across time is performed on each complex-valued STFT coefficient k, where the prediction coefficient a is calculated using the Levinson-Durbin algorithm according to equations (2.21) - (2.24)k,r. Using these prediction coefficients, a prediction gain R can be calculated for each kp,k[52,53,54]As follows below, the following description will be given,
wherein the content of the first and second substances,
andrespectively for each k input signal Xk,mAnd its prediction error Ek,mThe variance of (c). Ek,mIs calculated according to equation (2.10). The prediction gain is related to the use of a prediction coefficient ak,rCan predict how accurate Xk,mWherein a high prediction gain corresponds to good predictability of the signal. Transient and noise-like signals tend to result in lower prediction gain for time-domain linear prediction, so if R isp,kFor a particular k to be sufficiently high, the spectral coefficients may comprise tonal signal components. For this method, a threshold value for the prediction gain corresponding to the pitch frequency component is set to 10 dB.In addition to a high prediction gain, the tonal frequency components should also include relatively high energy over the rest of the signal spectrum. Therefore, the energy ε in the potential pre-echo region of the current i-th transient is measuredi,kCompared to a specific energy threshold. Epsiloni,kIs calculated as follows
The energy threshold is calculated using the running average energy of the previous echo region in the past, which is updated for each next transient. The running average energy will be expressed as
It is to be noted that it is preferable that,the energy in the current pre-echo region of the i-th transient has not been considered. The index i merely indicates that,for detection of a current transient. If it is notIs the total energy over all spectral coefficients k and frame m of the previous pre-echo region, thenCalculated by the following formulaWherein b is 0.7
Therefore, if
Rp,k> 10dB and
the spectral coefficient index k in the current pre-echo region is defined to include a tonal component.
The result of the tonal signal component detection method (200) is a vector k for each pre-echo region preceding the detected transienttonal,iWhich specifies the spectral coefficient index k satisfying the condition in equation (4.11).
Estimation of pre-echo width
Since there is no information about the signal s available for decodingnThe decoder of (a) is accurate in framing (and thus in relation to the actual pre-echo width), so the actual pre-echo start frame needs to be estimated (240) for each transient before the pre-echo reduction process. This estimation is crucial for the resulting sound quality of the processed signal after the preceding echo reduction. If the estimated pre-echo region is too small, part of the current pre-echo will remain in the output signal. If too large, the excessive signal amplitude before the transient will be attenuated, potentially resulting in audible signal loss. As previously mentioned, MlongRepresents the size of the long analysis window used in the audio encoder and is considered to be the maximum possible number of frames of pre-echo dispersion before the transient event. Maximum range of pre-echo spread MlongWill be represented as a pre-echo search area.
Figure 13.6 shows a schematic representation of the pre-echo estimation method. The estimation method follows the assumption that the resulting pre-echo results in an increase in the amplitude of the temporal envelope before the start point of the transient. This is shown in fig. 13.6 for the area between the two vertical dashed lines. During decoding of the encoded audio signal, the quantization noise is not equally spread over the entire synthesis block, but will be shaped by the particular form of the window function used. Thus, the resulting pre-echo results in a gradual rise in amplitude rather than a sudden increase. Before the start point of the previous echo, the signal may comprise silence or other signal components, such as a duration of another acoustic event occurring some time before. Therefore, the purpose of the pre-echo width estimation method is to find the moment when the rise of the signal amplitude corresponds to the starting point of the induced quantization noise (i.e. the pre-echo artifact).
The detection algorithm uses only Xk,mThe HF component above 3kHz because most of the energy of the input signal is concentrated in the LF region. For the particular STFT parameter used herein, this corresponds to a spectral coefficient with k ≧ 18. In this way, the detection of the starting point of the pre-echo becomes more robust, since it is assumed that no other signal components are present, which might complicate the detection process. Further, if the pitch spectral coefficient k has been detected by the previously described pitch component detection methodtonalCorresponding to frequencies above 3kHz, they will also be excluded from the estimation process. The remaining coefficients are then used to calculate an appropriate detection function for the simplified pre-echo estimate. First, the signal energy is summed in the frequency direction for all frames in the pre-echo search region to obtain an amplitude signal LmAs follows
kmaxCorresponding to the cut-off frequency of the low-pass filter, which has been used to limit the bandwidth of the original audio signal during the encoding process. Thereafter, LmSmoothed to reduce fluctuations in signal level. Running average filter pair L with 3 taps in the forward and backward directions by crossing timemFiltering to perform smoothing to generate a smoothed amplitude signal
Thus, the filter delay is compensated for, andand the filter becomes zero phase. Then theIs derived to calculate its slope L 'by'm,
Then L'mUsed with before for LmThe same running average filter of (a) performs the filtering. This results in a smoothed slope
Which is used as the resulting detection function Dm=Dm To determine the starting frame of the pre-echo.The basic idea of pre-echo estimation is to find a signal with DmThe last frame of negative values, which marks the instant after which the signal energy increases up to the start of the transient. FIG. 13.7 shows the detection function DmAnd two examples of the calculation of a prior echo start frame that is subsequently estimated. For the signals in (a) and (b), the amplitude signal LmAnd
is shown in the upper image, and the lower image shows the slope L'mAndwhich is also the detection function Dm. For the signal in fig. 13.7(a), the detection simply requires finding D with negative values in the lower imagemLast frame ofNamely, it isDetermined pre-echo start frameRepresented as vertical lines. The rationality of this estimate can be seen by visual inspection of the upper image of fig. 13.7 (a). However, only take DmThe last negative value of (a) will not give a suitable result for the lower signal (in kg) in (b). Here, the detection function ends with a negative value and the last frame is taken as mpreEffectively resulting in no reduction of the pre-echo at all. Furthermore, there may be D before that with a negative valuemNor do these frames coincide with the actual start of the pre-echo. This can be seen, for example, in the detection function of signal (b), where 52 ≦ m ≦ 58. Therefore, the search algorithm needs to take these fluctuations in the amplitude of the amplitude signal into account, which may also be present in the actual pre-echo region.Completing the pre-echo start frame m by adopting an iterative search algorithmpreIs estimated. The procedure for pre-echo start frame estimation will be described using the example detection function shown in figure 13.8, which is the same as the detection function for the signal in figure 13.7 (b). The top and bottom images of fig. 13.8 show the first two iterations of the search algorithm. The estimation method scans in reverse order from the start point of the estimated transient to the start of the pre-echo search region DmAnd determining DmA number of frames of the sign change. These frames are represented in the figure as numbered vertical lines. The first iteration in the top image starts with D having a positive valuemIs shown here as the last frame (line 1)
And the previous frame whose sign changes from + → -is determined as the pre-echo start frame candidate (line 2). To decide whether a candidate frame should be considered as mpreIs determined to have a symbol change m before the candidate frame+Two additional frames of (line 3) and m- (line 4). Whether the candidate frame should be taken as the obtained pre-echo start frame mpreIs based on gray and black areas (A)+And A-) Comparison between the summed values of (a). This comparison checks the black area A-(wherein DmExhibits a negative slope) may be considered as a sustained portion of the input signal prior to the starting point of the preceding echo or whether it is a temporary amplitude reduction in the actual preceding echo region. The summed slope A+And A-Is calculated as followsAnd
using A+And A-If, if
A->a·A+
Define the candidate pre-echo start frame at
For the first iteration of the estimation algorithm, the factor a is initially set to a-0.5, and then for each subsequent iteration, the factor a is adjusted to a-0.92 · a. This emphasizes the negative slope region A-This is for the amplitude signal L in the entire search areamSome signals exhibiting stronger amplitude variations are necessary. If the stop criterion in equation (4.15) does not hold (which is the case for the first iteration in the top image of fig. 13.8), then the next iteration takes the previously determined m + as the last considered frame, as shown in the bottom image
And is performed equivalently to the past iterations. It can be seen that equation (4.15) holds for the second iteration, since A-Is significantly greater than A+So the candidate frame atAdaptive pre-echo reduction
The following execution of adaptive pre-echo reduction can be divided into three phases, as can be seen in the bottom layer of the block diagram of fig. 13.4: determining a pre-echo amplitudeDegree threshold thkCalculating a spectral weighting matrix Wk,mAnd by Wk,mWith complex-valued input signal Xk,mThe element-by-element multiplication of (a) reduces pre-echo noise. FIG. 13.9 shows the input signal X in the upper imagek,mAnd the processed output signal Y is shown in the intermediate imagek,mWherein the pre-echo has been reduced. By Xk,mAnd the calculated spectral weight Wk,mElement-by-element multiplication (shown in the lower image of fig. 13.9) performs pre-echo reduction
Yk,m=Xk,m·Wk,m.
The purpose of the pre-echo reduction method is to correct for X in the previously estimated pre-echo regionk,mIs weighted such that the resulting Yk,mFalls within a specific threshold value thkThe following. By X over the pre-echo regionk,mDetermines this threshold value th for each spectral coefficient in (1)kAnd calculating the weighting factor required by the pre-echo attenuation for each frame m, and creating a frequency spectrum weighting matrix Wk,m。Wk,mIs limited to kmin≤k≤kmaxIn which k isminIs corresponding to the closest fminIndex of spectral coefficients for a frequency of 800Hz, thereby selecting k<kminAnd k>kmax.fminIs/are as follows
To avoid amplitude reduction in the low frequency region because most of the fundamental frequencies of instruments and speech lie below 800 Hz. Amplitude fading in this frequency region tends to produce audible signal loss prior to transients, especially for complex musical audio signals. Further, Wk,mIs limited to the estimated pre-echo region, where mpre≤m≤mi-2, wherein miIs the starting point of the detected transient. Due to the presence of the input signal snIs determined by the overlap of 50% between adjacent time frames in the STFT analysis, immediately following the transient start point frame miPrevious frames may also include transient events. Therefore, pre-echo attenuation is limited to frame m ≦ mi-2。Pre-echo threshold determination
As mentioned before, it is necessary for each spectral coefficient Xk,mDetermining (260) a threshold thkWherein k ismin≤k≤kmaxThe threshold is used to determine the spectral weight required for pre-echo attenuation in the respective pre-echo region preceding each detected transient onset. th (h)kCorresponds to Xk,mTo which the signal amplitude values should be reduced to obtain the output signal Yk,m. An intuitive way may be to simply take the first frame m of the estimated pre-echo regionpreSince it should correspond to the moment when the signal amplitude starts to rise constantly due to the induced pre-echo quantization noise. However, for example, if the pre-echo region is estimated to be too large or due to possible fluctuations of the amplitude signal in the pre-echo region, then
Not necessarily the smallest amplitude value of all signals. In fig. 4.10, the amplitude signal | X in the pre-echo region before the transient onset point is comparedk,mTwo examples of | are shown as solid gray curves. The top image represents the spectral coefficients of the soundboard signal and the bottom image represents the harmonica signal in a subband from the sustained tonal component of the previous harmonica tone. To calculate the appropriate threshold, | X is first filtered back and forth over time using a 2-tap running average filterk,mTo get a smoothed envelope(as shown by the dashed black curve). Then, the smoothed signal is processedAnd a weighting curve CmThe multiplication is performed so that the amplitude value increases toward the end of the pre-echo region. CmShown in fig. 13.11 and may be generated as follows
Wherein M ispreIs the number of frames in the pre-echo region. In both of the graphs of FIG. 13.10And CmThe weighting envelopes after multiplication are shown as dashed gray curves. Then, the pre-echo noise threshold thkIs taken as
Minimum, indicated by black circles. Derived threshold value th for two signalskDepicted as horizontal dotted lines. For the soundboard signal in the top image, simply take the smoothed amplitude signalWithout CmIt is sufficient to weight them. However, for the harmonica signal in the bottom image, the application of a weighting curve is necessary, whereinIs located at the end of the pre-echo region. Take this value as thkWill result in a strong attenuation of tonal signal components and hence audible drop-out artifacts. Also, due to the higher signal energy in the pitch spectral coefficients, the pre-echo may be masked and thus inaudible. As can be seen,and a weighting curve CmMultiplication does not change the signal in the upper signal of fig. 4.10 very muchWhile resulting in a suitably high th for the tonal chime component shown in the bottom graphk。Calculation of spectral weights
Obtained threshold thkFor calculating a reduction Xk,mAmplitude of (2)Spectral weight W required for valuek,mThus, a target amplitude signal will be calculated (450) for each spectral coefficient index k
Which represents the optimal output signal with reduced pre-echo for each individual k. Use ofSpectral weight matrix Wk,mCan be calculated as follows
Then W is summed over frequency by applying a 2-tap running average filter in the forward and backward directions for each frame mk,mSmoothing (460) to match the input signal Xk,mThe large difference between the weighting factors of adjacent spectral coefficients k is reduced before multiplication. The attenuation of the pre-echo is not at the start of the pre-echo frame mpreProcessing proceeds immediately to its maximum extent but fades up over the time period of the previous echo region. This is achieved by using (430) a parameterized fading curve f with adjustable steepnessmImplemented, said parameterized fading curve fmIs generated as follows (440)
Wherein the index is 10cDetermination of fmThe steepness of (d). Fig. 13.12 shows the decay curve for different values of c, which has been set to-0.5 for this study. Using fmAnd thkTarget amplitude signal
Can be calculated as follows
This effectively reducesIs higher than threshold thkValue of (2) | Xk,mL while remaining below thkThe value of (a) is not changed.
Application of time-advance masking model
Transient events act as masking sounds that can temporarily mask previous and subsequent weaker sounds. Here also the look-ahead masking model is applied (420) in such a way that | X should only be appliedk,mThe value of | is reduced until they fall below the leading masking threshold, at which they are assumed to be inaudible. The advanced masking model used first calculates the "prototype" advanced masking threshold
Then adjust it to Xk,mThe signal level of the particular masking tone transient in (1). According to B.Edler (Personal Communication,2016, 11, 22) [55]]The parameters used to calculate the look-ahead masking threshold are selected.Is generated as an exponential function, e.g.
Determination of the parameters L and αThe level and slope of (d). The level parameter L is set to
L=Lfall+L0=50dB+10dB=60dB.
T before masking soundfallThe look-ahead masking threshold should be lowered by
tfallNeeds to be converted into a corresponding number of frames mfallWherein (N-L) is the jump size of the STFT analysis, fsIs the sampling frequency. Using L, LfallAnd mfallEquation (4.21) becomes
The parameter α can therefore be determined by transforming equation (4.24) as follows
The resulting preliminary leading masking threshold is shown in fig. 13.13 for the time period before the start point of the masking sound (which occurs at m-0)The vertical dotted line marks t corresponding to the point before the start of the masking tonefallTime m of msfallWherein the threshold is reduced by Lfall-50 dB. According to Fastl and Zwicker [33]]And Moore [34]The look-ahead masking may last up to 20 ms. For the framing parameters used in the STFT analysis, this corresponds to MmaskAdvanced masking duration of approximately 14 frames, thereby
Is set to-oo frame m ≦ -Mmask。To calculate Xk,mSpecific signal dependent look-ahead masking threshold mask in each pre-echo region ofk,m,iDetecting a transient frame miAnd then MmaskA frame is considered as a time instance of a potential masking tone. Thus, for each spectral coefficient,
is shifted to each mi≤m<mi+MmaskAnd at a signal-to-masking ratio of-6 dB (i.e., at the masking sound level and the masking sound frame)Distance between) is adjusted to Xk,mThe signal level of (c). Thereafter, the maximum value of the overlap threshold is used as the resulting look-ahead masking threshold mask for the corresponding pre-echo regionk,m,i. Finally, the mask is frequency-matched in both directions by applying a single-pole recursive averaging filter equivalent to the filtering operation in equation (2.2)k,m,iSmoothing is performed with the filter coefficient b equal to 0.3.Then, by adopting the following formula, a leading masking threshold mask is usedk,m,iAdjusting a target amplitude signalThe value of (e) (as calculated in equation (4.20)),
FIG. 13.14 shows the same two signals from FIG. 13.10 with the resulting target amplitude signal
As solid black curve. For the castanets signal in the top image, it can be seen that the signal amplitude reaches the threshold thkHow to fade up in the whole pre-echo region, and the effect of an early masking threshold of m-16 for the last frame, whereThe bottom image (tonal spectral components of the chime signal) shows that the adaptive pre-echo reduction method has only a minor effect on the sustained tonal signal components, only slightly attenuating the smaller peaks, while the input signal X is maintainedk,mOf the total amplitude of the signal.Then, X is used according to equation (4.18)k,mAndcalculating (450) the resulting spectral weight Wk,mAnd then the obtained spectrum weight W is usedk,mApplied to the input signal Xk,mBefore oneIt is smoothed over frequency. Finally, the output signal Y of the adaptive pre-echo reduction methodk,mIs to weight the spectrum W by element-by-element multiplication according to equation (4.16)k,mApplying (320) to Xk,mAnd then obtaining the product. Note that Wk,mIs real-valued and therefore does not change the complex-valued Xk,mThe phase response of (c). Fig. 4.15 shows the result of pre-echo reduction for a harmonica transient with a tonal component before the transient onset point. Spectral weight W in bottom imagek,mShowing a value at about 0dB in the frequency band of the tonal component, resulting in the preservation of the sustained tonal portion of the input signal.
Enhancement of transient attack
The approaches discussed in this section are directed to enhancing degraded transient attack and enhancing the amplitude of transient events.
Adaptive transient attack enhancement
Except for transient frame miIn addition, the signal in the period after the transient is also amplified, with the amplification gain fading out over the interval. The output signal of the former echo reducing stage of the self-adaptive transient sound starting enhancing method is used as the input signal Xk,m. Similar to the pre-echo reduction method, a spectral weighting matrix W is calculated (610)k,mAnd applying (620) it to Xk,mSuch as
Yk,m=Xk,m·Wk,m.
However, in this case, Wk,mFor improving transient frames miAnd to a lesser extent the subsequent frame, rather than modifying the time period before the transient. The amplification is thus limited to fminCut-off frequency f of a low-pass filter applied in an audio encoder and above 400HzmaxThe following frequencies. First, a signal X is inputk,mIs divided into persistent parts
And transient partSubsequent signalThe amplification is applied only to the transient signal portion, while the sustained portion is fully maintained.Amplitude signal | X by using a single-pole recursive averaging filter according to equation (2.4)k,mAnd | calculating 650 by filtering, wherein the filter coefficient used is set to b 0.41. The top image of fig. 13.16 shows the input signal amplitude | X as a gray curvek,m| and corresponding persistent signal portions as dashed curvesThe transient signal portion is then calculated (670) as follows
In the bottom image of fig. 13.16, the corresponding input signal amplitude | X in the top imagek,mTransient part of |Shown as a grey curve. Not only in miIs multiplied by a certain gain factor G, but after the transient frame
Over a period of one frame, the amount of amplification fades out (680). The faded-down gain curve G111 is shown in fig. 4.17.Is set to G12.2, which corresponds to an amplitude level of 6.85dB, the gain of the subsequent frame increases according to GmAnd decreases. Using the gain curve G111 and the continuous and transient signal portions, the spectral weighting matrix Wk,mWill be obtained by the following formula (680)
Then, before enhancing the transient attack according to equation (4.27), W is frequency-aligned in both the forward and backward directions according to equation (2.2)k,mSmoothing is performed (690). In the bottom image of fig. 13.16, the gain curve G is usedmTransient signal portion of
The amplification result of (a) can be regarded as a black curve. In the top image, the output signal amplitude Y with enhanced transient attackk,mShown as solid black curves.Temporal envelope shaping using linear prediction
In contrast to the aforementioned adaptive transient attack enhancement method, this method aims to sharpen the attack of a transient event without increasing its amplitude. Instead, by applying (720) linear prediction in the frequency domain and using two different sets of prediction coefficients a for the inverse (720a) and synthesis filters (720b)rShaping (740) the temporal envelope of the temporal signal Sn completes the "sharpening" of the transient. By filtering the input signal spectrum with an inverse filter (740a), a prediction residual E can be obtained according to equations (2..9) and (2.10)k,mAs follows
An inverse filter (740a) filters the filtered input signal X in the frequency and time domainsk,mPerforming decorrelation, effectively rendering the input signal snThe temporal envelope of (a) is flat. If it is not
Pair E using synthesis filter (740b) according to equation (2.12)k,mFiltering (using prediction coefficients)) Perfectly reconstructing the input signal Xk,m. The goal of attack enhancement is to calculate the prediction coefficientsAndthe transients are amplified in a combination of an inverse filter and a synthesis filter, while attenuating the signal portions before and after the transients in a particular transient frame.The LPC shaping method works with different framing parameters as the previously described enhancement method. Therefore, the output signal of the previous adaptive attack enhancement stage needs to be re-synthesized with ISTFT and re-analyzed with new parameters. For this method, a frame size of N512 samples is used, where L N/2 is 50% overlap of 256 samples. The DFT size is set to 512. The larger frame size is chosen to improve the computation of the prediction coefficients in the frequency domain, so the high frequency resolution is more important than the high temporal resolution. The Levinson-Durbin algorithm and LPC order of p ═ 24 are used after equations (2.21) - (2.24), for fmin800Hz and fmax(which corresponds to k)min=10≤klpc≤kmaxSpectral coefficients of) in the input signal, in the input signal
Calculating the prediction coefficient on the complex spectrumAndbefore that, the band-pass signalIs the autocorrelation function R ofiMultiplying (802, 804) two different window functions Wi flatAnd Wi synthIs used forAndto smooth the data output by the corresponding LPC filter [56]]The temporal envelope is described. The window function is generated byWici0≤i≤kmax-kmin,
Wherein c isflat0.4 and csynth0.94. The top image of fig. 4.13 shows two different window functions, which are then multiplied by Ri. Autocorrelation function of an exemplary input signal frame along with two windowed versions (R)i·Wi flat) And (R)i·Wi synth) Depicted in the bottom image. Using the obtained prediction coefficients as filter coefficients of the flattening and shaping filters, the input signal X is subjected to the processing using the results of equations (4.30) and (2.6)k,mIs shaped as follows
This describes a filtering operation using the resulting shaping filter, which can be interpreted as a combined application (820) of the inverse filter (809) and the synthesis filter (810). Using FIR (inverse/flat) filters (1-P)n) And IIR (Synthesis) Filter AnThe time domain filter Transfer Function (TF) of the system is obtained using the FFT transformation equation (4.32) as follows
Equation (4.32) can be equivalently formulated in the time domain as the input signal frame snAnd shaping filter
The product of (A) is as follows
Fig. 13.13 shows the different time domains TF of equation (4.33). The two dashed curves correspond to
Andand the inverse filter and the synthesis filter before multiplication by the gain factor G (811) are represented by solid gray curvesThe combination of (820). It can be seen that for 140<n>426, a filtering operation using a gain factor G of 1 will result in a strong amplitude increase of the transient event. For the inverse filter and the synthesis filter, the appropriate gain factor G can be calculated as two predicted gains byAnd (b) andin the ratio of (a) to (b),
prediction gain RpIs derived from and predicts the coefficient arRelated partial correlation coefficient pm(wherein 1. ltoreq. m.ltoreq.p) and is compared with a in equation (2.21) of the Levinson-Durbin algorithmrAre calculated together. Then, ρ is usedmThe prediction gain (811) is obtained by the following equation
Final with adjusted amplitudeShown as a solid black curve in fig. 4.13. Drawing (A)4.13 shows the resulting output signal y after LPC envelope shaping in the top imagenAnd the input signal s in the transient framen. Bottom image is used for converting input signal amplitude spectrum Xk,mWith the filtered amplitude spectrum Yk,mA comparison is made.
Furthermore, examples of embodiments are set forth subsequently, particularly in relation to the second aspect:
1. an apparatus for post-processing (20) an audio signal, comprising:
a temporal-to-spectral converter (700) for converting the audio signal into a spectral representation comprising a sequence of spectral frames;
a prediction analyzer (720) for computing prediction filter data for a prediction of frequencies within a spectral frame;
a shaping filter (740) controlled by the prediction filter data for shaping the spectral frame to enhance transient portions within the spectral frame; and
a spectrum-time converter (760) for converting a sequence of spectrum frames comprising the shaped spectrum frames into the time domain.
2. The apparatus as described in example 1 was used,
wherein the prediction analyzer (720) is configured to calculate first prediction filter data (720a) for a flattening filter characteristic (740a) and second prediction filter data (720b) for a shaping filter characteristic (740 b).
3. The apparatus as set forth in example 2,
wherein the prediction analyzer (720) is configured to calculate the first prediction filter data (720a) using a first time constant and to calculate the second prediction filter data (720b) using a second time constant, the second time constant being greater than the first time constant.
4. The apparatus as described in example 2 or 3,
wherein the flat filter characteristic (740a) is an analysis FIR filter characteristic or an all-zero filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a flatter temporal envelope compared to a temporal envelope of the spectral frame; or
Wherein the shaping filter characteristic (740b) is a synthetic IIR filter characteristic or an all-pole filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a less flat temporal envelope than a temporal envelope of the spectral frame.
5. The apparatus as in any one of the preceding examples,
wherein the predictive analyzer (720) is configured to:
calculating (800) an autocorrelation signal from the spectral frame;
windowing (802, 804) the autocorrelation signal using a window having a first time constant or having a second time constant, the second time constant being greater than the first time constant;
calculating (806, 808) first prediction filter data from the windowed autocorrelation signal windowed using the first time constant or calculating second prediction filter coefficients from the windowed autocorrelation signal windowed using the second time constant; and
wherein the shaping filter (740) is configured to shape the spectral frame using the second prediction filter coefficients or using the second prediction filter coefficients and first prediction filter coefficients.
6. The apparatus as in any one of the preceding examples,
wherein the shaping filter (740) comprises a cascade of two controllable sub-filters (809, 810), a first sub-filter (809) being a flattening filter having a flattening filter characteristic and a second sub-filter (810) being a shaping filter having a shaping filter characteristic,
wherein the sub-filters (809, 810) are all controlled by the prediction filter data derived by the prediction analyzer (720), or
Wherein the shaping filter (740) is a filter having a combined filter characteristic derived by combining (820) a flattening characteristic and a shaping characteristic, wherein the combined characteristic is controlled by the prediction filter data derived from the prediction analyzer (720).
7. The apparatus as set forth in example 6,
wherein the prediction analyzer (720) is configured to determine the prediction filter data such that using prediction filter data for the shaping filter (740) results in a degree of shaping that is higher than a degree of flatness obtained by using the prediction filter data for the flatness filter characteristic.
8. The apparatus as in any one of the preceding examples,
wherein the predictive analyzer (720) is configured to apply (806, 808) a Levinson-Durbin algorithm to a filtered autocorrelation signal derived from the spectral frame.
9. The apparatus as in any one of the preceding examples,
wherein the shaping filter (740) is configured to apply gain compensation such that the energy of the shaped spectral frames is equal to or within a tolerance range of ± 20% of the energy of the spectral frames generated by the temporal-to-spectral converter (700).
10. The apparatus as in any one of the preceding examples,
wherein the shaping filter (740) is configured to apply a flattening filter characteristic (740a) with a flattening gain and a shaping filter characteristic (740b) with a shaping gain, and
wherein the shaping filter (740) is configured to perform gain compensation for compensating for the effects of the flat gain and the shaping gain.
11. The apparatus as set forth in example 6,
wherein the predictive analyzer (720) is configured to calculate a flat gain and a shaping gain,
wherein the cascade of two controllable sub-filters (809, 810) further comprises a separate gain stage (811) for applying a gain derived from the flat gain and/or the shaped gain or a gain function comprised in at least one of the two sub-filters, or
Wherein the filter (740) having the combined characteristic is configured to apply a gain derived from the flat gain and/or the shaped gain.
12. The apparatus as set forth in example 5,
wherein the window comprises a gaussian window with a time lag as a parameter.
13. The apparatus as in any one of the preceding examples,
wherein the prediction analyzer (720) is configured to calculate prediction filter data for a plurality of frames such that the shaping filter (740) controlled by the prediction filter data performs signal manipulation on a frame of the plurality of frames that includes a transient portion, and such that the shaping filter (740) does not perform signal manipulation or performs less signal manipulation on another frame of the plurality of frames that does not include a transient portion than the frame that includes a transient portion.
14. The apparatus as in any one of the preceding examples,
wherein the spectrotime converter (760) is configured to apply an overlap-add operation involving at least two adjacent frames of the spectral representation.
15. The apparatus as in any one of the preceding examples,
wherein the time-to-spectrum converter (700) is configured to apply an analysis window of a jump size between 3ms and 8ms or having a window length between 6ms and 16ms, or
Wherein the spectrotime converter (760) is configured to use a range corresponding to an overlap size of an overlap window or a range corresponding to a jump size between 3ms and 8ms used by the converter, or to use a synthesis window having a window length between 6ms and 16ms, or wherein the analysis window and the synthesis window are identical to each other.
16. The apparatus as described in example 2 or 3,
wherein the flat filter characteristic (740a) is an inverse filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a flatter temporal envelope compared to a temporal envelope of the spectral frame; or
Wherein the shaping filter characteristic (740b) is a synthesis filter characteristic that, when applied to a spectral frame, results in a modified spectral frame having a temporal envelope that is less flat than a temporal envelope of the spectral frame.
17. The apparatus of any of the preceding examples, wherein the prediction analyzer (720) is configured to calculate prediction filter data for a shaping filter characteristic (740b), and wherein the shaping filter (740) is configured to filter the spectral frame obtained by the temporal-to-spectral converter (700), e.g. without prior flattening.
18. The apparatus of any of the preceding examples, wherein the shaping filter (740) is configured to represent a shaping action at or below a maximum temporal resolution according to the temporal envelope of the spectral frame, and wherein the shaping filter (740) is configured to represent a non-flat action or a flat action according to a temporal resolution that is less than a temporal resolution associated with the shaping action.
19. A method of post-processing (20) an audio signal, comprising:
converting (700) the audio signal into a spectral representation comprising a sequence of spectral frames;
calculating (720) prediction filter data for a prediction of frequencies within a spectral frame;
shaping (740) the spectral frame in response to the prediction filter data to enhance transient portions within the spectral frame; and
the sequence of spectral frames comprising the shaped spectral frames is converted (760) into the time domain.
20. A computer program for performing the method of example 19 when run on a computer or processor.
Although some aspects are described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Embodiments of the present invention may be implemented in hardware or software, depending on the particular implementation requirements. The implementation can be performed using a digital storage medium, e.g. a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a flash memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier with electronically readable control signals capable of cooperating with a programmable computer system so as to perform one of the methods described herein.
In general, embodiments of the invention may be implemented as a computer program product with a program code operable to perform one of the methods when the computer program product runs on a computer. The program code may be stored on a machine-readable carrier, for example.
Other embodiments include a computer program stored on a machine-readable carrier or non-transitory storage medium for performing one of the methods described herein.
In other words, an embodiment of the inventive methods is thus a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
Thus, a further embodiment of the inventive method is a data carrier (or digital storage medium, or computer readable medium) comprising a computer program recorded thereon for performing one of the methods described herein.
A further embodiment of the inventive method is thus a data stream or a signal sequence representing a computer program for performing one of the methods described herein. The data stream or signal sequence may for example be arranged to be transmitted via a data communication connection, for example via the internet.
Further embodiments include a processing apparatus, such as a computer or programmable logic device, configured or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon a computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (e.g., a field programmable gate array) may be used to perform some or all of the functions of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor to perform one of the methods described herein. Generally, the method is preferably performed by any hardware means.
The above-described embodiments are merely illustrative of the principles of the present invention. It is to be understood that modifications and variations of the arrangements and details described herein will be apparent to others skilled in the art. It is the intention, therefore, to be limited only by the scope of the claims appended hereto, and not by the specific details presented by way of description and explanation of the embodiments herein.
Reference to the literature
[1] Brandenburg, "
[2] Brandenburg and G.Stoll, "ISO/MPEG-1 audio: A genetic standard for coding of high-quality digital audio," J.Audio Eng.Soc., Vol.42, page 780-792, 10 months 1994.
[3]ISO/IEC 11172-3,”MPEG-1:Coding of moving pictures and associatedaudiofor digital storage media at up to about 1.5mbit/s-part 3:Audio”internationalstandard,ISO/IEC,1993.JTC1/SC29/WG11.
[4]ISO/IEC 13818-1,“Information technology-generic coding of movingpicturesand associated audio information:Systems,”international standard,ISO/IEC,2000.ISO/IEC JTC1/SC29.
[5] J.Herre and J.D.Johnston, "Enhancing the performance of performance audiologists by using temporal noise mapping (TNS)," in 101st Audio engineering society convention, code 4384, AES, 11 months 1996.
[6] Edler, "Codierun von audiosignal mit ü berlappendertransformation undatversen fensterfurtionn" Frequikz-Zeitschrift f ü rTelekekommunikation, Vol 43, p 253-.
[7] Samalali, M.T. -H.Alouane, and G.Mah, "Temporal evolution correction for authentication im low bit-rate audio coding" in 17th European Signal processing conference (EUSIPCO), (Glasgow, Scotland), IEEE,
[8] Lapierre and R.Lefebvre, "Pre-echo noise reduction in frequency-domain audiodes," in 42nd IEEE International Conference on Acoustics, speech Signal processing,
[9]A.V.Oppenheim and R.W.Schafer,Discrete-Time SignalProcessing.Harlow,UK:Pearson Education Limited,3.ed.,2014.
[10]J.G.Proakis and D.G.Manolakis,Digital Signal Processing-Principles,Algorithms,and Applications.New Jersey,US:Pearson EducationLimited,4.ed.,2007.
[11] Benesty, J.Chen, and Y.Huang, Springer handbook of speedprocessing, ch.7.Linear Prediction, pages 121-134. Berlin Springer,2008.
[12] J. Makhoul, "Spectral analysis of speed by linear prediction" InIEEE Transactionson Audio and electronics, volume 21,
[13] Makhoul, "Linear prediction: A tubular review" "in Proceedings of the IEEE, volume 63, page 561-.
[14] M.Athineos and D.P.W.Ellis, "Frequency-domain linear prediction for temporalffeatures" in IEEE Workshop on Automatic Speech Recognition and Understand, page 261 and 266, IEEE, 11 months 2003.
[15]F.Keiler,D.Arfib,and U.
“Efficient linear prediction fordigital audioeffects,”in COST G-6Conference on Digital Audio Effects(DAFX-00),(Verona,Italy),[16] J.Makhoul, "Spectral line prediction: Properties and applications" in IEEEtransactions on Acoustics, Speech, and Signal Processing, volume 23, page 283-.
[17] T.painter and a.spanias, "recent coding of digital audio," advances, ofhe IEEE, volume 88,2000 for 4 months.
[18] J. Makhoul, "Stable and effective diagnostic methods for Linear analysis," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-25, page 423 and 428, IEEE, 10 months 1977.
[19] Levinson, "The turbine rms (root mean square) error criterion design and prediction," Journal of Mathematics and Physics, Vol.25, p.261-.
[20] Herre, "Temporal noise mapping, hybridization and Coding method in technical Audio Coding: A clinical interaction," in Audio engineering society Conference:17th International Conference: High-Quality Audio Coding, volume 17, AES,
[21] Schroeder, "Linear prediction, entry and signal analysis," IEEE ASSP Magazine,
[22] Daudet, S.Molla, and B.Torr saini, "Transmission detection and coding using wavelet coeffcient trees," gels super Transmission product signals et Images, 9.2001.
[23] Edler and O.Niemeyer, "Detection and extraction of transformation for Audio coding," in Audio
[24] Kliewer and A. Mertins, "Audio chewing and coding with improved signaling segments," in 9th European Signal processing Conference, Vol.9, (Rhodes), pages 1-4, IEEE, 9 months 1998.
[25] Jaillet, Detection and modeling of fast attransients, in Proceedings of the International Computer Music Conference, (Havana, Cuba), pages 30-33,2001.
[26] Bello, L.Daudet, S.Abdallah, C.Duxbury, and M.Davies, "A structural on set detection in music signals," IEEE Transactions on Speech and Audio processing, volume 13, page 1035-.
[27] Suresh Babu, A.K.Malot, V.Vijayachandar, and M.Vinay, "Transientdetection for transform domain coders," in Audio Engineering society Convention 116, No. 6175, (Berlin, Germany), 5 months 2004.
[28] Masri and A. Bateman, "Improved modification of attack transitions in Music analysis-regeneration," in International Computer Music Conference, page 100-.
[29] Kwong and R.Lefebvre, "transfer detection of audio signal based on an adaptive comb filter in the frequency domain," in Conference on signals, Systems and Computers,2004.Conference Record of the third-seven sloomar, Vol.1, Page 542-.
[30] Zhang, C.Cai, and J.Zhang, "A transfer signal detection technology based on flash measure," in 6th International Conference on computer science and discovery, (Singapore), page 310-.
[31] Johnston, "Transform coding of audio signals using qualitative information criterion," IEEE Journal on Selected Areas in Communications,
[32] Herre and S.Disch, Academic press in Signal processing,
[33]H.Fastl and E.Zwicker,Psychoacoustics-Facts andModels.Heidelberg:Springer,3.ed.,2007.
[34]B.C.J.Moore,An Introduction to the Psychology of Hearing.London:Emerald,6.ed.,2012.
[35]P.Dallos,A.N.Popper,and R.R.Fay,The Cochlea.New York:Springer,1.ed.,1996.
[36]W.M.Hartmann,Signals,Sound,and Sensation.Springer,5.ed.,2005.
[37] Brandenburg, C.Faller, J.Herre, J.D.Johnston, and B.Kleijn, "Perceptil coding of high-quality digital audio," in IEEE Transactions on Acoustics, Speech, and Signal Processing, volume 101, page 1905-.
[38] Fletcher and W.A.Munson, "Loodness, its definition, measurement and calculation," The Bell System Technical Journal,
[39] Fletcher, "Audio patterns," Reviews of Modern Physics,
[40]M.Bosi and R.E.Goldberg,Introduction to Digital Audio Coding andStandards.Kluwer Academic Publishers,1.ed.,2003.
[41] Noll, "MPEG digital audio coding," IEEE Signal processing magazine,
[42] Pan, "A tutoral on MPEG/audio compression," IEEE MultiMedia,
[43] Erne, "Perceptial audio coders" what to listen for "," in 111st Audio Engineering Society, accession No. 5489, AES, 9 months 2001.
[44] C. -M.Liu, H. -W.Hsu, and W.Lee, "Compression artifacts in procedural Audio coding," in IEEE Transactions on Audio, Speech, and Languge Processing,
[45] Daudet, "A review on techniques for the extraction of transformed sin biological signals," in Proceedings of the Third international conference on computer Music, page 219-.
[46] W. -C.Lee and C. -C.J.Kuo, "mechanical on set detected based on adaptive linear prediction," in IEEE International Conference on multimedia and Expo, (Toronto, Ontario), page 957-.
[47] M.Link, "An attachment processing of audio signals for optimizing the temporal characteristics of a low bit-rate audio coding system," in Audio engineering Society description, volume 95,1993 for 10 months.
[48]T.Vaupel,Ein Beitrag zur Transformationscodierung vonAudiosignalen unter Verwendung der Methode der“Time Domain AliasingCancellation(TDAC)”und einer Signalkompandierung im Zeitbereich.Ph.d.thesis,
Duisburg, Duisburg, Germany, 4 months 1991.[49] Bertini, M.Magrini, and T.Giunti, "A time-domain system for transformation in reconstructed music," in 14th European Signal processing conference (EUSIPCO), (Florence, Italy), IEEE, 9.2013.
[50] Duxbury, M.Sandler, and M.Davies, "A hybrid approach to music onset detection," in Proc.of the 5th int.conference on Digital Audio effects (DAFx-02), (Hamburg, Germany), p.33-38,2002, 9 months.
[51] Klapuri, "Sound on set detection by applying Sound in the acoustical output of knowledge," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
[52] S.L.Goh and D.P.Mandic, "Nonlinear adaptive prediction of complex-valued PRNN," in IEEE Transactions on Signal processing, volume 53, page 1827 and 1836, IEEE,
[53] Haykin and L.Li, "Nonlinear adaptive prediction of informativeness," in IEEE Transactions on Signal Processing, volume 43, page 526 and 535, IEEE,1995
[54] D.P.Mandic, S.Javidi, S.L.Goh, and K.Aihara, "complete-valued comparison of wind profile using the appended components," in Renewable energy, volume 34, page 196-.
[55] Edler, "parameter of a pre-masking model," Personal communication,2016, 11, 22 days.
[56] ITU-R Recommendation BS.1116-3, "Method for the discovery of small interactions in audio systems," Recommendation, International Telecommunication Union, Geneva, Switzerland,
[57] ITU-R Recommendation BS.1534-3, "Method for the objective assessment level of audio systems," Recommendation, International Telecommunication Union, Geneva, Switzerland,2015, 10 months.
[58] ITU-R Recommendation BS.1770-4, "Algorithms to measure audio reproduction low and true-peak audio level," Recommendation, International telecommunication Union, Geneva, Switzerland,
[59]S.M.Ross,Introduction to Probability and Statistics for Engineersand Scientists.Elsevier,3.ed.,2004.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:用于处理音频信号的装置和方法