Spatialized audio codec with rotated interpolation and quantization

文档序号：328161 发布日期：2021-11-30 浏览：10次中文

阅读说明：本技术 利用旋转的插值和量化进行空间化音频编解码 (Spatialized audio codec with rotated interpolation and quantization ) 是由 S.拉戈特 P.马埃于 2020-02-10 设计创作，主要内容包括：本发明涉及用于压缩音频信号的编码,该音频信号在高于0阶的环绕声表示中的N个通道的每个通道中随时间形成一连串样本帧,该方法包括：-基于通道并针对当前帧形成通道间协方差矩阵,并且搜索(S3)所述协方差矩阵的特征向量以用于获得特征向量矩阵,-测试(S5)特征向量矩阵以证实它表示N维空间中的旋转,并且如果不是这样的话,则校正(S6)特征向量矩阵,直到针对当前帧获得旋转矩阵,以及-在对所述信号进行分离通道编码之前,将所述旋转矩阵(S7)应用于N个通道的信号。(The invention relates to an encoding for compressing an audio signal forming a succession of sample frames over time in each of N channels in a surround sound representation above 0 th order, the method comprising: -forming an inter-channel covariance matrix based on the channels and for the current frame, and searching (S3) the eigenvectors of the covariance matrix for obtaining an eigenvector matrix, -testing (S5) the eigenvector matrix to verify that it represents a rotation in the N-dimensional space, and if not, correcting (S6) the eigenvector matrix until a rotation matrix is obtained for the current frame, and-applying the rotation matrix (S7) to the signals of the N channels before separate channel coding the signals.)

1. An encoding method for compressing an audio signal forming a succession of sample frames (t-1, t) over time in each of N channels in a surround sound representation above 0 th order, the method comprising:

-forming an inter-channel covariance matrix based on the channels and for the current frame (t), and searching eigenvectors of the covariance matrix for obtaining an eigenvector matrix,

-testing the eigenvector matrix to verify that it represents a rotation in an N-dimensional space, and if not, correcting the eigenvector matrix until a rotation matrix is obtained for the current frame (t), and

-applying the rotation matrix to the signals of the N channels before separate channel encoding of the signals.

2. The method of claim 1, further comprising:

-comparing said eigenvector matrix obtained for the current frame (t) with the rotation matrix obtained for the frame (t-1) preceding the current frame (t), and

-permuting the columns of the eigenvector matrix of the current frame (t) to ensure agreement with the rotation matrix of the previous frame (t-1).

3. The method of claim 2, wherein the permuting the columns makes it possible to ensure agreement of axes of vectors, and the method further comprises:

-for each feature vector of the current frame (t), verifying the direction of the column vector of the corresponding position in the rotation matrix of the previous frame (t-1) is coincident, and

-in case of inconsistency, inverting the sign of the element of the eigenvector in the eigenvector matrix of the current frame (t).

4. The method according to one of the preceding claims, further comprising:

-estimating a difference between a rotation matrix obtained for a current frame (t) and a rotation matrix obtained for a frame (t-1) preceding the current frame,

-determining, based on the estimated difference, whether to perform at least one interpolation between the rotation matrix of the current frame (t) and the rotation matrix of the previous frame (t-1).

5. The method of claim 4, wherein:

-determining, based on the estimated difference, the number of interpolations to be performed between the rotation matrix of the current frame (t) and the rotation matrix of the previous frame (t-1),

-the current frame is divided into a number of sub-frames corresponding to the number of interpolations to be performed, an

-encoding at least the number of interpolations for transmission via the network.

6. The method according to one of the preceding claims, wherein the sign of the determinant of the eigenvector matrix is inverted with permutations between columns of the eigenvector matrix, and the determinant of the rotation matrix is equal to 1,

if the determinant of the feature vector matrix is equal to-1, the sign of the elements of the selected column of the feature vector matrix is inverted such that the determinant is equal to 1, forming a rotation matrix.

7. Method according to one of the preceding claims, wherein the surround sound representation is first order and the number of channels N is four, and wherein the rotation matrix of the current frame is represented by two quaternions.

8. The method of claim 7 in combination with claim 6, wherein each interpolation of the current subframe is a spherical linear interpolation (SLERP) carried out as a function of the interpolation of subframes preceding the current subframe and based on the quaternion of the preceding subframe.

9. The method of claim 8, wherein spherical linear interpolation of the current subframe is performed to obtain a quaternion of the current subframe as follows:

wherein:

Q_L,t-1is one of the quaternions of the previous subframe t-1,

Q_R,t-1is another quaternion of the previous subframe t-1,

Q_L,tis one of the quaternions of the current sub-frame t,

Q_R,tis another quaternion of the current sub-frame t,

Ω_L＝Arccos(Q_L,t-1·Q_L,t)；Ω_R＝Arccos(Q_R,t-1·Q_R,t)

and alpha corresponds to the interpolation factor.

10. The method according to one of the preceding claims, wherein the search for feature vectors is carried out in time domain by Principal Component Analysis (PCA) or by karl ryan-loeve transform (KLT).

11. The method according to one of the preceding claims, wherein the previous step of predicting the bit allocation budget for each surround sound channel is implemented and comprises:

for each surround sound channel, estimating the current acoustic energy in the channel,

-selecting a predetermined quality score (MOS) in a memory based on the surround sound channel and a current bit rate in the network,

-estimating a weighting to be applied to the bit allocation of the channel by multiplying the selected fraction by the estimated energy.

12. A method for decoding an audio signal forming a succession of sample frames (t-1, t) over time in each of N channels in a surround sound representation above 0 th order, the method comprising:

-for a current frame (t), receiving the parameters of the rotation matrix in addition to the signals of the N channels of the current frame,

-constructing an inverse rotation matrix from said parameters,

-applying the derotation torque matrix to the signals from the N channels received prior to split channel decoding of the signals.

13. An encoding device comprising processing circuitry for implementing the method according to one of claims 1 to 11.

14. A decoding device comprising processing circuitry for implementing the method of claim 12.

15. A computer program comprising instructions for implementing the method according to one of claims 1 to 12 when the instructions are executed by a processor of a processing circuit.

Drawings

Other features and advantages of the present invention will become apparent from reading the exemplary embodiments presented in the following detailed description, and from viewing the accompanying drawings, in which:

figure 1 shows a multi-channel-single-channel codec (prior art),

figure 2 shows a sequence of main steps of an exemplary method in the sense of the present invention,

figure 3 shows the general structure of an example of an encoder according to the invention,

figure 4 shows details of the PCA/KLT analysis and transformation performed by block 310 of the encoder of figure 3,

figure 5 shows an example of a decoder according to the invention,

figure 6 shows the decoding and the inverse PCA/KLT synthesis in decoding to figure 4,

fig. 7 shows an exemplary embodiment of the structure of an encoder and a decoder within the meaning of the present invention.

Detailed Description

The invention aims to achieve optimized coding by:

adaptive time matrixing, in particular with an adaptive transformation obtained by PCA/KLT ("PCA" for principal component analysis and "KLT" for karl ryan-lova transform),

-preferably followed by multi-channel-single-channel encoding.

Adaptive matrixing allows for more efficient decomposition into channels than fixed matrixing. The matrixing according to the invention advantageously makes it possible to decorrelate the channels before multi-channel-single-channel encoding, so that the distortion of the spatial image by the codec noise introduced by encoding each channel is as small as possible overall when the channels are recombined in order to reconstruct the surround sound signal in decoding.

Furthermore, the invention is able to ensure a gentle adaptation of the matrixing parameters in order to avoid "click" type artifacts at the frame edges or too fast fluctuations in the spatial image, or even codec artifacts due to too strong variations in the various individual channels resulting from matrixing (which is subsequently encoded by different instances of a single-channel codec), e.g. linked to an untimely permutation of audio sources between the channels. Multi-channel-single-channel coding is presented below, preferably with variable bit allocation between channels (after adaptive matrixing), but in some variations multiple instances of the stereo core codec or others may be used.

To facilitate an understanding of the invention, some explanatory concepts regarding n-dimensional rotation and PCA/KLT or SVD type decomposition ("SVD" stands for singular value decomposition) are summarized below.

Rotation sum "quaternion"

The signal is represented by successive blocks of audio samples, which blocks are referred to below as "sub-frames".

The invention uses an n-dimensional rotated representation with parameters suitable for quantization of each frame, in particular efficient sub-frame interpolation. The rotation representations used in 2-dimension, 3-dimension, and 4-dimension are defined as follows.

Rotation (around the origin) is an n-dimensional spatial transformation that changes one vector to another such that:

the amplitude of the vector is preserved

The cross product of the vectors defining the orthogonal coordinate system before rotation is preserved after rotation (no reflection).

The matrix M of size n × n is if and only if M^T.M＝I_nA rotation matrix of time, wherein I_nRepresenting an identity matrix of size n x n (i.e. M is a unitary matrix, M)^TRepresenting the transpose of M) and its determinant is + 1.

Several representations equivalent to those of the rotation matrix are used in the present invention:

in two dimensions (2D plane) (n ═ 2): we use the rotation angle as a representation, as follows.

Given the rotation angle θ, we derive a rotation matrix:

given a rotation matrix, we can compute the angle θ by observing that the trajectory of the matrix is 2cos θ. Note that the θ can also be estimated directly from the covariance matrix before applying the principal component decomposition (PCA) and eigenvalue decomposition (EVD) presented below.

Each angle theta₁And theta₂May be interpolated by theta₁And theta₂Taking into account the shortest path constraint on the unit circle between these two angles.

In three-dimensional (3D) space (n ═ 3): euler angles and quaternions are used as representations. In some variations, an axis-angle representation not mentioned here may also be used.

A rotation matrix of size 3 x 3 can be decomposed as the product of 3 basic rotations of angle theta along the x, y or z axis.

Depending on the combination of axes, the angle is called the euler angle or the cartian angle.

However, another representation of the 3D rotation is given by a quaternion. Quaternions are generalizations of complex representations of four components in the form of the number q ═ a + bi + cj + dk, where i is²＝j²＝k²＝ijk＝-1。

The real part a is called a scalar and the three imaginary parts (b, c, D) form a 3D vector. Norm of quaternion isThe unit quaternion (of norm 1) represents a rotation-however, this representation is not unique; thus, if q represents a rotation, -q represents the same rotation.

Given the unit quaternion q ═ a + bi + cj + dk (where a is²+b²+c²+d²1), the associated rotation matrix is:

euler angles do not allow for correct interpolation of 3D rotations; to this end, we instead use quaternion or axis-angle representation. The SLERP ("spherical linear interpolation") interpolation method includes interpolation according to the following equation:

wherein 0. ltoreq. alpha. ltoreq.1 is selected from q₁To q₂And Ω is the angle between two quaternions:

Ω＝arccos(q₁.q₂)

wherein q is₁.q₂Representing the dot product between two quaternions (the same as the dot product between two 4-dimensional vectors).

This is equivalent to interpolating with constant angular velocity as a function of a following a large circle on a 4D sphere. It must be ensured that the shortest path is used to pass through at q₁.q₂<The sign of one of the quaternions is changed at 0 to interpolate. Note that other methods of quaternion interpolation (normalized linear or non-linear interpolation, splines, etc.) may be used.

Note that 3D rotations can also be interpolated by an axis-angle representation; in this case, the angles are interpolated as in the 2D case, and the axes can be interpolated, for example, by the SLERP method (in 3D), while ensuring that the shortest path is taken on the 3D unit sphere, and taking into account the fact that the representation given by the axis r and the angle θ is equivalent to the representation given by the axis-r and the angle 2 pi- θ in the opposite direction.

In 4-dimensions (n-4), the rotation can be parameterized by 6 angles (n (n-1)/2)), and we demonstrate that with a quaternion q₁A + bi + cj + dk and q₂Two matrices of size 4 × 4 (called quaternions (Q)) associated with w + xi + yj + zk₁) And inverse quaternion) The product of (a) gives a rotation matrix of size 4 x 4.

An associated quaternion pair (q) can be found₁,q₂) And associated quaternion and inverse quaternion matrices such that:

and is

Their product gives a matrix of size 4 × 4:

and it can be confirmed that the matrix satisfies the property of a rotation matrix (unitary matrix with determinant equal to 1).

Conversely, given a 4 x 4 rotation matrix, this matrix may be decomposed into the form of, for example, a so-called "Karley decomposition" methodThe matrix product of (a). This involves computing an intermediate matrix called a "tetragonal transform" (or correlation matrix) and deriving the quaternion therefrom with some uncertainty about the sign of the two quaternions (which can be eliminated by additional "shortest path" constraints mentioned further below).

Singular value decomposition (or "SVD")

Singular Value Decomposition (SVD) involves decomposing a real matrix a of size m × n, of the form:

A＝U∑V^T

where u is a unitary moment of size mMatrix (u)^TU＝I_m) And sigma is a rectangular diagonal matrix of size m × n, whose coefficient sigma is_i≧ 0 is a real number and a positive number (i ═ 1 … p where p ═ min (m, n)), and V is a unitary matrix (V) of size n × n^TV＝I_n) And V is^TIs the transpose of V. Sigma on diagonal of sigma_iThe coefficients are the singular values of the matrix a. By convention, they are usually listed in descending order, and in this case the diagonal matrix Σ associated with a is unique.

Rank r of A is defined by a non-zero coefficient σ_iThe number of (2) is given. Therefore, we can rewrite the singular value decomposition as:

wherein, U_r＝[u₁,u₂,…,u_r]Is the singular vector (or output vector) to the left of A, Σ_r＝diag(σ₁,…,σ_r) And V is_r＝[v₁,v₂,…,v_r]Is the singular vector (or input vector) to the right of a. The matrix formula can also be rewritten as:

if the sum is limited to the index i < r, we get a "filtered" matrix that represents only the "main" information.

We can also write as:

Av_i＝σ_iu_i

this shows that the matrix A will be v_iConversion to sigma_iu_i。

SVD of A and A^TA and AA^TThe eigenvalue decomposition of (c) is relevant because:

A^TA＝V(∑^T∑)V^T

AA^T＝U(∑∑^T)U^T

∑^Tsigma and sigma-sigma^TIs a characteristic value ofColumn of U is AA^TAnd the column of V is A^TA feature vector.

SVD can be explained geometrically: the image of a sphere of matrix A in dimension n is in dimension m with direction u₁，u₂，...，u_mAnd has a length of σ₁，...，σ_mIs a super ellipse of major axis.

Carlo-Huowei transform (or "KLT")

With random vector x and covariance matrix R centered at 0_xx＝E[x x^T]The karr ryun-lovin transform (KLT) of (a) is defined as follows:

y＝V^Tx

wherein V is formed by reacting R_xxA matrix of eigenvectors obtained by decomposition into eigenvalues (the eigenvectors are, by convention, column vectors)

R_xx＝VAV^T

Wherein Λ ═ diag (λ)₁，...，λ_n) Is a diagonal matrix with coefficients as eigenvalues. Matrix V ═ V₁，v₂，...，v_n]Comprising R_xxA feature vector (column) of (a) such that

R_xxv_i＝λ_nv_i

We can consider KLT as a change in basis (basis) because of the product V^Tx denotes the vector x in the basis given by the feature vector.

The inverse transform is given by:

x＝Vy

KLT makes it possible to decorrelate the components of x; the variance of the transformed vector y is R_xxThe characteristic value of (2).

Principal component analysis (or "PCA")

Principal Component Analysis (PCA) is a dimension reduction technique that produces orthogonal variables and maximizes the variance of the variables (or equivalently minimizes reconstruction errors) after projection.

The PCA presented below, although also based on decomposition into eigenvalues (such as KLT), is such: estimated covariance matrixIs based on N observed vectors x of dimension N_i(i ═ 1 … N) calculated:

assuming these vectors are centered:

is decomposed into the form ofCharacteristic value ofAllowing calculation of the principal component: y is_n＝V^Tx_n。

PCA is a matrix V that projects data into a new basis to maximize the variance of the variables after projection^TAnd (4) transforming.

Note that PCA may also be derived from signal x_iIs obtained, the signal is represented in the form of an N × N matrix X in size. In this case, we can write as:

X＝UDV^T

we demonstrate XX^T＝UDD^TU^TThis corresponds to XX^TThe diagonalization of (2). Thus, the projection vector of PCA corresponds to the column vector of U, and the projection gives U^TX＝DV^TAs a result.

It is also noted that PCA is generally considered a dimension reduction technique for "compressing" a high-dimensional data set into a set containing few principal components. In the present invention, PCA advantageously makes it possible to decorrelate a multi-dimensional input signal, but avoids the elimination of channels (and thus the number of channels) in order to avoid introducing artifacts. This results in a minimum encoding bitrate to avoid "truncating" the spatial image unless, in certain variants, the eigenvalues are so low that a zero rate can be allowed (e.g., to better encode artificially created surround sound with synthetically spatialized single sources).

We now describe with reference to fig. 2 the general principle of the steps implemented for the current frame t in a method within the meaning of the invention.

Step S1 includes obtaining the individual signals of the surround-sound channels (here four channels W, Y, Z, X in the example described) using the ACN (surround-sound channel number) channel ordering convention for each frame t. These signals may be represented in the form of an n × L matrix (for n surround sound channels (here 4) and L samples per frame).

In a next step S2, the signals of these channels may optionally be pre-processed, e.g. by a high-pass filter, as described below with reference to fig. 3.

In a next step S3, principal component analysis PCA or an equivalent karhunen-loeve transform KLT is applied to these signals to obtain eigenvalues and eigenvector matrices from the covariance matrices of the n channels. In variations of the present invention, SVD may be used.

In step S4, the eigenvector matrix obtained for the current frame t undergoes signed permutation (permatation) so that it is aligned as much as possible with the matrix of the same nature of the previous frame t-1. In principle, we ensure that the axes of the column vectors in the feature vector matrix correspond as much as possible to the axes of the column vectors at the same positions in the matrix of the previous frame, and if not, permute the positions of the non-corresponding feature vectors in the matrix of the current frame t. We then also ensure that the direction of the eigenvectors from one matrix to the other is also uniform. In other words, we are initially interested only in the straight lines carrying the eigenvectors (orientation only, no direction) and for each line we look for the nearest line in the matrix of the previous frame t-1. For this purpose, the vectors are permuted in the matrix of the current frame. Then, in a second step, we try to match the orientation of the (directional) vectors (orientation). To do this, we invert the sign of the feature vector that does not have the correct orientation.

Such an embodiment makes it possible to ensure maximum correspondence between the two matrices, avoiding audible clicks between the two frames during sound playback.

In step S5 we also ensure that the eigenvector matrix of the current frame t (thus corrected by signed permutation) does represent the application of rotation (rotation of angles n 2 channels, rotation of three euler angles, rotation of axes and angles, or rotation of quaternions n3 corresponding to the first-order planar surround representation W, Y, Z, and rotation of two quaternions n 4 in the first-order surround representation of type W, Y, Z, X).

To ensure that it is indeed a rotation, in step S6, the determinant of the eigenvector matrix of the current frame t corrected by the permutation must be positive and equal to (or actually close to) + 1. If it is equal to (or close to) -1, then:

permute again two feature vectors (e.g. two feature vectors associated with low energy channels, and therefore not very representative), or

Preferably, the signs of all elements of a column (e.g. the column associated with the low energy channel) are inverted in step S6.

Then, in step S7, we obtain the eigenvector matrix of the current frame t that effectively corresponds to the rotation.

Then, in step S8, parameters of the matrix (e.g., values such as angle values, axes and angles, or quaternion (S) of the matrix) may be encoded in a plurality of bits allocated for this purpose. In another optional but advantageous embodiment, in case a significant difference (e.g. greater than a threshold) between the rotation matrix estimated for the current frame t and the rotation matrix of the previous frame t-1 is observed in step S9, a variable number of interpolated sub-frames may be determined: otherwise, the number of subframes is fixed at a predetermined value. Step S10 includes:

-dividing the current frame into sub-frames, an

-interpolating the matrices to be applied to successive sub-frames from the matrix of the previous frame t-1 to the matrix of the current frame t, in order to smooth the difference over time between the two matrices.

In step S11, the interpolated rotation matrix is applied to the matrix n X (L/K) representing each of the K sub-frames of the surround-sound channel signals of step S1 (or optionally S2) in order to decorrelate as much of the signals as possible prior to the multi-channel-to-single-channel encoding of step S14. Recall that, in fact, according to a general approach, we want to decorrelate as much of the signals as possible before the multi-channel-to-single-channel transform. The bit allocation for the split channel is completed in step S12 and is encoded in step S13.

In step S14, before the multiplexing of step S15 is carried out to end the method for compression encoding, the number of bits to be allocated per channel may be decided according to the representativeness of the channel and the available bit rate on the network RES (fig. 7). In one embodiment, the energy in each channel is estimated for the current frame and multiplied by a predefined fraction for that channel and for a given bit rate (e.g., a MOS fraction as explained below with reference to fig. 3). Therefore, the number of bits to be allocated for each channel is weighted. Such an embodiment is advantageous and may be an object that is separately protected in a surround sound context.

Fig. 7 shows an encoding device DCOD and a decoding device DDEC within the meaning of the invention, these devices being dual (meaning "invertible") with respect to each other and being connected to each other by a communication network RES.

The encoding device DCOD comprises processing circuitry, generally comprising:

a memory MEM1 for storing instruction data of a computer program within the meaning of the invention (these instructions may be distributed between the encoder DCOD and the decoder DDEC);

an interface INT1 for receiving surround sound signals distributed over different channels (e.g. four first-order channels W, Y, Z, X) for compression encoding them within the meaning of the invention;

a processor PROC1 for receiving these signals and processing them by executing computer program instructions stored in the memory MEM1 for encoding them; and

a communication interface COM1 for transmitting coded signals via a network.

The decoding device DDEC comprises its own processing circuitry, typically comprising:

a memory MEM2 for storing instruction data of a computer program within the meaning of the invention (these instructions may be distributed between the encoder DCOD and the decoder DDEC, as described above);

an interface COM2 for receiving coded signals from the RES network for decoding them from compression within the meaning of the invention;

a processor PROC2 for processing these signals by executing computer program instructions stored in the memory MEM2 for decoding them; and

an output interface INT2 for transmitting the decoded signals in the form of surround sound channels W ', Y', Z ', X', for example for playback thereof.

Of course, this fig. 7 shows an example of a structural embodiment of a codec (coder or decoder) within the meaning of the present invention. Detailed embodiments of these more powerful codecs are described below with respect to fig. 3-6.

An encoder apparatus within the meaning of the present invention will now be described with reference to fig. 3.

The strategy of the encoder is to decorrelate as many channels around the acoustic signal as possible and encode them with the core codec. This strategy makes it possible to limit artifacts in the decoded surround sound signal. More specifically, here we seek to apply an optimized decorrelation to the input channels prior to multi-channel-single-channel encoding. Furthermore, the computational cost of interpolation is limited for the encoder and decoder, since it is carried out in a specific domain (angle of 2D, quaternion of 3D, quaternion pair of 4D), which makes it possible to interpolate the covariance matrix computed for PCA/KLT analysis, instead of repeatedly decomposing into eigenvalues and eigenvectors several times per frame.

However, before discussing the core encoding performed within the meaning of the present invention, some advantageous features of the encoder are presented here, in particular such as the optimization of the allocated bit budget for encoding as a function of perceptual criteria, as follows.

In embodiments of the encoder described herein, the encoder may generally be an extension of a standardized 3GPP EVS ("enhanced voice services") encoder. Advantageously, the EVS encoded bit rate can be used without modifying the structure of the EVS bitstream. Thus, multi-channel-single-channel coding (block 340 of fig. 3 described below) works here with a possible allocation of each transformed channel, which limits the following bit rates for coding in the ultra-wide audio band: 9.6; 13.2; 16.4 of the total weight of the mixture; 24.4; 32, a first step of removing the first layer; 48; 64; 96 and 128 kbps.

Of course, additional bit rate can be added by modifying the EVS codec (to have more detailed granularity in the allocation). Codecs other than EVS may also be used, e.g.A codec.

In general, bearing in mind that the finer the granularity of the encoding, the more bits must be reserved to represent the possible combinations of bit rates. A trade-off must be made between the fineness of the allocation and the additional information describing the bit allocation. This allocation is optimized here by block 320 of fig. 3, which will be described below. This is an advantageous feature in itself and independent of the decomposition into eigenvectors in order to establish a rotation matrix within the meaning of the invention. Thus, the bit allocation performed by block 320 may be an individually protected object.

Referring to fig. 3, a block 300 receives an input signal Y in a current frame having an index t. The index is not shown here to avoid complicating the tags. This is a matrix of size n × L. In an embodiment adapted to first order surround sound context, we have n-4 channels W, Y, Z, X (hence root)Defined by ACN order) they can be normalized according to the SN3D convention. In a variant, the order of the channels may be replaced by, for example, W, X, Y, Z (following the FuMa convention), and the normalization may be different (N3D or FuMa). Thus, the channel W, Y, Z, X corresponds to consecutive rows: y is_1,l、y_2,l、y_3,l、y_4,lThey will be represented by a one-dimensional signal y_i(l) (L-1, …, L). This is therefore a succession of samples from 1 to L occupying the frame t.

It is assumed (in each channel) that the signal is sampled at 48kHz without loss of generality. The frame length is fixed at 20ms, i.e. L960 consecutive samples without loss of generality. Alternatively, sampling may be performed at 32kHz, for example, using a frame length of L640 samples.

The PCA/KLT analysis and PCA/KLT transformation described below are performed in the time domain. It will therefore be appreciated that we remain here in the time domain, without having to perform sub-band transforms or more generally frequency transforms.

At each frame, the block 300 of the encoder applies pre-processing (optional) to obtain a pre-processed input signal denoted Y. This may be high pass filtering (with a cutoff frequency of typically 20Hz) each new 20ms frame of the input signal channel. This operation allows for the removal of successive components that may bias the covariance matrix estimate so that the signal output from block 300 may be considered to have a zero mean. The transfer function is represented as H_pre(z), so we have for each channel: x_i(z)＝H_pre(z)Y_i(z). If block 300 is not applied, we have X ═ Y. The low pass filter in block 340 may also be applied to perform multi-pass-single pass encoding, but when block 300 is applied, the high pass filtering during the pre-processing of the single pass encoding that may be used in block 340 is preferably disabled to avoid repeating the same pre-processing, thereby reducing overall complexity.

The above transfer function H_pre(z) may be of the following type:

by applying this filter to each of the n channels of the input signal, its coefficients can be represented as follows:

	8kHz	16kHz	32kHz	48kHz
					b₀	0.988954248067140	0.994461788958.195	0.997227049904470	0.998150511190452
b₁	-1.977908496134280	-1.988923577916390	-1.994454099808940	-1.996301022380904
					b₂	0.988954248067140	0.994461788958195	0.997227049904470	0.998150511190452
a₁	1.977786483776764	1.988892905899653	1.994446410541927	1.996297601769122
					a₂	-0.978030508491796	-0.988954249933127	-0.994461789075954	-0.996304442992686

alternatively, another type of filter may be used, for example a sixth order butterworth filter with a frequency of 50 Hz.

In some variations, the pre-processing may include a fixed matrixing step that may maintain the same number of channels or reduce the number of channels. An example of matrixing applied to four channels of a B-format surround sound signal is given below:

note that in this case, it is necessary to decode the data by passing through M_A→B＝M_B→A ^-1The decoded signal is matrixed to reverse the pre-processing to find the original format channel.

The next block 310 estimates at each frame t the transformation matrix obtained by determining the eigenvectors of the PCA/KLT and verifying that the transformation matrix formed by these eigenvectors indeed characterizes the rotation. Further details of the operation of block 310 are provided below with reference to FIG. 4. The transformation matrix performs a matrixing of the channels in order to decorrelate them, making it possible to apply an independent multi-channel-single-channel type of coding by block 340. As will be detailed below, block 310 sends to the multiplexer information representative of the quantization index of the transform matrix and, optionally, the interpolated number of transform matrices for each subframe of the current frame t, which will also be detailed below.

Block 320 determines the best bit rate allocation for each channel (after PCA/KLT transform) based on a given B-bit budget. The block finds the distribution of the bit-rates between the channels by calculating a fraction of each possible combination of bit-rates; the best allocation is found by finding the combination that maximizes the score.

Several criteria may be used to define the score for each combination.

For example, the number of possible bit rates for single-channel encoding of a channel may be limited to nine discrete bit rates for an EVS codec with an ultra-wide audio band: 9.6; 13.2; 16.4 of the total weight of the mixture; 24.4; 32, a first step of removing the first layer; 48; 64; 96 and 128 kbps. However, if the codec according to the invention operates at a given bitrate associated with a budget of B bits in the current frame with index t, only a subset of these listed bitrates can typically be used. For example, if the codec bit rate is fixed at 4 × 13.2 ═ 52.8kbps to represent four channels, and if each channel receives a minimum budget of 9.6kbps to guarantee an ultra wide band for each channel, the possible combinations of bit rates for encoding the individual channels must comply with the constraint that the used bit rate remains lower than the available bit rate, which corresponds to:

B_multimono＝B-B_overhead,

wherein, B_overheadIs the bit budget of the additional information (bit allocation + rotation data) for each frame encoding as described below. For example, for the case of four-channel surround sound coding, B_overheadMay be a frame B every 20ms_overheadOn the order of 55 bits (i.e., 2.75 kbps); this includes 51 bits for encoding the rotation matrix and 4 bits for encoding the bit allocation for separate channel encoding (as described below). To pairAt 4 × 13.2 — 52.8kbps total bit rate, this thus leaves B_multimonoA budget of 50.05 kbps.

In terms of bit rate per channel, this gives the following permutation of bit rate per channel:

-set of units (9.6,9.6,9.6,9.6) -total 38.4

Substitution of (13.2,9.6,9.6,9.6) in total of 42kbps

Substitution of (13.2,13.2,9.6,9.6) in total of 45.6kbps

Substitution of (13.2,13.2,13.2,9.6) -Total 49.2kbps

Substitution of (16.4,9.6,9.6,9.6) total 45.2kbps

Substitution of (16.4,13.2,9.6,9.6) -total 48.8kbps

It can be seen that some combinations that respect the maximum budget constraint have a much lower bit rate than others, and that eventually only two relevant combinations can be reserved:

substitution of (13.2,13.2,13.2,9.6) -4 cases, and unused bit rate of 50.5-49.2 ═ 1.3kbps

Permutations of-and (16.4,13.2,9.6,9.6) -12 cases and unused bit rates of 50.5-48.8 ═ 1.7kbps

This can indicate that sixteen combinations are of particular interest and can be encoded with 4 bits (16 values). Furthermore, depending on the allocation selected, a certain number of bits may remain unused.

It can be seen that coding based on PCA/KLT processing and allowing adaptive matrixing for flexible bit allocation may result in unused bits and, for some channels, a lower bit rate (e.g. 9.6kbps) than that evenly distributed between each channel (e.g. 13.2kbps per channel).

To improve this, block 320 may then evaluate all possible (correlated) combinations of bit rates of the 4 channels produced by the PCA/KLT transform (output of block 310) and assign them scores. The score is calculated based on:

-energy per channel, and

-average scores that can be pre-stored and generated by subjective or objective tests; this score is expressed as MOS (for "mean opinion score", which is the average score of a group of testers), associated with the allocated bit rate.

This score can then be defined by the following equation

Wherein E is_iIs the energy in the current frame (indexed t) of the signal s (L) … L-1 on channel i,

wherein:

the optimal allocation may be such that:

alternatively, factor E_iThe value taken by the eigenvalue associated with channel i may be fixed, the value being generated after possible signed permutation by decomposition into eigenvalues of the signal input to block 310.

For the bit rate R_i＝50b_iCorresponding budget b (in bits/sec) per 20ms frame_i(in units of bit number), the MOS fraction Q (b)_i) Preferably the subjective quality score of the codec used for the multi-pass-single pass encoding in block 340. First, we can normalize the (average) subjective MOS score of the encoder using EVS, given by the following table:

κ_i	0	1	2	3	4	5	6	7	8
										b_i	192	264	328	488	640	960	1280	1920	2560
R_i	9600	13200	16400	24400	32000	48000	64000	96000	128000
										Q(b_i)	3.62	3.79	4.25	4.60	4.53	4.82	4.83	4.85	4.87

alternatively, other MOS scores for each listed bit rate may be derived from other (subjective or objective) tests predicting the quality of the codec. The MOS score used in the current frame can also be adapted according to the classification of the signal type (e.g., speech signal without background noise, or speech with ambient noise, or music or mixed content) by reusing the classification methods implemented by the EVS codec and by applying them to the W-channel of the surround sound input signal before performing the bit allocation. The MOS score may also correspond to an average score produced by different types of methods and rating scales: MOS (absolute) from 1 to 5, DMOS (from 1 to 5), MUSHRA (from 0 to 100).

In a variant in which the EVS encoder is replaced by another codec, the bit rate b_iAnd a fraction Q (b)_i) The list of (c) can be replaced on the basis of the further codec. Additional encoding bit rates may also be added to the EVS encoder to supplement the list of bit rates and MOS scores, or even to modify the EVS encoder and potentially the associated MOS scores.

In another alternative, the distribution between channels is refined by weighting the energy with a power α, where α takes a value between 0 and 1. By varying the value of α, we can thus control the effect of energy in the distribution: the closer alpha is to 1, the more significant the energy is in the fraction and thus the more unequal the distribution between the channels. Conversely, the closer alpha is to 0, the less significant the energy and the more evenly distributed the distribution between the channels. Thus, the score is expressed in the following form:

in another alternative, a second weight may be added to the score function to penalize inter-frame bitrate changes in order to make the allocation more stable. A penalty is added to the score if the bit rate combination in frame t is different from the bit rate combination in frame t-1. The score is expressed in the following form:

wherein, when b_t,i＝b_t-1,iWhen is beta_iHas a predetermined constant as its value (e.g., 0.1), and when b is_t,i≠b_t-1,iWhen is beta_i＝0。

This additional weighting makes it possible to limit excessively frequent fluctuations in the bit rate between the channels. With this weighting, only a significant change in energy will result in a change in bit rate. In addition, the value of the constant may be varied to adjust the stability of the dispensing.

Referring again to fig. 3, once the bit rate is calculated for each frame, the bit rate is encoded by block 330, e.g., exhaustively encoding all bit rate combinations. In the case of 9 bit rates and 4 channels, the required bit rate isBits of whichCorresponding to rounding to the next integer. 4 bit ratesThe combination of (a) may be encoded in the form of an index:however, it may be more inclined: different bit rate combinations associated with a given bit budget are enumerated (initially offline) and the minimum bit rate is used to represent these combinations. This index can then be represented by a "permutation code" + "combinatorial offset" type of coding; for example, in our example of using 4-bit indices to encode 16 bit rate combinations comprising 4 permutations of (13.2,13.2,13.2,9.6) and 12 permutations of (16.4,13.2,9.6,9.6), we can encode the first 4 possible permutations (offset 0, code range 0-3) using indices 0-3 and 12 other possible permutations (offset 4, code 0-3) using indices 4-15.

Referring again to fig. 3, the multiplexing block 350 takes as input the n matrixed channels from block 310 and the bit rate allocated to each channel from block 320 to then separately encode the different channels with a core codec corresponding to, for example, the EVS codec. If the core codec used allows stereo or multi-channel coding, the multi-channel-to-mono method can be replaced by multi-channel-stereo or multi-channel coding. Once the channels are encoded, the associated bit streams are sent to the multiplexer (block 350).

In frames where a portion of the total budget is not fully used, the multiplexer (block 350) may apply zero bit padding to achieve the bit budget allocated to the current frame, i.e.And (4) a bit. Alternatively, the remaining bit budget may be reallocated for encoding the transformed channels in order to use the entire available budget, and if the multi-channel-single-channel encoding is based on EVS type techniques, the specified 3GPP EVS encoding algorithm may be modified to introduce additional bit rate. In this case, these additional bit rates can also be integrated into definition b_iAnd Q (b)_i) In a table of correspondence between.

Bits may also be reserved to enable switching between two coding modes:

coding according to the invention with a rotation matrix, and

if the rotation matrix of the previous frame is also an identity matrix (e.g. when the surround sound signal comprises very diffuse sound sources or a plurality of sources spatially distributed around some preferred directions, in which case the correlation of the surround sound channels is lower than the correlation of sound mixing more isolated point sources), encoding with a rotation matrix that is restricted to the identity matrix (and therefore not transmitted) according to the invention is equivalent to direct multi-channel-single-channel encoding.

The choice between these two modes means using a bit in the stream to indicate whether the current frame uses a rotation matrix restricted to the identity matrix without transmitting the rotation parameter (bit-0) or whether the rotation matrix is encoded (bit-1). When bit is 0, in some variations, fixed bits may be allocated to separate channels without transmitting the bit allocation.

The block 310 applying the PCA/KLT analysis and transformation will now be described in detail with reference to fig. 4. In this block, the encoder calculates a covariance matrix from the (pre-processed) surround sound channels in block 400:

alternatively, the matrix may be replaced by a correlation matrix, in which the channels are pre-normalized by their respective standard deviations, or a weight, which generally reflects the relative importance, may be applied to each channel; furthermore, the normalization term 1/(L-1) may be omitted or replaced by another value (e.g., 1/L). These values C_ijCorresponds to x_iAnd x_jThe variance between.

The encoder then performs decomposition into eigenvalues by computing eigenvalues and eigenvectors of matrix C in block 410 (EVD for "eigenvalue decomposition"). The feature vector is denoted here as V_tTo indicate the index of frame t, since it was obtained in the frame preceding index t-1Characteristic vector V of_t-1Preferably stored and subsequently used. The eigenvalues are represented as: lambda [ alpha ]₁，λ₂，...，λ_n。

Alternatively, a Singular Value Decomposition (SVD) of the preprocessed channel X may be used. We therefore obtain the singular vectors (U on the left and V on the right) and the singular values σ_i. In this case, we can consider the eigenvalue λ_iIs composed ofAnd the feature vector V_tGiven by the n singular vectors (columns) on the left U.

The encoder then applies a first signed permutation (where the columns are eigenvectors) to the columns of the transform matrix for frame t in block 420 in order to avoid too large a difference from the transform matrix for the previous frame t-1, which would lead to a click problem at the boundary of the previous frame.

Thus, once a rough draft of the transform matrix is obtained for frame t, block 430 obtains n estimated eigenvectors V from the current frame indexed t_t＝v_t，0，...，v_t，nAnd n eigenvectors V stored from the previous frame with index t-1_t-1And for the estimated vector V_tApplying signed permutations so that they are as close to V as possible_t-1. Thus, the feature vector of frame t is permuted so that the associated basis is as close as possible to the basis of frame t-1. This has the effect of improving the continuity of the frames of the transform signal (after the transform matrix is applied to the channels).

Another constraint is that the transformation matrix must correspond to a rotation. This constraint ensures that the encoder can convert the transform matrices into generalized euler angles (block 430) in order to quantize them with a predetermined bit budget as described above (block 440). For this purpose, the determinant of the matrix must be a positive number (usually equal to + 1).

Preferably, the optimal signed permutation is obtained in two steps:

the first step (S4 in fig. 2 presented above) matches the closest vector between the two frames, focusing only on the axis and not on the direction (orientation) of the axis. This problem can be formulated as a combinatorial problem of task assignments, where the goal is to find a configuration that minimizes cost. The cost may be defined herein as the trajectory of the absolute value of the cross-correlation between the frames t and the eigenvector matrix of t-1.

C_t＝tr(abs(corr(V_t，V_t-1)))

Where tr () denotes the trajectory of the matrix, abs () is equivalent to applying an absolute value operation on all the coefficients of the matrix, and corr (V1, V2) gives the correlation matrix between vectors V1 and V2.

In one embodiment, the "hungarian" method (or "hungarian algorithm") is used to determine an optimal assignment of permutations of feature vectors giving a frame t;

the second step (S6 in fig. 2) comprises determining the direction/orientation of each permuted feature vector. Block 420 computes the permuted feature vector for frame t-1Cross-correlation with the feature vector of frame t

If the cross-correlation matrix Γ is_tIs negative, this represents a sign change between the directions of the feature vector. Then is atWherein sign inversion is performed on the corresponding feature vector.

At the end of the two steps, the transformation matrix at frame t is represented by V_tIs expressed such that the stored matrix at the next frame becomes V_t-1。

Alternatively, the basis matrix is converted into 3D or 4D by calculationOrAnd by converting such changes in the basis matrix into a unit quaternion or two unit quaternions, respectively, a search for the best signed permutation may be accomplished. The search then becomes a nearest neighbor search that utilizes a dictionary that represents a set of possible signed permutations. For example, in the case of 4D, 12 possible even permutations of 4 values (24 permutations total) are associated with the following pairs of unit quaternions written as 4D vectors:

(1,0,0,0) and (1,0,0,0)

0,0,0,1 and (0,0, -1,0)

0,1,0,0 and (0,0,0, -1)

- (0,0,1,0) and (0, -1,0,0) ]

0.5, -0.5, -0.5, -0.5) and (0.5,0.5,0.5,0.5)

0.5,0.5,0.5,0.5) and (0.5, -0.5, -0.5, -0.5)

0.5, -0.5,0.5, -0.5) and (0.5, -0.5,0.5,0.5)

0.5, -0.5,0.5,0.5) and (0.5, -0.5, -0.5,0.5)

0.5,0.5, -0.5,0.5) and (0.5,0.5, -0.5, -0.5)

0.5, -0.5, -0.5,0.5) and (0.5,0.5, -0.5,0.5)

0.5,0.5, -0.5, -0.5) and (0.5,0.5,0.5, -0.5)

0.5,0.5,0.5, -0.5) and (0.5, -0.5,0.5, -0.5)

By using the above list as a dictionary of predefined quaternion pairs and by performing a nearest neighbor search for quaternion pairs associated with variations of the base matrix, a search for the (even) optimal permutation can be done. One advantage of this approach is to reuse quaternion and quaternion pair type rotation parameters.

The operations carried out in the next block 460 assume that the transformation matrix after the sign permutation is indeed a rotation matrix; the transformation matrix must necessarily be unitary, but its determinant must also be equal to 1

det(V_t)＝1

However, the transform matrices generated from blocks 410 and 420 (after EVD and signed permutation) are orthogonal (unitary) matrices, which may have a determinant of-1 or 1, meaning a reflection or rotation matrix.

If the transformation matrix is a reflection matrix (if its determinant is equal to-1), it can be modified into a rotation matrix by inverting the eigenvector (e.g., the eigenvector associated with the lowest value) or by inverting two columns (eigenvectors).

Some methods of eigenvector decomposition (e.g., by givens rotation) or singular value decomposition may result in a transformation matrix that is essentially a rotation matrix (determinant + 1); in this case, the step of confirming the determinant +1 would be optional.

Block 430 converts the rotation matrix into parameters. In a preferred embodiment, the angle representation is used for quantization (6 generalized euler angles for the 4D case, 3 euler angles for the 3D case, and one angle in 2D). For the surround sound case (4 channels), we obtained 6 generalized Euler Angles according to the method described in the article "general interpretation of Euler Angles to N-Dimensional Orthogonal materials", published in journal of mathematics and Physics 13, 528(1972), by David K.Hoffman, Richard C.Raffeentti and Klaus Ruedenberg; for the case of planar surround sound (3 channels) we obtain 3 euler angles and for the stereo case we obtain the rotation angle according to methods well known in the art. In a preferred embodiment, scalar quantization is used, and the quantization step size is the same for each angle, for example. For example, in the case of 4 channels, we use 3 × (8+9) ═ 51 bits for 6 generalized euler angles (3 angles are defined in the interval [ -pi/2, coded in 8 bits, with a step size of pi/256]And the other 3 angles are defined in the interval [ -pi, pi ] of step size pi/256 encoded in 9 bits]In (1) to encode. The quantization indices of the transform matrices are sent to the multiplexer (block 350). Further, if the parameters for quantization do not match the parameters for interpolation, block 440 may convert the quantized parameters to a quantized rotation matrix

Alternatively, blocks 430 and 440 may be replaced as follows:

the block 430 may perform the conversion of the rotation matrix into a pair of unit quaternions (case of 4 channels), into a unit quaternion (case of 3 channels), and into an angle (case of 2 channels).

For the 4D case, this conversion to a pair of quaternions can be carried out by the following pseudo code on a rotation matrix (whose coefficients are denoted R [ i, j ], i, j ═ 0 … 3):

the associated matrix A [ i, j ] is calculated as follows:

A[0,0]＝R[0,0]+R[1,1]+R[2,2]+R[3,3]

A[1,0]＝R[1,0]-R[0,1]+R[3,2]-R[2,3]

A[2,0]＝R[2,0]-R[3,1]-R[0,2]+R[1,3]

A[3,0]＝R[3,0]+R[2,1]-R[1,2]-R[0,3]

A[0,1]＝R[1,0]-R[0,1]-R[3,2]+R[2,3]

A[1,1]＝-R[0,0]-R[1,1]+R[2,2]+R[3,3]

A[2,1]＝-R[3,0]–R[2,1]-R[1,2]-R[0,3]

A[3,1]＝R[2,0]-R[3,1]+R[0,2]-R[1,3]

A[0,2]＝R[2,0]+R[3,1]-R[0,2]-R[1,3]

A[1,2]＝R[3,0]-R[2,1]-R[1,2]+R[0,3]

A[2,2]＝-R[0,0]+R[1,1]-R[2,2]+R[3,3]

A[3,2]＝-R[1,0]-R[0,1]-R[3,2]-R[2,3]

A[0,3]＝R[3,0]-R[2,1]+R[1,2]-R[0,3]

A[1,3]＝-R[2,0]-R[3,1]-R[0,2]-R[1,3]

A[2,3]＝R[1,0]+R[0,1]-R[3,2]-R[2,3]

A[3,3]＝-R[0,0]+R[1,1]+R[2,2]-R[3,3]

A＝A/4

2 quaternions are computed from the associated matrix:

square of a2 ═ square (a) # coefficient

q1 ═ sqrt (a2.sum (axis ═ 1)) # sum the rows

q2 ═ sqrt (a2.sum (axis ═ 0)) # sum columns

And (3) determination of the symbol:

for k 0.. 3: if sign (A [ i, k ]) <0, q2[ k ] -q2[ k ]

For k 0.. 3: if sign (A [ k, j ])! Sign (q1[ k ] q2[ j ]), then q1[ k ] ═ q1[ k ]

For a matrix R [ i, j ] (i, j ═ 0 … 2) of size 3 × 3, the conversion to quaternion for the 3D case can be carried out as follows:

simplifying the calculation of the associated matrix:

q[0]＝(R[0,0]+R[1,1]+R[2,2]+1)^2+(R[2,1]-R[1,2])^2+(R[0,2]-R[2,0])^2+(R[1,0]-R[0,1])^2

q[1]＝(R[2,1]-R[1,2])^2+(R[0,0]-R[1,1]-R[2,2]+1)^2+(R[1,0]+R[0,1])^2+(R[2,0]+R[0,2])^2

q[2]＝(R[0,2]-R[2,0])^2+(R[1,0]+R[0,1])^2+(R[1,1]-R[0,0]-R[2,2]+1)^2+(R[2,1]+R[1,2])^2

q[3]＝(R[1,0]-R[0,1])^2+(R[2,0]+R[0,2])^2+(R[2,1]+R[1,2])^2+(R[2,2]-R[0,0]-R[1,1]+1)^2

for i-0 … 3: q [ i ] - [ sqrt (q [ i ])/4

Calculation of quaternion q:

if (R < 2, 1-R < 1,2 >) <0, then q < 1 > -q < 1 > ]

If (R <0, 2-R < 2,0 >) <0, then q < 2 > -q < 2 > ]

If (R < 1, 0-R <0, 1 >) <0, then q < 3 > -q < 3 >

For the case of a2 x 2 matrix, the angle is calculated according to methods known in the art.

In some variations, the unit quaternions q1, q2(4D case), and q (3D case) may be converted to an axis-angle representation as known in the art

The block 440 may perform quantization in the indicated domain:

case of 4 channels: unit quaternion pair q₁And q is₂Quantizing through a 4-dimensional spherical quantization dictionary; by convention, q is₁Quantization with hemispherical dictionary (because of q)₁And-q₁Corresponding to the same 3D rotation), and q₂Quantization is performed using a spherical dictionary. An example of a dictionary may be given by predefined points based on 4-dimensional polyhedrons(ii) a In some variations, the dual-associated axis-angle representation may be quantized, which would equate to a quaternion pair;

case of 3 channels: unit quaternions are quantized by a 4-dimensional spherical quantization dictionary-an example of a dictionary may be given by a predefined point based on a 4-dimensional polyhedron;

case of 2 channels: the angles are quantized by uniform scalar quantization.

We now describe a block 460 for interpolation of a rotation matrix between two consecutive frames. It can eliminate discontinuities in the channel after applying these matrices. In general, if two sets of angles or quaternions differ too much from the previous frame t-1 to the next frame t, audible clicks are a problem without applying a smooth transition in the sub-frame between the two frames. Transitional interpolation is then carried out between the rotation matrix calculated for frame t-1 and the rotation matrix calculated for frame t. The encoder interpolates the rotated (quantized) representation between the current frame and the previous frame in block 460 in order to avoid overly fast fluctuations of the various channels after the transform. The number of interpolations may be fixed (equal to a predetermined value) or adaptive. Each frame is then divided into sub-frames according to the number of interpolations determined in block 450. Thus, if adaptive interpolation is used, block 450 may encode the number of interpolations to be performed in a selected number of bits, thereby encoding the number of subframes to be provided, if the number is adaptively determined; in the case of fixed interpolation, it is not necessary to encode the information.

Next, block 460 converts the rotation matrix into a particular domain that represents the rotation matrix. The frame is divided into subframes and, in a selected domain, interpolation is carried out for each subframe.

For a first order surround sound input signal (with 4 channels W, X, Y, Z), in block 460, the encoder reconstructs the quantized 4D rotation matrix from the 6 quantized euler angles and then converts it into two unit quaternions for interpolation purposes. In a variant where the input to the encoder is a planar surround sound signal (3 channels W, X, Y), in block 460 the encoder reconstructs the quantized 3D rotation matrix from the 3 quantized euler angles and then converts it to a unitary quaternion for interpolation purposes. In a variant where the encoder input is a stereo signal, the encoder uses a representation of the 2D rotation quantized with the rotation angle in block 460.

In the embodiment with 4 channels, for the interpolation of the rotation matrix between frame t and frame t-1, the rotation matrix computed for frame t is decomposed into two quaternions (pairs of quaternions) by means of a Karley factorization, and we use the pairs of quaternions stored for the previous frame t-1, denoted (Q)_L,t-1,Q_R,t-1)。

For each subframe, the quaternion is interpolated two by two in each subframe.

For left quaternion (Q)_L,t) The block determines two possibilities (Q)_L,tor-Q_L,t) The shortest path between them. The sign of the quaternion for the current frame is inverted depending on the situation. Interpolation is then calculated for the left quaternion using spherical linear interpolation (SLERP):

where α corresponds to an interpolation factor (α ═ 1/K, 2/K.. 1), and α Ω_L＝arccos(Q_L,t-1·Q_L,t)。

For the right quaternion (Q)_R,t) If there is an inversion to the left quaternion, then we must preserve parity and formulate the sign of the right quaternion. Such a symbolic constraint is hereinafter referred to as a "joint shortest path constraint". The interpolation is then calculated similarly to the left quaternion:

where α corresponds to an interpolation factor (α ═ 1/K, 2/K.. 1), and Ω_R＝arccos(Q_R,t-1·Q_R,t)。

Once the interpolation has been calculated for the two quaternions, a rotation matrix of size 4 x 4 (3 x 3 for planar surround or 2 x 2 for stereo respectively) is calculated.

This conversion to the rotation matrix may be carried out according to the following pseudo code:

4D case: for quaternion pairs

-as previously described, computing a quaternion matrix and an inverse quaternion matrix and computing the matrix product.

3D case: for quaternion q ═ (w, x, y, z), we get a matrix M [ i, j ] (i, j ═ 0 … 2) of size 3 × 3.

xy＝2*x*y

xz＝2*x*z

yz＝2*y*z

wx＝2*w*x

wy＝2*w*y

wz＝2*w*z

xx＝2*x*x

yy＝2*y*y

zz＝2*z*z

M[0][0]＝1-(yy+zz)

M[0][1]＝(xy-wz)

M[0][2]＝(xz+wy)

M[1][0]＝(xy+wz)

M[1][1]＝1-(xx+zz)

M[1][2]＝(yz-wx)

M[2][0]＝(xz-wy)

M[2][1]＝(yz+wx)

M[2][2]＝1-(xx+yy)；

Finally, the matrix calculated per sub-frame in the interpolation block 460(or their transpose) are then used in a transform block 470, which transform block 470 produces n transformed channels by applying the rotation matrix thus found to the surround sound channels that have been pre-processed by block 300.

Next, we return to the number K of subframes to be determined in block 450, block 450 for the case where this number is adaptive. The final difference between the current frame and the previous frame is measured or determined directly from the angular difference of the parameters describing the rotation matrix. In the latter case, we want to ensure that the angular variation between successive subframes is not perceptible. The implementation of an adaptive number of subframes is particularly advantageous for reducing the average complexity of the codec, but if the complexity reduction is chosen, interpolation with a fixed number of subframes is preferably used.

The resulting difference between the corrected rotation matrix for frame t and the rotation matrix for frame t-1 gives a measure of the magnitude of the difference in channel matrixing between the two frames. The larger the difference, the greater the number of subframes for interpolation done in block 460. To measure this difference, we use the sum of the absolute values of the cross-correlation matrix between the transformation matrices of the current and previous frames, as follows:

δ_t＝‖I_n-corr(V_t,V_t-1)‖

wherein, I_nIs a unit matrix, V_tIs the feature vector of the frame indexed as t, and is the norm of the matrix M, here corresponding to the sum of the absolute values of all the coefficients. Other matrix norms (e.g., the frobenius norm) may be used.

If the two matrices are identical, then the difference is equal to 0. The more dissimilar the matrices, the difference delta_tThe larger the value of (c). A predetermined threshold may be applied to delta_tFor example, each threshold is associated with a predetermined number of interpolations according to the following decision logic:

threshold value: {4.0,5.0,6.0,7.0}

Number of subframes for interpolation K: {10,48,96,192}

Thus, only two bits are sufficient to encode 4 possible values given the number of subdivisions (sub-frames).

The interpolation number K determined by block 450 is then sent to the interpolation module 460 and, in the adaptive case, the subframe number is encoded in the form of a binary index, which is sent to the multiplexer (block 350).

The implementation of interpolation enables finally to apply an optimization of the input channel decorrelation before the multi-channel-single-channel coding. In fact, due to this search for decorrelation, the rotation matrices calculated for the previous frame t-1 and the current frame t, respectively, may be very different, but even then interpolation makes it possible to smooth out this difference.

The interpolation used requires only limited computational cost of the encoder and decoder, since it is performed in a specific domain (angle in 2D, quaternion in 3D, pair of quaternions in 4D). This method is more advantageous than interpolation of covariance matrices calculated for PCA/KLT analysis and EVD type eigenvalue decomposition repeated several times per frame.

Block 470 then performs matrixing of the surround sound channels for each sub-frame using the transform matrix calculated in block 460. This matrixing is equivalent to calculating for each sub-frameWhere X (α) corresponds to a sub-block of size n X (L/K) (1/K, 2/K,.. 1 for α). The signals contained in these channels are then sent to block 340 for multi-channel-to-single-channel encoding.

Referring now to fig. 5, a decoder in an exemplary embodiment of the invention is described.

After demultiplexing the bitstream of the current frame t at block 500, the allocation information is decoded (block 510), which makes it possible to demultiplex and decode the bitstream(s) received for each of the n transformed channels (block 520).

Block 520 invokes multiple instances of core decoding that are performed separately. The core decoding may be of EVS type, optionally modified to improve its performance. Each channel is decoded separately using a multi-channel-single-channel approach. If the previously used coding is stereo or multi-channel coding, the multi-channel-to-mono method can be replaced with multi-channel-stereo or multi-channel for decoding. The so decoded channels are sent to block 530, which decodes the rotation matrix of the current frame and optionally the number of subframes K to be used for interpolation (if the interpolation is adaptive). For each matrix, the interpolation block 460 divides the frame into subframes for which the number K can be read in the stream encoded by the block 610 (fig. 6) and interpolates the rotation matrix in order to find the same matrix as in the block 460 of the encoder in the absence of transmission errors, so as to be able to invert the transformation previously done in the block 470.

Block 530 performs the inverse of the matrixing of block 470 in order to reconstruct the decoded signal, as described in detail below with reference to fig. 6. This matrixing is equivalent to calculating for each sub-frameWhereinCorresponding to consecutive subblocks of size n x (L/K) (for a 1/K, 2/K.).

Block 530 typically performs decoding and inverse PCA/KLT synthesis to that performed by block 310 of fig. 3. In block 600, a quantization index of a rotated quantization parameter in a current frame is decoded. Scalar quantization may be used and the quantization step size is the same for each angle. In the adaptive case, the number of interpolated subframes is decoded (block 610) to find the number of subframes K out of the set {10,48,96,192 }; in some variations where the frame length L is different, the set of values may be adjusted. The interpolation by the decoder is the same as the interpolation performed in the encoder (block 460).

Block 620 performs inverse matrixing of the surround sound channels of each sub-frame using the inverse (effectively the transpose) of the transform matrix calculated in block 460.

The invention therefore uses a method that is totally different from the MPEG-H codec, having a superposition based on a specific representation of the transformation matrix that is limited in the time domain to a rotation matrix from one frame to another, in particular enabling the interpolation of the transformation matrix with a mapping that ensures direction agreement (including the direction in which the symbols are considered).

The general approach of the present invention is to encode surround sound in the time domain by PCA, in particular to formulate the PCA transformation matrix as a rotation matrix in an optimized manner (in particular in the domain of quaternion/quaternion pairs) and to interpolate through subframes in order to improve the quality. The interpolation step size is fixed or adaptive, depending on the criterion of the difference between the cross-correlation matrix and the reference matrix (unit) or between the matrices to be interpolated. The quantization of the rotation matrix may be implemented in the generalized euler angular domain. Preferably, however, the quantization of the 3-dimensional and 4-dimensional matrices (respectively) in the domain of quaternions and quaternion pairs is chosen, which makes it possible to keep the quantization and interpolation in the same domain.

Furthermore, alignment of feature vectors is used to avoid the problems of rattling and channel inversion from frame to frame.

Of course, the invention is not limited to the embodiments described above as examples, but extends to other variants.

The above description thus discusses the case of 4 channels.

However, in some variations, more than four channels may be encoded. This embodiment remains the same (in terms of functional blocks) as the case of n-4, but the interpolation of the quaternion pair is replaced by the following general method.

The transformation matrix at frames t-1 and t is denoted V_t-1And V_t. Can use V_t-1And V_tBy a factor a in between, such that:

items may pass throughIs directly calculated by decomposing the eigenvalues of (a). In fact, ifThen we have

Note also that this variant can also replace interpolation by unit quaternion pairs (4D case), unit quaternions (3D case) or angles, however this would be less advantageous as it would require additional diagonalization steps and power calculations, whereas the above embodiments are more efficient for these cases of 2, 3 or 4 channels.

32页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：存储器装置接口及方法

Spatialized audio codec with rotated interpolation and quantization

相关技术

网友询问留言