Audio object coding method suitable for personalized interactive system

文档序号:1578620 发布日期:2020-01-31 浏览:18次 中文

阅读说明:本技术 一种适应于个性化交互系统的音频对象编码方法 (Audio object coding method suitable for personalized interactive system ) 是由 胡瑞敏 胡晨昊 王晓晨 武庭照 吴玉林 于 2019-10-14 设计创作,主要内容包括:本发明公开了一种适应于个性化交互系统的音频对象编码方法,在编码阶段,本发明首先将待编码的多个音频对象,从时域分帧加窗变换到频域;根据每个对象的能量大小进行排序,确定对象编码顺序;循环提取每步编码对象及对应下混信号,依此计算每步的参数及残差;利用奇异值分解,对大尺寸的残差矩阵进行分解压缩;将最终混合信号,参数及残差分解矩阵合成码流。在解码阶段,利用分解矩阵重构残差;然后根据每个对象的残差与参数,逐步将对象从下混信号中解码重建。本发明通过有顺序的多步编解码与残差分解,可以同时保证低码率和高质量的重建每个音频对象。(The invention discloses an audio object coding method suitable for a personalized interactive system, which comprises the steps of firstly, framing and converting a plurality of audio objects to be coded from a time domain to a frequency domain, sequencing according to the energy of each object, determining the coding sequence of the objects, circularly extracting each step of coded objects and corresponding downmix signals, calculating parameters and residual errors of each step according to the parameters, decomposing large-size residual matrixes by using singular values, decompressing the final mixed signals, the parameters and the residual decomposition matrixes into code streams, reconstructing the residual errors by using the decomposition matrixes in a decoding stage, and then gradually decoding and reconstructing the objects from the downmix signals according to the residual errors and the parameters of each object.)

1, method for encoding audio objects adapted to a personalized interactive system, comprising the steps of:

step A1: performing frame windowing on an input audio object sequence, converting a time domain signal into a frequency domain signal, and obtaining a time-frequency matrix of each audio object;

step A2: according to the time-frequency matrix of each object, calculating the frequency domain energy of the objects to sort, and determining the object to be coded in each step in multi-step progressive coding;

step A3, according to the determined coding sequence, gradually down-mixing and calculating corresponding side information, wherein the step-by-step down-mixing refers to adding matrixes to data of objects input in the current processing flow to obtain sum matrixes, the step-by-step down-mixing signals are not transmitted as transmission code streams, the side information comprises object residual errors and object gain parameter matrixes, and the object gain parameters are calculated through the energy ratio of two input signals in an object pair;

step A4: decomposing the object residual error in the side information into a left singular matrix, a right singular matrix and singular values by singular value decomposition;

step A5: quantizing the singular matrix, the singular value and the object gain parameter to obtain a side information code stream;

step A6: coding the final downmix signal in the step A3 to obtain a downmix signal code stream;

step A7: and synthesizing the code streams obtained in the step A5 and the step A6 into an output code stream, and transmitting the output code stream to a decoding end.

2. The audio object encoding method adapted to the personalized interactive system as set forth in claim 1, wherein in step A1, the original time domain dimensional sound signal is transformed into the frequency domain two dimensional spectrogram by framing, windowing and Modified Discrete Cosine Transform (MDCT), and the obtained matrix-form object data is output.

3. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step A2, according to the object data in the form of matrix, calculating the energy of object frequency domain, selecting the energy sorting mode from big to small, and determining the object sequence to be coded in each step; coding order, which means that audio objects with larger coding energy are preferentially coded;

the calculation of the frequency domain energy of the object is shown as follows:

Figure FDA0002232447770000011

wherein, | | SiI | represents the total energy of the ith audio object, OiRepresenting the ith subject in the total energy of all subjectsThe proportion of the components is calculated; according to each object OiThe values are sorted from big to small in the order of D (S)1)、B(S2)、A(S3)、…、C(SN) N is the number of objects to be encoded, and O is preferentially encodediObjects with large values.

4. The audio object coding method adapted to the personalized interaction system of claim 1, wherein in the step A3, side information of the coded objects is down-mixed and calculated step by step, and only object side information is calculated per step;

the calculation formula of the object residual and the object gain parameter is as follows:

Figure FDA0002232447770000021

Figure FDA0002232447770000022

wherein R (i) is the residual signal of the i +1 th object, Go(i) Gain parameter for the i +1 th object, Gd(i) A gain parameter for an ith downmix signal; xiRepresenting the downmix signal, P, obtained in step io(i) Is the energy of object i, Pd(i) The energy of the mixed signal in the ith step; n represents the number of objects to be encoded.

5. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in the step A4, carrying out dimension reduction compression on residual error matrixes of a plurality of objects by a singular value decomposition method, and reducing data volume increase brought by residual error information; decomposing the residual matrix into three small matrixes, namely a left singular matrix, a singular value matrix and a right singular matrix; wherein the singular value matrix transmits only the values on the matrix diagonal.

6. The method of claim 1, wherein in the step A5, the side information is quantized by a table lookup method, the element values of the residual decomposition matrix and the gain parameter matrix are normalized before quantization, the closest quantization value is looked up in a quantization table according to the size of each element value, and the corresponding quantization index is outputted as the side information quantization code stream.

7. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a6, the final downmix signal is encoded by an AAC encoder and then a code stream is output.

8. The audio object coding method adapted to a personalized interaction system according to claim 1, characterized in that: in step a7, synthesizing an output code stream refers to merging the final downmix signal code stream and the side information code stream, and adding a flag bit for identifier resolution; and finally, the down-mixing signal code stream refers to an output code stream after AAC coding, and the side information code stream refers to a quantization index code stream output after the residual decomposition matrix and the gain parameter are quantized.

An audio object decoding method adapted to a personalized interactive system, characterized by decoding the code generated by the method of any of claims 1-8, ;

the specific implementation comprises the following substeps:

step B1: analyzing the received code stream to obtain a side information code stream and a down-mixing signal code stream;

step B2: carrying out AAC decoding on the down-mixed signal code stream to obtain a down-mixed signal;

step B3: the side information is dequantized to obtain a left singular matrix, a right singular matrix, a singular value and an object gain parameter;

step B4: performing matrix synthesis on the left singular matrix, the right singular matrix and the singular value to recover an object residual error;

step B5: decoding backward according to the coding order, and circularly reconstructing an audio object frequency domain signal from the transmission downmix signal by using the side information;

step B6: the audio object signals in the frequency domain are converted to the time domain using a time-frequency transform.

10. The audio object decoding method adapted to a personalized interactive system according to claim 9, characterized in that: in step B4, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual error.

Technical Field

The invention belongs to the technical field of digital audio signal processing, and particularly relates to an multi-step progressive downmixing and reconstructed audio object coding and decoding method which is suitable for a personalized interactive system of spatial audio and allows a user to adjust an audio object according to the requirement of the user.

Background

The spatial audio technology based on channel coding can realize coding and reconstruction of three-dimensional audio scenes, and can provide more immersive auditory experience than mono or stereo audio technologies, such as MPEG spatial audio coding, NHK22.2 speaker arrays and the like, so that the spatial audio technology is more and more popular with people.

Many internationally scholars and research institutes have conducted research work on audio object coding, and proposed various audio object coding methods. The most representative of these is Spatial audio object joint coding (SAOC) proposed by Fraunhofer, the german well-known research institute [ document 1], which encodes a downmix signal transmitting a plurality of audio objects and side information, and separates and reconstructs the audio objects from the downmix signal based on the side information at a decoding end. The SAOC method can transmit a large number of audio objects at a low bit rate, greatly improving the coding efficiency of the audio objects, and enabling a user to perform personalized adjustment and interaction according to the listening needs of the user [ document 2 ].

In the SAOC framework, in order to obtain a lower coding bit rate, the same parameters are used as side information in the same subband, which results in aliasing distortion in the frequency domain, and severely degrades the hearing experience, for example, audio object signals may contain other object signal components to be mixed when played [ document 3 ]. even, this problem may affect the spatial audio personalized interactive service at the subsequent user end.

Document 1: breebaart, J., Engdeg. ard, J., Falch, C., et al., Spatial Audio object coding (saoc) -the upper case standard on parameter object based Audio coding. in: Audio Engineering Society Convention 124.Audio Engineering Society (2008).

Document 2: coleman, P., Franck, A., Francombe, J., et al, An audio-visual system for object based audio: From recording to listing. IEEE Transactions on multimedia 20(8), 1919-.

Document 3: wu, T., Hu, R., Wang, X., Ke, S.: Audio object coded based on optimal parameter frequency resolution. multimedia Tools and Applications pp.1-16(2019). Ref.4: spatial audio objects with two-step coding structure for interactive audio service IEEETransactions on Multimedia 13(6),1208-1216(2011).

Document 5: lee, B., Kim, K., Hahn, M. effective residual coding method of spatial audio object coding with two-step coding structure for interactive audio services. E.E. TRANSACTIONS on Information and Systems 99(7), 1949-.

Disclosure of Invention

In order to solve the technical problems, the invention provides audio object coding and decoding methods for multi-step progressive downmixing and reconstruction, which can perform high-quality audio coding and decoding at medium and low bit rates and ensure that all audio objects have good decoding tone quality.

audio object coding method suitable for personalized interactive system, characterized by comprising the following steps:

step A1: performing frame windowing on an input audio object sequence, converting a time domain signal into a frequency domain signal, and obtaining a time-frequency matrix of each audio object;

step A2: according to the time-frequency matrix of each object, calculating the frequency domain energy of the objects to sort, and determining the object to be coded in each step in multi-step progressive coding;

step A3, according to the determined coding sequence, gradually down-mixing and calculating corresponding side information, wherein the step-by-step down-mixing refers to adding matrixes to data of objects input in the current processing flow to obtain sum matrixes, the step-by-step down-mixing signals are not transmitted as transmission code streams, the side information comprises object residual errors and object gain parameter matrixes, and the object gain parameters are calculated through the energy ratio of two input signals in an object pair;

step A4: decomposing the object residual error in the side information into a left singular matrix, a right singular matrix and singular values by singular value decomposition;

step A5: quantizing the singular matrix, the singular value and the object gain parameter to obtain a side information code stream;

step A6: coding the final downmix signal in the step A3 to obtain a downmix signal code stream;

step A7: and synthesizing the code streams obtained in the step A5 and the step A6 into an output code stream, and transmitting the output code stream to a decoding end.

Compared with the existing audio object coding technology, the invention has the advantages that: multi-step progressive encoding and decoding are utilized, residual errors are utilized to compensate decoding distortion to the maximum extent, and each audio object is guaranteed to have good listening quality; and simultaneously, singular value decomposition is introduced to decompress residual error information in a dividing mode, so that the code rate is reduced. Therefore, the invention can ensure that high-quality audio objects are obtained by decoding under medium and low code rates so as to meet the use requirements of the audio personalized interaction system.

Drawings

FIG. 1 is a diagram of the encoding principle of an embodiment of the present invention;

fig. 2 is a decoding schematic diagram of an embodiment of the present invention.

Detailed Description

To facilitate understanding and practice of the present invention for those skilled in the art, the following technical solution is described with reference to the accompanying drawings and specific examples, it should be understood that the examples described herein are only for illustration and explanation of the present invention and are not intended to limit the present invention:

firstly, according to the optimal coding sequence of the object frequency domain energy research, determining the object which needs to be coded and calculate side information in each step, finally obtaining the residual error information of each object, effectively reducing the signal distortion and confusion of all reconstructed objects, and then dividing the residual error information into three low-dimensional matrixes by using a singular value decomposition method, thereby achieving the purposes of compressing the residual error information and reducing the bit rate.

Referring to fig. 1, the present invention proposes a multi-audio object coding method adapted to a personalized interactive system, where the present embodiment is illustrated by inputting A, B, C, D four objects, and the specific embodiment includes the following steps:

step A1: inputting audio objects A, B, C, D (which may include various objects such as human voice, piano, guitar, etc.), framing and windowing each object, converting the time domain signal to the frequency domain signal, and obtaining a time-frequency matrix of each audio object;

in this embodiment, an -dimensional sound signal in an original time domain is converted into a two-dimensional spectrogram in a frequency domain by framing, windowing and modified discrete cosine transform MDCT, and the obtained object data in a matrix form is output.

The input audio object signal sample rate is 44.1Khz, bit depth is 16 bits, wav audio format.

It should be noted that the audio parameters and object types specified herein are only for illustrating the implementation process of the present invention, and are not used to limit the present invention.

In the frame windowing, each frame is 1024 in length, a hanning window is selected as a window function, and 50% of time domains are overlapped; selecting Modified Discrete Cosine Transform (MDCT) by time-frequency transform, wherein the transform length is 2048 points; finally, a plurality of audio object signals in the form of a matrix are output, wherein the number of rows of the matrix is equal to the number of frames (or the number of columns is equal to the number of frames), and the number of columns of the matrix is equal to the number of frequency points (or the number of rows is equal to the number of frequency points).

It should be noted that the frame length, the type of window function, the transformation method, etc. specified herein are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

Step A2: according to the time-frequency matrix of each object, calculating the frequency domain energy of the objects to sort, and determining the object to be coded in each step in multi-step progressive coding;

in the embodiment, according to the object data in the form of a matrix, the frequency domain energy of the object is calculated, a large-to-small energy sorting mode is selected, and the sequence of the object to be coded in each step is determined; the coding order refers to the priority of coding audio objects with larger energy.

The calculation of the object frequency domain energy is shown as follows:

Figure BDA0002232447780000041

wherein, | | SiI | represents the total energy of the ith audio object, OiRepresenting the proportion of the ith object in the total energy of all the objects; according to each object OiThe values are sorted from big to small in the order of D (S)1)、B(S2)、A(S3)、C(S4) Preferably encoding OiObjects with large values; it should be noted that i ∈ [1, 4] specified here]And the order of the steps from large to small, are merely examples of the specific implementation steps of the present invention and are not intended to limit the present invention.

Step A3: according to the coding sequence, gradually down-mixing and calculating corresponding side information (object residual error, singular matrix and singular value);

in the embodiment, the step-by-step down mixing refers to performing matrix addition on data by using an object input in the current processing flow to obtain sum matrixes, wherein step-by-step down mixing signals are not transmitted as a transmission code stream, and side information comprises an object residual error and an object gain parameter matrix, wherein the object gain parameter is obtained by calculating the energy ratio of two input signals in an object pair;

the calculation formula of the object residual and the object gain parameter is as follows:

Figure BDA0002232447780000042

Figure BDA0002232447780000051

wherein R (i) is the residual signal of the i +1 th object, Go(i) Gain parameter for the i +1 th object, Gd(i) A gain parameter for an ith downmix signal; x in the formulaiRepresenting the downmix signal, P, obtained in step io(i) Is the energy of object i, Pd(i) Is the energy of the downmix signal of the ith step. In this embodiment, N is 4, which indicates the number of objects to be encoded.

It should be noted that the number N of objects defined herein is 4, which is merely an example of the implementation steps of the present invention and is not used to limit the present invention.

In connection with this example, the multi-step down-mix calculation procedure according to the above formula determined in step A2 is as follows, step , down-mix and parameter extraction is performed with object D, B as object pair (in step , D is regarded as down-mix signal for calculation), and the down-mix signal X of two objects is obtained1And calculating to obtain a gain parameter G of the second object Bo(1) And its residual R (1); second, down-mix signal X1A is taken as an object pair to carry out down mixing and parameter extraction to obtain a down mixing signal X of the second step2And calculating a gain parameter G of a third object Ao(2) And its residual R (2); third, down-mix signal X2C, performing down-mixing and parameter extraction on the object pair to obtain a down-mixing signal X of the third step3(i.e., the final downmix signal that needs to be transmitted to the decoding end), and calculates a gain parameter G of the fourth object Co(3) And its residual R (3). At this point, the four objects complete the down-mixing and parameter extraction through the above three steps.

It should be noted that the encoding sequence and the number of steps specified herein are only for illustrating the specific implementation steps of the present invention, and are not used to limit the present invention.

Step A4: decomposing the object residual in the side information into a coefficient matrix and a kernel vector by using singular value decomposition;

in the embodiment, the dimension reduction compression is carried out on the residual error matrixes of a plurality of objects by a singular value decomposition method, so that the data volume increase caused by residual error information is reduced; the residual matrix is decomposed into three small matrixes which are a left singular matrix, a singular value matrix and a right singular matrix respectively; wherein the singular value matrix transmits only the values on the matrix diagonal.

SVD is a matrix eigenvalue decomposition, a matrix decomposition method for reducing a matrix into its constituent parts, so that a high-dimensional matrix is decomposed into several low-dimensional matrices for representation, and the purpose of data compression is achieved.

Figure BDA0002232447780000052

Figure BDA0002232447780000061

Wherein, R (i)P×QThe residual signal of the (i + 1) th object is obtained, the row number P is halves of the MDCT transformation length, the column number Q is the frame number of the audio object, U is a left singular matrix, Lambda is a singular value matrix, V is a right singular value matrix, and the singular values on the diagonal line in the Lambda matrix are sorted from large to small.

For dimensionality reduction, the first r singular values (r-50) and the corresponding singular matrix approximation r (i) may be selected as follows:

Figure BDA0002232447780000062

Figure BDA0002232447780000063

wherein the content of the first and second substances,

Figure BDA0002232447780000064

which is the portion of the matrix of singular values,

Figure BDA0002232447780000065

and

Figure BDA0002232447780000066

first 5 of the original left and right singular matricesRow (or column) 0. Residual signals can be approximately represented by the three matrixes, matrix dimensionality is reduced, and side information data volume is compressed.

It should be noted that r-50 is only given to illustrate the specific implementation steps of the present invention and is not used to limit the present invention.

Step A5: quantizing the singular value, the singular matrix and the object gain parameter to obtain a side information code stream;

in the quantization operation, the value ranges of elements in the residual decomposition matrix and the gain parameter are different, so that the quantization table is unified by performing quantization before quantization, then the closest quantization value is searched in the quantization table according to the size of each element value, and the corresponding quantization index is output as a side information quantization code stream.

Step A6: coding the final downmix signal in the step A3 to obtain a downmix signal code stream;

in this embodiment, the final downmix signal is a basis for reconstructing the object signal at the decoding end, and is encoded by using AAC128 k.

It should be noted that the AAC128k coding of the final downmix signal is only to illustrate the specific implementation steps of the present invention and is not used to limit the present invention.

Step A7: and synthesizing the code streams obtained in the step A5 and the step A6 into an output code stream, and transmitting the output code stream to a decoding end.

Referring to fig. 2, the invention also provides multi-audio object decoding methods suitable for a personalized interactive system, wherein the embodiment is exemplified by inputting A, B, C, D four objects, and the specific implementation example comprises the following steps:

step B1: analyzing the received code stream to obtain a side information code stream and a final downmix signal code stream;

in this embodiment, parsing the code stream refers to performing a back-stepping according to a method for synthesizing the output code stream to obtain a final downmix signal code stream and a side information code stream.

Step B2: carrying out AAC decoding on the down-mixed signal code stream to obtain a down-mixed signal;

in this embodiment, the final downmix signal code stream is a data stream obtained after AAC encoding and compressing, and the final downmix signal before transmission can be obtained after AAC decoding.

Step B3: the side information code stream is dequantized to obtain a left singular matrix, a right singular matrix, singular values and object gain parameters;

in this embodiment, the side information is classified into when quantization is performed, and is classified into when dequantization is performed.

Step B4: performing matrix synthesis on the left singular matrix, the right singular matrix and the singular value to recover an object residual error;

in this embodiment, the matrix synthesis is to multiply the left singular matrix, the singular value matrix, and the right singular matrix to obtain an approximate object residual, which is specifically shown in the formula:

Figure BDA0002232447780000071

Figure BDA0002232447780000072

step B5: decoding backward according to the coding order, and circularly reconstructing an audio object frequency domain signal from the transmission downmix signal by using the side information;

separating the object from the corresponding downmix signal by using the object gain parameter, and calculating with the residual signal to compensate for aliasing distortion to obtain a reconstructed audio object frequency domain signal, as shown in the following formula:

Figure BDA0002232447780000073

Figure BDA0002232447780000074

Figure BDA0002232447780000075

wherein, S'iIs a reconstructed frequency domain object signal, X'iIs a reconstructed progressive downmix signal, Gd(i) For each step corresponds to a gain parameter of the downmix signal.

Figure BDA0002232447780000076

Is the residual information obtained by matrix synthesis at the decoding end, i.e. the work done in step B4. The decoding order of the objects is opposite to the encoding order, each object being analytically reconstructed from the stepwise downmix signal in a corresponding decoding step.

In connection with the present example, the multi-step progressive reconstruction of the object according to the above equations (8), (9) and (10) according to the decoding order determined in step B5 is as follows, step , using the gain parameter Go(3) And its residual error

Figure BDA0002232447780000081

From the final downmix signal X3Middle reconstructed object C (i.e., S'4) Using the gain parameter Gd(3) From the final downmix signal X3The reconstruction obtains a progressive down-mixing signal X'2(ii) a Secondly, gain parameter Go (2) and residual error thereof are utilized

Figure BDA0002232447780000082

From the progressive downmix Signal X'2Middle reconstructed object A (i.e., S'3) Using the gain parameter Gd(2) From most gradually downmix signal X'2The reconstruction obtains a progressive down-mixing signal X'1(ii) a Third, using the gain parameter Go(1) And its residual error

Figure BDA0002232447780000083

From the progressive downmix Signal X'1Middle reconstructed object B (i.e., S'2) Using progressive downmix signal X'1Is subtracted from the reconstructed object B to obtain a reconstructed object D (i.e., S'1). And finally, sequentially restoring the object from the corresponding gradually-mixed down signal through three-step decoding, and compensating the reconstructed signal by using residual information to reduce the tone quality reduction caused by aliasing distortion.

It should be noted that A, B, C, D the four objects and the number of decoding steps are only used to illustrate the implementation steps of the present invention and are not used to limit the present invention.

Step B6: and converting the audio object signal in the frequency domain into the time domain by using time-frequency inverse transformation.

In this embodiment, the gradually reconstructed object signal is still a frequency domain signal, and the time-frequency inverse transformation is performed to convert the object signal into a time domain, so that subsequent functions such as rendering, personalized interaction, playing and the like can be performed. Therefore, the inverse transform in the decoding method is to perform windowing on the object frequency domain signal, and improve the inverse discrete cosine transform operation to obtain the time domain connection signal.

Compared with the existing audio object coding method, the method has the advantages and characteristics that:

multi-step progressive encoding and decoding are utilized, residual errors are utilized to compensate decoding distortion to the maximum extent, and each audio object is guaranteed to have good listening quality; and simultaneously, singular value decomposition is introduced to decompress residual error information in a dividing mode, so that the code rate is reduced. Therefore, the invention can ensure that high-quality audio objects are obtained by decoding under medium and low code rates so as to meet the use requirements of the audio personalized interaction system.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:支持变换长度切换的频域音频编码器、解码器、编码和解码方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类