Audio processing method, medium, device and computing equipment

文档序号：1925730 发布日期：2021-12-03 浏览：19次中文

阅读说明：本技术 音频处理方法、介质、装置和计算设备 (Audio processing method, medium, device and computing equipment ) 是由赵翔宇曹偲刘华平于 2021-09-03 设计创作，主要内容包括：本公开的实施方式提供了一种音频处理方法、介质、装置和计算设备。该音频处理方法包括：获取待处理音频的音频数据；确定音频数据中的至少一种声源音频信号,声源音频信号为在声场中存在对应时长的分轨音频信号；根据预设空间位置摆放规则,获取声源音频信号对应的空间参数,空间参数为声源音频信号发声时的相对位置；根据空间参数,对待处理音频进行渲染处理,得到待处理音频对应的目标音频。本公开通过空间参数对待处理音频进行渲染得到沉浸式效果更好目标音频,而无需获取分轨文件即可得到分离度很高的声源音频信号,并且能够将现有的非沉浸式音频数据转换为沉浸式效果的目标音频,为用户带来了更好的体验。(The embodiment of the disclosure provides an audio processing method, medium, device and computing equipment. The audio processing method comprises the following steps: acquiring audio data of audio to be processed; determining at least one sound source audio signal in the audio data, wherein the sound source audio signal is a split-track audio signal with corresponding duration in a sound field; acquiring a spatial parameter corresponding to a sound source audio signal according to a preset spatial position placing rule, wherein the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal sounds; and rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed. This is disclosed through spatial parameter to treat that the audio frequency of handling is rendered and is obtained the better target audio frequency of immersive effect, and need not to acquire the split rail file and can obtain the very high sound source audio signal of degree of separation to can convert current non-immersive audio data into the target audio frequency of immersive effect, bring better experience for the user.)

1. An audio processing method, comprising:

acquiring audio data of audio to be processed;

determining at least one sound source audio signal in the audio data, wherein the sound source audio signal is a split-track audio signal with corresponding duration in a sound field;

acquiring a spatial parameter corresponding to the sound source audio signal according to a preset spatial position placing rule, wherein the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal sounds;

and rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed.

2. The audio processing method of claim 1, the determining at least one sound source audio signal in the audio data, comprising:

performing Fourier transform on the audio data to obtain a frequency spectrum corresponding to the audio data;

inputting the frequency spectrum into a sound source audio signal separation model to obtain the frequency spectrum parameters of the split-track audio signals corresponding to the audio data, wherein the sound source audio signal separation model is used for separating and obtaining the frequency spectrum parameters of the split-track audio signals corresponding to one sound source audio signal;

and carrying out inverse Fourier transform on the frequency spectrum parameters of the split-track audio signals corresponding to the audio data to obtain the sound source audio signals.

3. The audio processing method according to claim 1, wherein the obtaining the spatial parameter corresponding to the sound source audio signal according to a preset spatial position placing rule comprises:

determining a corresponding spatial position placing template according to the sound source audio signal;

the spatial position placing module comprises: the spatial parameters corresponding to the at least one sound source audio signal are predefined according to different music styles;

and according to the spatial position placing template, determining the spatial parameters corresponding to the sound source audio signals.

4. The audio processing method according to claim 3, wherein the rendering the audio to be processed according to the spatial parameter to obtain a target audio corresponding to the audio to be processed, includes:

determining a target spatial audio signal corresponding to the audio to be processed according to the spatial parameters and the sound source audio signal;

and obtaining the target audio corresponding to the audio to be processed according to the target space audio signal.

5. The audio processing method of claim 4, the at least one sound source audio signal comprising at least one audio object, the spatial position-putting template further comprising: the method for determining the target spatial audio signal corresponding to the audio to be processed according to the spatial parameters and the audio source audio signal comprises the following steps:

determining a head-related transfer function corresponding to the audio object based on the spatial position template;

performing convolution processing on the audio object and the corresponding head-related transmission function to obtain a first spatial audio signal;

and determining a target spatial audio signal corresponding to the audio to be processed according to the first spatial audio signal.

6. The audio processing method according to claim 5, the at least one sound source audio signal further comprising a sound bed signal, the sound bed signal being an audio signal other than the audio object in the audio to be processed, the spatial localization template further comprising: the method includes the steps that a spatial response function corresponding to a sound bed signal is used for describing attenuation of sound waves in an environment, the spatial response function is preset according to spatial parameters of the sound bed signal, and a target spatial audio signal corresponding to the audio to be processed is determined according to the spatial parameters and the sound source audio signal, and further includes the steps of:

determining a spatial response function corresponding to the sound bed signal based on the spatial position template;

convolving the sound bed signal with the spatial response function to obtain a second spatial audio signal;

and obtaining a target spatial audio signal according to the first spatial audio signal and the second spatial audio signal.

7. The audio processing method according to any of claims 4 to 6, wherein the obtaining the target audio corresponding to the audio to be processed according to the target spatial audio signal comprises:

and compensating the target space audio signal according to the compensation response of preset playing equipment to obtain the target audio corresponding to the audio to be processed.

8. An audio processing apparatus comprising:

the first acquisition module is used for acquiring audio data of audio to be processed;

a determining module, configured to determine at least one sound source audio signal in the audio data, where the sound source audio signal is an audio element with a corresponding duration in a sound field;

the second acquisition module is used for acquiring a spatial parameter corresponding to the sound source audio signal according to a preset spatial position placing rule, wherein the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal sounds;

and the rendering module is used for rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed.

9. A computer readable storage medium having computer program instructions stored therein which, when executed, implement the method of any one of claims 1 to 7.

10. A computing device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 7.

Technical Field

Embodiments of the present disclosure relate to the field of immersive audio technology, and more particularly, to an audio processing method, medium, apparatus, and computing device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Immersive audio is an effect of rendering an audio channel or a sound source audio signal to surround up, down, left, and right according to the orientation of a 3-dimensional space, for example, application of a panoramic sound to a movie scene.

For music scenes, left and right 2-channel audio is typically extended to multi-channel audio for immersive audio experience. Specifically, the same component between channels is arranged in the center channel and different component differences between channels are arranged in the surround channel and the upper channel with loudness attenuation using the correlation and loudness difference between channels. However, the inter-channel component differences are typically very small, resulting in most of the audio still emanating from one channel of the center channel, and thus not achieving a very immersive audio effect in this manner.

Disclosure of Invention

In this context, embodiments of the present disclosure are intended to provide an audio processing method, medium, apparatus, and computing device to solve the problem that the prior art cannot achieve a good immersive audio effect.

In a first aspect of embodiments of the present disclosure, there is provided an audio processing method, comprising: acquiring audio data of audio to be processed; determining at least one sound source audio signal in the audio data, wherein the sound source audio signal is a split-track audio signal with corresponding duration in a sound field; acquiring a spatial parameter corresponding to a sound source audio signal according to a preset spatial position placing rule, wherein the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal sounds; and rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed.

In one embodiment of the present disclosure, determining at least one sound source audio signal in the audio data comprises: carrying out Fourier transform on the audio data to obtain a frequency spectrum corresponding to the audio data; inputting the frequency spectrum into a sound source audio signal separation model to obtain the frequency spectrum parameters of the split-track audio signals corresponding to the audio data, wherein the sound source audio signal separation model is used for separating and obtaining the frequency spectrum parameters of the split-track audio signals corresponding to the sound source audio signals; and carrying out inverse Fourier transform on the frequency spectrum parameters of the split-track audio signals corresponding to the audio data to obtain the sound source audio signals.

In another embodiment of the present disclosure, obtaining the spatial parameter corresponding to the sound source audio signal according to the preset spatial position placing rule includes: determining a corresponding spatial position placing template according to the sound source audio signal; the spatial position placing module comprises: at least one spatial parameter corresponding to the sound source audio signal predefined according to different music styles; and placing the template according to the spatial position, and determining the spatial parameters corresponding to the sound source audio signals.

In another embodiment of the present disclosure, rendering the audio to be processed according to the spatial parameter to obtain a target audio corresponding to the audio to be processed, includes: determining a target space audio signal corresponding to the audio to be processed according to the space parameters and the source audio signal; and obtaining a target audio corresponding to the audio to be processed according to the target space audio signal.

In yet another embodiment of the present disclosure, the at least one sound source audio signal comprises at least one audio object, the spatial position placement template further comprises: the method comprises the following steps that head related transfer functions corresponding to audio objects are obtained, spatial parameters corresponding to sound source audio signals comprise spatial position coordinates corresponding to the audio objects, the head related transfer functions are used for describing a transmission process of sound waves from a sound source to two ears, the head related transfer functions are preset according to the spatial position coordinates of the audio objects, and target spatial audio signals corresponding to audio to be processed are determined according to the spatial parameters and the sound source audio signals, and the method comprises the following steps: determining a head-related transfer function corresponding to the audio object based on the spatial position template; performing convolution processing on the audio object and the corresponding head-related transmission function to obtain a first spatial audio signal; and determining a target spatial audio signal corresponding to the audio to be processed according to the first spatial audio signal.

In yet another embodiment of the present disclosure, the at least one sound source audio signal further includes a sound bed signal, the sound bed signal is an audio signal other than an audio object in the audio to be processed, and the spatial position placement template further includes: the method comprises the following steps that a spatial response function corresponding to a sound bed signal is used for describing the attenuation of sound waves in the environment, the spatial response function is preset according to the spatial parameters of the sound bed signal, and a target spatial audio signal corresponding to audio to be processed is determined according to the spatial parameters and a source audio signal, and the method further comprises the following steps: determining a spatial response function corresponding to the sound bed signal based on the spatial position template; convolving the sound bed signal with the spatial response function to obtain a second spatial audio signal; and obtaining a target spatial audio signal according to the first spatial audio signal and the second spatial audio signal.

In another embodiment of the present disclosure, obtaining a target audio corresponding to an audio to be processed according to a target spatial audio signal includes: and compensating the target space audio signal according to the compensation response of the preset playing equipment to obtain the target audio corresponding to the audio to be processed.

In still another embodiment of the present disclosure, a preset playback apparatus includes: the binaural earphone presets the compensation response of the playing device, compensates the target spatial audio signal, and obtains the target audio corresponding to the audio to be processed, including: obtaining a transfer function and a regular factor corresponding to a double-ear earphone; determining a corresponding compensation function according to the transfer function and the regular factor; and performing convolution on the compensation function and the target space audio signal to obtain a target audio.

In another embodiment of the present disclosure, before performing fourier transform on the audio data to obtain a frequency spectrum corresponding to the audio data, the method further includes: acquiring a training sample, wherein the training sample comprises an audio data sample and a sound source audio signal corresponding to the audio data sample; inputting the audio data sample into a sound source audio signal separation model to obtain a training output sound source audio signal corresponding to the audio data sample; and adjusting parameters of the sound source audio signal separation model according to the sound source audio signal corresponding to the training output sound source audio signal and the audio data sample to obtain the trained sound source audio signal separation model.

In another embodiment of the present disclosure, the method for separating an audio source audio signal includes a K-layer convolutional network, where each layer of convolutional network includes an encoder and a decoder, and inputs an audio data sample into the audio source audio signal separation model to obtain a training output audio source audio signal corresponding to the audio data sample, and includes: adopting K encoders to carry out down-sampling on the audio data samples to obtain an intermediate characteristic image, wherein the output of the ith encoder is the input of the (i + 1) th encoder, and i sequentially takes 1,2, … … and K-1; and (3) up-sampling the intermediate characteristic image by adopting K decoders to obtain a training output sound source audio signal corresponding to the audio data sample, wherein the output of the jth decoder is the input of the jth +1 decoder, and j sequentially takes K, K-1, … … and 1.

In another embodiment of the present disclosure, the up-sampling of the intermediate feature image by using K encoders to obtain a training output sound source audio signal corresponding to an audio data sample includes: acquiring a first output of an encoder and a second output of a decoder in a K-th layer convolutional network; determining a correlation result according to the first output and the second output; and inputting the correlation result into a decoder of a K-1 layer convolutional network layer, performing up-sampling through the decoder of the K-1 layer convolutional network layer to obtain a training output sound source audio signal corresponding to the audio data sample, and sequentially taking K, K-1, … … and 2 by K.

In yet another embodiment of the disclosure, determining the correlation result according to the first output and the second output includes: determining a similarity factor for the first output and the second output; and determining a correlation result according to the similarity factor and the second output.

In yet another embodiment of the present disclosure, determining a similarity factor for the first output and the second output includes: determining a sum of the first output and the second output as a third output; and performing convolution processing on the third output to obtain a similar factor.

In another embodiment of the present disclosure, inputting an audio data sample into a sound source audio signal separation model to obtain a training output sound source audio signal corresponding to the audio data sample, includes: carrying out Fourier transform on the audio data sample to obtain a frequency spectrum corresponding to the audio data sample; inputting the frequency spectrum corresponding to the audio data sample into a sound source audio signal separation model to obtain the frequency spectrum of a training output sound source audio signal corresponding to the audio data sample; and carrying out inverse Fourier transform on the frequency spectrum of the audio signal of the training output sound source to obtain the audio signal of the training output sound source.

In another embodiment of the present disclosure, adjusting parameters of a sound source audio signal separation model according to a sound source audio signal corresponding to a training output sound source audio signal and an audio data sample to obtain a trained sound source audio signal separation model includes: determining a loss function corresponding to a sound source audio signal separation model according to the audio data sample, the training output sound source audio signal and the sound source audio signal corresponding to the audio data sample; and adjusting parameters of the sound source audio signal separation model according to the loss function to obtain the trained sound source audio signal separation model.

In a second aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored therein computer program instructions which, when executed, implement a method as in any one of the above.

In a third aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising: the first acquisition module is used for acquiring audio data of audio to be processed;

the determining module is used for determining at least one sound source audio signal in the audio data, wherein the sound source audio signal is an audio element with corresponding duration existing in a sound field;

and the rendering module is used for rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed.

In yet another embodiment of the disclosure, the determining module includes:

the first transformation unit is used for carrying out Fourier transformation on the audio data to obtain a frequency spectrum corresponding to the audio data;

the system comprises an input unit, a frequency spectrum acquisition unit and a frequency spectrum analysis unit, wherein the input unit is used for inputting the frequency spectrum into a sound source audio signal separation model to obtain the frequency spectrum parameters of the split-track audio signal corresponding to the audio data, and the sound source audio signal separation model is used for separating and acquiring the frequency spectrum parameters of the split-track audio signal corresponding to a sound source audio signal;

and the second transformation unit is used for performing inverse Fourier transformation on the frequency spectrum parameters of the split-track audio signals corresponding to the audio data to obtain the sound source audio signals.

In yet another embodiment of the present disclosure, the second obtaining module includes:

the first determining unit is used for determining a corresponding spatial position placing template according to the sound source audio signal; the spatial position placing module comprises: at least one spatial parameter corresponding to the sound source audio signal predefined according to different music styles;

and the second determining unit is used for placing the template according to the spatial position and determining the spatial parameters corresponding to the sound source audio signals.

In yet another embodiment of the disclosure, a rendering module includes:

the third determining unit is used for determining a target space audio signal corresponding to the audio to be processed according to the space parameters and the source audio signal;

and the fourth determining unit is used for obtaining the target audio corresponding to the audio to be processed according to the target space audio signal.

In yet another embodiment of the present disclosure, the at least one sound source audio signal comprises at least one audio object, the spatial position placement template further comprises: the head-related transfer function corresponding to the audio object, the spatial parameter corresponding to the audio signal of the sound source including the spatial position coordinate corresponding to the audio object, the head-related transfer function being used to describe the transmission process of the sound wave from the sound source to the ears, the head-related transfer function being preset according to the spatial position coordinate of the audio object, the third determining unit includes:

a first determining subunit, configured to determine, based on the spatial position template, a head-related transfer function corresponding to the audio object;

the first convolution subunit is used for performing convolution processing on the audio object and the corresponding head-related transfer function to obtain a first spatial audio signal;

and the second determining subunit is used for determining a target spatial audio signal corresponding to the audio to be processed according to the first spatial audio signal.

the third determining subunit is used for determining a spatial response function corresponding to the sound bed signal based on the spatial position template;

the second convolution subunit is used for convolving the sound bed signal and the spatial response function to obtain a second spatial audio signal;

and the fourth determining subunit is used for obtaining the target spatial audio signal according to the first spatial audio signal and the second spatial audio signal.

In still another embodiment of the present disclosure, the fourth determining unit includes:

and the compensation subunit is used for compensating the target spatial audio signal according to the compensation response of the preset playing device to obtain a target audio corresponding to the audio to be processed.

In still another embodiment of the present disclosure, a preset playback apparatus includes: headphone, the compensation subunit, is specifically configured to:

obtaining a transfer function and a regular factor corresponding to a double-ear earphone;

determining a corresponding compensation function according to the transfer function and the regular factor;

and performing convolution on the compensation function and the target space audio signal to obtain a target audio.

In still another embodiment of the present disclosure, further comprising:

the third acquisition module is used for acquiring a training sample, wherein the training sample comprises an audio data sample and a sound source audio signal corresponding to the audio data sample;

the input module is used for inputting the audio data sample into the sound source and audio signal separation model to obtain a training output sound source audio signal corresponding to the audio data sample;

and the adjusting module is used for adjusting parameters of the sound source audio signal separation model according to the sound source audio signal corresponding to the training output sound source audio signal and the audio data sample to obtain the trained sound source audio signal separation model.

In still another embodiment of the present disclosure, the sound source audio signal separation model includes K layers of convolutional networks, each layer of convolutional network including an encoder and a decoder, an input module including:

the first sampling unit is used for performing down-sampling on the audio data samples by adopting K encoders to obtain an intermediate characteristic image, the output of the ith encoder is the input of the (i + 1) th encoder, and i sequentially takes 1,2, … … and K-1;

and the second sampling unit is used for up-sampling the intermediate characteristic image by adopting K decoders to obtain a training output sound source audio signal corresponding to the audio data sample, the output of the jth decoder is the input of the jth +1 decoder, and j sequentially takes K, K-1, … … and 1.

In another embodiment of the present disclosure, the second sampling unit is specifically configured to:

acquiring a first output of an encoder and a second output of a decoder in a K-th layer convolutional network;

determining a correlation result according to the first output and the second output;

and inputting the correlation result into a decoder of a K-1 layer convolutional network layer, performing up-sampling through the decoder of the K-1 layer convolutional network layer to obtain a training output sound source audio signal corresponding to the audio data sample, and sequentially taking K, K-1, … … and 2 by K.

In a further embodiment of the present disclosure, when determining the correlation result according to the first output and the second output, the second sampling unit is specifically configured to:

determining a similarity factor for the first output and the second output;

and determining a correlation result according to the similarity factor and the second output.

In a further embodiment of the disclosure, the two-sampling unit, when determining the similarity factor of the first output and the second output, is specifically configured to:

determining a sum of the first output and the second output as a third output;

and performing convolution processing on the third output to obtain a similar factor.

In yet another embodiment of the present disclosure, the input module is specifically configured to:

carrying out Fourier transform on the audio data sample to obtain a frequency spectrum corresponding to the audio data sample; inputting the frequency spectrum corresponding to the audio data sample into a sound source audio signal separation model to obtain the frequency spectrum of a training output sound source audio signal corresponding to the audio data sample;

and carrying out inverse Fourier transform on the frequency spectrum of the audio signal of the training output sound source to obtain the audio signal of the training output sound source.

In another embodiment of the present disclosure, the adjusting module is specifically configured to:

determining a loss function corresponding to a sound source audio signal separation model according to the audio data sample, the training output sound source audio signal and the sound source audio signal corresponding to the audio data sample;

and adjusting parameters of the sound source audio signal separation model according to the loss function to obtain the trained sound source audio signal separation model.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device having stored thereon computer program instructions which, when executed, implement a method as in any one of the first aspect.

According to the method and the device for processing the audio data, at least one sound source audio signal in the audio data is determined, a spatial parameter corresponding to the sound source audio signal is obtained according to a preset spatial position placing rule, the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal is sounded, and the audio to be processed is rendered according to the spatial parameter to obtain a target audio corresponding to the audio to be processed. Can be through the spatial parameter audio frequency of treating to handle and play up and obtain the better target audio frequency of immersive effect, and need not to acquire the split rail file and can obtain the very high sound source audio signal of degree of separation to can convert current non-immersive audio data into the target audio frequency of immersive effect, bring better experience for the user.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates an application scenario diagram according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of steps of an audio processing method according to an embodiment of the present disclosure;

fig. 3 schematically shows a structural diagram of a sound source audio signal separation model according to an embodiment of the present disclosure;

fig. 4 schematically shows a structural schematic diagram of spatial arrangement of audio signals of a sound source according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of steps of a method of training a sound source audio signal separation model according to an embodiment of the present disclosure;

FIG. 6 schematically shows a structural diagram of a storage medium according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of an audio processing device according to an embodiment of the present disclosure;

fig. 8 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the disclosure, an audio processing method, medium, apparatus, and computing device are provided.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

The present disclosure finds that, among methods of processing audio data, one is a stereo audio generation method based on channel upmixing. Specifically, upmixing of sound channels is performed according to differences in speaker placement positions, for example: and expanding the audio data of 2 channels to the audio data of 5.1 channels, and placing the loudspeakers at the left, middle, right, left rear and right rear positions according to the angle proportion. The same component between channels is arranged in the center channel and different component differences between channels are arranged in the surround channel and the upper channel in loudness attenuation according to the inter-channel correlation and loudness difference in the 2-channel audio data. However, in this method, the similarity between the left and right channels of the 2-channel audio data is very high, and the difference is small. Therefore, expanding the 2-channel audio data into 5.1-channel audio data in this way is not obvious enough in azimuth difference, and expanding the 5.1-channel back up channel and back surround channel to have the same content, so that the user cannot feel the sound difference in the vertical direction, and the immersive effect is not ideal.

Another is a method of extracting stereo audio based on audio objects. Specifically, multi-channel audio data is obtained first, and an audio object of the audio data is obtained according to spectral similarity between multiple channels of the audio data. The method is based on the extraction of the frequency spectrum similarity between channels, and because the frequency spectrum difference between the channels is small, especially for the audio data of 2 channels, the frequency spectrum envelopes and the frequency spectrum shapes of the left channel and the right channel are basically consistent, only 1 audio object or a plurality of similar audio objects can be obtained by adopting the method. In addition, the method cannot acquire the azimuth information of the audio object, so that a good immersive audio effect cannot be achieved.

Based on the above problem, the present disclosure provides an audio processing method, which can determine at least one sound source audio signal in audio data, then determine a spatial parameter corresponding to the sound source audio signal, render the audio to be processed by using the spatial parameter to obtain a target audio with a better immersive effect, and obtain the sound source audio signal with a very high separation degree without acquiring a split-track file, and can convert the existing non-immersive audio data into the target audio with the immersive effect, thereby bringing better experience to a user.

Application scene overview

Referring to fig. 1, fig. 1 is an application scenario diagram of an audio processing method provided by the present disclosure, where fig. 1 includes: a terminal 11 and a playback device 12. Wherein the terminal 11 can download audio data of the audio to be processed, e.g. music. And then, processing the audio data to obtain a target audio, wherein the target audio can be played through the playing device 12. In fig. 1, the playing device is an earphone, and may also be a sound box, a speaker, etc., which is not limited herein.

Exemplary method

In connection with the application scenario of fig. 1, a method for audio processing according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Referring to fig. 2, a flow chart of an audio processing method provided by the present disclosure is shown. The embodiment of the present disclosure provides an audio processing method, which specifically includes the following steps:

s201, audio data of the audio to be processed is obtained.

The audio data may be audio data corresponding to music. The audio data may be 2-channel, 5.1-channel, or 7.12-channel audio data.

Specifically, the acquiring of the audio data of the audio to be processed includes: receiving a trigger operation of a user on a to-be-processed audio; and responding to the trigger operation, and downloading the audio data of the audio to be processed. Illustratively, when a user wants to listen to music "ABC" while using a music application, the user may trigger the identification of music "ABC" and the terminal downloads or caches audio data corresponding to the identification of music "ABC".

Another implementation manner of the present disclosure, acquiring audio data of an audio to be processed includes: and acquiring audio data of the audio to be processed in an audio database. Illustratively, audio data of a plurality of audio to be processed are stored in the audio database, and the audio data of the audio to be processed are sequentially acquired by the present disclosure for subsequent audio processing.

The present disclosure may also sample other ways to obtain audio data of the audio to be processed, which is not limited herein.

S202, at least one sound source audio signal in the audio data is determined.

Wherein, the sound source audio signal is a split-track audio signal with corresponding duration in the sound field.

In the present disclosure, the audio data is composed of a plurality of kinds of split-track audio signals, each of which corresponds to one kind of sound source audio signal. Illustratively, when the audio data is music, the sound source audio signals correspond to audio signals emitted by a leading song, a vocal accompaniment, a piano, a guitar, a stringed group, a drum group, a bass, and a sound bed (corresponding to the environment). It is understood that a sound source audio signal comprises an audio signal emitted by one audio object. Typically, the audio data is composed of audio signals of a plurality of sound sources.

Specifically, determining at least one sound source audio signal in the audio data specifically includes the following steps:

s2021, performing fourier transform on the audio data to obtain a frequency spectrum corresponding to the audio data.

Specifically, the audio data is framed according to preset time to obtain audio data subframes, and then the audio data subframes are subjected to fourier transform. The corresponding spectrum of the audio data comprises: a spectrum corresponding to a plurality of sub-frames of audio data. Illustratively, the resulting first spectrum image has a spectrum of 512 × 128, where 512 is a pixel value corresponding to the width of the first spectrum image, and 128 is a pixel value corresponding to the height of the first spectrum image.

S2022, inputting the frequency spectrum into the sound source audio signal separation model to obtain the frequency spectrum parameters of the split-track audio signal corresponding to the audio data.

The sound source audio signal separation model is used for separating and obtaining the frequency spectrum parameters of the split-track audio signal corresponding to the sound source audio signal.

In the present disclosure, each sound source audio signal corresponds to one sound source audio signal separation model, so that the frequency spectrum is input into different sound source audio signal separation models to obtain the frequency spectrum parameters of different split-track audio signals.

Further, the spectral parameter is a second spectral image of the split-track audio signal. Illustratively, the obtained audio data corresponds to a second spectrum image of the split-track audio signal with the spectrum parameter of 512 × 128.

Specifically, referring to fig. 3, a schematic structural diagram of a sound source audio signal separation model provided by the present disclosure is shown, wherein the sound source audio signal separation model includes a multi-layer network convolution (K)₁To K₆) Each layer of the network convolution comprises an encoder En and a decoder De. Wherein each oneThe encoder En comprises a 2-dimensional convolution and a ReLU integer (linear rectification function). Each decoder De includes 1 stride length deconvolution and 1 ReLU integer, where the stride length deconvolution is an inverse 2-bit convolution. Each layer of convolutional network also comprises an activation function sublayer; the output of each encoder or decoder outputs a corresponding convolutional network through the active function sublayer.

Wherein, convolution K to the first layer network₁The encoder En of (a) inputs the first spectral image (In) to obtain a corresponding output, and inputs the output to the second layer convolutional network K₂Until a sixth layer of convolutional network K is obtained₆Is output from the encoder En. Convolution network K of the sixth layer₆Is input to the sixth layer convolutional network K₆The decoder De of (a) obtains a corresponding output, inputs the output to the fifth layer convolutional network K₅In turn until the first layer of convolutional network K is obtained₁Out of the decoder De.

Exemplary, with reference to FIG. 3, P₀Is a first spectral image (512 x 128 x 1) obtained by Fourier transform of audio data, wherein 1 is represented as a first spectral image, and is input into a first layer of network convolution K₁Is output as a spectral image P₁(256 × 64 × 16), where 16 represents 16 spectral images P₁. Will P₁(256 x 64 x 16) input second tier network convolution K₂Encoder En of, output P₂(128*32*32). Will P₂(128 x 32) input third layer network convolution K₃Encoder En of, output P₃(64*16*64). Will P₃(64 x 16 x 64) input fourth layer network convolution K₄Encoder En of, output P₄(32*8*128). Will P₄(32 x 8 x 128) input fifth layer network convolution K₅Encoder En of, output P₅(16*4*258). Will P₅(16 x 4 x 258) input sixth layer network convolution K₆Encoder En of, output P₆(8*2*512). Will P₆(8 x 2 x 512) input sixth layer network convolution K₆Decoder De of, output P₇(16*4*258). Will P₇(16 x 4 x 258) input fifth layer network convolution K₅Solution of (2)Encoder De, output P₈(32*8*128). Will P₈(32 x 8 x 128) input fourth layer network convolution K₄And (3) outputs P9(64 × 16 × 64). Inputting P9(64 × 16 × 64) into the third layer network convolution K₃Decoder De of, output P₁₀(128*32*32). Will P₁₀(128 x 32) input second layer network convolution K₂Decoder De of, output P₁₁(256*64*16). Will P₁₁(256 × 64 × 16) input first layer of network convolution K₁Decoder De of, output P₁₂(512*128*1)。P₁₂For the output Out of the model for separating the audio signal of the sound source, specifically a 2-dimensional masking factor f (P)₀Θ) (512 × 128 × 1), and the masking factor f (P)₀Theta) and P₀And obtaining the corresponding frequency spectrum parameters of the corresponding output sound source audio object signals after multiplication.

In the present disclosure, by inputting audio data into different sound source audio signal separation models, spectral parameters corresponding to respective sound source audio signals in the audio data can be accurately obtained.

And S2023, performing inverse Fourier transform on the frequency spectrum parameters of the split-track audio signals corresponding to the audio data to obtain sound source audio signals.

Specifically, the frequency spectrum parameters of the corresponding split-track audio signals of each audio data frame are subjected to inverse fourier transform to obtain a plurality of corresponding sub-sound source audio signals, and the sub-sound source audio signals are combined according to a time sequence to obtain a sound source audio signal.

In the embodiment of the present disclosure, each sound source audio signal in the audio data can be obtained by using the sound source audio signal separation model.

And S203, acquiring the spatial parameters corresponding to the sound source audio signals according to the preset spatial position placing rule.

Wherein the spatial parameter is a relative position of the sound source when the audio signal is sounded. In the present disclosure, the sound source audio signals have different corresponding spatial parameters according to different sound production objects, that is, each sound source audio signal has a corresponding one of the spatial parameters.

In addition, the spatial parameters may be expressed in coordinates, with the origin of coordinates simulating where the user is located. The spatial parameter may represent a position of the sound source audio signal relative to the user when the sound source audio signal is sounded.

Specifically, the method for obtaining the spatial parameters corresponding to the sound source audio signals according to the preset spatial position placing rule comprises the following steps:

s2031, according to the sound source audio signal, determining a corresponding spatial position placing template.

Wherein, spatial position puts the module and includes: and at least one spatial parameter corresponding to the sound source audio signal is predefined according to different music styles.

And referring to the first table, and placing the template for one spatial position. In table one, the sound source audio signal includes an audio object, and a specific sound source audio signal corresponds to an audio object, and the audio object is an audio signal corresponding to a sound production object, including but not limited to a human voice, a musical instrument voice, and the like. For example, the audio signal corresponding to the leading song, the audio signal corresponding to the guitar, or the audio signal corresponding to the bass, etc.

Each audio object has a corresponding spatial parameter, which is expressed in XYZ coordinates. Each audio pair has a corresponding head-related transfer function.

Watch 1

Referring to table two, another spatial position placement template provided for the present disclosure is provided. The sound source audio signal also includes a sound bed signal, which refers to an audio signal other than an audio object, and may be, for example, an ambient sound, a background sound, or the like. The sound bed signal has a corresponding spatial reverberation function. The sound bed signal is processed through a spatial reverberation function, and the whole surround feeling can be improved.

Watch two

In addition, the spatial position placing templates shown in the first table and the second table can be obtained according to experiments, that is, referring to fig. 4, when the user is at the origin of coordinates, a sound box is placed at a position corresponding to the spatial parameters, and the sound box is adopted to play the corresponding sound source audio signal, so that a good immersion effect can be obtained. In addition, different spatial position placing modules can be set according to different style types of the audio data. And are not limited thereto.

Illustratively, in fig. 4, the spatial parameter of the user R is (0, 0, 0). The lead song 41(1,0,0) is disposed directly in front of the user R, nearly flush with the user R. The vocal accompaniment 42(0.4,0.4,0.4) may be arranged in front of, at the side of or behind the ear of the user R. The piano 43(2.5,2.5,2.5) can be upmixed into 7.12 channels, with more surround. The guitar 44(1.5,1.5,0) is placed in front of the side of the user R. String music group 45(1.5, -1.5,0) may be placed in front of the side of user R or above user R. The drum set 46/bass 47(-1,0,0) may be placed directly behind the user R. If an audio object corresponding to the lute exists, the corresponding spatial parameter can be the same as that of the guitar.

In the disclosure, different music styles and different spatial position placing rules are corresponding to the audio objects, and then different spatial position placing modules are designed, so that different audio objects are specifically suitable for music orientation.

S2032, placing the template according to the space position, and determining the space parameters corresponding to the sound source audio signal.

After the sound source audio signal is determined, the corresponding spatial parameter can be obtained according to the corresponding spatial parameter of the sound source audio signal in the spatial position placement template. For example, referring to table one, if the sound source audio signal is a human voice, the spatial parameter corresponding to the human voice is (1,0, 0).

In the present disclosure, spatial parameters of the respective sound source audio signals may be determined based on the spatial position placement template.

And S204, rendering the audio to be processed according to the spatial parameters to obtain a target audio corresponding to the audio to be processed.

The audio to be processed is rendered by adopting the spatial parameters, so that the audio data of the audio to be processed has a sense of orientation, and the obtained target audio has an immersion sense.

Specifically, according to the spatial parameters, rendering the audio to be processed to obtain a target audio corresponding to the audio to be processed, including the following steps:

s2041, determining a target space audio signal corresponding to the audio to be processed according to the space parameters and the sound source audio signal.

In one alternative, the at least one sound source audio signal includes at least one audio object, and a specific sound source audio signal corresponds to one audio object, and the spatial position placing template further includes: the method comprises the following steps that head related transfer functions corresponding to audio objects are obtained, spatial parameters corresponding to sound source audio signals comprise spatial position coordinates corresponding to the audio objects, the head related transfer functions are used for describing a transmission process of sound waves from a sound source to two ears, the head related transfer functions are preset according to the spatial position coordinates of the audio objects, and target spatial audio signals corresponding to audio to be processed are determined according to the spatial parameters and the sound source audio signals, and the method comprises the following steps: determining a head-related transfer function corresponding to the audio object based on the spatial position template; performing convolution processing on the audio object and the corresponding head-related transmission function to obtain a first spatial audio signal; and determining a target spatial audio signal corresponding to the audio to be processed according to the first spatial audio signal.

Referring to table one, the spatial position template includes at least one audio object, and the spatial position placing template further includes: the head related transfer function is used for describing the transmission process of sound waves from a sound source to two ears, and the head related transfer function is preset according to the spatial position coordinates of the audio object.

Specifically, a Head Related Transfer Function (HRTF) is an audio localization algorithm. Wherein different audio objects correspond to different head related transfer functions. Exemplarily, in table one, the head related transfer function HRTF (1) corresponds to the dominant song. The head related transfer function HRTF (2) corresponding to vocal accompaniment.

Specifically, each audio object is convolved with a corresponding head-related transfer function, and then convolution results are added to obtain a first spatial audio signal. Illustratively, the audio object of the audio data separated into the lead song is O_vocal. The audio object of the guitar is O_guitar. Audio object of drum is O_drum. The audio object of bass is O_bass. Wherein, the head related transfer function corresponding to the leading record is HRTF (1). The corresponding head related transfer function of the guitar is HRTF (4). The head related transfer function for the drum is HRTF (6). The head related transfer function for the bass is HRTF (7). The target spatial audio signal S of the first spatial audio signal₁＝O_vocal*HRTF(1)+O_guitar*HRTF(4)+O_drum*HRTF(6)+O_bass*HRTF(7)。

In another alternative, the at least one sound source audio signal further includes a sound bed signal, the sound bed signal is an audio signal except for an audio object in the audio to be processed, and the spatial position placing template further includes: the method comprises the following steps that a spatial response function corresponding to a sound bed signal is used for describing the attenuation of sound waves in the environment, the spatial response function is preset according to the spatial parameters of the sound bed signal, and a target spatial audio signal corresponding to audio to be processed is determined according to the spatial parameters and a source audio signal, and the method further comprises the following steps: determining a spatial response function corresponding to the sound bed signal based on the spatial position template; convolving the sound bed signal with the spatial response function to obtain a second spatial audio signal; and obtaining a target spatial audio signal according to the first spatial audio signal and the second spatial audio signal.

Wherein, referring to the second table, the spatial position template further comprises a sound bed signal, specifically, the sound bed signal is O_other. The spatial response function corresponding to the sound bed signal is comb filteringOr all-pass filteringWherein m represents the number of delay sample points and takes a value of 20ms-40 ms.

The second spatial audio signal is S₂＝O_otherH (z). The corresponding target spatial audio signal S is the first spatial audio signal S₁And a second spatial audio signal S₂The sum of (1). Wherein the first spatial audio signal S₁The sum after convolution processing for each audio object with the corresponding head-related transfer function. Exemplary, see Table two, S₁＝O_vocal*HRTF(1)+O_guitar*HRTF(4)+O_drum*HRTF(6)+O_bass*HRTF(7)

S2042, obtaining a target audio corresponding to the audio to be processed according to the target space audio signal.

Wherein, according to the target space audio signal, obtain the target audio frequency that pending audio frequency corresponds, include: and compensating the target space audio signal according to the compensation response of the preset playing equipment to obtain the target audio corresponding to the audio to be processed.

In addition, according to the compensation response of the preset playing device, the target spatial audio signal is compensated to obtain the target audio corresponding to the audio to be processed, which includes: and acquiring a compensation function corresponding to the preset playing equipment, and convolving the target space audio signal with the compensation function to obtain a target audio corresponding to the audio to be processed. The compensation function is used for improving the playing quality of the target space audio signal in the preset playing device. In particular, the frequency response curve of the target spatial audio signal is adjusted.

Illustratively, the preset playback device includes: the binaural earphone compensates the target spatial audio signal according to the compensation response of the preset playing device to obtain a target audio corresponding to the audio to be processed, and includes: obtaining a transfer function and a regular factor corresponding to a double-ear earphone; determining a corresponding compensation function according to the transfer function and the regular factor; and performing convolution on the compensation function and the target space audio signal to obtain a target audio.

The transfer function and the regularization factor corresponding to the binaural headphone may be pre-stored in a memory, and the transfer function and the regularization factor are obtained from the memory when in use.

Specifically, for the usage scenario of the earphone, the earphone transfer function is H_pThe compensation function of the earphone is H_cThe calculation method of (c) is as follows:

where β is a regularization factor, typically a minimum, independent of the frequency response, to avoid divergence of the compensation function.

The target audio S_out＝S₁*H_c。

In addition, the preset playing device can also be other devices, such as a sound box, a loudspeaker and the like. According to the method, the compensation functions corresponding to the preset playing devices can be prestored, so that when the target space audio signal is obtained, the compensation functions can be determined according to the preset playing devices, the target space audio signal is compensated, and the target audio is further obtained.

Specifically, the convolution is expressed as multiplication in the frequency domain, and the target audio S_out＝S*H_c。

Referring to fig. 5, a flowchart of training steps of a sound source audio signal separation model is shown, and the following steps may be performed before S2022, specifically as follows:

s501, obtaining a training sample, wherein the training sample comprises an audio data sample and a sound source audio signal corresponding to the audio data sample.

And the sound source audio signal corresponding to the audio data sample is the split-track audio of the audio data sample. Training samples may be obtained in the public data set musbd18 in this disclosure. The musbd18 is a data set containing 150 different types of full-length music of about 10 hours in length, and an independent source audio signal for each music, for example: drums, bass, human voice, and other audio. The present disclosure may also obtain training samples in a music gallery. The music library comprises split-track audio (sound source audio signals) provided by original musicians and corresponding mixed audio (audio data samples), and covers popular, electronic, rap, rock, latin, ancient style, quadratic element and other styles of music.

In the present disclosure, the training samples may be obtained in other manners, which are not limited herein.

S502, inputting the audio data sample into a sound source audio signal separation model to obtain a training output sound source audio signal corresponding to the audio data sample.

Wherein, with audio data sample input sound source audio signal separation model, obtain the training output sound source audio signal that audio data sample corresponds, include: carrying out Fourier transform on the audio data sample to obtain a frequency spectrum corresponding to the audio data sample; inputting the frequency spectrum corresponding to the audio data sample into a sound source audio signal separation model to obtain a masking factor corresponding to the audio data sample, and multiplying the masking factor by the frequency spectrum corresponding to the audio data sample to obtain the frequency spectrum of the audio signal of the training output sound source; and carrying out inverse Fourier transform on the frequency spectrum of the audio signal of the training output sound source to obtain the audio signal of the training output sound source.

Wherein, sound source audio signal separation model includes K layer convolution network, and every layer convolution network includes encoder and decoder, inputs sound source audio signal separation model with the audio data sample, obtains the training output sound source audio signal that the audio data sample corresponds, includes: adopting K encoders to carry out down-sampling on the audio data samples to obtain an intermediate characteristic image, wherein the output of the ith encoder is the input of the (i + 1) th encoder, and i sequentially takes 1,2, … … and K-1; and (3) up-sampling the intermediate characteristic image by adopting K decoders to obtain a training output sound source audio signal corresponding to the audio data sample, wherein the output of the jth decoder is the input of the jth +1 decoder, and j sequentially takes K, K-1, … … and 1.

Specifically, referring to fig. 3, the initial sound source audio signal separation model is a network structure such as U-net, and the sound source audio signal separation model used in the present disclosure is obtained by training a training sample on the basis of the initial sound source audio signal separation model. The structure of the initial sound source audio signal separation model is shown in fig. 3, and includes K layers of convolutional networks, such as K1 to K6 in fig. 3, where each layer of convolutional network includes an encoder En, a decoder De, and an activation function sublayer, and output contents corresponding to the encoder and the decoder are output from the convolutional network through the activation function sublayer. The encoder includes: a 2-dimensional convolution sublayer and a first linear rectification sublayer; the decoder includes: a step-wise deconvolution sublayer and a second linear rectification sublayer.

Wherein, adopt K encoders to go up sampling to middle characteristic image, obtain the training output sound source audio signal that the audio data sample corresponds, include: acquiring a first output of an encoder and a second output of a decoder in a K-th layer convolutional network; determining a correlation result according to the first output and the second output; and inputting the correlation result into a decoder of a K-1 layer convolutional network layer, and performing up-sampling through the decoder of the K-1 layer convolutional network layer to obtain a training output sound source audio signal K corresponding to the audio data sample, wherein K, K-1, … … and 2 are sequentially selected.

Specifically, the first output of the encoder is U_iThe second output of the decoder is D_i. Wherein i refers to the number of layers of the convolutional network, e.g. the first layer of the convolutional network K₁I of 1, a second layer of convolutional network K₂I of (2).

Wherein determining the correlation result according to the first output and the second output comprises: determining a similarity factor for the first output and the second output; and determining a correlation result according to the similarity factor and the second output.

In particular, the similarity factor A (P)_i) After ReLu convolution, the second output D is obtained_iAnd obtaining a correlation result after multiplication, and inputting the correlation result into a decoder of the (k-1) th layer of convolutional network so as to establish a connection relation in the same layer of convolutional network. And performing the same operation on each layer of convolutional network in the same way, and establishing an attention mechanism of the sound source audio signal separation model. Wherein determining a similarity factor for the first output and the second output comprises: determining a sum of the first output and the second output as a third output; performing convolution processing on the third output to obtain a similarity factor。

In summary, the attention mechanism of the source audio signal separation model is determined as follows:

in the above formula, Attention is the Attention mechanism, and K is the number of layers of the convolutional network. A (P)_i) Are similar factors. D_iIs the output of the i-th layer convolutional network decoder.

Wherein the correlation result Q_iIs calculated by outputting the first output U_iAnd a second output D_iAdding the obtained solution, and performing ReLu convolution to obtain a similarity factor A (P)_i) A similar factor A (P)_i) After ReLu convolution, the second output D is obtained_iMultiplying to obtain a correlation result Q_i。

Illustratively, referring to FIG. 3, a spectral image P₀Input to the first layer convolutional network K₁To obtain a spectral image P₁. Spectral image P₁Input into a second layer convolutional network K₂Until a sixth layer of convolutional network K is obtained₆Of the encoder En, and a spectral image P of the output of the encoder En₆. Then the spectrum image P₆Input to the sixth layer convolution network K₆To obtain a spectral image P₇. Wherein the spectral image P₆As a sixth layer of convolutional network K₆First output U of encoder En₆Spectral image P₇As a sixth layer of convolutional network K₆Second output D of decoder De₆Then P will be₆And P₇Adding, performing ReLu convolution to obtain similar factor A (P6), performing ReLu convolution to obtain similar factor A (P6), and performing ReLu convolution to obtain similar factor A and P₇Multiplying to obtain a correlation result Q₆. Correlating the result Q₆Input to the fifth layer convolutional network K₅To obtain a second output D₅. Based on a fifth layer convolutional network K in the same way₅Is encoded by the encoder En, and the spectral image P output by the encoder En of₅And a second output D₅To obtain the corresponding correlation result Q₅. Continuing to adopt the mode until obtaining the first layer of convolution network K₁Second output D of decoder De₁Then, the second output D₁As a 2-dimensional masking factor output from the sound source audio signal separation model, D₁And P₀And multiplying to obtain the frequency spectrum of the audio signal of the training output sound source.

In the present disclosure, the attention mechanism is employed to make the trained source audio signal separation model converge deeper.

S503, adjusting parameters of the sound source audio signal separation model according to the sound source audio signal corresponding to the training output sound source audio signal and the audio data sample, and obtaining the trained sound source audio signal separation model.

Wherein, according to the sound source audio signal that training output sound source audio signal and audio data sample correspond, adjust the parameter of sound source audio signal separation model, obtain the sound source audio signal separation model that the training was accomplished, include: determining a loss function corresponding to a sound source audio signal separation model according to the audio data sample, the training output sound source audio signal and the sound source audio signal corresponding to the audio data sample; and adjusting parameters of the sound source audio signal separation model according to the loss function to obtain the trained sound source audio signal separation model.

Specifically, the loss function is determined as follows:

L(P₀，Y；Θ)＝||f(P₀，Θ)*P₀-Y||

in the above formula f (P)₀And theta) is a 2-dimensional masking factor output by the sound source audio signal separation model, P is a spectral image obtained by performing Fourier transform on an audio data sample, Y is a spectral image obtained by performing Fourier transform on a sound source audio signal corresponding to the audio data sample, and theta is a parameter of the sound source audio signal separation model.

In the present disclosure, different sound source audio signals correspond to different sound source audio signal separation models, the different sound source audio signal separation models are obtained by training in the same manner, and the parameters Θ corresponding to the different sound source audio signal separation models are different.

Exemplary Medium

Having described the method of the exemplary embodiment of the present disclosure, next, a storage medium of the exemplary embodiment of the present disclosure will be described with reference to fig. 6.

Referring to fig. 6, a program product 60 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. The readable signal medium may also be any readable medium other than a readable storage medium.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary devices

Having described the media of the exemplary embodiments of the present disclosure, next, an audio processing apparatus of the exemplary embodiments of the present disclosure will be described with reference to fig. 7.

As shown in fig. 7, the audio processing apparatus 70 includes: a first acquisition module 701, a determination module 702, a second acquisition module 703 and a rendering module 704. Wherein:

a first obtaining module 701, configured to obtain audio data of an audio to be processed;

a determining module 702, configured to determine at least one sound source audio signal in the audio data, where the sound source audio signal is an audio element with a corresponding duration in a sound field;

a second obtaining module 703, configured to obtain a spatial parameter corresponding to the sound source audio signal according to a preset spatial position placement rule, where the spatial parameter is a relative position of the sound source audio signal when the sound source audio signal sounds;

and the rendering module 704 is configured to perform rendering processing on the audio to be processed according to the spatial parameter, so as to obtain a target audio corresponding to the audio to be processed.

In yet another embodiment of the present disclosure, the determining module 702 includes:

the first transformation unit is used for carrying out Fourier transformation on the audio data to obtain a frequency spectrum corresponding to the audio data;

In yet another embodiment of the present disclosure, the second obtaining module 703 includes:

and the second determining unit is used for placing the template according to the spatial position and determining the spatial parameters corresponding to the sound source audio signals.

In yet another embodiment of the present disclosure, the rendering module 704 includes: