Audio processing method and device, electronic equipment and storage medium

文档序号：170864 发布日期：2021-10-29 浏览：41次中文

阅读说明：本技术 音频处理的方法、装置、电子设备和存储介质 (Audio processing method and device, electronic equipment and storage medium ) 是由鲍枫李娟娟李岳鹏于 2021-01-28 设计创作，主要内容包括：本申请涉及计算机技术领域,尤其涉及一种音频处理的方法、装置、电子设备和计算机可读存储介质。该方法包括获取待处理音频数据所对应的原始音频特征；调用第一网络模型对原始音频特征进行处理,得到第一音频特征,其中,第一音频特征包括至少一维特征；调用第二网络模型对原始音频特征以及第一音频特征进行处理,得到第二音频特征,其中,第二音频特征的特征数量大于第一音频特征的特征数量；根据第二音频特征以及原始音频特征,调用全连接网络模型获取待处理音频数据所对应的增益结果；根据增益结果以及待处理音频数据,生成去噪音频数据。该方法能够提升去噪效果,从而能够更准确地判断出音频中的语音,提升判断的准确性。(The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a computer-readable storage medium. The method comprises the steps of obtaining original audio characteristics corresponding to audio data to be processed; calling a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features; calling a second network model to process the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features; calling a full-connection network model to obtain a gain result corresponding to the audio data to be processed according to the second audio characteristic and the original audio characteristic; and generating de-noising audio data according to the gain result and the audio data to be processed. The method can improve the denoising effect, so that the voice in the audio can be judged more accurately, and the judgment accuracy is improved.)

1. A method of audio processing, comprising:

acquiring original audio characteristics corresponding to audio data to be processed;

calling a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

calling a second network model to process the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is larger than that of the first audio features;

calling a full-connection network model to obtain a gain result corresponding to the audio data to be processed according to the second audio characteristic and the original audio characteristic;

and generating de-noising audio data according to the gain result and the audio data to be processed.

2. The method according to claim 1, wherein the obtaining of the original audio features corresponding to the audio data to be processed comprises:

dividing the audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is smaller than the minimum frequency of the second frequency interval;

performing frequency division on the frequencies of the first frequency interval and the second frequency interval, and performing sparsification on the sub-bands of the second frequency interval to obtain a sub-band set, wherein the number of the sub-bands divided by the first frequency interval is greater than the number of the sub-bands divided by the second frequency interval, and the sub-band audio set comprises audio fragment data corresponding to each sub-band;

and calculating the original audio characteristics according to the subband set.

3. The method of claim 2, wherein said computing the original audio features from the set of subbands comprises:

calculating a bark frequency cepstrum coefficient of each subband in the subband set to obtain a first characteristic set;

calculating a difference coefficient and a discrete cosine transform value between the sub-bands aiming at least two sub-bands in the sub-band set to obtain a second characteristic set;

and determining the original audio features according to the first feature set and the second feature set.

4. The method according to claim 1, wherein the invoking a full-connection network model according to the second audio feature and the original audio feature to obtain a gain result corresponding to the audio data to be processed comprises:

calling a third network model to process the original audio features, the first audio features and the second audio features to obtain third audio features, wherein the feature quantity of the third audio features is larger than that of the second audio features;

and calling a full-connection network model according to the third audio characteristics to obtain a gain result corresponding to the audio data to be processed.

5. The method of claim 1, wherein generating denoised audio data from the gain result and the audio data to be processed comprises:

performing multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

and carrying out inverse fast Fourier transform on the audio gain result to obtain de-noised audio data.

6. The method according to claim 1, wherein the audio processing model comprises the first network model, the second network model and the fully-connected network model, and before the obtaining of the original audio features corresponding to the audio data to be processed, the method further comprises:

acquiring training audio features corresponding to audio data to be trained;

calling a first network model included in a model to be trained, and processing the training audio features to obtain first audio features, wherein the first audio features include at least one-dimensional features;

calling a second network model included in the model to be trained, and processing the training audio features and the first audio features to obtain second audio features, wherein the dimensionality of the second audio features is larger than that of the first audio features;

calling a full-connection network model included in the model to be trained according to the second audio characteristic and the training audio characteristic to obtain a gain result corresponding to the audio data to be processed;

and adjusting the model parameters of the model to be trained according to the gain result, the audio data to be trained and the noiseless audio data corresponding to the audio data to be processed to obtain an audio processing model.

7. The method according to claim 1, wherein before the obtaining of the original audio features corresponding to the audio data to be processed, the method further comprises:

collecting the audio data to be processed through an audio collecting device;

after generating denoised audio data according to the gain result and the audio data to be processed, the method further comprises:

identifying the de-noised audio data to obtain an audio identification result;

and if the audio recognition result indicates that the audio data to be processed is human voice, controlling the audio acquisition device to transmit the audio data, otherwise, controlling the audio acquisition device to stop transmitting the audio data.

8. An audio processing apparatus, comprising:

the acquisition module is used for acquiring the original audio characteristics corresponding to the audio data to be processed;

the calling module is used for calling a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

the calling module is further configured to call a second network model to process the original audio features and the first audio features to obtain second audio features, where the feature quantity of the second audio features is greater than the feature quantity of the first audio features;

the calling module is further configured to call a full-connection network model according to the second audio feature and the original audio feature to obtain a gain result corresponding to the audio data to be processed;

and the generating module is used for generating de-noising audio data according to the gain result and the audio data to be processed.

9. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of audio processing of any of claims 1 to 7 via execution of the executable instructions.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of audio processing according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of computer technology, web conferences are gradually received by people and become a preferred solution for teleconferencing. In a web conference, participants often choose to turn their microphones off when they are not speaking to avoid interfering with the current speaker. The moderator of the conference can also keep the conference in order by forbidding part or all of the other participants through an authority control function and the like.

Currently, the on and off of the microphone may be controlled by the conference program during the user's participation in the conference. The online conference program will listen to the user's speech and, in the event that it is determined that the user is speaking, actively turn on the microphone to allow the user to speak.

However, noise interference generally exists in the conference environment of the user at present, so that the online conference program wrongly judges the noise of the surrounding environment as the speech of the user and turns on the microphone, and the accuracy of the speech judgment of the user and the use experience of the user are reduced.

Disclosure of Invention

Based on the technical problem, the application provides an audio processing method to improve the denoising effect, so that the voice in the audio can be more accurately judged, and the judgment accuracy is improved.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided an audio processing method, including:

acquiring original audio characteristics corresponding to audio data to be processed;

calling a first network model to process the original audio features to obtain first audio features, wherein the first audio features comprise at least one-dimensional features;

calling a full-connection network model to obtain a gain result corresponding to the audio data to be processed according to the second audio characteristic and the original audio characteristic;

and generating de-noising audio data according to the gain result and the audio data to be processed.

According to an aspect of an embodiment of the present application, there is provided an audio processing apparatus including:

the acquisition module is used for acquiring the original audio characteristics corresponding to the audio data to be processed;

and the generating module is used for generating de-noising audio data according to the gain result and the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the obtaining module includes:

the interval dividing unit is used for dividing the audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is smaller than the minimum frequency of the second frequency interval;

a subband dividing unit, configured to perform frequency division on the frequencies of the first frequency interval and the second frequency interval and perform sparsification on subbands of the second frequency interval to obtain a subband set, where the number of subbands divided by the first frequency interval is greater than the number of subbands divided by the second frequency interval, and the subband audio set includes audio fragment data corresponding to each subband;

and the characteristic calculating unit is used for calculating the original audio characteristic according to the subband set.

In some embodiments of the present application, based on the above technical solutions, the feature calculating unit includes:

the first calculating subunit is configured to calculate a bark frequency cepstrum coefficient of each subband in the subband set to obtain a first feature set;

the second calculating subunit is configured to calculate, for at least two subbands in the subband set, a difference coefficient and a discrete cosine transform value between the subbands to obtain a second feature set;

and the characteristic determining subunit determines the original audio characteristic according to the first characteristic set and the second characteristic set.

In some embodiments of the present application, based on the above technical solutions, the invoking module includes:

the model calling unit is used for calling a third network model to process the original audio features, the first audio features and the second audio features to obtain third audio features, wherein the feature quantity of the third audio features is greater than that of the second audio features;

the model calling unit is further configured to call a full-connection network model according to the third audio feature, and obtain a gain result corresponding to the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the generating module includes:

the gain calculation unit is used for performing multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

and the audio transformation unit is used for carrying out inverse fast Fourier transformation on the audio gain result to obtain de-noised audio data.

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus further includes:

the acquisition module is also used for acquiring training audio features corresponding to the audio data to be trained;

the calling module is further configured to call a first network model included in the model to be trained, and process the training audio features to obtain first audio features, where the first audio features include at least one-dimensional features;

the calling module is further configured to call a second network model included in the model to be trained, and process the training audio feature and the first audio feature to obtain a second audio feature, where a dimension of the second audio feature is greater than a dimension of the first audio feature;

the calling module is further configured to call a full-connection network model included in the model to be trained according to the second audio feature and the training audio feature, and obtain a gain result corresponding to the audio data to be processed;

and the training module is used for adjusting the model parameters of the model to be trained according to the gain result, the audio data to be trained and the noiseless audio data corresponding to the audio data to be processed to obtain the audio processing model.

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus further includes:

the acquisition module is used for acquiring the audio data to be processed through an audio acquisition device;

the identification module is used for identifying the de-noised audio data to obtain an audio identification result;

and the switching module is used for controlling the audio acquisition device to transmit the audio data if the audio recognition result indicates that the audio data to be processed is human voice, and otherwise, controlling the audio acquisition device to stop transmitting the audio data.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the method of audio processing as in the above solution via execution of executable instructions.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements a method of audio processing as in the above technical solution.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of providing audio processing in the various alternative implementations described above.

According to the technical scheme provided by some embodiments of the application, the voice data to be processed is denoised through the network model, and in the processing process, for the plurality of network models, the original input characteristics and the output result of the preorder network model are input into the subsequent network model for calculation, so that the noise characteristic condition in the original audio characteristics can be fully considered in the model calculation process, noise is fully filtered, the denoising effect is improved, the voice in the audio can be more accurately judged, and the judgment accuracy is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1 is a schematic interface diagram of a conference application in an embodiment of the present application;

FIG. 2 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 3 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 4 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 5 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 6 is a diagram illustrating an algorithm structure of an audio processing apparatus according to an embodiment of the present invention;

FIG. 7 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 8 is a flow chart of a method of audio processing in an embodiment of the present application;

FIG. 9 is a flow chart of a method of audio processing in an embodiment of the present application;

fig. 10 schematically shows a block diagram of the audio processing apparatus in the embodiment of the present application;

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Web online conferencing programs are becoming an increasingly preferred way of teleconferencing. The participants access the cloud conference server through the network online conference program and listen to the conference and speak through the loudspeaker and the microphone on the terminal.

It can be understood that the audio processing method and the related apparatus in the embodiment of the present application can be applied to a voice call device such as a computer and a mobile phone, can also be applied to an internet device such as a smart phone and a smart television, and can also be applied to a dedicated device such as a landline phone and a teleconference camera. The conditions applied to the devices are similar, and the voice audio data of the user are collected through the microphone, and are subjected to denoising processing by combining the audio processing method of the embodiment of the application, so that the audio with the noise removed is obtained. The following detailed description of the application of the embodiments of the present application to the cloud conference may be referred to for specific implementation.

Referring to fig. 1, fig. 1 is a schematic interface diagram of a conference application according to an embodiment of the present application. The conference application is run on a terminal (e.g., a computer) which is connected to a cloud conference server via the internet and which transmits and receives video, audio and text information to participate in a conference. The computer has a built-in microphone or an external microphone. After connecting to the online conference, the user may switch the microphone in the conference application to a silent mode. At this point, the conference application will not send audio signals to the cloud conference server. However, the microphone on the computer is not turned off, and it will continue to capture audio information for the conferencing application to analyze and determine the user's speech status. When a user needs to speak in the process of participating in a conference, the user can directly start speaking, a conference application program firstly utilizes the audio processing method in the application to process audio information acquired by a microphone, noise in the audio information is filtered out (for example, sound of mouse and keyboard operations, prompt sound of other applications or mobile phone messages, sound of moving objects or tables and chairs on a desktop and the like), and then the audio information after being denoised is analyzed and judged in a Voice Activity Detection (VAD) mode. When it is detected that the user is speaking, the conference application prompts the user to turn on the microphone to enter a speaking mode, or to turn on the microphone directly for the user to speak.

It can be understood that the cloud conference server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal may be a computer (such as a notebook computer, a desktop computer, etc.), a smart phone, a tablet computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the cloud conference server may be directly or indirectly connected through a wired or wireless communication manner, which is not limited herein.

The audio processing method in the embodiment of the application can also be realized in a machine learning manner when being implemented specifically, and can be applied to a cloud conference specifically.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate.

At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences.

In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.

The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.

The scheme of the application is suitable for filtering the noise in the voice audio information to obtain the denoised voice information for subsequent processing, so that the accuracy of voice information operation is improved. The technical solutions provided in the present application are described in detail below with reference to specific embodiments. The method of the present embodiment may be applied to a computer terminal, and is specifically executed by an audio processing apparatus.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for audio processing according to an embodiment of the present application, where the flowchart at least includes the following steps S201 to S205:

step S201, obtaining an original audio feature corresponding to the audio data to be processed.

In the embodiment of the application, the audio processing device may acquire the audio data to be processed through the microphone. The audio data to be processed may include noise data as well as voice data. The sampling of the audio data to be processed may typically be at a frequency of 16000 HZ. And according to a preset frequency band division rule, the audio processing device divides the audio data to be processed into sub-bands to obtain a plurality of sub-bands. Subsequently, the parametric features thereof are calculated for each subband.

The number and manner of subband division and the selection of the parameter characteristics may be in various suitable manners. Specifically, the sub-band may be divided into a plurality of bark bands in the form of a bark (bark) domain. Parameters such as cepstrum coefficients and difference coefficients in the band may be calculated as parameter characteristics for each of the bark bands.

Step S202, a first network model is called, and the original audio features are processed to obtain first audio features, wherein the first audio features comprise at least one-dimensional features.

In the embodiment of the application, the audio processing device inputs the original audio features into the first network model for processing, so as to obtain the first audio features. The first network model may employ a recurrent neural network model in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all recurrent units are connected in a chain. The first network model is one of the cyclic units, wherein the first network model can be implemented by using a model such as a Long Short-Term-Memory artificial neural network (LSTM) or a Gated cyclic Unit (GRU). The first network model receives as input the original audio features and outputs as output a multi-dimensional vector, i.e. the first audio features. Step S203, calling a second network model, and processing the original audio features and the first audio features to obtain second audio features, wherein the feature quantity of the second audio features is greater than that of the first audio features;

in the embodiment of the application, the audio processing device combines the first audio feature and the original audio feature into an input feature and inputs the input feature into the second network model for processing, so as to obtain a second audio feature.

In particular, the second network model is also a looping unit in a looping neural network model, which is the same type of neural network model as the first network model, and receives as input the output of the first network model and the original audio features. The dimensions of the output result of the second network model (i.e. the second audio feature) are typically larger than the dimensions of the output result of the second network model (i.e. the first audio feature) in order to highlight features in the audio data even more. Specifically, if the first audio feature includes 60 feature values, the second audio feature includes at least 61 or more, for example, 70 or 80 feature values. The input data of the second network model comprises the first audio features and the original audio features, and the feature quantity of the second audio features is larger than that of the first audio features, so that the second audio features have enough size to contain feature details in the original audio features, the weight of the original audio features in the process of calculating the second audio features is increased, and the learning capability and the denoising effect of the model are improved. If the feature quantity of the second audio features is equal to or less than the first audio features, the weight of the original audio features in the second audio features is low due to insufficient size of the second audio features, the learning capability is reduced, feature details in the original audio features are lost, and the denoising effect is weakened.

The activation function used in the second network model may be the same as or different from that used in the first network model, and is not limited herein.

It is to be understood that, in the present application, each feature value in the audio feature obtained by the network model processing may be data having a value within a predetermined range. Each feature value is an intermediate value in the learning process, does not necessarily correspond to the actual physical meaning, and does not necessarily have a direct association relationship with the audio data to be processed. The value range of each characteristic value is determined according to the adopted function of the corresponding network model. For example, if the first network model uses a hyperbolic tangent function, the first audio feature output by the first network model will include, for example, 60 feature values, each feature value being a value in a range of values between 0 and 1, and each feature value not necessarily having an actual physical meaning.

And step S204, calling a full-connection network model according to the second audio characteristic and the original audio characteristic, and acquiring a gain result corresponding to the audio data to be processed.

In particular, the audio processing device may invoke at least two loop units. In the case of only calling the first network model and the second network model, the audio processing apparatus may input the second audio feature output by the second network model into the fully-connected network model for processing, so as to obtain a gain result of the audio. The number of dimensions of the gain result is the same as the number of subbands in the original audio feature. For example, if the audio data to be processed is divided into 50 subbands, the dimension in the gain result is 50 dimensions.

When the audio processing device calls three or more circulation units, the audio processing device will continue to call the subsequent circulation units for further processing according to the second audio characteristics and the original audio characteristics, and take the output result and the output result of the previous circulation unit as the input of the next circulation unit until the last circulation unit in the sequence finishes processing to obtain the final audio characteristics. And then, processing by adopting a full-connection network model to obtain a gain result.

It can be understood that the number of dimensions of the output structure of each circulation unit should have a trend of increasing stepwise, so as to gradually and fully embody the voice characteristics and the noise characteristics in the audio data to be processed, thereby facilitating the extraction of noise.

And S205, generating denoising audio data according to the gain result and the audio data to be processed.

Specifically, after the gain result is obtained, denoising operation may be performed on the gain result and the audio data to be processed to obtain denoised audio data. For example, the gain result is comprised of M dimensions, which correspond to the M subbands into which the speech data to be processed is partitioned. And carrying out denoising operation according to the signal value of the sub-band and the characteristic value in the corresponding gain result to obtain a denoised signal value, and combining all the signal values obtained by calculation into denoised audio data.

In the embodiment of the application, the voice data to be processed is processed through the neural network model, in the processing process, for each circulation unit in the neural network model, the original input characteristics and the output result of the preorder circulation unit are input into the subsequent circulation unit for calculation, the noise characteristic condition in the original audio characteristics can be fully considered in the model calculation process, so that the noise is fully filtered, the denoising effect is improved, the voice in the audio can be more accurately judged, and the judgment accuracy is improved.

In an embodiment of the present application, in order to reduce resource consumption of an algorithm and improve computational efficiency on the basis of sufficiently recognizing a speech feature, specifically, as shown in fig. 3, the step S201 of obtaining an original audio feature corresponding to audio data to be processed may include the following steps S301 to S303, which are described in detail as follows:

step S301, dividing audio data to be processed into a first frequency interval and a second frequency interval, wherein the maximum frequency of the first frequency interval is less than the minimum frequency of the second frequency interval;

step S302, frequency division is carried out on the frequencies of a first frequency interval and a second frequency interval, and sparsification processing is carried out on the sub-bands of the second frequency interval, so as to obtain a sub-band set, wherein the number of the sub-bands divided by the first frequency interval is greater than that of the sub-bands divided by the second frequency interval, and the sub-band audio set comprises audio fragment data corresponding to each sub-band;

step S303, calculating the original audio characteristics according to the subband set.

In the embodiment of the application, audio data to be processed is sampled at 16000Hz, so that a broadband voice signal with a bandwidth of 8000Hz is obtained. The audio processing device divides the audio data to be processed into a first frequency interval and a second frequency interval. The first frequency interval is divided according to the usual speech frequency when the person speaks, which may typically comprise a relatively low frequency band. For example, the first frequency interval may be 0 to 2000 Hz. The second frequency interval mainly includes frequency intervals related to various types of environmental noise, and the range of the second frequency interval is not overlapped with the first frequency interval, and for example, the second frequency interval may be 2000Hz to 8000 Hz.

The first frequency interval and the second frequency interval are respectively divided into a plurality of sub-bands, each sub-band corresponding to one audio clip data. In the present application, the frequency bands are divided in the barker (bark) domain. Before obtaining a plurality of characteristic parameters of the audio data to be processed, the audio data to be processed may be subjected to fourier transform to obtain a magnitude spectrum of the audio data to be processed, and then the magnitude spectrum of the current audio data to be processed may be subjected to bark subband division according to the critical band definition to obtain characteristic parameters of a plurality of subbands.

Illustratively, the audio data to be processed may be subjected to a short-time fourier transform and a magnitude spectrum of the current audio signal segment may be calculated. The audio data to be processed is a noisy audio signal, which is composed of a clean human voice signal s (t) and uncorrelated noise w (t), for example, w (t) may be noise in the environment. The time domain expression of the audio data to be processed satisfies the following conditions: x (t) ═ s (t) + w (t), where t denotes time. The frequency domain expression of the current audio signal segment obtained by respectively carrying out short-time Fourier transform on two sides of the expression meets the following requirements: x (k) = s (k) + w (k), where x (k) denotes an amplitude spectrum of a noisy audio signal, s (k) denotes an amplitude spectrum of a speech signal, w (k) denotes a noise amplitude spectrum, and k denotes frequency points, and for example, short-time fourier transform of 512 frequency points may be performed on audio data to be processed.

The first frequency interval may be divided directly into a preset number of bark bands, the number of specific divisions often being empirically determined, e.g. the first frequency interval may be divided into 36 bark bands. The second frequency interval is first sparsely represented before the division into the bark bands, so that the influence of the noise signals therein on the result is moderately reduced. The number of subbands in the second frequency interval may be less than the number of subbands in the first frequency interval so as not to confuse the calculation results. For example, the second frequency interval may be divided into 28 sub-bands.

The audio segments corresponding to the subbands in the first frequency interval and the second frequency interval constitute a subband set, i.e. 64 subbands. For each of the 64 subbands, various types of audio features may be calculated to obtain original audio features, such as speech spectrogram, short-time power spectral density, fundamental frequency, formants, cepstrum coefficients, and the like.

In this embodiment, the audio data to be processed is divided into two different sections, and the section in which noise is involved is subjected to sparsification processing, so that a noise signal in the audio data to be processed can be represented by a small number of features, and since the number of features required to be calculated for the low-frequency section is large and the number of features required to be calculated for the high-frequency section is small, on the basis of fully recognizing the voice features, the resource consumption of the algorithm can be reduced, and the calculation efficiency can be improved.

In an embodiment of the present application, in order to determine the speech with normal speech and the speech without speech more accurately, as shown in fig. 4, the step S303 of calculating the original audio feature according to the subband set may include the following steps S401 to S403, which are described in detail as follows:

step S401, calculating a bark frequency cepstrum coefficient of each subband in a subband set to obtain a first characteristic set;

step S402, calculating a difference coefficient and a discrete cosine transform value between sub-bands aiming at least two sub-bands in a sub-band set to obtain a second characteristic set;

step S403, determining an original audio feature according to the first feature set and the second feature set.

The audio processing device may calculate a parametric characterization within each subband. Specifically, for example, the audio data to be processed is divided into 56 bark bands, wherein 32 bark bands are divided for the lower frequency part (0 to 1000Hz), and 24 bark bands are divided for the higher frequency part (1000 to 8000 Hz). For each of the 56 Bark bands, the Bark Frequency Cepstrum Coefficient (BFCC) coefficients within the band, i.e. the parameter characteristics of the Bark domain, are calculated, thereby yielding 56 characteristics, forming a first set of characteristics.

It should be understood that the division of the Frequency bands and the parameter characteristics thereof are only examples, and the BFCC coefficients may also adopt other parameters, such as Mel Frequency Cepstrum Coefficient (MFCC), and are not limited herein.

For some sub-bands, the audio processing device may calculate the difference coefficients and discrete cosine transform values of the BFCC coefficients within its band. Specifically, for the first 1 to 6 bark bands, for example, the first and second order difference coefficients within its band of BFCC coefficients may be calculated, and the discrete cosine transform values of the in-band signal cross-correlation coefficients may also be calculated, thereby obtaining 18 features, forming a second feature set.

The first-order difference is the difference between the BFCC coefficients of two adjacent sub-bands, and can be used to represent the relationship between the two adjacent sub-bands. Illustratively, the first order difference of the subband BFCC coefficients may be obtained according to the following formula: y (b) ═ X (b +1) -X (b), where X (b) is the BFCC coefficient for subband b and y (b) is the first order difference. The second-order difference of the BFCC coefficient is the difference value of two adjacent first-order differences, and represents the relationship between the adjacent first-order differences, namely the relationship between the previous-order difference and the next-order difference, and can be used for embodying the dynamic relationship between three adjacent sub-bands in the sub-band of the audio amplitude spectrum. Illustratively, the second order difference of the BFCC coefficients may be obtained according to the following formula: z (b) ═ Y (b +1) -Y (b) ═ X (b +2) -2 × X (b +1) + X (b), where X (b) is the BFCC coefficient for subband b, Y (b) is the first order difference, and z (b) is the second order difference.

In one embodiment, noise is not finely suppressed between the fundamental harmonics in order to avoid adequate resolution of the speech. Post-filtering methods may also be added: inter-harmonic noise (inter-harmonic noise) is removed within one fundamental frequency period (pitch period) using a comb filter. Thus, the original audio features may add the fundamental frequency (pitch) of the gene period (pitch period) for the comb filter and the energy parameter as additional features.

Thus, from the first and second feature sets and the additional features described above, the original audio features comprising 76 dimensions can be determined.

It is to be understood that the above-described numbers for the first set of features, the second set of features, and the additional features are merely examples and are not limiting, and that a person skilled in the art may determine the number of features depending on the specific implementation.

In the embodiment of the application, the initial audio characteristics are determined by calculating the Barker frequency cepstrum coefficient, the difference coefficient and the discrete cosine transform value in the subband, the voice and noise conditions in the audio to be processed can be fully expressed, and the feeling of human ears on signals can be reflected more truly in the Barker domain, so that the voice with normal speech and the non-speech voice can be judged more accurately, and the accuracy of speech judgment is improved.

In an embodiment of the present application, in order to filter noise data in the audio data to be processed more fully, specifically, as shown in fig. 5, the step S204 calls a full-connection network model according to the second audio feature and the original audio feature to obtain a gain result corresponding to the audio data to be processed, which may include the following steps S501 to S502, which are described in detail as follows:

step S501, calling a third network model, and processing the original audio features, the first audio features and the second audio features to obtain third audio features, wherein the feature quantity of the third audio features is greater than that of the second audio features;

step S502, according to the third audio characteristic, calling a full-connection network model to obtain a gain result corresponding to the audio data to be processed.

In the embodiment of the present application, the audio processing apparatus calls three network models. For convenience of introduction, please refer to fig. 6, fig. 6 is a block diagram of an algorithm of an audio processing apparatus according to an embodiment of the present disclosure. Specifically, the three network models are all realized by adopting a GRU model. For example, assume that 76 features calculated based on 56 subbands are included in the original audio features. The audio processing device inputs feature values of the 76 features into the first GRU model. The first GRU model employs a hyperbolic tangent (tanh) function as an activation function, and the first audio feature of its output includes 60 features. The 60 features of the first output result and 76 of the original audio features are then output to a second GRU model that employs the ReLU function as an activation function, and the output first audio feature includes 70 features. Similarly, the audio processing device invokes a third GRU model that also processes 76 of the original audio features, 60 models of the first audio features, and 70 features of the second audio features using a tanh function as an activation function, and the output third audio feature comprises 130 features. It can be understood that according to the sequence order of the GRU model, the number of output features is gradually increased so as to retain more detailed features, so that the representation of the speech signal and the noise signal is more specific, the gain is calculated accurately, and the denoising effect is improved.

After obtaining 130 features of the third audio feature, the audio processing apparatus inputs them into the full-connected model. In this embodiment, the fully connected model adopts Sigmoid function as a calculation function, and obtains 56 feature values corresponding to 56 subbands according to 130 input features, as an output gain result.

Similar to the description about the first network model and the second network model, the feature quantity of the third audio feature output by the third network model is greater than the feature quantity of the second audio feature, and each feature value in the third audio feature does not necessarily have an actual physical meaning, which is specifically referred to the above description about the first audio feature and the second audio feature, and is not described herein again.

It should be noted that the GRU model employed above may be replaced with other neural network models, such as a long-short term memory artificial neural network model or a recurrent neural network model. The activation functions of the respective GRU models may be replaced with other homogeneous activation functions. The dimension of the output result of each GRU model may also depend on the input values and the implementation details, as long as a trend of gradually increasing according to the model sequence is met. The types of the neural network model, the types of the activation function, and the dimension of the output result are not limited here.

In the embodiment of the application, the audio processing device specifically calls the three network model units, so that the denoising capability of the audio processing device can be improved, and meanwhile, the quantized volume of the audio processing device is maintained to meet the requirement of real-time communication, so that the accuracy of a subsequent voice detection algorithm is improved, and the user experience is improved.

In an embodiment of the present application, in order to obtain the denoised audio data, specifically, as shown in fig. 7, the step S205 of generating the denoised audio data according to the gain result and the audio data to be processed may include the following steps S701 to S702, which are described in detail as follows:

step S701, performing multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

and S702, performing inverse fast Fourier transform on the audio gain result to obtain de-noised audio data.

Specifically, for each subband of the audio data to be processed, a corresponding gain feature value to be included in the gain result. And multiplying the frequency of the sub-band by the gain characteristic value to filter the noise quotation mark therein, and amplifying the voice signal therein so as to perform denoising operation. And combining each subband with the obtained calculation result of the corresponding gain characteristic value to obtain an audio gain result.

Then, inverse fast fourier transform is performed on the audio gain result, so that data of the audio gain result is converted from frequency to time domain, thereby obtaining denoised audio data.

In the embodiment of the application, the audio data to be processed is denoised by using the gain result, so that the influence of external noise factors is effectively eliminated, and the quality and the effect of the generated denoised audio data are improved.

In an embodiment of the present application, the audio processing model includes the first network model, the second network model, and the fully-connected network model, and in order to obtain the trained audio processing model, as shown in fig. 8, before the step S201, obtaining the original audio features corresponding to the audio data to be processed, the following steps S801 to S805 may be included, which are described in detail as follows:

step S801, acquiring training audio features corresponding to audio data to be trained;

step S802, calling a first network model included in a model to be trained, and processing training audio features to obtain first audio features, wherein the first audio features include at least one-dimensional features;

step S803, a second network model included in the model to be trained is called, the training audio features and the first audio features are processed, and second audio features are obtained, wherein the dimensionality of the second audio features is larger than that of the first audio features;

step S804, according to the second audio characteristic and the training audio characteristic, calling a full-connection network model included in the model to be trained, and acquiring a gain result corresponding to the audio data to be processed;

step S805, adjusting model parameters of the model to be trained according to the gain result, the audio data to be trained, and the noiseless audio data corresponding to the audio data to be processed, to obtain an audio processing model.

The audio processing model comprises a plurality of sub-models, specifically comprises a first network model, a second network model and a fully-connected network model. In one embodiment, the audio processing model may include further network models, such as a third network model, each connected in sequence order and having the output of the preceding model and the original input as its own input features. And the last model in the sequence inputs the output result into the fully-connected network model to obtain a final gain result. The number of network models included in the audio processing model may depend on the specific implementation and is not limiting in this application.

Specifically, the audio data to be trained includes audio data containing noise. The training set of the neural network can be constructed according to the collected fundamental frequency information of a large number of audio signals and the characteristic parameters of a plurality of sub-bands, and the original noisy data training set meets the following requirements: x (b) ═ s (b) + w (b), and the target enhancement data training set satisfy: x' (b) ═ g (b) × s (b) + w (b), for parameter training. The objective of the algorithm is to optimize this target enhancement factor g (b). Wherein, b is a subband index number, X (b) represents an original noise amplitude spectrum, X' (b) represents a noise amplitude spectrum after human voice enhancement, s (b) represents a human voice amplitude spectrum without noise, and w (b) represents a noise amplitude spectrum. The loss function is related to the relationship between the enhancement result of the target and the enhancement result output by the audio model to be trained, and can be L (p (x), p '(x)) - (p (x) -p' (x))²Wherein p (x) represents the target enhancement result, and p' (x) represents the enhancement result output by the audio model to be trained. The target enhancement result can be calculated according to the audio data to be processed and the corresponding noiseless audio data. In a neural network, a loss function is usually used to measure the degree of fitting of the neural network, i.e., the loss function is minimized, which means that the degree of fitting is the best, and the corresponding model parameter is the optimal parameter.

Therefore, in the training process of the audio processing model, firstly, according to the above-mentioned parameter feature calculation mode, sub-band division and calculation are performed on the audio data to be trained, so as to obtain the training audio features. And then, according to the number of the cyclic units in the model to be trained, the output result of the preamble unit and the original training audio characteristic are used as input, and the output result is calculated. For a model to be trained with a two-layer structure, a first network model included in the model to be trained is called first, the training audio features are processed to obtain first audio features, the first audio features include at least one-dimensional features, then a second network model included in the model to be trained is called, the training audio features and the first audio features are processed to obtain second audio features, and the dimensionality of the second audio features is larger than that of the first audio features.

And then, the output of the second network model is processed by a full-connection model to obtain a final gain result. And obtaining a target gain result according to the audio data to be trained and the corresponding noiseless audio data. And calculating a loss function according to the target gain result and the gain result output by the model to be trained, and adjusting the model parameters of the model to be trained according to the loss result to obtain the audio processing model.

The training process for the model to be trained may be performed iteratively, specifically, a plurality of training batches may be set, and each batch inputs a certain number of audio data to be trained as a training data set. In the iterative training process, the iterative training of the loss value can be performed by an Adaptive Moment Estimation Optimizer (Adaptive Moment Estimation Optimizer).

In this embodiment, the audio data to be trained is used to train the model to be trained, so as to obtain the audio processing model, which is beneficial to improving the feasibility of the scheme.

In an embodiment of the present application, in order to control the state of the audio collecting device so that the user can speak, as shown in fig. 9, the method may include the following steps S901 to S903, which are described in detail as follows:

before acquiring the original audio features corresponding to the audio data to be processed in step S201, the method further includes:

step 901, collecting audio data to be processed by an audio collecting device;

after generating the de-noised audio data according to the gain result and the audio data to be processed in step S205, the method further includes:

step S902, carrying out recognition processing on the de-noised audio data to obtain an audio recognition result;

step S903, if the audio recognition result indicates that the audio data to be processed is human voice, the audio acquisition device is controlled to transmit the audio data, otherwise, the audio acquisition device is controlled to stop transmitting the audio data.

The audio capturing device may be any kind of microphone or other device with audio capturing functionality. Specifically, after the user participates in the cloud conference server through the conference application, the conference application acquires the audio data to be processed through the microphone. The user can switch the microphone to a mute state, at this time, the conference application program cannot send audio data to the cloud conference server to speak, but still collects the audio data to be processed through the microphone, so that the background program of the conference application program can analyze whether the user speaks.

When the user speaks, the conference application program performs denoising processing on the acquired audio data to be processed by using the audio processing device in the embodiment to obtain denoised audio data. Then, in step S902, voice recognition may be performed on the denoised audio data by using VAD algorithm or other types of detection algorithm to obtain an audio recognition result.

If the audio recognition result indicates that the audio data to be processed includes voice, it can be determined that the user is speaking currently, and the microphone can be switched to a conversation state. The microphone sends audio data to a remote cloud conference server through the conference application in a call state so as to allow the user to speak. Otherwise, if the audio recognition result indicates that the audio data to be processed does not include the voice of the human voice, the microphone is kept in a mute state. In the mute state, the microphone will stop transmitting audio data. In one embodiment, if the microphone is already in a talk state, no processing may be done. In an embodiment, before the denoised audio data is identified or before the audio data to be processed is collected by the microphone, the state of the microphone may be monitored first, and if the microphone is in a talk state, no operation is performed, and if the microphone is in a mute state, the above steps are started to be executed.

It will be understood that the state of the microphone mentioned above refers to a state set for the microphone in an application environment such as a conference application, and not to the on-off state of the microphone itself. The mute state and the talk state are used to distinguish whether the conference application is sending audio data to the remote server, in both states, the microphone is in a powered on state and can collect audio data.

In this embodiment, the collected audio is denoised by the method in the embodiment of the present application, then the judgment of the human voice is performed according to the denoised audio, and whether the audio device transmits the audio data is controlled according to the judgment result, so that when the user forgets to turn on the audio device, the application can replace the user to turn on the audio device to speak, thereby avoiding the user from speaking repeatedly, and improving the usability of the application.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes an apparatus implementation of the present application that may be used to perform the methods of audio processing in the above-described embodiments of the present application. Fig. 10 schematically shows a block diagram of the audio processing apparatus in the embodiment of the present application. As shown in fig. 10, the audio processing device 1000 may mainly include:

an obtaining module 1001, configured to obtain an original audio feature corresponding to audio data to be processed;

a calling module 1002, configured to call a first network model to process the original audio features to obtain first audio features, where the first audio features include at least one-dimensional features;

a generating module 1003, configured to generate denoising audio data according to the gain result and the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the obtaining module 1001 includes:

and the characteristic calculating unit is used for calculating the original audio characteristic according to the subband set.

In some embodiments of the present application, based on the above technical solutions, the feature calculating unit includes:

the first calculating subunit is configured to calculate a bark frequency cepstrum coefficient of each subband in the subband set to obtain a first feature set;

and the characteristic determining subunit determines the original audio characteristic according to the first characteristic set and the second characteristic set.

In some embodiments of the present application, based on the above technical solutions, the invoking module 1002 includes:

the model calling unit is further configured to call a full-connection network model according to the third audio feature, and obtain a gain result corresponding to the audio data to be processed.

In some embodiments of the present application, based on the above technical solutions, the generating module 1003 includes:

the gain calculation unit is used for performing multiplication calculation according to the gain result and the audio data to be processed to obtain an audio gain result;

and the audio transformation unit is used for carrying out inverse fast Fourier transformation on the audio gain result to obtain de-noised audio data.

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus 1000 further includes:

the obtaining module 1001 is further configured to obtain a training audio feature corresponding to audio data to be trained;

the calling module 1002 is further configured to call a first network model included in a model to be trained, and process the training audio features to obtain first audio features, where the first audio features include at least one-dimensional features;

the calling module 1002 is further configured to call a second network model included in the model to be trained, and process the training audio feature and the first audio feature to obtain a second audio feature, where a dimension of the second audio feature is greater than a dimension of the first audio feature;

the calling module 1002 is further configured to call a full-connection network model included in the model to be trained according to the second audio feature and the training audio feature, and obtain a gain result corresponding to the audio data to be processed;

In some embodiments of the present application, based on the above technical solutions, the audio processing apparatus 1000 further includes:

the acquisition module is used for acquiring the audio data to be processed through an audio acquisition device;

the identification module is used for identifying the de-noised audio data to obtain an audio identification result;

It should be noted that the apparatus provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module performs operations has been described in detail in the method embodiment, and is not described again here.

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

It should be noted that the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, a computer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for system operation are also stored. The CPU 1101, ROM1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output section 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. A driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 1109 and/or installed from the removable medium 1111. When the computer program is executed by a Central Processing Unit (CPU)1101, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

25页详细技术资料下载

Audio processing method and device, electronic equipment and storage medium

相关技术

网友询问留言