Audio processing method and device

文档序号:972885 发布日期:2020-11-03 浏览:2次 中文

阅读说明:本技术 音频处理方法及装置 (Audio processing method and device ) 是由 庄晓滨 于 2020-07-27 设计创作,主要内容包括:本申请实施例公开了一种音频处理方法及装置,该方法包括:针对样本语音信号集合中每个样本语音信号,从样本语音信号中提取音素向量序列和目标频谱序列。将目标频谱序列输入初始音色提取模型,以得到音色特征向量。将根据音色特征向量和音素向量序列生成的联合特征向量序列输入初始序列转换模型,以得到预测频谱序列。根据目标频谱序列和预测频谱序列调整初始音色提取模型和初始序列转换模型。当基于调整后的初始音色提取模型和初始序列转换模型输出的预测频谱序列与目标频谱序列间的最小均方误差不大于预设阈值时,将调整后的初始音色提取模型确定为目标音色提取模型。采用本申请实施例,提高音色提取模型的精度,适用性高。(The embodiment of the application discloses an audio processing method and an audio processing device, wherein the method comprises the following steps: for each sample speech signal of the set of sample speech signals, a sequence of phoneme vectors and a sequence of target spectra are extracted from the sample speech signal. And inputting the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector. And inputting the joint feature vector sequence generated according to the tone feature vector and the phoneme vector sequence into an initial sequence conversion model to obtain a prediction frequency spectrum sequence. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the predicted frequency spectrum sequence output based on the adjusted initial tone extraction model and the initial sequence conversion model and the target frequency spectrum sequence is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. By adopting the embodiment of the application, the accuracy of the tone extraction model is improved, and the applicability is high.)

1. A method of audio processing, the method comprising:

acquiring a sample voice signal set, wherein the sample voice signal set comprises at least one sample voice signal;

for each sample speech signal, extracting a sequence of phoneme vectors from the sample speech signal and a sequence of target spectra from the sample speech signal;

inputting the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector output by the initial tone extraction model;

generating a combined feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the combined feature vector sequence into an initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model;

adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence;

and when the minimum mean square error between a target frequency spectrum sequence and a prediction frequency spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model, wherein the target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected.

2. The method of claim 1, wherein said extracting a sequence of phoneme vectors from the sample speech signal comprises:

performing framing and windowing processing on the sample voice signal to obtain at least one framing signal forming the sample voice signal;

extracting text information included in each frame signal, and determining at least one phoneme constituting the text information;

acquiring a preset phoneme vector query table, and determining a phoneme vector corresponding to each phoneme from the phoneme vector query table, wherein the phoneme vector query table comprises a plurality of phoneme vectors corresponding to a plurality of phonemes, and each phoneme corresponds to one phoneme vector;

and splicing the phoneme vectors corresponding to the frame signals to obtain a phoneme vector sequence corresponding to the sample voice signal.

3. The method of claim 2, wherein the target sequence of spectra comprises a target sequence of mel-frequency spectra; the extracting a target spectrum sequence from the sample voice signal comprises:

acquiring a linear frequency spectrum corresponding to each framing signal, and inputting the linear frequency spectrum corresponding to each framing signal into a Mel filter bank to obtain a Mel spectrum corresponding to each framing signal output by the Mel filter bank;

splicing the Mel spectrums corresponding to the frame signals to obtain a Mel spectrum sequence corresponding to the sample voice signal;

and determining a target Mel spectrum sequence corresponding to the sample voice signal according to the Mel spectrum sequence.

4. The method according to claim 3, wherein said determining a target Mel spectral sequence corresponding to the sample speech signal from the Mel spectral sequence comprises:

and randomly extracting Mel spectrums corresponding to n continuous framing signals from the Mel spectrum sequence as a target Mel spectrum sequence, wherein n is a positive integer.

5. The method of claim 1, wherein generating a sequence of joint feature vectors from the sequence of timbre feature vectors and the sequence of phoneme vectors comprises:

and splicing or summing the tone characteristic vector and the phoneme vector sequence to generate a joint characteristic vector sequence.

6. The method of claim 1, further comprising:

acquiring a voice signal to be detected, and extracting a target frequency spectrum sequence from the voice signal to be detected;

inputting the target frequency spectrum sequence of the voice signal to be detected into a target tone extraction model to obtain a target tone characteristic vector output by the target tone extraction model;

and determining the speaker to which the voice signal to be detected belongs according to the target tone characteristic vector.

7. The method according to claim 6, wherein the determining the speaker to which the speech signal to be detected belongs according to the target timbre feature vector comprises:

acquiring a tone characteristic vector set of registered users, wherein one registered user corresponds to one tone characteristic vector;

determining tone similarity between the target tone feature vector and each tone feature vector in the tone feature vector set to obtain a plurality of tone similarities;

and determining the registered user corresponding to the maximum tone similarity from the plurality of tone similarities as the speaker to which the voice signal to be detected belongs.

8. An audio processing apparatus, characterized in that the apparatus comprises:

the system comprises a sample voice acquisition module, a voice acquisition module and a voice processing module, wherein the sample voice acquisition module is used for acquiring a sample voice signal set by voice, and the sample voice signal set comprises at least one sample voice signal;

the sequence acquisition module is used for extracting a phoneme vector sequence from the sample voice signal and extracting a target spectrum sequence from the sample voice signal aiming at each sample voice signal;

a tone characteristic vector obtaining module, configured to input the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector output by the initial tone extraction model;

the prediction frequency spectrum sequence acquisition module is used for generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into an initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model;

the model adjusting module is used for adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence;

and the model determining module is used for determining the adjusted initial tone extraction model as a target tone extraction model when the minimum mean square error between a target frequency spectrum sequence and a prediction frequency spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not greater than a preset threshold, wherein the target tone extraction model is used for extracting tone characteristic vectors of the voice signal to be detected.

9. A terminal device, comprising a processor and a memory, the processor and the memory being interconnected;

the memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.

Technical Field

The present application relates to the field of audio signal processing, and in particular, to an audio processing method and apparatus.

Background

With the development of artificial intelligence technology, intelligent application in speech is increasing. In the film and television dubbing industry, the tone characteristics of the original actors can be added into the localized film dubbing, so that the dubbing is more original in taste and flavor. In digital entertainment industries such as karaoke and the like, pitch information can be modified while tone color is reserved, and tone modification is realized. In the medical field, sound conversion techniques can improve the intelligibility of sound of patients with impaired vocal tract. In the military and defense field, the voice conversion technology can disguise the voice of a speaker in communication. In daily life, a mobile phone voice assistant, a question and answer robot, an electronic reading, a virtual singing and jerking and the like are all related to tone pitches. In the prior art, a speaker tag to which an audio corresponding to a mel cepstrum coefficient belongs is usually required to be provided for training and modeling, so that the problems of tag errors and deviation caused by tone difference of the same person at different moments are difficult to overcome, and the effect of a tone recognition model obtained through training is poor.

Disclosure of Invention

The embodiment of the application provides an audio processing method and device, which can improve the accuracy of a tone extraction model and have high applicability.

In a first aspect, an embodiment of the present application provides an audio processing method, where the method includes:

acquiring a sample voice signal set, wherein the sample voice signal set comprises at least one sample voice signal;

for each sample speech signal, extracting a sequence of phoneme vectors from the sample speech signal and a sequence of target spectra from the sample speech signal;

inputting the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector output by the initial tone extraction model;

generating a combined feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the combined feature vector sequence into an initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model;

adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence;

and when the minimum mean square error between a target frequency spectrum sequence and a prediction frequency spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model, wherein the target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected.

With reference to the first aspect, in one possible implementation manner, the extracting a phoneme vector sequence from the sample speech signal includes:

performing framing and windowing processing on the sample voice signal to obtain at least one framing signal forming the sample voice signal;

extracting text information included in each frame signal, and determining at least one phoneme constituting the text information;

acquiring a preset phoneme vector query table, and determining a phoneme vector corresponding to each phoneme from the phoneme vector query table, wherein the phoneme vector query table comprises a plurality of phoneme vectors corresponding to a plurality of phonemes, and each phoneme corresponds to one phoneme vector;

and splicing the phoneme vectors corresponding to the frame signals to obtain a phoneme vector sequence corresponding to the sample voice signal.

With reference to the first aspect, in one possible implementation, the target spectrum sequence includes a target mel-frequency spectrum sequence; the extracting a target spectrum sequence from the sample voice signal comprises:

acquiring a linear frequency spectrum corresponding to each framing signal, and inputting the linear frequency spectrum corresponding to each framing signal into a Mel filter bank to obtain a Mel spectrum corresponding to each framing signal output by the Mel filter bank;

splicing the Mel spectrums corresponding to the frame signals to obtain a Mel spectrum sequence corresponding to the sample voice signal;

and determining a target Mel spectrum sequence corresponding to the sample voice signal according to the Mel spectrum sequence.

With reference to the first aspect, in a possible implementation manner, the determining, according to the mel-spectrum sequence, a target mel-spectrum sequence corresponding to the sample speech signal includes:

and randomly extracting Mel spectrums corresponding to n continuous framing signals from the Mel spectrum sequence as a target Mel spectrum sequence, wherein n is a positive integer.

With reference to the first aspect, in one possible implementation manner, the generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence includes:

and splicing or summing the tone characteristic vector and the phoneme vector sequence to generate a joint characteristic vector sequence.

With reference to the first aspect, in one possible implementation, the method further includes:

acquiring a voice signal to be detected, and extracting a target frequency spectrum sequence from the voice signal to be detected;

inputting the target frequency spectrum sequence of the voice signal to be detected into a target tone extraction model to obtain a target tone characteristic vector output by the target tone extraction model;

and determining the speaker to which the voice signal to be detected belongs according to the target tone characteristic vector.

With reference to the first aspect, in a possible implementation manner, the determining, according to the target tone feature vector, a speaker to which the speech signal to be detected belongs includes:

acquiring a tone characteristic vector set of registered users, wherein one registered user corresponds to one tone characteristic vector;

determining tone similarity between the target tone feature vector and each tone feature vector in the tone feature vector set to obtain a plurality of tone similarities;

and determining the registered user corresponding to the maximum tone similarity from the plurality of tone similarities as the speaker to which the voice signal to be detected belongs.

In a second aspect, an embodiment of the present application provides an audio processing apparatus, including:

the system comprises a sample voice acquisition module, a voice acquisition module and a voice processing module, wherein the sample voice acquisition module is used for acquiring a sample voice signal set by voice, and the sample voice signal set comprises at least one sample voice signal;

the sequence acquisition module is used for extracting a phoneme vector sequence from the sample voice signal and extracting a target spectrum sequence from the sample voice signal aiming at each sample voice signal;

a tone characteristic vector obtaining module, configured to input the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector output by the initial tone extraction model;

the prediction frequency spectrum sequence acquisition module is used for generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into an initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model;

the model adjusting module is used for adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence;

and the model determining module is used for determining the adjusted initial tone extraction model as a target tone extraction model when the minimum mean square error between a target frequency spectrum sequence and a prediction frequency spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not greater than a preset threshold, wherein the target tone extraction model is used for extracting tone characteristic vectors of the voice signal to be detected.

With reference to the second aspect, in a possible implementation manner, the sequence obtaining module includes a phoneme vector sequence extracting unit, and the tone vector sequence extracting unit includes:

a signal framing subunit, configured to perform framing and windowing on the sample speech signal to obtain at least one framing signal constituting the sample speech signal;

a phoneme determining subunit, configured to extract text information included in each of the framing signals, and determine at least one phoneme constituting the text information;

a phoneme vector determining subunit, configured to obtain a preset phoneme vector lookup table, and determine a phoneme vector corresponding to each phoneme from the phoneme vector lookup table, where the phoneme vector lookup table includes a plurality of phoneme vectors corresponding to a plurality of phonemes, and each phoneme corresponds to one phoneme vector;

and the phoneme vector sequence determining subunit is used for splicing the phoneme vectors corresponding to the frame signals to obtain a phoneme vector sequence corresponding to the sample voice signal.

With reference to the second aspect, in one possible implementation, the target spectrum sequence includes a target mel-frequency spectrum sequence; the sequence acquisition module comprises a target spectrum sequence extraction unit, and the target spectrum sequence extraction unit comprises:

the Mel spectrum acquiring subunit is used for acquiring a linear spectrum corresponding to each framing signal, and inputting the linear spectrum corresponding to each framing signal into the Mel filter bank to obtain a Mel spectrum corresponding to each framing signal output by the Mel filter bank;

a Mel spectrum sequence obtaining subunit, configured to splice Mel spectra corresponding to the frame signals to obtain a Mel spectrum sequence corresponding to the sample speech signal;

and the target Mel spectrum sequence determining subunit is used for determining a target Mel spectrum sequence corresponding to the sample voice signal according to the Mel spectrum sequence.

With reference to the second aspect, in a possible implementation manner, the target mel-spectrum sequence determining subunit is specifically configured to:

and randomly extracting Mel spectrums corresponding to n continuous framing signals from the Mel spectrum sequence as a target Mel spectrum sequence, wherein n is a positive integer.

With reference to the second aspect, in a possible implementation manner, the predicted spectrum sequence obtaining module includes a joint feature vector sequence determining unit and a predicted spectrum sequence determining unit, where the joint feature vector sequence determining unit is configured to:

and splicing or summing the tone characteristic vector and the phoneme vector sequence to generate a joint characteristic vector sequence.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes:

the voice signal preprocessing module is used for acquiring a voice signal to be detected and extracting a target frequency spectrum sequence from the voice signal to be detected;

the tone characteristic vector determining module is used for inputting the target frequency spectrum sequence of the voice signal to be detected into a target tone extraction model so as to obtain a target tone characteristic vector output by the target tone extraction model;

and the voice recognition module is used for determining the speaker to which the voice signal to be detected belongs according to the target tone characteristic vector.

With reference to the second aspect, in one possible implementation manner, the voice recognition module includes:

the vector set acquisition unit is used for acquiring a tone characteristic vector set of registered users, wherein one registered user corresponds to one tone characteristic vector;

a tone similarity determining unit, configured to determine tone similarities between the target tone feature vector and each tone feature vector in the tone feature vector set, so as to obtain multiple tone similarities;

and the speaker determining unit is used for determining the registered user corresponding to the maximum tone similarity from the plurality of tone similarities as the speaker to which the voice signal to be detected belongs.

In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory, and the processor and the memory are connected to each other. The memory is configured to store a computer program that supports the terminal device to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the method provided by the first aspect and/or any one of the possible implementation manners of the first aspect.

In the embodiment of the present application, by obtaining the sample speech signal set, it can be obtained that the sample speech signal set includes at least one sample speech signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. By adopting the embodiment of the application, the accuracy of the tone extraction model can be improved, and the applicability is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present application;

fig. 1a is a schematic view of an application scenario of a target spectrum sequence provided in an embodiment of the present application;

FIG. 1b is a schematic structural diagram of a tone extraction model provided in an embodiment of the present application;

fig. 2 is another schematic flow chart of an audio processing method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The audio processing method provided by the embodiment of the application can be widely applied to terminal equipment capable of processing audio signals. The terminal device includes, but is not limited to, a server, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, which is not limited herein. For convenience of description, the following description will be given taking a terminal device as an example. In the method in the embodiment of the present application, a sample speech signal set is obtained, and the obtained sample speech signal set includes at least one sample speech signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. By adopting the embodiment of the application, the accuracy of the tone extraction model can be improved, and the applicability is high.

The method and the related apparatus provided by the embodiments of the present application will be described in detail with reference to fig. 1 to 4, respectively. The method provided by the embodiment of the application can comprise data processing stages, wherein the data processing stages are used for acquiring a sample voice signal set, extracting a phoneme vector sequence and a target frequency spectrum sequence from each sample voice signal, acquiring a tone characteristic vector, determining a joint characteristic vector sequence according to the tone characteristic vector and the phoneme vector sequence, acquiring a prediction frequency spectrum sequence, and adjusting an initial tone extraction model and an initial sequence conversion model according to the target frequency spectrum sequence and the prediction frequency spectrum sequence. The implementation of each data processing stage can be referred to the following implementation shown in fig. 1 to 2.

Referring to fig. 1, fig. 1 is a schematic flow chart of an audio processing method according to an embodiment of the present disclosure. The method provided by the embodiment of the application can comprise the following steps S101 to S106:

s101, acquiring a sample voice signal set.

In some possible embodiments, the set of sample speech signals may be obtained from a local storage of the terminal device, or from an external memory connected to the terminal device, or from a cloud storage space of the terminal device. Wherein, the sample voice signal set comprises at least one sample voice signal.

Optionally, in some possible embodiments, the audio recorded by the microphone of the terminal device may also be obtained in real time to serve as the sample voice signal, without limitation.

S102, aiming at each sample voice signal, extracting a phoneme vector sequence from the sample voice signal, and extracting a target spectrum sequence from the sample voice signal.

In some possible embodiments, for each sample speech signal of the set of sample speech signals, a sequence of phoneme vectors may be extracted from each sample speech signal, and a sequence of target spectra may be extracted from each sample speech signal.

Specifically, for each sample speech signal, at least one frame signal constituting the sample speech signal may be obtained by performing frame windowing on the sample speech signal. The frame length used in the above-mentioned framing windowing process may be within 8-32 ms, for example, the frame length may be 10 ms. The window function used in the frame windowing process may be a hanning window or a hamming window, and is specifically determined according to an actual application scenario, and is not limited herein.

Further, for each of the framing signals, by extracting text information included in the framing signal, at least one phoneme constituting the text information can be determined. It should be understood that a phone is the smallest phonetic unit that constitutes a syllable, where one pronunciation action constitutes one phone, analyzed according to the pronunciation action in the syllable. In general, phonemes can be divided into chinese phonemes and english phonemes. The chinese phoneme is the smallest unit constituting the chinese speech, for example, the chinese syllable a (o) includes only one phoneme a, ai (ai) includes two phonemes, a and i respectively, dai (dull) includes three phonemes, d, a and i respectively, zhuang (zhuang) includes four phonemes, zh, u, a and ng respectively. Accordingly, english phonemes are the smallest units constituting english speech, and specifically, english phonemes include 20 vowel phonemes and 28 consonant phonemes in total.

It should be understood that the phones that make up a syllable can be represented in two different ways, one at the speech frame level and the other at the pronunciation action level. Generally, one utterance may correspond to multiple speech frames. For example, for the chinese syllable zhuang (banked), if represented by a phoneme at the voice frame level, it may be represented as zh (3) u (4) a (3) ng (5), where the value within the parentheses represents the number of voice frames, i.e., the chinese syllable zhuang (banked) may be represented as zh zh zh u u u u a ng ng ng ng ng ng. If the pronunciation action level is used to represent the phoneme, one phoneme corresponds to only one fixed pronunciation action, and thus can be represented as zh (1) u (1) a (1) ng (1), i.e., the Chinese syllable zhuang can be represented as zhu a ng.

The phoneme vector corresponding to each phoneme can be determined from the phoneme vector lookup table by obtaining a preset phoneme vector lookup table. Here, the phoneme vector lookup table includes a plurality of phoneme vectors corresponding to a plurality of phonemes, wherein each phoneme corresponds to a phoneme vector. It is understood that, by splicing the phoneme vectors corresponding to the frame signals, a complete phoneme vector sequence corresponding to the sample speech signal can be obtained.

And performing Fourier transform on each frame signal to obtain a linear spectrum corresponding to each frame signal. It should be understood that, by concatenating the linear spectrums corresponding to the respective frame signals in at least one of the frame signals constituting the sample speech signal, a complete linear spectrum sequence corresponding to the sample speech signal can be obtained. Furthermore, the complete linear spectrum sequence of the obtained sample voice signal can be determined as a target spectrum sequence. Optionally, in order to avoid overfitting of the trained model, linear spectrums corresponding to n continuous framing signals may be randomly extracted from the complete linear spectrum sequence to serve as the target spectrum sequence. That is, a partially continuous linear spectrum sequence may be randomly extracted from the complete linear spectrum sequence as the target spectrum sequence. Wherein n is a positive integer.

Optionally, in some possible embodiments, the target spectrum sequence may also be a mel spectrum sequence or a barker scale spectrum sequence, which is not limited herein. For the sake of understanding, the embodiments of the present application take the target spectrum sequence as the mel spectrum sequence as an example. Specifically, the mel spectrum corresponding to each framing signal output by the mel filter bank can be obtained by acquiring the linear spectrum corresponding to each framing signal and inputting the linear spectrum corresponding to each framing signal into the mel filter bank. The Mel spectrum sequence corresponding to the sample speech signal can be obtained by splicing the Mel spectra corresponding to each frame signal. Further, the obtained complete mel-frequency spectrum sequence can be determined as a target frequency spectrum sequence. Optionally, in order to avoid overfitting of the trained model, mel spectrums corresponding to n continuous framing signals may be randomly extracted from the complete mel spectrum sequence to serve as a target mel spectrum sequence. That is, a partially continuous mel-frequency spectrum sequence may be randomly extracted from the complete mel-frequency spectrum sequence as a target frequency spectrum sequence. Wherein n is a positive integer.

For example, please refer to fig. 1a, fig. 1a is a schematic view of an application scenario of a target spectrum sequence according to an embodiment of the present application. As shown in fig. 1b, it is assumed that any sample speech signal in the sample speech signal set is subjected to framing and windowing processing, so as to obtain a framing signal 1, a framing signal 2, a framing signal 3, a framing signal 4, a framing signal 5, and a framing signal 6. By performing fourier transform on each of the framing signals, a linear spectrum 1 corresponding to the framing signal 1, a linear spectrum 2 corresponding to the framing signal 2, a linear spectrum 3 corresponding to the framing signal 3, a linear spectrum 4 corresponding to the framing signal 4, a linear spectrum 5 corresponding to the framing signal 5, and a linear spectrum 6 corresponding to the framing signal 6 can be obtained. Further, by inputting the linear spectrum corresponding to each frame signal to the mel filter bank, the mel spectrum 1 corresponding to the linear spectrum 1, the mel spectrum 2 corresponding to the linear spectrum 2, the mel spectrum 3 corresponding to the linear spectrum 3, the mel spectrum 4 corresponding to the linear spectrum 4, the mel spectrum 5 corresponding to the linear spectrum 5, and the mel spectrum 6 corresponding to the linear spectrum 6 can be obtained. And splicing the Mel spectrums to obtain a complete Mel spectrum sequence of the sample voice signal. Assuming that n is equal to 4, the mel spectrums corresponding to the 4 continuous framing signals can be randomly extracted from the complete mel spectrum sequence to be used as the target mel spectrum sequence. As shown in fig. 1b, the mel-spectrum 2, the mel-spectrum 3, the mel-spectrum 4, and the mel-spectrum 5 corresponding to the framing signal 2, the framing signal 3, the framing signal 4, and the framing signal 5, respectively, can be used as the target spectrum sequence.

S103, inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model.

In some possible embodiments, the tone feature vector output by the initial tone extraction model can be obtained by inputting the obtained target spectrum sequence into the initial tone extraction model. The initial tone extraction model can be composed of two layers of residual error connection structures and one layer of one-way gating circulation units. Referring to fig. 1b, fig. 1b is a schematic structural diagram of a tone extraction model provided in the embodiment of the present application. As shown in fig. 1b, the initial tone extraction model includes two layers of residual connection structures and one layer of unidirectional gating cycle unit, wherein each layer of residual connection structure may be composed of a one-dimensional convolution unit and a modified linear unit. As shown in fig. 1b, the arrows in the figure represent the data flow direction, and it should be understood that the size of the convolution kernel used in the one-dimensional convolution unit can be determined according to the dimension of the input target spectrum sequence. Generally, the size of the convolution kernel may be 3 × 3 or 5 × 5, etc., which is determined according to the actual application scenario and is not limited herein. The input of the tone extraction model is a target frequency spectrum sequence, and the output of the one-way gating circulation unit of the tone extraction model is a tone characteristic vector. It is to be understood that the dimension of the tone feature vector coincides with the phoneme vector dimension described above.

And S104, generating a combined feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the combined feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model.

In some possible embodiments, for each sample speech signal, after obtaining its corresponding tone color feature vector and phoneme vector sequence, a joint feature vector sequence may be generated according to the tone color feature vector and the phoneme vector sequence. Furthermore, the joint feature vector sequence can be input into the initial sequence conversion model to obtain a predicted spectrum sequence output by the initial sequence conversion model. In particular, the timbre feature vector and the phoneme vector sequence may be spliced or summed to generate a joint feature vector sequence. It is to be understood that the classical sequence transformation model may consist of an encoder and a decoder.

And S105, adjusting the initial tone extraction model and the initial sequence conversion model according to the target spectrum sequence and the predicted spectrum sequence.

In some possible embodiments, the model parameters of the initial tone extraction model and the initial sequence conversion model may be adjusted according to the target spectrum sequence and the predicted spectrum sequence.

And S106, when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model.

In some possible embodiments, when the minimum mean square error between the target spectrum sequence and the predicted spectrum sequence corresponding to each sample speech signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not greater than a preset threshold, the adjusted initial tone extraction model may be determined as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected.

In the embodiment of the present application, by obtaining the sample speech signal set, it can be obtained that the sample speech signal set includes at least one sample speech signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. By adopting the embodiment of the application, the accuracy of the tone extraction model can be improved, and the applicability is high.

Referring to fig. 2, fig. 2 is another schematic flow chart of an audio processing method provided in an embodiment of the present application. The method provided by the embodiment of the present application can be explained by the following implementation manners provided in steps S201 to S209:

s201, acquiring a sample voice signal set.

S202, aiming at each sample voice signal, extracting a phoneme vector sequence from the sample voice signal, and extracting a target spectrum sequence from the sample voice signal.

S203, inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model.

And S204, generating a combined feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the combined feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model.

S205, adjusting the initial tone extraction model and the initial sequence conversion model according to the target spectrum sequence and the predicted spectrum sequence.

And S206, when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model.

The specific implementation manner of steps S201 to S206 may refer to the description of steps S101 to S106 in the embodiment corresponding to fig. 1, and is not described herein again.

S207, acquiring the voice signal to be detected, and extracting a target frequency spectrum sequence from the voice signal to be detected.

In some possible embodiments, by acquiring the speech signal to be detected, the target spectrum sequence may be extracted from the speech signal to be detected. It should be understood that, the method for extracting the target spectrum sequence in the speech signal to be detected can refer to the above step of extracting the target spectrum sequence from the sample speech signal, and is not described herein again.

S208, inputting the target frequency spectrum sequence of the voice signal to be detected into the target tone extraction model to obtain a target tone characteristic vector output by the target tone extraction model.

In some possible embodiments, the target frequency spectrum sequence of the speech signal to be detected is input into the trained target tone extraction model, so as to obtain a tone feature vector, i.e. a target tone feature vector, of the speech signal to be detected, which is output by the target tone extraction model.

S209, determining the speaker to which the voice signal to be detected belongs according to the target tone characteristic vector.

In some possible embodiments, the speaker to which the speech signal to be detected belongs can be determined according to the target timbre feature vector. Specifically, by obtaining a tone feature vector set of registered users, tone feature vectors corresponding to a plurality of registered users included in the tone feature vector set can be obtained, where one registered user corresponds to one tone feature vector. Furthermore, by calculating the tone similarity between the target tone feature vector and each tone feature vector in the set of tone feature vectors, a plurality of tone similarities can be obtained. Therefore, the registered user corresponding to the maximum tone similarity can be determined from the multiple tone similarities to serve as the speaker to which the voice signal to be detected belongs. The calculation method of the timbre similarity includes an euclidean distance, a manhattan distance, a minkowski distance, a cosine similarity, and the like, which is not limited herein. For convenience of description, the embodiments of the present application mainly use the euclidean distance as an example for illustration. Specifically, by calculating the euclidean distance between the target tone feature vector and each tone feature vector in the tone feature vector set, the euclidean distance can be further converted into a similarity value as the tone similarity between the target tone feature vector and each tone feature vector in the tone feature vector set. For example, assuming that the target tone feature vector a is { a1, a 2., am }, and any tone feature vector B in the set of tone feature vectors is { B1, B2., bm }, then the euclidean distance D between the target tone feature vector a and any tone feature vector B in the set of tone feature vectors can be calculated based on formula 1:

Figure BDA0002604094200000121

after the euclidean distance between the target timbre feature vector a and any timbre feature vector B in the timbre feature vector set is obtained through calculation, the euclidean distance may be converted into a similarity value, for example, the euclidean distance may be converted into the similarity value based on formula 2:

in addition to the above formula 2, the formula for converting the euclidean distance into the similarity value may also adopt different definition manners according to different requirements, and is not limited herein.

Optionally, in some possible embodiments, after obtaining the target tone feature vector based on the target tone extraction model, the target tone feature vector may be applied to scenes such as tone conversion, speech synthesis, singing voice synthesis, and the like, which is not limited herein.

In the embodiment of the present application, by obtaining the sample speech signal set, it can be obtained that the sample speech signal set includes at least one sample speech signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. Further, by acquiring the voice signal to be detected and extracting the target frequency spectrum sequence from the voice signal to be detected, the target frequency spectrum sequence of the voice signal to be detected can be input into the target tone extraction model to obtain the target tone characteristic vector output by the target tone extraction model. And determining the speaker to which the voice signal to be detected belongs according to the target tone characteristic vector. By adopting the embodiment of the application, the accuracy of the tone extraction model and the accuracy of the voice recognition can be improved, and the applicability is high.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application. The audio processing device provided by the application comprises:

a sample voice obtaining module 31, configured to obtain a sample voice signal set by voice, where the sample voice signal set includes at least one sample voice signal;

a sequence obtaining module 32, configured to, for each sample speech signal, extract a phoneme vector sequence from the sample speech signal, and extract a target spectrum sequence from the sample speech signal;

a tone characteristic vector obtaining module 33, configured to input the target frequency spectrum sequence into an initial tone extraction model to obtain a tone characteristic vector output by the initial tone extraction model;

a predicted spectrum sequence obtaining module 34, configured to generate a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and input the joint feature vector sequence into an initial sequence conversion model to obtain a predicted spectrum sequence output by the initial sequence conversion model;

a model adjusting module 35, configured to adjust the initial tone extraction model and the initial sequence conversion model according to the target spectrum sequence and the predicted spectrum sequence;

and a model determining module 36, configured to determine the adjusted initial tone color extraction model as a target tone color extraction model when a minimum mean square error between a target frequency spectrum sequence and a prediction frequency spectrum sequence corresponding to each sample speech signal output based on the adjusted initial tone color extraction model and the initial sequence conversion model is not greater than a preset threshold, where the target tone color extraction model is used to extract a tone color feature vector of the speech signal to be detected.

Referring to fig. 4, fig. 4 is a schematic view of another structure of an audio processing apparatus according to an embodiment of the present disclosure. Wherein:

in some possible embodiments, the sequence obtaining module 32 includes a phoneme vector sequence extracting unit 321, and the timbre vector sequence extracting unit 321 includes:

a signal framing subunit 3211, configured to perform framing and windowing on the sample voice signal to obtain at least one framing signal forming the sample voice signal;

a phoneme determining subunit 3212, configured to extract text information included in each frame signal, and determine at least one phoneme constituting the text information;

a phoneme vector determining subunit 3213, configured to obtain a preset phoneme vector lookup table, and determine a phoneme vector corresponding to each phoneme from the phoneme vector lookup table, where the phoneme vector lookup table includes a plurality of phoneme vectors corresponding to a plurality of phonemes, and each phoneme corresponds to one phoneme vector;

the phoneme vector sequence determining subunit 3214 is configured to splice the phoneme vectors corresponding to the frame signals to obtain a phoneme vector sequence corresponding to the sample speech signal.

In some possible embodiments, the target sequence of spectra comprises a target sequence of mel-frequency spectra; the sequence acquiring module 32 includes a target spectrum sequence extracting unit 322, where the target spectrum sequence extracting unit 322 includes:

a mel spectrum obtaining subunit 3221, configured to obtain a linear spectrum corresponding to each framing signal, and input the linear spectrum corresponding to each framing signal into a mel filter bank, so as to obtain a mel spectrum corresponding to each framing signal output by the mel filter bank;

a mel-spectrum sequence obtaining subunit 3222, configured to splice mel-spectra corresponding to the frame signals to obtain a mel-spectrum sequence corresponding to the sample speech signal;

a target mel-spectrum sequence determining subunit 3223 is configured to determine, according to the mel-spectrum sequence, a target mel-spectrum sequence corresponding to the sample speech signal.

In some possible embodiments, the above target mel-spectrum sequence determining subunit 3223 is specifically configured to:

and randomly extracting Mel spectrums corresponding to n continuous framing signals from the Mel spectrum sequence as a target Mel spectrum sequence, wherein n is a positive integer.

In some possible embodiments, the predicted spectrum sequence obtaining module 34 includes a joint feature vector sequence determining unit 341 and a predicted spectrum sequence determining unit 342, where the joint feature vector sequence determining unit 341 is configured to:

and splicing or summing the tone characteristic vector and the phoneme vector sequence to generate a joint characteristic vector sequence.

In some possible embodiments, the apparatus further comprises:

the voice signal preprocessing module 37 is configured to acquire a voice signal to be detected, and extract a target frequency spectrum sequence from the voice signal to be detected;

a tone characteristic vector determining module 38, configured to input the target frequency spectrum sequence of the speech signal to be detected into a target tone extraction model, so as to obtain a target tone characteristic vector output by the target tone extraction model;

and the voice recognition module 39 is configured to determine the speaker to which the voice signal to be detected belongs according to the target tone feature vector.

In some possible embodiments, the speech recognition module 39 includes:

a vector set obtaining unit 391, configured to obtain a tone feature vector set of registered users, where one registered user corresponds to one tone feature vector;

a tone similarity determining unit 392, configured to determine tone similarities between the target tone feature vector and each tone feature vector in the set of tone feature vectors, so as to obtain multiple tone similarities;

the speaker determining unit 393 is configured to determine, from the multiple tone similarities, a registered user corresponding to the maximum tone similarity as the speaker to which the voice signal to be detected belongs.

In a specific implementation, the audio processing apparatus may execute the implementation manners provided in the steps of fig. 1 to fig. 2 through its built-in functional modules. For example, the sample speech obtaining module 31 may be configured to execute implementation manners such as obtaining a sample speech signal in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The sequence obtaining module 32 may be configured to perform the implementation manners described in the steps for obtaining the phoneme vector sequence and the target spectrum sequence from the sample speech signal, which may specifically refer to the implementation manners provided in the steps, and will not be described herein again. The tone feature vector obtaining module 33 may be configured to execute implementation manners such as obtaining an initial tone extraction model in each step, outputting a tone feature vector based on the initial tone extraction model and the target frequency spectrum sequence, and the like. The predicted frequency spectrum sequence obtaining module 34 may be configured to execute implementation manners, such as generating a joint feature vector according to the tone feature vector and the phoneme vector sequence, obtaining an initial sequence conversion model, and outputting a predicted frequency spectrum sequence based on the initial sequence conversion model and the joint feature vector in each step, which may be specifically referred to the implementation manners provided in each step and are not described herein again. The model adjusting module 35 may be configured to perform implementation manners such as adjusting an initial tone extraction model and an initial sequence conversion model based on the target spectrum sequence and the predicted spectrum sequence in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The model determining module 36 may be configured to perform the implementation manners, such as determining the adjusted initial tone extraction model meeting the convergence condition as the target tone extraction model in each step, which may specifically refer to the implementation manners provided in each step, and will not be described herein again. The voice signal preprocessing module 37 to be detected can be configured to execute the implementation manners of obtaining the voice signal to be detected in the above steps, extracting the target spectrum sequence from the voice signal to be detected, and the like, which may specifically refer to the implementation manners provided in the above steps, and will not be described herein again. The tone feature vector determination module 38 may be configured to execute implementation manners, such as inputting the target frequency spectrum sequence of the speech signal to be detected into the target tone extraction model in each step, and obtaining a target tone feature vector output by the target tone extraction model, which may specifically refer to the implementation manners provided in each step, and are not described herein again. The speech recognition module 39 can be configured to determine an implementation manner of a speaker to which the speech to be detected belongs according to the target tone feature vector in each step, which may be specifically referred to the implementation manner provided in each step, and is not described herein again.

In the embodiment of the present application, the audio processing apparatus obtains the sample speech signal set by obtaining the sample speech signal set, and obtains that the sample speech signal set includes at least one sample speech signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. By adopting the embodiment of the application, the accuracy of the tone extraction model can be improved, and the applicability is high.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 5, the terminal device in this embodiment may include: one or more processors 401 and memory 402. The processor 401 and the memory 402 are connected by a bus 403. The memory 402 is configured to store a computer program, where the computer program includes program instructions, and the processor 401 is configured to execute the program instructions stored in the memory 402, and perform the steps in the foregoing embodiments, which are not described herein again.

It should be appreciated that in some possible implementations, the processor 401 may be a Central Processing Unit (CPU), and the processor may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory 402 may include both read-only memory and random access memory, and provides instructions and data to the processor 401. A portion of the memory 402 may also include non-volatile random access memory. For example, the memory 402 may also store device type information.

In a specific implementation, the terminal device may execute the implementation manners provided in the steps in fig. 1 to fig. 2 through the built-in functional modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.

In the embodiment of the present application, the terminal device obtains the sample voice signal set by obtaining the sample voice signal set, and obtains that the sample voice signal set includes at least one sample voice signal. Wherein, for each sample speech signal, a sequence of phoneme vectors may be extracted from the sample speech signal and a sequence of target spectra may be extracted from the sample speech signal. And inputting the target frequency spectrum sequence into the initial tone extraction model to obtain the tone characteristic vector output by the initial tone extraction model. And generating a joint feature vector sequence according to the tone feature vector and the phoneme vector sequence, and inputting the joint feature vector sequence into the initial sequence conversion model to obtain a prediction frequency spectrum sequence output by the initial sequence conversion model. And adjusting the initial tone extraction model and the initial sequence conversion model according to the target frequency spectrum sequence and the predicted frequency spectrum sequence. And when the minimum mean square error between the target spectrum sequence and the prediction spectrum sequence corresponding to each sample voice signal output based on the adjusted initial tone extraction model and the initial sequence conversion model is not more than a preset threshold value, determining the adjusted initial tone extraction model as the target tone extraction model. The target tone extraction model is used for extracting tone characteristic vectors of the voice signals to be detected. By adopting the embodiment of the application, the accuracy of the tone extraction model can be improved, and the applicability is high.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the audio processing method provided in each step in fig. 1 to 2 is implemented.

The computer-readable storage medium may be the audio processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the terminal device, such as a hard disk or a memory of an electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first", "second", "third", "fourth", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音合成、特征提取模型训练方法、装置、介质及设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!