Speech emotion recognition model training method and electronic equipment

文档序号：193315 发布日期：2021-11-02 浏览：30次中文

阅读说明：本技术 语音情绪识别模型训练方法及电子设备 (Speech emotion recognition model training method and electronic equipment ) 是由简仁贤许曜麒林长洲于 2021-08-31 设计创作，主要内容包括：本发明公开了一种语音情绪识别模型训练方法及电子设备,方法包括：获取语者识别语料；从所述语者识别语料中提取频域特征数据；使用所述频域特征数据进行训练,获得语音情绪特征抽取器；获取语音情绪语料；利用所述语音情绪特征抽取器从所述语音情绪语料中提取语音情绪特征数据；使用语音情绪特征数据进行训练,获得语音情绪识别模型。本发明仅需少量的语音情绪语料就能使得训练得到的语音情绪识别模型也具有较高的精确度。(The invention discloses a speech emotion recognition model training method and electronic equipment, wherein the method comprises the following steps: obtaining speaker identification corpora; extracting frequency domain characteristic data from the speaker identification corpus; training by using the frequency domain characteristic data to obtain a speech emotion characteristic extractor; acquiring a voice emotion corpus; extracting voice emotion feature data from the voice emotion corpus by using the voice emotion feature extractor; and training by using the voice emotion characteristic data to obtain a voice emotion recognition model. The method can ensure that the trained speech emotion recognition model has higher accuracy only by a small amount of speech emotion corpora.)

1. A speech emotion recognition model training method is characterized by comprising the following steps:

obtaining speaker identification corpora;

extracting frequency domain characteristic data from the speaker identification corpus;

training by using the frequency domain characteristic data to obtain a speech emotion characteristic extractor;

acquiring a voice emotion corpus;

extracting voice emotion feature data from the voice emotion corpus by using the voice emotion feature extractor;

and training by using the voice emotion characteristic data to obtain a voice emotion recognition model.

2. The training method of speech emotion recognition model according to claim 1, wherein the extracting frequency domain feature data from the speaker recognition corpus comprises:

performing Fourier transform on the voice of the speaker recognition corpus to obtain a first transform result;

and generating a first Mel frequency cepstrum coefficient characteristic as frequency domain characteristic data by passing the first transformation result through a Mel filter.

3. The training method of speech emotion recognition model according to claim 1, wherein the training using the frequency domain feature data to obtain the speech emotion feature extractor comprises:

and sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

randomly selecting a part of the frequency domain characteristic data as a current speaker model input;

training a current speaker recognition model by using the input of the current speaker model, and acquiring a speech emotion characteristic value through a forward propagation algorithm;

recording the difference between the speech emotion characteristic value and the true speech emotion characteristic value as a first minimum cross entropy;

judging whether the current speaker identification model meets a convergence condition or not according to the first minimum cross entropy, and if so, taking the current speaker identification model as a final speaker identification model; if the current speaker identification model does not meet the requirement, adding 1 to the iteration times, taking the voice emotion characteristic value gradually approaching the real voice emotion characteristic value as a target, updating the parameters of the current speaker identification model through a back propagation algorithm, and performing the next iteration process;

and taking the final speaker recognition model as a speech emotion characteristic extractor.

4. The training method of speech emotion recognition model according to claim 1, wherein said extracting speech emotion feature data from the speech emotion corpus by using the speech emotion feature extractor includes:

performing Fourier transform on the voice of the voice emotion corpus to obtain a second transform result;

passing the second transform result through a mel filter to produce a second mel-frequency cepstrum coefficient characteristic;

and inputting the second mel frequency cepstrum coefficient characteristic into a voice emotion characteristic extractor to obtain voice emotion characteristic data.

5. The training method of the speech emotion recognition model according to claim 1, wherein the training using the speech emotion feature data to obtain the speech emotion recognition model comprises:

and sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

randomly selecting a part of the speech emotion characteristic data as current emotion model input;

training a current emotion recognition model by using current emotion model input, and acquiring a speech emotion class value through a forward propagation algorithm;

recording the difference between the voice emotion category value and the real voice emotion category value as a second minimum cross entropy;

judging whether the current emotion recognition model meets a convergence condition or not according to the second minimum cross entropy, and if so, taking the current emotion recognition model as a final emotion recognition model; if the current emotion recognition model does not meet the requirement, adding 1 to the iteration number, taking the voice emotion type value gradually approaching the real voice emotion type value as a target, updating the parameters of the current emotion recognition model through a back propagation algorithm, and performing the next iteration process;

and taking the final emotion recognition model as a voice emotion recognition model.

6. The speech emotion recognition model training method of claim 3,

the current speaker identification model adopts ECAPA-TDNN;

and the real value of the speech emotion characteristic is obtained through the speaker recognition corpus.

7. The speech emotion recognition model training method of claim 5,

the current emotion recognition model adopts a multilayer perceptron;

and the voice emotion category true value is obtained through the voice emotion corpus.

8. The method for training a speech emotion recognition model according to claim 4, wherein the second mel-frequency cepstrum coefficient feature is input to a speech emotion feature extractor, and a vector value is generated at a second last layer of the speech emotion feature extractor and extracted as speech emotion feature data.

9. The speech emotion recognition model training method according to claim 3 or 5, wherein the convergence condition is: the condition one or the condition two is satisfied,

the first condition is as follows: stopping the change of the first minimized cross entropy or the second minimized cross entropy;

and a second condition: the number of iterations reaches 200.

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech emotion recognition model training method of any one of claims 1 to 8.

Technical Field

The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition model training method and electronic equipment.

Background

At present, the effect of deep learning in various fields is very excellent, and the effects of improving the performance of operation hardware, deepening a model architecture and the like are achieved. The scale of the corpus is the most critical factor to achieve the above effect. The speech emotion recognition is one of deep learning which can be realized, but the corpus of the speech emotion recognition is very rare, and the deep learning cannot be applied to achieve a better recognition effect. Compared with speech recognition, the training corpus that can be obtained by speaker recognition is several times and several tens of thousands times of speech emotion recognition.

The most direct method is to widely collect and record the corpora to solve the problem of insufficient training data of speech emotion recognition, but the corpus collection belongs to a high-cost task, and particularly, the corpus collection cost of speech emotion recognition is higher than that of other fields. Generally speaking, speech emotion recognition can first distinguish four more common classes: anger (anger), Happy (Happy), Neutral (Neutral), Sad) and collection of related speech emotion corpora require professional actors to record and are not deductive by ordinary speakers. To even more advanced mood categories: surprise, fear, aversion, slight vision and confusion, and the collection difficulty is greatly improved. This approach is not very feasible.

Disclosure of Invention

The invention aims to provide a speech emotion recognition model training method, which can ensure that the trained speech emotion recognition model has higher accuracy only by a small amount of speech emotion corpora.

The technical scheme for realizing the purpose is as follows:

the application provides a speech emotion recognition model training method, which comprises the following steps:

obtaining speaker identification corpora;

extracting frequency domain characteristic data from the speaker identification corpus;

training by using the frequency domain characteristic data to obtain a speech emotion characteristic extractor;

acquiring a voice emotion corpus;

extracting voice emotion feature data from the voice emotion corpus by using the voice emotion feature extractor;

and training by using the voice emotion characteristic data to obtain a voice emotion recognition model.

In an embodiment, the extracting frequency domain feature data from the speaker identification corpus includes:

performing Fourier transform on the voice of the speaker recognition corpus to obtain a first transform result;

and generating a first Mel frequency cepstrum coefficient characteristic as frequency domain characteristic data by passing the first transformation result through a Mel filter.

In an embodiment, the training using the frequency domain feature data to obtain the speech emotion feature extractor includes:

and sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

randomly selecting a part of the frequency domain characteristic data as a current speaker model input;

training a current speaker recognition model by using the input of the current speaker model, and acquiring a speech emotion characteristic value through a forward propagation algorithm;

recording the difference between the speech emotion characteristic value and the true speech emotion characteristic value as a first minimum cross entropy;

and taking the final speaker recognition model as a speech emotion characteristic extractor.

In an embodiment, the extracting, by the speech emotion feature extractor, speech emotion feature data from the speech emotion corpus includes:

performing Fourier transform on the voice of the voice emotion corpus to obtain a second transform result;

passing the second transform result through a mel filter to produce a second mel-frequency cepstrum coefficient characteristic;

and inputting the second mel frequency cepstrum coefficient characteristic into a voice emotion characteristic extractor to obtain voice emotion characteristic data.

In an embodiment, the training using the speech emotion feature data to obtain the speech emotion recognition model includes:

and sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

randomly selecting a part of the speech emotion characteristic data as current emotion model input;

training a current emotion recognition model by using current emotion model input, and acquiring a speech emotion class value through a forward propagation algorithm;

recording the difference between the voice emotion category value and the real voice emotion category value as a second minimum cross entropy;

and taking the final emotion recognition model as a voice emotion recognition model.

In one embodiment, the current speaker identification model is ECAPA-TDNN (explicit Channel attachment, Propagation and Aggregation in Time-Delay Neural Network);

and the real value of the speech emotion characteristic is obtained through the speaker recognition corpus.

In one embodiment, the current emotion recognition model employs a multi-layer perceptron;

and the voice emotion category true value is obtained through the voice emotion corpus.

In one embodiment, the second mel frequency cepstrum coefficient feature is input into a speech emotion feature extractor, and a vector value is generated by a second last layer of the speech emotion feature extractor and is extracted as speech emotion feature data.

In one embodiment, the convergence condition is: the condition one or the condition two is satisfied,

the first condition is as follows: stopping the change of the first minimized cross entropy or the second minimized cross entropy;

and a second condition: the number of iterations reaches 200.

The application provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech emotion recognition model training method of any one of claims 1 to 8.

The invention has the beneficial effects that: the invention applies a large amount of speaker recognition corpora and trains a speaker recognition model by taking a speaker recognition task as a target, can extract richer voice characteristics, obtains the voice emotion recognition model by training under the condition of a small amount of voice emotion corpora based on the voice characteristics, solves the problem of insufficient voice emotion corpora in the prior art, can obtain the high-precision voice emotion recognition model by a small amount of voice emotion corpora, and has better effect than the voice emotion recognition by using a traditional Support Vector Machine (SVM). Meanwhile, the prediction of each emotion type is more balanced than that of the traditional method, and the emotion types with more data are not prone to be biased.

Drawings

FIG. 1 is a flow chart of a speech emotion recognition model training method provided by an embodiment of the present application;

FIG. 2 is a flow chart of a speech emotion recognition method provided by an embodiment of the present application;

FIG. 3 is a block diagram of a training apparatus for speech emotion recognition models provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The invention will be further explained with reference to the drawings.

At present, the corpus of speech emotion recognition is very rare mainly because of the difficulty and high cost of speech emotion corpus collection. Generally, collecting related speech emotion corpora requires looking for professional actors to record, and common speakers are difficult to perform, so that the collection difficulty is high, and the collection quantity cannot reach a large scale. In addition, some advanced emotion categories (such as surprise, fear, disgust, slight vision and confusion) greatly improve the collection difficulty. Therefore, if a small amount of speech emotion corpora is used for training with a speech emotion recognition task as a target, the obtained speech emotion recognition model has a limited recognition effect, and higher accuracy is difficult to obtain.

In order to solve the problems, a voice emotion recognition model with higher accuracy can be obtained only by a small amount of voice emotion linguistic data. The invention provides a speech emotion recognition model training method, a speech emotion recognition model training device, electronic equipment and a computer readable storage medium, wherein a speaker recognition corpus and a speaker recognition model are used for assisting speech emotion recognition, a large number of speaker recognition corpora are applied, a deep learning model trained by taking a speaker recognition task as a target is used for extracting richer speech features, and the model is trained on a small number of speech emotion corpora based on the speech features to obtain a speech emotion recognition model with higher accuracy. The present invention can be realized by corresponding software, hardware or a combination of software and hardware, and the embodiments of the present invention are described in detail below.

Referring to fig. 1, an embodiment of the present application provides a method for training a speech emotion recognition model, where the method may be executed by an electronic device, and the method includes the following steps:

step S100, obtaining speaker identification corpora.

In this embodiment, because of the richness and the availability of the speaker identification corpus, a large amount of speaker identification corpuses can be conveniently obtained. And training the deep learning model by using the speaker recognition corpus.

Step S101, extracting frequency domain characteristic data from the speaker identification corpus.

In the embodiment, firstly, the voice of the speaker recognition corpus is subjected to Fourier transform to obtain a first transform result; and generating a first Mel frequency cepstrum coefficient characteristic as frequency domain characteristic data by passing the first transformation result through a Mel filter.

And S102, training by using frequency domain characteristic data to obtain a speech emotion characteristic extractor.

In this embodiment, ECAPA-TDNN is used as a speaker recognition model, and parameters of ECAPA-TDNN are updated with different speaker classifications as targets. Step S102 is specifically realized by the following steps.

1) And sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

11) and randomly selecting a part of frequency domain characteristic data as the input of the current speaker model.

12) And training the current speaker recognition model by using the current speaker model input, and acquiring a speech emotion characteristic value through a forward propagation algorithm.

13) And recording the difference between the speech emotion characteristic value and the true speech emotion characteristic value as a first minimum cross entropy.

14) Judging whether the current speaker identification model meets a convergence condition or not according to the first minimum cross entropy, and if so, taking the current speaker identification model as a final speaker identification model; if the current speaker identification model does not meet the requirement, the iteration times are added by 1, the speech emotion characteristic value gradually approaches to the speech emotion characteristic true value as a target, the parameters of the current speaker identification model are updated through a back propagation algorithm, and the next iteration process is carried out.

2) And taking the final speaker recognition model as a speech emotion characteristic extractor.

In this embodiment, the cross entropy is a loss function, and machine learning model training generally defines a margin loss function and performs parameter updating for the purpose of minimizing the loss function. The cross entropy is roughly the difference between the speech emotion characteristic value and the true speech emotion characteristic value, the minimum cross entropy is the target, and the whole training process is iterated continuously. The actual value of the speech emotion characteristic is a standard answer obtained from the corpus recognized by the speaker, and the purpose of minimizing the cross entropy is a parameter estimation target for making the representation of the speech emotion characteristic value closer to the actual value of the speech emotion characteristic. Wherein, every time an iteration process is performed, the parameters of the current speaker identification model are updated, and the current speaker identification model after the parameters are updated is used as the current speaker identification model (when the next iteration is needed) or the final speaker identification model (when the convergence condition is satisfied) of the next iteration process. The initial current speaker recognition model is initial ECAPA-TDNN.

The convergence condition means that a condition one or a condition two is satisfied, wherein the condition one: the first minimum cross entropy stops changing; and a second condition: the number of iterations reaches 200. The speech emotion feature extractor obtained above can extract speech emotion features.

And step S103, acquiring voice emotion corpora.

In this embodiment, a small amount of speech emotion corpus can be obtained, and a speech emotion recognition model is trained on the basis of the speech emotion corpus. The obtained speech emotion corpus can be divided into four categories: anger (anger), Happy (Happy), Neutral (Neutral), Sad (Sad), which focuses on the recorded speaking mood of the speaker rather than the content of the speech, the content of the speech is not limited by the language and the character intention.

And step S104, extracting voice emotion characteristic data from the voice emotion corpus by using the voice emotion characteristic extractor.

In this embodiment, the voice of the voice emotion corpus is subjected to fourier transform to obtain a second transform result. Passing the second transform result through a mel filter to generate a second mel frequency cepstrum coefficient characteristic; and inputting the second mel frequency cepstrum coefficient characteristic into a voice emotion characteristic extractor to obtain voice emotion characteristic data. Specifically, a second mel frequency cepstrum coefficient feature is input into a voice emotion feature extractor, a vector value is generated in the second last layer of the voice emotion feature extractor and is extracted to serve as voice emotion feature data, and the second last layer refers to the second layer from an output layer to an input layer of ECAPA-TDNN.

And step S105, training by using the voice emotion characteristic data to obtain a voice emotion recognition model.

In this embodiment, a Multilayer Perceptron (MLP) is used as a speech emotion recognition model, and parameter updating is performed on the Multilayer Perceptron with the goal of classifying different emotion categories. Step S105 is specifically realized by the following steps.

1) And sequentially finishing a plurality of iterative processes, wherein each iterative process comprises:

11) randomly selecting a part of the speech emotion characteristic data as current emotion model input;

12) training a current emotion recognition model by using current emotion model input, and acquiring a speech emotion class value through a forward propagation algorithm;

13) recording the difference between the voice emotion category value and the real voice emotion category value as a second minimum cross entropy;

14) judging whether the current emotion recognition model meets a convergence condition or not according to the second minimum cross entropy, and if so, taking the current emotion recognition model as a final emotion recognition model; if the current emotion recognition model does not meet the requirement, adding 1 to the iteration number, taking the voice emotion type value gradually approaching the real voice emotion type value as a target, updating the parameters of the current emotion recognition model through a back propagation algorithm, and performing the next iteration process;

2) and taking the final emotion recognition model as a voice emotion recognition model.

And updating the parameters of the current emotion recognition model once every iteration process, wherein the current emotion recognition model after the parameters are updated is used as the current emotion recognition model (when the next iteration is needed) or the final emotion recognition model (when the convergence condition is met) of the next iteration process. The initial current emotion recognition model is the initial multi-layer perceptron.

In this embodiment, the convergence condition means that a condition one or a condition two is satisfied, where the condition one: second minimized cross entropy stop change; and a second condition: the number of iterations reaches 200.

The true value of the speech emotion category is a standard answer obtained from speech emotion corpus, and the objective of minimizing cross entropy is to make the representation of the speech emotion category closer to a parameter estimation target of the true value of the speech emotion category.

In this embodiment, after the processing in steps S100 to S105, a speech emotion recognition model is obtained, and then speech emotion features may be extracted by using the speech emotion feature extractor and input to the speech emotion recognition model for speech emotion recognition, which is better than that of the conventional support vector machine. And the emotion classification is more balanced and is not easy to be biased to certain emotion classifications with larger data quantity.

In one embodiment, the application performs speech emotion recognition on the words of the user that "I want fairness all over unreasonable" (the words are highly exciting and angry). Specifically, the present application provides a speech emotion recognition method, as shown in fig. 2, specifically including the following steps:

and step S201, obtaining a speech emotion characteristic extractor according to the steps S100 to S102.

Step S202, obtaining a speech emotion recognition model according to the steps S103 to S105.

Step S203, extracting the speech emotion characteristics of the speech to be recognized by using the speech emotion characteristic extractor.

In this embodiment, the speech to be recognized is "the fairness i want is made up of nothing". Specifically, the voice to be recognized is changed from a time domain to a frequency domain through Fourier transform to obtain a corresponding transform result, and then the transform result is used for generating a Mel frequency cepstrum coefficient characteristic through a Mel filter, wherein the Mel frequency cepstrum coefficient characteristic is the frequency domain characteristic of the voice to be recognized. And inputting the frequency domain characteristics into a voice emotion characteristic extractor, and extracting voice emotion characteristics.

Step S204, predicting the speech emotion characteristics of the speech to be recognized by using the speech emotion recognition model, and obtaining a speech emotion recognition result (anger) of the speech to be recognized (the fair that we want is fictitious to all) of the speech to be recognized.

In the above, through the processing in steps S201-204, the speech emotion recognition result [ anger (anger) ] is obtained, and compared with the conventional speech emotion recognition technology, the recognition accuracy is higher, and the type of emotion category is not biased due to the large amount of some data.

The following is an embodiment of the apparatus of the present application, which can be used to execute the above embodiment of the speech emotion recognition model training method. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training method of speech emotion recognition model described above.

Referring to fig. 3, the present invention provides a speech emotion recognition model training apparatus, including: the system comprises a speaker recognition corpus acquisition module 301, a frequency domain feature data extraction module 302, a speech emotion feature extractor training module 303, a speech emotion corpus acquisition module 304, a speech emotion feature data extraction module 305 and a speech emotion recognition model training module 306.

The speaker identification corpus obtaining module 301 obtains a speaker identification corpus.

The frequency domain feature data extraction module 302 extracts frequency domain feature data from the speaker identification corpus.

And a speech emotion feature extractor training module 303, which performs training by using the frequency domain feature data to obtain a speech emotion feature extractor.

And a speech emotion corpus acquiring module 304, configured to acquire a speech emotion corpus.

And a speech emotion feature data extraction module 305, configured to extract speech emotion feature data from the speech emotion corpus by using the speech emotion feature extractor.

And the speech emotion recognition model training module 306 performs training by using the speech emotion feature data to obtain a speech emotion recognition model.

In this embodiment, the frequency domain feature data extracting module 302 further includes sub-modules:

and the first Fourier transform module is used for carrying out Fourier transform on the voice of the speaker recognition corpus to obtain a first transform result.

And the first Mel filtering module generates a first Mel frequency cepstrum coefficient characteristic as frequency domain characteristic data by passing the first transformation result through a Mel filter.

The speech emotion feature extractor training module 303 further comprises sub-modules:

the first selection module randomly selects a part of frequency domain characteristic data as the input of the current speaker model.

And the voice emotion characteristic value acquisition module is used for training the current speaker recognition model by using the current speaker model input and acquiring a voice emotion characteristic value through a forward propagation algorithm.

And the first minimum cross entropy module is used for recording the difference between the speech emotion characteristic value and the real speech emotion characteristic value as a first minimum cross entropy.

The first convergence judging module is used for judging whether the current speaker recognition model meets the convergence condition or not according to the first minimum cross entropy, and if so, taking the current speaker recognition model as a final speaker recognition model; if the current speaker identification model does not meet the requirement, adding 1 to the iteration times, taking the voice emotion characteristic value gradually approaching the voice emotion characteristic true value as a target, updating the parameters of the current speaker identification model through a back propagation algorithm, and carrying out the next iteration process, wherein the iteration process refers to the process of reselecting a part of frequency domain characteristic data. The convergence condition means that a condition one or a condition two is satisfied, wherein the condition one: the first minimum cross entropy stops changing; and a second condition: the number of iterations reaches 200. The speech emotion feature extractor obtained above can extract speech emotion features.

And the voice emotion characteristic extractor acquisition module takes the final speaker recognition model as a voice emotion characteristic extractor.

The speech emotion feature data extraction module 305 further includes sub-modules:

and the second Fourier transform module is used for carrying out Fourier transform on the voice of the voice emotion corpus to obtain a second transform result.

And the second mel filtering module generates a second mel frequency cepstrum coefficient characteristic by passing the second transformation result through a mel filter.

And the voice emotion feature data extraction module is used for inputting the second mel frequency cepstrum coefficient feature into the voice emotion feature extractor, generating a vector value at the second layer from the last but one of the voice emotion feature extractor, and extracting the vector value as voice emotion feature data.

The speech emotion recognition model training module 306 further includes sub-modules:

and the second selection module randomly selects a part of voice emotion characteristic data as the current emotion model input.

And the voice emotion category value module is used for training the current emotion recognition model by using the current emotion model input and acquiring a voice emotion category value through a forward propagation algorithm.

And the second minimum cross entropy module is used for recording the difference between the voice emotion class value and the real voice emotion class value as second minimum cross entropy.

The second convergence judging module is used for judging whether the current emotion recognition model meets the convergence condition or not according to the second minimum cross entropy, and if so, taking the current emotion recognition model as a final emotion recognition model; if the current emotion recognition model does not meet the requirement, adding 1 to the iteration times, taking the voice emotion type value gradually approaching the real voice emotion type value as a target, updating the parameters of the current emotion recognition model through a back propagation algorithm, and carrying out the next iteration process, wherein the iteration process refers to the process of reselecting a part of voice emotion characteristic data. The convergence condition means that a condition one or a condition two is satisfied, wherein the condition one: second minimized cross entropy stop change; and a second condition: the number of iterations reaches 200.

And the voice emotion recognition model acquisition module is used for taking the final emotion recognition model as a voice emotion recognition model.

Referring to fig. 4, an electronic device 400 includes a processor 401 and a memory 402 for storing instructions executable by the processor 401. Wherein the processor 401 is configured to execute the speech emotion recognition model training method in any of the above embodiments.

The processor 401 may be an integrated circuit chip having signal processing capabilities. The Processor 401 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; or may be a processed signal processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The Memory 402 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. The memory 402 further stores one or more modules, which are respectively executed by the one or more processors 401 to complete the steps of the speech emotion recognition model training method in the above-described embodiment.

A computer-readable storage medium is provided in an embodiment of the present application, where the storage medium stores a computer program, and the computer program is executable by the processor 401 to perform the method for training the speech emotion recognition model in any of the above embodiments.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a read-only Memory, a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

14页详细技术资料下载

Speech emotion recognition model training method and electronic equipment

相关技术

网友询问留言