Cross-language speech emotion recognition method and system based on common feature extraction

文档序号:1800760 发布日期:2021-11-05 浏览:53次 中文

阅读说明:本技术 一种基于共性特征提取的跨语种语音情感识别方法和系统 (Cross-language speech emotion recognition method and system based on common feature extraction ) 是由 李太豪 郑书凯 刘逸颖 阮玉平 张晓宁 于 2021-10-08 设计创作,主要内容包括:本发明属于人工智能领域,涉及一种基于共性特征提取的跨语种语音情感识别方法和系统,该系统包括:语音信号采集模块,采用高保真单麦克风或者麦克风阵列采集用户语音信号;语音信号预处理模块,用于将采集到的语音信号进行预处理,对语音进行端点检测,去除语音前后静音段,生成可用于神经网络处理的数据;跨语种语音情感识别模块,用于通过设计的复数网络模型处理声谱图特征,预测用户音频的情感类型;分析存储模块,用于存储用户的语音数据和情感标签数据,并根据实际业务进行统计分析。本发明能够有效解决跨语种的语音情感识别问题,解决音频中的相位特征处理问题,从而提取音频中更加精细的发音特征,提升语音情感识别精度。(The invention belongs to the field of artificial intelligence, and relates to a method and a system for recognizing cross-language speech emotion based on common feature extraction, wherein the system comprises the following steps: the voice signal acquisition module acquires a user voice signal by adopting a high-fidelity single microphone or a microphone array; the voice signal preprocessing module is used for preprocessing the acquired voice signals, detecting the end points of the voice, removing the front and rear mute sections of the voice and generating data which can be used for neural network processing; the cross-language speech emotion recognition module is used for processing spectrogram characteristics through a designed complex network model and predicting the emotion type of the user audio; and the analysis storage module is used for storing the voice data and the emotion label data of the user and carrying out statistical analysis according to the actual service. The method can effectively solve the problem of cross-language speech emotion recognition and the problem of phase characteristic processing in the audio, thereby extracting more precise pronunciation characteristics in the audio and improving the speech emotion recognition precision.)

1. A cross-language speech emotion recognition method based on common feature extraction is characterized by comprising the following steps:

step one, collecting English emotion voice data containing labeled information and emotion voice data of other languages without labeled information;

preprocessing the emotion voice data to generate a spectrogram containing a phase;

removing front and rear mute sections of the spectrogram, inputting the voice to a network to obtain voice depth characteristic information, and calculating to obtain a maximum mean error of the voice depth characteristic;

inputting the voice depth characteristic information into a classification network to calculate to obtain classification probability of output of the labeled data, and calculating to obtain English emotional voice data classification errors containing labeling information by combining label representation obtained by vectorizing the label data;

fifthly, training to obtain a cross-language emotion voice classification model according to the maximum mean error of the voice depth features and the English emotion voice data classification error containing the labeled information;

and step six, inputting the spectrogram after audio processing to be predicted to the trained cross-language emotion voice classification model, and predicting voice emotion.

2. The method for recognizing cross-language speech emotion based on common feature extraction as claimed in claim 1, wherein said step one specifically includes the steps of:

s1, searching and developing the source data set through the network, downloading English voice data with emotion marks, wherein the English voice data are expressed asThe label data is expressed as

S2, downloading non-English language voice data without emotion marks through network search or active recording collection, wherein the non-English language voice data are represented as

3. The method for recognizing cross-language speech emotion based on common feature extraction as claimed in claim 2, wherein said step two specifically comprises:

s3, Speech data collected for S1 and S2Andthen, a Mel-spectrum signal, which is spectrogram information with phase information, is generated by short-time Fourier transform, and is expressed as

4. The method for recognizing cross-language speech emotion based on common feature extraction as claimed in claim 3, wherein said step three specifically includes the steps of:

s4, Merr spectrogram signal generated for S3Calculating the energy of spectrogram information of different time frames, cutting off the front and rear silent sections by setting a threshold value to obtain spectrogram information with the length of

S5 obtained in S4Inputting the speech into a feature extraction sub-network consisting of a plurality of network structures to obtain speech depth feature informationAnd

s6, obtaining the voice depth feature information obtained in S5Andobtaining model feature similarity loss by minimizing mean errorThe expression is:

wherein n issIs the number of English data of the input model, ntIs the number of other language data of the input model, xiAnd xjIs composed ofThe subscripts of the features are the elements of i and j respectively,characterized by a matrix, represented as:

yiand yjIs composed ofThe subscripts of the features are the elements of i and j respectively,a feature matrix, represented as:

k represents a gaussian kernel function, which can be expressed as:

and b, adjusting the value according to the data set.

5. The method for recognizing cross-lingual speech emotion based on commonality feature extraction as recited in claim 4, wherein said step four specifically comprises the steps of:

s7, obtaining the voice depth feature information obtained in S5Inputting the emotion prediction probability characteristics to an emotion classification processing network

S8, using One-hot technology to the label data obtained in S1Performing characterization to obtain a label characterization represented as

S9, predicting the emotion by the emotion prediction probability characteristics obtained in S7And the tag characterization obtained at S8Calculating the model loss through a cross entropy functionThe expression is:

where C is the number of emotion categories.

6. The method for recognizing cross-language speech emotion based on common feature extraction as claimed in claim 5, wherein said step five specifically comprises:

s10, losing the similarity of the model features obtained in S6And model loss from S9And after accumulation, optimizing the network model by a neural network gradient updating method to obtain a trained cross-language emotion voice classification model.

7. The method for recognizing cross-language speech emotion based on common feature extraction as claimed in claim 6, wherein said step six specifically comprises:

s11, predicting the speech of any languagePre-processing to generate phase spectrogram signalAnd inputting the Mel spectrogram signal into a trained cross-language emotion voice classification model, and predicting to obtain the emotion category of the voice.

8. A cross-language speech emotion recognition system based on common feature extraction is characterized by comprising the following steps:

the voice signal acquisition module is used for acquiring a user voice signal, wherein the voice signal comprises English emotion voice data containing labeled information and other language emotion voice data not containing labeled information;

the voice signal preprocessing module is used for preprocessing the acquired voice signals to generate a spectrogram containing phases, then carrying out end point detection to remove front and rear mute sections of the spectrogram signals and generate data which can be used for neural network processing;

the cross-language voice emotion recognition module is used for processing the spectrogram through the designed complex network model to obtain voice depth characteristic information, training an emotion recognition model and predicting the emotion type of the user audio;

and the analysis storage module is used for storing the voice data and the emotion label data of the user by utilizing an Oracle database and carrying out statistical analysis according to the actual service.

9. The system according to claim 8, wherein the preprocessing specifically includes: pre-emphasis, framing, windowing, short-time Fourier transform, silence removal operations, converting speech signals from time domain signals to frequency domain signals, i.e. from audio samples to audio spectral features; the method comprises the steps of carrying out silence denoising on voice by adopting a spectral subtraction method, carrying out pre-emphasis on the voice by adopting a Z transformation method, and carrying out sound spectrum feature extraction on the voice by adopting a short-time Fourier transformation method.

Technical Field

The invention belongs to the field of artificial intelligence, and relates to a cross-language speech emotion recognition method and system based on common feature extraction.

Background

Speech is the main way that humans express emotion in everyday communication. With the development of artificial intelligence technology, applications such as human-computer interaction and the like are rapidly developed, and human-like interaction can be carried out, namely human-computer interaction based on emotion intelligence becomes an urgent need, and speech emotion recognition is a key technical support for realizing the emotion interaction.

The current technology for speech emotion recognition is a traditional speech emotion recognition method based on artificial features and an end-to-end speech emotion recognition method based on an artificial neural network. The traditional method usually needs a large amount of expert knowledge, and the design and the model construction of the recognition characteristics are carried out according to specific pronunciation characteristics and the like, so that the cost is usually higher. The method based on the artificial neural network generally only needs to design a network model, and then utilizes a large amount of labeled data to enable the model to learn autonomously, so as to realize the emotion recognition of voice. At present, the method based on the neural network has better performance than the traditional method on the speech emotion recognition effect.

The end-to-end neural network technology is used for realizing the speech emotion recognition, a large amount of marking data is needed for training the model, but for the speech emotion marking, a marker can mark only by knowing related languages, and a large amount of time is consumed for marking the data needed by the training model. This makes speech emotion recognition possible only in certain languages with a large amount of tagged data, and for languages without tagged data, it is difficult to realize speech emotion recognition.

Disclosure of Invention

In order to solve the problem of cross-language speech emotion recognition in the prior art, the invention provides a method and a system for cross-language speech emotion recognition based on common feature extraction, which can effectively solve the problem of cross-language speech emotion recognition and solve the problem of phase feature processing in audio through a complex network, thereby extracting more precise pronunciation features in the audio and improving speech emotion recognition precision, and the specific technical scheme is as follows:

a cross-language speech emotion recognition method based on common feature extraction comprises the following steps:

step one, collecting English emotion voice data containing labeled information and emotion voice data of other languages without labeled information;

preprocessing the emotion voice data to generate a spectrogram containing a phase;

removing front and rear mute sections of the spectrogram, inputting the voice to a network to obtain voice depth characteristic information, and calculating to obtain a maximum mean error of the voice depth characteristic;

inputting the voice depth characteristic information into a classification network to calculate to obtain classification probability of output of the labeled data, and calculating English emotion voice data classification errors containing labeling information by combining label representation obtained by vectorizing the label data;

fifthly, training to obtain a cross-language emotion voice classification model according to the maximum mean error of the voice depth features and the English emotion voice data classification error containing the labeled information;

and step six, inputting the spectrogram after audio processing to be predicted to the trained cross-language emotion voice classification model, and predicting voice emotion.

Further, the step one specifically includes the following steps:

s1, searching and developing the source data set through the network, downloading English voice data with emotion marks, wherein the English voice data are expressed asThe label data is expressed as

S2, downloading non-English language voice data without emotion marks through network search or active recording collection, wherein the non-English language voice data are represented as

Further, the second step specifically includes:

s3, Speech data collected for S1 and S2Andthen, a Mel-spectrum signal, which is spectrogram information with phase information, is generated by short-time Fourier transform, and is expressed as

Further, the third step specifically includes the following steps:

s4, Merr spectrogram signal generated for S3Calculating the energy of spectrogram information of different time frames, cutting off the front and rear silent sections by setting a threshold value to obtain spectrogram information with the length of

S5 obtained in S4Inputting the speech into a feature extraction sub-network consisting of a plurality of network structures to obtain speech depth feature informationAnd

s6, obtaining the voice depth feature information obtained in S5Andobtaining model feature similarity loss by minimizing mean errorThe expression is:

wherein n issIs the number of English data of the input model, ntIs the number of other language data of the input model, xiAnd xjIs composed ofThe subscripts of the features are the elements of i and j respectively,characterized by a matrix, represented as:

yiand yjIs composed ofThe subscripts of the features are the elements of i and j respectively,a feature matrix, represented as:

k represents a gaussian kernel function, which can be expressed as:

and b, adjusting the value according to the data set.

Further, the fourth step specifically includes the following steps:

s7, obtaining the voice depth feature information obtained in S5Inputting the emotion prediction probability characteristics to an emotion classification processing network

S8, using One-hot technology to characterize the label data obtained in S1 to obtain label characterization which is expressed as

S9, predicting the emotion by the emotion prediction probability characteristics obtained in S7And the tag characterization obtained at S8Calculating the model loss through a cross entropy functionThe expression is:

where C is the number of emotion categories.

Further, the fifth step specifically includes:

s10, losing the similarity of the model features obtained in S6And model loss from S9And after accumulation, optimizing the network model by a neural network gradient updating method to obtain a trained cross-language emotion voice classification model.

Further, the sixth step specifically includes:

s11, predicting the speech of any languagePre-processing to generate phase spectrogram signalAnd inputting the Mel spectrogram signal into a trained cross-language emotion voice classification model, and predicting to obtain the emotion category of the voice.

A cross-language speech emotion recognition system based on common feature extraction comprises:

the voice signal acquisition module is used for acquiring a user voice signal, wherein the voice signal comprises English emotion voice data containing labeled information and other language emotion voice data not containing labeled information;

the voice signal preprocessing module is used for preprocessing the acquired voice signals to generate a spectrogram containing phases, then carrying out end point detection to remove front and rear mute sections of the spectrogram signals and generate data which can be used for neural network processing;

the cross-language voice emotion recognition module is used for processing the spectrogram through the designed complex network model to obtain voice depth characteristic information, training an emotion recognition model and predicting the emotion type of the user audio;

and the analysis storage module is used for storing the voice data and the emotion label data of the user by utilizing an Oracle database and carrying out statistical analysis according to the actual service.

Further, the pretreatment specifically includes: pre-emphasis, framing, windowing, short-time Fourier transform, silence removal operations, converting speech signals from time domain signals to frequency domain signals, i.e. from audio samples to audio spectral features; the method comprises the steps of carrying out silence denoising on voice by adopting a spectral subtraction method, carrying out pre-emphasis on the voice by adopting a Z transformation method, and carrying out sound spectrum feature extraction on the voice by adopting a short-time Fourier transformation method.

The invention has the advantages that:

1. according to the cross-language voice emotion recognition method based on common feature extraction, the emotion information common to different languages of audios is extracted by minimizing the maximum mean error of implicit features of different languages extracted by a network, and the purpose of cross-language voice emotion is effectively achieved;

2. according to the cross-language speech emotion recognition method based on common feature extraction, speech spectrogram information is extracted by using a complex network, phase information related to emotion pronunciation can be extracted from speech, and the recognition accuracy of a model can be higher;

3. the cross-language speech emotion recognition system based on common feature extraction integrates a cross-language speech emotion recognition model, can realize cross-language speech emotion recognition, and is suitable for cross-region speech emotion recognition scenes, such as: cross-regional telephones, automatic analysis of conference content for video conferencing systems, etc.

Drawings

FIG. 1 is a schematic diagram of a cross-language speech emotion recognition system according to the present invention;

FIG. 2 is a flowchart illustrating a cross-language speech emotion recognition method according to the present invention;

FIG. 3 is a schematic diagram of a network structure of the cross-language speech emotion recognition method of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

As shown in fig. 2, a cross-language speech emotion recognition method based on common feature extraction includes the following steps:

s1, collecting English emotion voice data containing marking information:

searching a source data set through a network, downloading English voice data with emotion marks, wherein the English voice data is expressed asThe label data is expressed as

S2, acquiring emotion voice data of other languages without labeling information:

downloading non-English language voice data without emotion mark through network search or active recording collection, and expressing the data as

S3, preprocessing the voice signal to generate a spectrogram containing a phase:

speech data collected for S1 and S2Andby short-time Fourier transformTransform, etc., to generate Mel spectrogram signal as spectrogram information with phase information, and respectively and correspondingly express as

S4, removing front and rear silent sections from the spectrogram:

for Mel spectrum signal generated in S3Calculating the energy of spectrogram information of different time frames, cutting off the front and rear silent sections by setting a threshold value to obtain spectrogram information with the length of

S5, inputting the spectrogram to the network to obtain the speech depth feature information:

obtained in S4Inputting the speech into a feature extraction sub-network consisting of a plurality of network structures to obtain speech depth feature informationAnd(ii) a As shown in FIG. 3, the complex network structure is used in the signal processing field in recent yearsThe neural network structure of (1).

S6, calculating the maximum mean error of the speech depth features:

the speech depth characteristic information obtained in the S5Andobtaining model feature similarity loss by minimizing mean errorSo that the two kinds of extracted feature information are features which have common distribution;

specifically, the model feature similarity loss calculation method is as follows:

in, nsIs the number of English data of the input model, ntIs the amount of data of other languages of the input model. x is the number ofiAnd xjIs composed ofThe subscripts of the features are the elements of i and j respectively,the feature is a matrix, which can be expressed as:

in, yiAnd yjIs composed ofThe subscripts of the features are the elements of i and j respectively,a feature matrix, which can be expressed as:

in (e), k represents a gaussian kernel function, which can be expressed as:

and b is adjusted according to the data set, and can take values of 1 and the like.

S7, inputting the speech depth characteristic information to a classification network to calculate the classification probability of the output of the labeled data:

the speech depth characteristic information obtained in the S5Inputting the emotion prediction probability characteristics to an emotion classification processing network

S8, vectorizing the label of the annotation data:

the label obtained from S1 is characterized by using One-hot technology to obtain a label characterization which is expressed as

S9, calculating the classification error of the labeling data:

predicting probability characteristics of emotion obtained in S7And the tag characterization obtained at S8Calculating the model loss through a cross entropy function

Specifically, the cross entropy function calculation method is as follows:

where C is the number of emotion categories, usually taking the value 7, nsIs the number of English labeled samples input into the training model at one time.

10. Updating the training network according to the two errors to obtain an emotion recognition model M:

model loss from S6And model loss from S9And after accumulation, optimizing the network model by a neural network gradient updating method to obtain a trained cross-language emotion voice classification model M.

S11, inputting the spectrogram after audio processing to be predicted to the model M, and predicting the speech emotion:

arbitrary language speech to be predictedPre-processing to generate phase spectrogram signalInputting the Mel spectrogram signal into model classification model M, passing through nerveAnd network derivation step, obtaining the emotion type of the voice.

As shown in fig. 1, a cross-language speech emotion recognition system based on common feature extraction includes:

the voice signal acquisition module adopts a high-fidelity single microphone or a microphone array and is used for acquiring a user voice signal;

the voice signal preprocessing module is used for preprocessing the acquired voice signals to generate a spectrogram containing phases, then carrying out end point detection to remove front and rear mute sections of the spectrogram signals and generate data which can be used for neural network processing; wherein, the pretreatment specifically comprises the following steps: pre-emphasis, framing, windowing, short-time Fourier transform, silence removal operations, converting speech signals from time domain signals to frequency domain signals, i.e. from audio samples to audio spectral features; the method comprises the following steps of performing silence denoising on voice by adopting a spectral subtraction method, performing pre-emphasis on the voice by adopting a Z transform method, and performing sound spectrum feature extraction on the voice by adopting a short-time Fourier transform method;

the cross-language voice emotion recognition module is used for processing the spectrogram through the designed complex network model to obtain voice depth characteristic information, training an emotion recognition model and predicting the emotion type of the user audio;

and the analysis storage module is used for storing the voice data and the emotion label data of the user by utilizing an Oracle database and the like and carrying out statistical analysis according to the actual service.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于机器学习的发音纠错方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!