Data processing method, data processing device, computer equipment and storage medium

文档序号：1628119 发布日期：2020-01-14 浏览：18次中文

阅读说明：本技术 数据处理方法、装置、计算机设备和存储介质 (Data processing method, data processing device, computer equipment and storage medium ) 是由黄海杰于 2019-08-13 设计创作，主要内容包括：本申请涉及数据分析技术领域,提供了一种数据处理方法、装置、计算机设备和存储介质。所述方法包括：根据微语音特征,得到第一语音情绪数据,将面试者音频数据转换为文字数据,将文字数据拆分为多个句子,根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,将文字数据输入已训练的语法分析网络,得到文字数据的语法评分,根据微表情特征,得到视频数据置信度,根据第一语音情绪数据、第二语音情绪数据、语法评分以及视频数据置信度,确定面试者的面试结果。采用本方法能够提高识别准确率,使面试结果更贴近真实情况。(The application relates to the technical field of data analysis, and provides a data processing method, a data processing device, computer equipment and a storage medium. The method comprises the following steps: obtaining first voice emotion data according to micro voice characteristics, converting audio data of an interviewer into character data, dividing the character data into a plurality of sentences, searching and matching a preset dictionary corresponding to a trained emotion classification network according to words in each sentence, determining confidence coefficients of the character data belonging to all preset emotion categories according to the searching and matching results to obtain second voice emotion data, inputting the character data into the trained grammar analysis network to obtain grammar scores of the character data, obtaining video data confidence coefficients according to the micro expression characteristics, and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar scores and the video data confidence coefficients. By adopting the method, the identification accuracy can be improved, and the interview result is closer to the real situation.)

1. A method of data processing, the method comprising:

acquiring audio data and video data of an interviewee;

extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features;

converting the interviewee audio data into character data, splitting the character data into a plurality of sentences, segmenting the plurality of sentences, searching and matching a preset dictionary corresponding to a trained emotion classification network according to each word in each sentence, determining confidence of the character data belonging to each preset emotion category according to a searching and matching result, and obtaining second voice emotion data, wherein the emotion classification network is obtained by training first character data;

inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, wherein the grammar analysis network is obtained by training second sample character data;

randomly intercepting video frames from the interviewer video data, extracting micro-expression characteristics of the interviewer according to the video frames, and obtaining video data confidence according to the micro-expression characteristics;

and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

2. The method of claim 1, wherein said extracting interviewer micro-voice features from the interviewer audio data comprises:

and calling a voice feature extraction tool, and extracting micro voice features of the interviewee according to the audio data of the interviewee, wherein the micro voice features comprise a speech speed feature, a Mel frequency cepstrum coefficient and a pitch feature.

3. The method of claim 1, wherein the obtaining of the first speech emotion data according to the micro speech feature comprises:

acquiring gender information of an interviewer, and acquiring a speech emotion classification model matched with the gender information of the interviewer from a trained speech emotion classification model set, wherein the speech emotion classification model is obtained by training sample speech data carrying labeled information, and the labeled information comprises emotion category information and gender information;

acquiring a pitch characteristic, a Mel frequency cepstrum coefficient and a speech speed characteristic in the micro voice characteristic;

and inputting the pitch feature, the Mel frequency cepstrum coefficient and the speech speed feature into a matched speech emotion classification model, and acquiring confidence coefficients of the micro speech features belonging to all preset emotion categories to obtain first speech emotion data of the micro speech features.

4. The method of claim 3, wherein before obtaining the speech emotion classification model matching the gender information of the interviewer from the trained speech emotion classification model set, further comprising:

acquiring sample voice data carrying labeling information;

dividing the sample voice data into a training set and a verification set;

performing model training according to the training set and the initial speech emotion classification model to obtain a speech emotion classification model set;

and performing model verification according to the verification set, and adjusting each speech emotion classification model in the speech emotion classification model set.

5. The method of claim 1, wherein the searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determining the confidence level that the text data belongs to each preset emotion category according to the search and matching result to obtain the second speech emotion data comprises:

searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determining a corresponding serial number of each word in each sentence in the dictionary;

inputting the corresponding serial numbers of the words in the sentences in the dictionary into the emotion classification network to obtain the confidence coefficient of each sentence in the character data belonging to each preset emotion category;

and acquiring the average value of the confidence degrees of the sentences in the character data belonging to the preset emotion categories, and acquiring the confidence degrees of the character data belonging to the preset emotion categories according to the average value of the confidence degrees.

6. The method of claim 1, wherein said determining an interview result of an interviewer based on the first speech emotion data, the second speech emotion data, the grammar score, and the video data confidence level comprises:

obtaining an audio data confidence according to the first voice emotion data, the second voice emotion data and the grammar score;

and determining an interview result of the interviewer according to the audio data confidence coefficient, the video data confidence coefficient and a preset confidence coefficient parameter.

7. A data processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring audio data and video data of the interviewer;

the first extraction module is used for extracting micro voice features of the interviewer according to the audio data of the interviewer and obtaining first voice emotion data according to the micro voice features;

the first processing module is used for converting the audio data of the interviewee into character data, dividing the character data into a plurality of sentences, segmenting the plurality of sentences, searching and matching a preset dictionary corresponding to a trained emotion classification network according to each word in each sentence, determining confidence of the character data belonging to each preset emotion category according to a searching and matching result, and obtaining second voice emotion data, wherein the emotion classification network is obtained by training the first character data;

the second processing module is used for inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and the grammar analysis network is obtained by training second sample character data;

the second extraction module is used for randomly intercepting video frames from the interviewer video data, extracting micro-expression characteristics of the interviewer according to the video frames and obtaining video data confidence coefficient according to the micro-expression characteristics;

and the analysis module is used for determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence coefficient.

8. The apparatus of claim 7, wherein the first extraction module is further configured to invoke a speech feature extraction tool to extract micro speech features of the interviewer from the interviewer audio data, the micro speech features including speech rate features, mel-frequency cepstral coefficients, and pitch features.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

Technical Field

The present application relates to the field of data analysis technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

Along with the development of artificial intelligence, an intelligent interview system appears, and most of the traditional intelligent interview systems identify the facial micro expression to discover the abnormal expression of the interviewee and serve as one of the bases of risk assessment. Micro-expression is a psychological term. People can express the feeling of mind to the other side to see by doing some expressions, and other information can be revealed by the face among different expressions or in a certain expression. "micro-emotions" may last 1/25 seconds at the shortest, and sometimes express the opposite emotion, although a subconscious expression may last only a moment.

However, merely relying on the micro-expression characteristics is not enough to accurately and comprehensively capture the psychological state of the interviewee, which easily causes great difference between the interview result and the real situation and has the problem of low recognition accuracy.

Disclosure of Invention

Therefore, it is necessary to provide a data processing method, an apparatus, a computer device, and a storage medium, which can improve the recognition accuracy and make the interview result closer to the real situation, so as to provide convenience for the user.

A method of data processing, the method comprising:

acquiring audio data and video data of an interviewee;

extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features;

converting audio data of the interviewee into character data, splitting the character data into a plurality of sentences, segmenting the plurality of sentences, searching and matching a preset dictionary corresponding to a trained emotion classification network according to each word in each sentence, determining confidence degree of the character data belonging to each preset emotion category according to the searching and matching result, and obtaining second voice emotion data, wherein the emotion classification network is obtained by training the first character data;

inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and training the grammar analysis network by second sample character data to obtain the grammar scores;

randomly intercepting video frames from interviewer video data, extracting micro-expression characteristics of an interviewer according to the video frames, and obtaining video data confidence according to the micro-expression characteristics;

and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

In one embodiment, extracting the micro-voice features of the interviewer from the interviewer audio data comprises:

In one embodiment, obtaining the first speech emotion data according to the micro-speech feature comprises:

acquiring gender information of an interviewer, and acquiring a speech emotion classification model matched with the gender information of the interviewer from a trained speech emotion classification model set, wherein the speech emotion classification model is obtained by training sample speech data carrying labeling information, and the labeling information comprises emotion category information and gender information;

acquiring a pitch characteristic, a Mel frequency cepstrum coefficient and a speech speed characteristic in the micro-voice characteristic;

and inputting the pitch characteristic, the Mel frequency cepstrum coefficient and the speech speed characteristic into the matched speech emotion classification model, and acquiring confidence coefficients of the micro speech characteristics belonging to all preset emotion categories to obtain first speech emotion data of the micro speech characteristics.

In one embodiment, before obtaining the speech emotion classification model matched with the gender information of the interviewer from the trained speech emotion classification model set, the method further comprises the following steps:

acquiring sample voice data carrying labeling information;

dividing sample voice data into a training set and a verification set;

performing model training according to the training set and the initial speech emotion classification model to obtain a speech emotion classification model set;

and performing model verification according to the verification set, and adjusting each speech emotion classification model in the speech emotion classification model set.

In one embodiment, searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determining confidence that the character data belongs to each preset emotion category according to the search and matching result to obtain the second speech emotion data includes:

inputting the corresponding serial numbers of all words in each sentence in the dictionary into an emotion classification network to obtain the confidence coefficient of each sentence in the character data belonging to each preset emotion category;

and acquiring the average value of the confidence degrees of the sentences in the character data belonging to the preset emotion categories, and acquiring the confidence degrees of the characters data belonging to the preset emotion categories according to the average value of the confidence degrees.

In one embodiment, determining the interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence comprises:

obtaining an audio data confidence coefficient according to the first voice emotion data, the second voice emotion data and the grammar score;

and determining the interview result of the interviewer according to the audio data confidence coefficient, the video data confidence coefficient and the preset confidence coefficient parameter.

A data processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring audio data and video data of the interviewer;

the first processing module is used for converting audio data of the interviewee into character data, dividing the character data into a plurality of sentences, segmenting the plurality of sentences, searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, determining confidence coefficient of the character data belonging to each preset emotion category according to the searching and matching result, and obtaining second voice emotion data, wherein the emotion classification network is obtained by training the first character data;

the second processing module is used for inputting the character data into the trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and the grammar analysis network is obtained by training the second sample character data;

the second extraction module is used for randomly intercepting video frames from the video data of the interviewer, extracting micro-expression characteristics of the interviewer according to the video frames and obtaining the confidence coefficient of the video data according to the micro-expression characteristics;

In one embodiment, the first extraction module is further configured to invoke a speech feature extraction tool to extract micro speech features of the interviewer according to the audio data of the interviewer, wherein the micro speech features include a speech rate feature, a mel-frequency cepstrum coefficient and a pitch feature.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring audio data and video data of an interviewee;

extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features;

inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and training the grammar analysis network by second sample character data to obtain the grammar scores;

and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring audio data and video data of an interviewee;

extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features;

inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and training the grammar analysis network by second sample character data to obtain the grammar scores;

and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

According to the data processing method, the data processing device, the computer equipment and the storage medium, micro-voice features are extracted according to audio data of an interviewee, first voice emotion data are obtained according to the micro-voice features, the audio data of the interviewee are converted into character data, the character data are analyzed, second voice emotion data and grammar scores are obtained, micro-expression features are extracted according to the video data of the interviewee, video data confidence coefficients are obtained according to the micro-expression features, and interview results of the interviewee are determined according to the first voice emotion data, the second voice emotion data, the grammar scores and the video data confidence coefficients. Through a plurality of characteristics of the interviewer of multiple mode discernment, synthesize a plurality of recognition results and confirm the interviewer result of interviewer to can accurately catch the psychological state of person being interviewed comprehensively, improve the discernment rate of accuracy, make the interview result more close to the true condition.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a data processing method;

FIG. 2 is a flow diagram illustrating a data processing method according to one embodiment;

FIG. 3 is a schematic illustration of a sub-flow chart of step S204 in FIG. 2 according to an embodiment;

FIG. 4 is a schematic view of a sub-flow chart of step S204 in FIG. 2 according to another embodiment;

FIG. 5 is a schematic view of a sub-flow chart of step S204 in FIG. 2 according to still another embodiment;

FIG. 6 is a schematic sub-flow chart illustrating step S206 in FIG. 2 according to an embodiment;

FIG. 7 is a schematic sub-flow chart illustrating step S212 of FIG. 2 according to an embodiment;

FIG. 8 is a block diagram showing the structure of a data processing apparatus according to an embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The data processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains interviewer audio data and interviewer video data, extracts micro voice features of an interviewer according to the interviewer audio data, obtains first voice emotion data according to the micro voice features, converts the interviewer audio data into character data, divides the character data into a plurality of sentences, performs word segmentation on the plurality of sentences, searches and matches a preset dictionary corresponding to a trained emotion classification network according to each word in each sentence, determines confidence degrees of the character data belonging to each preset emotion category according to the search and match results, obtains second voice emotion data, trains the emotion classification network from the first sample character data, inputs the character data into the trained grammar analysis network, obtains grammar scores of each sentence in the character data, calculates a grammar score average value of each sentence, obtains a grammar score of the character data, trains the grammar analysis network from the second sample character data, randomly intercepting video frames from the video data of the interviewer, extracting micro-expression characteristics of the interviewer according to the video frames, obtaining a video data confidence coefficient according to the micro-expression characteristics, determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence coefficient, and pushing the interview result to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a data processing method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

s202: interviewer audio data and interviewer video data are obtained.

The interviewer video data refers to video data recorded by an interviewer when the interviewer receives an interview, the interviewer audio data refers to audio data of the interviewer when the interviewer receives the interview, and the interviewer audio data can be extracted from the interviewer video data.

S204: and extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features.

The server can extract micro voice features of the interviewee from the audio data of the interviewee by calling a voice feature extraction tool, wherein the micro voice features comprise a speech speed feature, a pitch feature and a Mel frequency cepstrum coefficient. The speech rate refers to the number of words per second in the voice data, the words can be Chinese or English, the pitch refers to the voice frequency, the Mel frequency cepstrum is linear transformation of logarithmic energy spectrum based on nonlinear Mel scales of voice frequency, and the Mel frequency cepstrum coefficient is the coefficient forming the Mel frequency cepstrum. The server inputs the micro voice features into a voice emotion classification model matched with the sex information of the interviewer in the trained voice emotion classification model set, and first voice emotion data corresponding to the micro voice features can be obtained, wherein the first voice emotion data refer to confidence degrees of the micro voice features belonging to each preset emotion category.

The trained speech emotion classification model set comprises speech emotion classification models obtained by training sample data of interviewers with different genders, namely an emotion classification model for analyzing male speech data and an emotion classification model for analyzing female speech data. The server acquires the sex information of the interviewer, matches the trained speech emotion classification model set according to the sex information of the interviewer, and acquires the speech emotion classification model matched with the sex information of the interviewer from the trained speech emotion classification model set. The voice emotion classification model is obtained by training sample voice data carrying labeling information, and the labeling information comprises emotion category information and gender information. The server divides the sample voice data according to the gender information, and respectively performs model training according to the divided sample voice data to obtain a voice emotion classification model set.

S206: the interviewee audio data are converted into text data, the text data are divided into a plurality of sentences, the plurality of sentences are subjected to word segmentation, a preset dictionary corresponding to the trained emotion classification network is searched and matched according to words in each sentence, the confidence degree of the text data belonging to each preset emotion category is determined according to the searching and matching result, and second voice emotion data are obtained, wherein the emotion classification network is obtained by training the first text word data.

The emotion classification network may be a network in which a classification layer including N neurons (assuming N emotions are preset) is superimposed on the BERT basis. The server divides the character data into a plurality of sentences, divides each sentence into words, searches a dictionary matched with BERT according to each word in each sentence, converts each word into a corresponding serial number of the word in the dictionary of the BERT, inputs the serial number of the whole sentence into the BERT to obtain the confidence coefficient of each sentence belonging to each preset emotion category, and then determines the confidence coefficient of the character data belonging to each preset emotion category according to the confidence coefficient of each sentence belonging to each preset emotion category to obtain second voice emotion data. The emotion classification network can be obtained by training first sample character data, each sample sentence in the first sample character data carries labeling information, and the labeling information is emotion category information of each sample sentence.

Because the cache space required by the text data is smaller than that of the audio data and the video data, the mode of converting the audio data of the interviewer into the text data and processing the text data is adopted during data processing, the cache space of the server can be saved in the processing process, and the optimization of the cache space of the server is realized.

S208: and inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and training the grammar analysis network by second sample character data to obtain the grammar scores of the character data.

In training the parsing network, the second sample text data may be used as comla (chinese of legal availability), where the data set includes a plurality of single sentences carrying labels, which are labeled as grammar correctness or not (0 is wrong and 1 is correct), and after training, the parsing network may be used to determine the grammar accuracy of the sentences, where the grammar score ranges from 0 to 1, where 0 represents grammar mistake, 1 represents grammar correctness, and a confidence between 0 and 1 is understood as grammar accuracy. After obtaining the grammar score of each sentence in the character data, the server calculates the grammar score average value of each sentence to obtain the grammar score of the character data. The grammar analysis network can automatically learn according to the character data without splitting and matching grammar structures of sentences in the character data.

S210: randomly intercepting video frames from interviewer video data, extracting micro-expression characteristics of the interviewer according to the video frames, and obtaining video data confidence according to the micro-expression characteristics.

The server randomly intercepts video frames from interviewer video data according to preset time intervals, acquires micro-expression characteristics of an interviewer according to the video frames, inputs the micro-expression characteristics into a trained micro-expression model, obtains confidence coefficients of the micro-expression characteristics belonging to each preset emotion category, sorts the confidence coefficients of the micro-expression characteristics belonging to each preset emotion category, acquires the maximum confidence coefficient, and obtains video data confidence coefficients. Wherein, the micro-expression model is obtained by training sample micro-expression data.

S212: and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

The server can obtain the confidence coefficient of the audio data by inputting the first voice emotion data, the second voice emotion data and the grammar score into the trained audio classification model, and then determines the interview result of the interviewer according to the confidence coefficient of the audio data, the confidence coefficient of the video data and the confidence coefficient parameter. Specifically, the parameters of the audio classification model include confidence degrees of the audio data belonging to the preset emotion categories in the first speech emotion data, confidence degrees of the text data belonging to the preset emotion categories in the second speech emotion data, and grammar scores. When the audio classification model is trained, sample voice data and sample text data carrying labeling information can be used as a training set, and the labeling information is used for labeling whether a interviewee corresponding to the sample voice data and the sample text data lies or not. The confidence coefficient parameter can be set according to the requirement, and the confidence coefficient parameter is an adjustable parameter.

In one embodiment, as shown in fig. 3, S204 includes:

s302: calling a voice feature extraction tool, and extracting micro voice features of the interviewee according to audio data of the interviewee, wherein the micro voice features comprise a speech speed feature, a Mel frequency cepstrum coefficient and a pitch feature;

s304: and inputting the micro voice features into the matched voice emotion classification model to obtain first voice emotion data corresponding to the micro voice features.

Calling a voice feature extraction tool, wherein the method for extracting the mel frequency cepstrum coefficient comprises the following steps: and performing fast Fourier transform on audio data of the interviewer to obtain a frequency spectrum, mapping the frequency spectrum to a Mel proportion, performing discrete cosine transform after logarithm removal, and obtaining a Mel frequency cepstrum coefficient. The pitch characteristics include a current segment pitch average, a current segment pitch standard deviation, a historical pitch average, and a historical pitch standard deviation. The extraction mode of the pitch average value of the current segment is as follows: and performing fast Fourier transform on the audio data of the interviewer to obtain a spectrogram of the audio data, then calculating the variance between each frequency band and the center value of the frequency spectrum, and taking a square root after summing the variances. The historical pitch mean and standard deviation refer to the mean and standard deviation of the interviewer from the beginning of the interview to the current segment. The data is memorized and stored in the server after the interview is started. For convenience of calculation, the calculation can be approximated by an exponential moving average, and the formula is updated as follows:

historical pitch average ═ α @, historical pitch average + (1- α) · current pitch average

Historical pitch standard deviation α historical pitch standard deviation + (1- α) current pitch standard deviation

Wherein α is a weighting parameter between 0 and 1, which can be set by itself as required, and is defaulted to 0.9 here.

The speech rate characteristics comprise the current speech rate, the historical speech rate average value and the historical speech rate standard deviation, and the historical speech rate average value and the historical speech rate standard deviation are calculated, memorized and stored by the server after the interview is started. Similarly, for the sake of calculation convenience, the calculation can be approximated by an exponential moving average, and the formula is updated as follows:

historical speech rate average ═ α · historical speech rate average + (1- α) × current speech rate

Historical speech rate mean square (α) historical speech rate mean square + (1- α) (current speech rate-historical speech rate mean)²

Historical speech rate standard deviation is the square root value of the mean square error of the historical speech rate

According to the embodiment, the voice feature extraction tool is called, the micro voice feature of the interviewer is extracted according to the audio data of the interviewer, and the micro voice feature of the interviewer is extracted.

In one embodiment, as shown in fig. 4, S204 includes:

s402: acquiring gender information of an interviewer, and acquiring a speech emotion classification model matched with the gender information of the interviewer from a trained speech emotion classification model set, wherein the speech emotion classification model is obtained by training sample speech data carrying labeling information, and the labeling information comprises emotion category information and gender information;

s404: acquiring a pitch characteristic, a Mel frequency cepstrum coefficient and a speech speed characteristic in the micro-voice characteristic;

s406: and inputting the pitch characteristic, the Mel frequency cepstrum coefficient and the speech speed characteristic into the matched speech emotion classification model, and acquiring confidence coefficients of the micro speech characteristics belonging to all preset emotion categories to obtain first speech emotion data of the micro speech characteristics.

The pitch characteristics comprise a current segment pitch average value, a current segment pitch standard deviation, a historical pitch average value and a historical pitch standard deviation, the speech speed characteristics comprise a current speech speed, a historical speech speed average value and a historical speech speed standard deviation, the server can input all the characteristics in the three characteristics as parameters into the matched speech emotion classification model, and the convolutional neural network in the speech emotion classification model can synthesize all the characteristics to give confidence coefficients that the micro speech characteristics belong to all preset emotion categories.

In the embodiment, the matched speech emotion classification model is obtained according to the gender information of the interviewer, the pitch characteristic, the Mel frequency cepstrum coefficient and the speech speed characteristic are input into the matched speech emotion classification model, the confidence coefficient of the micro-speech characteristic belonging to each preset emotion category is obtained, the first speech emotion data of the micro-speech characteristic is obtained, and the first speech emotion data is obtained.

In one embodiment, as shown in fig. 5, before S402, the method further includes:

s502: acquiring sample voice data carrying labeling information;

s504: dividing sample voice data into a training set and a verification set;

s506: performing model training according to the training set and the initial speech emotion classification model to obtain a speech emotion classification model set;

s508: and performing model verification according to the verification set, and adjusting each speech emotion classification model in the speech emotion classification model set.

After obtaining sample voice data carrying annotation information, a server firstly divides the sample voice data into a first sample voice data set and a second sample voice data set according to gender information in the annotation information, then divides the first sample voice data set and the second sample voice data set into a training set and a verification set respectively, carries out model training according to the training sets in the first sample voice data set and the second sample voice data set to obtain a first voice emotion classification model and a second voice emotion classification model, carries out model verification according to the verification sets in the first sample voice data set and the second sample voice data set, and adjusts the first voice emotion classification model and the second voice emotion classification model. Wherein the first sample voice data set and the second sample voice data set respectively only comprise sample voice data of interviewers of the same identity level.

In the embodiment, the sample voice data carrying the labeling information is acquired, the sample voice data is divided into the training set and the verification set, model training is performed according to the training set, model verification is performed according to the verification set, and the speech emotion classification models in the speech emotion classification model set are obtained, so that the speech emotion classification model set is acquired.

In one embodiment, as shown in fig. 6, S206 includes:

s602: searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determining a corresponding serial number of each word in each sentence in the dictionary;

s604: inputting the corresponding serial numbers of all words in each sentence in the dictionary into an emotion classification network to obtain the confidence coefficient of each sentence in the character data belonging to each preset emotion category;

s606: and acquiring the average value of the confidence degrees of the sentences in the character data belonging to the preset emotion categories, and acquiring the confidence degrees of the characters data belonging to the preset emotion categories according to the average value of the confidence degrees.

In the embodiment, the serial numbers of the words in the sentences, which correspond to the dictionaries, are input into the emotion classification network, so that the confidence degrees of the sentences in the text data, which belong to the preset emotion categories, are obtained, and then the confidence degrees of the text data, which belong to the preset emotion categories, are obtained according to the confidence degrees of the sentences in the text data, which belong to the preset emotion categories, so that the confidence degrees of the text data, which belong to the preset emotion categories, are obtained.

In one embodiment, as shown in fig. 7, S212 includes:

s702: obtaining an audio data confidence coefficient according to the first voice emotion data, the second voice emotion data and the grammar score;

s704: and determining the interview result of the interviewer according to the audio data confidence coefficient, the video data confidence coefficient and the preset confidence coefficient parameter.

It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a data processing apparatus including: an obtaining module 802, a first extracting module 804, a first processing module 806, a second processing module 808, a second extracting module 810, and an analyzing module 812, wherein:

an obtaining module 802, configured to obtain interviewer audio data and interviewer video data;

the first extraction module 804 is used for extracting micro voice features of the interviewer according to the audio data of the interviewer and obtaining first voice emotion data according to the micro voice features;

the first processing module 806 is configured to convert audio data of the interviewer into text data, split the text data into a plurality of sentences, perform word segmentation on the plurality of sentences, search and match a preset dictionary corresponding to a trained emotion classification network according to each word in each sentence, determine confidence levels of the text data belonging to each preset emotion category according to the search and match results, and obtain second speech emotion data, wherein the emotion classification network is obtained by training the first text word data;

a second processing module 808, configured to input the text data into a trained syntactic analysis network to obtain syntactic scores of sentences in the text data, calculate a syntactic score average value of each sentence to obtain syntactic scores of the text data, where the syntactic analysis network is obtained by training the text data of a second sample;

the second extraction module 810 is configured to randomly intercept a video frame from the interviewer video data, extract micro-expression features of the interviewer according to the video frame, and obtain a video data confidence level according to the micro-expression features;

and the analysis module 812 is configured to determine an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

The data processing device extracts micro voice features according to audio data of an interviewee, obtains first voice emotion data according to the micro voice features, converts the audio data of the interviewee into character data, analyzes the character data to obtain second voice emotion data and grammar scores, extracts micro expression features according to the video data of the interviewee, obtains video data confidence degrees according to the micro expression features, and determines interview results of the interviewee according to the first voice emotion data, the second voice emotion data, the grammar scores and the video data confidence degrees. Through a plurality of characteristics of the interviewer of multiple mode discernment, synthesize a plurality of recognition results and confirm the interviewer result of interviewer to can accurately catch the psychological state of person being interviewed comprehensively, improve the discernment rate of accuracy, make the interview result more close to the true condition.

In one embodiment, the first extraction module is further configured to obtain gender information of the interviewer, obtain a speech emotion classification model matched with the gender information of the interviewer from a trained speech emotion classification model set, where the speech emotion classification model is obtained by training sample speech data carrying tagging information, where the tagging information includes emotion classification information and gender information, obtain a pitch feature, a mel-frequency cepstrum coefficient, and a speech rate feature in micro-speech features, input the pitch feature, the mel-frequency cepstrum coefficient, and the speech rate feature into the matched speech emotion classification model, obtain confidence coefficients of the micro-speech features belonging to respective preset emotion classifications, and obtain first emotion speech data of the micro-speech features.

In one embodiment, the first extraction module is further configured to obtain sample voice data carrying tagging information, divide the sample voice data into a training set and a verification set, perform model training according to the training set and the initial speech emotion classification model to obtain a speech emotion classification model set, perform model verification according to the verification set, and adjust each speech emotion classification model in the speech emotion classification model set.

In one embodiment, the first processing module is further configured to search and match a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, determine a serial number corresponding to each word in each sentence in the dictionary, input the serial number corresponding to each word in each sentence in the dictionary into the emotion classification network, obtain a confidence level that each sentence in the text data belongs to each preset emotion category, obtain an average value of the confidence levels that each sentence in the text data belongs to each preset emotion category, and obtain a confidence level that the text data belongs to each preset emotion category according to the average value of the confidence levels.

In one embodiment, the analysis module is further configured to obtain an audio data confidence level according to the first voice emotion data, the second voice emotion data and the grammar score, and determine an interview result of the interviewer according to the audio data confidence level, the video data confidence level and a preset confidence level parameter.

For specific limitations of the data processing apparatus, reference may be made to the above limitations of the data processing method, which are not described herein again. The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:

acquiring audio data and video data of an interviewee;

extracting micro voice features of the interviewer according to the audio data of the interviewer, and obtaining first voice emotion data according to the micro voice features;

inputting the character data into a trained grammar analysis network to obtain grammar scores of sentences in the character data, calculating a grammar score average value of each sentence to obtain grammar scores of the character data, and training the grammar analysis network by second sample character data to obtain the grammar scores;

and determining an interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence.

The data processing computer equipment extracts micro voice features according to audio data of an interviewer, obtains first voice emotion data according to the micro voice features, converts the audio data of the interviewer into character data, analyzes the character data to obtain second voice emotion data and grammar scores, extracts micro expression features according to the video data of the interviewer, obtains video data confidence degrees according to the micro expression features, and determines interview results of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar scores and the video data confidence degrees. Through a plurality of characteristics of the interviewer of multiple mode discernment, synthesize a plurality of recognition results and confirm the interviewer result of interviewer to can accurately catch the psychological state of person being interviewed comprehensively, improve the discernment rate of accuracy, make the interview result more close to the true condition.

In one embodiment, the processor, when executing the computer program, further performs the steps of: