Voice classification method and device based on voiceprint recognition and related equipment

文档序号：36636 发布日期：2021-09-24 浏览：25次中文

阅读说明：本技术 基于声纹识别的语音分类方法、装置及相关设备 (Voice classification method and device based on voiceprint recognition and related equipment ) 是由李少军杨杰于 2021-07-30 设计创作，主要内容包括：本申请涉及数据处理技术,提供一种基于声纹识别的语音分类方法、装置、计算机设备与存储介质,包括：预处理历史语音会话数据；将语音会话片段变换为目标频谱数据；训练目标音频编码模型；调用目标音频编码模型处理所有代理人的音频数据,得到代理音频编码,创建声纹库；接收待验证语音会话数据,调用目标音频编码模型处理待验证会话数据,得到待验证音频编码集；计算待验证音频编码与声纹库中代理音频编码的相似度,检测是否存在目标待验证音频编码与代理音频编码的相似度未超过预设相似度阈值；当结果为否时,确定待验证语音会话数据为真实语音会话数据。本申请能够提高语音分类准确性与效率,促进智慧城市快速发展。(The application relates to a data processing technology, and provides a voice classification method, a device, computer equipment and a storage medium based on voiceprint recognition, which comprises the following steps: preprocessing historical voice conversation data; transforming the voice conversation fragment into target frequency spectrum data; training a target audio coding model; calling a target audio coding model to process audio data of all agents to obtain agent audio codes, and creating a voiceprint library; receiving voice session data to be verified, and calling a target audio coding model to process the voice session data to be verified to obtain an audio coding set to be verified; calculating the similarity between the audio code to be verified and the proxy audio code in the voiceprint library, and detecting whether the similarity between the target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold value; and when the result is negative, determining that the voice session data to be verified is real voice session data. This application can improve speech classification accuracy and efficiency, promotes the rapid development of wisdom city.)

1. A voice classification method based on voiceprint recognition is characterized by comprising the following steps:

preprocessing historical voice conversation data to obtain a voice conversation fragment set with preset duration;

calling a fast Fourier transform algorithm to transform each voice conversation fragment in the voice conversation fragment set into target spectrum data;

acquiring an initial audio coding model, and deleting useless channels in the initial audio coding model to obtain an improved initial audio coding model;

calling the improved initial audio coding model to aggregate the target frequency spectrum data to obtain frequency spectrum characteristics;

taking the spectral features as input vectors, and taking audio codes corresponding to the spectral features as output vectors to train the improved initial audio coding model to obtain a trained target audio coding model;

calling the target audio coding model to process the audio data of all the agents to obtain the agent audio coding of each agent, and creating a voiceprint library corresponding to all the agents according to the agent audio coding;

receiving voice session data to be verified, and calling the target audio coding model to process the voice session data to be verified to obtain an audio coding set to be verified;

calculating the similarity value of each audio code to be verified in the audio code set to be verified and each proxy audio code in the voiceprint library, and detecting whether the similarity between the target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold value;

and when the detection result shows that the similarity between the target audio code to be verified and the proxy audio code does not exceed the preset similarity threshold, determining that the voice session data to be verified is real voice session data.

2. The voice classification method based on voiceprint recognition according to claim 1, wherein the preprocessing the historical voice conversation data to obtain a voice conversation fragment set with a preset duration comprises:

acquiring the conversation opening time and the conversation ending time of each section of voice conversation in the historical voice conversation data, and determining the conversation duration of the voice conversation according to the conversation opening time and the conversation ending time;

deleting the voice conversation of which the conversation duration does not exceed a preset voice duration threshold in the historical voice conversation data to obtain first voice conversation data;

calling VAD voice detection technology to detect noise segments of each voice session in the first voice session data, and deleting the voice sessions of which the number of the noise segments exceeds a preset number threshold value to obtain second voice session data;

and cutting the second voice conversation data according to a preset time length to obtain a voice conversation fragment set.

3. The voiceprint recognition based speech classification method according to claim 1, wherein said invoking a fast fourier transform algorithm to transform each of the set of voice session segments into target spectral data comprises:

extracting the frequency spectrum information of each voice conversation fragment in the voice conversation fragment set;

generating a first oscillogram corresponding to a time domain according to the frequency spectrum information, and performing framing processing on the first oscillogram to obtain a plurality of first single-frame oscillograms;

performing a fast fourier transform operation on each first single-frame waveform map to obtain a plurality of first single-frame frequency spectrum maps, wherein a horizontal axis of each first single-frame frequency spectrum map is used for representing frequency, and a vertical axis of each first single-frame frequency spectrum map is used for representing amplitude;

carrying out inversion operation and gray scale operation on each first single-frame frequency spectrum image to obtain a plurality of first one-dimensional gray scale amplitude images;

and synthesizing a plurality of first one-dimensional gray scale amplitude maps to obtain a voice frequency spectrogram, and obtaining target frequency spectrum data based on coordinate information in the voice frequency map.

4. The method of claim 1, wherein the removing the number of useless channels from the initial audio coding model to obtain the improved initial audio coding model comprises:

presetting useless channel number;

detecting whether the number of useless channels exists in the last dimension of each layer of the initial audio coding model;

and when the detection result shows that the useless channel number exists in the last dimension of each layer of the initial audio coding model, deleting the useless channel number to obtain the improved initial audio coding model.

5. The method of claim 1, wherein the step of aggregating the target spectrum data by using the initial audio coding model after the modification to obtain the spectrum features comprises:

acquiring target frequency spectrum data, and extracting a preset number of frequency spectrum frames from the target frequency spectrum data to obtain a frequency spectrum frame set, wherein each frequency spectrum frame corresponds to a unique timestamp in the target frequency spectrum data;

vectorizing each frequency spectrum frame in the frequency spectrum frame set to obtain a frame feature vector;

and performing aggregation analysis on the frame feature vectors to obtain the frequency spectrum features corresponding to the target frequency spectrum data.

6. The method for voice classification based on voiceprint recognition according to claim 1, wherein before invoking the target audio coding model to process the session data to be verified, the method further comprises:

dividing the voice session data to be verified into a plurality of data frames according to a preset rule;

counting the spectrum energy of the current data frame, and comparing the spectrum energy with the preset energy threshold;

if the spectrum energy is less than or equal to the preset energy threshold, determining that the current data frame is a normal audio signal;

and if the spectrum energy is larger than the preset energy threshold, determining that the current data frame contains an abnormal signal.

7. The method according to claim 1, wherein the calculating the similarity value between each audio code to be verified in the set of audio codes to be verified and each proxy audio code in the voiceprint library comprises:

converting the audio code to be verified and the proxy audio code into a vector form;

processing the audio code to be verified and the proxy audio code in a vector form by adopting a preset included angle cosine value calculation model to obtain an included angle cosine value;

and determining the similarity value of the audio code to be verified and the proxy audio code according to the cosine value of the included angle.

8. A voice classifying device based on voiceprint recognition, the voice classifying device based on voiceprint recognition comprising:

the data preprocessing module is used for preprocessing historical voice conversation data to obtain a voice conversation fragment set with preset duration;

the data transformation module is used for calling a fast Fourier transform algorithm to transform each voice conversation fragment in the voice conversation fragment set into target spectrum data;

the model improvement module is used for acquiring an initial audio coding model and deleting the number of useless channels in the initial audio coding model to obtain an improved initial audio coding model;

the aggregation processing module is used for calling the improved initial audio coding model to aggregate the target spectrum data to obtain spectrum characteristics;

the model training module is used for training the improved initial audio coding model by taking the spectral features as input vectors and the audio codes corresponding to the spectral features as output vectors to obtain a trained target audio coding model;

the voiceprint library creating module is used for calling the target audio coding model to process the audio data of all the agents to obtain the agent audio code of each agent, and creating a voiceprint library corresponding to all the agents according to the agent audio codes;

the code acquisition module is used for receiving voice session data to be verified and calling the target audio coding model to process the voice session data to be verified to obtain an audio coding set to be verified;

the similarity calculation module is used for calculating the similarity value between each audio code to be verified in the audio code set to be verified and each proxy audio code in the voiceprint library, and detecting whether the similarity between the target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold value;

and the data determining module is used for determining the voice session data to be verified as real voice session data when the detection result shows that the similarity between the target audio code to be verified and the proxy audio code does not exceed the preset similarity threshold.

9. A computer device, characterized in that the computer device comprises a processor for implementing a method for speech classification based on voiceprint recognition according to any one of claims 1 to 7 when executing a computer program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for speech classification based on voiceprint recognition according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for voice classification based on voiceprint recognition, a computer device, and a medium.

Background

Under the guidance of the strategies of finance + science and technology and finance + ecology in the insurance industry, on the important node of science and technology accelerated insurance digital transformation, a large number of offline service scenes are switched to be performed online, and a large number of guest-meeting audio data for service communication between agents and customers are generated. The investigation shows that a large amount of audio which is automatically exercised by the agent exists in the current guest meeting audio data, namely the whole audio does not appear in the clients and is only used for the agent to exercise the data by the agent, but the audio obtained from the back end cannot distinguish whether the audio is exercised by the agent or is really communicated with the clients to explain the audio, and the data cannot meet the requirements of downstream AI tasks.

In the process of implementing the present application, the inventor finds that the following technical problems exist in the prior art: at present, the method of distinguishing real dialogue data of an agent and a client mainly depends on voiceprint recognition, and parameters such as a linear prediction coefficient, a mel cepstrum coefficient, a speech spectrogram feature and the like are generally considered, wherein the speech spectrogram is a feature representation commonly used in the voiceprint recognition direction in the current deep learning research field, and besides the local spatial feature and the time sequence feature of rich speaker personality information, the speech spectrogram also has the conditions of blank speech information fragments and insufficient speech energy, so that the speech spectrogram has a large amount of redundant information, the network training cannot be converged quickly, a large amount of calculation cost is consumed, and the speed and the accuracy of speech classification cannot be guaranteed.

Therefore, it is necessary to provide a voice classification method based on voiceprint recognition, which can improve the speed and accuracy of voice classification.

Disclosure of Invention

In view of the foregoing, there is a need for a method, apparatus, computer device and medium for speech classification based on voiceprint recognition, which can improve the speed and accuracy of speech classification.

A first aspect of an embodiment of the present application provides a voice classification method based on voiceprint recognition, where the voice classification method based on voiceprint recognition includes:

preprocessing historical voice conversation data to obtain a voice conversation fragment set with preset duration;

calling a fast Fourier transform algorithm to transform each voice conversation fragment in the voice conversation fragment set into target spectrum data;

acquiring an initial audio coding model, and deleting useless channels in the initial audio coding model to obtain an improved initial audio coding model;

calling the improved initial audio coding model to aggregate the target frequency spectrum data to obtain frequency spectrum characteristics;

receiving voice session data to be verified, and calling the target audio coding model to process the voice session data to be verified to obtain an audio coding set to be verified;

Further, in the voice classification method based on voiceprint recognition provided in an embodiment of the present application, the preprocessing the historical voice session data to obtain a voice session fragment set with a preset duration includes:

deleting the voice conversation of which the conversation duration does not exceed a preset voice duration threshold in the historical voice conversation data to obtain first voice conversation data;

and cutting the second voice conversation data according to a preset time length to obtain a voice conversation fragment set.

Further, in the voice classification method based on voiceprint recognition provided by the embodiment of the present application, the invoking a fast fourier transform algorithm to transform each voice session segment in the set of voice session segments into target spectrum data includes:

extracting the frequency spectrum information of each voice conversation fragment in the voice conversation fragment set;

carrying out inversion operation and gray scale operation on each first single-frame frequency spectrum image to obtain a plurality of first one-dimensional gray scale amplitude images;

Further, in the method for classifying speech based on voiceprint recognition provided by the embodiment of the present application, the deleting useless channels in the initial audio coding model to obtain an improved initial audio coding model includes:

presetting useless channel number;

detecting whether the number of useless channels exists in the last dimension of each layer of the initial audio coding model;

Further, in the above speech classification method based on voiceprint recognition provided by the embodiment of the present application, the invoking the improved initial audio coding model to aggregate the target spectrum data, and obtaining the spectrum feature includes:

vectorizing each frequency spectrum frame in the frequency spectrum frame set to obtain a frame feature vector;

and performing aggregation analysis on the frame feature vectors to obtain the frequency spectrum features corresponding to the target frequency spectrum data.

Further, in the voice classification method based on voiceprint recognition provided in the embodiment of the present application, before the target audio coding model is called to process the session data to be verified to obtain an audio coding set to be verified, the method further includes:

dividing the voice session data to be verified into a plurality of data frames according to a preset rule;

counting the spectrum energy of the current data frame, and comparing the spectrum energy with the preset energy threshold;

if the spectrum energy is less than or equal to the preset energy threshold, determining that the current data frame is a normal audio signal;

and if the spectrum energy is larger than the preset energy threshold, determining that the current data frame contains an abnormal signal.

Further, in the voice classification method based on voiceprint recognition provided in an embodiment of the present application, the calculating a similarity value between each to-be-verified audio code in the to-be-verified audio code set and each proxy audio code in the voiceprint library includes:

converting the audio code to be verified and the proxy audio code into a vector form;

processing the audio code to be verified and the proxy audio code in a vector form by adopting a preset included angle cosine value calculation model to obtain an included angle cosine value;

and determining the similarity value of the audio code to be verified and the proxy audio code according to the cosine value of the included angle.

A second aspect of the embodiments of the present application further provides a voice classification device based on voiceprint recognition, where the voice classification device based on voiceprint recognition includes:

the data preprocessing module is used for preprocessing historical voice conversation data to obtain a voice conversation fragment set with preset duration;

the data transformation module is used for calling a fast Fourier transform algorithm to transform each voice conversation fragment in the voice conversation fragment set into target spectrum data;

the aggregation processing module is used for calling the improved initial audio coding model to aggregate the target spectrum data to obtain spectrum characteristics;

A third aspect of embodiments of the present application further provides a computer device, where the computer device includes a processor, and the processor is configured to implement the voice classification method based on voiceprint recognition according to any one of the above items when executing the computer program stored in the memory.

The fourth aspect of the embodiments of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements any one of the voice classification methods based on voiceprint recognition.

According to the voice classification method based on voiceprint recognition, the voice classification device based on voiceprint recognition, the computer equipment and the computer readable storage medium, the target audio coding model is trained and called to process the audio data of all agents and the voice session data to be verified, so that the agent audio coding and the audio coding to be verified of each agent are obtained, voice classification is performed by utilizing the similarity of the audio coding, the problem of poor voice classification effect caused by a large amount of redundant information in voiceprint information when a network model is directly called to perform voice classification on the voiceprint information can be avoided, and the speed and the accuracy of voice classification can be improved; in addition, the improved initial audio coding model is adopted, the number of channels of the initial audio coding model is simplified, the initial audio coding model is called to aggregate target spectrum data on a time axis to obtain spectrum characteristics, optimization of the spectrum characteristics is achieved, the spectrum characteristics can be prevented from being affected by poor frames as far as possible, accuracy and efficiency of audio coding extraction are improved, and accuracy and efficiency of voice classification are improved. This application can be applied to in each functional module in wisdom cities such as wisdom government affairs, wisdom traffic, for example wisdom government affairs based on voiceprint recognition's speech classification module etc. can promote wisdom city's rapid development.

Drawings

Fig. 1 is a flowchart of a method for voice classification based on voiceprint recognition according to an embodiment of the present application.

Fig. 2 is a block diagram of a voice classification apparatus based on voiceprint recognition according to a second embodiment of the present application.

Fig. 3 is a schematic structural diagram of a computer device provided in the third embodiment of the present application.

The following detailed description will further illustrate the present application in conjunction with the above-described figures.

Detailed Description

In order that the above objects, features and advantages of the present application can be more clearly understood, a detailed description of the present application will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and the described embodiments are a part, but not all, of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The voice classification method based on the voiceprint recognition provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the voice classification device based on the voiceprint recognition is operated in the computer equipment.

Fig. 1 is a flowchart of a voice classification method based on voiceprint recognition according to a first embodiment of the present application. As shown in fig. 1, the voice classification method based on voiceprint recognition may include the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements:

and S11, preprocessing historical voice conversation data to obtain a voice conversation fragment set with preset duration.

In at least one embodiment of the present application, the historical voice session data may be the content of an exercise voice session between agent a and agent B, or the content of a real voice session between agent a and client B. The historical voice conversation data includes voice conversation content, and also includes preset audio codes of different speakers, each speaker corresponds to a unique audio code, and the audio codes may be codes in a pure number form, codes in a pure letter form, or codes in a combination form of numbers and letters, which is not limited herein. The original historical voice conversation data can be audio data in a PCM format, and in order to facilitate voiceprint recognition processing, the application converts the audio data in the PCM format into the audio data in the WAV format. Since it is the prior art to convert audio data in PCM format to audio data in WAV format, it is not described herein again.

Optionally, the preprocessing the historical voice conversation data to obtain a voice conversation fragment set with a preset duration includes:

deleting the voice conversation of which the conversation duration does not exceed a preset voice duration threshold in the historical voice conversation data to obtain first voice conversation data;

and cutting the second voice conversation data according to a preset time length to obtain a voice conversation fragment set.

Wherein, the first voice conversation data comprises a plurality of voice conversations, each voice conversation may belong to a section with background noise or a section with both background noise and voice, the VAD voice detection technology is used for dividing the voice conversation into a plurality of frames according to a time signal and determining in which section a given frame is placed, and it can be understood that when the detection result is that the voice conversation is in the section with background noise, the voice conversation is indicated as a noise segment; when the detection result is that the voice conversation is in the section of both background noise and voice, the voice conversation is indicated to be a non-noise segment. The preset voice time is preset time of system personnel. The preset quantity threshold value is a value preset by system personnel.

In at least one embodiment of the present application, the preset duration may be 2 to 6 seconds, and the clipped voice conversation segment includes the longest silence segment within no more than 0.3 seconds. The preset duration can be preset by system personnel and can also be determined in a machine learning mode. By setting the preset time length for 2-6 seconds, the maximum effective information contained in the cut voice conversation fragment can be ensured, and the effective information refers to the information which can express the conversation subject except for a mute fragment and a language word in the voice conversation.

By preprocessing historical voice conversation data, voice conversation data with too short conversation time and too much noise is deleted, and the problem of low model training accuracy caused by invalid conversation of a training sample is avoided; in addition, according to the method and the device, historical voice conversation data are cut into a voice conversation fragment set according to preset duration, and a plurality of short-duration voice conversation fragments are used for replacing lengthy conversation data to serve as training samples, so that the model training efficiency can be improved.

And S12, calling a fast Fourier transform algorithm to transform each voice conversation fragment in the voice conversation fragment set into target spectrum data.

In at least one embodiment of the present application, the Fast Fourier Transform (FFT) is a fast algorithm of discrete fourier transforms that can transform a time domain signal to a frequency domain signal, since some signals are not characterized in the time domain, but are characterized if transformed to the frequency domain. Additionally, a Fast Fourier Transform (FFT) can extract the spectrum of a signal to reflect the distribution of signal energy in the frequency domain.

Optionally, the invoking a fast fourier transform algorithm to transform each voice session segment in the set of voice session segments into target spectral data includes:

extracting the frequency spectrum information of each voice conversation fragment in the voice conversation fragment set;

carrying out inversion operation and gray scale operation on each first single-frame frequency spectrum image to obtain a plurality of first one-dimensional gray scale amplitude images;

and synthesizing the plurality of first one-dimensional gray scale amplitude maps to obtain a voice frequency spectrogram, and obtaining target frequency spectrum data based on coordinate information in the voice frequency map.

And the inversion operation is used for exchanging the horizontal axis and the vertical axis in the first single-frame spectrogram, and the grayscale operation is used for representing the amplitude in the first single-frame spectrogram after the inversion operation by grayscale values. The voice spectrogram is an image reflecting the relation between signal frequency and energy, and the first Wave form diagram (Wave) is a continuous sound Wave form signal diagram generated according to the frequency spectrum information. In an embodiment, the speech spectrogram can be obtained by processing the spectrum information. For example, the Spectrum information is first converted into a first oscillogram corresponding to the time domain of the Spectrum information, the first oscillogram is divided into a plurality of first single-frame oscillograms with equal duration, each first single-frame oscillogram is continuously sampled to obtain a plurality of sampling points, then fourier transform (FFT) operation is performed on the plurality of sampling points to obtain a plurality of first single-frame spectrograms (spectra), and each first single-frame spectrogram is subjected to inversion operation and grayscale operation to obtain a first one-dimensional grayscale Amplitude map, wherein the horizontal axis of each first single-frame spectrogram is used for representing frequency, and the vertical axis of each first single-frame spectrogram is used for representing Amplitude (Amplitude); finally, the voice frequency spectrogram corresponding to the frequency spectrum information can be obtained by splicing the plurality of first one-dimensional gray scale amplitude maps. For example, when the plurality of samples is 4096 samples, the duration of each first single-frame waveform is 1/10 seconds(s), and each point in the speech spectrogram corresponding to the first waveform corresponds to a value of the amplitude of the corresponding frequency. Therefore, the speech spectrogram corresponding to the spectrum information reflects the frequency distribution of audio over time.

S13, obtaining an initial audio coding model, and deleting useless channels in the initial audio coding model to obtain an improved initial audio coding model.

In at least one embodiment of the present application, the initial audio coding model may refer to an initialized ResNet34 model. The initial audio coding model is used for coding the audio data to obtain the audio coding corresponding to each audio data. The improvement of the RESNET34 model in the application lies in that: on one hand, the method and the device perform channel number processing on the last dimension of each layer of the RESNET34 model, delete the number of channels which are not needed, and simplify the original model to about 300 ten thousand parameters on the basis of about 2200 ten thousand parameters of the original model, so that the efficiency of audio coding extraction is improved, and the efficiency of voice classification is improved. On the other hand, the characteristics are aggregated on a time axis by adding a NetVlad method, so that the accuracy of audio coding extraction is improved, and the accuracy of voice classification is improved. The essence of NetVLAD is to calculate residuals for features, and aggregate (aggregate) different times and frames to obtain new features.

Optionally, the deleting the number of useless channels in the initial audio coding model to obtain an improved initial audio coding model includes:

presetting useless channel number;

detecting whether the number of useless channels exists in the last dimension of each layer of the initial audio coding model;

The number of the useless channels can be preset by system personnel, and the number of the useless channels is stored in the preset database. In other embodiments, the number of useless channels in the model may also be determined by constructing a mathematical model, which is not particularly limited.

And S14, calling the improved initial audio coding model to aggregate the target spectrum data to obtain the spectrum characteristics.

In at least one embodiment of the present application, the invoking the improved initial audio coding model to aggregate the target spectral data to obtain the spectral feature includes:

vectorizing each frequency spectrum frame in the frequency spectrum frame set to obtain a frame feature vector;

and performing aggregation analysis on the frame feature vectors to obtain the frequency spectrum features corresponding to the target frequency spectrum data.

Wherein the aggregation analysis may comprise adaptive weight aggregation or timing correlation aggregation. The characteristics are aggregated on a time axis by adding a NetVlad method in the RESNET34 model processing process, the optimization of the spectrum characteristics is realized, the influence of extracting poor frames can be avoided as far as possible by the spectrum characteristics, the accuracy of audio coding extraction is improved, and the accuracy of voice classification is improved.

And S15, taking the spectral features as input vectors, and taking audio codes corresponding to the spectral features as the initial audio coding model after the training improvement of the output vectors to obtain a trained target audio coding model.

In at least one embodiment of the present application, the spectral feature is used as an input vector, and the audio coding corresponding to the spectral feature is used as an output vector to train the improved initial audio coding model, so as to obtain a trained target audio coding model. The audio coding may be preset coding information.

Optionally, the inputting the target spectrum data into the initial audio coding model to obtain the trained target audio coding model includes:

acquiring the spectral features as sample data, and splitting the sample data into a training set and a test set, wherein the sample data takes the spectral features as input vectors, and audio codes corresponding to the spectral features are output vectors;

inputting the training set into the improved initial audio coding model to obtain a trained audio coding model;

inputting the test set into the trained audio coding model to obtain an evaluation index of the model;

detecting whether the evaluation index of the model exceeds a preset index threshold value;

when the detection result is that the evaluation index of the model exceeds a preset index threshold value, determining that the model training is finished to obtain a trained target audio coding model; and when the detection result shows that the evaluation index of the model does not exceed the preset index threshold value, adding a training set, and retraining the model until the evaluation index of the model exceeds the preset index threshold value.

The preset index threshold is a preset value, for example, the preset index threshold is 95%.

The improved ResNet34 model is used as an audio coding model, the number of channels of the standard ResNet34 model is simplified, and a NetVlad method is added to achieve the purpose of aggregating the characteristics on a time axis, so that the optimization of the spectral characteristics is realized, the influence of extracting poor frames on the spectral characteristics can be avoided as far as possible, the accuracy and the efficiency of audio coding extraction are improved, and the accuracy and the efficiency of voice classification are improved.

S16, calling the target audio coding model to process the audio data of all the agents, obtaining the agent audio coding of each agent, and creating a voiceprint library corresponding to all the agents according to the agent audio coding.

In at least one embodiment of the present application, a preset database is established, where audio data of all agents are stored in the preset database, where the audio data may be voice data for asking each agent to read a preset text, and the preset text is a text preset by a system staff. And calling the target audio coding model to process the audio data of all the agents to obtain the agent audio coding of each agent, wherein the agent audio coding and each agent have a mapping relation, and one agent corresponds to one agent audio coding. And creating a voiceprint library, wherein the voiceprint library stores the basic information of each agent and the corresponding agent audio code. By inquiring the mapping relation, the agent audio code of each agent can be obtained. The basic information of the agent may include information for identifying the identity of the agent, such as a name and an ID, which is not limited herein. The voiceprint library may be updated at a preset time interval, which may be 7 days.

The audio coding model is established for the agent, the audio characteristic coding can be carried out on the agent, and whether the training audio for the agent or the real audio of the agent and the client is judged for each communication session rapidly and accurately through the voiceprint characteristics.

And S17, receiving the voice session data to be verified, and calling the target audio coding model to process the session data to be verified to obtain an audio coding set to be verified.

In at least one embodiment of the present application, the voice session data to be verified is the practice voice session content between agents or the real voice session content between an agent and a client, where the current voice session data needs to be verified. The audio code set to be verified refers to a set of audio codes of each person in the voice conversation data to be verified, and the audio code set to be verified comprises two or more audio codes to be verified.

Optionally, before invoking the target audio coding model to process the session data to be verified to obtain an audio coding set to be verified, the method further includes:

dividing the voice session data to be verified into a plurality of data frames according to a preset rule;

counting the spectrum energy of the current data frame, and comparing the spectrum energy with the preset energy threshold;

if the spectrum energy is less than or equal to the preset energy threshold, determining that the current data frame is a normal audio signal;

and if the spectrum energy is larger than the preset energy threshold, determining that the current data frame contains an abnormal signal.

The preset rule refers to a division length of the data frame, and for example, the preset rule may be to convert an audio signal with a duration of 10ms or 20ms into one data frame, and in an embodiment, the audio signal of each data frame is detected in real time to determine whether an abnormal signal exists. For each data frame, the energy of the data frame is determined by an energy statistic method, which may be a periodic RMS (Root Mean Square) statistic method. In one embodiment, the spectral energy of an audio signal in a data frame is sequentially compared with a preset energy threshold, if the spectral energy of the audio signal in the data frame is less than or equal to the preset energy threshold, the current data frame is determined to be a normal audio signal, no processing is performed on the data frame, and the next data frame is continuously detected; if the spectral energy of the audio signal in the data frame is greater than a preset energy threshold, determining that the audio signal is an abnormal signal, continuously detecting other audio signals of the data frame until the data frame detection is finished, and continuously detecting the next data frame.

The method and the device can avoid the problem that the audio effect is influenced by the abnormal signals caused by aging of hardware equipment, realize low-cost elimination of the abnormal signals and improve the quality of the audio signals.

S18, calculating the similarity value between each audio code to be verified in the audio code set to be verified and each proxy audio code in the voiceprint library, detecting whether the similarity between the target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold, and executing the step S19 when the detection result shows that the similarity between the target audio code to be verified and the proxy audio code does not exceed the preset similarity threshold.

In at least one embodiment of the present application, the cosine similarity is also called cosine similarity, and the similarity is evaluated by calculating the cosine value of the included angle between two vectors. The cosine similarity is to draw the vector into the vector space according to the coordinate value, such as the most common two-dimensional space, to obtain the included angle between them, and obtain the cosine value corresponding to the included angle, and the cosine value can be used to represent the similarity of the two vectors. The smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions are, the more similar; the larger the angle, the closer the cosine values are to 0, the closer they are to being orthogonal, and the poorer the similarity.

Optionally, the calculating a similarity value between each audio code to be verified in the set of audio codes to be verified and each proxy audio code in the voiceprint library includes:

converting the audio code to be verified and the proxy audio code into a vector form;

processing the audio code to be verified and the proxy audio code in a vector form by adopting a preset included angle cosine value calculation model to obtain an included angle cosine value;

and determining the similarity value of the audio code to be verified and the proxy audio code according to the cosine value of the included angle.

The larger the cosine value of the included angle is, the closer the audio code to be verified is to the proxy audio code; the smaller the value of the included angle cosine value is, the less the audio code to be verified is related to the proxy audio code. The method comprises the steps of setting a preset similarity threshold, and when the cosine value of the included angle exceeds the preset similarity threshold, determining that the audio code to be verified is close to the proxy audio code, namely the audio code to be verified is in the voiceprint library, namely a conversation party corresponding to the audio code to be verified is a proxy in the voiceprint library; when the cosine value of the included angle does not exceed the preset similarity threshold value, it is determined that the audio code to be verified is not related to the proxy audio code, that is, the audio code to be verified is not in the voiceprint library, that is, a conversation party corresponding to the audio code to be verified is a stranger (that is, a client). The preset similarity threshold is a value preset by a system staff, for example, the preset similarity threshold may be 95%, and is not limited herein.

And S19, determining the voice session data to be verified as real voice session data.

In at least one embodiment of the present application, whether the similarity between the target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold is detected, and when the detection result indicates that the similarity between the target audio code to be verified and the proxy audio code does not exceed the preset similarity threshold, it is determined that the voice session data to be verified is real voice session data; and when the detection result shows that the similarity between the target audio code to be verified and the proxy audio code exceeds the preset similarity threshold, determining the voice session data to be verified as training voice session data. The real voice conversation data refers to conversation data between the agent and the client, and the training voice conversation data refers to conversation data between the agent and the client.

According to the voice classification method based on voiceprint recognition, the target audio coding model is trained and called to process the audio data of all agents and the voice session data to be verified, so that the agent audio code and the voice session data to be verified of each agent are obtained, voice classification is performed by utilizing the similarity of the audio codes, the problem that when a network model is directly called to perform voice classification on voiceprint information, the voice classification effect is poor due to the fact that a large amount of redundant information exists in the voiceprint information can be solved, and the speed and accuracy of voice classification can be improved; in addition, the improved initial audio coding model is adopted, the number of channels of the initial audio coding model is simplified, the initial audio coding model is called to aggregate target spectrum data on a time axis to obtain spectrum characteristics, optimization of the spectrum characteristics is achieved, the spectrum characteristics can be prevented from being affected by poor frames as far as possible, accuracy and efficiency of audio coding extraction are improved, and accuracy and efficiency of voice classification are improved. This application can be applied to in each functional module in wisdom cities such as wisdom government affairs, wisdom traffic, for example wisdom government affairs based on voiceprint recognition's speech classification module etc. can promote wisdom city's rapid development.

Fig. 2 is a block diagram of a voice classification apparatus based on voiceprint recognition according to a second embodiment of the present application.

In some embodiments, the voiceprint recognition based speech classification apparatus 20 may comprise a plurality of functional modules consisting of computer program segments. The computer program of each program segment in the voiceprint recognition based speech classification apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform the functions of the model training process (described in detail in fig. 1).

In this embodiment, the voice classifying device 20 based on voiceprint recognition can be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises a data preprocessing module 201, a data transformation module 202, a model improvement module 203, an aggregation processing module 204, a model training module 205, a voiceprint library creation module 206, a code acquisition module 207, a similarity calculation module 208 and a data determination module 209. A module as referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in a memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The data preprocessing module 201 is configured to preprocess historical voice session data to obtain a voice session fragment set with a preset duration.

Optionally, the preprocessing the historical voice conversation data to obtain a voice conversation fragment set with a preset duration includes:

deleting the voice conversation of which the conversation duration does not exceed a preset voice duration threshold in the historical voice conversation data to obtain first voice conversation data;

and cutting the second voice conversation data according to a preset time length to obtain a voice conversation fragment set.

The data transformation module 202 is configured to invoke a fast fourier transform algorithm to transform each voice conversation fragment in the set of voice conversation fragments into target spectrum data.

Optionally, the invoking a fast fourier transform algorithm to transform each voice session segment in the set of voice session segments into target spectral data includes:

extracting the frequency spectrum information of each voice conversation fragment in the voice conversation fragment set;

and carrying out inversion operation and gray scale operation on each first single-frame spectrogram to obtain a plurality of first one-dimensional gray scale amplitude maps, synthesizing the plurality of first one-dimensional gray scale amplitude maps to obtain a voice spectrogram, and obtaining target frequency spectrum data based on coordinate information in the voice frequency map.

The model improvement module 203 is configured to obtain an initial audio coding model, and delete the number of useless channels in the initial audio coding model to obtain an improved initial audio coding model.

Optionally, the deleting the number of useless channels in the initial audio coding model to obtain an improved initial audio coding model includes:

presetting useless channel number;

detecting whether the number of useless channels exists in the last dimension of each layer of the initial audio coding model;

The aggregation processing module 204 is configured to invoke the improved initial audio coding model to aggregate the target spectrum data, so as to obtain a spectrum feature.

In at least one embodiment of the present application, the invoking the improved initial audio coding model to aggregate the target spectral data to obtain the spectral feature includes:

vectorizing each frequency spectrum frame in the frequency spectrum frame set to obtain a frame feature vector;

and performing aggregation analysis on the frame feature vectors to obtain the frequency spectrum features corresponding to the target frequency spectrum data.

The model training module 205 is configured to train the initial audio coding model after training improvement by using the spectral features as input vectors and using the audio coding corresponding to the spectral features as output vectors, so as to obtain a trained target audio coding model.

Optionally, the inputting the target spectrum data into the initial audio coding model to obtain the trained target audio coding model includes:

inputting the training set into the improved initial audio coding model to obtain a trained audio coding model;

inputting the test set into the trained audio coding model to obtain an evaluation index of the model;

detecting whether the evaluation index of the model exceeds a preset index threshold value;

The preset index threshold is a preset value, for example, the preset index threshold is 95%.

The voiceprint library creating module 206 is configured to invoke the target audio coding model to process audio data of all the agents, obtain an agent audio code of each agent, and create a voiceprint library corresponding to all the agents according to the agent audio codes.

In at least one embodiment of the present application, a preset database exists, where audio data of all agents are stored in the preset database, where the audio data may be voice data for asking each agent to read a preset text, and the preset text is a text preset by a system staff. And calling the target audio coding model to process the audio data of all the agents to obtain the agent audio coding of each agent, wherein the agent audio coding and each agent have a mapping relation, and one agent corresponds to one agent audio coding. And creating a voiceprint library, wherein the voiceprint library stores the basic information of each agent and the corresponding agent audio code. By inquiring the mapping relation, the agent audio code of each agent can be obtained. The basic information of the agent may include information for identifying the identity of the agent, such as a name and an ID, which is not limited herein. The voiceprint library may be updated at a preset time interval, which may be 7 days.

The code obtaining module 207 is configured to receive voice session data to be verified, and call the target audio coding model to process the voice session data to be verified, so as to obtain an audio coding set to be verified.

Optionally, before invoking the target audio coding model to process the session data to be verified to obtain an audio coding set to be verified, the method further includes:

dividing the voice session data to be verified into a plurality of data frames according to a preset rule;

counting the spectrum energy of the current data frame, and comparing the spectrum energy with the preset energy threshold;

if the spectrum energy is less than or equal to the preset energy threshold, determining that the current data frame is a normal audio signal;

and if the spectrum energy is larger than the preset energy threshold, determining that the current data frame contains an abnormal signal.

The similarity calculation module 208 is configured to calculate a similarity value between each audio code to be verified in the audio code set to be verified and each proxy audio code in the voiceprint library, and detect whether a similarity between a target audio code to be verified and the proxy audio code does not exceed a preset similarity threshold.

Optionally, the calculating a similarity value between each audio code to be verified in the set of audio codes to be verified and each proxy audio code in the voiceprint library includes:

converting the audio code to be verified and the proxy audio code into a vector form;

processing the audio code to be verified and the proxy audio code in a vector form by adopting a preset included angle cosine value calculation model to obtain an included angle cosine value;

and determining the similarity value of the audio code to be verified and the proxy audio code according to the cosine value of the included angle.

The data determining module 209 is configured to determine that the voice session data to be verified is real voice session data when a detection result indicates that the similarity between the target audio code to be verified and the proxy audio code does not exceed the preset similarity threshold.

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present application. In the preferred embodiment of the present application, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 is not a limitation of the embodiments of the present application, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.

It should be noted that the computer device 3 is only an example, and other existing or future electronic products, such as those that may be adapted to the present application, are also included in the scope of the present application and are incorporated herein by reference.

In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the method for voice classification based on voiceprint recognition as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or part of the steps of the voiceprint recognition based speech classification method described in the embodiments of the present application; or implement all or part of the functions of the voice classification device based on voiceprint recognition. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present application and not for limiting, and although the present application is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application.

23页详细技术资料下载

Voice classification method and device based on voiceprint recognition and related equipment

相关技术

网友询问留言