Dialect recognition model training method, readable storage medium and terminal device

文档序号：662571 发布日期：2021-04-27 浏览：7次中文

阅读说明：本技术 一种方言识别模型的训练方法、可读存储介质及终端设备 (Dialect recognition model training method, readable storage medium and terminal device ) 是由吴洁于 2019-10-25 设计创作，主要内容包括：本申请属于计算机技术领域,尤其涉及一种方言识别模型的训练方法、计算机可读存储介质及终端设备。所述方法中预设的方言识别模型根据预设的质心确定语音样本对应的输出方言类别,其中,每一个质心用于表征一种方言类别的特征；所述方言识别模型根据所述输出方言类别和与所述语音样本对应的目标方言类别,对模型参数进行调整,并继续执行所述根据预设的质心确定语音样本对应的输出方言类别的步骤,直至满足预设的训练条件,以得到已训练的方言识别模型。由于在训练过程中,使用了预设的各种方言类别的质心,而其中每一种方言类别的质心均可表征该方言类别的特征,从而为方言的识别提供了可靠的依据。(The application belongs to the technical field of computers, and particularly relates to a dialect recognition model training method, a computer-readable storage medium and a terminal device. The method comprises the steps that a preset dialect recognition model determines output dialect categories corresponding to voice samples according to preset centroids, wherein each centroid is used for representing the characteristics of one dialect category; and the dialect recognition model adjusts model parameters according to the output dialect category and the target dialect category corresponding to the voice sample, and continues to execute the step of determining the output dialect category corresponding to the voice sample according to the preset centroid until a preset training condition is met, so as to obtain a trained dialect recognition model. In the training process, the preset centroids of various dialect categories are used, and the centroid of each dialect category can represent the characteristics of the dialect category, so that a reliable basis is provided for dialect identification.)

1. A method for training a dialect recognition model, comprising:

the preset dialect recognition model determines an output dialect category corresponding to the voice sample according to preset centroids, wherein each centroid is used for representing the characteristics of one dialect category;

and the dialect recognition model adjusts model parameters according to the output dialect category and the target dialect category corresponding to the voice sample, and continues to execute the step of determining the output dialect category corresponding to the voice sample according to the preset centroid until a preset training condition is met, so as to obtain a trained dialect recognition model.

2. The method for training a dialect recognition model according to claim 1, wherein the dialect recognition model comprises a word vector extraction module and a classification module;

the determining of the output dialect category corresponding to the voice sample according to the preset centroid includes:

inputting the frequency spectrum of the voice sample into the word vector extraction module to obtain a word vector of the voice sample;

and inputting the word vectors of the voice samples into the classification module, and obtaining the output dialect categories corresponding to the voice samples by the classification module according to the preset mass center and the word vectors of the voice samples.

3. The method for training a dialect recognition model according to claim 2, wherein the obtaining an output dialect class corresponding to the voice sample according to the preset centroid and the word vector of the voice sample comprises:

respectively calculating the similarity between the word vector of the voice sample and the preset centroids of various dialect categories;

and determining the dialect class corresponding to the maximum similarity as the output dialect class corresponding to the voice sample.

4. The method for training a dialect recognition model according to claim 3, further comprising, before calculating the similarity between the word vector of the speech sample and the centroid of each dialect class, respectively:

for each dialect category, acquiring a voice sample set corresponding to the dialect category, wherein the voice sample set comprises M voice samples, and M is a positive integer;

respectively calculating word vectors of all voice samples in the voice sample set to obtain M word vectors;

calculating the average of the M word vectors and determining the average as the centroid of the dialect class.

5. The method for training dialect recognition model according to claim 4, wherein the calculating the similarity between the word vector of the speech sample and the preset centroids of the dialect categories respectively comprises:

calculating cosine similarity between the word vectors of the voice samples and the centroids of various dialect categories respectively;

and calculating the similarity between the word vector of the voice sample and the mass center of each dialect type according to the cosine similarity corresponding to each dialect type, the preset weight coefficient and the preset bias coefficient.

6. The method for training a dialect recognition model according to claim 5, wherein the adjusting model parameters according to the target dialect class and the output dialect class comprises:

calculating a training loss value of the dialect recognition model according to the similarity between the word vector of the voice sample and the centroids of various dialect categories;

and adjusting the model parameters according to the training loss value.

7. The method for training a dialect recognition model according to any one of claims 1 to 6, further comprising, after obtaining the trained dialect recognition model:

testing the dialect recognition model by using preset test data, and respectively counting the test success times and the test failure times;

calculating the recognition accuracy of the dialect recognition model according to the test success times and the test failure times;

if the recognition accuracy is smaller than a preset accuracy threshold, continuing training the dialect recognition model;

and if the identification accuracy is greater than or equal to the accuracy threshold, ending the test of the dialect identification model.

8. A dialect identification method, comprising:

acquiring a frequency spectrum of a voice to be recognized;

inputting the frequency spectrum of the speech to be recognized into a trained dialect recognition model, and acquiring a dialect category which is output by the dialect recognition model and corresponds to the speech to be recognized, wherein the dialect recognition model is the dialect recognition model in any one of claims 1 to 7.

9. The dialect identification method of claim 8, wherein the dialect identification model includes a word vector extraction module and a classification module;

the inputting the frequency spectrum of the speech to be recognized into a trained dialect recognition model and acquiring the dialect category output by the dialect recognition model and corresponding to the speech to be recognized comprises:

inputting the frequency spectrum of the voice to be recognized into the word vector extraction module to obtain a word vector of the voice to be recognized;

and inputting the word vector of the voice to be recognized into the classification module to obtain a dialect category corresponding to the voice to be recognized.

10. The dialect recognition method of claim 9, wherein the inputting the word vector of the speech to be recognized into the classification module to obtain the dialect class corresponding to the speech to be recognized comprises:

respectively calculating the similarity between the word vector of the voice to be recognized and the preset centroids of various dialect categories;

and determining the dialect class corresponding to the maximum similarity as the dialect class corresponding to the voice to be recognized.

11. The dialect recognition method of claim 10, further comprising, before calculating the similarity between the word vector of the speech to be recognized and the centroid of each of the preset dialect classes, respectively:

for each dialect category, acquiring a voice sample set corresponding to the dialect category, wherein the voice sample set comprises M voice samples, and M is a positive integer;

respectively calculating word vectors of all voice samples in the voice sample set to obtain M word vectors;

calculating the average of the M word vectors and determining the average as the centroid of the dialect class.

12. The dialect recognition method of claim 10, wherein the calculating the similarity between the word vector of the speech to be recognized and the centroids of the preset dialect categories respectively comprises:

calculating cosine similarity between the word vectors of the voice to be recognized and the centroids of various dialect categories respectively;

and calculating the similarity between the word vector of the voice to be recognized and the centroids of the various dialect categories according to the cosine similarity corresponding to the dialect categories, the preset weight coefficient and the preset bias coefficient.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

14. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

Technical Field

The application belongs to the technical field of computers, and particularly relates to a dialect recognition model training method, a computer-readable storage medium and a terminal device.

Background

With the development of science and technology, more and more intelligent devices enter the lives of people, and the lives of people become more and more convenient. At present, the most frequent voice interaction between people and intelligent equipment is that household appliances and vehicle-mounted equipment are used, hands and four limbs of people are liberated to a great extent in voice interaction, particularly for old people and children who cannot operate a complex electronic system, the voice interaction can directly operate the electronic system according to spoken instructions of people, but many old people cannot speak Mandarin, and some Mandarin also carries local dialects. However, the voice interaction function of the existing intelligent device is designed for mandarin, and dialects cannot be effectively recognized.

Disclosure of Invention

In view of this, embodiments of the present application provide a dialect recognition model training method, a computer-readable storage medium, and a terminal device, so as to solve the problem that the existing voice interaction function of an intelligent device is designed for mandarin and cannot effectively recognize dialects.

A first aspect of an embodiment of the present application provides a method for training a dialect recognition model, which may include:

Further, the dialect recognition model comprises a word vector extraction module and a classification module;

the determining of the output dialect category corresponding to the voice sample according to the preset centroid includes:

inputting the frequency spectrum of the voice sample into the word vector extraction module to obtain a word vector of the voice sample;

Further, the obtaining an output dialect category corresponding to the voice sample according to the preset centroid and the word vector of the voice sample includes:

respectively calculating the similarity between the word vector of the voice sample and the preset centroids of various dialect categories;

and determining the dialect class corresponding to the maximum similarity as the output dialect class corresponding to the voice sample.

Further, before calculating the similarity between the word vector of the speech sample and the centroid of each preset dialect class, the method further includes:

for each dialect category, acquiring a voice sample set corresponding to the dialect category, wherein the voice sample set comprises M voice samples, and M is a positive integer;

respectively calculating word vectors of all voice samples in the voice sample set to obtain M word vectors;

calculating the average of the M word vectors and determining the average as the centroid of the dialect class.

Further, the calculating the similarity between the word vector of the voice sample and the preset centroids of various dialect categories respectively includes:

calculating cosine similarity between the word vectors of the voice samples and the centroids of various dialect categories respectively;

Further, the adjusting the model parameters according to the target dialect category and the output dialect category includes:

calculating a training loss value of the dialect recognition model according to the similarity between the word vector of the voice sample and the centroids of various dialect categories;

and adjusting the model parameters according to the training loss value.

Further, after obtaining the trained dialect recognition model, the method further includes:

testing the dialect recognition model by using preset test data, and respectively counting the test success times and the test failure times;

calculating the recognition accuracy of the dialect recognition model according to the test success times and the test failure times;

if the recognition accuracy is smaller than a preset accuracy threshold, continuing training the dialect recognition model;

and if the identification accuracy is greater than or equal to the accuracy threshold, ending the test of the dialect identification model.

A second aspect of an embodiment of the present application provides a dialect identifying method, which may include:

acquiring a frequency spectrum of a voice to be recognized;

inputting the frequency spectrum of the speech to be recognized into a trained dialect recognition model, and acquiring the dialect category output by the dialect recognition model and corresponding to the speech to be recognized, wherein the dialect recognition model is obtained by training through any one of the dialect recognition model training methods.

Further, the dialect recognition model comprises a word vector extraction module and a classification module;

inputting the frequency spectrum of the voice to be recognized into the word vector extraction module to obtain a word vector of the voice to be recognized;

and inputting the word vector of the voice to be recognized into the classification module to obtain a dialect category corresponding to the voice to be recognized.

Further, the inputting the word vector of the speech to be recognized into the classification module to obtain a dialect category corresponding to the speech to be recognized includes:

respectively calculating the similarity between the word vector of the voice to be recognized and the preset centroids of various dialect categories;

and determining the dialect class corresponding to the maximum similarity as the dialect class corresponding to the voice to be recognized.

Further, before calculating the similarity between the word vector of the speech to be recognized and the centroid of each preset dialect category, the method further includes:

for each dialect category, acquiring a voice sample set corresponding to the dialect category, wherein the voice sample set comprises M voice samples, and M is a positive integer;

respectively calculating word vectors of all voice samples in the voice sample set to obtain M word vectors;

calculating the average of the M word vectors and determining the average as the centroid of the dialect class.

Further, the calculating the similarity between the word vector of the speech to be recognized and the preset centroids of various dialect categories respectively includes:

calculating cosine similarity between the word vectors of the voice to be recognized and the centroids of various dialect categories respectively;

A third aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, which when executed by a processor, implements the steps of the training method of any of the dialect recognition models described above.

A fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the training method for any dialect recognition model described above when executing the computer program.

Compared with the prior art, the embodiment of the application has the advantages that: the preset dialect recognition model in the embodiment of the application determines the output dialect category corresponding to the voice sample according to the preset mass center, wherein each mass center is used for representing the characteristic of one dialect category; and the dialect recognition model adjusts model parameters according to the output dialect category and the target dialect category corresponding to the voice sample, and continues to execute the step of determining the output dialect category corresponding to the voice sample according to the preset centroid until a preset training condition is met, so as to obtain a trained dialect recognition model. Through the training mode, the dialect recognition model is continuously trained by using the training data, the centroids corresponding to various preset dialect categories are used in the training process, and the centroid of each dialect category can represent the characteristics of the dialect category, so that a reliable basis is provided for dialect recognition, model parameters are continuously adjusted according to the training result, and finally the dialect recognition model meeting the training conditions can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart of an embodiment of a method for training a dialect recognition model according to an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a process for preprocessing training data;

FIG. 3 is a schematic diagram of a data processing process of the dialect recognition model;

FIG. 4 is a schematic flow chart of the dialect recognition model processing the spectrum of a speech sample to obtain an output dialect class corresponding to the speech sample;

FIG. 5 is a flowchart of an embodiment of a dialect identification method according to an embodiment of the present application;

FIG. 6 is a schematic flow diagram of a pre-processing process of speech to be recognized;

FIG. 7 is a schematic flow chart of the dialect recognition model processing the frequency spectrum of the speech to be recognized to obtain the dialect class corresponding to the speech to be recognized;

fig. 8 is a schematic block diagram of a terminal device in an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, an embodiment of a method for training a dialect recognition model in an embodiment of the present application may include:

and S101, determining an output dialect category corresponding to the voice sample according to a preset centroid by a preset dialect recognition model.

Wherein each centroid is used to characterize a dialect class.

The spectrum of the voice sample can be obtained by a preprocessing device (including but not limited to a computer, a server, and other terminal devices with computing power) through a preprocessing process as shown in fig. 2:

step S201, a voice sample is obtained.

When the dialect recognition model is trained, a large amount of training data is generally required, where the training data may include multiple sets of training data, and each set of training data includes a frequency spectrum of a voice sample and a target dialect category corresponding to the voice sample. The specific number of training data may be set according to actual situations, for example, the dialect recognition model may obtain 1000 sets, 2000 sets, 3000 sets or other numbers of training data for training.

Generally, the voice sample may be obtained from a preset voice sample library, and the voice sample library may be created by collecting a plurality of voice samples of users using different dialect categories, respectively, and dividing all the voice samples in the voice sample library into respective voice sample sets according to different dialect categories. Preferably, the speech sample sets in any one speech sample set may be collected from different regions using the corresponding dialect category. For example, if a speech sample set of the dialect in shanxi requires to collect 6000 speech samples, each person can collect 60 sentences by collecting 100 speech samples of shanxi persons; or by collecting voice samples of 6000 Shanxi persons, each person collects 1 sentence. And the objects for collecting the voice samples come from different areas in Shaanxi, such as chickens, Yanan, Xian, Hanzhong and the like, so as to generate a voice sample set of the dialect in Shaanxi. The same operation is carried out on dialects in other regions so as to obtain a voice sample set of each dialect. Any one of the voice samples in the voice sample library has a dialect type corresponding to the voice sample, that is, the target dialect type, for example, if a certain voice sample belongs to a voice sample set of shanxi dialects, the target dialect type is the shanxi dialect.

For simplicity, dialect labels may be used to represent various dialect categories, for example, if 5 dialect categories are included in the voice sample library, the 5 dialect categories may be represented by dialect labels 0,1,2,3, and 4, respectively. Illustratively, the dialect label for southern Fujia is 0, the dialect label for Hakka is 1, the dialect label for Sichuan is 2, the dialect label for Shanghai is 3, and the dialect label for Guizhou is 4. It should be noted that the above is merely an example, and in practical applications, different types of dialect labels may be set according to specific situations, which is not described in detail in this embodiment.

Step S202, the voice sample is processed to obtain the frequency spectrum of the voice sample.

Typically, the original data format of the speech samples is the WAV audio format, which is the closest lossless audio format, so its size is relatively large. In practical applications, the voice samples may be converted from the WAV audio format to the PCM audio format in advance in order to reduce the amount of subsequent calculation. Preferably, considering that the speech samples may contain mute signals, which generally occur in the period before the user speaks, the period after the user speaks and the period during which the user pauses in speaking, and do not contain any useful information, the mute signals may be removed from the speech samples to reduce the interference to the final recognition result.

The voice samples are presented in the form of sound waves, the sound waves represent the size of sound, but the sound waves cannot well reflect the characteristics of voice during voice recognition, so that the time-domain sound waves are converted into frequency spectrums which can better reflect the characteristics of the voice. In this embodiment, the spectrum may be a mel spectrum, which is a spectrum representing short-term audio, and the principle is based on a logarithmic spectrum represented by a nonlinear mel scale and a linear cosine transform thereof. In one specific implementation, the speech sample may be first converted from the time domain to the frequency domain by fourier transform, then its logarithmic energy spectrum is convolved by a set of triangular filters distributed according to the mel scale, and finally the vector formed by the outputs of the filters is subjected to discrete cosine transform to obtain its mel spectrum.

In this embodiment, after the dialect recognition model obtains the training sample, the dialect recognition model may process the spectrum of the voice sample through the process shown in fig. 3, and calculate and output a dialect category corresponding to the voice sample, that is, the output dialect category. The dialect recognition model comprises a word vector extraction module and a classification module, wherein the word vector extraction module is used for extracting word vectors of the voice samples according to the frequency spectrums of the voice samples, and the classification module is used for determining output dialect categories corresponding to the voice samples according to the word vectors of the voice samples.

Specifically, step S101 may include a process as shown in fig. 4:

step S1011, inputting the spectrum of the voice sample into the word vector extraction module to obtain the word vector of the voice sample.

The word vector extraction module may be any one of existing networks, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like. Preferably, in this embodiment, a Long Short-Term Memory network (LSTM) may be used as the vector extraction module, and the spectrum of the speech sample is input into the LSTM network to obtain the word vector of the speech sample.

The LSTM network used in this embodiment may be composed of three layers of recurrent neural networks and one fully-connected layer. The frequency spectrum of the voice sample is subjected to feature extraction of each layer of the cyclic neural network to obtain a two-dimensional feature map (as shown in step a in fig. 3), and the two-dimensional feature map is further highly purified by the full connection layer to be converted into a one-dimensional vector, namely, a word vector of the voice sample (as shown in step b in fig. 3).

The number of features in a word vector (i.e., the length of the word vector) is determined by the number of nodes in the fully-connected layer. For example, if the total number of dialect categories is 5, the number of nodes in the full-connected layer may also be set to 5, and the number of features of the word vector obtained after the full-connected layer processing is also 5.

Preferably, in order to prevent the occurrence of the over-fitting phenomenon, that is, the case where the recognition accuracy of the model to the training data is extremely high, but the recognition accuracy of the model to the data other than the training data is extremely low, in this embodiment, after the word vector is obtained, the word vector may be further normalized. Wherein, the regularization is a generic term of a class of methods that introduce additional information into the model to prevent overfitting and improve the generalization performance of the model, including but not limited to L1 regularization and L2 regularization, and L2 regularization is preferably used in this embodiment to prevent overfitting.

Step S1012, inputting the word vector of the voice sample into the classification module, and the classification module obtains an output dialect category corresponding to the voice sample according to the preset centroid and the word vector of the voice sample.

Specifically, the similarity between the word vector of the speech sample and the centroids of the preset dialect categories may be calculated first.

The centroid of each dialect category can be calculated in advance, and for each dialect category, a voice sample set corresponding to the dialect category is obtained first, wherein the voice sample set includes M voice samples, and M is a positive integer. And then, respectively calculating word vectors of all the voice samples in the voice sample set to obtain M word vectors. Next, the average of the M word vectors is calculated and determined as the centroid of the dialect class.

Taking the calculation process of the centroid of the kth dialect class (K is greater than or equal to 1 and less than or equal to K, and K is the total number of the dialect classes) as an example, the centroid of the kth dialect class can be calculated according to the following formula:

wherein M is the serial number of each voice sample in the voice sample set corresponding to the kth dialect category, M is more than or equal to 1 and less than or equal to M, e_kmWord vector for the m-th speech sample of the set of speech samples corresponding to the k-th dialect class, c_kIs the centroid of the k dialect class.

And (4) executing the centroid calculation process aiming at each dialect class to obtain the centroids of various dialect classes. Illustratively, if 5 dialect categories are provided, 200 pieces of south-of-the-minting speech, 200 pieces of Hakka speech, 200 pieces of Sichuan speech, 200 pieces of Shanghai speech and 200 pieces of Guizhou speech are respectively selected, word vectors of each piece of 200 pieces of south-of-the-minting speech are respectively extracted through an LSTM network, and the average value of the word vectors of the 200 pieces of south-of-the-minting speech is calculated to obtain the centroid of the south-of-the-minting speech, and the centroid of the Hakka speech, the centroid of the Sichuan speech, the centroid of the Shanghai speech and the centroid of the Guizhou speech are obtained by analogy.

When the similarity between the word vector of the voice sample and the preset centroids of various dialect categories is calculated respectively, the cosine similarity between the word vector of the voice sample and the centroids of various dialect categories can be calculated firstly, and then the similarity between the word vector of the voice sample and the centroids of various dialect categories is calculated according to the cosine similarity corresponding to the dialect categories respectively, the preset weight coefficient and the preset bias coefficient.

For example, the similarity between the word vectors of the speech samples and the centroids of the various dialect classes may be calculated according to:

S_k＝ω·cos(e,c_k)+b；

where e is the word vector of the speech sample, cos (e, c)_k) Is the cosine similarity between the word vector of the speech sample and the centroid of the k-th dialect class, ω is the weight coefficient, for different dialect classes the weight coefficient is the same, b is the bias coefficient, for different dialect classes the bias coefficient is the same, S_kIs the similarity between the word vector of the speech sample and the centroid of the kth dialect class.

As shown in step c in fig. 3, after the similarity between the word vector of the voice sample and the centroids of the preset various dialect categories is calculated, the obtained similarity result will form a similarity matrix shown in the figure, and the dialect category corresponding to the similarity when the similarity reaches the maximum value may be determined as the output dialect category corresponding to the voice sample. For example, if the similarity between the word vector of the speech sample and the centroids of the 5 dialect categories of southern Fujia, Hakka, Sichuan, Shanghai and Guizhou are: s₁、S₂、S₃、S₄、S₅Wherein S is₄If the value of (2) is the largest, the Shanghai dialect can be determined as the output dialect category corresponding to the voice sample.

And S102, the dialect recognition model adjusts model parameters according to the target dialect type and the output dialect type, and continues to execute the step of determining the output dialect type corresponding to the voice sample according to the preset centroid until the preset training condition is met, so that the trained dialect recognition model is obtained.

Specifically, the target dialect type and the output dialect type may be compared, and if the target dialect type and the output dialect type are not consistent, it is determined that the output of the model is still inaccurate, a training loss value of the dialect recognition model may be calculated, and model parameters of the dialect recognition model may be adjusted.

For example, a loss function of tf.nn. sparse _ softmax _ cross _ entry _ py _ with _ locations (labels ═ None, locations ═ None) owned by the tensrflow system may be used to calculate the training loss value of the dialect recognition model, which combines the softmax function and the cross entropy (cross _ entry) loss function to calculate the training loss value. Wherein labels is the target dialect category, logits is the output dialect category y, and the output dialect category satisfies: and y is w x + b, wherein x is the input of the dialect recognition model, namely the frequency spectrum of the voice sample, w and b are model parameters of the dialect recognition model, w is a weight parameter of the dialect recognition model, and b is a bias parameter of the dialect recognition model. And taking the target dialect class (labels) and the output dialect class (logits) as input parameters of the loss function, wherein the calculated output value is the training loss value of the dialect recognition model.

In this embodiment, the training loss value of the dialect recognition model may also be calculated according to the similarity between the word vector of the voice sample and the centroids of various dialect categories.

In a first specific implementation of this embodiment, the training loss value of the dialect recognition model may be calculated according to the following formula:

wherein S is_tgWord vectors for the speech samplesSimilarity to the centroid of the target dialect class, L_sIn a first implementation, the dialect identifies a training loss value for the model.

In a second specific implementation of this embodiment, the training loss value of the dialect recognition model may be further calculated according to the following formula:

where σ is a Sigmoid function, i.e., σ (x) ═ 1/(1+ exp (-x)), L_cIn a second implementation, the dialect identifies a training loss value for the model.

In a third specific implementation of this embodiment, a sum of the two training loss values may be further used as a training loss value of the dialect recognition model, that is:

L_g＝_s+_c；

wherein L is_gIn a third implementation, the dialect identifies a training loss value for the model.

After the training loss value of the dialect recognition model is obtained through calculation, the model parameters can be adjusted according to the training loss value.

In this embodiment, assuming that the model parameter of the dialect recognition model is W1, the training loss value is reversely propagated to modify the model parameter W1 of the dialect recognition model, so as to obtain a modified parameter W2. After the parameters are modified, continuing to execute the step of determining the output dialect category corresponding to the voice sample according to the preset centroid, that is, starting to perform the next training process, in the training process, processing the frequency spectrum of a new group of voice samples, calculating a training loss value corresponding to the voice sample, reversely propagating the training loss value to modify the model parameter W2 of the dialect recognition model to obtain modified parameters W3 and … …, and so on, repeating the above processes continuously, training a new group of voice samples in each training process, and modifying the model parameter until a preset training condition is met, where the training condition may be that the number of training times reaches a preset number threshold, and optionally, the number threshold may be 100000 times; the training condition may also be the dialect recognition model convergence; since it may occur that the number of training times has not reached the number threshold, but the dialect recognition model has converged, unnecessary work may be repeated; or the dialect recognition model cannot be converged all the time, which may result in infinite loop and failure to end the training process, and based on the two cases, the training condition may also be that the training frequency reaches the frequency threshold or the dialect recognition model converges. And when the training condition is met, obtaining the trained dialect recognition model.

Further, after obtaining the trained dialect recognition model, the following test procedure may be performed:

firstly, testing the dialect recognition model by using preset test data, and respectively counting the test success times and the test failure times.

The test data, similar to the training data, also includes a speech sample and a target dialect class corresponding to the speech sample. In a specific application, in establishing the voice sample library, the voice sample library may be divided into two parts, one part is the training data, and the other part is the testing data. For example, if there are 6000 speech samples for each dialect category in the speech sample library, 500 speech samples for each dialect category may be randomly selected from the speech sample library as the test data of the dialect category, and the remaining 5500 speech samples may be used as the training data of the dialect category.

If the obtained output dialect type of a voice sample of certain test data is consistent with the target dialect type after being processed by the dialect recognition model, the test of the test data is successful, otherwise, if the obtained output dialect type is inconsistent with the target dialect type, the test of the test data is failed. After all the test data are used for testing the dialect identification model, the test success times and the test failure times can be respectively obtained through statistics.

And then, calculating the recognition accuracy of the dialect recognition model according to the test success times and the test failure times.

Specifically, the recognition accuracy of the dialect recognition model may be calculated according to the following formula:

AcRt＝N1/(N1+N2)

wherein N1 is the test success times, N2 is the test failure times, and AcRt is the recognition accuracy of the dialect recognition model.

If the recognition accuracy is smaller than a preset accuracy threshold, continuing training the dialect recognition model; and if the identification accuracy is greater than or equal to the accuracy threshold, ending the test of the dialect identification model. The accuracy threshold may be set according to actual conditions, for example, it may be set to 90%, 95%, 98%, or other values.

In a specific application, in order to facilitate verification of the test result, the voice sample may be labeled according to a dialect category thereof, and specifically, may be labeled by a dialect label. Illustratively, if 10 speech samples are used to test the dialect recognition model, and all of the 10 speech samples are shanghai, the dialect label is set to [3,3,3,3,3,3,3, 3] when the speech sample is input, the output result of the dialect recognition model is obtained, and if the output result is [0,2,1,4,3,1,2,0,2,3], it is obvious that the classification result is not good, so the model structure and parameters can be further adjusted to optimize the model until the final classification result is similar to [3,3,3,3,3,3,3,3,3 ].

In a specific application, in order to measure a test result more accurately, when the dialect recognition model is tested, 500 voice samples of each dialect can be used for testing at one time, the dialect recognition model is used for classifying the 500 voice samples of each dialect, and if 5 dialects exist in total, a final output result is between [0,1,2,3,4], and whether the final output result meets a preset accuracy threshold value is judged according to the output result. Specifically, whether the output result is the same as the value of the corresponding position in the dialect label marked when the dialect label is input or not is compared, the same number is counted, the ratio of the same number to the total number is calculated, and whether the ratio is larger than or equal to the accuracy threshold value or not is judged. Illustratively, for 500 voice samples of southern Min, marking a dialect label as [0,0, 0.,. 0,0,0], if the test result is [0,0, 0.,. 1,0, 2.,. 0,1,0], totally 500 results are obtained, counting the total number of the corresponding position values in the test result and the dialect label value, representing the total number by N, calculating N/500, determining the ratio as the recognition accuracy of the dialect recognition model, and judging whether N/500 is greater than or equal to a preset accuracy threshold, if so, the test result meets the requirement, and completing the construction process of the model. If not, the test result does not meet the requirement, and the model is trained again after the model parameters are adjusted until the test result meets the requirement.

In summary, in the embodiment of the present application, a preset dialect recognition model processes a spectrum of a voice sample according to preset centroids of various dialect categories to obtain output dialect categories corresponding to the voice sample, where each centroid is used to represent a feature of one dialect category; and the dialect recognition model adjusts model parameters according to the output dialect category and the target dialect category corresponding to the voice sample, and continues to execute the step of determining the output dialect category corresponding to the voice sample according to the preset centroid until a preset training condition is met, so as to obtain a trained dialect recognition model. Through the training mode, the dialect recognition model is continuously trained by using the training data, the centroids corresponding to various preset dialect categories are used in the training process, and the centroid of each dialect category can represent the characteristics of the dialect category, so that a reliable basis is provided for dialect recognition, model parameters are continuously adjusted according to the training result, and finally the dialect recognition model meeting the training conditions can be obtained.

Referring to fig. 5, an embodiment of a dialect identifying method in an embodiment of the present application may include:

step S501, obtaining the frequency spectrum of the voice to be recognized.

The spectrum of the speech to be recognized can be obtained in advance by a preprocessing device (including but not limited to a computer, a server, and other terminal devices with computing capability) through a preprocessing process as shown in fig. 6:

step S5011, obtains a voice to be recognized.

The voice to be recognized can be the voice instantly collected by the user through a microphone of a terminal device such as a mobile phone and a tablet personal computer. In a specific usage scenario of this embodiment, when a user wants to perform dialect recognition immediately, before acquiring a speech to be recognized, the dialect recognition mode of the terminal device may be opened by clicking a specific physical key or a virtual key, and in this mode, the terminal device may process each sentence of speech acquired by the user according to subsequent steps to obtain a dialect category corresponding to the speech, where a specific processing procedure will be described in detail later.

The voice to be recognized may also be a voice originally stored in the terminal device, or a voice acquired by the terminal device from a cloud server or other terminal devices through a network. In another specific use scenario of this embodiment, when a user wants to perform dialect recognition on one or more existing voices to be recognized, the dialect recognition mode of the terminal device may be opened by clicking a specific physical key or virtual key, and the voices to be recognized are selected (the order of clicking the key and selecting the voices may be interchanged, that is, the voices may also be selected first, and then the dialect recognition mode of the terminal device is opened), and then the terminal device may process the voices to be recognized according to subsequent steps to obtain dialect categories corresponding to the voices, where a specific processing procedure will be described in detail later.

Step S5012, processing the voice to be recognized to obtain a frequency spectrum of the voice to be recognized.

Generally, the original data format of the speech to be recognized is the WAV audio format, and the WAV is the most lossless audio format, so the size of the WAV is relatively large. In practical applications, in order to reduce the amount of subsequent calculation, the speech to be recognized may be converted from the WAV audio format to the PCM audio format in advance. Preferably, considering that the speech to be recognized may include mute signals, which generally occur in the period before the user speaks, the period after the user speaks, and the period during which the user pauses in speaking, and do not include any useful information, the mute signals may be removed from the speech to be recognized to reduce interference with the final recognition result.

The voice to be recognized is presented in the form of sound waves, the height of the sound waves represents the size of the voice, but the sound waves cannot well reflect the characteristics of the voice during voice recognition, so the sound waves in a time domain are converted into a frequency spectrum which can better reflect the characteristics of the voice. In this embodiment, the spectrum may be a mel spectrum, which is a spectrum representing short-term audio, and the principle is based on a logarithmic spectrum represented by a nonlinear mel scale and a linear cosine transform thereof. In one specific implementation, the speech to be recognized may be first converted from the time domain to the frequency domain by fourier transform, then its logarithmic energy spectrum is convolved by a set of triangular filters distributed according to the mel scale, and finally the vector formed by the outputs of the filters is subjected to discrete cosine transform to obtain its mel spectrum.

Step S502, inputting the frequency spectrum of the voice to be recognized into a trained dialect recognition model, and acquiring a dialect category which is output by the dialect recognition model and corresponds to the voice to be recognized.

The dialect recognition model is obtained by training through any one of the training methods of the dialect recognition models.

In this embodiment, after obtaining the frequency spectrum of the speech to be recognized, the dialect recognition model may process the frequency spectrum of the speech to be recognized according to the preset centroids respectively corresponding to various dialect categories, and calculate and output the dialect categories corresponding to the speech to be recognized, where each centroid is used to represent a feature of one dialect category. The dialect recognition model comprises a word vector extraction module and a classification module, wherein the word vector extraction module is used for extracting word vectors of the voice to be recognized according to the frequency spectrum of the voice to be recognized, and the classification module is used for determining dialect categories corresponding to the voice to be recognized according to the word vectors of the voice to be recognized.

Specifically, step S502 may include the process as shown in fig. 7:

step S5021, inputting the frequency spectrum of the voice to be recognized into the word vector extraction module to obtain the word vector of the voice to be recognized.

The word vector extraction module may be any one of existing networks, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like. Preferably, in this embodiment, a Long Short-Term Memory network (LSTM) may be used as the vector extraction module, and the spectrum of the speech to be recognized is input into the LSTM network to obtain the word vector of the speech to be recognized.

Step S5022, the word vectors of the voice to be recognized are input into the classification module, and dialect categories corresponding to the voice to be recognized are obtained.

Specifically, the similarity between the word vector of the speech to be recognized and the centroid of each of the preset dialect categories may be calculated first.

For example, the centroid of the k dialect class may be calculated according to:

When the similarity between the word vector of the language to be recognized and the preset centroids of various dialect categories is calculated respectively, the cosine similarity between the word vector of the language to be recognized and the preset centroids of various dialect categories can be calculated firstly, and then the similarity between the word vector of the language to be recognized and the centroids of various dialect categories is calculated according to the cosine similarity corresponding to the dialect categories respectively, the preset weight coefficient and the preset bias coefficient.

For example, the similarity between the word vector of the speech to be recognized and the centroids of the various dialect classes may be calculated according to:

S_k＝ω·cos(e,c_k)+b；

where e is the word vector of the speech to be recognized, cos (e, c)_k) Is the cosine similarity between the word vector of the speech to be recognized and the centroid of the k-th dialect class, ω is the weight coefficient, which is the same for different dialect classes, b is the bias coefficient, which is also the same for different dialect classes,S_kis the similarity between the word vector of the speech to be recognized and the centroid of the kth dialect class.

After the similarity between the word vector of the speech to be recognized and the centroids of the preset various dialect categories is calculated, the dialect category corresponding to the maximum similarity can be determined as the dialect category corresponding to the speech to be recognized. For example, if the similarity between the word vector of the speech to be recognized and the centroids of the 5 dialect categories of southern Fujia, Hakka, Sichuan, Shanghai and Guizhou are respectively: s₁、S₂、S₃、S₄、S₅Wherein S is₄If the value of (2) is the maximum, the Shanghai dialect can be determined as the dialect category corresponding to the voice to be recognized.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 8 shows a schematic block diagram of a terminal device provided in an embodiment of the present application, and only shows a part related to the embodiment of the present application for convenience of description.

As shown in fig. 8, the terminal device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82 stored in said memory 81 and executable on said processor 80. The processor 80, when executing the computer program 82, implements the steps in the above-described embodiments of the training method for dialect recognition models, such as the steps S101 to S102 shown in fig. 1.

Illustratively, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 82 in the terminal device 8.

The terminal device 8 may be a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a cloud server, or other computing devices. Those skilled in the art will appreciate that fig. 8 is merely an example of a terminal device 8 and does not constitute a limitation of terminal device 8 and may include more or less components than those shown, or combine certain components, or different components, for example, terminal device 8 may also include input-output devices, network access devices, buses, etc.

The Processor 80 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer programs and other programs and data required by the terminal device 8. The memory 81 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

19页详细技术资料下载

Dialect recognition model training method, readable storage medium and terminal device

相关技术

网友询问留言