Home CNN classification and feature matching combined voiceprint recognition method

文档序号:1075063 发布日期:2020-10-16 浏览:14次 中文

阅读说明:本技术 面向家居cnn分类与特征匹配联合的声纹识别方法 (Home CNN classification and feature matching combined voiceprint recognition method ) 是由 张晖 张金鑫 赵海涛 孙雁飞 倪艺洋 朱洪波 于 2020-05-22 设计创作,主要内容包括:本发明公开了一种面向家居CNN分类与特征匹配联合的声纹识别方法,包括如下步骤:对语音进行短时傅里叶变换生成语谱图;将语谱图输入到训练好的卷积神经网络进行分类,若识别为非家庭成员,流程结束;对语音信号提取MFCC特征参数;将MFCC特征参数和k-means特征模板进行匹配,获取最终识别结果。本发明基于语谱图的生成、卷积神经网络、k-means算法、余弦相似度测量方法,在保证识别准确率的情况下,有效的降低了语音识别的误检率和漏检率,解决了误检率和漏检率较高的问题,保证了家居环境的绝对安全。(The invention discloses a home CNN classification and feature matching combined voiceprint recognition method, which comprises the following steps: carrying out short-time Fourier transform on the voice to generate a spectrogram; inputting the spectrogram into a trained convolutional neural network for classification, and if the spectrogram is identified as a non-family member, ending the process; extracting MFCC characteristic parameters from the voice signals; and matching the MFCC characteristic parameters with the k-means characteristic template to obtain a final recognition result. The method effectively reduces the false detection rate and the missed detection rate of voice recognition on the basis of the generation of the voice spectrogram, the convolutional neural network, the k-means algorithm and the cosine similarity measurement method, solves the problems of high false detection rate and missed detection rate and ensures the absolute safety of the home environment under the condition of ensuring the recognition accuracy rate.)

1. The home CNN classification and feature matching combined voiceprint recognition method is characterized by comprising the following steps: the method comprises the following steps:

s1: carrying out short-time Fourier transform on the voice to generate a spectrogram;

s2: inputting the spectrogram into a trained convolutional neural network for classification, if the spectrogram is identified as a non-family member, ending the process, otherwise, turning to the step S3;

s3: extracting MFCC characteristic parameters from the voice signals;

s4: and matching the MFCC characteristic parameters with the k-means characteristic template to obtain a final recognition result.

2. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: the speech is subjected to a preprocessing operation before being subjected to a short-time fourier transform in said step S1.

3. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 2, wherein: the preprocessing operations in step S1 include sample quantization, pre-emphasis, windowing and framing, and endpoint detection.

4. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: the convolutional neural network in the step S2 comprises an input layer, a convolutional layer, a pooling layer, a full-link layer and an output layer, wherein the pooling layer adopts average pooling, the output layer adopts a softmax function, and the convolutional neural network is trained by adopting a BP algorithm.

5. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: in step S3, MFCC characteristic parameters are extracted by adjusting the order of the Mel filter.

6. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 5, wherein: the extraction process of the MFCC characteristic parameters in step S3 is as follows:

A) preprocessing an input voice signal to generate a time domain signal, and processing each frame of voice signal through fast Fourier transform or discrete Fourier transform to obtain a voice linear frequency spectrum;

B) inputting the linear frequency spectrum into a Mel filter bank for filtering to generate a Mel frequency spectrum, and taking the logarithmic energy of the Mel frequency spectrum to generate a corresponding logarithmic frequency spectrum;

C) the log spectrum is converted to MFCC feature parameters using a discrete cosine transform.

7. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: the generation process of the k-means feature template in the step S4 is as follows: randomly selecting a clustering center; traversing all samples in the data set, calculating the distance from the training data set to each cluster center, recording the center point with the closest distance, and then distributing the point into the cluster; traversing all the clustering centers, and moving the new positions of the clustering centers to all the mean values belonging to the clustering; and repeating the steps, and continuously updating the cluster center position until the cluster center position is not moved any more.

8. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: in step S4, a cosine similarity method is used for matching, and similarity is evaluated by calculating a cosine value of an included angle between two vectors.

9. The home-oriented CNN classification and feature matching combined voiceprint recognition method of claim 1, wherein: the generation process of the spectrogram in the step S1 is as follows:

a) performing framing processing on a voice signal to obtain X (m, n), wherein m represents the number of frames, n represents the frame length, and converting the number of frames into X (m, n) through short-time Fourier transform;

b) changing X (m, n) into a periodic diagram through a formula of X (m, n) multiplied by X (m, n) to Y (m, n);

c) and (4) carrying out logarithm processing on the periodic chart, and converting M and N into M and N according to time and probability scales respectively to generate a two-dimensional spectrogram.

Technical Field

The invention belongs to the field of voiceprint recognition, and particularly relates to a voiceprint recognition method for home CNN classification and feature matching combination.

Background

Voiceprint recognition, also known as speaker recognition, includes speaker recognition and speaker verification. The voiceprint recognition application field is very wide, and comprises the financial field, the military safety field, the medical field, the home safety field and the like. Prior to the identification of many voiceprint recognition systems, in addition to pre-processing operations, feature parameters and model matching are critical to the accuracy of the identification. The existing voiceprint recognition algorithm cannot achieve one hundred percent of recognition accuracy, the false detection rate and the missing detection rate are high, and the absolute safety of people and property in the home environment cannot be guaranteed.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the voiceprint recognition method combining home CNN classification and feature matching is provided, and the false detection rate and the omission factor are reduced on the premise of ensuring the recognition accuracy. The method improves the existing model, thereby solving the problems of high false detection rate and high omission factor.

The technical scheme is as follows: in order to achieve the above object, the present invention provides a home CNN classification and feature matching combined voiceprint recognition method, which comprises the following steps:

s1: carrying out short-time Fourier transform on the voice to generate a spectrogram;

s2: inputting the spectrogram into a trained convolutional neural network for classification, if the spectrogram is identified as a non-family member, ending the process, otherwise, turning to the step S3;

s3: extracting MFCC characteristic parameters from the voice signals;

s4: and matching the MFCC characteristic parameters with the k-means characteristic template to obtain a final recognition result.

Further, the speech is subjected to a preprocessing operation before the short-time fourier transform in step S1.

Further, the preprocessing operation in step S1 includes sample quantization, pre-emphasis, windowing and framing, and endpoint detection.

Further, the convolutional neural network in step S2 includes an input layer, a convolutional layer, a pooling layer, a full-link layer, and an output layer, where the pooling layer uses average pooling, the output layer uses a softmax function, and the convolutional neural network is trained using a BP algorithm.

Further, in step S3, MFCC characteristic parameters are extracted through order adjustment of the Mel filter.

Further, the extraction process of the MFCC characteristic parameters in step S3 is as follows:

A) preprocessing an input voice signal to generate a time domain signal, and processing each frame of voice signal through fast Fourier transform or discrete Fourier transform to obtain a voice linear frequency spectrum;

B) inputting the linear frequency spectrum into a Mel filter bank for filtering to generate a Mel frequency spectrum, and taking the logarithmic energy of the Mel frequency spectrum to generate a corresponding logarithmic frequency spectrum;

C) the log spectrum is converted to MFCC feature parameters using a discrete cosine transform.

Further, the generation process of the k-means feature template in the step S4 is as follows: randomly selecting a clustering center; traversing all samples in the data set, calculating the distance from the training data set to each cluster center, recording the center point with the closest distance, and then distributing the point into the cluster; traversing all the clustering centers, and moving the new positions of the clustering centers to all the mean values belonging to the clustering; and repeating the steps, and continuously updating the cluster center position until the cluster center position is not moved any more.

Further, in step S4, a cosine similarity method is used for matching, and similarity is evaluated by calculating a cosine value of an included angle between two vectors.

Further, the generation process of the spectrogram in step S1 is as follows:

a) performing framing processing on a voice signal to obtain X (m, n), wherein m represents the number of frames, n represents the frame length, and converting the number of frames into X (m, n) through short-time Fourier transform;

b) changing X (m, n) into a periodic diagram through a formula of X (m, n) multiplied by X (m, n) to Y (m, n);

c) and (4) carrying out logarithm processing on the periodic chart, and converting M and N into M and N according to time and probability scales respectively to generate a two-dimensional spectrogram.

The method firstly generates a speech spectrogram of the speaker, then the speech spectrogram is used as input and input into a convolutional neural network, if the speaker is identified as a non-family member, the process is finished, and otherwise, the speaker needs to be confirmed again. And extracting MFCC characteristics of the speaking voice, measuring by using a template matching method and cosine similarity, and outputting a final recognition result.

Has the advantages that: compared with the prior art, the voice recognition method based on the voice spectrogram has the advantages that the false detection rate and the omission factor of the voice recognition are effectively reduced on the basis of the generation of the voice spectrogram, the convolutional neural network, the k-means algorithm and the cosine similarity measurement method under the condition of ensuring the recognition accuracy, the problem of high false detection rate and omission factor is solved, and the absolute safety of the home environment is ensured.

Drawings

FIG. 1 is a block diagram showing the general structure of the method of the present invention;

fig. 2 is a flowchart of MFCC feature parameter extraction.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

As shown in fig. 1, the present invention provides a home CNN classification and feature matching combined voiceprint recognition method, which includes the following steps:

1) the method comprises the steps of preprocessing input speaker voice, wherein the preprocessing comprises sampling quantization, pre-emphasis, windowing, framing, endpoint detection and the like. The preprocessing aims to eliminate the interference of sounding organs and voice acquisition equipment and improve the recognition rate of the system.

2) Carrying out short-time Fourier transform on the preprocessed voice to generate a spectrogram, wherein the specific process comprises the following steps:

a) performing framing processing on a voice signal to obtain X (m, n), wherein m represents the number of frames, n represents the frame length, and converting the number of frames into X (m, n) through short-time Fourier transform;

b) changing X (m, n) into a periodic diagram through a formula of X (m, n) multiplied by X (m, n) to Y (m, n);

c) and (4) carrying out logarithm processing on the periodic chart, and converting M and N into M and N according to time and probability scales respectively to generate a two-dimensional spectrogram.

3) Inputting the spectrogram into a trained convolutional neural network for classification, if the spectrogram is identified as a non-family member, ending the process, otherwise, turning to the step 4;

4) extracting MFCC characteristic parameters from the voice signals;

5) and matching the MFCC characteristic parameters with the k-means characteristic template to obtain a final recognition result.

The convolutional neural network in the embodiment comprises an input layer, a convolutional layer, a pooling layer, a full-link layer and an output layer, wherein the pooling layer adopts average pooling, and the output layer adopts a softmax function.

Figure BDA0002503417670000031

And training the convolutional neural network by adopting a BP algorithm.

As can be seen from the structure of the convolutional neural network, the network includes the following parameters: convolution kernel, bias term, weight of the full-connection network, and the like. The solution of these parameters requires the use of back propagation algorithms.

For solving the error signal transmitted in l, it is necessary to sum all the signals corresponding to the neuron in the next layer of the stack and multiply them by the weight corresponding to the k +1 layer, the weights in the down-sampling layers are all equal to β (a constant, see gradient calculation of the down-sampling layer), so that the result of the previous one is only required to be amplified by β times to calculatek. This procedure is repeated to calculate each graph j in the convolutional layer and to correspond it to the downsampled layer:

where f' () denotes the first derivative of the activation function and up (·) denotes the upsampling operation, simply the input pixel is repeated n times from the horizontal and vertical directions, corresponding to a factor n in the downsampling operation. A simple implementation is by Kronecker product:

now with the error signal of a given map, the gradient of the deviation can be calculated by summing the terms in all errors:

Figure BDA0002503417670000041

finally, the gradient of the weight of the kernel function is computed by back-propagation, except that here many connections share a weight. Summing all gradients involved by the weight:

Figure BDA0002503417670000042

whereinIs shown in

Figure BDA0002503417670000044

In the convolution process byA block of multiplied areas. This seems to be difficult to calculate, requiring calculation of which regions correspond to the input map. But the formula can be realized in matlab by 'valid' area coverage, using the following formula:

Figure BDA0002503417670000046

gradient calculations for the down-sampling layer, which produces a down-sampled result of the input map. If there are N input maps, there are also N output maps, although the output maps will be smaller relative to the input maps.

Where down (.) denotes the down-sampling function.

The difficulty here is in computing the error signal map. The only learnable parameters are β and b. It is assumed that the upper and lower layers of the sampling layer are convolutional layers. If the down-sampling layer is followed by a fully connected network, its error signal map can be directly obtained by a back-propagation algorithm.

In order to compute the gradient of the convolutional layer, it is necessary to find which blocks in the input map correspond to a certain pixel in the output map. Here again it has to be found which blocks in the sensitive map of the current layer correspond to a certain pixel in the next layer. Obviously, the multiplication of the output link of the input output link by the weight is the weight of the full set. It can also be effectively realized by the following formula:

now the gradient of β and b can be calculated, where b is the sum of the elements in the error signal map to μ, v:

the multiplier bias is obviously related to the original down-sampled graph (a characteristic graph formed by no additional bias after down-sampling) of the current layer in forward propagation. It follows that saving these maps during forward transmission will help the subsequent calculations effectively. Accordingly, define:

the gradient of β is given by the following equation:

Figure BDA0002503417670000052

as shown in fig. 2, the specific steps of extracting MFCC characteristic parameters in this embodiment are as follows:

(1) the input speech signal s (N) is preprocessed to generate a time-domain signal x (N) (the length N of the signal sequence is 256), and then, a speech linear spectrum x (k) is obtained by performing fast fourier transform or discrete fourier transform on each frame of speech signal, which can be represented as:

Figure BDA0002503417670000053

(2) the linear spectrum x (k) is input to the Mel filter bank for filtering to generate Mel spectrum, and then its logarithmic energy is taken to generate corresponding logarithmic spectrum s (m).

Here, the Mel Filter Bank is a set of triangular band-identity filters Hm(k) And M is more than or equal to 0 and less than or equal to M, wherein M represents the number of the filters and is usually 20-28. The transfer function of a band-pass filter can be expressed as:

f (m) is the center frequency.

The logarithm of the Mel energy spectrum is used to promote the performance of the voiceprint recognition system. The transfer function from the linear spectrum x (k) of speech to the logarithmic spectrum s (m) is:

(3) converting the solution of the logarithmic spectrum S (m) into the MFCC characteristic parameters by using Discrete Cosine Transform (DCT), wherein the expression of the nth dimension characteristic component C (n) of the MFCC characteristic parameters is as follows:

Figure BDA0002503417670000061

the MFCC characteristic parameters obtained through the steps only reflect the static characteristics of the voice signals, and the dynamic characteristic parameters can be obtained by solving the first-order difference and the second-order difference of the static characteristics.

In the embodiment, a k-means algorithm (k-means) is adopted for generating the k-means characteristic template, the k-means algorithm is an unsupervised machine learning algorithm, and labels are not needed in the unsupervised learning algorithm, so that the workload of data marking can be greatly reduced, and the applicable range is wider. The k-means algorithm firstly needs to select k, namely the number of clusters is selected; the other is a training data set x(1),x(2),...,x(m)

Firstly, randomly selecting a clustering center: mu.s1,μ2,....μk(ii) a Traversing all samples in the data set m, calculating x(i)Respectively to each cluster center mu1,μ2,....μkRecording the nearest center point mujThis point is then assigned to this cluster. The distance is typically calculated using: | x(i)jL; then, all cluster centers are traversed, and the new positions of the cluster centers are moved to all the means belonging to the cluster, i.e. to the cluster

Figure BDA0002503417670000062

Where e denotes the number of training sample points belonging to this cluster center, x(d)Indicates that belongs to mujPoints of this category; and repeating the steps, and continuously updating the cluster center position until the cluster center position is not moved any more.

In this embodiment, a cosine similarity (cosine similarity) method is adopted to match the MFCC characteristic parameters with the k-means characteristic template, and the similarity between the two vectors is evaluated by calculating the cosine value of the included angle between the two vectors. If vector

Figure BDA0002503417670000067

Figure BDA0002503417670000063

Respectively is (x)1,x2,,xn),(y1,y2,,yn) Then, then

Figure BDA0002503417670000064

Andthe cosine similarity of (c) can be expressed as:

if the two directions are consistent, the included angle is close to zero, and the two vectors are considered to be more similar, and the cosine similarity is closer to 1. When the similarity is compared in voiceprint recognition, if the voice to be detected is closer to the voice of the target speaker, namely the cosine similarity value is larger, the same speaker is considered.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:说话人识别方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!