Voiceprint recognition method for home multi-feature parameter fusion

文档序号:1075062 发布日期:2020-10-16 浏览:12次 中文

阅读说明:本技术 面向家居多特征参数融合的声纹识别方法 (Voiceprint recognition method for home multi-feature parameter fusion ) 是由 张晖 张金鑫 赵海涛 孙雁飞 倪艺洋 朱洪波 于 2020-05-22 设计创作,主要内容包括:本发明公开了一种面向家居多特征参数融合的声纹识别方法,包括如下步骤:分别计算提取到语音信号的MFCC特征参数、GFCC特征参数和LPCC特征参数;分别利用MFCC特征参数、GFCC特征参数和LPCC特征参数训练三个混合高斯模型;将三个混合高斯模型的结果加权融合,进行软判决,设定阈值,用随机梯度下降法,得到最优的权重系数,输出最终的识别结果。本发明将MFCC特征参数、GFCC特征参数和LPCC特征参数进行融合,弥补了单一特征参数无法较好的表达说话人的特征的缺陷,从而大幅提高声纹识别准确度。(The invention discloses a voiceprint recognition method for home multi-feature parameter fusion, which comprises the following steps: respectively calculating and extracting MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters of the voice signals; training three Gaussian mixture models by respectively using MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters; and weighting and fusing the results of the three Gaussian mixture models, performing soft decision, setting a threshold value, obtaining an optimal weight coefficient by using a random gradient descent method, and outputting a final recognition result. The invention fuses the MFCC characteristic parameters, the GFCC characteristic parameters and the LPCC characteristic parameters, overcomes the defect that a single characteristic parameter cannot well express the characteristics of a speaker, and greatly improves the voiceprint recognition accuracy.)

1. The voiceprint recognition method for home multi-feature parameter fusion is characterized by comprising the following steps: the method comprises the following steps:

s1: respectively calculating and extracting MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters of the voice signals;

s2: training three Gaussian mixture models by respectively using MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters;

s3: and weighting and fusing the results of the three Gaussian mixture models, performing soft decision, setting a threshold value, obtaining an optimal weight coefficient by using a random gradient descent method, and outputting a final recognition result.

2. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the speech signal undergoes a preprocessing operation in the step S1 before feature parameter extraction.

3. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 2, wherein: the preprocessing operation in step S1 includes sampling quantization, pre-emphasis, frame windowing, and endpoint detection.

4. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the MFCC characteristic parameters in step S1 is as follows:

A1) preprocessing an input voice signal to generate a time domain signal, and processing each frame of voice signal through fast Fourier transform or discrete Fourier transform to obtain a voice linear frequency spectrum;

A2) inputting the linear frequency spectrum into a Mel filter bank for filtering to generate a Mel frequency spectrum, and taking the logarithmic energy of the Mel frequency spectrum to generate a corresponding logarithmic frequency spectrum;

A3) the log spectrum solution is converted to MFCC feature parameters by using a discrete cosine transform.

5. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the GFCC characteristic parameters in step S1 is as follows:

B1) preprocessing a voice signal to generate a time domain signal, and obtaining a discrete power spectrum through fast Fourier transform or discrete Fourier transform processing;

B2) squaring the discrete power spectrum to generate a voice energy spectrum, and performing filtering processing by using a Gamma atom filter bank;

B3) performing exponential compression on the output of each Gamma filter to obtain a group of exponential energy spectrums;

B4) the exponential energy spectrum is converted into GFCC characteristic parameters using a discrete cosine transform.

6. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the LPCC characteristic parameters in step S1 is as follows:

C1) setting a system function of the vocal tract model;

C2) setting the impulse response of a system function, and calculating the complex cepstrum of the impulse response;

C3) and calculating to obtain the LPCC characteristic parameters according to the relation between the complex cepstrum and the cepstrum coefficient.

7. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the determination method of the recognition result in step S3 is: and when the result of the weighted fusion is greater than or equal to the threshold value, the target speaker is identified, otherwise, the non-target speaker is identified.

8. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the Gammatone filter bank in the step B2 is used for simulating the auditory characteristics of the cochlear basilar membrane, and the time domain expression thereof is as follows:

g(f,t)=tn-1e-2πbtcos(2πfii)U(t),1≤i≤N

wherein N is the number of filters, N is the number of filter stages, i is the number of filter stages, fiFor the center frequency of the filter, U (t) is the unit step function, biIs the attenuation factor of the filter, phiiIs the phase of the filter of sequence i.

Technical Field

The invention belongs to the field of voiceprint recognition, and particularly relates to a voiceprint recognition method for home multi-feature parameter fusion.

Background

Voiceprint recognition, also known as speaker recognition, includes speaker recognition and speaker verification. The voiceprint recognition application field is very wide, and comprises the financial field, the military safety field, the medical field, the home safety field and the like. Prior to the identification of many voiceprint recognition systems, in addition to pre-processing operations, feature parameters and model matching are critical to the accuracy of the identification.

The traditional single feature parameter cannot well express the voice feature of the speaker, overfitting may occur, and the MFCC feature parameter is easy and imitative. Besides single features, many scholars directly connect the GFCC and the MFCC to form a new feature parameter vector, which may cause dimension disasters and increase the computation of the system. Therefore, the current home voiceprint recognition algorithm cannot meet the requirement for better expressing the characteristics of the speaker, and the recognition accuracy of the current home voiceprint recognition algorithm needs to be improved.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the home-oriented voiceprint recognition method based on the fusion of the multiple feature parameters is provided, the problem that the voice feature of a speaker cannot be completely expressed by a single feature parameter is effectively solved, and the accuracy of voiceprint recognition is improved.

The technical scheme is as follows: in order to achieve the purpose, the invention provides a home multi-feature parameter fusion-oriented voiceprint recognition method, which comprises the following steps:

s1: respectively calculating and extracting MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters of the voice signals;

s2: training three Gaussian mixture models by respectively using MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters;

s3: and weighting and fusing the results of the three Gaussian mixture models, performing soft decision, setting a threshold value, obtaining an optimal weight coefficient by using a random gradient descent method, and outputting a final recognition result.

Further, the speech signal is subjected to a preprocessing operation before feature parameter extraction in step S1.

Further, the preprocessing operation in step S1 includes sample quantization, pre-emphasis, frame windowing, and endpoint detection.

Further, the extraction process of the MFCC characteristic parameters in step S1 is as follows:

A1) preprocessing an input voice signal to generate a time domain signal, and processing each frame of voice signal through fast Fourier transform or discrete Fourier transform to obtain a voice linear frequency spectrum;

A2) inputting the linear frequency spectrum into a Mel filter bank for filtering to generate a Mel frequency spectrum, and taking the logarithmic energy of the Mel frequency spectrum to generate a corresponding logarithmic frequency spectrum;

A3) the log spectrum solution is converted to MFCC feature parameters by using a discrete cosine transform.

Further, the extraction process of the GFCC characteristic parameters in step S1 is as follows:

B1) preprocessing a voice signal to generate a time domain signal, and obtaining a discrete power spectrum through fast Fourier transform or discrete Fourier transform processing;

B2) squaring the discrete power spectrum to generate a voice energy spectrum, and performing filtering processing by using a Gamma atom filter bank;

B3) performing exponential compression on the output of each Gamma filter to obtain a group of exponential energy spectrums;

B4) the exponential energy spectrum is converted into GFCC characteristic parameters using a discrete cosine transform.

Further, the extraction process of the LPCC characteristic parameters in step S1 is as follows:

C1) setting a system function of the vocal tract model;

C2) setting the impulse response of a system function, and calculating the complex cepstrum of the impulse response;

C3) and calculating to obtain the LPCC characteristic parameters according to the relation between the complex cepstrum and the cepstrum coefficient.

Further, the determination method of the recognition result in step S3 is: and when the result of the weighted fusion is greater than or equal to the threshold value, the target speaker is identified, otherwise, the non-target speaker is identified.

Has the advantages that: compared with the prior art, the invention fuses the MFCC characteristic parameters, the GFCC characteristic parameters and the LPCC characteristic parameters, overcomes the defect that a single characteristic parameter can not well express the characteristics of a speaker, and greatly improves the accuracy of voiceprint recognition.

Drawings

FIG. 1 is a block diagram showing the general structure of the method of the present invention;

FIG. 2 is a flowchart of MFCC feature parameter extraction;

fig. 3 is a flow chart of GFCC characteristic parameter extraction.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

As shown in fig. 1, the present invention provides a home CNN classification and feature matching combined voiceprint recognition method, which includes the following steps:

1) the method comprises the steps of preprocessing input speaker voice, wherein the preprocessing comprises sampling quantization, pre-emphasis, windowing, framing, endpoint detection and the like. The preprocessing aims to eliminate the interference of sounding organs and voice acquisition equipment and improve the recognition rate of the system.

2) Respectively calculating and extracting MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters of the voice signals;

3) training three Gaussian mixture models, namely a GMM model A, GMM model B and a GMM model C, respectively by using MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters;

4) and weighting and fusing the results of the GMM model A, GMM model B and the GMM model C, performing soft decision, setting a threshold value, obtaining an optimal weight coefficient by using a random gradient descent method, and outputting a final recognition result.

As shown in fig. 2, the extraction process of the MFCC characteristic parameters in this embodiment is as follows:

A1) the input speech signal s (N) is preprocessed to generate a time-domain signal x (N) (the length N of the signal sequence is 256), and then, a speech linear spectrum x (k) is obtained by performing fast fourier transform or discrete fourier transform on each frame of speech signal, which can be represented as:

Figure BDA0002503416000000031

A2) inputting the linear spectrum X (k) into a Mel filter bank for filtering to generate a Mel spectrum, and then taking the logarithmic energy of the Mel spectrum to generate a corresponding logarithmic spectrum S (m).

Here, the Mel Filter Bank is a set of triangular band-identity filters Hm(k) And M is more than or equal to 0 and less than or equal to M, wherein M represents the number of the filters and is usually 20-28. The transfer function of a band-pass filter can be expressed as:

Figure BDA0002503416000000032

in the formula (2), f (m) is the center frequency.

The logarithm of the Mel energy spectrum is used to promote the performance of the voiceprint recognition system. The transfer function from the linear spectrum x (k) of speech to the logarithmic spectrum s (m) is:

Figure BDA0002503416000000033

A3) converting the solution of the logarithmic spectrum S (m) into the MFCC characteristic parameters by using Discrete Cosine Transform (DCT), wherein the expression of the nth dimension characteristic component C (n) of the MFCC characteristic parameters is as follows:

Figure BDA0002503416000000041

the MFCC characteristic parameters obtained through the steps only reflect the static characteristics of the voice signals, and the dynamic characteristic parameters can be obtained by solving the first-order difference and the second-order difference of the static characteristics.

In this embodiment, the design scheme of applying the GFCC (gamma frequency cepstrum coefficient) characteristic parameter to the gamma filter in the extraction process is as follows:

the Gammatone filter bank is used for simulating the auditory characteristics of a cochlear basilar membrane, and the time domain expression of the Gammatone filter bank is as follows:

g(f,t)=tn-1e-2πbtcos(2πfii) U (t), i is more than or equal to 1 and less than or equal to N (5), wherein N is the number of filters;

n- - -the filter order number, typically 4;

i-filter ordinal number;

fi-the center frequency of the filter;

u (t) -unit step function;

bi-an attenuation factor of the filter;

φithe phase of the filter with sequence i is typically taken to be 0.

The bandwidth of each filter is related to the auditory critical band of the human ear, which according to the theory of psychology can be expressed in terms of equivalent rectangular bandwidth:

Figure BDA0002503416000000042

attenuation factor b of filteriThe decay rate of the impulse response is dependent on the bandwidth by a decay factor biAnd (6) determining. The expression is as follows:

bi=1.019EBR(f) (7)

the time-domain impulse function of the Gammatone filter is an analog function, and in order to facilitate calculation processing, the time-domain impulse function needs to be discretized, and the laplace transform performed on the formula (4) includes:

input speech signals s (n) and gi(n) the output of the Gamma-tone filter can be obtained through convolution operation.

The extraction process of the GFCC characteristic parameters is similar to that of the MFCC characteristic parameters, and only a Gamma filterbank is needed to replace a traditional Mel filterbank, so that the cochlea basilar membrane characteristics of the Gamma filterbank are effectively utilized, and the nonlinear processing can be well carried out on voice signals.

Based on the above Gammatone filter, as shown in fig. 3, the process of extracting the GFCC (Gammatone frequency cepstrum coefficient) characteristic parameters is as follows:

B1) firstly, preprocessing an input voice signal s (n) to generate a time domain signal x (n), and obtaining a discrete power spectrum X (k) through fast Fourier transform or discrete Fourier transform processing, wherein the expression is as follows:

B2) the discrete power spectrum x (k) is squared to generate a speech energy spectrum, which is then filtered using a gamma-tone filter bank.

B3) To better improve the performance of the voiceprint recognition system, the output of each filter is exponentially compressed to obtain a set of exponential energy spectra s1,s2,…,sM

Where e (f) is the exponential compression value and M is the number of filter channels.

B4) Finally, the exponential energy spectrum is converted into GFCC characteristic parameters by using Discrete Cosine Transform (DCT), and the expression is as follows:

in the formula, L represents the dimension of the characteristic parameter.

In this embodiment, the process of extracting the characteristic parameters of LPCC (linear predictive cepstrum coefficient) is as follows:

assume that the system function of the vocal tract model is as follows:

in equation (12), p is the order of the predictor.

Let h (n) be the impulse response of H (z),

Figure BDA0002503416000000055

is a complex cepstrum of h (n), then

Combining formula (12) and formula (13), and for z-1Derivation, simplified to obtain:

equal sign two sides z of formula (14)-1The coefficients of the powers are added up to obtain a complex cepstrum as follows:

according to the relation between the complex cepstrum and the cepstrum coefficient:

Figure BDA0002503416000000063

linear prediction cepstrum coefficients can be calculated:

wherein c (n) is the linear prediction cepstrum coefficient LPCC, anAre linear prediction coefficients.

In step 4 of this embodiment, the mixture degrees of the GMM model A, GMM model B and the GMM model C are both 1024. The output results of the three models are a, b and c respectively, the three results are subjected to weighted fusion, and the weight coefficient is omegaiAnd is

Figure BDA0002503416000000065

Final result D ═ ω1a2b3cSetting a threshold gamma, and identifying the target speaker when D is greater than or equal to the threshold gamma, otherwise identifying the non-target speaker.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:面向家居CNN分类与特征匹配联合的声纹识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!