Voice digital recognition method based on MFCC

文档序号:1674035 发布日期:2019-12-31 浏览:29次 中文

阅读说明:本技术 一种基于mfcc的语音数字识别方法 (Voice digital recognition method based on MFCC ) 是由 朱静 杨盛元 尹邦政 陈明希 杨强 魏慧棠 何海城 李浩明 于 2019-09-20 设计创作,主要内容包括:本发明涉及语音识别技术,具体为一种基于MFCC的语音数字识别方法,首先对输入的语音信号进行采样,对采样后的语音信号进行预处理;对采样及预处理后的语音信号进行端点检测,提取出单个数字语音信号;提取每一个数字语音信号的MFCC特征;利用均方误差MSE的方法将每一个数字语音信号的MFCC特征与通过训练获得的MFCC数字语音信号参数模板进行匹配,识别出语音信号中的数字。该方法将MFCC特征与MSE结合实现语音数字的识别,不仅识别率高而且避免了大量的数据计算,识别效率高,且可应用在环境较为复杂的情形下。(The invention relates to a voice recognition technology, in particular to a voice digital recognition method based on MFCC, which comprises the steps of firstly sampling an input voice signal, and preprocessing the sampled voice signal; carrying out endpoint detection on the sampled and preprocessed voice signals, and extracting a single digital voice signal; extracting MFCC characteristics of each digital voice signal; and matching the MFCC characteristics of each digital voice signal with an MFCC digital voice signal parameter template obtained through training by utilizing a Mean Square Error (MSE) method, and identifying the digits in the voice signals. The method combines the MFCC characteristics with the MSE to realize the recognition of the speech number, has high recognition rate, avoids a large amount of data calculation, has high recognition efficiency, and can be applied to the situation of complex environment.)

1. A voice digital recognition method based on MFCC is characterized by comprising the following steps:

s1, sampling the input voice signal, and preprocessing the sampled voice signal;

s2, carrying out endpoint detection on the sampled and preprocessed voice signals, and extracting a single digital voice signal;

s3, extracting the MFCC characteristics of each digital voice signal;

and S4, matching the MFCC characteristics of each digital voice signal with the MFCC digital voice signal parameter template obtained through training by using a Mean Square Error (MSE) method, and recognizing the numbers in the voice signals.

2. The speech digital recognition method of claim 1, wherein the preprocessing process in step S1 includes channel conversion, pre-emphasis, framing and windowing of the speech signal.

3. The speech digit recognition method of claim 2, wherein the pre-emphasis process is calculated as:

H(z)=1-az-1

where a is the pre-emphasis factor, 0.9< a < 1.0.

4. The speech digit recognition method of claim 2, wherein the windowing process is calculated as:

Figure FDA0002208400380000011

where N is the window length.

5. The speech digit recognition method of claim 1, wherein the step S2 implements the end point detection process of the speech signal as follows:

firstly, framing a voice signal, setting an energy threshold TL, an energy threshold TH, a zero-crossing rate threshold ZCR, the maximum allowed mute length in a voice section and the shortest length of the voice section according to the voice signal, and calculating and adjusting the zero-crossing rate and the short-time energy of the voice signal to obtain a corresponding threshold range; when a certain frame signal is greater than an energy threshold TL or greater than a zero crossing rate threshold ZCR, the frame signal is considered as a starting point of the voice signal, when the certain frame signal is greater than an energy threshold TH, the frame signal is considered as a formal voice signal, then according to comparison of the signal thresholds, the voice state corresponding to the frame length and whether the frame signal is a required voice signal are judged, and finally the range from the starting point to the end point of the voice signal is obtained.

6. The speech digit recognition method of claim 1, wherein the extraction process of step S3 is: firstly, carrying out FFT (fast Fourier transform) on each frame of voice signal to obtain a frequency spectrum corresponding to each frame of voice signal so as to obtain a magnitude spectrum; then adding Mel filter group to the magnitude spectrum and carrying out logarithm operation to obtain the output corresponding to each Mel filter; and finally, performing DCT transformation to obtain the MFCC characteristics.

Technical Field

The invention relates to a voice recognition technology, in particular to a voice digital recognition method based on MFCC.

Background

With the development of computer and information technology, voice interaction becomes an essential means for human-computer interaction. The voice recognition technology is an important development direction of computer technology, the voice recognition forms a theoretical system with a certain scale, the application field is very wide, the functions of voice dialing, emotion recognition, voiceprint recognition and the like can be realized, the functions are closely related to the life of people, and the method has very wide prospects.

Mel-Frequency Cepstral coefficients mfccs (Mel-Frequency Cepstral coeffients) are a feature that is widely used in automated speech and speaker recognition. Mel is the unit of subjective pitch, while Hz (Hertz) is the unit of objective pitch. The Mel frequency is extracted based on the auditory characteristics of human ears, and the Mel frequency and the Hz frequency form a nonlinear corresponding relation. Mel Frequency Cepstral Coefficients (MFCC) are the Hz spectral features calculated by using the relationship between them. MFCCs have been widely used in the field of speech recognition. Due to the nonlinear corresponding relation between the Mel frequency and the Hz frequency, the calculation accuracy of the MFCC is reduced along with the increase of the frequency. Therefore, only low frequency MFCCs are often used in applications, while medium and high frequency MFCCs are discarded. The point of understanding speech is that human-generated sounds are filtered by the shape of the vocal tract, including the tongue, teeth, etc. This shape determines what the sound is. If we can know the shape exactly, we can accurately represent the phoneme (phone) that we produced. The shape of the vocal tract is expressed in the form of a short-time power spectral envelope, and the MFCC functions to accurately represent this envelope. Needless to say, MFCCs are always the most advanced in speech recognition.

Mean-square error (MSE) is a metric that reflects the degree of difference between the estimator and the estimated volume. Let t be an estimate of the overall parameter θ determined from the subsamples, (θ -t) 2, a mathematical expectation called the mean square error of the estimate t. It is equal to σ ^2+ b ^2, where σ ^2 and b are the variance and bias of t, respectively.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a voice number recognition method based on MFCC, which comprises the steps of firstly extracting MFCC characteristics corresponding to each voice number, and then correctly matching and recognizing the corresponding number by adopting mean square error MSE.

The MFCC-based speech digital recognition method comprises the following steps:

s1, sampling the input voice signal, and preprocessing the sampled voice signal;

s2, carrying out endpoint detection on the sampled and preprocessed voice signals, and extracting a single digital voice signal;

s3, extracting the MFCC characteristics of each digital voice signal;

and S4, matching the MFCC characteristics of each digital voice signal with the MFCC digital voice signal parameter template obtained through training by using a Mean Square Error (MSE) method, and recognizing the numbers in the voice signals.

Preferably, the preprocessing process in step S1 includes channel conversion, pre-emphasis, framing, and windowing processes of the speech signal.

Preferably, the calculation formula of the pre-emphasis process is as follows:

H(z)=1-az-1

where a is the pre-emphasis factor, 0.9< a < 1.0.

Preferably, the calculation formula of the windowing process is:

Figure BDA0002208400390000021

where N is the window length.

Preferably, step S2 implements the endpoint detection process of the voice signal as follows: firstly, framing a voice signal, setting an energy threshold TL, an energy threshold TH, a zero-crossing rate threshold ZCR, the maximum allowed mute length in a voice section and the shortest length of the voice section according to the voice signal, and calculating and adjusting the zero-crossing rate and the short-time energy of the voice signal to obtain a corresponding threshold range; when a certain frame signal is greater than an energy threshold TL or greater than a zero crossing rate threshold ZCR, the frame signal is considered as a starting point of the voice signal, when the certain frame signal is greater than an energy threshold TH, the frame signal is considered as a formal voice signal, then according to comparison of the signal thresholds, the voice state corresponding to the frame length and whether the frame signal is a required voice signal are judged, and finally the range from the starting point to the end point of the voice signal is obtained.

In a preferred embodiment, the extraction process of step S3 is: firstly, carrying out FFT (fast Fourier transform) on each frame of voice signal to obtain a frequency spectrum corresponding to each frame of voice signal so as to obtain a magnitude spectrum; then adding Mel filter group to the magnitude spectrum and carrying out logarithm operation to obtain the output corresponding to each Mel filter; and finally, performing DCT transformation to obtain the MFCC characteristics.

Compared with the prior art, the invention has the following beneficial effects: in the prior art, a plurality of methods can be used in isolated word speech recognition, wherein an HMM algorithm needs to provide a large amount of speech data in a training stage, model parameters can be obtained through repeated calculation, although the recognition accuracy is considerable, the model parameters do not have the feeling of large and small materials. Under the conditions of complex environment, such as high noise and high real-time identification requirement, the method can also accurately and quickly perform voice identification.

Drawings

FIG. 1 is a flow chart of the identification method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

7页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于人工智能的语音识别方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!