Voice endpoint detection method and device based on non-uniform sub-band separation variance

文档序号:1650344 发布日期:2019-12-24 浏览:21次 中文

阅读说明:本技术 基于非均匀子带分离方差的语音端点检测方法及装置 (Voice endpoint detection method and device based on non-uniform sub-band separation variance ) 是由 黄翔东 曹璐 刘子楠 于 2019-09-25 设计创作,主要内容包括:本发明公开了一种基于非均匀子带分离方差的语音端点检测方法及装置,方法包括:计算分帧后的各帧语音信号的幅值谱;将语音信号有效频带转换到Mel域,并在Mel域上将其均匀分为q个子带,再将各子带的中心频率、下限频率、上限频率转换到以Hz为单位的实际频率;通过内插将幅值谱扩展,并结合转换后的实际频率计算每个子带内频谱的平均幅值,并求取子带均值,进而计算各帧子带方差;利用前导无话段计算出噪声的平均方差值,进一步设置上下限阈值,用双门限进行判决,得到最终的语音端点检测结果。装置包括:模数转化器、DSP芯片。本发明的实现方法效率高,具有较强的鲁棒性。(The invention discloses a voice endpoint detection method and a device based on non-uniform sub-band separation variance, wherein the method comprises the following steps: calculating the amplitude spectrum of each frame of voice signals after framing; converting the effective frequency band of the voice signal into a Mel domain, uniformly dividing the effective frequency band of the voice signal into q sub-bands on the Mel domain, and converting the center frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency with Hz as a unit; expanding the amplitude spectrum through interpolation, calculating the average amplitude of the frequency spectrum in each sub-band by combining the converted actual frequency, solving the mean value of the sub-bands, and further calculating the variance of each frame sub-band; and calculating the mean variance value of the noise by using the leading non-speech segment, further setting an upper limit threshold and a lower limit threshold, and judging by using double thresholds to obtain a final voice endpoint detection result. The device comprises: an analog-to-digital converter and a DSP chip. The realization method of the invention has high efficiency and stronger robustness.)

1. A speech endpoint detection method based on non-uniform subband separation variance is characterized by comprising the following steps:

calculating the amplitude spectrum of each frame signal after framing;

converting the effective frequency band of each frame of voice signal into a Mel domain, uniformly dividing the effective frequency band into q sub-bands on the Mel domain, and converting the central frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency taking Hz as a unit;

expanding the amplitude spectrum through interpolation, calculating the average amplitude of the frequency spectrum in each sub-band by combining the converted actual frequency, solving the mean value of the sub-bands, and further calculating the variance of each frame sub-band;

and calculating the average variance value of the noise based on the variance of each frame of sub-band, further setting an upper limit threshold and a lower limit threshold, and judging by using double thresholds to obtain a final voice endpoint detection result.

2. The method for detecting the voice endpoint based on the non-uniform subband separation variance as claimed in claim 1, wherein the setting of the upper and lower threshold values and the decision by using the double thresholds to obtain the final voice endpoint detection result specifically comprise:

1) performing rough judgment once on an upper limit threshold selected on an envelope line of the voice Mel domain sub-band separation variance, wherein the threshold is higher than the upper limit threshold, the voice starting point is positioned outside a time point corresponding to the intersection point of the threshold and a voice sub-band variance graph line;

2) and determining a lower limit threshold, and respectively finding two points where the sub-band variance envelope is intersected with the lower limit threshold, wherein a line segment formed by the two points is the final voice segment.

3. A voice endpoint detection device based on non-uniform sub-band separation variance is characterized in that the device comprises an analog-to-digital converter and a DSP chip,

the audio signal is input into a DSP chip in a parallel digital input mode after passing through an analog-to-digital converter;

the DSP chip, when executing a program, implements the method steps of claim 1.

Technical Field

The invention relates to the technical field of digital signal processing, in particular to a voice endpoint detection method and device based on non-uniform subband separation variance, and specifically relates to how to determine a starting point and an ending point of voice in a quiet environment and under the condition of containing noise.

Background

Voice Endpoint Detection (Endpoint Detection), also known as Voice Activity Detection (Voice Activity Detection), is commonly usedIn the front end of a speech processing system, the aim is to separate an effective speech signal from other undesired interference signals in sampled signal data in various environmental noises, and lay a foundation for further enhancing the speech processing performance subsequently. Generally, it is necessary to extract the noise-robust features from the samples to distinguish the speech signal from the non-speech signal and determine the start point and the end point of each speech segment, and for the speech intelligent recognition and speech enhancement systems widely used today, the end point detection accuracy is one of the important parameters of the overall system with excellent performance[1]

Starting from the first proposed speech signal endpoint detection in Bell laboratories, this technology has matured over nearly half a century of development, and a number of excellent methods continue to emerge. The method can be roughly divided into two categories, namely threshold-based and model-based: the threshold-based method extracts the time domain characteristic value of the voice different from the noise, compares the time domain characteristic value with the set threshold, and accordingly makes the final judgment[2]. The main categories can be time domain, frequency domain and cepstral domain parameters, such as: energy value, zero crossing rate, cepstrum coefficient, spectral distance, spectral entropy, etc[3]. Compared with a model method, the method is simple to operate and easy to realize, but the detection precision is low; the model-based method is complex, usually requires transforming the speech signal to another domain (such as discrete cosine transform domain) and extracting multi-dimensional features (such as Mel cepstrum, etc.) from the transformed speech signal, and is very dependent on the established model, and the feature dimension used is large, so that the method needs long transition time from transient state to steady state to adapt to the change of noise and interference, and has high computational complexity, so that the method is not suitable for real-time implementation (such as not suitable for the situation that the hearing aid detects the speech endpoint on line in real time).

For a pure speech signal, the boundary points of speech can be found out very accurately by the above two methods. In practice, most speech signals are in more than one type of complex noise background, and effectively distinguishing speech segments from noise segments becomes a first problem for detecting speech endpoints. Specifically, for the threshold decision method, a threshold criterion needs to be set first, and when the decision parameter of the speech signal exceeds the threshold criterion, the speech signal is considered, otherwise, the speech signal is considered as a noise signal. The selection of the characteristic parameters of the voice signals is crucial, and a good detection method needs to meet the following characteristics:

1) the accuracy is as follows: the determination of the boundary points of the speech segments must be accurate; 2) stability: the detection algorithm has to have better robustness, and the anti-noise performance is strong; 3) the decision criterion has self-adaptive characteristic and can not only fix threshold decision; 4) the computational complexity is: the detection algorithm has low operation intensity and small calculation amount, and is convenient for hardware realization.

Reference to the literature

[1] Zhao Li, speech signal processing [ M ].3 edition, Beijing: mechanical engineering Press 2016.n

[2] Hu boat. Speech Signal processing [ M ] Harbin: harbin university of Industrial university Press, 2000:163-17.

[3] Sumin, Speech enhancement technology and related technologies under low signal-to-noise ratio study [ D ]. Nanjing post and telecommunications university, 2018.

[4]Mark Marzinzik etc.Speech Pause Detection for Noise Spectrum Estimation by Tracking Tracking Power Envelope Dynamics.IEEE Transactions onSpeech and Audio Processing,2002,10(2):109-111.

[5] Von, large, adaptive voice endpoint detection technology research [ D ]. Beijing post and telecommunications university, 2008.

[6] Lejia Anna, a voice endpoint detection method in a noise environment, study [ D ]. southern China university of Rich, 2015.

[7]Ishizuka,J,et al.Study ofNoise Robust Voice Activity Detection Based on Periodic Component To Aperiodic Component Ratio.Proc.ofSAPA,2006,06(9):65-70.

[8] Li Zuipeng, Yao Yiyang, a new method for detecting the start and stop points of a voice section [ J ] telecommunication technology, 2003, 3:68-70.

[9]Tanyer S G,Ozer H.Voice Activity Detection in Non-stationary Noise[J].IEEE Transactions on Speech and Audio Processing,2000,8(4):478-482.

[10] The optimized voice endpoint detection algorithm based on self-contained energy characteristics is researched [ J ]. acoustics report, 2005,24(2): 171-.

[11] Zhanhui, digital speech processing and MATLAB simulation [ M ]. electronics industry press, 2016.

[12] Application of MATLAB in speech signal analysis and synthesis [ M ]. Beijing university of aerospace publishers, 2013.

Disclosure of Invention

The invention provides a voice endpoint detection method and a device based on non-uniform sub-band separation variance, which adopts a Mel domain sub-band division mode, calculates each sub-band variance, and utilizes double thresholds to realize final judgment, and the details are described as follows:

a method for detecting a voice endpoint based on non-uniform subband separation variance, the method comprising:

calculating the amplitude spectrum of each frame of voice signals after framing;

converting the effective frequency band of the voice signal into a Mel domain, uniformly dividing the effective frequency band of the voice signal into q sub-bands on the Mel domain, and converting the center frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency with Hz as a unit;

expanding the amplitude spectrum through interpolation, calculating the average amplitude of the frequency spectrum in each sub-band by combining the converted actual frequency, solving the mean value of the sub-bands, and further calculating the variance of each frame sub-band;

and calculating the average variance value of the noise based on the variance of each frame of sub-band, further setting an upper limit threshold and a lower limit threshold, and judging by using double thresholds to obtain a final voice endpoint detection result.

Setting an upper threshold and a lower threshold, and judging by using double thresholds to obtain a final voice endpoint detection result, wherein the final voice endpoint detection result is specifically as follows:

1) performing rough judgment once on an upper limit threshold selected on an envelope line of the voice Mel domain sub-band separation variance, wherein the threshold is higher than the upper limit threshold, the voice starting point is positioned outside a time point corresponding to the intersection point of the threshold and a voice sub-band variance graph line;

2) and determining a lower limit threshold, and respectively finding two points where the sub-band variance envelope is intersected with the lower limit threshold, wherein a line segment formed by the two points is the final voice segment.

A voice endpoint detection device based on non-uniform sub-band separation variance comprises an analog-to-digital converter and a DSP chip,

the audio signal is input into a DSP chip in a parallel digital input mode after passing through an analog-to-digital converter;

the DSP chip, when executing a program, implements the method steps of claim 1.

The technical scheme provided by the invention has the beneficial effects that:

1. calculating the variance of a Mel sub-band of a voice signal frequency domain, calculating the average variance of a noise section by using a leading non-speech section, and selecting different thresholds for judgment according to different signal-to-noise ratios;

2. the detection result can be used for the front end of a voice separation and enhancement system, so that the front end can conveniently carry out different processing on a voice silence section and a voice section;

3. the implementation method is high in efficiency and has strong robustness.

Drawings

FIG. 1 is a schematic diagram of speech signal framing;

FIG. 2 is a graph of an energy envelope of a speech signal and a noise signal;

FIG. 3 is a diagram illustrating Mel-domain sub-band division of 7 sub-bands;

FIG. 4 is a schematic diagram of a decision process of the double threshold method;

FIG. 5 is a diagram of a frequency domain subband variance envelope;

FIG. 6 is a comparison of the results of different detection methods;

FIG. 7 is a diagram of a hardware implementation of the present invention;

fig. 8 is a flow chart of the DSP internal.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

The invention utilizes the difference of attenuation characteristics of logarithmic energy envelopes of voice signals and background noise, sets a size threshold value according to the mean sub-band separation variance of a leading non-speech section by calculating the Mel domain sub-band separation variance, and utilizes a double-threshold method to realize final judgment.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种大容量光存储装置及数据读写方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!