Voice feature extraction method for monitoring manic episodes

文档序号:1157627 发布日期:2020-09-15 浏览:26次 中文

阅读说明:本技术 一种用于监测躁狂发作的语音特征提取方法 (Voice feature extraction method for monitoring manic episodes ) 是由 崔东红 杜礼钊 林奥 于 2019-03-07 设计创作,主要内容包括:本发明公开了一种用于监测躁狂发作的语音特征提取方法,属于语音处理技术领域。它包括以下步骤:(1)使用谱熵法对语音信号进行端点检测;(2)使用倒谱法进行语音信号基音pitch提取;(3)使用平均能量进行有音段识别;(4)监测躁狂发作结果。能够通过对语音信号的特征提取,得到语音信号中基音“pitch”和停顿时间“PT”,根据pitch和PT能够反映人是否“情绪高涨”和“健谈”的关系,以及情绪、健谈与躁狂症状的内在联系,与正常状态的pitch和PT进行比较,作为判断躁狂是否发作的两项监测指标。(The invention discloses a voice feature extraction method for monitoring manic episodes, and belongs to the technical field of voice processing. It comprises the following steps: (1) carrying out endpoint detection on the voice signal by using a spectral entropy method; (2) extracting pitch of the voice signal by using a cepstrum method; (3) using the average energy to identify the voiced segments; (4) manic episode results were monitored. The pitch 'and the pause time' PT 'in the voice signal can be obtained through feature extraction of the voice signal, and the pitch' and the pause time 'PT' can be compared with the pitch 'and the PT in a normal state to serve as two monitoring indexes for judging whether mania occurs according to the relation that the pitch' and the PT can reflect whether people are in 'mood surge' and 'conversation', and the internal relation between mood and conversation and mania symptoms.)

1. A speech feature extraction method for monitoring manic episodes, comprising the steps of:

(1) endpoint detection of speech signals using spectral entropy method

Setting the time domain waveform of the voice signal as x (i), and setting the nth frame voice signal obtained after windowing and framing as xn(m) its Fourier transform (FFT) is denoted xn(k) Where the index n is denoted as the nth frame and k is denoted as the kth line, the short-time energy of the speech frame in the frequency domain is:

Figure FDA0001987704320000011

in the formula (1), N is the length of FFT, and only a positive frequency part is taken;

and the energy spectrum for a certain spectral line k is

Figure FDA0001987704320000012

the short-time spectral entropy of the speech frame is defined as:

Figure FDA0001987704320000014

setting a discrimination threshold value, and comparing the spectrum entropy value of each frame in the voice with the threshold value to obtain endpoint information;

(2) pitch extraction of speech signal using cepstrum

pitch is one of the most basic indicators in speech signals, based on the principle that speech x (i) is filtered from the glottal pulse excitation u (i) via the vocal tract response v (i), with:

x(i)=u(i)+v(i)(4)

performing cepstrum transformation on x (i), u (i) and v (i) respectively to obtain:

in the cepstrum

Figure FDA0001987704320000016

(3) voiced segment identification using average energy

The stored voice signal is a function of the amplitude changing along with the time, the energy is calculated for all sample points and then summed, the sum is divided by the number of data samples, the average energy is obtained, the average energy is compared with the average energy of the voiced segment and the unvoiced segment marked by the test, the voiced segment is marked when the average energy is larger than the threshold value, otherwise, the unvoiced segment is marked, and PT is equal to the ratio of the unvoiced segment time to the total dialogue time of the segment;

(4) monitoring manic episode results

The pitch in the extracted speech signal is compared with the pitch in the normal state, and a time-varying graph is drawn as one index of a manic episode when the two are clearly separated and as the other index of the manic episode when the pause time PT is smaller than the normal state PT set value.

Technical Field

The invention relates to a voice feature extraction method for monitoring manic episodes, and belongs to the technical field of voice processing.

Background

Bipolar disorder is a complex mood disorder characterized by episodes of mania and depression. According to the world mental health survey of the world health organization, more than 1% of the world population suffers from bipolar disorder, which has a second place impact on the life of patients. Mortality and suicide rates are much higher in bipolar affective patients than in the general population: wherein the suicide rate is even 20 times higher. Compared with the other extreme of bipolar affective disorder, namely depressive episodes, manic patients often do nothing and are easier to do dangerous events in the episodes: such as casual investment, reckless driving, etc. These impulsive behaviors may cause bad family and social relationships, and may cause various accidents and even directly threaten the lives of other people. The extreme devastating nature of manic patients at onset makes it a major public health and safety concern. However, because its pathogenesis is unclear and biomarkers are lacking, the diagnosis accuracy, therapeutic effect and prognosis of mania (bipolar disorder) are not satisfactory. Therefore, based on the characteristics of high incidence, difficult diagnosis, difficult cure and easy recurrence and the current research situation, objective indexes for detecting manic episodes are urgently needed. However, there are relatively few studies currently seeking to find or predict episodes of manic symptoms. The main approach is through questionnaire analysis, or interview with patients and their immediate relatives. Both methods are methods which have to rely on the prior knowledge of interviewers or interviewers for subjective judgment, cannot depart from the knowledge obtained by the system theory and practice of professionals, and cannot realize real-time analysis, which means that the methods are not feasible in specific applications even though the methods have certain significance in clinic and research. Besides the two methods, one method is worth trying the speech analysis of manic patients.

Speech signal is one of the main objective biomarkers associated with mood swings, which are the main features of bipolar affective disorder. According to the International Classification of diseases, the most fundamental of all symptoms of bipolar affective disorder is a change or effect in mood, usually depression (with or without anxiety) or excitement. Thus, Speech Signal Processing (SSP) is an effective and objective method for diagnosing manic symptoms.

The research based on SSP mostly adopts a machine learning method to detect the mania state. In general, the method extracts enough speech features and then uses classifiers such as a support vector machine and a Gaussian mixture model to identify and judge the mania state of the patient. However, since the result of machine learning is often judged according to the maximum probability, i.e., directly selecting between "yes" and "no", the reliability of the result may be a problem. In addition, the machine learning result is a probability obtained after all the selected features are classified, and the importance among the features cannot be effectively measured. In clinical practice, doctors prefer to be able to see how a particular characteristic indicator, such as blood pressure, changes over time. This is because such features are effective, intuitive, and targeted. In the fifth edition of the manual for the diagnosis and statistics of mental disorders, manic episodes "are defined as a period of marked abnormality with persistent increases in mood, agitation or irritability, target-oriented activity or energy abnormality occurring almost every day, lasting for at least 1 week. Correspondingly, two basic parameters in speech signal processing, the pitch "and the pause time" PT "of the speech signal, can generally reflect" mood elevation "and" conversation ". pitch is one of the most important features in SSP, and is closely related to emotional expression. It is a fundamental property of the fundamental frequency of sound, determined by its temporal regularity and average repetition rate of the acoustic waveform. PT here refers to the proportion of time of all the pause portions (no speech signal or speech signal nearly 0) in the whole dialog. Therefore, when the patient is more involved, the PT should be smaller than normal.

Patients in manic states should have greater pitch and PT than normal. Therefore, the technical scheme mainly extracts pitch 'and pause time' PT 'in the voice signal and compares the pitch' and the pause time 'PT' with a normal state to achieve the purpose of monitoring the mania attack.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a speech feature extraction method for monitoring manic episodes is provided, which solves the problem of monitoring manic episodes based on speech feature extraction.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a method of speech feature extraction for monitoring manic episodes comprising the steps of:

(1) endpoint detection of speech signals using spectral entropy method

Setting the time domain waveform of the voice signal as x (i), and setting the nth frame voice signal obtained after windowing and framing as xn(m) its Fourier transform (FFT) is denoted xn(k) Where the index n is denoted as the nth frame and k is denoted as the kth line, the short-time energy of the speech frame in the frequency domain is:

in the formula (1), N is the length of FFT, and only a positive frequency part is taken;

and the energy spectrum for a certain spectral line k is

Figure BDA0001987704330000032

The normalized spectral probability density function for each frequency component is defined as:

the short-time spectral entropy of the speech frame is defined as:

setting a discrimination threshold value, and comparing the spectrum entropy value of each frame in the voice with the threshold value to obtain endpoint information;

(2) pitch extraction of speech signal using cepstrum

pitch is one of the most basic indicators in speech signals, based on the principle that speech x (i) is filtered from the glottal pulse excitation u (i) via the vocal tract response v (i), with:

x(i)=u(i)+v(i) (4)

performing cepstrum transformation on x (i), u (i) and v (i) respectively to obtain:

Figure BDA0001987704330000035

in the cepstrumAndrelative separation, can be separated in the cepstrumThen, recovering u (i) to obtain pitch, and after calculating the cepstrum, searching the maximum value of the cepstrum function in the cepstrum frequency, wherein the corresponding sample point number is the pitch of the current frame voice signal;

(3) voiced segment identification using average energy

The stored voice signal is a function of the amplitude changing along with the time, the energy is calculated for all sample points and then summed, the sum is divided by the number of data samples, the average energy is obtained, the average energy is compared with the average energy of the voiced segment and the unvoiced segment marked by the test, the voiced segment is marked when the average energy is larger than the threshold value, otherwise, the unvoiced segment is marked, and PT is equal to the ratio of the unvoiced segment time to the total dialogue time of the segment;

(4) monitoring manic episode results

The pitch in the extracted speech signal is compared with the pitch in the normal state, and a time-varying graph is drawn as one index of a manic episode when the two are clearly separated and as the other index of the manic episode when the pause time PT is smaller than the normal state PT set value. For the dwell time, and the normal state PT settings, all people's data must be counted and a T-test performed.

The invention has the beneficial effects that: the pitch 'and the pause time' PT 'in the voice signal can be obtained through feature extraction of the voice signal, and the pitch' and the pause time 'PT' can be compared with the pitch 'and the PT in a normal state to serve as two monitoring indexes for judging whether mania occurs according to the relation that the pitch' and the PT can reflect whether people are in 'mood surge' and 'conversation', and the internal relation between mood and conversation and mania symptoms.

Drawings

FIG. 1 is a graph showing the change with time of pitch values in the manic state and the normal state in example 1 of the present invention;

FIG. 2 is a graph showing the change with time of pitch values in the manic state and the normal state in example 2 of the present invention;

FIG. 3 is a graph showing the change with time of pitch values in the manic state and the normal state in example 3 of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further explained below.

A method of speech feature extraction for monitoring manic episodes comprising the steps of:

(1) endpoint detection of speech signals using spectral entropy method

Setting the time domain waveform of the voice signal as x (i), and setting the nth frame voice signal obtained after windowing and framing as xn(m) its Fourier transform (FFT) is denoted xn(k) Where the index n is denoted as the nth frame and k is denoted as the kth line, the short-time energy of the speech frame in the frequency domain is:

in the formula (1), N is the length of FFT, and only a positive frequency part is taken;

and the energy spectrum for a certain spectral line k isThe normalized spectral probability density function for each frequency component is defined as:

the short-time spectral entropy of the speech frame is defined as:

setting a discrimination threshold value, and comparing the spectrum entropy value of each frame in the voice with the threshold value to obtain endpoint information;

(2) pitch extraction of speech signal using cepstrum

pitch is one of the most basic indicators in speech signals, based on the principle that speech x (i) is filtered from the glottal pulse excitation u (i) via the vocal tract response v (i), with:

x(i)=u(i)+v(i) (4)

performing cepstrum transformation on x (i), u (i) and v (i) respectively to obtain:

Figure BDA0001987704330000055

in the cepstrum

Figure BDA0001987704330000056

And

Figure BDA0001987704330000057

relative separation, can be separated in the cepstrumThen, recovering u (i) to obtain pitch, and after calculating the cepstrum, searching the maximum value of the cepstrum function in the cepstrum frequency, wherein the corresponding sample point number is the pitch of the current frame voice signal;

(3) voiced segment identification using average energy

The stored voice signal is a function of the amplitude changing along with the time, the energy is calculated for all sample points and then summed, the sum is divided by the number of data samples, the average energy is obtained, the average energy is compared with the average energy of the voiced segment and the unvoiced segment marked by the test, the voiced segment is marked when the average energy is larger than the threshold value, otherwise, the unvoiced segment is marked, and PT is equal to the ratio of the unvoiced segment time to the total dialogue time of the segment;

(4) monitoring manic episode results

The pitch in the extracted speech signal is compared with the pitch in the normal state, and a time-varying graph is drawn as one index of a manic episode when the two are clearly separated and as the other index of the manic episode when the pause time PT is smaller than the normal state PT set value. For the dwell time, and the normal state PT settings, all people's data must be counted and a T-test performed.

As shown in fig. 1 to 3, in the graphs of the embodiments 1 to 3, the upper part is the pitch value in the mania state, and the lower part is the pitch value in the normal state, and it can be seen from the graphs that the pitch value in the mania state and the pitch value in the normal state are clearly separated, and can be used as an index for judging whether mania occurs.

It has been found that pitch is able to distinguish between manic and normal states in a patient. It is to be noted that the speech signal must be of sufficient length to be able to detect this characteristic, and that the pitch in the manic and normal states may initially be at the same or similar level, but with increasing time the difference gradually appears, and the pitch is able to detect this difference.

The following table shows the dwell time comparisons and corresponding P values for two conditions:

status of state Is normal Mania P value
Pause/total length 0.4987±0.1161 0.3638±0.0966 0.00028685

In the above table, the proportion of the patient's dwell time in the manic state is 0.3638(0.0966), which is significantly lower than 0.4987(0.1161) in the normal state. The P value is a parameter used to determine the result of the hypothesis testing, and may also be compared using the rejected fields of the distributions according to different distributions. The smaller the P value, the more significant the result.

According to the technical scheme, pitch and dwell time PT in the voice signal can be obtained through feature extraction of the voice signal, the relation of whether 'mood rising' and 'conversation' of a person can be reflected according to the pitch and the PT, the internal relation of the mood, the conversation and mania symptoms can be reflected according to the relation, and the relation is compared with the pitch and the PT in a normal state to serve as two monitoring indexes for judging whether mania attacks.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于磁悬浮的声音播放装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!