Method for identifying coughing and sneezing aiming at real-time voice flow

文档序号：1364317 发布日期：2020-08-11 浏览：40次中文

阅读说明：本技术 针对实时语音流的咳嗽及打喷嚏识别方法 (Method for identifying coughing and sneezing aiming at real-time voice flow ) 是由孙宝石于 2020-03-24 设计创作，主要内容包括：本发明公开了一种针对实时语音流的双域特征化及队列加速的咳嗽及打喷嚏识别方法。本发明一种针对实时语音流的双域特征化及队列加速的咳嗽及打喷嚏识别方法,包括：连续采集语音信号,将采集的语音数据进行分帧；对语音数据帧进行端点检测,以定位候选目标语音的起点帧；端点检测采用三门限法；同时采用时域特征和频域特征,并且针对咳嗽和打喷嚏的特点选取特征值；特征向量队列加速处理；三种工作模式灵活切换等,并形成一整套运行和持续优化流程。本发明的有益效果：1.识别率高：本方法包括多处关键创新点,并且针对咳嗽和打喷嚏进行了特别处理,比现有其他方法识别率明显提升。(The invention discloses a method for identifying coughing and sneezing aiming at double-domain characterization and queue acceleration of real-time voice flow. The invention relates to a method for identifying coughs and sneezes aiming at double-domain characterization and queue acceleration of real-time voice streams, which comprises the following steps: continuously acquiring voice signals, and framing the acquired voice data; carrying out end point detection on the voice data frame to locate a starting point frame of the candidate target voice; the end point detection adopts a three-threshold method; simultaneously adopting time domain characteristics and frequency domain characteristics, and selecting characteristic values aiming at the characteristics of cough and sneeze; accelerating the processing of the feature vector queue; and the three working modes are flexibly switched, and the like, and a whole set of operation and continuous optimization process is formed. The invention has the beneficial effects that: 1. the recognition rate is high: the method comprises a plurality of key innovation points, and the identification rate is obviously improved compared with other existing methods by specially treating cough and sneeze.)

1. A method for cough and sneeze recognition for two-domain characterization and queue acceleration of real-time voice streams, comprising:

continuously acquiring voice signals, and framing the acquired voice data;

carrying out end point detection on the voice data frame to locate a starting point frame of the candidate target voice; the end point detection adopts a three-threshold method, namely:

(1) the average energy of the frame samples is greater than a threshold of 1, and

(2) the frame sample zero crossing rate (percentage of frame sample values greater than zero) is greater than a threshold of 2, and

(3) if the average energy difference (absolute value of the average energy difference between two frames) between the Frame and the previous Frame is greater than the threshold, if the Frame meets the three thresholds, the Frame Mute Flag (FMF) of the Frame is marked as 1 (true), otherwise, the FMF of the Frame is marked as 0 (false);

update "Recognition Activated Flag (RAF)", with RAF initial value of 0 (not Activated): RAF ═ RAF | FMF

Judging the RAF, if the RAF is false and indicates that the recognition process is not activated, directly discarding the current frame and skipping to start to continue voice sampling;

if the RAF is true and indicates that the identification process is activated, performing characterization processing on the current frame to obtain a group of frame feature vectors with 20 feature values;

adding the frame feature vector to the tail of the feature vector queue;

if the length of the feature vector queue reaches the recognizable length (RecoLen), substituting the feature vector queue (RecoLen 20 feature matrix) into a machine learning model which is trained in advance for recognition; otherwise, continuing voice sampling;

the recognizable length RecoLen is one dimension of a two-dimensional input sample of the machine learning model and represents how many data frames one input sample comprises; RecoLen takes on values between 20 frames and 32 frames, corresponding to about 1.25-2 seconds of speech data, which is essentially the time window for just one cough or sneeze;

if the Confidence (CL) of the recognition result exceeds a recognition threshold value set by a system, considering that one cough or sneeze is effectively detected, counting, outputting the recognition result, emptying a feature vector queue and setting RAF to be 0; then jumping to the beginning and starting a new identification process;

if the Confidence (CL) of the recognition result does not exceed the recognition threshold set by the system, the cough or sneeze cannot be detected, but the acceleration processing of the feature vector queue is required according to the specific value of the CL;

after the acceleration process is completed, a new recognition process is started.

2. The method according to claim 1, wherein the whole process flow is "operation mode" of the method, and there are three operation modes of "training mode" and "collection mode", and the operation modes are controlled by system parameters;

if the device works in a training mode, the frame feature vectors need to be reported to a server or a cloud platform while being enqueued;

if the system works in the acquisition mode, the framed voice data needs to be uploaded to a server or a cloud platform.

3. The method of claim 1, wherein threshold 1 is for absolute silence filtering, threshold 2 is for relative silence filtering, and threshold 3 is for abrupt energy cough and sneeze detection to filter smoother normal speech.

4. The method of claim 1, wherein the performing feature vector queue acceleration processing comprises:

(1) acceleration 1: removing the top (100% -CL) number of frames from the feature vector queue, for example, assuming that RecoLen is 20, if the CL obtained by the identification is 60%, the top 40% of frames, i.e. 8 frames, need to be removed from the queue;

(2) acceleration 2: finding the first frame with FMF being 1 (true) in the residual frames in the feature vector queue, and discarding all the frames in front of the first frame; if no frame with FMF 1 (true) is found, the feature vector queue is emptied and RAF is set to 0.

5. The method for real-time speech flow two-domain characterization and queue acceleration cough and sneeze recognition according to claim 1, wherein the cough and sneeze machine learning method training process comprises:

the training process is divided into off-line training and on-line training, and can be used independently or cooperatively;

the off-line training can acquire voice data from the outside, and can also set the running mode of the recognition device as an acquisition mode to acquire original voice data;

preprocessing the voice data, dividing the voice data into segments with the length equal to the length of RecoLen frames, wherein the preprocessing can be completed manually or special voice file processing software can be used;

classifying and labeling voice files, comprising: coughing, sneezing, and others;

extracting a feature vector queue of each voice segment by using the framing and the characterization method shown in the identification process, if the length is less than RecoLen, filling the feature vector queue with zero vectors, and if the length exceeds RecoLen, cutting off the feature vector queue;

on a server or a cloud platform, carrying characteristic values and labels into a model in batches for training and verification;

leading the satisfactorily trained model into a recognition device, and updating the recognition model;

when online learning is carried out, the operation mode of the recognition device is set as a training mode so as to directly obtain the feature vector of the voice data frame;

the feature vectors are uploaded to a server or a cloud platform on line;

the server or the cloud platform takes each RecoLen feature vectors as a training sample;

if the recognition result is received, restarting a new training sample, and if the length of the previous sample is less than RecoLen, filling the sample with a zero vector;

simultaneously, the manual on-line labeling of the sample comprises: coughing, sneezing, and others;

using a newly obtained training sample, and adopting a transfer learning method to perform incremental optimization of the existing model;

the optimized model can be compared with the existing model to identify results so as to evaluate the optimization effect;

and leading the satisfactorily trained model into the MCU recognition device, and updating the recognition model.

6. The method of claim 1, wherein the speech data frame characterization process comprises

Respectively performing time domain characterization and frequency domain characterization on an input voice data frame;

time domain characterization: from the characteristics of the instantaneous amplitude changes of the cough and sneeze sounds, three characteristic values were calculated, including:

(1) the sampling fluctuation value of the frame is the maximum sampling value-the minimum sampling value;

(2) the energy difference between the current frame and the previous frame is abs (the average of the samples of the current frame-the average of the samples of the previous frame), note: abs is an absolute value function;

(3) the energy variance of the frame fragment represents the energy fluctuation in the frame;

frequency domain characterization, comprising two parts; the first part is a Mel-scale Frequency Cepstral Coefficient (MFCC) which is universal for voice signal Frequency domain analysis and mainly comprises three parts of Fast Fourier Transform (FFT), a Mel Frequency filter bank and Discrete Cosine Transform (DCT);

the second part of the frequency domain characterization is to take 16 eigenvalues of the first part, calculate the energy variance of the frequency band by using a standard deviation formula, and then obtain one eigenvalue.

7. The method for cough and sneeze recognition for two-domain characterization and cohort acceleration of real-time voice streams of claim 1, wherein the "machine learning model" includes but is not limited to two-dimensional convolutional neural network (2D CNN), long-short memory network (LSTM), Random Forest (RF).

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.

Technical Field

The invention relates to the field of cough and sneeze recognition, in particular to a method for recognizing cough and sneeze by aiming at two-domain characterization and queue acceleration of real-time voice flow.

Background

Coughing and sneezing are common symptoms of respiratory diseases, in public places such as: the cough and sneeze of unspecified people can be automatically detected in occasions such as classrooms, offices, meeting rooms, restaurants and the like, so that the risk of the disease sources can be found in time, and the prevention and control can be effectively carried out.

The prior art for this problem can be said to be a special application of the general speech recognition technology: basically, the Frequency domain features of the audio signal are extracted by methods such as Short Time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficient (MFCC), and the like, and then feature matching is performed by methods such as pattern recognition or machine learning.

While cough and sneeze detection products meeting the above scenario (unspecified people in public places) are hardly available on the market. Products that can be found include: medical contact personal cough detector, cough detect cell-phone APP (to individual).

Coughing and sneezing are common symptoms of respiratory diseases, in public places such as: the cough and sneeze of people can be automatically detected in occasions such as classrooms, offices, meeting rooms, restaurants and the like, so that the risk of the disease sources can be found in time, and the prevention and control can be effectively carried out.

Aiming at the above-mentioned supposed scene, the conventional technology has the following technical problems:

1. the method has lower precision: in the prior art, Frequency domain features of an audio signal are extracted by methods such as Short Time Fourier Transform (STFT) and Mel-Frequency cepstrum Coefficient (MFCC), and then feature matching is performed by methods such as pattern recognition or machine learning. The existing method lacks special consideration on 'cough and sneeze' and consideration on the environment of public places and unspecified people, and in practical application, the accuracy and the robustness of the method are not high.

2. The practicability is poor: the existing methods, especially those described in academic articles, are basically operated in relatively ideal experimental environments, and are optimized to the greatest extent only for individual indexes. But the complex environment and large-scale deployment of the application are not comprehensively considered, so that the related method is difficult to fall on the ground.

3. The resource occupies much, and the off-line identification is difficult: the existing method has more invalid operations, large redundancy of characteristic data, higher requirements on calculation and storage resources and difficulty in independent completion on a conventional single chip microcomputer, so that off-line identification is difficult to realize, and the application range is greatly limited.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for identifying coughing and sneezing aiming at the double-domain characterization and queue acceleration of real-time voice flow, which comprises the following steps: three-threshold end detection, time domain + frequency domain two-domain characterization (optimizing feature vector and compressing dimension), feature vector queue acceleration, flexible switching of three working modes and the like, and a whole set of operation and continuous optimization process is formed. The method has the advantages of high recognition efficiency, high accuracy, less occupied resources, good robustness, large-scale deployment and the like.

In order to solve the above technical problems, the present invention provides a method for identifying coughing and sneezing with respect to two-domain characterization and queue acceleration of real-time voice streams, comprising:

continuously acquiring voice signals, and framing the acquired voice data;

carrying out end point detection on the voice data frame to locate a starting point frame of the candidate target voice; the end point detection adopts a three-threshold method, namely:

(1) the average energy of the frame samples is greater than a threshold of 1, and

(2) the frame sample zero crossing rate (percentage of frame sample values greater than zero) is greater than a threshold of 2, and

(3) the average energy difference between the frame and the previous frame (the absolute value of the average energy difference between the two frames) is greater than the threshold

If the Frame meets the three thresholds, the Frame Mute Flag (FMF) of the Frame is marked as 1 (true), otherwise, the FMF of the Frame is marked as 0 (false);

update "Recognition Activated Flag (RAF)", with RAF initial value of 0 (not Activated): RAF ═ RAF | FMF

Judging the RAF, if the RAF is false and indicates that the recognition process is not activated, directly discarding the current frame and skipping to start to continue voice sampling;

adding the frame feature vector to the tail of the feature vector queue;

after the acceleration process is completed, a new recognition process is started.

In one embodiment, the whole processing flow is an operation mode of the method, and in addition, three working modes including a training mode and an acquisition mode are provided, and the working modes are controlled by system parameters;

if the device works in a training mode, the frame feature vectors need to be reported to a server or a cloud platform while being enqueued;

if the system works in the acquisition mode, the framed voice data needs to be uploaded to a server or a cloud platform.

In one embodiment, threshold 1 is for absolute silence filtering, threshold 2 is for relative silence filtering, and threshold 3 is for abrupt energy cough and sneeze features to filter out smoother normal speech.

In one embodiment, the performing accelerated processing on the feature vector queue specifically includes:

In one embodiment, the cough and sneeze machine learning method training process comprises: