Method and device for audio discrimination by using audio discrimination model

文档序号：1906568 发布日期：2021-11-30 浏览：25次中文

阅读说明：本技术 利用音频判别模型进行音频判别的方法和装置 (Method and device for audio discrimination by using audio discrimination model ) 是由颜永红张学帅张鹏远于 2021-08-30 设计创作，主要内容包括：本说明书实施例提供了一种利用音频判别模型进行音频判别的方法和装置。该方法用于判别音频中的咳嗽音频属于新型冠状病毒肺炎的概率,该方法的一具体实施方式包括：首先,从采集的音频中获取多帧待判别咳嗽音频,并从各帧待判别咳嗽音频中提取特征向量。而后,利用至少一个第一时延神经网络,对多帧待判别咳嗽音频的特征向量进行信息提取,得到音频信息。之后,利用至少一个残差时延神经网络,从多个维度提取音频信息的多维度信息,并利用至少一个第二时延神经网络,从多维度信息获得固定长度的音频特征。最后,将固定长度的音频特征输入全连接层得到待判别咳嗽音频属于新冠肺炎的概率。(The embodiment of the specification provides a method and a device for audio discrimination by using an audio discrimination model. The method is used for judging the probability that the cough audio in the audio belongs to the novel coronavirus pneumonia, and a specific implementation mode of the method comprises the following steps: firstly, acquiring a plurality of frames of cough audios to be distinguished from acquired audios, and extracting a feature vector from each frame of cough audio to be distinguished. And then, extracting information of the feature vectors of the multi-frame cough audio to be distinguished by using at least one first time delay neural network to obtain audio information. And then, extracting multi-dimensional information of the audio information from multiple dimensions by using at least one residual error time delay neural network, and obtaining the audio features with fixed length from the multi-dimensional information by using at least one second time delay neural network. And finally, inputting the audio features with fixed length into a full connection layer to obtain the probability that the cough audio to be judged belongs to the new crown pneumonia.)

1. A method for distinguishing the audio frequency by using an audio frequency distinguishing model is used for distinguishing the probability that the cough audio frequency in the audio frequency belongs to the novel coronavirus pneumonia; the audio discrimination model comprises at least one first delay neural network, at least one second delay neural network, at least one residual delay neural network and a full connection layer, and the method comprises the following steps:

acquiring a plurality of frames of cough audio to be distinguished from the acquired audio;

extracting a characteristic vector from each frame of cough audio to be distinguished;

extracting information of the feature vectors of the multiple frames of cough audios to be distinguished by using the at least one first time delay neural network to obtain audio information;

extracting multi-dimensional information of the audio information from a plurality of dimensions using the at least one residual delay neural network;

obtaining fixed-length audio features from the multi-dimensional information by using the at least one second time-delay neural network;

and inputting the audio features into the full connection layer to obtain the probability that the cough audio to be judged belongs to the novel coronavirus pneumonia.

2. The method of claim 1, wherein each of the at least one residual time-delay neural network comprises a squeeze excitation module and at least one time-delay neural network, wherein the squeeze excitation module comprises a first linear layer, a first activation function, a second linear layer, and a second activation function; and

each residual delay neural network in the at least one residual delay neural network processes the input information in the following way:

reducing dimensions of the information extracted by the at least one time delay neural network by utilizing the first linear layer so as to remove general information of the cough sound of the novel coronavirus pneumonia and the cough sound of the non-novel coronavirus pneumonia;

performing dimension raising on the output of the first activation function by using the second linear layer so as to raise the network parameter quantity;

and multiplying the output of the second activation function by the output of the last time delay neural network in the residual error time delay neural network, weighting the multiplied result and the input information of the residual error time delay neural network, and taking the weighted result as the output of the residual error time delay neural network.

3. The method of claim 1, wherein the at least one first latency neural network comprises two first latency neural networks; the at least one second time-delay neural network comprises two second time-delay neural networks; the at least one residual delay neural network comprises three residual delay neural networks.

4. The method of claim 1, wherein the method further comprises:

and outputting the probability to assist the user in judging whether the speaker of the cough audio to be judged is the patient with the novel coronavirus pneumonia.

5. The method of claim 1, wherein the audio discriminant model is trained by:

acquiring a sample set, wherein the samples of the sample set comprise positive samples and negative samples, the positive samples comprise feature vectors and probability values 1 of cough audios corresponding to new coronary pneumonia, and the negative samples comprise feature vectors and probability values 0 of cough audios corresponding to non-new coronary pneumonia;

and taking the feature vector of the sample as input, taking the probability value corresponding to the input feature vector as expected output, and training to obtain the audio frequency discrimination model.

6. The method of claim 1, wherein the obtaining a plurality of frames of to-be-determined cough audio from the captured audio comprises:

preprocessing audio collected by audio collection equipment to obtain processed audio;

determining whether the processed audio comprises cough sound audio or not by using a pre-trained cough sound discrimination model;

and in response to the fact that the processed audio comprises the cough sound audio, extracting the cough sound audio as the cough audio to be distinguished.

7. The method of claim 1, wherein the extracting feature vectors from each frame of the cough audio to be distinguished comprises:

and extracting Mel frequency cepstrum coefficients from the audio frequency of the cough to be distinguished as a feature vector.

8. A device for distinguishing audio by using an audio distinguishing model is used for distinguishing the probability that cough audio in audio belongs to novel coronavirus pneumonia; the device comprises:

the acquisition module is configured to acquire a plurality of frames of cough audio to be distinguished from the acquired audio;

the extraction module is configured to extract a feature vector from each frame of cough audio to be distinguished;

the first time delay neural network module comprises at least one first time delay neural network and is configured to receive the feature vectors output by the extraction module, extract information of the feature vectors of the multiple frames of cough audios to be distinguished and obtain audio information;

the residual error time delay neural network module comprises at least one residual error time delay neural network and is configured to receive the audio information output by the first time delay neural network module and extract multi-dimensional information of the audio information from multiple dimensions;

the second time delay neural network module comprises at least one second time delay neural network and is configured to receive the multi-dimensional information output by the residual error time delay neural network module and obtain the audio features with fixed lengths from the multi-dimensional information;

and the full connection layer module is configured to receive the audio features with fixed length output by the second time delay neural network module and obtain the probability that the cough audio to be judged belongs to the novel coronavirus pneumonia.

9. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.

10. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, implements the method of any of claims 1-7.

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a method and a device for performing audio judgment by using an audio judgment model.

Background

With the advancement of data processing technology, computer technology is widely used in various fields, including the field of audio processing. And various information can be obtained by analyzing audio uttered by a human. For example, the identity of the speaker can be obtained by voiceprint recognition of the audio. The cough sound emitted by human is used as the physiological characterization of various diseases, and the characteristics of the cough sound shown by different diseases are different, so that doctors can obtain the illness information of patients by analyzing the cough sound of the patients. For some diseases with strong infectivity and high fatality rate, such as novel coronavirus pneumonia (abbreviated as new coronavirus pneumonia), in order to prevent the spread of the diseases, a large amount of manpower and material resources are required to perform disease detection. The new coronary pneumonia patient generally accompanies the occurrence of cough during the illness period, and the cough sound is easier to collect, so if the probability that a certain patient is the new coronary pneumonia patient can be automatically generated through the cough sound and displayed to assist the user in further disease diagnosis, great help can be brought to the inhibition of the spread of the disease.

Disclosure of Invention

The embodiment of the specification describes a method and a device for discriminating audio by using an audio discrimination model, the method is used for discriminating the probability that cough sound in audio belongs to new coronary pneumonia, and the corresponding audio discrimination model is constructed and trained based on the characteristics of the cough sound of new coronary pneumonia symptoms, so that the probability output by the audio discrimination model is more accurate.

According to a first aspect, a method for audio frequency discrimination using an audio frequency discrimination model is provided, for discriminating a probability that a cough audio frequency in audio frequencies belongs to a novel coronavirus pneumonia; the audio discrimination model includes at least one first delay neural network, at least one second delay neural network, at least one residual delay neural network, and a full link layer, and the method includes: acquiring a plurality of frames of cough audio to be distinguished from the acquired audio; extracting a characteristic vector from each frame of cough audio to be distinguished; extracting information of the feature vectors of the multiple frames of cough audios to be distinguished by using at least one first time delay neural network to obtain audio information; extracting multi-dimensional information of the audio information from multiple dimensions by using at least one residual error time delay neural network; obtaining audio features of fixed length from the multi-dimensional information by using at least one second time delay neural network; and inputting the audio features into the full connection layer to obtain the probability that the cough audio to be judged belongs to the novel coronavirus pneumonia.

In one embodiment, each of the at least one residual time-delay neural network comprises a squeeze excitation module and at least one time-delay neural network, wherein the squeeze excitation module comprises a first linear layer, a first activation function, a second linear layer, and a second activation function; and each residual error time delay neural network in the at least one residual error time delay neural network processes the input information in the following mode: reducing dimensions of the information extracted by the at least one time delay neural network by using the first linear layer to remove general information of the cough sound of the novel coronavirus pneumonia and the cough sound of the non-novel coronavirus pneumonia; performing dimension raising on the output of the first activation function by using the second linear layer to raise the network parameter quantity; and multiplying the output of the second activation function by the output of the last time delay neural network in the residual time delay neural network, weighting the multiplied result and the input information of the residual time delay neural network, and taking the weighted result as the output of the residual time delay neural network.

In one embodiment, the at least one first latency neural network includes two first latency neural networks; the at least one second time-delay neural network comprises two second time-delay neural networks; the at least one residual delay neural network includes three residual delay neural networks.

In one embodiment, the method further comprises: and outputting the probability to assist the user in judging whether the speaker with the cough audio to be judged is the patient with the novel coronavirus pneumonia.

In one embodiment, the audio discriminant model is trained by: acquiring a sample set, wherein the samples of the sample set comprise a positive sample and a negative sample, the positive sample comprises a feature vector and a probability value 1 of a cough audio corresponding to the new coronary pneumonia, and the negative sample comprises a feature vector and a probability value 0 of a cough audio corresponding to the non-new coronary pneumonia; and taking the feature vector of the sample as input, taking the probability value corresponding to the input feature vector as expected output, and training to obtain the audio frequency discrimination model.

In an embodiment, the obtaining multiple frames of to-be-determined cough audio from the collected audio includes: preprocessing audio collected by audio collection equipment to obtain processed audio; determining whether the processed audio comprises cough sound audio or not by using a pre-trained cough sound discrimination model; and in response to the fact that the processed audio comprises the cough sound audio, extracting the cough sound audio as the cough audio to be distinguished.

In an embodiment, the extracting the feature vector from each frame of the cough audio to be discriminated includes: and extracting Mel frequency cepstrum coefficient from the cough audio to be distinguished as a feature vector.

According to a second aspect, there is provided an apparatus for performing audio frequency discrimination using an audio frequency discrimination model, for discriminating a probability that a cough audio frequency in audio frequencies belongs to a novel coronavirus pneumonia; the above-mentioned device includes: the acquisition module is configured to acquire a plurality of frames of cough audio to be distinguished from the audio acquired by the audio acquisition equipment; the extraction module is configured to extract a feature vector from each frame of cough audio to be distinguished; the first time delay neural network module comprises at least one first time delay neural network and is configured to receive the feature vectors output by the extraction module, and extract information of the feature vectors of the multi-frame cough audio to be distinguished to obtain audio information; a residual time delay neural network module, including at least one residual time delay neural network, configured to receive the audio information output by the first time delay neural network module, and extract multi-dimensional information of the audio information from multiple dimensions; the second time delay neural network module comprises at least one second time delay neural network and is configured to receive the multi-dimensional information output by the residual error time delay neural network module and obtain the audio features with fixed lengths from the multi-dimensional information; and the full connection layer module is configured to receive the audio features with the fixed length output by the second time delay neural network module and obtain the probability that the cough audio to be judged belongs to the novel coronavirus pneumonia.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method as described above in any one of the first aspects.

According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor, when executing the executable code, implements the method as described in any one of the first aspect.

According to the method and the device for distinguishing the audio by using the audio distinguishing model, provided by the embodiment of the specification, firstly, a plurality of frames of cough audio to be distinguished are obtained from collected audio, and a feature vector is extracted from each frame of cough audio to be distinguished. And then, extracting information of the feature vectors of the multi-frame cough audio to be distinguished by using at least one first time delay neural network to obtain audio information. And then, extracting multi-dimensional information of the audio information from multiple dimensions by using at least one residual error time delay neural network, and obtaining the audio features with fixed length from the multi-dimensional information by using at least one second time delay neural network. And finally, inputting the audio features with fixed length into a full connection layer to obtain the probability that the cough audio to be judged belongs to the new crown pneumonia. Therefore, the probability that the cough audio to be distinguished belongs to the new coronary pneumonia is automatically generated according to the cough audio to be distinguished, and the user is assisted in diagnosing the new coronary virus.

Drawings

FIG. 1 shows a flow diagram of a method for audio discrimination using an audio discrimination model according to one embodiment;

FIG. 2 illustrates a schematic structural diagram of one implementation of an audio discrimination model;

FIG. 3 shows a schematic structural diagram of one implementation of a residual delay neural network;

FIG. 4 shows a schematic block diagram of an apparatus for audio discrimination using an audio discrimination model according to one embodiment.

Detailed Description

The technical solutions provided in the present specification are described in further detail below with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. It should be noted that the embodiments and features of the embodiments in the present specification may be combined with each other without conflict.

As shown in FIG. 1, FIG. 1 illustrates a flow diagram of a method for audio discrimination using an audio discrimination model according to one embodiment. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices, etc. having computing, processing capabilities.

The method for performing audio frequency discrimination by using an audio frequency discrimination model shown in fig. 1 is used for discriminating the probability that a cough audio frequency belongs to new crown pneumonia. The audio frequency discrimination model used by the method comprises at least one first Time Delay Neural Network (TDNN), at least one second Time Delay Neural Network, at least one Residual Time Delay Neural Network (RES _ TDNN) and a full connection layer.

In one implementation, the at least one first delay neural network may include two first delay neural networks, the at least one second delay neural network may include two second delay neural networks, and the at least one residual delay neural network may include three residual delay neural networks. Thus, the constructed audio discriminant model can be as shown in fig. 2, and fig. 2 shows a schematic structural diagram of the audio discriminant model in the present implementation.

In practice, analysis shows that compared with the cough audio of the common cold, the cough audio of the new coronary pneumonia has intermittent vibration and the cough duration is longer. Thus, the audio discrimination model of the present implementation uses two first time-delay neural networks to extract audio information over more time. In addition, the energy of the cough sound of the new coronary pneumonia patient is generally concentrated in the low-frequency and medium-frequency regions, and in addition, the cough sound of the new coronary pneumonia patient is generally hoarse and dry and is generally accompanied with phlegm sound. Therefore, the audio frequency discrimination model of the realization mode is connected with three residual error time delay neural networks behind the first time delay neural network in a hanging mode so as to extract multi-dimensional information in the cough sound from multiple dimensions and expand the whole information of the cough sound. And then, two layers of second time delay neural networks are connected behind the residual error time delay neural network in a hanging mode so as to obtain the robust audio features with fixed lengths. And finally, hanging a full connection layer containing a hidden node at the rear end of the second time delay neural network, and using a softmax function as an activation function to judge the probability that the cough sound belongs to the new coronary pneumonia patient. According to the implementation mode, the audio discrimination model is constructed on the basis of analyzing the cough sound characteristics of the new coronary pneumonia patient, so that the constructed audio discrimination model is more in line with the cough sound characteristics of the new coronary pneumonia patient, and the output result of the audio discrimination model is more accurate.

In one implementation, each of the at least one residual delay neural network may include a squashed-and-Excitation (SE) module and at least one delay neural network, wherein the squashed Excitation module may include a first linear layer, a first activation function, a second linear layer, and a second activation function. As an example, fig. 3 shows a schematic structural diagram of a residual delay neural network including a squeeze excitation module (shown as a dashed box) and three delay neural networks. In the example of fig. 3, three time-lapse neural networks are connected in series, with the output of the last time-lapse neural network as the input of the first linear layer, the output of the first linear layer as the input of the first activation function, the output of the first activation function as the input of the second linear layer, the output of the second linear layer as the input of the second activation function, and the output of the second activation function multiplied by the output of the last time-lapse neural network. And finally, forming a residual error network by the multiplication result and the input of the whole residual error time delay neural network. It is understood that the residual delay neural network in fig. 3 is only illustrative, and not limiting the number of the delay neural networks in the residual delay neural network, etc., and the number of the delay neural networks may be determined according to actual needs in practice.

In this implementation manner, each of the at least one residual delay neural network may process the input information in the following manner: firstly, using the first linear layer to perform dimensionality reduction on the information extracted by the at least one time delay neural network so as to remove general information of the cough sound of the novel coronavirus pneumonia and the cough sound of the non-novel coronavirus pneumonia. And then, utilizing the second linear layer to carry out dimensionality increase on the output of the first activation function so as to improve the network parameter quantity. And finally, multiplying the output of the second activation function by the output of the last time delay neural network in the residual time delay neural network, weighting the multiplied result and the input information of the residual time delay neural network, and taking the weighted result as the output of the residual time delay neural network.

According to practical analysis, the patients with new coronary pneumonia usually have dry cough as the main part in the early stage of illness, and the cough duration and frequency are gradually increased along with the worsening of the illness. Therefore, the implementation mode can use a plurality of (for example, three) delay neural networks in the residual delay neural network, so as to increase the receptive field as much as possible and utilize the contextual information of the cough audio more. In addition, the cough audio of the new coronary pneumonia patient has some typical characteristics, such as hoarse cough sound, dryness and intermittent vibration, so the implementation mode is characterized in that the extrusion stimulation module is connected behind the plurality of time delay neural networks in an articulated mode, and the characteristic information of the new coronary pneumonia cough audio is extracted from a plurality of dimensions. In practice, the extrusion stimulation module may include a first linear layer, a first activation function, a second linear layer, and a second activation function, and a network structure thereof may be as shown by a dotted line in fig. 3. As an example, the first linear layer may select a smaller number of hidden nodes to perform dimensionality reduction on the audio information to remove common information of new crown cough sounds and common patient cough sounds. The second linear layer can be used for increasing the dimension of information, and the purpose of the second linear layer is to improve the parameter quantity of the whole network so as to improve the modeling capability and the discrimination capability of the network and increase the accuracy of modeling the cough audio. In this implementation, the second linear layer may use a sigmoid function as the second activation function to obtain a weighted excitation coefficient, and the weighted excitation function is multiplied by the output of the last time-delay neural network to obtain the audio information of each dimension of the cough sound. And finally, forming a residual error network by the multiplication result and the input of the whole residual error time delay neural network, wherein the residual error network can ensure that the neural network can better ensure the frequency spectrum information of the whole cough sound while reducing the dimension of the cough sound. According to the implementation mode, the residual error time delay neural network is constructed on the basis of analyzing the cough sound characteristics of the new coronary pneumonia patient, so that the constructed residual error time delay neural network is more in line with the cough sound characteristics of the new coronary pneumonia patient, the information extracted by the residual error time delay neural network is more accurate, and the output result of the audio frequency discrimination model is more accurate.

As an example, in actual use, the first activation function may be a ReLu activation function, and the second activation function may be a sigmoid function. Wherein, the ReLu activation function is:

the sigmoid function is:

the value of s (x) in the sigmoid function is between 0 and 1, and the sigmoid function has good symmetry. Therefore, sigmoid functions may be selected as classifiers in the audio discriminant model.

The tanh activation function in the network is:

the BatchNorm function in the network is:

wherein gamma and beta are parameter vectors which can be learnt.

In one implementation, the audio discriminant model may be obtained by training an apparatus for training an audio discriminant model by:

first, a sample set is obtained.

In this implementation, an apparatus for training an audio discriminative model may obtain a sample set. Here, the samples in the sample set may include positive samples and negative samples. Wherein, the positive sample may include a feature vector and a probability value 1 of the cough audio corresponding to the new crown pneumonia. The negative examples may include feature vectors and probability values 0 for cough audio corresponding to non-new crown pneumonia.

And then, taking the feature vector of the sample as input, taking the probability value corresponding to the input feature vector as expected output, and training to obtain the audio frequency discrimination model.

In this implementation, the feature vector of the sample may be input to the audio frequency discrimination model as an input to obtain a predicted probability value of the feature vector of the sample, and the audio frequency discrimination model may be trained by a machine learning method with a probability value corresponding to the input feature vector as an expected output. For example, the difference between the resulting predicted probability value and the desired output may first be calculated using a preset loss function. Then, the network parameters of the audio frequency discrimination model can be adjusted based on the calculated difference, and the training is finished under the condition that the preset training finishing condition is met. For example, the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated difference is less than a preset difference threshold, and so on. Here, various implementations may be employed to adjust the network parameters of the audio discrimination model based on the difference between the predicted probability values and the desired output. For example, a BP (Back Propagation) algorithm or an SGD (Stochastic Gradient Descent) algorithm may be used to adjust the network parameters of the audio discriminant model.

In the implementation mode, model training is carried out by using the positive sample and the negative sample, so that the output of the trained audio frequency discrimination model is more accurate.

Returning to fig. 1, as shown in fig. 1, the method for performing audio frequency discrimination by using an audio frequency discrimination model may include the following steps:

step 101, acquiring multiple frames of to-be-determined cough audio from the acquired audio.

In the present embodiment, the apparatus for executing the method of performing audio discrimination using the audio discrimination model may acquire a plurality of frames of cough audio to be discriminated from audio collected by an audio collecting apparatus (e.g., a microphone). As an example, various methods may be used to identify the audio collected by the audio collecting apparatus, identify those audio frames as audio frames of cough sound, and take the audio frames of cough sound as the cough audio to be discriminated.

In one implementation, the step 101 may be specifically performed as follows:

firstly, preprocessing audio collected by audio collecting equipment to obtain processed audio. For example, preprocessing such as denoising and de-muting can be performed on the audio collected by the audio collecting device to obtain the processed audio.

Then, whether the processed audio comprises the cough sound audio is determined by using a pre-trained cough sound discrimination model. Here, the above cough sound discrimination model may be used to discriminate which audio frames are the audio frames of the cough sound from the audio, and as an example, the above cough sound discrimination model may be a two-class model.

And finally, in response to the fact that the processed audio comprises the cough sound audio, extracting the cough sound audio as the cough audio to be distinguished. Through the implementation mode, the multi-frame cough audio to be distinguished can be acquired from the acquired audio for subsequent processing. Therefore, the influence of noise, silence, non-cough audio and the like in the audio is removed, and the subsequent processing result is more accurate.

And 102, extracting a characteristic vector from each frame of cough audio to be distinguished.

In this embodiment, for each frame of the to-be-determined cough audio in the multiple frames of to-be-determined cough audio acquired in step 101, a feature vector may be extracted from the frame of to-be-determined cough audio. Here, the feature vector may be various feature vectors related to audio.

In one implementation, the step 102 may be specifically performed as follows: Mel-Frequency Cepstral Coefficients (MFCC) are extracted from each frame of cough audio to be distinguished as a feature vector.

And 103, extracting information of the feature vectors of the multiple frames of cough audios to be distinguished by using at least one first time delay neural network to obtain audio information.

In this embodiment, at least one first time delay neural network in the audio frequency discrimination model may be used to extract information of the feature vectors of the multiple frames of the to-be-discriminated cough audio frequencies, so as to obtain audio information.

And 104, extracting multi-dimensional information of the audio information from multiple dimensions by using at least one residual error time delay neural network.

And 105, acquiring the audio features with fixed length from the multi-dimensional information by utilizing at least one second time delay neural network.

And step 106, inputting the audio features into the full connection layer to obtain the probability that the cough audio to be judged belongs to the novel coronavirus pneumonia.

In this embodiment, the probability of the output of the full-link layer of the audio frequency discrimination model may be a value between 0 and 1, and a larger value indicates that the probability that the cough audio frequency to be discriminated belongs to the new crown pneumonia is higher.

In one implementation, the method for audio discrimination using an audio discrimination model may further include the following steps not shown in fig. 1: and outputting the probability to assist the user in judging whether the person who produces the cough audio is the patient with the novel coronavirus pneumonia.

In this implementation, the device executing the method for audio discrimination using the audio discrimination model may also output the probability of audio discrimination model generation for display. The method is used for assisting the user to judge whether the speaker with the cough audio to be judged is a new coronary pneumonia patient. As an example, the user may be a user using the device, and the user may be the same person as the speaker or a different person from the speaker. It is understood that the above probability is used to assist in determining whether the speaker is a new coronary pneumonia patient, and in practice, for more accurate determination, the above probability may be combined with the examination results of chest radiography, CT (Computed Tomography), and the like to determine whether the speaker is a new coronary pneumonia patient. Through the implementation mode, the output and the display of the probability generated by the audio frequency discrimination model can be realized so as to be checked by a user.

The method for discriminating the audio by using the audio discrimination model provided by the above embodiment of the present specification discriminates the cough audio to be discriminated by using the audio discrimination model, and realizes that the probability that the cough audio to be discriminated belongs to the new crown pneumonia is automatically generated according to the cough audio to be discriminated. In addition, the audio discrimination model is constructed based on the cough sound characteristics of the new coronary pneumonia symptom, so that the output probability of the audio discrimination model is more accurate.

According to an embodiment of another aspect, an apparatus for audio discrimination using an audio discrimination model is provided. The apparatus for audio discrimination using the audio discrimination model may be deployed in any device, platform, or device cluster having computing and processing capabilities.

FIG. 4 shows a schematic block diagram of an apparatus for audio discrimination using an audio discrimination model according to one embodiment. As shown in fig. 4, the apparatus 400 for performing audio frequency discrimination by using an audio frequency discrimination model is used for discriminating the probability that a cough audio frequency belongs to the novel coronavirus pneumonia; the above apparatus 400 includes: the acquisition module 401 is configured to acquire multiple frames of to-be-determined cough audio from the acquired audio; an extracting module 402 configured to extract a feature vector from each frame of the cough audio to be discriminated; a first time delay neural network module 403, including at least one first time delay neural network, configured to receive the feature vector output by the extracting module 402, and perform information extraction on the feature vectors of the multiple frames of cough audio to be determined to obtain audio information; a residual delay neural network module 404, including at least one residual delay neural network, configured to receive the audio information output by the first delay neural network module 403, and extract multi-dimensional information of the audio information from multiple dimensions; a second time-delay neural network module 405, including at least one second time-delay neural network, configured to receive the multi-dimensional information output by the residual time-delay neural network module 404, and obtain fixed-length audio features from the multi-dimensional information; the full link layer module 406 is configured to receive the audio features with fixed length output by the second time delay neural network module 405, and obtain the probability that the cough audio to be determined belongs to the novel coronavirus pneumonia.

In some optional implementations of this embodiment, each of the at least one residual delay neural network includes a squeeze excitation module and at least one delay neural network, where the squeeze excitation module includes a first linear layer, a first activation function, a second linear layer, and a second activation function; and each residual error time delay neural network in the at least one residual error time delay neural network processes the input information in the following mode: reducing dimensions of the information extracted by the at least one time delay neural network by using the first linear layer to remove general information of the cough sound of the novel coronavirus pneumonia and the cough sound of the non-novel coronavirus pneumonia; performing dimension raising on the output of the first activation function by using the second linear layer to raise the network parameter quantity; and multiplying the output of the second activation function by the output of the last time delay neural network in the residual time delay neural network, weighting the multiplied result and the input information of the residual time delay neural network, and taking the weighted result as the output of the residual time delay neural network.

In some optional implementations of this embodiment, the at least one first latency neural network includes two first latency neural networks; the at least one second time-delay neural network comprises two second time-delay neural networks; the at least one residual delay neural network includes three residual delay neural networks.

In some optional implementations of this embodiment, the apparatus 400 further includes: and an output module (not shown in the figure) configured to output the probability to assist the user in judging whether the person who utters the cough audio to be determined is the patient with the novel coronavirus pneumonia.

In some optional implementations of the present embodiment, the audio discriminant model is trained by: acquiring a sample set, wherein the samples of the sample set comprise a positive sample and a negative sample, the positive sample comprises a feature vector and a probability value 1 of cough audio corresponding to new coronary pneumonia, and the negative sample comprises a feature vector and a probability value 0 of cough audio corresponding to non-new coronary pneumonia; and taking the feature vector of the sample as input, taking the probability value corresponding to the input feature vector as expected output, and training to obtain the audio frequency discrimination model.

In some optional implementations of this embodiment, the obtaining module 401 is further configured to: preprocessing audio collected by audio collection equipment to obtain processed audio; determining whether the processed audio comprises cough sound audio or not by using a pre-trained cough sound discrimination model; and in response to the fact that the processed audio comprises the cough sound audio, extracting the cough sound audio as the cough audio to be distinguished.

In some optional implementations of this embodiment, the extracting module 402 is further configured to: and extracting Mel frequency cepstrum coefficients from the audio frequency of the cough to be distinguished as a feature vector.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于多头注意力机制融合的卷积递归神经网络模型

Method and device for audio discrimination by using audio discrimination model

相关技术

网友询问留言