Sound detection method and device

文档序号：344509 发布日期：2021-12-03 浏览：7次中文

阅读说明：本技术 声音检测方法及装置 (Sound detection method and device ) 是由佘积洪朱宸都于 2021-09-13 设计创作，主要内容包括：本申请提供一种声音检测方法及装置,声音检测方法包括：获取待检测的音频数据；确定音频数据中每一帧数据的类型,类型包括语音和静音；将音频数据中属于语音类型的帧对应的语音数据输入深度神经网络,获得属于目标的声音数据。由于静音数据中很少存在声音,因此,通过预先剔除音频数据中的静音数据,仅将音频数据中的语音数据输入到深度神经网络中进行目标声音检测,避免深度神经网络对于静音数据的无效检测,减少了深度神经网络对于音频数据的计算量,在确保目标声音检测的准确率的同时,还提高了目标声音检测的效率。(The application provides a sound detection method and a device, wherein the sound detection method comprises the following steps: acquiring audio data to be detected; determining the type of each frame of data in the audio data, wherein the type comprises voice and silence; and inputting the voice data corresponding to the frame belonging to the voice type in the audio data into the deep neural network to obtain the voice data belonging to the target. Because few sounds exist in the mute data, the mute data in the audio data is removed in advance, and only the voice data in the audio data is input into the deep neural network for target sound detection, so that invalid detection of the deep neural network on the mute data is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.)

1. A method of sound detection, the method comprising:

acquiring audio data to be detected;

determining a type of each frame of data in the audio data, the type comprising speech and silence;

and inputting the voice data corresponding to the frame belonging to the voice type in the audio data into a deep neural network to obtain the voice data belonging to the target.

2. The method of claim 1, wherein the determining the type of each frame of data in the audio data comprises:

acquiring first frame data in the audio data;

performing Fourier transform on the first frame data to obtain a first frequency spectrum, wherein the first frequency spectrum comprises a plurality of frequency points;

calculating the power of each frequency point in the first frequency spectrum;

and determining the type of the first frame data based on the power of each frequency point in the first frequency spectrum.

3. The method according to claim 2, wherein the determining the type of the first frame data based on the power of each frequency point in the first spectrum comprises:

subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the power of the de-noised sound signal of each frequency point; the minimum power is obtained based on the nonlinear tracking of the frequency points;

normalizing the power of the de-noised sound signals of each frequency point to obtain the normalized power of each frequency point;

and determining the type of the first frame data based on the normalized power of each frequency point in the first frequency spectrum.

4. The method according to claim 3, wherein the determining the type of the first frame data based on the normalized power of each frequency point in the first spectrum comprises:

when the normalized power of each frequency point in the first frequency spectrum is within a preset power range, marking a first label for the corresponding frequency point, wherein the first label is used for representing that the first frame data belongs to voice on the corresponding frequency point;

when the normalized power of each frequency point in the first frequency spectrum is not within the preset power range, marking a second label for the corresponding frequency point, wherein the second label is used for representing that the first frame data belongs to silence on the corresponding frequency point;

generating a label sequence of each frequency point in the first frequency spectrum;

determining a type of the first frame data based on the tag sequence.

5. The method according to claim 4, wherein before determining whether the normalized power of each frequency point in the first spectrum is within the preset power range, the method further comprises:

constructing a probability distribution model of each frequency point in the first frequency spectrum;

integrating the probability curve in the probability distribution model to obtain two power values corresponding to the preset probability in the probability distribution model;

and taking the two power values as the preset power range of the corresponding frequency point.

6. The method of claim 4, wherein prior to the determining the type of the first frame data based on the tag sequence, the method further comprises:

determining the weight corresponding to each frequency point according to the normalized power of each frequency point in the first frequency spectrum, wherein the weight is positively correlated with the normalized power;

the determining the type of the first frame data based on the tag sequence comprises:

carrying out weighted average on the label sequence and the weight of the corresponding frequency point to obtain the voice confidence of the first frame data corresponding to the first frequency spectrum;

when the voice confidence coefficient of the first frame data is greater than a preset voice confidence coefficient, determining that the type of the first frame data is voice;

and when the voice confidence coefficient of the first frame data is less than or equal to a preset voice confidence coefficient, determining that the type of the first frame data is mute.

7. The method according to claim 3, wherein before subtracting the minimum power of each frequency point in the first spectrum from the power of the corresponding frequency point, the method further comprises:

when the power of a target frequency point in the first frequency spectrum is greater than the minimum power of a corresponding frequency point in a second frequency spectrum, calculating the minimum power of the target frequency point according to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum, wherein second frame data corresponding to the second frequency spectrum is previous frame data of the first frame data in the audio data;

and when the power of the target frequency point in the first frequency spectrum is less than or equal to the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is the power of the target frequency point in the first frequency spectrum.

8. The method according to claim 7, wherein before determining whether the power of the target frequency point in the first spectrum is smaller than the minimum power of the corresponding frequency point in the second spectrum, the method further comprises:

and smoothing the original power of each frequency point in the first frequency spectrum to obtain the power of each frequency point in the first frequency spectrum.

9. The method according to any one of claims 1 to 8, wherein the inputting speech data corresponding to frames belonging to a speech type in the audio data into a deep neural network to obtain sound data belonging to a target comprises:

inputting a first voice segment in the voice data into the deep neural network, and obtaining a prediction result of a first voice frame and a second voice frame in the first voice segment, wherein the prediction result is used for representing whether the corresponding voice frame comes from the target or not;

inputting a second voice segment in the voice data into the deep neural network to obtain a prediction result of a second voice frame and a third voice frame in the second voice segment, wherein the second voice frame in the first voice segment and the second voice frame in the second voice segment are the same frame in the voice data;

determining whether a second speech frame in the first speech segment is from the target according to the prediction result of the second speech frame in the first speech segment and the prediction result of the second speech frame in the second speech segment;

and taking the data from the target determined from the voice data as the sound data of the target.

10. A sound detection device, characterized in that the device comprises:

the acquisition module is used for acquiring audio data to be detected;

a determining module, configured to determine a type of each frame of data in the audio data, where the type includes voice and silence;

and the prediction module is used for inputting the voice data corresponding to the frame belonging to the voice type in the audio data into the deep neural network to obtain the voice data belonging to the target.

Technical Field

The present disclosure relates to the field of sound detection technologies, and in particular, to a sound detection method and apparatus, an electronic device, and a storage medium.

Background

The sound detection means detecting a target sound from a single-segment audio. The sound detection has wide application prospect. For example: the voice detection may be pre-processed as a front-end of speech recognition. That is, human voice data is detected from audio data, and voice recognition is performed on the human voice data, thereby improving voice recognition efficiency. For another example: a meeting summary can also be formed using sound detection. Specifically, voice data of a speaker is detected from audio data of a conference, and a conference summary is formed.

Generally, two methods are mainly used for voice detection. The first method comprises the following steps: the sound of the target in a section of audio is distinguished from the sound of a non-target by a traditional algorithm (such as a double-threshold algorithm, a Gaussian mixture model, and the like), and then the sound data of the target is obtained. And the second method comprises the following steps: and through a deep neural network, the sound of the target in a section of audio is distinguished from the sound of a non-target, and the sound data of the target is obtained.

However, the first method of detecting the target sound in the audio by the conventional algorithm cannot well distinguish the target sound from transient sounds (e.g., sounds of knocking a table, sounds of walking, etc.), and thus the detection effect of the target sound is not good. In the second method for detecting the target sound in the audio through the deep neural network, when the target sound is detected, the deep neural network is required to judge each frame in the audio, and then output whether the frame is a label of the target sound, so that the calculation amount of detecting the target sound through the deep neural network is large, and the efficiency of sound detection is affected.

Disclosure of Invention

An object of the embodiments of the present application is to provide a sound detection method, device, electronic device and storage medium, so as to improve efficiency and accuracy of sound detection.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

a first aspect of the present application provides a sound detection method, the method comprising: acquiring audio data to be detected; determining a type of each frame of data in the audio data, the type comprising speech and silence; and inputting the voice data corresponding to the frame belonging to the voice type in the audio data into a deep neural network to obtain the voice data belonging to the target.

A second aspect of the present application provides a sound detection apparatus, the apparatus comprising: the acquisition module is used for acquiring audio data to be detected; a determining module, configured to determine a type of each frame of data in the audio data, where the type includes voice and silence; and the prediction module is used for inputting the voice data corresponding to the frame belonging to the voice type in the audio data into the deep neural network to obtain the voice data belonging to the target.

Compared with the prior art, the sound detection method provided by the first aspect of the present application determines whether each frame of data in the audio data belongs to a speech type or a silence type after the audio data to be detected is acquired, and then inputs the speech data corresponding to the frame belonging to the speech type in the audio data into the deep neural network to obtain the sound data belonging to the target. Because few sounds exist in the mute data, the mute data in the audio data is removed in advance, and only the voice data in the audio data is input into the deep neural network for target sound detection, so that invalid detection of the deep neural network on the mute data is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

The sound detection apparatus provided by the second aspect, the electronic device provided by the third aspect, and the computer-readable storage medium provided by the fourth aspect of the present application have the same or similar beneficial effects as the sound detection method provided by the first aspect.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a first flowchart illustrating a voice detection method according to an embodiment of the present application;

FIG. 2 is a second flowchart illustrating a voice detection method according to an embodiment of the present application;

fig. 3 is a probability distribution curve of normalized power of a 19 th frequency point in a first spectrum in the embodiment of the present application;

FIG. 4 is a schematic diagram of an architecture for performing speech detection according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a structure of voice data according to an embodiment of the present application;

FIG. 6 is a schematic flowchart illustrating target voice recognition performed on audio data according to an embodiment of the present application;

FIG. 7 is a first schematic structural diagram of an exemplary embodiment of a sound detection apparatus;

FIG. 8 is a second schematic structural diagram of an exemplary sound detection apparatus;

fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

In the prior art, when target sound data needs to be detected from a section of audio data, if a traditional algorithm is used to distinguish the target sound data from non-target sound data in the audio data, the traditional algorithm cannot effectively distinguish the target sound from the transient sound under the condition that the non-target sound is the transient sound, thereby reducing the accuracy of target sound detection. If the deep neural network is adopted to distinguish the target sound data from the non-target sound data in the audio data, the deep neural network detects the target sound for each frame in the audio data, so that the calculation amount of the target sound detection through the deep neural network is large, and the efficiency of the target sound detection is reduced.

The inventor finds that the reason why the computation amount of the deep neural network is large when the deep neural network performs the sound detection is that the deep neural network needs to perform computation once for each frame of data when performing the sound detection. Each frame of data of the audio data to be detected does not need to be input into the deep neural network for target sound detection, because some frames of the audio data may not have sound, i.e., silence. And the mute data is input into the deep neural network for calculation, so that the deep neural network has high accuracy, large calculation amount and low calculation efficiency when detecting the sound.

In view of this, an embodiment of the present application provides a sound detection method, before inputting audio data to be detected into a deep neural network to detect target sound data, determining whether a type of each frame of data in the audio data belongs to voice data or silence data. Then, only the voice data is input into the deep neural network for processing to detect the sound data of the target. Because few sounds exist in the mute data, the mute data in the audio data is removed in advance, and only the voice data in the audio data is input into the deep neural network for target sound detection, so that invalid detection of the deep neural network on the mute data is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

In practical applications, the target may be a human. Then, the detected sound in the embodiment of the present application is the voice of the human speaking. Furthermore, the voice of the detected person speaking can be subjected to semantic recognition and the like. Of course, the target may also be an animal. In the present example, the sounds made by the animal are then detected. Further, the emotion, intention, etc. of the animal can be known by the sound made by the animal. Of course, the target may also be an object. In the embodiment of the present application, the sound emitted from the object is detected. The condition of the environment in which the object is located can then be determined by the sound emitted by the object. In the embodiments of the present application, the specific type of the object is not limited.

Next, a sound detection method provided in an embodiment of the present application will be described in detail.

Fig. 1 is a schematic flow chart of a sound detection method in an embodiment of the present application, and referring to fig. 1, the method may include:

s101: and acquiring audio data to be detected.

In order to detect the target sound, it is first necessary to acquire audio data including the target sound, i.e., audio data to be detected.

The audio data to be detected may not only include the target audio data, but also include non-target audio data. Non-target herein may refer to everything unrelated to the target. For example: when the target is a person, the non-target may be another person, various animals, various objects, environmental noise, or the like.

Of course, the audio data to be detected may include only the sound data of the object. However, a person who carries out detection on the audio data does not know that only the target sound data and not the non-target sound data are included in the audio data to be detected, and therefore target sound detection on the audio data is still needed. The specific content contained in the audio data is not limited herein.

S102: the type of each frame of data in the audio data is determined, the type including speech and silence.

Since sound does not necessarily exist continuously on the time axis in the audio data. For example: when a person speaks in a meeting, the person does not speak continuously, but there is a slight pause between speaking the two sentences, or there is no speaking when the next action is performed after a speech is finished. Thus, in the audio data, some frames correspond to sound data, and some frames have no sound data or only weak ambient sound data.

In order to improve the voice recognition efficiency and avoid the increase of the calculation amount of the deep neural network caused by the fact that the deep neural network still processes the data corresponding to the frames without the voice data or only the weak voice, the type of each frame of data in the audio data can be firstly identified, and whether each frame of data is voice data or mute data can be judged. And the deep neural network is enabled to process the voice data in the audio data, so that the calculated amount of the deep neural network is saved.

For example, assume that a person is speaking, and continuously speaks the content a in the time corresponding to the 1 st frame to the 100 th frame. Then, at the time corresponding to the 101 st frame to the 200 th frame, a drink is drunk, and the presentation is turned to the next page. Next, the content B is continuously described for the time corresponding to the 201 st frame to the 300 th frame. During the time from the 1 st frame to the 300 th frame, the speech is recorded, and audio data is formed. It can be seen that, in the audio data, the type of the 1 st frame data is voice data, … …, the type of the 100 th frame data is voice data, the type of the 101 th frame data is mute data, … …, the type of the 200 th frame data is mute data, the type of the 201 th frame data is voice data, … …, and the type of the 300 th frame data is voice data.

The specific manner for determining the type of each frame of data in the audio data may be spectral power, amplitude of the sound signal, etc., and is not limited herein.

And voice-type data may include target-uttered sounds as well as non-target-uttered sounds. For example: the sound emitted by the target may be a human sound, and the sound emitted by the non-target may be a sound of an animal such as a cat or a dog, an object such as a table or a chair.

S103: and inputting the voice data corresponding to the frame belonging to the voice type in the audio data into the deep neural network to obtain the voice data belonging to the target.

After the data of each frame in the audio data is divided into the voice data and the mute data, only the voice data in the audio data can be input into the deep neural network for target sound detection, and then the sound data of the target can be obtained from the voice data of the audio data.

The deep neural network may be any deep neural network capable of performing sound detection. The specific type of the deep neural network is not limited herein.

After the voice data is input into the deep neural network, the deep neural network can calculate each frame data in the voice data and output a probability value that each frame data is the target sound data. According to the probability value of each frame data output by the deep neural network, the target sound data can be extracted from the voice data.

As can be seen from the above, in the sound detection method provided in the embodiment of the present application, after the audio data to be detected is obtained, it is determined whether each frame of data in the audio data belongs to a speech type or a silence type, and then the speech data corresponding to the frame belonging to the speech type in the audio data is input to the deep neural network, so as to obtain the sound data belonging to the target. Because few sounds exist in the mute data, the mute data in the audio data is removed in advance, and only the voice data in the audio data is input into the deep neural network for target sound detection, so that invalid detection of the deep neural network on the mute data is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

Further, as a refinement and an extension of the method shown in fig. 1, the embodiment of the present application also provides a sound detection method. Fig. 2 is a schematic flowchart of a second sound detection method in an embodiment of the present application, and referring to fig. 2, the method may include:

s201: and acquiring audio data to be detected.

Step S201 has the same or similar implementation as step S101, and is not described herein again.

S202: first frame data in the audio data is acquired.

In essence, each frame of data in the audio data needs to be processed separately to determine whether each frame of data belongs to a speech type or a silence type. Here, a specific mode of determining each frame data in the audio data will be described by taking a first frame data, which is a certain frame data in the audio data, as an example. Of course, the first frame data is not intended to be limited to the frame data being the data of the start frame in the audio data. The first frame data may be data of any one frame in the audio data.

In practical application, the frame length and frame shift of a frame of data in audio data can be set according to practical requirements. For example: when the sampling frequency of the audio data is 16000Hz, the frame length may be 25ms, and the frame shift may be 10 ms.

S203: and performing Fourier transform on the first frame data to obtain a first frequency spectrum, wherein the first frequency spectrum comprises a plurality of frequency points.

In the time domain, the type of the first frame data is not easily determined. In the frequency domain, there is a lot of information that is not known in the time domain. Accordingly, the first frame data may be converted into a frequency domain, thereby determining the type of the first frame data through information in the frequency domain.

Specifically, the first frame data is fourier-transformed. After the first frame data is subjected to fourier transform, a corresponding frequency spectrum of the first frame data, namely a first frequency spectrum, can be obtained. The first frequency spectrum comprises a plurality of frequency points. Each frequency point represents a frequency in the waveform of the first frame data in the time domain. Through a plurality of frequency points in the first frequency spectrum, the combination condition of the frequency of the first frame data in the waveform of the time domain can be clearly obtained, and the probability of whether the first frame data is the voice type data or not can be respectively determined through each frequency point.

S204: and calculating the original power of each frequency point in the first frequency spectrum.

And calculating the original power of each frequency point through the first frequency spectrum. Specifically, the following formula (1) can be used for the calculation.

P'_signal(λ,k)＝|Y(λ,k)|²Formula (1)

Of course, the power of each frequency point in the frequency spectrum can also be calculated by adopting other modes of calculating the power of the frequency point in the frequency spectrum. The specific calculation method is not limited herein.

It should be particularly noted that, due to the symmetry of the frequency spectrum, it is not necessary to perform power calculation on all frequency points in the first frequency spectrum once, but only the first half frequency points, that is, the power of the preset frequency points, of all frequency points in the first frequency spectrum may be calculated. Thus, the processing speed for the frequency point can be improved. And then follow-up also only judges the type that the preset frequency point corresponds to, can improve the judgement speed of type, and then improves the detection efficiency of sound.

For example, assume that 15 ten thousand audio files need to be processed. For these 15 ten thousand audio files, the frames are divided into frames each having a frame length of 25ms and a frame shift of 10ms, resulting in multi-frame data. For each frame of data, a short-time Fourier transform of 512 bins is performed to transform the data into a frequency spectrum. Due to the symmetry of the frequency spectrum, the frequency spectrum of the front 257 frequency points can represent each frame of data, and then only the power of the front 257 frequency points in the frequency spectrum is calculated.

S205: and smoothing the original power of each frequency point in the first frequency spectrum to obtain the power of each frequency point in the first frequency spectrum.

After the original power of each frequency point in the first frequency spectrum is calculated, in order to improve the accuracy of the power value of each frequency point and further improve the accuracy of determining the frame data type, the original power of each frequency point in the first frequency spectrum may be firstly smoothed. Specifically, the processing can be performed by the following formula (2).

P_signal(λ,k)＝αP'_signal(λ,k)+(1-α)|Y(λ,k)|²Formula (2)

Wherein λ represents a λ frame, k represents a k-th frequency point in the λ frame, Y (λ, k) represents a frequency spectrum of the k-th frequency point of the λ frame, and P'_signal(λ, k) denotes the original power of the k-th frequency bin of the λ -frame, P_signal(λ, k) represents the power of the k-th frequency bin of the λ -th frame, and α represents a smoothing factor. In general, α is between 0.5 and 1. If α is too small, the smoothing effect is poor. If alpha is too large, the smoothing is excessive, and further, the detail information of the original power of the frequency point is lost.

Because the power of each frequency point needs to be normalized subsequently, the minimum power of each frequency point needs to be determined in advance at this time, and the minimum power of the frequency point is the estimation of the noise power. When the minimum power of each frequency point in the first frequency spectrum is determined, a continuous frequency spectrum minimum value tracking mode can be adopted. A specific manner of determining the minimum power of each frequency point in the first spectrum is described below by taking one frequency point in the first spectrum, i.e., a target frequency point, as an example. However, this is not intended to limit the determination of the minimum power only for one frequency point in the spectrum, but needs to be performed once for the minimum power of each frequency point in the spectrum.

Of course, the original power of each frequency point in the first spectrum may not be smoothed, that is, after step S204 is executed, step S205 may be skipped and step S206 may be executed as it is. Therefore, the original power of each frequency point in the first frequency spectrum is directly used as the power of each frequency point in the first frequency spectrum for subsequent processing.

S206: and when the power of the target frequency point in the first frequency spectrum is greater than the minimum power of the corresponding frequency point in the second frequency spectrum, calculating the minimum power of the target frequency point according to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum.

S207: and when the power of the target frequency point in the first frequency spectrum is less than or equal to the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is the power of the target frequency point in the first frequency spectrum.

And the second frame data corresponding to the second frequency spectrum is the last frame data of the first frame data in the audio data. When determining the minimum power of each frequency point corresponding to the first frame data, the last frame data of the first frame data, that is, the minimum power of each frequency point in the second spectrum corresponding to the second frame data, needs to be referred to.

Specifically, it is determined whether the following formula (3) is satisfied.

P_signal，min(λ-1,k)＜P_signal(λ, k) formula (3)

Wherein, P_signal，min(lambda-1, k) represents the minimum power of the k frequency bin of the lambda-1 frame, P_signal(λ, k) represents the power of the k-th frequency bin in the λ -th frame, λ represents the λ -th frame, and k represents the k-th frequency bin in the λ -th frame.

If the above formula (3) is satisfied, which indicates that the power of the frequency point corresponding to the current frame data is increased, the minimum power of the frequency point is calculated by the following formula (4).

Wherein, P_signal，min(λ, k) represents the minimum power of the k-th frequency bin of the λ -th frame, P_signal，min(lambda-1, k) represents the minimum power of the k frequency bin of the lambda-1 frame, P_signal(λ, k) represents the power of the k-th frequency bin of the λ -th frame, P_signalAnd (lambda-1, k) represents the power of the k frequency point of the lambda-1 frame, lambda represents the lambda frame, k represents the k frequency point in the lambda frame, and beta and gamma are related parameters.

Further, both β and γ can be between 0 and 1. Preferably, β may take the value of 0.96 and γ may take the value of 0.998. Of course, β, γ may also take other values between 0 and 1. Specific values of β and γ are not limited herein.

In essence, equation (4) implements a first order difference operation, which is an approximation of the derivation in the discrete case. This can improve the calculation speed. When the power of the signal with noise, namely the frequency point corresponding to the current frame, is increased, the derivative value is positive, and the noise estimation is increased. And when the power of the signal with noise, namely the frequency point corresponding to the current frame, is reduced, the derivative value is negative, and the noise estimation is reduced.

If the above formula (3) does not hold, which indicates that the power of the frequency point corresponding to the current frame data is reduced or unchanged, the minimum power of the frequency point is calculated through the following formula (5).

P_signal，min(λ,k)＝P_signal(λ, k) formula (5)

Wherein, P_signal，min(λ, k) represents the minimum power of the k-th frequency bin of the λ -th frame, P_signal(λ, k) represents the power of the k-th frequency bin in the λ -th frame, λ represents the λ -th frame, and k represents the k-th frequency bin in the λ -th frame.

It should be noted that steps S206 and S207 are alternatively performed.

After the minimum power of each frequency point in the first frequency spectrum is determined, subtracting the minimum power of the corresponding frequency point in the first frequency spectrum from the power of each frequency point in the first frequency spectrum to obtain the power of the de-noised sound signal of each frequency point in the first frequency spectrum. And then normalizing the power of the de-noised sound signals of each frequency point, and judging whether the corresponding frequency point belongs to voice or silence based on the normalized power. Before this, a criterion for judgment, that is, a preset power range, needs to be determined in advance.

S208: and constructing a probability distribution model of each frequency point in the first frequency spectrum.

S209: and integrating the probability curve in the probability distribution model to obtain two power values corresponding to the preset probability in the probability distribution model.

S210: and taking the two power values as the preset power range of the corresponding frequency point.

And aiming at each frequency point in the first frequency spectrum, a probability distribution model is required to be constructed. And then determining a reference, namely a preset power range, for judging the type of each frequency point based on each constructed probability distribution model. A specific process of determining the preset power range of the 19 th frequency point will be described below by taking the probability distribution map of the normalized power of one frequency point, for example, the 19 th frequency point, in the first spectrum as an example. Of course, this is not intended to limit the first spectrum to having the 19 th frequency point, and the 19 th frequency point is merely an example.

Fig. 3 is a probability distribution curve of normalized power of the 19 th frequency point in the first spectrum in the embodiment of the present application, and as shown in fig. 3, the abscissa represents normalized power, and the ordinate represents probability. Will be a horizontal straight line T₁Intersecting with the probability distribution curve to obtain a section of probability curve S₁. For the probability curve S₁And integrating to obtain the probability P. Moving horizontal straight line T₁When the probability P is 75%, the probability curve S₁And a horizontal straight line T₁The range between the two normalized power thresholds thr1 and thr2 corresponding to the intersection point of (a) is the preset power range.

Of course, the above 75% probability is also merely illustrative. The specific value of the probability can be set according to actual needs. The specific value of the probability is not limited herein.

The reason why the preset power range is set as above is that if a certain frame of data belongs to voice but not to silence, the normalized power of the 19 th frequency point, i.e., P_voice(λ,19) will be approximately within the predetermined power range thr1 to thr 2. The preset power range can be used for judging whether the 19 th frequency point corresponding to the frame data belongs to the voice, and subsequently, the preset power range can be used for continuously judging whether the 19 th frequency point corresponding to other frame data belongs to the voice. That is to say, the preset power range corresponding to each frequency point in one frame of data may be determined in advance, and then the preset power ranges corresponding to the frequency points in other frame of data may all use the preset power range corresponding to each frequency point in the one frame of data. Therefore, the calculation amount of the preset power range can be reduced, and the sound detection efficiency is improved.

It should be further noted that the probability distribution curve of the normalized power of each frequency point may be obtained by any method for generating a frequency point power probability distribution curve. The specific manner of obtaining the probability distribution curve of the frequency point power is not limited here.

After the power of each frequency point is obtained in the step S205, the minimum power of each frequency point is obtained in the steps S206 to S207, and the preset power range of each frequency point is obtained in the steps S208 to S210, normalization processing may be performed on the power of each frequency point in the first spectrum, and it is determined whether the corresponding frequency point belongs to speech or silence based on the normalized power.

S211: and subtracting the minimum power of the corresponding frequency point from the power of each frequency point in the first frequency spectrum to obtain the power of the de-noised sound signal of each frequency point, and further carrying out normalization processing on the power of the de-noised sound signal of each frequency point to obtain the normalized power of each frequency point.

Here, the corresponding frequency point is a frequency point corresponding to each frequency point in the first spectrum. For example: the 1 st to 257 th frequency points exist in the first frequency spectrum, and then the 1 st to 257 th frequency points all correspond to the minimum power respectively. For example: the 1 st frequency point corresponds to the minimum power a, the 2 nd frequency point corresponds to the minimum power b, … …, and the 257 th frequency point corresponds to the minimum power s. The minimum power is obtained by the non-linear tracking based on the frequency points, i.e. the steps S206 to S207. And the minimum power of the corresponding frequency point represents the noise corresponding to the corresponding frequency point.

Specifically, the power of the de-noised sound signal at each frequency point in the first spectrum can be obtained by the following formula (6).

P_voice(λ,k)＝P_signal(λ,k)-P_signal，min(λ, k) formula (6)

Wherein, P_voice(lambda, k) represents the power of the de-noised sound signal of the k frequency point of the lambda frame, P_signal(λ, k) represents the power of the k-th frequency bin of the λ -th frame, P_signal，minAnd (λ, k) represents the minimum power of the k frequency point of the λ frame, λ represents the λ frame, and k represents the k frequency point in the λ frame.

After the power of the de-noised sound signal of each frequency point is obtained, the power of the de-noised sound signal of each frequency point can be normalized, and then the normalized power of each frequency point is obtained. Here, any normalization processing method may be used to normalize the power of the de-noised sound signal of each frequency point. The detailed description of the normalization process is omitted here.

S212: and when the normalized power of each frequency point in the first frequency spectrum is within a preset power range, marking a first label for the corresponding frequency point.

S213: and when the normalized power of each frequency point in the first frequency spectrum is not within the preset power range, marking a second label for the corresponding frequency point.

S214: and generating a label sequence of each frequency point in the first frequency spectrum.

The preset power range may be obtained by performing statistics according to a large amount of known voice data and silence data. The first label is used for representing that the first frame data belongs to voice on corresponding frequency points. The second label is used for representing that the first frame data belong to silence on corresponding frequency points.

After the normalized power of each frequency point in the first frequency spectrum is obtained, whether the normalized power of each frequency point is within the corresponding preset power range is determined. If the position is determined, which indicates that the frequency point belongs to voice, the frequency point may be marked as 1. If the frequency point is determined not to be located, which means that the frequency point does not belong to voice, and may belong to silence, the frequency point may be marked as 0.

Of course, other types of marks can be used to distinguish whether a frequency point belongs to voice. The specific type of the mark is not limited herein.

After marking each frequency point in the first frequency spectrum, a sequence, i.e. [ i ] can be obtained₁、i₂、……i_n]Such a sequence. Wherein i₁、i₂、……i_nCan be 0 or 1 respectively, and n is the number of marked frequency points. Generally, the first half of all the frequency points are marked according to how many frequency points exist in the frequency spectrum.

S215: and determining the weight corresponding to each frequency point according to the normalized power of each frequency point in the first frequency spectrum, wherein the weight is positively correlated with the normalized power.

Generally, the frequency of the audio data is between 300Hz and 3400Hz, and when the audio data is acquiredThe spectrum after fourier transformation can describe the range from 0Hz to 8000Hz at a sample frequency of 16000 Hz. That is, the frequency of the voice type data is mainly distributed over the first 128 frequency points. In addition, the power of different data at different frequency points is different. The larger the power of a certain data at a corresponding frequency point is, the more important the frequency point is to judge whether the data is voice. That is, different frequency points contribute differently to the final determination of whether the data is voice data. Therefore, the weight [ w ] corresponding to each frequency point can be determined according to the normalized power of each frequency point in the first frequency spectrum₁、w₂、……、w_n]. Wherein n is the number of frequency points.

Specifically, the larger the normalized power of the corresponding frequency point in the first spectrum is, the larger the weight corresponding to the corresponding frequency point is. Similarly, the smaller the normalized power of the corresponding frequency point in the first spectrum is, the smaller the weight corresponding to the corresponding frequency point is.

S216: and carrying out weighted average on the tag sequence of the first frequency spectrum and the weight of each frequency point in the first frequency spectrum to obtain the voice confidence of the first frame data corresponding to the first frequency spectrum.

Specifically, the speech confidence of the first frame data can be calculated by the following equation (7).

Where C represents the confidence of the speech, [ omega ]₁,ω₂......ω_n]Represents the weight of each frequency point in the first frequency spectrum, [ i₁,i₂......i_n]A sequence of tags representing the first spectrum, and n represents the number of bins.

S217: and when the voice confidence coefficient of the first frame data is greater than the preset voice confidence coefficient, determining that the type of the first frame data is voice.

S218: and when the voice confidence coefficient of the first frame data is less than or equal to the preset voice confidence coefficient, determining the type of the first frame data to be mute.

In practical application, the preset voice confidence can be set according to actual needs. For example: 0.4, 0.5, 0.6, etc. When voice data needs to be acquired more comprehensively from audio data, the preset voice confidence coefficient can be set to be lower; when it is necessary to more accurately acquire voice data from audio data, the preset voice confidence may be set higher. The specific value of the preset speech confidence is not limited herein.

Fig. 4 is a schematic diagram of an architecture for performing voice detection in the embodiment of the present application, and referring to fig. 4, for n frequency points of first frame data in a first frequency spectrum of a frequency domain, whether the n frequency points belong to voices is respectively determined, so as to generate a sequence of the n frequency points. And meanwhile, determining the weight of the n frequency points. And then, calculating the voice confidence of the first frame data based on the weight of the n frequency points and the sequence of the n frequency points. And finally, judging a threshold value according to the voice confidence of the first frame data, and outputting a judgment result, wherein the judgment result is used for representing whether the first frame data belongs to the voice data.

Through the steps S201 to S207, the noise is continuously updated in the process of detecting the audio data, and through the steps S208 to S218, when the threshold is determined, the priori knowledge of the voice data (the normalized power range of the voice data at each frequency point and which frequency points are more important for determining that the data is voice) is introduced to roughly distinguish the voice frame from the mute frame in the audio data, so that only the voice frame is sent to the deep neural network for voice detection in the following process, and while the detection accuracy is ensured, the amount of calculation can be greatly reduced, and the detection efficiency is improved.

In practical application, a convolutional neural network can be used to perform classification and identification on target sounds and non-target sounds in the voice data.

Taking a convolutional neural network as an example, before using the convolutional neural network to perform target sound detection, a training data set needs to be acquired, a network needs to be built, and the network needs to be trained. When the network is trained, the target sound in the voice data can be detected more accurately. The following describes a process of performing target sound detection in four aspects of acquiring a training data set, building a network, training the network, and predicting using the network.

In a first aspect: a training data set is obtained.

In practical applications, a training data set may be generated through a subtitle-aligned movie corpus (SAM) for network training. Taking voice recognition as an example, voice data in a movie is acquired from the SAM. The human voice data may include: pure voice, voice plus noise, voice plus music, etc. About 5 ten thousand samples. Non-human voice data in the movie is also acquired. The non-human voice data may include: other sounds that do not contain human voice that occur in movies, etc. About 17 ten thousand samples. Each sample is approximately 0.63s in length. In this way, a training data set is obtained.

After the training data set is obtained, the data set may be further processed. Specifically, each 0.63 s-long speech segment in the training data set is framed, the frame length is set to 25ms, the frame shift is set to 10ms, and each speech segment can be divided into 64 speech frames. Then, for each speech frame, a 64-dimensional mel-frequency cepstral coefficient (MFCC) feature is extracted to express each speech frame with the MFCC feature. And each speech segment is represented by a 64 x 64 speech map.

Here, MFCC is proposed based on human auditory characteristics, and has a nonlinear correspondence with frequency. Therefore, the MFCC signature is a spectral signature calculated using this relationship.

In a second aspect: and (5) building a network.

Specifically, a depth residual network of 8 layers may be employed. Because the degradation problem of the deep network can be solved by the deep residual error network through residual error learning, the target sound identification can be achieved by adopting the deep residual error network.

In a third aspect: and training the network.

And each voice segment in the data set can be substituted into the built deep neural network for training.

Specifically, an adamW optimization algorithm may be used to train the network until parameters in the network are optimized. In the training process, a learning rate warmup technology is also adopted. The weights of the network are initialized randomly at the beginning of training, and if a larger learning rate is selected, the network may be unstable (vibrate). With the learning rate warmup technique, the learning rate can be slowly increased from an initial small value to an initial large value at the beginning of training, or in some training steps. Thus, the network may slowly tend to stabilize at a preheated pupil learning rate. After the network is relatively stable, a larger learning rate is selected for training, so that the convergence rate of the network can be accelerated, and the prediction effect of the network is further improved.

In a fourth aspect: the prediction is performed using a network.

When the network predicts, the whole voice data is not predicted at one time, but the voice data is divided into a plurality of voice segments according to the preset frame length and frame shift, and each frame in each voice segment is predicted respectively. The following describes the network prediction process in detail by taking two adjacent voice segments, i.e. a first voice segment and a second voice segment, in the voice data as an example.

S219: and inputting the first voice segment in the voice data into a deep neural network to obtain the prediction results of the first voice frame and the second voice frame in the first voice segment.

S220: and inputting the second voice segment in the voice data into the deep neural network to obtain the prediction results of the second voice frame and the third voice frame in the second voice segment.

S221: and determining whether the second speech frame comes from the target according to the prediction result of the second speech frame in the first speech segment and the prediction result of the second speech frame in the second speech segment.

And the second voice frame in the first voice segment and the second voice frame in the second voice segment are the same frame in the voice data. The prediction result is used to characterize whether the corresponding speech frame is from the target.

That is, the voice data includes a plurality of voice segments, each of the voice segments includes a plurality of voice frames, and adjacent voice segments partially overlap on some voice frames. Fig. 5 is a schematic structural diagram of speech data in an embodiment of the present application, and referring to fig. 5, in the speech data, there are multiple partially overlapping speech segments, and for a certain frame in the speech data, for example: the current frame. There are multiple speech segments that contain the current frame. When determining whether the current frame is from the target, it is necessary to predict whether the current frame is from the target from a plurality of speech segments including the current frame, and finally determine whether the current frame is from the target based on the prediction result of the current frame in the plurality of speech segments including the current frame. For example: in the speech data, a duration of 0.63s is used as a speech segment, a duration of 25ms is used as a speech frame, when it is determined whether a current frame is from a target, prediction of whether the current frame is from the target is required to be performed in 63 speech segments, and then based on the prediction result of the current frame in 63 speech segments, it is finally determined whether the current frame is from the target.

In general, in practical applications, the number of speech frames in the first speech segment may be 64, i.e. the first speech frame and the second speech frame represent 64 speech frames. Correspondingly, the number of speech frames in the second speech segment may also be 64, i.e. the second speech frame and the third speech frame represent 64 speech frames. Thus, after the first voice segment in the voice data is input into the deep neural network, the prediction results of 64 voice frames in the first voice segment can be obtained. The second voice segment in the voice data is input into the deep neural network, and the prediction results of 64 voice frames in the second voice segment can be obtained. Because the speech frame in the first speech segment is partially overlapped with the speech frame in the second speech segment, it is finally determined whether the speech frame is from the target according to the prediction results of the 63 speech segments on the speech frame of one of the frames. And further, the voice frame data from the target determined from the voice data is used as the target voice data.

For the deep neural network, after a speech segment is received, the prediction results of all speech frames in the speech segment can be output. The prediction result of a certain speech frame in the speech segment also includes the prediction result of the speech frame in other speech segments. Therefore, all the prediction results of the speech frame are obtained from each speech segment, and whether the speech frame belongs to the target or not is determined according to all the prediction results of the speech frame, so that the accuracy of speech frame prediction can be improved, and the accuracy of target sound detection is further improved.

Specifically, among all the prediction results of the speech frame, there are two results. One belonging to the target and the other not belonging to the target. Therefore, the number of the two results can be counted, and the prediction result with the larger number is used as the final prediction result of the speech frame.

In practical application, after the voice segment is input into the deep neural network, the deep neural network can output the prediction result of each voice frame in the voice segment. The prediction result may be represented by 0 and 1, or may be represented by a probability. When the prediction result is represented by 0 and 1, 0 is output, which may indicate that the current speech frame does not belong to the target, and 1 is output, which may indicate that the current speech frame belongs to the target. And when the prediction result is represented by the probability, the probability value represents the probability size that the current frame belongs to the target. Of course, the predicted result of the network may also be expressed by other ways, and is not limited in detail here.

Fig. 6 is a schematic flow chart of performing target sound identification on audio data in the embodiment of the present application, and referring to fig. 6, after the audio data is acquired, first, silence detection is performed on the audio data through noise estimation; then, eliminating the mute frames in the audio data to obtain voice frames; then, generating a plurality of voice segments based on the voice frames; then, inputting the plurality of voice segments into a deep neural network for target sound detection; and finally, obtaining the prediction result of each speech frame, namely whether each speech frame belongs to the target or not.

S222: and using the data from the target determined from the voice data as the sound data of the target.

After determining whether each voice frame in the voice data belongs to the target through the deep neural network, the voice data of the target can be extracted from the voice data.

Based on the same inventive concept, as an implementation of the method, the embodiment of the application also provides a sound detection device. Fig. 7 is a schematic structural diagram of a sound detection apparatus in an embodiment of the present application, and referring to fig. 7, the apparatus may include:

the obtaining module 701 is configured to obtain audio data to be detected.

A determining module 702, configured to determine a type of each frame of data in the audio data, where the type includes voice and silence.

The prediction module 703 is configured to input the speech data corresponding to the frame belonging to the speech type in the audio data into a deep neural network, so as to obtain sound data belonging to a target.

Further, as a refinement and an extension of the apparatus shown in fig. 7, the embodiment of the present application further provides a sound detection apparatus. Fig. 8 is a schematic structural diagram of a second sound detection apparatus in an embodiment of the present application, and referring to fig. 8, the apparatus may include:

an obtaining module 801, configured to obtain audio data to be detected.

The smoothing module 802 is configured to smooth the power of each frequency point in the first frequency spectrum to obtain the processed power of each frequency point in the first frequency spectrum.

The first preset module 803 is specifically configured to:

and constructing a probability distribution model of each frequency point in the first frequency spectrum.

And integrating the probability curve in the probability distribution model to obtain two power values corresponding to the preset probability in the probability distribution model.

And taking the two power values as the preset power range of the corresponding frequency point.

The second preset module 804 is specifically configured to:

when the power of a target frequency point in the first frequency spectrum is greater than the minimum power of a corresponding frequency point in a second frequency spectrum, the minimum power of the target frequency point is related to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum, and second frame data corresponding to the second frequency spectrum is previous frame data of the first frame data in the audio data.

And when the power of the target frequency point in the first frequency spectrum is less than or equal to the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is the power of the target frequency point in the first frequency spectrum.

A weight module 805, configured to determine a weight corresponding to each frequency point according to the normalized power of each frequency point in the first spectrum, where the weight is positively correlated to the normalized power.

The determining module 806 includes:

the acquiring unit 8061 is configured to acquire first frame data in the audio data.

The transforming unit 8062 is configured to perform fourier transform on the first frame data to obtain a first frequency spectrum, where the first frequency spectrum includes multiple frequency points.

And the calculating unit 8063 is configured to calculate power of each frequency point in the first spectrum.

The calculating unit 8063 is specifically configured to calculate power of a preset frequency point in the first frequency spectrum, where the preset frequency point is a frequency point in a first half of each frequency point of the first frequency spectrum.

A determining unit 8064, configured to determine the type of the first frame data based on the power of each frequency point in the first spectrum.

The determining unit 8064 is specifically configured to:

subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the power of the de-noised sound signal of each frequency point, wherein the minimum power is obtained based on the nonlinear tracking of the frequency points;

and normalizing the power of the de-noised sound signals of each frequency point to obtain the normalized power of each frequency point.

And determining the type of the first frame data based on the normalized power of each frequency point in the first frequency spectrum.

The determining unit 8064 is specifically configured to:

and when the normalized power of each frequency point in the first frequency spectrum is within a preset power range, marking a first label for the corresponding frequency point, wherein the first label is used for representing that the first frame data belongs to voice on the corresponding frequency point.

And when the normalized power of each frequency point in the first frequency spectrum is not within the preset power range, marking a second label for the corresponding frequency point, wherein the second label is used for representing that the first frame data belongs to silence on the corresponding frequency point.

And generating a label sequence of each frequency point in the first frequency spectrum.

Determining a type of the first frame data based on the tag sequence.

The determining unit 8064 is specifically configured to:

and carrying out weighted average on the label sequence and the weight of the corresponding frequency point to obtain the voice confidence of the first frame data corresponding to the first frequency spectrum.

And when the voice confidence coefficient of the first frame data is greater than a preset voice confidence coefficient, determining that the type of the first frame data is voice.

And when the voice confidence coefficient of the first frame data is less than or equal to a preset voice confidence coefficient, determining that the type of the first frame data is mute.

The prediction module 807 includes:

a first prediction unit 8071, configured to input a first speech segment in the speech data into the deep neural network, to obtain a prediction result of a first speech frame and a second speech frame in the first speech segment, where the prediction result is used to characterize whether a corresponding speech frame comes from the target.

A second prediction unit 8072, configured to input a second speech segment in the speech data into the deep neural network, to obtain prediction results of a second speech frame and a third speech frame in the second speech segment, where the second speech frame in the first speech segment and the second speech frame in the second speech segment are the same frame in the speech data.

A target prediction unit 8073, configured to determine whether the second speech frame is from the target according to a prediction result of the second speech frame in the first speech segment and a prediction result of the second speech frame in the second speech segment.

An extracting unit 8074, configured to use data from the target determined from the voice data as the sound data of the target.

It is to be noted here that the above description of the embodiments of the apparatus, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. Fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application, and referring to fig. 9, the electronic device may include: a processor 901, a memory 902, a bus 903; the processor 901 and the memory 902 complete communication with each other through the bus 903; the processor 901 is used to call program instructions in the memory 902 to perform the methods in one or more of the embodiments described above.

It is to be noted here that the above description of the embodiments of the electronic device, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device of the present application, refer to the description of the embodiments of the method of the present application for understanding.

Based on the same inventive concept, the embodiment of the present application further provides a computer-readable storage medium, where the storage medium may include: a stored program; wherein, when the program runs, the device in which the storage medium is positioned is controlled to execute the method in one or more of the above embodiments.

It is to be noted here that the above description of the storage medium embodiments, like the description of the above method embodiments, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多模态语音识别方法、系统及计算机可读存储介质

Sound detection method and device

相关技术

网友询问留言