Mute voice detection method, device, terminal equipment and storage medium

文档序号:662570 发布日期:2021-04-27 浏览:5次 中文

阅读说明:本技术 静音语音检测方法、装置、终端设备及存储介质 (Mute voice detection method, device, terminal equipment and storage medium ) 是由 许锋刚 于 2020-12-08 设计创作,主要内容包括:本申请适用于语音检测技术领域,提供了一种静音语音检测方法、装置、终端设备及存储介质,该方法包括:用户语音进行分割,得到分割语音;分别对分割语音进行静音端点检测;若检测到任一分割语音中存在静音片段,则对静音片段进行静音标记,并根据静音标记计算不同静音片段之间的静音总时长;若静音总时长大于时长阈值,则判定用户语音是静音语音。本申请通过采用静音端点检测的方式检测分割语音中是否存在静音片段,并基于检测到的不同静音片段之间的静音总时长能准确的判定到用户语音是否是静音语音,无需将用户语音从时域信号转换为频域信号,提高了静音语音检测的检测效率和准确性。(The application is applicable to the technical field of voice detection, and provides a mute voice detection method, a device, a terminal device and a storage medium, wherein the method comprises the following steps: the user voice is segmented to obtain segmented voice; performing mute endpoint detection on the segmented voices respectively; if the silence segment exists in any segmented voice, carrying out silence marking on the silence segment, and calculating the total silence duration between different silence segments according to the silence mark; and if the total mute time is greater than the time threshold, judging that the voice of the user is mute voice. Whether silence segments exist in the segmented voice is detected in a silence endpoint detection mode, whether the voice of the user is silence voice can be accurately judged based on the detected total silence duration between different silence segments, the voice of the user does not need to be converted into a frequency domain signal from a time domain signal, and the detection efficiency and accuracy of silence voice detection are improved.)

1. A method for detecting mute speech, comprising:

acquiring user voice, and segmenting the user voice to obtain segmented voice;

performing silence endpoint detection on the segmented voices respectively, wherein the silence endpoint detection is used for detecting whether silence segments exist in the segmented voices, and the silence segments comprise silence starting points and silence end points;

if the silence segment exists in any segmented voice, carrying out silence marking on the silence segment, and calculating the total silence duration between different silence segments according to the silence mark;

and if the total mute time is greater than a time threshold, judging that the user voice is mute voice.

2. The method according to claim 1, wherein the performing silence end point detection on the segmented voices respectively comprises:

respectively extracting sample entropies of the voice frames in the segmented voice;

if the sample entropy of the voice frame is larger than a first threshold value, judging that the voice frame is the mute starting point in the corresponding segmented voice;

if the sample entropy of the speech frame is greater than a second threshold and smaller than the first threshold, acquiring a short-time zero-crossing rate of the speech frame, wherein the second threshold is smaller than the first threshold;

if the short-time zero crossing rate of the voice frame is smaller than a third threshold value, judging that the voice frame is the mute end point in the corresponding segmented voice;

and if the same segmented voice has the mute starting point and the mute end point, judging that the voice formed by the voice frames between the mute starting point and the mute end point is the mute segment, wherein the voice frame serving as the mute starting point does not exist between the mute starting point and the mute end point, and the voice frame serving as the mute end point does not exist either.

3. The method according to claim 1, wherein said silence marking the silence segments and calculating the total silence duration between different silence segments according to the silence markers comprises:

marking the starting point of the voice frame corresponding to the mute starting point in the mute segment, and marking the end point of the voice frame corresponding to the mute end point in the mute segment;

acquiring a mute time starting point corresponding to the mute segment according to the starting point mark, and acquiring a mute time end point corresponding to the mute segment according to the end point mark;

and calculating the mute duration corresponding to the mute sections according to the start point and the end point of the mute time, and calculating the sum of the mute durations among different mute sections to obtain the total mute duration.

4. The method of claim 1, wherein after the obtaining the user speech, the method further comprises:

inputting the user voice into a low-pass filter for voice filtering, and performing voice sampling and voice quantization on the user voice after voice filtering;

pre-emphasis processing is performed on the user speech after speech sampling and speech quantization, wherein the pre-emphasis processing is used for increasing the high-frequency resolution of the user speech.

5. The method of claim 1, wherein the segmenting the user speech to obtain segmented speech comprises:

acquiring a voice recording scene of the user voice, and inquiring a voice segmentation value according to the voice recording scene;

and segmenting the user voice according to the queried voice segmentation value to obtain the segmented voice.

6. The method according to claim 2, further comprising, after the extracting sample entropies of the speech frames in the segmented speech respectively:

and if the sample entropy of the voice frame is smaller than the second threshold value, judging that the voice frame is voice noise in the corresponding segmented voice, and performing voice filtering on the segmented voice according to the voice noise.

7. The method according to claim 3, wherein after calculating the mute duration corresponding to the mute segment according to the start point and the end point of the mute time, the method further comprises:

and if any mute time length is greater than the time length threshold value, directly judging that the user voice is the mute voice.

8. A silent speech detection device, comprising:

the voice segmentation unit is used for acquiring user voice and segmenting the user voice to obtain segmented voice;

a silence end point detection unit, configured to perform silence end point detection on the segmented voices respectively, where the silence end point detection is configured to detect whether a silence segment exists in the segmented voices, and the silence segment includes a silence start point and a silence end point;

a silence marking unit, configured to perform silence marking on the silence segments if it is detected that the silence segments exist in any of the segmented voices, and calculate total silence durations between different silence segments according to the silence marks;

and the mute judgment unit is used for judging that the user voice is mute voice if the total mute time is greater than a time threshold.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A storage medium storing a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 7 when executed by a processor.

Technical Field

The present application relates to the field of voice detection, and in particular, to a method and an apparatus for detecting a mute voice, a terminal device, and a storage medium.

Background

With the rapid development of artificial intelligence, the robot industry has also enjoyed a rapid rise, and the voice recording is more and more emphasized as an important step of the robot for uploading voice and receiving and issuing instructions. Automatic Speech Recognition (Automatic Speech Recognition) has been playing an important role as an integral part of service Recognition of user Speech. When the terminal equipment triggers the microphone to be opened and a user carries out voice call, the silence detection needs to be carried out on the received user voice so as to judge whether the call of the user is finished or not, and when the fact that the call of the user is finished is detected, a recording stopping instruction is issued to the terminal.

In the existing mute voice detection process, voice mute detection is carried out by adopting a spectrum envelope detection mode, but because the spectrum envelope detection needs to firstly convert user voice from a time domain signal to a frequency domain signal, the mute voice detection operation is complex, and the mute voice detection accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for detecting a mute speech, a terminal device, and a storage medium, so as to solve the problem of low accuracy of mute speech detection due to performing speech mute detection in a spectral envelope detection manner in the mute speech detection process in the prior art.

A first aspect of an embodiment of the present application provides a method for detecting a mute speech, including:

acquiring user voice, and segmenting the user voice to obtain segmented voice;

performing silence endpoint detection on the segmented voices respectively, wherein the silence endpoint detection is used for detecting whether silence segments exist in the segmented voices, and the silence segments comprise silence starting points and silence end points;

if the silence segment exists in any segmented voice, carrying out silence marking on the silence segment, and calculating the total silence duration between different silence segments according to the silence mark;

and if the total mute time is greater than a time threshold, judging that the user voice is mute voice.

Further, the performing silence endpoint detection on the segmented voices respectively includes:

respectively extracting sample entropies of the voice frames in the segmented voice;

if the sample entropy of the voice frame is larger than a first threshold value, judging that the voice frame is the mute starting point in the corresponding segmented voice;

if the sample entropy of the speech frame is greater than a second threshold and smaller than the first threshold, acquiring a short-time zero-crossing rate of the speech frame, wherein the second threshold is smaller than the first threshold;

if the short-time zero crossing rate of the voice frame is smaller than a third threshold value, judging that the voice frame is the mute end point in the corresponding segmented voice;

and if the same segmented voice has the mute starting point and the mute end point, judging that the voice formed by the voice frames between the mute starting point and the mute end point is the mute segment, wherein the voice frame serving as the mute starting point does not exist between the mute starting point and the mute end point, and the voice frame serving as the mute end point does not exist either.

Further, said mute marking said mute segments, and calculating the total mute duration between different said mute segments according to said mute marking, includes:

marking the starting point of the voice frame corresponding to the mute starting point in the mute segment, and marking the end point of the voice frame corresponding to the mute end point in the mute segment;

acquiring a mute time starting point corresponding to the mute segment according to the starting point mark, and acquiring a mute time end point corresponding to the mute segment according to the end point mark;

and calculating the mute duration corresponding to the mute sections according to the start point and the end point of the mute time, and calculating the sum of the mute durations among different mute sections to obtain the total mute duration.

Further, after the obtaining of the user voice, the method further includes:

inputting the user voice into a low-pass filter for voice filtering, and performing voice sampling and voice quantization on the user voice after voice filtering;

pre-emphasis processing is performed on the user speech after speech sampling and speech quantization, wherein the pre-emphasis processing is used for increasing the high-frequency resolution of the user speech.

Further, the segmenting the user speech to obtain segmented speech includes:

acquiring a voice recording scene of the user voice, and inquiring a voice segmentation value according to the voice recording scene;

and segmenting the user voice according to the queried voice segmentation value to obtain the segmented voice.

Further, after the extracting sample entropies of the speech frames in the segmented speech respectively, the method further includes:

and if the sample entropy of the voice frame is smaller than the second threshold value, judging that the voice frame is voice noise in the corresponding segmented voice, and performing voice filtering on the segmented voice according to the voice noise.

Further, after the calculating the mute duration corresponding to the mute segment according to the start point of the mute time and the end point of the mute time, the method further includes:

and if any mute time length is greater than the time length threshold value, directly judging that the user voice is the mute voice.

A second aspect of the embodiments of the present application provides a silent speech detection apparatus, including:

the voice segmentation unit is used for acquiring user voice and segmenting the user voice to obtain segmented voice;

a silence end point detection unit, configured to perform silence end point detection on the segmented voices respectively, where the silence end point detection is configured to detect whether a silence segment exists in the segmented voices, and the silence segment includes a silence start point and a silence end point;

a silence marking unit, configured to perform silence marking on the silence segments if it is detected that the silence segments exist in any of the segmented voices, and calculate total silence durations between different silence segments according to the silence marks;

and the mute judgment unit is used for judging that the user voice is mute voice if the total mute time is greater than a time threshold.

A third aspect of the embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the terminal device, where the processor implements the steps of the silent voice detection method provided by the first aspect when executing the computer program.

A fourth aspect of the embodiments of the present application provides a storage medium, which stores a computer program that, when executed by a processor, implements the steps of the silent speech detection method provided by the first aspect.

The mute voice detection method, the device, the terminal equipment and the storage medium provided by the embodiment of the application have the following beneficial effects: by acquiring and dividing the user voice, the mute endpoint detection of different divided voices is effectively facilitated, the phenomenon of low accuracy of mute detection caused by directly performing mute detection on the user voice is prevented, the total mute time between different mute segments is effectively facilitated by performing mute marking on the mute segments, the mute voice detection efficiency is improved, whether the user voice is a mute voice can be effectively detected based on the comparison between the total mute time obtained by calculation and the time threshold value, in the embodiment, whether the user voice is a mute voice or not is detected by adopting the mute endpoint detection mode, whether the user voice is a mute voice or not can be accurately judged based on the detected total mute time between different mute segments, and the user voice does not need to be converted into a frequency domain signal from a time domain signal, the detection efficiency and accuracy of the mute voice detection are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating an implementation of a method for detecting a mute speech according to an embodiment of the present application;

fig. 2 is a flowchart illustrating an implementation of a method for detecting a mute speech according to another embodiment of the present application;

fig. 3 is a block diagram of a mute speech detection apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of a terminal device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The mute speech detection method according to the embodiment of the present application may be executed by a control device or a terminal (hereinafter referred to as a "mobile terminal").

Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a mute speech detection method according to an embodiment of the present application, including:

step S10, obtaining user voice, and segmenting the user voice to obtain segmented voice.

The method and the device have the advantages that by segmenting the user voice, the mute endpoint detection of different segmented voices is effectively facilitated, and the phenomenon of low mute detection accuracy caused by directly carrying out spectrum envelope detection on the user voice is prevented.

Specifically, in this step, the segmenting the user speech to obtain segmented speech includes:

acquiring a voice recording scene of the user voice, and inquiring a voice segmentation value according to the voice recording scene;

segmenting the user voice according to the queried voice segmentation value to obtain the segmented voice;

the method comprises the steps of obtaining environment information of a user corresponding to user voice, determining a corresponding voice recording scene according to the obtained environment information, matching the determined voice recording scene with a partition value query table to obtain a voice partition value corresponding to the user, wherein the voice partition value is preset duration, and the voice partition value is used for partitioning the user voice into voice segments with the same duration based on the preset duration, and the partition value query table stores corresponding relations between different voice recording scenes and the corresponding voice partition values.

For example, the correspondence between the voice recording scene and the voice segmentation value in the segmentation value lookup table includes: the voice segmentation values corresponding to different voice recording scenes can be the same or different, so that the segmentation of the voice of the user can be effectively applicable to different voice recording scenes, and the practicability of the mute voice detection is improved.

Optionally, in this step, after the obtaining of the user voice, the method further includes:

inputting the user voice into a low-pass filter for voice filtering, and performing voice sampling and voice quantization on the user voice after voice filtering;

pre-emphasis processing is carried out on the user voice after voice sampling and voice quantization;

the voice filtering is carried out by inputting the user voice into the low-pass filter, so that the noise in the user voice is effectively removed, and the accuracy of subsequent mute end point detection on the segmented voice is improved.

Specifically, voice sampling and voice quantization are carried out on voice of a user after voice filtering, so that the voice preprocessing effect is effectively achieved on the voice of the user.

In this step, the purpose of the pre-emphasis processing is to emphasize the high-frequency part of the user voice, so as to remove the influence of lip radiation and increase the high-frequency resolution of the user voice.

Step S20, performing silence endpoint detection on the segmented voices respectively.

Wherein the silence endpoint detection is used for detecting whether a silence segment exists in the segmented voice, and the silence segment comprises a silence starting point and a silence ending point.

In this step, if the segmented speech corresponding to the user speech does not have the silent segment, it is determined that the user speech is not the silent speech, the user speech is continuously acquired, and the step of "segmenting the user speech to obtain the segmented speech" is continuously performed on the acquired user speech.

Step S30, if it is detected that the silence segment exists in any of the segmented voices, the silence segment is marked, and the total silence duration between different silence segments is calculated according to the silence mark.

The mute mark is used for marking the voice position of the corresponding mute segment in the segmented voice, so that the calculation of the mute duration corresponding to the mute segment is effectively facilitated, and the accuracy of the calculation of the total mute duration between different mute segments is further ensured.

Specifically, in this step, the mute marking the mute sections and calculating the total mute duration between different mute sections according to the mute marking includes:

starting point marking is carried out on the voice frame corresponding to the mute starting point in the mute segment, and end point marking is carried out on the voice frame corresponding to the mute end point in the mute segment, wherein the starting point marking or the end point marking can be carried out on the voice frame according to a preset identifier, and the preset identifier can be set according to requirements, for example, in the step, the starting point marking can be carried out on the voice frame corresponding to the mute starting point in the mute segment by adopting the identifier '0', and the end point marking can be carried out on the voice frame corresponding to the mute end point in the mute segment by adopting the identifier '1';

acquiring a mute time starting point corresponding to the mute segment according to the starting point mark, and acquiring a mute time end point corresponding to the mute segment according to the end point mark;

and calculating the mute duration corresponding to the mute segments according to the start point and the end point of the mute time, and calculating the sum of the mute durations among different mute segments to obtain the total mute duration, wherein the total mute duration is used for expressing the sum of the voice durations among different mute segments in the voice of the user.

For example, after the user speech is subjected to speech segmentation, segmented speech a1, segmented speech a2 and segmented speech A3 are obtained, when it is detected that a silence segment a1 exists in the segmented speech a1, a silence segment a2 exists in the segmented speech a2 and a silence segment A3 exists in the segmented speech A3, silence marking is performed on a silence segment a1, a silence segment a2 and a silence segment A3 respectively, silence durations corresponding to the silence segment a1, the silence segment a2 and the silence segment A3 are obtained according to silence marking results of the silence segment a1, the silence segment a2 and the silence segment A3 respectively, a silence duration b1, a silence duration b2 and a silence duration b3 are obtained, and the sum of the silence durations b1, the silence duration b2 and the silence duration b3 is calculated, so that the total silence duration corresponding to the user speech is obtained.

Step S40, if the total mute duration is greater than the duration threshold, determining that the user speech is a mute speech.

The duration threshold may be set according to a requirement, for example, the duration threshold may be set to 100 milliseconds, 200 milliseconds, or 300 milliseconds, and the like, and the time threshold is used to determine whether the user voice corresponding to the total mute duration is a mute voice.

Preferably, in this embodiment, after the calculating the mute duration corresponding to the mute segment according to the start point of the mute time and the end point of the mute time, the method further includes: and if any mute time length is greater than the time length threshold value, directly judging that the user voice is the mute voice.

In the embodiment, by acquiring the user voice and dividing the user voice, the mute endpoint detection of different divided voices is effectively facilitated, the phenomenon of low accuracy of the mute detection caused by directly performing the mute detection on the user voice is prevented, the total mute time between different mute segments is effectively facilitated by performing the mute marking on the mute segments, the mute voice detection efficiency is improved, and whether the user voice is the mute voice can be effectively detected based on the comparison between the total mute time obtained by calculation and the time threshold value, in the embodiment, whether the mute segment exists in the divided voice is detected by adopting the mute endpoint detection mode, whether the user voice is the mute voice can be accurately judged based on the detected total mute time between different mute segments, and the user voice does not need to be converted into the frequency domain signal from the time domain signal, the detection efficiency and accuracy of the mute voice detection are improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating an implementation of a mute speech detection method according to another embodiment of the present application. With respect to the embodiment corresponding to fig. 1, the mute speech detection method provided in this embodiment is used to further refine step S20, and includes:

step S21, respectively extracting sample entropies of the voice frames in the segmented voice;

the Sample Entropy (Sample Entropy) measures the complexity of a time sequence by measuring the probability of generating a new pattern in a signal, the greater the probability of generating the new pattern is, the greater the complexity of the sequence is, and the lower the value of the Sample Entropy is, the higher the sequence self-similarity is; the larger the value of the sample entropy, the more complex the sample sequence.

Specifically, in this step, the silence end point detection can be effectively performed on the segmented speech based on the sample entropy, so as to identify the speech start point and the speech start point in the speech sample, thereby improving the accuracy of subsequent speech filtering on the speech sample.

Step S22, if the sample entropy of the speech frame is greater than a first threshold, determining that the speech frame is the silence starting point in the corresponding segmented speech.

In this step, if the duration that the sample entropy is greater than the first threshold is greater than a first preset time, it is determined that the speech frame is a silence start point in the corresponding segmented speech.

Step S23, if the sample entropy of the speech frame is greater than the second threshold and smaller than the first threshold, obtaining the short-time zero-crossing rate of the speech frame.

The second threshold is smaller than the first threshold, the short-time zero-crossing rate refers to the number of times that a signal in each frame passes through a zero value, for a continuous speech signal with time and a horizontal axis, the condition that a time domain waveform of speech passes through the horizontal axis can be observed, and under the condition of a discrete-time speech signal, if adjacent speech frames have different algebraic symbols, the zero-crossing phenomenon is known to occur, so the short-time zero-crossing rate of the speech frames can be calculated based on the detected zero-crossing phenomenon.

In this step, if the sample entropy of the speech frame is greater than the second threshold and smaller than the first threshold, it indicates that whether the corresponding speech frame is a mute node cannot be determined based on the sample entropy, and at this time, the speech frame may be subjected to mute analysis again by acquiring the short-time zero-crossing rate of the speech frame and based on the acquired short-time zero-crossing rate, so as to determine whether the speech frame corresponding to the sample entropy is a mute end point.

Step S24, if the short-time zero-crossing rate of the speech frame is less than a third threshold, determining that the speech frame is the mute end point in the corresponding segmented speech.

In this step, if the short-time zero crossing rate of the speech frame is less than the third threshold, it is determined that the speech frame is a mute end point in the corresponding segmented speech.

Step S25, if the same segmented speech includes the silence start point and the silence end point, determining that the speech formed by the speech frames between the silence start point and the silence end point is the silence segment.

And the voice frame as the mute starting point does not exist between the mute starting point and the mute end point, and the voice frame as the mute end point does not exist.

Optionally, in this embodiment, after the respectively extracting sample entropies of the speech frames in the segmented speech, the method further includes: if the sample entropy of the voice frame is smaller than the second threshold, the voice frame is judged to be voice noise in the corresponding segmented voice, and the segmented voice is subjected to voice filtering according to the voice noise, wherein the second threshold can also be used for judging whether a voice node corresponding to the sample entropy is the voice noise, and when any segmented voice is detected to have the voice noise, the segmented voice is subjected to voice filtering to achieve the effect of removing the noise in the segmented voice, so that the voice quality of the voice of the user is improved, and the accuracy of silent voice detection of the voice of the user is improved.

In this embodiment, the sample entropies of the speech frames in the segmented speech are respectively extracted, and whether the speech frame corresponding to the sample entropy is a silence start point is determined based on the magnitude determination between the sample entropy and the first threshold, and if the sample entropy of the speech frame is greater than the second threshold and smaller than the first threshold, whether the speech frame corresponding to the sample entropy is a silence end point is determined by obtaining the short-time zero-crossing rate of the speech frame and based on the magnitude determination between the short-time zero-crossing rate of the speech frame and the third threshold, so that the detection of the silence segment in the segmented speech is effectively improved.

Referring to fig. 3, fig. 3 is a block diagram of a mute speech detection apparatus 100 according to an embodiment of the present application. In this embodiment, the mute speech detection apparatus 100 includes units for executing the steps in the embodiments corresponding to fig. 1 and fig. 2. Please refer to fig. 1 and fig. 2 and the related descriptions in the embodiments corresponding to fig. 1 and fig. 2. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 3, comprising: a voice dividing unit 10, a silence end point detecting unit 11, a silence marking unit 12, and a silence judging unit 13, wherein:

the speech segmentation unit 10 is configured to acquire a user speech and segment the user speech to obtain a segmented speech.

Wherein the speech segmentation unit 10 is further configured to: inputting the user voice into a low-pass filter for voice filtering, and performing voice sampling and voice quantization on the user voice after voice filtering;

pre-emphasis processing is performed on the user speech after speech sampling and speech quantization, wherein the pre-emphasis processing is used for increasing the high-frequency resolution of the user speech.

Optionally, the speech segmentation unit 10 is further configured to: acquiring a voice recording scene of the user voice, and inquiring a voice segmentation value according to the voice recording scene;

and segmenting the user voice according to the queried voice segmentation value to obtain the segmented voice.

A mute endpoint detection unit 11, configured to perform mute endpoint detection on the segmented voices respectively, where the mute endpoint detection is configured to detect whether a mute segment exists in the segmented voices, and the mute segment includes a mute start point and a mute end point.

Wherein, the mute endpoint detection unit 11 is further configured to: respectively extracting sample entropies of the voice frames in the segmented voice;

if the sample entropy of the voice frame is larger than a first threshold value, judging that the voice frame is the mute starting point in the corresponding segmented voice;

if the sample entropy of the speech frame is greater than a second threshold and smaller than the first threshold, acquiring a short-time zero-crossing rate of the speech frame, wherein the second threshold is smaller than the first threshold;

if the short-time zero crossing rate of the voice frame is smaller than a third threshold value, judging that the voice frame is the mute end point in the corresponding segmented voice;

and if the same segmented voice has the mute starting point and the mute end point, judging that the voice formed by the voice frames between the mute starting point and the mute end point is the mute segment, wherein the voice frame serving as the mute starting point does not exist between the mute starting point and the mute end point, and the voice frame serving as the mute end point does not exist either.

Optionally, the mute endpoint detection unit 11 is further configured to: and if the sample entropy of the voice frame is smaller than the second threshold value, judging that the voice frame is voice noise in the corresponding segmented voice, and performing voice filtering on the segmented voice according to the voice noise.

A silence marking unit 12, configured to perform silence marking on the silence segments if it is detected that the silence segments exist in any of the segmented voices, and calculate total silence durations between different silence segments according to the silence marks.

Wherein the mute marking unit 12 is further configured to: marking the starting point of the voice frame corresponding to the mute starting point in the mute segment, and marking the end point of the voice frame corresponding to the mute end point in the mute segment;

acquiring a mute time starting point corresponding to the mute segment according to the starting point mark, and acquiring a mute time end point corresponding to the mute segment according to the end point mark;

and calculating the mute duration corresponding to the mute sections according to the start point and the end point of the mute time, and calculating the sum of the mute durations among different mute sections to obtain the total mute duration.

Optionally, the mute flag unit 12 is further configured to: and if any mute time length is greater than the time length threshold value, directly judging that the user voice is the mute voice.

A mute judgment unit 13, configured to judge that the user voice is a mute voice if the total mute duration is greater than a duration threshold.

In the embodiment, by acquiring the user voice and dividing the user voice, the mute endpoint detection of different divided voices is effectively facilitated, the phenomenon of low accuracy of the mute detection caused by directly performing the mute detection on the user voice is prevented, the total mute time between different mute segments is effectively and conveniently calculated by performing the mute marking on the mute segments, the mute voice detection efficiency is improved, and whether the user voice is the mute voice can be effectively detected based on the comparison between the total mute time obtained by calculation and the time threshold value, in the embodiment, whether the mute segment exists in the divided voice is detected by adopting the mute endpoint detection mode, whether the user voice is the mute voice can be accurately judged based on the detected total mute time between different mute segments, and the user voice does not need to be converted into the frequency domain signal from the time domain signal, the detection efficiency and accuracy of the mute voice detection are improved.

Fig. 4 is a block diagram of a terminal device 2 according to another embodiment of the present application. As shown in fig. 4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program of a silent speech detection method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 23, implements the steps in the embodiments of the mute speech detection methods described above, such as S10-S40 shown in fig. 1, or S21-S25 shown in fig. 2. Alternatively, when the processor 20 executes the computer program 22, the functions of the units in the embodiment corresponding to fig. 3, for example, the functions of the units 10 to 13 shown in fig. 3, are implemented, for which reference is specifically made to the relevant description in the embodiment corresponding to fig. 4, which is not repeated herein.

Illustratively, the computer program 22 may be divided into one or more units, which are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the terminal device 2. For example, the computer program 22 may be divided into a voice dividing unit 10, a silence end point detecting unit 11, a silence marking unit 12, and a silence judging unit 13, each of which functions as described above.

The terminal device may include, but is not limited to, a processor 20, a memory 21. It will be appreciated by those skilled in the art that fig. 4 is merely an example of a terminal device 2 and does not constitute a limitation of the terminal device 2 and may include more or less components than those shown, or some components may be combined, or different components, for example the terminal device may also include input output devices, network access devices, buses, etc.

The Processor 20 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种方言识别模型的训练方法、可读存储介质及终端设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!