Long voice endpoint detection method and device, storage medium and electronic equipment

文档序号：1203030 发布日期：2020-09-01 浏览：32次中文

阅读说明：本技术 长语音端点检测方法与装置、存储介质、电子设备 (Long voice endpoint detection method and device, storage medium and electronic equipment ) 是由黄洪运李红岩刘岩于 2020-07-06 设计创作，主要内容包括：本公开属于语音检测技术领域,涉及一种长语音端点检测方法及装置、计算机可读存储介质、电子设备。该方法包括：获取待检测长语音的语音信号,并对语音信号进行分窗处理得到检测窗；在检测窗中确定一采样点为起始检测点,并计算起始检测点的起始能量比率；根据起始检测点确定另一采样点为终止检测点,并计算终止检测点的终止能量比率；根据起始能量比率确定起始检测点为待检测长语音的语音起始点,并根据终止能量比率确定终止检测点为待检测长语音的语音终止点。本公开减少对短促语音片段的孤立,使得端点检测更连贯,进一步减少了后续合并处理工作带来的复杂度,避免了漏掉有效语音片段,也使得端点检测的准确度更高。(The disclosure belongs to the technical field of voice detection, and relates to a long voice endpoint detection method and device, a computer readable storage medium and an electronic device. The method comprises the following steps: acquiring a voice signal of a long voice to be detected, and performing windowing processing on the voice signal to obtain a detection window; determining a sampling point in the detection window as an initial detection point, and calculating the initial energy ratio of the initial detection point; determining another sampling point as a termination detection point according to the starting detection point, and calculating the termination energy ratio of the termination detection point; and determining the starting detection point as the voice starting point of the long voice to be detected according to the starting energy ratio, and determining the ending detection point as the voice ending point of the long voice to be detected according to the ending energy ratio. The method and the device reduce the isolation of short voice segments, enable the endpoint detection to be more coherent, further reduce the complexity caused by subsequent merging processing work, avoid missing effective voice segments and enable the accuracy of the endpoint detection to be higher.)

1. A method for long speech endpoint detection, the method comprising:

acquiring a voice signal of a long voice to be detected, and performing windowing processing on the voice signal to obtain a detection window;

determining a sampling point in the detection window as an initial detection point, and calculating an initial energy ratio of the initial detection point;

determining another sampling point as a termination detection point according to the starting detection point, and calculating the termination energy ratio of the termination detection point;

and determining the starting detection point as the voice starting point of the long voice to be detected according to the starting energy ratio, and determining the ending detection point as the voice ending point of the long voice to be detected according to the ending energy ratio.

2. The method of claim 1, wherein said calculating a start energy ratio of said start detection points comprises:

acquiring a first preamble amplitude of a first preamble sampling point before the initial detection point, and acquiring a first subsequent amplitude of a first subsequent sampling point after the initial detection point;

and calculating the first preamble amplitude and the first subsequent amplitude to obtain an initial energy ratio.

3. The method of claim 2, wherein said computing the first preamble magnitude and the first subsequence magnitude to obtain a start energy ratio comprises:

calculating the first preamble amplitude to obtain a first preamble energy value, and calculating the first subsequent amplitude to obtain a first subsequent energy value;

and calculating the first preamble energy value and the first subsequent energy value to obtain an initial energy ratio.

4. The long speech endpoint detection method of claim 1, wherein said calculating a termination energy ratio of the termination detection points comprises:

acquiring a second preamble amplitude of a second preamble sampling point before the termination detection point, and acquiring a second subsequent amplitude of a second subsequent sampling point after the termination detection point;

and calculating the second preamble amplitude and the second subsequent amplitude to obtain a termination energy ratio.

5. The method of claim 4, wherein said computing the second preamble magnitude and the second subsequent magnitude to obtain a termination energy ratio comprises:

calculating the second preamble amplitude to obtain a second preamble energy value, and calculating the second subsequent amplitude to obtain a second subsequent energy value;

and calculating the second preamble energy value and the second subsequent energy value to obtain a termination energy ratio.

6. The method according to claim 1, wherein said determining the start detection point as the speech start point of the long speech to be detected according to the start energy ratio comprises:

determining a starting ratio threshold corresponding to the starting energy ratio and comparing the starting energy ratio to the starting ratio threshold;

and determining the starting detection point as the voice starting point of the long voice to be detected according to the comparison result.

7. The long speech endpoint detection method according to claim 6, wherein said determining the termination detection point as the speech termination point of the long speech to be detected according to the termination energy ratio comprises:

calculating a reciprocal value of the start ratio threshold and determining the reciprocal value as an end ratio threshold corresponding to the end energy ratio;

and comparing the termination energy ratio with the termination ratio threshold, and determining the termination detection point as the voice termination point of the long voice to be detected according to the comparison result.

8. A long speech endpoint detection apparatus, comprising:

the window processing module is configured to acquire a voice signal of a long voice to be detected and perform window processing on the voice signal to obtain a detection window;

an initial detection module configured to determine a sampling point in the detection window as an initial detection point and calculate an initial energy ratio of the initial detection point;

the termination detection module is configured to determine another sampling point as a termination detection point according to the starting detection point and calculate a termination energy ratio of the termination detection point;

and the detection determining module is configured to determine the starting detection point as the voice starting point of the long voice to be detected according to the starting energy ratio, and determine the ending detection point as the voice ending point of the long voice to be detected according to the ending energy ratio.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a long speech endpoint detection method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the long speech endpoint detection method of any of claims 1-7 via execution of the executable instructions.

Technical Field

The present disclosure relates to the field of voice detection technologies, and in particular, to a long voice endpoint detection method, a long voice endpoint detection apparatus, a computer-readable storage medium, and an electronic device.

Background

The voice signal end point detection technology is a very important direction in the voice signal processing technology, and aims to accurately detect a starting point and an end point of voice from a segment of signal containing voice so as to distinguish the voice signal from a non-voice signal. The effective voice signal endpoint detection can not only reduce the cost of voice data acquisition in systems such as voice recognition, voiceprint recognition and the like, save the processing time, but also eliminate the interference of a silent section and a noise section and improve the performance of the system.

The most widely used speech signal endpoint detection technique at present is the double-threshold method. The dual threshold method distinguishes between speech and non-speech according to two characteristic parameters, namely the short-term energy and the short-term zero-crossing rate of the signal. However, the dual threshold method requires setting many thresholds, and it is almost impossible to adjust the thresholds each time for speech signals with different background noises. In addition, the double-threshold method uses short-term characteristics, and errors can be easily judged when the double-threshold method is applied to long voice.

In view of the above, there is a need in the art to develop a new method and apparatus for detecting a long speech endpoint.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a long voice endpoint detection method, a long voice endpoint detection apparatus, a computer-readable storage medium, and an electronic device, so as to overcome, at least to some extent, the problems of inaccurate detection and inapplicability of long voice due to the limitations of related technologies.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the embodiments of the present invention, there is provided a long speech endpoint detection method, including: acquiring a voice signal of a long voice to be detected, and performing windowing processing on the voice signal to obtain a detection window;

determining a sampling point in the detection window as an initial detection point, and calculating an initial energy ratio of the initial detection point;

determining another sampling point as a termination detection point according to the starting detection point, and calculating the termination energy ratio of the termination detection point;

In an exemplary embodiment of the present invention, the calculating the initial energy ratio of the initial detection points includes: acquiring a first preamble amplitude of a first preamble sampling point before the initial detection point, and acquiring a first subsequent amplitude of a first subsequent sampling point after the initial detection point;

and calculating the first preamble amplitude and the first subsequent amplitude to obtain an initial energy ratio.

In an exemplary embodiment of the present invention, the calculating the first preamble amplitude and the first subsequent amplitude to obtain a starting energy ratio comprises: calculating the first preamble amplitude to obtain a first preamble energy value, and calculating the first subsequent amplitude to obtain a first subsequent energy value;

and calculating the first preamble energy value and the first subsequent energy value to obtain an initial energy ratio.

In an exemplary embodiment of the present invention, the calculating a termination energy ratio of the termination detection points includes: acquiring a second preamble amplitude of a second preamble sampling point before the termination detection point, and acquiring a second subsequent amplitude of a second subsequent sampling point after the termination detection point;

and calculating the second preamble amplitude and the second subsequent amplitude to obtain a termination energy ratio.

In an exemplary embodiment of the present invention, the calculating the second preamble magnitude and the second subsequent magnitude to obtain a termination energy ratio comprises: calculating the second preamble amplitude to obtain a second preamble energy value, and calculating the second subsequent amplitude to obtain a second subsequent energy value;

and calculating the second preamble energy value and the second subsequent energy value to obtain a termination energy ratio.

In an exemplary embodiment of the present invention, the determining, according to the start energy ratio, the start detection point as the speech start point of the long speech to be detected includes: determining a starting ratio threshold corresponding to the starting energy ratio and comparing the starting energy ratio to the starting ratio threshold;

and determining the starting detection point as the voice starting point of the long voice to be detected according to the comparison result.

In an exemplary embodiment of the present invention, the determining, according to the termination energy ratio, that the termination detection point is a speech termination point of the long speech to be detected includes: calculating a reciprocal value of the start ratio threshold and determining the reciprocal value as an end ratio threshold corresponding to the end energy ratio;

According to a second aspect of the embodiments of the present invention, there is provided a long speech endpoint detection apparatus, including: the window processing module is configured to acquire a voice signal of a long voice to be detected and perform window processing on the voice signal to obtain a detection window;

an initial detection module configured to determine a sampling point in the detection window as an initial detection point and calculate an initial energy ratio of the initial detection point;

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus including: a processor and a memory; wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the long speech endpoint detection method of any of the above exemplary embodiments.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the long speech endpoint detection method in any of the above-described exemplary embodiments.

As can be seen from the foregoing technical solutions, the long voice endpoint detection method, the long voice endpoint detection apparatus, the computer storage medium and the electronic device in the exemplary embodiments of the present invention have at least the following advantages and positive effects:

in the method and apparatus provided by the exemplary embodiments of the present disclosure, all the voice start points and voice end points in the long voice to be detected may be determined by calculating the start detection point and the end detection point of each detection window after performing windowing on the long voice to be detected. On one hand, the length of the detection window can be adaptively set to be an ultra-long window suitable for the long voice to be detected, so that the isolation of short voice segments is reduced, the end point detection is more coherent, and the complexity caused by the subsequent merging processing work is further reduced; on the other hand, the determination mode of the voice starting point and the voice ending point is more rigorous and meticulous, the condition that effective voice segments are missed is avoided, and the accuracy of end point detection is higher.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 schematically illustrates a flow chart of a long speech endpoint detection method in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of calculating a starting energy ratio in an exemplary embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart of a method of further calculating a starting energy ratio in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a method of calculating a termination energy ratio in an exemplary embodiment of the disclosure;

FIG. 5 schematically illustrates a flow chart of a method of further calculating a termination energy ratio in an exemplary embodiment of the disclosure;

FIG. 6 schematically illustrates a flow chart of a method of determining a speech onset in an exemplary embodiment of the disclosure;

FIG. 7 is a flow chart schematically illustrating a method for determining a speech termination point in an exemplary embodiment of the present disclosure;

FIG. 8 is a diagram schematically illustrating the effect of using a dual threshold method for voice endpoint detection in the prior art;

FIG. 9 schematically illustrates a block diagram of a long speech endpoint detection method in an application scenario in an exemplary embodiment of the present disclosure;

fig. 10 schematically illustrates a structural diagram of a long speech endpoint detection apparatus in an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates an electronic device for implementing a long speech endpoint detection method in an exemplary embodiment of the present disclosure;

fig. 12 schematically illustrates a computer-readable storage medium for implementing a long speech endpoint detection method in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

In order to solve the problems in the related art, the present disclosure provides a long speech endpoint detection method. Fig. 1 shows a flow chart of a long speech endpoint detection method, as shown in fig. 1, the long speech endpoint detection method at least comprises the following steps:

and S110, acquiring a voice signal of the long voice to be detected, and performing windowing processing on the voice signal to obtain a detection window.

And S120, determining a sampling point in the detection window as an initial detection point, and calculating the initial energy ratio of the initial detection point.

And S130, determining another sampling point as an ending detection point according to the starting detection point, and calculating an ending energy ratio of the ending detection point.

And S140, determining the starting detection point as the voice starting point of the long voice to be detected according to the starting energy ratio, and determining the ending detection point as the voice ending point of the long voice to be detected according to the ending energy ratio.

In the exemplary embodiment of the present disclosure, by calculating the start detection point and the end detection point of each detection window after performing windowing processing on the long speech to be detected, all speech start points and speech end points in the long speech to be detected can be determined. On one hand, the length of the detection window can be adaptively set to be an ultra-long window suitable for the long voice to be detected, so that the isolation of short voice segments is reduced, the end point detection is more coherent, and the complexity caused by the subsequent merging processing work is further reduced; on the other hand, the determination mode of the voice starting point and the voice ending point is more rigorous and meticulous, the condition that effective voice segments are missed is avoided, and the accuracy of end point detection is higher.

The following describes each step of the long speech endpoint detection method in detail.

In step S110, a voice signal of the long voice to be detected is obtained, and the voice signal is subjected to windowing processing to obtain a detection window.

In an exemplary embodiment of the present disclosure, the long speech to be detected may be a speech with a duration longer than 3 seconds, and the speech includes a speech signal, which may provide a processing basis for subsequent endpoint detection.

For example, the long speech to be detected may be speech in any condition, such as vehicle-mounted environment speech, indoor environment speech, abnormal speech, and the like, and this is not particularly limited in this exemplary embodiment.

Further, after the voice signal of the long voice to be detected is obtained, windowing processing can be further performed on the voice signal.

For example, the window length of the windowing process may be set to 1 second, and the overlapping length of the windows may be set to 0.1 second, so as to perform the windowing process to obtain each detection window.

Thus, the first detection window may contain speech signals from S (0) to S (n), the second detection window may contain speech signals from S (0.9n) to S (1.9n), … …, and so on to obtain multiple detection windows.

The number of the detection windows obtained specifically can be determined according to the formula (1):

n＝W_L×SimpleRate (1)

wherein, W_LWindow length, i.e., 1 second; SimpleRate is the sampling rate of the detection window, which defines the number of samples per second that are extracted from a continuous signal and made up of discrete signals, expressed in Hertz (Hz).

It should be noted that, in order to ensure the integrity of the detection windows, when the last detection window is reached and the length of the detection window is less than a window length, the remaining speech signal value of the last detection window may be filled with the mean value of the speech signal.

In step S120, a sampling point in the detection window is determined as a start detection point, and a start energy ratio of the start detection point is calculated.

In an exemplary embodiment of the disclosure, taking the first detection window as an example, the second sampling point of the first detection window may be selected as the initial detection point. If 8000 points are sampled within 1 second, 8000 sample points exist in the first detection window, and the second sample point is selected as an initial sample point to further calculate the initial energy ratio.

It should be noted that other detection windows may also select the second sampling point in the window as the start detection point, and the first detection window and the other detection windows may also select other sampling points as the start detection points, which is not particularly limited in this exemplary embodiment.

In an alternative embodiment, fig. 2 shows a flow chart of a method of calculating a starting energy ratio, as shown in fig. 2, the method comprising at least the steps of: in step S210, a first preamble amplitude of a first preamble sampling point before the start detection point is obtained, and a first subsequent amplitude of a first subsequent sampling point after the start sampling point is obtained.

For example, when the second sampling point of the first detection window is selected as the initial detection point, it may be determined that only the first sampling point of the first detection window is the first preamble sampling point, and thus the amplitude of the first sampling point is obtained as the first preamble amplitude; and when the fifth sampling point of the second detection window is selected as the initial detection point, the first four sampling points in the second detection window are jointly used as first preamble sampling points, and the amplitudes of the four sampling points are respectively obtained and jointly used as first preamble amplitudes.

Correspondingly, when the second sampling point of the first detection window is selected as the initial detection point, all sampling points in the first detection window from the third sampling point to the end of the first detection window can be determined to be used as first subsequent sampling points together, and the amplitudes of all the remaining sampling points are used as first subsequent amplitudes together; and when the last sampling point of the first detection window is the initial detection point, only taking the last sampling point of the first detection window as a first subsequent sampling point, and taking the amplitude of the last sampling point as a first subsequent amplitude.

It should be noted that there may be one or more of the first preamble sampling point, the first preamble amplitude, the first subsequent sampling point, and the first subsequent amplitude, and this exemplary embodiment is not particularly limited in this respect.

In addition, the other detection windows except the first detection window are also applicable to the determination manner of the first preamble sampling point, the first preamble amplitude, the first subsequent sampling point and the first subsequent amplitude, and are not described herein again.

In step S220, the first preamble amplitude and the first subsequent amplitude are calculated to obtain an initial energy ratio.

The starting energy ratio may be calculated after obtaining the first preceding amplitude and the first following amplitude, respectively. And, whether there is one or more of the first preceding amplitude and the first following amplitude, it can be calculated in the following manner.

In an alternative embodiment, fig. 3 shows a flow diagram of a method for further calculating the starting energy ratio, which method comprises at least the following steps, as shown in fig. 3:

in step S310, a first preamble amplitude is calculated to obtain a first preamble energy value, and a first subsequent amplitude is calculated to obtain a first subsequent energy value.

Wherein the first preamble energy value may be a short-time energy average of the first preamble sample point, and the first subsequent energy value may be a short-time energy average of the first subsequent sample point.

The short-term energy is one of the speech characteristic parameters and is a visual representation of the speech signal. The energy analysis of a speech signal is based on the phenomenon that the amplitude of the speech signal varies with time. The short-term energy can be used to distinguish between unvoiced segments and voiced segments of speech, with the shorter-term energy corresponding to unvoiced segments and the shorter-term energy corresponding to voiced segments.

For signals with high signal-to-noise ratio, the short-time energy can be used to judge whether speech exists or not. The short-term energy of the noise without the voice signal is smaller, and the short-term energy can be obviously increased when the voice signal exists, so that the starting point and the ending point of the voice signal can be distinguished. Besides, the short-time energy can be used for distinguishing the boundaries of initials and finals, the boundaries of hyphens and the like.

Specifically, the corresponding first preamble energy value may be calculated according to equation (2):

the starting sample point is a, and s (i) is the first preamble magnitude of the ith first preamble sample point.

Correspondingly, a corresponding first subsequent energy value may be calculated according to equation (3):

similarly, the starting sample point is a, and s (i) is the first subsequent amplitude of the ith first subsequent sample point.

In addition, the first preamble energy value and the first subsequent energy value may also be other parameters characterizing the speech signal, which is not particularly limited in this exemplary embodiment.

In step S320, the first preamble energy value and the first subsequent energy value are calculated to obtain an initial energy ratio.

After the first preamble energy value and the first subsequent energy value are calculated, a start energy ratio corresponding to the start detection point may be further calculated.

Specifically, the calculation can be performed with reference to formula (4):

in the exemplary embodiment, a start energy ratio may be calculated based on the first preceding energy value and the first subsequent energy value, providing a data basis for determining a speech onset, and may provide a more accurate and logically meticulous subsequent determination of the speech onset.

In step S130, another sampling point is determined as an end detection point from the start detection point, and an end energy ratio of the end detection point is calculated.

In an exemplary embodiment of the present disclosure, when another sampling point is determined as the end detection point according to the start point, there may be two cases.

Specifically, taking the first detection window as an example, after the second sampling point of the first detection window is the start detection point, other sampling points exist, and the third sampling point can be determined as the end detection point. In addition, any one of the sampling points after the second sampling point may be selected as the termination detection point, which is not particularly limited in the present exemplary embodiment.

Alternatively, after the last sample point of the first detection window is the start detection point and no other sample points exist in the first detection window, a sample point can be determined as the end detection point in the second detection window. When the end sample point is determined in the second detection window, it may be any one of the sample points in the second detection window, and this exemplary embodiment is not particularly limited thereto.

In summary, the end detection points determined from the start detection points may be within the same detection window or may be in different detection windows.

When there are no other sampling points in the current detection window, the end detection point of the current detection window can be determined in the next detection window, and then the voice starting point of the next detection window is determined.

After the termination detection point is determined, a termination energy ratio corresponding to the termination detection point may be further calculated.

In an alternative embodiment, fig. 4 shows a flow chart of a method of calculating a termination energy ratio, as shown in fig. 4, the method comprising at least the steps of: in step S410, a second preamble amplitude of a second preamble sampling point before the end detection point is obtained, and a second subsequent amplitude of a second subsequent sampling point after the end detection point is obtained.

For example, when the second sampling point of the first detection window is selected as the termination detection point, it may be determined that only the first sampling point of the first detection window is the second preamble sampling point, and thus the amplitude of the first sampling point is obtained as the second preamble amplitude; and when the fifth sampling point of the second detection window is selected as the termination detection point, the first four sampling points in the two detection windows are jointly used as second preamble sampling points, and the amplitudes of the four sampling points are respectively obtained and jointly used as second preamble amplitudes.

Correspondingly, when the second sampling point of the first detection window is selected as the termination detection point, all sampling points in the first detection window from the third sampling point to the end of the first detection window can be determined to be used as second subsequent sampling points together, and the amplitudes of all the remaining sampling points are used as second subsequent amplitudes together; and when the last sampling point of the first detection window is the termination detection point, only taking the last sampling point of the first detection window as a second subsequent sampling point, and taking the amplitude of the last sampling point as a second subsequent amplitude.

It should be noted that there may be one or more of the second preamble sampling point, the second preamble amplitude, the second subsequent sampling point, and the second subsequent amplitude, and this exemplary embodiment is not particularly limited in this respect.

In addition, the other detection windows except the first detection window are also applicable to the determination manner of the second preamble sampling point, the second preamble amplitude, the second subsequent sampling point and the second subsequent amplitude, and are not described herein again.

In step S420, the second preamble amplitude and the second subsequent amplitude are calculated to obtain the termination energy ratio.

The termination energy ratio may be calculated after obtaining the second preamble amplitude and the second subsequent amplitude. And, whether there is one or more of the second preceding amplitude and the second following amplitude, it can be calculated in the following manner.

In an alternative embodiment, fig. 5 shows a flow diagram of a method for further calculating the termination energy ratio, which method comprises at least the following steps, as shown in fig. 5: in step S510, a second preamble amplitude is calculated to obtain a second preamble energy value, and a second subsequent amplitude is calculated to obtain a second subsequent energy value.

Wherein the second preamble energy value may be a short-time energy average of the second preamble samples, and the second subsequent energy value may be a short-time energy average of the second subsequent samples.

Therefore, the calculation method of the second preamble energy value and the second subsequent energy value is the same as step S310, and is not described herein again.

In addition, the second preamble energy value and the second subsequent energy value may also be other parameters characterizing the speech signal, which is not particularly limited in this exemplary embodiment.

In step S520, the second preamble energy value and the second subsequent energy value are calculated to obtain a termination energy ratio.

After the second preamble energy value and the second subsequent energy value are calculated, a termination energy ratio corresponding to the termination detection point may be further calculated.

Specifically, the calculation can be performed with reference to formula (5):

in the exemplary embodiment, a termination energy ratio may be calculated based on the second preceding energy value and the second subsequent energy value, providing a data basis for determining a speech termination point, and may provide a more accurate and logically meticulous subsequent determination of speech termination point.

In step S140, the start detection point is determined as the voice start point of the long voice to be detected according to the start energy ratio, and the end detection point is determined as the voice end point of the long voice to be detected according to the end energy ratio.

In an exemplary embodiment of the present disclosure, after determining the start energy ratio and the end energy ratio, it may be further determined whether the start detection point is a voice start point of the long voice to be detected and the end detection point is a voice end point of the long voice to be detected.

Wherein, fig. 6 and fig. 7 show a method for determining a speech start point and a speech end point, respectively.

In an alternative embodiment, fig. 6 shows a flowchart of a method for determining a speech starting point, as shown in fig. 6, the method at least comprises the following steps: in step S610, a start ratio threshold corresponding to the start energy ratio is determined and the start energy ratio is compared with the start ratio threshold.

The start ratio threshold may be a threshold set for determining whether the start detection point is a voice start point according to the start energy ratio. The size of the start ratio threshold may be set according to actual conditions, and this exemplary embodiment is not particularly limited in this respect.

After determining the start ratio threshold, the start energy ratio may be compared to the start ratio threshold. Specifically, when the threshold value of the initial ratio is R, the initial energy ratio is R_aThe two may be compared.

In step S620, it is determined that the start detection point is the voice start point of the long voice to be detected according to the comparison result.

Specifically, when r>R_aWhen, i.e. the threshold value of the initial ratio>When the initial energy ratio is obtained, the initial detection point corresponding to the initial energy ratio can be determined as the voice initial point of the long voice to be detected; otherwise, selecting the next sampling point of the initial detection point to calculate and determine the next detection window until the voice initial point of the long voice to be detected is determined.

In the present exemplary embodiment, it can be further determined whether the start detection point corresponding to the start energy ratio is a voice start point according to the start ratio threshold, and the determination is simple and logical, and is extremely practical.

In an alternative embodiment, fig. 7 shows a flowchart of a method for determining a speech termination point, as shown in fig. 7, the method at least includes the following steps: in step S710, a reciprocal value of the start ratio threshold is calculated, and the reciprocal value is determined as the end ratio threshold corresponding to the end energy ratio.

Comparing equation (4) and equation (5) shows that the starting energy ratio and the ending energy ratio are just calculated in an inverse relationship, and therefore the ending ratio threshold may also be a threshold having an inverse relationship with the starting ratio threshold.

Wherein the termination ratio threshold may be a threshold set for determining whether the termination detection point is a voice termination point according to the termination energy ratio.

It should be noted that the ending ratio threshold may be a reciprocal value of the starting ratio threshold, or may be set according to actual situations, which is not particularly limited in this exemplary embodiment.

After determining the termination ratio threshold, the termination energy ratio may be compared to the termination ratio thresholdThen the obtained product is obtained. For example, when the termination ratio threshold is 1/R, the termination energy ratio is R_aThe two may be compared.

In step S720, the termination energy ratio is compared with the termination ratio threshold, and the termination detection point is determined to be the voice termination point of the long voice to be detected according to the comparison result.

Specifically, when 1/r<R_aI.e. termination ratio threshold<When the energy ratio is terminated, determining that a termination detection point corresponding to the energy ratio is a voice termination point of the long voice to be detected; otherwise, selecting the next sampling point of the termination detection point to perform the next round of calculation and determination until the voice termination point of the long voice to be detected is determined.

In the present exemplary embodiment, whether the termination detection point corresponding to the termination energy ratio is a speech termination point or not can be further determined according to the termination ratio threshold, and the determination is simple and logical, and is extremely practical.

The following describes the long speech endpoint detection method in the embodiment of the present disclosure in detail with reference to an application scenario.

End point detection is needed in the training and registration test processes of the voiceprint recognition model, and collected audio or open source data set audio contains a plurality of non-speech sections which are distributed in front of the speech section, behind the speech section and in the middle of the speech section. During registration and testing, the time length of each audio is 3s, if the invalid audio signal segments are not removed, the time occupied by the voice segments in each audio file 3s is greatly reduced, and errors also occur in extraction and calculation of a plurality of acoustic features due to the interference of the invalid voice segments. Excessive invalid speech segments waste computation power during acoustic model training.

The most widely used method in the prior art is the double threshold method. The double-threshold method can distinguish the voice segment from the non-voice segment by using two characteristic parameters of short-time energy and short-time zero-crossing rate of the voice signal.

In particular, reference may be made to fig. 8. Fig. 8 is a schematic diagram illustrating the effect of performing voice endpoint detection by using a dual-threshold method, as shown in fig. 8, when receiving a voice to be detected, the short-term energy and the short-term zero-crossing rate of the voice to be detected can be obtained.

The short-time energy is the sum of squares of the amplitude values of the time-domain signal in a frame signal, and the short-time zero crossing rate is the ratio of the number of times that the time-domain signal value crosses the zero level to the length of the signal value in a period of time. The specific detection method is that a high threshold and a low threshold are self-defined according to experience, when any characteristic parameter value above a sound signal exceeds the low threshold, a transition period is entered, if any characteristic parameter value is greater than the high threshold and the characteristic parameter value is greater than the low threshold in the next self-defined time, the section is considered as a voice section, otherwise, the section is considered as a non-voice section. It is difficult to determine the threshold value by the dual threshold method, so that when the speaking voice is small, i.e. the absolute energy value is small, the whole audio is misjudged as a non-speech segment.

Besides the short-time energy and the short-time zero-crossing rate, other speech signal characteristic parameters include cepstrum, entropy and the like.

The cepstrum-based endpoint detection method is similar to the short-term energy-based detection method, and judgment is performed by replacing short-term energy with cepstrum distance. The principle of the entropy-based endpoint detection method is that the amplitude variation range of the non-speech segment is much smaller than that of the speech segment, so the signal value distribution of the non-speech segment is more concentrated, that is, the entropy of the non-speech segment is much smaller than that of the speech segment.

Distinguishing between speech and non-speech based on the difference in entropy of magnitude is an entropy-based endpoint detection method.

Deep learning is also applied to the process of endpoint detection, and strong fitting and learning capabilities of the neural network are used for judging whether a current point is voice or non-voice in front and back environments, but training data required by the neural network is huge, parameters are complicated, and training difficulty is large, so that corresponding research has a large development space.

In general, endpoint detection is a very important step in signal preprocessing of many speech analysis systems, and is also a very challenging problem.

Therefore, the long voice endpoint detection method in the disclosure can detect out voice segments with very small voice in an actual application scene, so that effective voice segments are avoided from being missed, the training corpus is richer, and the model is more robust.

Fig. 9 is a flowchart illustrating a long speech endpoint detection method in an application scenario, as shown in fig. 9, in step S910, a speech signal of a long speech to be detected, that is, an original time sequence signal, is obtained.

The long voice to be detected can be a voice with a period of more than 3 seconds, and the voice contains a voice signal, so that a processing basis can be provided for subsequent endpoint detection.

In step S920, a detection window is obtained by performing windowing on the speech signal.

In step S930, a sample point is determined as a starting detection point in the detection window.

Taking the first detection window as an example, the second sampling point of the first detection window can be selected as the initial detection point.

In step S940, a start energy ratio of start detection points is calculated to determine whether there is a speech start point in the detection window.

Specifically, the first preamble amplitude may be first calculated to obtain a first preamble energy value, and the first subsequent amplitude may be calculated to obtain a first subsequent energy value. Then, the first preamble energy value and the first subsequent energy value are calculated to obtain a start energy ratio to determine a speech start point.

In step S941, when a speech detection point is determined to be a speech start point in the detection window, another sample point may be further determined to be an end detection point in the detection.

The end detection points determined from the start detection points may be within the same detection window or may be in different detection windows. When there are no other sampling points in the current detection window, the end detection point of the current detection window can be determined in the next detection window, and then the voice starting point of the next round is determined.

In step S950, the termination energy ratio of the termination detection point is calculated to determine whether there is a voice termination point in the detection window.

Specifically, the second preamble amplitude may be first calculated to obtain a second preamble energy value, and the second subsequent amplitude may be calculated to obtain a second subsequent energy value. Then, the second preamble energy value and the second subsequent energy value are calculated to obtain a termination energy ratio to determine a speech termination point.

When the voice termination point is determined to exist in the detection window, the voice starting point and the voice termination point in the next detection window can be continuously detected until all the voice starting points and the voice termination points in the long voice to be detected are detected.

In step S942, when the voice detection point is determined not to be the voice starting point in the detection window, a starting detection point may be determined in the next detection window of the detection window to continue the detection until the detection of all the starting detection points of the long voice to be detected is finished.

In step S951, when it is determined that there is no speech end point in the detection window, a termination detection point may be determined in the next detection window of the detection window to continue detection until detection of all termination detection points of the long speech to be detected is finished.

In step S952, the termination points detected as the speech termination points in the long speech to be detected are all marked as the speech termination points of the long speech to be detected.

In step S960, after all the voice start points and voice end points in the long voice to be detected are detected, the end point detection process of the long voice to be detected is ended.

It should be noted that there may be one or more voice starting points in the long voice to be detected, and this exemplary embodiment is not particularly limited to this.

Correspondingly, there may be one or more voice termination points in the long voice to be detected, which is not particularly limited in this exemplary embodiment.

After detecting the voice starting point and the voice ending point of the long voice to be detected, the two end points can be used as the voice end point detection result. In addition, the length of the voice segment in the long voice to be detected can be determined according to the actual requirement and two end points and used as a voice end point detection result. Specifically, the time span between the detected voice starting point and the detected voice ending point can be obtained. In addition, other voice endpoint detection results may also be obtained according to the detected voice starting point and voice ending point, which is not particularly limited in this exemplary embodiment.

The method can determine all voice starting points and voice ending points in the long voice to be detected by calculating the starting detection point and the ending detection point of each detection window after windowing the long voice to be detected. On one hand, the length of the detection window can be adaptively set to be an ultra-long window suitable for the long voice to be detected, so that the isolation of short voice segments is reduced, the end point detection is more coherent, and the complexity caused by the subsequent merging processing work is further reduced; on the other hand, the method for determining the voice starting point and the voice ending point is more precise and meticulous, the condition that effective voice segments are missed is avoided, the accuracy of end point detection is better, and the method is more suitable for a collecting end and an analyzing end.

It should be noted that although the above exemplary embodiment implementations describe the various steps of the method in the present disclosure in a particular order, this does not require or imply that these steps must be performed in that particular order, or that all of the steps must be performed, to achieve the desired results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In addition, in the exemplary embodiment of the present disclosure, a long voice endpoint detection apparatus is also provided. Fig. 10 shows a schematic structural diagram of the long speech endpoint detection apparatus, and as shown in fig. 10, the long speech endpoint detection apparatus 1000 may include: a windowing module 1010, a start detection module 1020, an end detection 1030, and a detection determination module 1040. Wherein:

the windowing processing module 1010 is configured to acquire a voice signal of a long voice to be detected, and perform windowing processing on the voice signal to obtain a detection window;

an initial detection module 1020 configured to determine a sampling point in the detection window as an initial detection point and calculate an initial energy ratio of the initial detection point;

a termination test 1030 configured to determine another sampling point as a termination test point from the start test point and calculate a termination energy ratio of the termination test point;

the detection determining module 1040 is configured to determine the start detection point as the speech start point of the long speech to be detected according to the start energy ratio, and determine the end detection point as the speech end point of the long speech to be detected according to the end energy ratio.

The specific details of the long voice endpoint detection apparatus have been described in detail in the corresponding long voice endpoint detection method, and therefore are not described herein again.

It should be noted that although several modules or units of the long speech end point detection apparatus 1000 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

An electronic device 1100 according to such an embodiment of the invention is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.

Wherein the storage unit stores program code that is executable by the processing unit 1110 to cause the processing unit 1110 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification.

The storage unit 1120 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1121 and/or a cache memory unit 1122, and may further include a read-only memory unit (ROM) 1123.

The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1100, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1140 communicates with the other modules of the electronic device 1100 via the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 12, a program product 1200 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：具有盒外部音圈电机致动器部件的多盒控制板

Long voice endpoint detection method and device, storage medium and electronic equipment

相关技术

网友询问留言