Voice end point detection method and device

文档序号：1629497 发布日期：2020-01-14 浏览：14次中文

阅读说明：本技术 一种语音结束端点检测方法及装置 (Voice end point detection method and device ) 是由龙嘉裕于 2019-09-17 设计创作，主要内容包括：本申请实施例提供一种语音结束端点检测方法及装置,其中方法包括：获取用户输入的语音信号,并将所述语音信号转换为文本信息；确定所述文本信息对应的语境类型,和/或所述语音信号中的非语言特征信息；根据所述语境类型和/或所述非语言特征信息确定检测时长；识别所述文本信息中的每个字在所述语音信号中对应的发音区间,当确定所述文本信息中的第一字的发音区间之后的所述检测时长内,不包括第二字的发音区间,则将所述第一字在所述语音信号中对应的发音区间的结束时间点作为第一端点；当确定所述第一字所处的句子的语义结构完整时,将第一端点作为所述第一字所处的句子在所述语音信号中的语音结束端点。(The embodiment of the application provides a method and a device for detecting a voice ending endpoint, wherein the method comprises the following steps: acquiring a voice signal input by a user, and converting the voice signal into text information; determining a context type corresponding to the text information and/or non-language feature information in the voice signal; determining a detection duration according to the context type and/or the non-language feature information; identifying a pronunciation interval corresponding to each word in the text information in the voice signal, and when the pronunciation interval of a second word is not included in the detection duration after the pronunciation interval of a first word in the text information is determined, taking the ending time point of the pronunciation interval corresponding to the first word in the voice signal as a first endpoint; and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.)

1. A method for detecting an end-of-speech endpoint, comprising:

acquiring a voice signal input by a user, and converting the voice signal into text information;

determining a context type corresponding to the text information and/or non-language feature information in the voice signal;

determining a detection duration according to the context type and/or the non-language feature information;

identifying a pronunciation interval corresponding to each word in the text information in the voice signal, and when the pronunciation interval of a second word is not included in the detection duration after the pronunciation interval of a first word in the text information is determined, taking the ending time point of the pronunciation interval corresponding to the first word in the voice signal as a first endpoint; the first word is any word in the text information, and the second word is positioned after the first word and adjacent to the first word;

and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.

2. The method of claim 1, wherein the method further comprises:

when the semantic structure of the sentence where the first character is located is determined to be incomplete, determining whether the sentence is ended or not by adopting a Natural Language Understanding (NLU) technology;

if the sentence is determined to be ended, taking the first endpoint as a voice ending endpoint of the sentence in the voice signal; otherwise, requesting the user to re-input the voice signal or sending indication information to the user, wherein the indication information is used for prompting the user to confirm whether the sentence is finished.

3. The method according to claim 1 or 2, wherein the determining a detection duration according to the context type and/or when the non-linguistic feature information comprises:

when the context type is a question context, and/or when the non-language feature information includes at least one of trailing information, hesitation information, and delay information, taking a first duration as the detection duration;

and when the context type is not the query context and the non-language feature information does not include the trailing information, the hesitation information and the delay information, taking a second duration as the detection duration, wherein the first duration is longer than the second duration.

4. The method of claim 3, wherein the first duration is greater than 200 milliseconds and less than 2 seconds;

the second duration is less than or equal to 200 milliseconds.

5. The method of claim 1, wherein said determining a context type to which the textual information corresponds comprises:

collecting different question sentences in advance; analyzing the language composition constitution of the question sentence, extracting phrases with question characteristics, and storing the phrases as a phrase group set; and when the text information is acquired, identifying and determining the context type corresponding to the text information according to the word group set.

6. An end-of-speech endpoint detection apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal input by a user and converting the voice signal into text information;

the processing unit is used for determining a context type corresponding to the text information and/or non-language feature information in the voice signal; determining a detection duration according to the context type and/or the non-language feature information;

the processing unit is further configured to identify a pronunciation section corresponding to each word in the text information in the voice signal, and when the pronunciation section of a second word is not included in the detection time length after the pronunciation section of a first word in the text information is determined, take an end time point of the pronunciation section corresponding to the first word in the voice signal as a first endpoint; the first word is any word in the text information, and the second word is positioned after the first word and adjacent to the first word; and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.

7. The apparatus as recited in claim 6, said processing unit to further:

8. The apparatus of claim 6 or 7, wherein the processing unit is further to:

9. A computer program product comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 5.

10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 5.

Technical Field

The present application relates to the field of voice detection technologies, and in particular, to a method and an apparatus for detecting a voice termination point.

Background

With the progress of science and technology, people work and live almost every day by applying to computers and networks, and in order to provide services for work and life more conveniently and efficiently, speech recognition is increasingly widely applied in various fields, for example, man-machine interactive speech recognition, recording conversation contents between people in a speech recognition mode when people communicate with each other, or recording their thoughts in a speech mode anytime and anywhere, and the like, and the recognition mode gradually becomes the trend of speech application development. The process of speech recognition mainly comprises 4 steps, which are respectively: voice signal collection, voice signal characteristic parameter extraction, acoustic model and mode matching, and language model and language processing. When the voice signal is collected in the first step, firstly, the voice signal input by the user is judged, and the starting point and the ending point of the voice signal are accurately found out, so that whether the user has spoken the word is known, and the method is applied to an endpoint detection technology (VAD), which is used as a first key technology encountered in the voice recognition system and the processing stage, and the accuracy of the VAD directly determines the success or failure of the voice recognition system to some extent.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting a voice ending endpoint, which are used for solving the problem of low accuracy of endpoint detection in the prior art.

The embodiment of the application provides a method for detecting a voice ending endpoint, which comprises the following steps: acquiring a voice signal input by a user, and converting the voice signal into text information; determining a context type corresponding to the text information and/or non-language feature information in the voice signal; determining a detection duration according to the context type and/or the non-language feature information; identifying a pronunciation interval corresponding to each word in the text information in the voice signal, and when the pronunciation interval of a second word is not included in the detection duration after the pronunciation interval of a first word in the text information is determined, taking the ending time point of the pronunciation interval corresponding to the first word in the voice signal as a first endpoint; the first word is any word in the text information, and the second word is positioned after the first word and adjacent to the first word; and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.

The method comprises the steps of firstly obtaining the ending time point of a speech signal corresponding to a sentence, namely a first endpoint; and further judging the integrity of the sentence semantic structure, and finally determining whether the first endpoint is a voice ending endpoint or not, so that the accuracy of detecting the voice ending endpoint can be improved.

In one possible implementation, the method further includes: when the semantic structure of the sentence where the first character is located is determined to be incomplete, determining whether the sentence is ended or not in a Natural Language Understanding (NLU) mode; if the sentence is determined to be ended, taking the first endpoint as a voice ending endpoint of the sentence in the voice signal; otherwise, requesting the user to re-input the voice signal or sending indication information to the user, wherein the indication information is used for prompting the user to confirm whether the sentence is finished.

One possible implementation manner, wherein the determining the detection duration according to the context type and/or when the non-language feature information includes: when the context type is a question context, and/or when the non-language feature information includes at least one of trailing information, hesitation information, and delay information, taking a first duration as the detection duration; and when the context type is not the query context and the non-language feature information does not include the trailing information, the hesitation information and the delay information, taking a second duration as the detection duration, wherein the first duration is longer than the second duration.

Illustratively, the first duration may be greater than 200 milliseconds and less than 2 seconds; the second duration is less than or equal to 200 milliseconds.

One possible implementation manner of determining a context type corresponding to the text information includes: collecting different question sentences in advance; analyzing the language composition constitution of the question sentence, extracting phrases with question characteristics, and storing the phrases as a phrase group set; and when the text information is acquired, identifying and determining the context type corresponding to the text information according to the word group set.

The embodiment of the application provides a voice ending endpoint detection device, which specifically comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal input by a user and converting the voice signal into text information; the processing unit is used for determining a context type corresponding to the text information and/or non-language feature information in the voice signal; determining a detection duration according to the context type and/or the non-language feature information; the processing unit is further configured to identify a pronunciation section corresponding to each word in the text information in the voice signal, and when the pronunciation section of a second word is not included in the detection time length after the pronunciation section of a first word in the text information is determined, take an end time point of the pronunciation section corresponding to the first word in the voice signal as a first endpoint; the first word is any word in the text information, and the second word is positioned after the first word and adjacent to the first word; and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.

In a possible implementation manner, the processing unit is further configured to determine whether the sentence in which the first word is located is ended by using a natural language understanding NLU technique when it is determined that the semantic structure of the sentence is incomplete; if the sentence is determined to be ended, taking the first endpoint as a voice ending endpoint of the sentence in the voice signal; otherwise, requesting the user to re-input the voice signal or sending indication information to the user, wherein the indication information is used for prompting the user to confirm whether the sentence is finished.

In one possible implementation, the processing unit is further configured to use the first duration as the detection duration when the context type is a question context and/or when the non-language feature information includes at least one of trailing information, hesitation information, and delay information; and when the context type is not the query context and the non-language feature information does not include the trailing information, the hesitation information and the delay information, taking a second duration as the detection duration, wherein the first duration is longer than the second duration.

Embodiments of the present application provide a computer-readable storage medium, which stores computer-readable instructions, and when the computer-readable instructions are read and executed by a computer, the computer-readable instructions cause the computer to perform the method in any one of the above possible designs.

The embodiments of the present application provide a computer program product, which when read and executed by a computer, causes the computer to perform the method of any one of the above possible designs.

The embodiment of the present application provides a chip, where the chip is connected to a memory, and is used to read and execute a software program stored in the memory, so as to implement the method in any one of the above possible designs.

Embodiments of the present application provide a chip system, which includes a processor and may further include a memory, and is used to implement the method in any one of the above possible designs. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

The method and the device for detecting the voice ending endpoint provided by the embodiment of the invention have the following beneficial effects: determining corresponding context types and/or non-language characteristic information for a voice signal input by a user so as to dynamically adjust the detection duration of a voice ending endpoint to reasonably determine the time point of ending of a pronunciation interval of a word in a sentence; and then, the semantic structure integrity analysis of the sentence is combined to judge whether the time point is the voice ending end point of the sentence or not, so that the accuracy of detecting the voice ending end point of the sentence is ensured.

Drawings

Fig. 1 is a schematic structural diagram of a mobile phone according to an embodiment of the present application;

fig. 2 is a flowchart of a method for detecting an end point of speech according to an embodiment of the present application;

fig. 3 is a schematic diagram of determining a first endpoint according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of shallow semantic analysis provided by an embodiment of the present application;

FIG. 5 is a flowchart of deep semantic analysis according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a device for detecting an end point of speech termination according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the drawings attached hereto.

Before describing the embodiments of the present application, some terms in the present application are explained to facilitate understanding for those skilled in the art.

1) The voice endpoint detection VAD is used for detecting an effective voice section from a continuous voice stream and comprises two aspects, namely detecting a front endpoint which is a starting point of the effective voice and detecting a rear endpoint which is an end point of the effective voice, and distinguishing a voice section from a non-voice section by adopting a short-time energy technology.

2) The Automatic Speech Recognition (ASR) technology is a technology for converting human speech into text.

3) The "speech turn" means a speech in which a speaker continuously utters a speech having a certain communication function at any time during a daily conversation, and the end mark indicates that the speaker and the listener exchange roles or both silence the speech as a signal for giving up the speech turn.

4) The voice wheel switching means that the voice wheel between the speaker and the receiver can be smoothly connected.

5) The spoken language transcription technology is a technology for marking dialogue, and adopts voiceprint recognition and short-time energy technology to mark non-language interactive features on written linguistic data so as to obtain richer context information.

6) Shallow semantic analysis, namely judging the integrity of a sentence according to semantic role components of predicates in the sentence; the semantic role components comprise a core semantic role and an auxiliary semantic role. For example, given a sentence, according to the corresponding semantic role components of predicates in the sentence, including core semantic roles (such as a performer, a victim, and the like) and dependent semantic roles (such as a place, a time, a mode, a reason, and the like), with the predicates of the sentence as a center, the relationship between each component in the sentence and the predicates is studied to judge and analyze the completeness of the sentence.

7) And semantic role labeling, namely analyzing a predicate-argument structure of a sentence by taking the sentence as a unit, and determining an argument and a role of the argument in the sentence aiming at a core predicate in the sentence. Adopting a rule-based method, such as traversing a syntax tree, a syntactic dependency tree and the like, cutting words which cannot be argument from a sentence to obtain a role chain of the predicate, and identifying all argument belonging to the predicate from candidate arguments; and (4) giving semantic roles to the recognized arguments and labeling, and processing labeling results, such as deleting semantically repeated arguments.

8) Deep semantic analysis refers to analyzing the real semantics of a sentence to judge the completeness of the sentence.

9) The incremental index is to generate each index file, combine the index files according to a strategy to form one or a plurality of large index files, index when new data is added to form a segment of independent index data, generate an index data set and an index information set for managing the index data when more and more new data are indexed, and store the directory where each index file is located, the name of the index file and the number of documents contained in the index file in the index information set. When inquiring, the index information set is accessed first, each index data is accessed according to the information in the set, and then the result sets obtained from each index data file are merged to form a complete result set.

The embodiments of the present application may be applicable to any electronic device, such as a mobile phone, a tablet computer, a wearable device (e.g., a watch, a bracelet, a smart helmet, etc.), an in-vehicle device, a smart home, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), etc.

Taking an electronic device mobile phone as an example, fig. 1 is a schematic structural diagram of the mobile phone, and for convenience of description, fig. 1 only shows main components of the mobile phone. As shown in fig. 1, the handset 100 includes at least one processor 101, an input unit 103, a touch panel 104, other input devices 105, a display unit 106, a display screen 107, a first camera 108, a second camera 109, audio circuitry 110, a microphone 111, and the like, which are application-coupled to a memory 102. The at least one processor 101 is mainly configured to process a communication protocol and communication data, control the entire electronic device, execute a software program, process data of the software program, for example, support the electronic device to perform processing of performing end-of-speech endpoint detection on received speech information, and the like. The memory 102 is used primarily for storing software programs and data. The input unit 103 is mainly used for receiving input of a user on a touch screen of the touch panel 104 of the electronic device, text, and the like. The display screen 107 is mainly used for a main interface of the electronic device, an application interface of each APP, and the like, and in the embodiment of the present application, the display screen 107 may display text information after the speech signal conversion processing. The microphone 111 is mainly used to acquire a voice signal input by a user. Those skilled in the art will appreciate that the handset architecture shown in fig. 1 is merely an example of an implementation, and is not limited to a handset, and may include more or less components than those shown, or combine some components, or arrange different components, and will not be described herein. It should be noted that the processor 101 in the embodiment of the present application may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors. Wherein the controller may be a neural center and a command center of the foldable electronic device. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution. A memory may also be provided in the processor 101 for storing instructions and data. In some embodiments, the memory in the processor 101 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 101. If the processor 101 needs to reuse the instruction or data, it can be called directly from the memory, avoiding repeated access, reducing the waiting time of the processor 110, and increasing the efficiency of the system.

In this embodiment, the processor 101 may invoke program instructions stored in the memory 102 to complete the detection of the end-of-speech endpoint of the received speech signal by the electronic device.

It will be appreciated that the memory 102 in embodiments of the present application may be used to store computer-executable program code comprising instructions, as well as various functional applications and data processing of electronic devices. The processor 101 may perform the detection of the end-of-utterance endpoint for a received speech signal as proposed by embodiments of the present application by executing instructions stored in the memory 102. The memory 102 may include a program storage area and a data storage area. Wherein, the storage program area can store an operating system, software codes of at least one application program (such as an Aichi art application, a WeChat application, etc.), and the like. The data storage area may store data (e.g., images, video, etc.) generated during use of the electronic device, and the like. Further, the memory 102 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

Currently, the voice over endpoint detection can be applied to many scenarios, for example, the most typical application scenario is that when a user uses an instant messaging application (e.g., WeChat, QQ, connect me, etc.), the user can send his/her voice to a counterpart contact or receive the voice from the counterpart contact, play the voice through the speaker 112, etc. based on the dialog function given by the instant messaging application. When the user uses these instant messages to perform voice chat or voice message communication, because the voice of the user usually has coherence or time sequence and there may be environmental noise, the electronic device knows whether the user has spoken or not by finding out the start point and the end point in the voice signal sent by the user accurately through the sound pick-up, i.e. the microphone 111, and then uses the voice end point detection technology VAD. In the embodiment of the present invention, the voice end point detection program instruction may be pre-stored in the memory 102, or stored in another external memory, and the processor 101 may call and operate the voice end point detection program instruction stored in the memory 102 or the external memory, so as to achieve accurate identification of the end point position of the voice signal received by the electronic device, thereby improving the user experience when the user uses the instant message application to perform voice communication or voice message communication.

In addition, the embodiment of the invention can also be applied to scenes such as voice translation application, conference recording application, human-computer interaction application and the like in electronic equipment; the further corresponding scenes can be voice recording of wearable equipment, voice recording during meeting, voice recording during man-machine interaction such as mobile terminal voice searching or consultation, voice conversion into character recording during detecting a case, voice recording of vehicle-mounted equipment and the like.

When the voice ending endpoint detection method is applied to the electronic equipment, the electronic equipment can acquire a voice signal sent by a user through the microphone 111; with reference to the structure diagram of the mobile phone shown in fig. 1, fig. 2 is a flowchart of a method for detecting a voice termination point according to an embodiment of the present application, which may include the following steps.

Step 201: when the processor 101 in the mobile phone detects that a function requiring voice end point detection is started, for example, a recording APP of the mobile phone is opened, or a voice trigger control in a WeChat dialog box in the mobile phone is triggered, the processor 101 starts the microphone 111, a voice signal input by a user is acquired by the microphone 111, and the processor 101 converts the voice signal acquired by the microphone 111 into text information;

specifically, for example, the voice signal of the user may be acquired through the microphone 111, or the voice signal of the user may also be acquired through other manners, for example, the mobile phone receives the voice signal sent by the third-party device through the RF circuit 114 or the WiFi module 113, which is not limited in this embodiment of the application. When acquiring a speech signal of a user, the processor 101 may synchronously convert the speech signal into text information through an Automatic Speech Recognition (ASR) technique. ASR is a technique that converts human speech into text.

Step 202: a processor 101 in the mobile phone determines a context type corresponding to the text information and/or non-language feature information in the voice signal;

as to a method how to determine a context type corresponding to the text information, reference may be made to the following description.

The context types are generally divided into a question context, a statement context and an imperative context according to the mood of a sentence, wherein the question context refers to the context containing a general question sentence, a special question sentence, an anti-question sentence and a selected question sentence, the statement context refers to the context containing a positive sentence and a negative sentence, and the imperative context refers to the context containing a positive sentence and a negative sentence expressing a command, a request, an advice, a warning and a forbidden class.

The sentences contained in the context are analyzed, according to the theory of speech behaviors, the types of the speech behaviors mainly comprise three types of expressions, queries and requirements, and the corresponding functional types of the sentences are divided into three types of statement sentences, questioning sentences and imperative sentences. Generally, in a scene with a questioning sentence, a responder needs to think for a pause time, and if a voice ending end point is detected by adopting the VAD technology in the prior art, the sentence is easily judged to be ended by mistake due to the reason of thinking pause within the detection time, so that the integrity of the sentence is damaged. This is not likely to occur in the statement sentences and imperative sentences, and therefore, in the end-of-speech endpoint detection processing, only the context types are classified as: both question context and non-question context types.

Analyzing the language composition of the question in the question context, it can be seen that whether the phrases composing the question are in question form can be determined, and according to this feature, the phrases such as "why", "to the end", "how", "what", and "do" are extracted, and the phrases are collected and stored as a word group set, which may be stored in the memory 102 in advance.

In step 201, after converting the voice signal into text information, the processor 101 compares phrases constituting the text information with phrases in the phrase set, and determines that the context type corresponding to the text information is a question context when the phrases of the text information have the phrases in the phrase set; otherwise, determining the context type corresponding to the text information as the non-question context.

For example, a phrase set includes phrases such as "why", "to", "how", "what", "do"; the text information converted from the voice signal input by the user includes: 1. how much the weather is tomorrow? 2. Weather is good, what is you going on tomorrow? 3. Consider outdoor sports. Firstly, the text information 1 is subjected to phrase splitting processing, for example, "tomorrow", "weather" and "how" can be split; firstly, the text information 2 is subjected to phrase splitting processing, for example, "weather", "good", "tomorrow", "what" and "intention" can be split; firstly, the text information 3 is subjected to phrase splitting processing, for example, the text information can be split into 'consideration', 'outdoor' and 'sport'; then, the divided phrases are respectively and correspondingly compared with the phrase sets one by one, and when the text information 1 is compared with the phrase sets, what's the words' in the text information 1 exist in the phrase sets; when text information 2 is compared with the phrase set, "what" in text information 2 exists in the phrase set; when the text information 3 is compared with the phrase set, the "consideration", "outdoor" and "motion" in the text information 3 do not exist in the phrase set, and it can be seen that phrases exist in the phrase set in both the text information 1 and the text information 2, and all phrases in the text information 3 do not exist in the phrase set, so that the context types corresponding to the text information 1 and the text information 2 can be determined as the query context, and the context type corresponding to the text information 3 is determined as the non-query context.

By determining the context type through the method, whether the context corresponding to the text information is a question context can be known, so that different processing is performed on the voice ending endpoint; another situation where the end point of the speech is processed differently is where the speech signal contains non-linguistic feature information, as described in detail below.

As to a method how to determine non-linguistic feature information in the speech signal, reference may be made to the following description.

Information contained in a speech signal is called acoustic information, and the acoustic information includes language information, paralinguistic information, non-language information, and silence information.

The language information is information that can be expressed by characters and is added to a speech signal when a person who speaks the speech speaks the language. For example, the language information includes phonemes, syllables, phonemes and characters grouped in units of short syllables.

The sub-language information is information that is added to the speech signal by the utterance of the speaking person and cannot be recognized from the language information. It is a filler indicating that a speaking person is thinking, and is also information for identifying whether to ask the other party by using the direction of the tone. For example, when the tone of the second half of the language information "yes" becomes high, the language information indicates a query; and in the case where the sound modulation of the latter half is low, the language information indicates affirmative.

The non-speech information is characteristic information indicating a person who speaks, which is included in the speech signal, and may be status information indicating the person who speaks. For example, the non-language feature information is information of a lingering sound, hesitation information, and extension information when speaking; the non-verbal status information is the gender, age, physical characteristics, personality, etc. of the person speaking.

The silence information is information of a state in which any information of the language information, the sub-language information, and the non-language information is not included in the speech signal, and is, for example, silence and noise.

As can be seen from the 4 types of information included in the acoustic information, when the speech contains the hangover information, the hesitation information, and the delay information in the non-linguistic feature information, if the VAD technique in the prior art is used to detect the end point of the speech, the sentence is easily misjudged as the speech end due to the presence of the hangover information, the hesitation information, or the extension information within the detection time, and the integrity of the sentence is damaged. Therefore, it is necessary to recognize the non-linguistic feature information in the speech signal and perform different end point processing for the speech signal.

After the microphone acquires the voice signal input by the user in step 201, the processor identifies whether the voice signal contains the lingering information, the hesitation information and the delay information by using a spoken language transcription technology, wherein the spoken language transcription technology is used for marking a conversation, and the identification of the voice signal by using the spoken language transcription technology is the prior art, and specific contents are not repeated.

For example, an example case of the non-language information is as follows.

Example 1, overlapped voices exist in the voice signal, and for example, text information corresponding to the acquired voice signal is as follows.

The user A: i do not know that is true;

and a user B: is true.

And (3) analysis: the words spoken by the user a and the user B include the same voice "true", and there is an overlapped voice, and the overlapped voice does not belong to any one of the lingering information, the hesitation information, and the delay information in the non-language information although belonging to the non-language information, so the detection duration when the voice end point is detected will be processed according to the second duration in step 203.

Example 2, there is a heavy utterance in the speech signal, for example, text information corresponding to the acquired speech signal is as follows.

The user A: we wait for the bar first;

and a user B: is good.

And (3) analysis: the acquired voice signal shows that the speaking states of the user a and the user B are low, which means that the state information of the speaker belongs to the non-language information but does not belong to any one of the lingering information, the hesitation information and the delay information, and therefore, the detection duration in the voice end endpoint detection is processed according to the second duration in step 203.

Example 3, a voice pause exists in the voice signal, and for example, the text information corresponding to the acquired voice signal is as follows.

The user A: he drives (pauses 200 milliseconds) up the hill; (telephone ring switching 1.3 seconds)

And a user B: is it? How far (150 ms pause)?

And (3) analysis: the obtained voice signal shows that the user a has a pause phenomenon when saying "drive" and "go up a mountain" and the user B has a pause phenomenon when saying "yes" and "how far", and belongs to the delay information in the non-language feature information, so the detection duration when the voice end endpoint detection is performed will be processed according to the first duration in step 203.

Example 4, there is voice extension in the voice signal, for example, the text information corresponding to the acquired voice signal is as follows.

The user A: i do so well …;

and a user B: … is good.

And (3) analysis: the obtained voice signal shows that the user A has a lingering sound when speaking the 'good' word and the user B has a lingering sound when speaking the 'good' word, and belongs to the lingering sound information in the non-language feature information, so that the detection time length when the voice end endpoint is detected is processed according to the first time length in the step 203.

Example 5, there is a intonation change in the voice signal, for example, the text information corresponding to the acquired voice signal is as follows.

The user A: it has 4 stories? Too good! | A | A

And a user B: is.

And (3) analysis: when the obtained voice signal shows that the user a says "story" and "too good", the story is expressed in an ascending manner, belongs to the auxiliary language information, and does not belong to the non-language information, so that the detection time length during the voice end point detection is processed according to the second time length in the step 203.

Example 6, there is emphasis in the voice signal, and for example, text information corresponding to the acquired voice signal is as follows.

The user A: she has a number of books!

And (3) analysis: when the acquired voice signal shows that the user a speaks, the emphasis is highlighted when "there" and "much" are spoken, and the speech belongs to the sublingual information and does not belong to the non-linguistic information, so that the detection time length in the voice end endpoint detection is processed according to the second time length in the step 203.

Example 7, there is a case where the volume of the voice signal becomes high, and for example, text information corresponding to the acquired voice signal is as follows.

The user A: too good.

And (3) analysis: the acquired voice signal shows that the state of the user a when speaking is good, which causes the volume to become high, belongs to the non-language information, but does not belong to any one of the lingering information, the hesitation information, and the delay information, and therefore, the detection duration when detecting the voice end point will be processed according to the second duration in step 203.

As can be seen from the above analysis, of the above 7 examples, examples 3 and 4, which belong to the hangover information, the hesitation information, and the delay information, do not belong to any of example 1, example 2, example 5, example 6, and example 7; therefore, the detection time duration at the time of the voice end point detection in examples 3 and 4 may be different from the detection time duration at the time of the voice end point detection in other examples.

Step 203: and a processor in the mobile phone determines the detection duration according to the context type and/or the non-language feature information.

Specifically, when the processor determines that the context type of the text information is a question context, or when the processor determines that the non-language feature information includes at least one of trailing information, hesitation information, and delay information; or when the processor determines that the context type of the text information is a question context and that the non-language feature information includes at least one of lingering information, hesitation information, and delay information, the first duration is taken as the detection duration.

And when the processor determines that the context type of the text information is not the question context and the non-language feature information does not comprise the lingering information, the hesitation information and the delay information, taking a second duration as the detection duration, wherein the second duration is less than the first duration.

In the embodiment of the application, the specific values of the first duration and the second duration can be determined according to actual conditions. For example, in one possible implementation, the voice end detection mechanism may be adaptively adjusted according to a switching rule of the speaker in linguistics (i.e., switching from the speaker to the receiver), for example, a general switching duration is 200 ms, and 2 seconds is regarded as an upper limit of the voice switching, so that the detection duration of the end of speech refers to the switching rule of the speaker, and different values of the first duration and the second duration may be set with reference to the switching duration of the speaker and the upper limit thereof, thereby ensuring the accuracy of the end of speech detection.

In another possible implementation manner, different values may be directly set for the first duration and the second duration, for example, the value of the first duration is greater than 200 milliseconds and less than 2 seconds; the second duration is less than or equal to 200 milliseconds.

Step 204: and a processor in the mobile phone identifies a corresponding pronunciation interval of each word in the text information in the voice signal, and takes the ending time point of the first word in the pronunciation interval corresponding to the voice signal as a first endpoint when the pronunciation interval of the second word is not included in the detection time length after the pronunciation interval of the first word in the text information is determined.

Wherein the first word is any word in the text information and the second word is located after and adjacent to the first word.

For example, as shown in the figure, fig. 3 is a schematic diagram of determining a first endpoint according to an embodiment of the present application. The text information corresponding to the speech signal shown in fig. 3 is "good". You ".

The processor identifies the pronunciation interval of the first word "good" in the speech signal, and in combination with the detection duration after the pronunciation interval of the first word "good" determined by the processor in step 203 being the second duration, it is known that the value of the detection duration is less than or equal to 200 milliseconds.

Through the above steps 201 and 204, the processor may obtain the ending time point of the speech signal corresponding to the sentence, i.e. the first endpoint, but whether the first endpoint is the speech ending endpoint at this time needs to perform further semantic structure integrity judgment processing on the sentence.

Step 205: and when the processor determines that the semantic structure of the sentence in which the first word is positioned is complete, taking the first endpoint as a speech ending endpoint of the sentence in which the first word is positioned in the speech signal.

Fig. 4 is a flowchart of shallow semantic analysis with a complete semantic structure according to an embodiment of the present application, where the method flow is used to determine the semantic structure integrity of the sentence where the first endpoint is located obtained in step 204, and may be applied to the mobile phone shown in fig. 1, and when the mobile phone shown in fig. 1 executes the method flow, the method flow may include the following steps.

Step 401: a processor in the mobile phone takes the interactive dialogue text as a corpus in advance to carry out training processing and semantic role labeling;

specifically, the processor takes various dialogue interaction texts as corpora in advance for training, extracts features from the trained corpora, and constructs corresponding semantic feature vectors; constructing a prototype mode on the basis of the semantic feature vector; solving a plurality of candidate roles for each dependency component according to the prototype mode, constructing a predicate library, and combining the candidate roles of all dependency components corresponding to each predicate to obtain a role chain of each predicate; labeling semantic roles in the role chain, wherein the labeling types comprise: predicates, paths, phrase types, locations, morphemes, core words, dependencies, combination features, and first and last words of an argument.

Step 402: a processor in the mobile phone carries out semantic role labeling processing on the current text, and identifies the semantic structural integrity of sentences;

specifically, the processor identifies the semantic role of the text at this time by using a neural network, and labels the semantic role. And judging whether the components of the sentence are complete or not through the core predicates of the sentence and the semantic roles labeled by the argument to which the predicate belongs, and identifying the semantic structural integrity of the sentence. The semantic recognition technology based on the neural network is the prior art, and specific contents are not described again.

Step 403: a processor in the mobile phone judges whether the semantic structure of the sentence is complete or not and correspondingly carries out different processing;

step 404: and when the processor in the mobile phone judges that the semantic structure of the sentence is complete, taking the first endpoint as a voice ending endpoint of the sentence in the voice signal.

Step 405: when the processor judges that the semantic structure of the sentence is incomplete, Natural Language Understanding (NLU) technology is adopted to determine whether the sentence is finished. The NLU-based statement analysis technology is the prior art, and specific contents are not described again.

Specifically, as shown in the figure, fig. 5 is a flowchart of deep semantic analysis with a complete semantic structure provided in the embodiment of the present application, and the method flow may be applied to the mobile phone shown in fig. 1, and when the mobile phone shown in fig. 1 executes the method flow, the method flow may include the following steps.

Step 501: a processor in the mobile phone analyzes the intention of the sentence by adopting an NLU technology and predicts the correctness of the intention by adopting an increment processing technology;

specifically, the user intention can be recognized by the NLU, and by using an increment processing technology adopted by the idea of increment indexing in a search engine, various dialog texts are classified in advance according to the user intention recognized by the NLU, each dialog intention file is generated, the dialog intention files are merged according to a strategy to form one or a plurality of large dialog intention files, when a new text is added to require intention prediction, an independent dialog intention data is formed, when more and more new texts are subjected to intention prediction, a dialog intention data set and an intention information set for managing the dialog intention data are generated, and a directory where each dialog intention file is located, the name of the dialog intention file, and the number of documents contained in the dialog intention file are stored in the intention information set.

When the processor analyzes the incomplete semantic structure of a sentence through shallow semantics, the processor accesses each dialogue intention data corresponding to the information in the set according to the intention information set of the sentence analyzed through an NLU technology in advance, and then merges result sets obtained from each dialogue intention data file to form a complete result set, so that the next prediction intention analysis is facilitated.

Step 502: dividing the intention analysis result of the sentence into three categories according to the confidence coefficient by the processor, and respectively carrying out different processing;

step 503: the processor judges whether the confidence coefficient is greater than or equal to M%;

step 504: when the confidence coefficient is larger than or equal to M%, the processor determines that the sentence is ended, and takes the first endpoint as a speech ending endpoint of the sentence in the speech signal;

step 505: when the confidence coefficient is less than M%, the processor judges whether the confidence coefficient is simultaneously greater than or equal to N%;

step 506: when the confidence coefficient is less than M% and greater than or equal to N%, the processor sends indication information to the user to prompt the user to confirm whether the sentence is finished;

step 507: the user judges whether the confirmed sentence is correct or not;

specifically, when the user confirms that the sentence is correct, the processor determines that the sentence is ended, and takes the first endpoint as a speech ending endpoint of the sentence in the speech signal; when the user confirms that the sentence is incorrect, the processor in step 508 is turned to request the user to re-input the speech signal.

Step 508: when the confidence coefficient is less than N%, the processor requests the user to input the voice signal again;

step 509: the processor judges whether the number of times of re-inputting the voice signal by the user is greater than L times;

specifically, when the number of times of re-input by the user is less than or equal to L times, the process goes to step 201; the value of the set time threshold can be 3 times, and can also be adjusted according to the actual situation.

Step 510: when the number of inputs is greater than L, the processor stops the end-of-utterance end-point detection.

When the confidence is greater than or equal to M%, the text with the high confidence belongs to, for example, the value of M may be set to 85%.

When the confidence is less than M% and greater than or equal to N%, the text belongs to the confidence, for example, the value of N may be set to 45%, which may correspond to the actual scene: the user does not speak out, and is interrupted or inserted into the mouth during the speaking process. At this time, the processor confirms whether the prediction is correct or not with the user, for example, explicitly confirms with the user that "do you mean (prediction), do you? "when the prediction result is confirmed to be correct, the first endpoint is taken as a speech ending endpoint of a sentence in the speech signal; when it is determined that the prediction is incorrect, the processing is the same as the third point in the classification, i.e., the processor prompts the user to re-enter a sentence, e.g., "do you miss, i just as if you interrupted, can you speak again? "while the processor records the number of times the user re-enters a sentence.

Fig. 6 is a schematic structural diagram of a device for detecting an end point of speech termination provided in an embodiment of the present application, which includes an obtaining unit 601 and a processing unit 602, and is described in detail as follows.

An acquiring unit 601, configured to acquire a voice signal input by a user, and convert the voice signal into text information;

a processing unit 602, configured to determine a context type corresponding to the text information, and/or non-language feature information in the speech signal; determining a detection duration according to the context type and/or the non-language feature information;

the voice detection device is further used for identifying a corresponding pronunciation interval of each word in the text information in the voice signal, and when the detection duration after the pronunciation interval of a first word in the text information is determined does not include the pronunciation interval of a second word, the ending time point of the pronunciation interval corresponding to the first word in the voice signal is used as a first endpoint; the first word is any word in the text information, and the second word is positioned after the first word and adjacent to the first word; and when the semantic structure of the sentence in which the first word is positioned is determined to be complete, taking the first endpoint as a voice ending endpoint of the sentence in which the first word is positioned in the voice signal.

In one possible implementation manner, the processing unit is further configured to use the first duration as the detection duration when it is determined that the context type of the text information is a question context, and/or when the non-language feature information includes at least one of trailing information, hesitation information, and delay information; and when the context type of the text information is determined not to be the question context and the non-language feature information does not include the trailing information, the hesitation information and the delay information, taking a second duration as the detection duration, wherein the first duration is longer than the second duration.

For example, the first duration may be greater than 200 milliseconds and less than 2 seconds; the second duration is less than or equal to 200 milliseconds.

In a possible implementation manner, the processing unit is further configured to collect different question sentences in advance; analyzing the language composition constitution of the question sentence, extracting phrases with question characteristics, and storing the phrases as a phrase group set; and when the text information is acquired, identifying and determining the context type corresponding to the text information according to the word group set.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于XLNet的智能语音对话意图识别方法

Voice end point detection method and device

相关技术

网友询问留言