Method and device for speech sentence-breaking, computer equipment and storage medium

文档序号：36619 发布日期：2021-09-24 浏览：39次中文

阅读说明：本技术 语音断句方法、装置、计算机设备及存储介质 (Method and device for speech sentence-breaking, computer equipment and storage medium ) 是由曹磊李俊蓉于 2021-06-29 设计创作，主要内容包括：本发明涉及人工智能技术领域,提供一种语音断句方法及相关设备,使用静默时间计算模型根据用户语音的语速语调及用户参数计算静默时间,以所述静默时间为断点对用户语音进行断句处理,实现了千人千面的打断判断；在得到多个第一断句语音后,使用词汇模型识别每个第一断句语音中的末端字词是否为目标字词,从而在识别到有目标末端字词为目标字词时,对包含目标末端字词的目标第一断句语音进行断句处理得到多个第二断句语音,将包含目标末端字词的第二断句语音与目标第一断句语音相邻的第一断句语音进行合并处理得到第三断句语音,最后根据第三断句语音对第一断句语音进行更新,得到目标断句语音,实现了对用户语音的正确断句。(The invention relates to the technical field of artificial intelligence, and provides a voice sentence-breaking method and related equipment.A silent time calculation model is used for calculating silent time according to the speech speed and the intonation of user voice and user parameters, and the silent time is used as a breakpoint to perform sentence-breaking processing on the user voice, so that interruption judgment of thousands of people and thousands of faces is realized; after the plurality of first sentence-break voices are obtained, whether the tail end word in each first sentence-break voice is the target word is identified by using the vocabulary model, so that when the target tail end word is identified as the target word, sentence-break processing is carried out on the target first sentence-break voice containing the target tail end word to obtain a plurality of second sentence-break voices, the second sentence-break voice containing the target tail end word and the first sentence-break voice adjacent to the target first sentence-break voice are combined to obtain a third sentence-break voice, and finally the first sentence-break voice is updated according to the third sentence-break voice to obtain the target sentence-break voice, so that correct sentence-break of the voice of the user is realized.)

1. A method for speech sentence-breaking, the method comprising:

acquiring user parameters and user voice, acquiring a speech rate intonation according to the user voice, and calling a silent time calculation model to acquire silent time based on the speech rate intonation and the user parameters;

sentence-breaking processing is carried out on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

extracting the tail end words in each first sentence-breaking voice, and identifying whether each tail end word is a target word or not by using a pre-trained vocabulary model;

when the terminal word is recognized as a target word, sentence breaking processing is carried out on the first sentence breaking voice containing the target word to obtain a plurality of second sentence breaking voices;

acquiring adjacent sentence break voice of the first sentence break voice containing the target words as to-be-processed voice, and combining the second sentence break voice containing the target words with the to-be-processed voice to obtain third sentence break voice;

and arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

2. The speech sentence-breaking method of claim 1, wherein after obtaining the target sentence-breaking speech, the method further comprises:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

calculating the energy probability distribution density of each target sub-band and calculating the spectrum entropy of the corresponding sub-band according to the energy probability distribution density;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

3. The speech sentence-breaking method of claim 2, wherein the linearly constraining each sub-band using the constraint condition to obtain a target sub-band comprises:

acquiring the frequency spectrum of each sub-band;

setting the frequency spectrum not in the preset target range as 0, reserving the frequency spectrum in the target range, setting the frequency spectrum probability density larger than the preset target value as 0, and reserving the frequency spectrum probability density smaller than or equal to the preset target value to obtain the target sub-band.

4. A method of speech sentence segmentation according to any of the claims 1 to 3, characterized in that the method further comprises:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

5. The method for speech sentence segmentation according to claim 4, wherein the comparing the sentence-segment text with the target sentence-segment speech to obtain a comparison result comprises:

converting the target sentence-breaking voice into a target sentence-breaking text;

calculating the similarity between the target sentence-break text and the corresponding sentence-break text;

when the similarity between the target sentence break text and the corresponding sentence break text is greater than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are compared consistently;

and when the similarity between the target sentence break text and the corresponding sentence break text is smaller than the preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are not consistent.

6. The speech sentence-breaking method of claim 5, wherein the method further comprises:

acquiring a comparison result as a first quantity of the sentence-break texts which are consistent with the target sentence-break voice comparison;

acquiring a second quantity of the target sentence-breaking voices;

and calculating the accuracy of the target sentence-breaking voice according to the first quantity and the second quantity.

7. A method of speech sentence segmentation according to any of the claims 1 to 3, characterized in that the method further comprises:

after the target sentence-break voice is obtained, displaying a voice text corresponding to the target sentence-break voice to a user; or

And after the target sentence-break voice is obtained, adding a sentence-break mark at the position where the sentence is required to be broken, and displaying the voice text corresponding to the sentence-break voice added with the sentence-break mark to the user.

8. A speech sentence-breaking apparatus, characterized in that the apparatus comprises:

the time calculation module is used for acquiring user parameters and user voices, acquiring speech speed intonations according to the user voices, and calling a silent time calculation model to acquire silent time based on the speech speed intonations and the user parameters;

the first sentence-breaking module is used for carrying out sentence-breaking processing on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

the word recognition module is used for extracting the tail end word in each first sentence-breaking voice and recognizing whether each tail end word is the target word or not by using a pre-trained vocabulary model;

the second sentence-breaking module is used for carrying out sentence-breaking processing on the first sentence-breaking voice containing the target words to obtain a plurality of second sentence-breaking voices when the terminal words are recognized as the target words;

the voice merging module is used for acquiring adjacent sentence break voices of the first sentence break voice containing the target words as voices to be processed, and merging the second sentence break voice containing the target words with the voices to be processed to obtain third sentence break voices;

and the voice arrangement module is used for arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

9. A computer device, characterized in that the computer device comprises a processor and a memory, the processor being configured to implement the speech sentence-breaking method according to any of the claims 1 to 7 when executing a computer program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of speech sentence-breaking according to any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for sentence segmentation by voice, computer equipment and a storage medium.

Background

Currently, when speech recognition is performed, a speech recognition robot in the market usually performs sentence break by using pause as a judgment basis, but when actual recognition is performed, the sentence break is easy to be incorrect, and the following situations mainly exist:

1) when a user replies to the robot, the user can speak while thinking, the user does not speak after expressing a short sentence, but the user is easy to break the sentence when the thinking time and the reaction time are in doubt, so that only the first half sentence is recognized;

2) when a user replies to the robot, if the environment is noisy or has background sound, the situation that the robot cannot respond in time due to the fact that the user finishes speaking and still has sound, and the fact that whether a sentence is broken cannot be judged cannot be caused;

3) when a user replies to the robot, because the voice tone of the user speaking is different, when the speed of speech is slow, the sentence break is easy to cause incorrect sentence, and the situation that only the first half sentence is recognized is caused.

Disclosure of Invention

In view of the foregoing, there is a need for a method, an apparatus, a computer device and a storage medium for speech sentence-breaking, which can improve the accuracy of speech sentence-breaking.

A first aspect of the present invention provides a method for speech sentence-breaking, the method comprising:

sentence-breaking processing is carried out on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

extracting the tail end words in each first sentence-breaking voice, and identifying whether each tail end word is a target word or not by using a pre-trained vocabulary model;

According to an optional embodiment of the present invention, after obtaining the target sentence-break speech, the method further comprises:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

According to an optional embodiment of the present invention, the linearly constraining each subband using the constraint condition to obtain a target subband includes:

acquiring the frequency spectrum and the frequency spectrum probability density of each sub-band;

According to an alternative embodiment of the invention, the method further comprises:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

According to an optional embodiment of the present invention, the comparing the sentence-break text with the target sentence-break voice to obtain a comparison result includes: