Method and device for speech sentence-breaking, computer equipment and storage medium

文档序号:36619 发布日期:2021-09-24 浏览:39次 中文

阅读说明:本技术 语音断句方法、装置、计算机设备及存储介质 (Method and device for speech sentence-breaking, computer equipment and storage medium ) 是由 曹磊 李俊蓉 于 2021-06-29 设计创作,主要内容包括:本发明涉及人工智能技术领域,提供一种语音断句方法及相关设备,使用静默时间计算模型根据用户语音的语速语调及用户参数计算静默时间,以所述静默时间为断点对用户语音进行断句处理,实现了千人千面的打断判断;在得到多个第一断句语音后,使用词汇模型识别每个第一断句语音中的末端字词是否为目标字词,从而在识别到有目标末端字词为目标字词时,对包含目标末端字词的目标第一断句语音进行断句处理得到多个第二断句语音,将包含目标末端字词的第二断句语音与目标第一断句语音相邻的第一断句语音进行合并处理得到第三断句语音,最后根据第三断句语音对第一断句语音进行更新,得到目标断句语音,实现了对用户语音的正确断句。(The invention relates to the technical field of artificial intelligence, and provides a voice sentence-breaking method and related equipment.A silent time calculation model is used for calculating silent time according to the speech speed and the intonation of user voice and user parameters, and the silent time is used as a breakpoint to perform sentence-breaking processing on the user voice, so that interruption judgment of thousands of people and thousands of faces is realized; after the plurality of first sentence-break voices are obtained, whether the tail end word in each first sentence-break voice is the target word is identified by using the vocabulary model, so that when the target tail end word is identified as the target word, sentence-break processing is carried out on the target first sentence-break voice containing the target tail end word to obtain a plurality of second sentence-break voices, the second sentence-break voice containing the target tail end word and the first sentence-break voice adjacent to the target first sentence-break voice are combined to obtain a third sentence-break voice, and finally the first sentence-break voice is updated according to the third sentence-break voice to obtain the target sentence-break voice, so that correct sentence-break of the voice of the user is realized.)

1. A method for speech sentence-breaking, the method comprising:

acquiring user parameters and user voice, acquiring a speech rate intonation according to the user voice, and calling a silent time calculation model to acquire silent time based on the speech rate intonation and the user parameters;

sentence-breaking processing is carried out on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

extracting the tail end words in each first sentence-breaking voice, and identifying whether each tail end word is a target word or not by using a pre-trained vocabulary model;

when the terminal word is recognized as a target word, sentence breaking processing is carried out on the first sentence breaking voice containing the target word to obtain a plurality of second sentence breaking voices;

acquiring adjacent sentence break voice of the first sentence break voice containing the target words as to-be-processed voice, and combining the second sentence break voice containing the target words with the to-be-processed voice to obtain third sentence break voice;

and arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

2. The speech sentence-breaking method of claim 1, wherein after obtaining the target sentence-breaking speech, the method further comprises:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

calculating the energy probability distribution density of each target sub-band and calculating the spectrum entropy of the corresponding sub-band according to the energy probability distribution density;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

3. The speech sentence-breaking method of claim 2, wherein the linearly constraining each sub-band using the constraint condition to obtain a target sub-band comprises:

acquiring the frequency spectrum of each sub-band;

setting the frequency spectrum not in the preset target range as 0, reserving the frequency spectrum in the target range, setting the frequency spectrum probability density larger than the preset target value as 0, and reserving the frequency spectrum probability density smaller than or equal to the preset target value to obtain the target sub-band.

4. A method of speech sentence segmentation according to any of the claims 1 to 3, characterized in that the method further comprises:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

5. The method for speech sentence segmentation according to claim 4, wherein the comparing the sentence-segment text with the target sentence-segment speech to obtain a comparison result comprises:

converting the target sentence-breaking voice into a target sentence-breaking text;

calculating the similarity between the target sentence-break text and the corresponding sentence-break text;

when the similarity between the target sentence break text and the corresponding sentence break text is greater than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are compared consistently;

and when the similarity between the target sentence break text and the corresponding sentence break text is smaller than the preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are not consistent.

6. The speech sentence-breaking method of claim 5, wherein the method further comprises:

acquiring a comparison result as a first quantity of the sentence-break texts which are consistent with the target sentence-break voice comparison;

acquiring a second quantity of the target sentence-breaking voices;

and calculating the accuracy of the target sentence-breaking voice according to the first quantity and the second quantity.

7. A method of speech sentence segmentation according to any of the claims 1 to 3, characterized in that the method further comprises:

after the target sentence-break voice is obtained, displaying a voice text corresponding to the target sentence-break voice to a user; or

And after the target sentence-break voice is obtained, adding a sentence-break mark at the position where the sentence is required to be broken, and displaying the voice text corresponding to the sentence-break voice added with the sentence-break mark to the user.

8. A speech sentence-breaking apparatus, characterized in that the apparatus comprises:

the time calculation module is used for acquiring user parameters and user voices, acquiring speech speed intonations according to the user voices, and calling a silent time calculation model to acquire silent time based on the speech speed intonations and the user parameters;

the first sentence-breaking module is used for carrying out sentence-breaking processing on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

the word recognition module is used for extracting the tail end word in each first sentence-breaking voice and recognizing whether each tail end word is the target word or not by using a pre-trained vocabulary model;

the second sentence-breaking module is used for carrying out sentence-breaking processing on the first sentence-breaking voice containing the target words to obtain a plurality of second sentence-breaking voices when the terminal words are recognized as the target words;

the voice merging module is used for acquiring adjacent sentence break voices of the first sentence break voice containing the target words as voices to be processed, and merging the second sentence break voice containing the target words with the voices to be processed to obtain third sentence break voices;

and the voice arrangement module is used for arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

9. A computer device, characterized in that the computer device comprises a processor and a memory, the processor being configured to implement the speech sentence-breaking method according to any of the claims 1 to 7 when executing a computer program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of speech sentence-breaking according to any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for sentence segmentation by voice, computer equipment and a storage medium.

Background

Currently, when speech recognition is performed, a speech recognition robot in the market usually performs sentence break by using pause as a judgment basis, but when actual recognition is performed, the sentence break is easy to be incorrect, and the following situations mainly exist:

1) when a user replies to the robot, the user can speak while thinking, the user does not speak after expressing a short sentence, but the user is easy to break the sentence when the thinking time and the reaction time are in doubt, so that only the first half sentence is recognized;

2) when a user replies to the robot, if the environment is noisy or has background sound, the situation that the robot cannot respond in time due to the fact that the user finishes speaking and still has sound, and the fact that whether a sentence is broken cannot be judged cannot be caused;

3) when a user replies to the robot, because the voice tone of the user speaking is different, when the speed of speech is slow, the sentence break is easy to cause incorrect sentence, and the situation that only the first half sentence is recognized is caused.

Disclosure of Invention

In view of the foregoing, there is a need for a method, an apparatus, a computer device and a storage medium for speech sentence-breaking, which can improve the accuracy of speech sentence-breaking.

A first aspect of the present invention provides a method for speech sentence-breaking, the method comprising:

acquiring user parameters and user voice, acquiring a speech rate intonation according to the user voice, and calling a silent time calculation model to acquire silent time based on the speech rate intonation and the user parameters;

sentence-breaking processing is carried out on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

extracting the tail end words in each first sentence-breaking voice, and identifying whether each tail end word is a target word or not by using a pre-trained vocabulary model;

when the terminal word is recognized as a target word, sentence breaking processing is carried out on the first sentence breaking voice containing the target word to obtain a plurality of second sentence breaking voices;

acquiring adjacent sentence break voice of the first sentence break voice containing the target words as to-be-processed voice, and combining the second sentence break voice containing the target words with the to-be-processed voice to obtain third sentence break voice;

and arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

According to an optional embodiment of the present invention, after obtaining the target sentence-break speech, the method further comprises:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

calculating the energy probability distribution density of each target sub-band and calculating the spectrum entropy of the corresponding sub-band according to the energy probability distribution density;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

According to an optional embodiment of the present invention, the linearly constraining each subband using the constraint condition to obtain a target subband includes:

acquiring the frequency spectrum and the frequency spectrum probability density of each sub-band;

setting the frequency spectrum not in the preset target range as 0, reserving the frequency spectrum in the target range, setting the frequency spectrum probability density larger than the preset target value as 0, and reserving the frequency spectrum probability density smaller than or equal to the preset target value to obtain the target sub-band.

According to an alternative embodiment of the invention, the method further comprises:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

According to an optional embodiment of the present invention, the comparing the sentence-break text with the target sentence-break voice to obtain a comparison result includes:

converting the target sentence-breaking voice into a target sentence-breaking text;

calculating the similarity between the target sentence-break text and the corresponding sentence-break text;

when the similarity between the target sentence break text and the corresponding sentence break text is greater than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are compared consistently;

and when the similarity between the target sentence break text and the corresponding sentence break text is smaller than the preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are not consistent.

According to an alternative embodiment of the invention, the method further comprises:

acquiring a comparison result as a first quantity of the sentence-break texts which are consistent with the target sentence-break voice comparison;

acquiring a second quantity of the target sentence-breaking voices;

and calculating the accuracy of the target sentence-breaking voice according to the first quantity and the second quantity.

According to an alternative embodiment of the invention, the method further comprises:

after the target sentence-break voice is obtained, displaying a voice text corresponding to the target sentence-break voice to a user; or

And after the target sentence-break voice is obtained, adding a sentence-break mark at the position where the sentence is required to be broken, and displaying the voice text corresponding to the sentence-break voice added with the sentence-break mark to the user.

A second aspect of the present invention provides a speech sentence-breaking apparatus, comprising:

the time calculation module is used for acquiring user parameters and user voices, acquiring speech speed intonations according to the user voices, and calling a silent time calculation model to acquire silent time based on the speech speed intonations and the user parameters;

the first sentence-breaking module is used for carrying out sentence-breaking processing on the user voice according to the silence time to obtain a plurality of first sentence-breaking voices;

the word recognition module is used for extracting the tail end word in each first sentence-breaking voice and recognizing whether each tail end word is the target word or not by using a pre-trained vocabulary model;

the second sentence-breaking module is used for carrying out sentence-breaking processing on the first sentence-breaking voice containing the target words to obtain a plurality of second sentence-breaking voices when the terminal words are recognized as the target words;

the voice merging module is used for acquiring adjacent sentence break voices of the first sentence break voice containing the target words as voices to be processed, and merging the second sentence break voice containing the target words with the voices to be processed to obtain third sentence break voices;

and the voice arrangement module is used for arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

A third aspect of the invention provides a computer device comprising a processor and a memory, the processor being adapted to implement the speech sentence-breaking method when executing a computer program stored in the memory.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech sentence-breaking method.

In summary, the speech sentence-breaking method, the speech sentence-breaking device, the computer equipment and the storage medium of the invention calculate the silence time according to the speech speed and the intonation of the user speech and the user parameters by using the silence time calculation model, and perform sentence-breaking processing on the user speech according to the silence time, thereby realizing interruption judgment of thousands of people and thousands of faces; after a plurality of first sentence-breaking voices are obtained, extracting a tail end word in each first sentence-breaking voice, identifying whether each tail end word is a target word by using a pre-trained vocabulary model, so that when the target tail word is identified as the target word, sentence-breaking processing is carried out on the first sentence-breaking voice containing the target word to obtain a plurality of second sentence-breaking voices, the second sentence-breaking voice containing the target word is combined with the voice to be processed to obtain a third sentence-breaking voice, the problem that the target first sentence-breaking voice and the first sentence-breaking voice adjacent to the target first sentence-breaking voice are wrongly-broken sentences is effectively solved, and finally the first sentence-breaking voice not containing the target word, the second sentence-breaking voice not containing the target word and the third sentence-breaking voice are arranged in sequence to obtain the target sentence-breaking voice, the correct sentence break of the user voice is realized.

Drawings

Fig. 1 is a flowchart of a speech sentence-breaking method according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a speech sentence-breaking device according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The speech sentence-breaking method provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the speech sentence-breaking device runs in the computer equipment.

Fig. 1 is a flowchart of a speech sentence-breaking method according to an embodiment of the present invention. The method for speech sentence interruption specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.

S11, obtaining user parameters and user voice, obtaining the speech speed intonation according to the user voice, calling a silent time calculation model to obtain silent time based on the speech speed intonation and the user parameters.

User parameters may include, but are not limited to: age, gender, area of residence, educational level, etc.

The silence time calculation model is a machine learning model which is trained in advance and used for calculating the silence time. In some optional embodiments, user parameters of a plurality of known users and user voices of each known user may be obtained, the speech rate intonation of the user voices of the known users is extracted, the user parameters of the plurality of known users and the corresponding speech rate intonation are used as training data, the silence time of each known user is used as a training tag, and a deep neural network is trained to obtain a silence time calculation model. In the silence time calculation model using the speech rate intonation and the user parameters as input data input values, the silence time can be output through the silence time calculation model.

In some alternative embodiments, in the presence of noise, the intonation used for speaking changes with the presence of noise, resulting in a decrease in the accuracy of the subsequent speech sentence break. In order to reduce the influence of noise, denoising processing may be performed on the obtained user speech to obtain a denoised user speech.

And S12, sentence breaking processing is carried out on the user voice according to the silence time to obtain a plurality of first sentence breaking voices.

And with the silence time as a breakpoint, sentence breaking is carried out on the user voice, so that a plurality of first sentence breaking voices are obtained.

And S13, extracting the terminal word in each first sentence-breaking voice, and identifying whether each terminal word is the target word or not by using a pre-trained vocabulary model.

The terminal word can be a word, or a word composed of two or three words.

A vocabulary model may be preset, which may be a target word library for recording target words, e.g., words representing thinking, which are not ending words.

And after extracting the terminal words, matching the terminal words with each target word in the target word library so as to identify whether the terminal words are the target words.

When the target end word is not recognized as the target word, the target end word is indicated as the end word, namely the first sentence break voice containing the target end word is a real sentence break, and the first sentence break voice containing the target end word is correct.

When the target end word is identified as the target word, the target end word is not the end word, that is, the first sentence break voice containing the target end word is not a real sentence break, and the first sentence break voice containing the target end word is wrong.

And S14, when the terminal word is recognized as the target word, performing sentence breaking processing on the first sentence breaking voice containing the target word to obtain a plurality of second sentence breaking voices.

When the terminal word is identified as the target word, it indicates that sentence break processing needs to be further performed on the first sentence break voice containing the target word, so as to obtain a second sentence break voice containing the target word and a second sentence break voice not containing the target word.

And S15, acquiring adjacent sentence break voice of the first sentence break voice containing the target word as to-be-processed voice, and combining the second sentence break voice containing the target word with the to-be-processed voice to obtain third sentence break voice.

For convenience of description, the first sentence break voice containing the target word is referred to as a target first sentence break voice, and the first sentence break voice adjacent to the first sentence break voice containing the target end word is the first sentence break voice located on the right side of the target first sentence break voice.

And merging the second sentence-break voice containing the target words and the first sentence-break voice adjacent to the target first sentence-break voice, so that the problem that the target first sentence-break voice and the first sentence-break voice adjacent to the target first sentence-break voice are wrongly sentence-broken can be solved.

And S16, arranging the first sentence break voice not containing the target word, the second sentence break voice not containing the target word and the third sentence break voice in sequence to obtain the target sentence break voice.

After the third sentence-break voice is obtained, the first sentence-break voice not containing the target words, the second sentence-break voice not containing the target words and the third sentence-break voice are arranged in sequence to obtain the target sentence-break voice, and the target sentence-break voice is a correct sentence-break, which is equivalent to the effect of updating a plurality of first sentence-break voices obtained by sentence-break processing of the user voice according to the silence time.

In an optional embodiment, after obtaining the target sentence-break speech, the method further comprises:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

calculating the energy probability distribution density of each target sub-band and calculating the spectrum entropy of the corresponding sub-band according to the energy probability distribution density;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

The target speech segment may be sampled at a sampling rate of 8 KHZ.

The constraints may include: the normalized frequency spectrum range is a preset target range, and the upper limit of the normalized energy probability distribution density is a preset target value. The preset target range may be 250-3500HZ, and the preset target value may be 0.9. By setting the constraint conditions, the influence of noise on voice segmentation can be eliminated, and certain syllables are prevented from being omitted and segmented, so that the accuracy of recognizing the voice into texts is improved.

The voice endpoint detection means that a starting point and an ending point of a voice signal are accurately found from a section of voice signal, and an effective voice signal and an useless noise signal are separated. The double-threshold endpoint detection method extracts the characteristics of each section of voice signal based on the threshold value according to different characteristics of the voice signal and the noise signal, and compares the characteristics with the set threshold value, thereby achieving the purpose of endpoint detection.

Energy probability distribution density P of target sub-bandb(m, i) is calculated using the following formula:

wherein the content of the first and second substances,1≤m≤Nb,Nbfor the number of subbands, K is a normal number introduced, each subband comprising 4 spectral lines, Eb(m, i) denotes a subband energy of the mth subband in the ith frame.

Spectral entropy of sub-band Hb(i) The following formula is used for calculation:

in an optional embodiment, the linearly constraining each subband by using the constraint condition to obtain a target subband includes:

acquiring the frequency spectrum and the frequency spectrum probability density of each sub-band;

setting the frequency spectrum not in the preset target range as 0, reserving the frequency spectrum in the target range, setting the frequency spectrum probability density larger than the preset target value as 0, and reserving the frequency spectrum probability density smaller than or equal to the preset target value to obtain the target sub-band.

The calculation of the frequency spectrum and the spectral probability density of the sub-band is prior art and the present invention is not described in detail herein.

Since most of the spectrum of the voice signal is within the target range, in order to improve the discrimination capability between the voice section and the noise section, the influence of the noise can be eliminated by setting the spectrum not within the preset target range to 0. To further eliminate the effect of a certain frequency in some noise on the spectral entropy, the spectral probability density that is larger than the target value is set to 0 so that the spectral probability density is capped at the target value.

In the optional implementation, each sub-band is linearly constrained by setting constraint conditions to obtain a target sub-band, so that not only can the influence of noise be eliminated, but also the influence of certain frequency in certain noise on spectral entropy can be eliminated, and the noise is normalized to 0, thereby not only ensuring the accuracy of voice segmentation, but also reducing the calculation amount of voice segmentation and improving the calculation efficiency of voice segmentation.

In an optional embodiment, the method further comprises:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

The long-term memory punctuation model is a machine learning model which is trained in advance and used for punctuating a text according to the characteristics of the text punctuation, and the training process is the prior art and is not elaborated in detail.

The user voice can be converted into the user text by adopting a voice-to-text technology, and the text sentence break characteristics of the user text are extracted, so that the user text is broken by using a pre-trained long-term memory sentence break model according to the text sentence break characteristics, and a plurality of sentence break texts are obtained.

And converting each target sentence-break voice into a target sentence-break text by adopting a voice-to-text technology, and calculating the similarity between the target sentence-break text and the corresponding sentence-break text so as to compare the target sentence-break voice with the sentence-break text. And when the similarity between the target sentence break text and the corresponding sentence break text is greater than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are compared consistently. And when the similarity between the target sentence break text and the corresponding sentence break text is smaller than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are not consistent.

And acquiring a first quantity of the sentence-break texts and a second quantity of the target sentence-break voices, which are consistent with each other in comparison, according to the first quantity and the second quantity, calculating the accuracy rate of the target sentence-break voices.

It should be understood that, for the case that the similarity between the target sentence break text and the corresponding sentence break text is equal to the preset similarity threshold, the method is applicable to the case that the similarity between the target sentence break text and the corresponding sentence break text is greater than the preset similarity threshold, and is also applicable to the case that the similarity between the target sentence break text and the corresponding sentence break text is less than the preset similarity threshold.

In an optional embodiment, the method further comprises:

after the target sentence-break voice is obtained, displaying a voice text corresponding to the target sentence-break voice to a user; or

And after the target sentence-break voice is obtained, adding a sentence-break mark at the position where the sentence is required to be broken, and displaying the voice text corresponding to the sentence-break voice added with the sentence-break mark to the user.

According to the method provided by the invention, the silent time is calculated by using the silent time calculation model according to the speech speed and the intonation of the user speech and the user parameters, and the sentence breaking processing is carried out on the user speech according to the silent time, so that the interruption judgment of thousands of people and thousands of faces is realized; after a plurality of first sentence-breaking voices are obtained, extracting a tail end word in each first sentence-breaking voice, identifying whether each tail end word is a target word by using a pre-trained vocabulary model, so that when the target tail word is identified as the target word, sentence-breaking processing is carried out on the first sentence-breaking voice containing the target word to obtain a plurality of second sentence-breaking voices, the second sentence-breaking voice containing the target word is combined with the voice to be processed to obtain a third sentence-breaking voice, the problem that the target first sentence-breaking voice and the first sentence-breaking voice adjacent to the target first sentence-breaking voice are wrongly-broken sentences is effectively solved, and finally the first sentence-breaking voice not containing the target word, the second sentence-breaking voice not containing the target word and the third sentence-breaking voice are arranged in sequence to obtain the target sentence-breaking voice, the correct sentence break of the user voice is realized.

The method can be applied to the outbound scene, and can accurately identify and judge the real interruption of the user, so that the intention of the user is accurately acquired, the purpose of the task type outbound scene is fulfilled, the conversation between the robot and the user is smoother and anthropomorphic, and the conversation experience is improved.

It is emphasized that the user language may be stored in a node of the blockchain in order to further ensure privacy and security of the user language.

Fig. 2 is a structural diagram of a speech sentence-breaking device according to a second embodiment of the present invention.

In some embodiments, the speech sentence-breaking device 20 may include a plurality of functional modules composed of computer program segments. The computer program of each program segment in the speech sentence-breaking apparatus 20 may be stored in a memory of a computer device and executed by at least one processor to perform the functions of speech sentence-breaking (described in detail with reference to fig. 1).

In this embodiment, the speech sentence-breaking device 20 may be divided into a plurality of functional modules according to the functions performed by the speech sentence-breaking device. The functional module may include: the system comprises a time calculation module 201, a first sentence break module 202, a word recognition module 203, a second sentence break module 204, a voice combination module 205, a voice arrangement module 206, a voice segmentation module 207 and a sentence break comparison module 208. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The time calculation module 201 is configured to obtain a user parameter and a user voice, obtain a speech rate intonation according to the user voice, and call a silent time calculation model to obtain a silent time based on the speech rate intonation and the user parameter.

User parameters may include, but are not limited to: age, gender, area of residence, educational level, etc.

The silence time calculation model is a machine learning model which is trained in advance and used for calculating the silence time. In some optional embodiments, user parameters of a plurality of known users and user voices of each known user may be obtained, the speech rate intonation of the user voices of the known users is extracted, the user parameters of the plurality of known users and the corresponding speech rate intonation are used as training data, the silence time of each known user is used as a training tag, and a deep neural network is trained to obtain a silence time calculation model. In the silence time calculation model using the speech rate intonation and the user parameters as input data input values, the silence time can be output through the silence time calculation model.

In some alternative embodiments, in the presence of noise, the intonation used for speaking changes with the presence of noise, resulting in a decrease in the accuracy of the subsequent speech sentence break. In order to reduce the influence of noise, denoising processing may be performed on the obtained user speech to obtain a denoised user speech.

The first sentence-break module 202 is configured to perform sentence-break processing on the user speech according to the silence time to obtain a plurality of first sentence-break voices.

And with the silence time as a breakpoint, sentence breaking is carried out on the user voice, so that a plurality of first sentence breaking voices are obtained.

The word recognition module 203 is configured to extract a terminal word in each first sentence-breaking speech, and recognize whether each terminal word is a target word using a pre-trained vocabulary model.

The terminal word can be a word, or a word composed of two or three words.

A vocabulary model may be preset, which may be a target word library for recording target words, e.g., words representing thinking, which are not ending words.

And after extracting the terminal words, matching the terminal words with each target word in the target word library so as to identify whether the terminal words are the target words.

When the target end word is not recognized as the target word, the target end word is indicated as the end word, namely the first sentence break voice containing the target end word is a real sentence break, and the first sentence break voice containing the target end word is correct.

When the target end word is identified as the target word, the target end word is not the end word, that is, the first sentence break voice containing the target end word is not a real sentence break, and the first sentence break voice containing the target end word is wrong.

The second sentence-breaking module 204 is configured to, when the end word is identified as the target word, perform sentence-breaking processing on the first sentence-breaking voice containing the target word to obtain a plurality of second sentence-breaking voices.

When the terminal word is identified as the target word, it is indicated that sentence break processing needs to be further performed on the first sentence break voice containing the target word, so that a second sentence break voice containing the target word and a second sentence break voice not containing the target word are obtained.

The speech merging module 205 is configured to acquire an adjacent sentence break speech of the first sentence break speech including the target word as a to-be-processed speech, and merge the second sentence break speech including the target word with the to-be-processed speech to obtain a third sentence break speech.

For convenience of description, the first sentence break voice containing the target word is referred to as a target first sentence break voice, and the first sentence break voice adjacent to the first sentence break voice containing the target end word is the first sentence break voice located on the right side of the target first sentence break voice.

And merging the second sentence-break voice containing the target words and the first sentence-break voice adjacent to the target first sentence-break voice, so that the problem that the target first sentence-break voice and the first sentence-break voice adjacent to the target first sentence-break voice are wrongly sentence-broken can be solved.

The speech arrangement module 206 is configured to obtain the first sentence-breaking speech that does not include the target word; and arranging the first sentence break voice not containing the target words, the second sentence break voice not containing the target words and the third sentence break voice in sequence to obtain the target sentence break voice.

After the third sentence-break voice is obtained, the first sentence-break voice not containing the target words, the second sentence-break voice not containing the target words and the third sentence-break voice are arranged in sequence to obtain the target sentence-break voice, and the target sentence-break voice is a correct sentence-break, which is equivalent to the effect of updating a plurality of first sentence-break voices obtained by sentence-break processing of the user voice according to the silence time.

In an optional embodiment, after obtaining the target sentence-break speech, the speech segmentation module 207 is configured to:

setting constraint conditions;

preprocessing the target sentence-breaking voice including pre-emphasis and windowing framing;

performing fast Fourier transform on the preprocessed target sentence-breaking voice to obtain a plurality of sub-bands;

carrying out linear constraint on each sub-band by using the constraint conditions to obtain a target sub-band;

calculating the energy probability distribution density of each target sub-band and calculating the spectrum entropy of the corresponding sub-band according to the energy probability distribution density;

smoothing the spectral entropy of each sub-band to obtain a threshold value;

detecting syllable start and end points based on the threshold value by using a double-threshold end point detection method;

and carrying out voice segmentation on the target sentence-breaking voice according to the syllable starting point and the syllable ending point.

The target speech segment may be sampled at a sampling rate of 8 KHZ.

The constraints may include: the normalized frequency spectrum range is a preset target range, and the upper limit of the normalized energy probability distribution density is a preset target value. The preset target range may be 250-3500HZ, and the preset target value may be 0.9. By setting the constraint conditions, the influence of noise on voice segmentation can be eliminated, and certain syllables are prevented from being omitted and segmented, so that the accuracy of recognizing the voice into texts is improved.

The voice endpoint detection means that a starting point and an ending point of a voice signal are accurately found from a section of voice signal, and an effective voice signal and an useless noise signal are separated. The double-threshold endpoint detection method extracts the characteristics of each section of voice signal based on the threshold value according to different characteristics of the voice signal and the noise signal, and compares the characteristics with the set threshold value, thereby achieving the purpose of endpoint detection.

Energy probability distribution density P of target sub-bandb(m, i) is calculated using the following formula:

wherein the content of the first and second substances,1≤m≤Nb,Nbfor the number of subbands, K is a normal number introduced, each subband comprising 4 spectral lines, Eb(m, i) denotes a subband energy of the mth subband in the ith frame.

Spectral entropy of sub-band Hb(i) The following formula is used for calculation:

in an optional embodiment, the linearly constraining each subband by using the constraint condition to obtain a target subband includes:

acquiring the frequency spectrum and the frequency spectrum probability density of each sub-band;

setting the frequency spectrum not in the preset target range as 0, reserving the frequency spectrum in the target range, setting the frequency spectrum probability density larger than the preset target value as 0, and reserving the frequency spectrum probability density smaller than or equal to the preset target value to obtain the target sub-band.

The calculation of the frequency spectrum and the spectral probability density of the sub-band is prior art and the present invention is not described in detail herein.

Since most of the frequency spectrum of the voice signal is within the target range, in order to improve the discrimination capability between the voice segment and the noise segment, by setting the frequency spectrum that is not within the preset target range to 0, the influence of the noise can be eliminated, and the influence of the noise can be eliminated. To further eliminate the effect of a certain frequency in some noise on the spectral entropy, the spectral probability density that is larger than the target value is set to 0 so that the spectral probability density is capped at the target value.

In the optional implementation, each sub-band is linearly constrained by setting constraint conditions to obtain a target sub-band, so that not only can the influence of noise be eliminated, but also the influence of certain frequency in certain noise on spectral entropy can be eliminated, and the noise is normalized to 0, thereby not only ensuring the accuracy of voice segmentation, but also reducing the calculation amount of voice segmentation and improving the calculation efficiency of voice segmentation.

In an optional embodiment, the sentence break comparison module 208 is configured to:

converting the user speech into user text;

performing word segmentation processing on the user text to obtain a plurality of keywords;

acquiring a word vector of each keyword;

generating text sentence breaking characteristics according to the word vectors;

carrying out sentence breaking on the user text according to the text sentence breaking characteristics and the long-term memory sentence breaking model to obtain a sentence breaking text;

and comparing the sentence break text with the target sentence break voice to obtain a comparison result.

The long-term memory punctuation model is a machine learning model which is trained in advance and used for punctuating a text according to the characteristics of the text punctuation, and the training process is the prior art and is not elaborated in detail.

The user voice can be converted into the user text by adopting a voice-to-text technology, and the text sentence break characteristics of the user text are extracted, so that the user text is broken by using a pre-trained long-term memory sentence break model according to the text sentence break characteristics, and a plurality of sentence break texts are obtained.

And converting each target sentence-break voice into a target sentence-break text by adopting a voice-to-text technology, and calculating the similarity between the target sentence-break text and the corresponding sentence-break text so as to compare the target sentence-break voice with the sentence-break text. And when the similarity between the target sentence break text and the corresponding sentence break text is greater than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are compared consistently. And when the similarity between the target sentence break text and the corresponding sentence break text is smaller than a preset similarity threshold, the comparison result is that the sentence break text and the target sentence break voice are not consistent.

And calculating a comparison result as a first quantity of the sentence-break texts and the target sentence-break voices which are consistent in comparison, calculating a second quantity of the target sentence-break voices, and calculating the accuracy rate of the target sentence-break voices according to the first quantity and the second quantity.

It should be understood that, for the case that the similarity between the target sentence break text and the corresponding sentence break text is equal to the preset similarity threshold, the method is applicable to the case that the similarity between the target sentence break text and the corresponding sentence break text is greater than the preset similarity threshold, and is also applicable to the case that the similarity between the target sentence break text and the corresponding sentence break text is less than the preset similarity threshold.

In an optional implementation manner, after the target sentence-break voice is obtained, the voice text corresponding to the target sentence-break voice may be displayed to the user; or after the target sentence-break voice is obtained, adding a sentence-break mark at the position where the sentence is required to be broken, and displaying the voice text corresponding to the sentence-break voice added with the sentence-break mark to the user.

The device provided by the invention calculates the silent time according to the speech speed and the intonation of the user speech and the user parameters by using the silent time calculation model, and performs sentence breaking processing on the user speech according to the silent time, thereby realizing the interruption judgment of thousands of people; after a plurality of first sentence-breaking voices are obtained, extracting a tail end word in each first sentence-breaking voice, identifying whether each tail end word is a target word by using a pre-trained vocabulary model, so that when the target tail word is identified as the target word, sentence-breaking processing is carried out on the first sentence-breaking voice containing the target word to obtain a plurality of second sentence-breaking voices, the second sentence-breaking voice containing the target word is combined with the voice to be processed to obtain a third sentence-breaking voice, the problem that the target first sentence-breaking voice and the first sentence-breaking voice adjacent to the target first sentence-breaking voice are wrongly-broken sentences is effectively solved, and finally the first sentence-breaking voice not containing the target word, the second sentence-breaking voice not containing the target word and the third sentence-breaking voice are arranged in sequence to obtain the target sentence-breaking voice, the correct sentence break of the user voice is realized.

The device can be applied to the outbound scene, and can accurately identify and judge the real interruption of the user, so that the intention of the user is accurately acquired, the purpose of the task type outbound scene is fulfilled, the conversation between the robot and the user is smoother and anthropomorphic, and the conversation experience is improved.

It is emphasized that the user language may be stored in a node of the blockchain in order to further ensure privacy and security of the user language.

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 3 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the computer device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 3 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.

It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the speech sentence-breaking method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the computer device 3, connects various components of the entire computer device 3 by using various interfaces and lines, and executes various functions and processes data of the computer device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or a portion of the steps of the speech sentence-breaking method described in embodiments of the present invention; or to implement all or part of the functionality of the speech sentence-breaking device. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用于语音指令捕捉的信号精确度调节系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!