Voice recognition defect detection method and device

文档序号:116971 发布日期:2021-10-19 浏览:32次 中文

阅读说明:本技术 一种语音识别缺陷检测方法和装置 (Voice recognition defect detection method and device ) 是由 韩传宇 孙仿逊 易晖 翁志伟 于 2021-06-24 设计创作,主要内容包括:本发明实施例提供了一种语音识别缺陷检测方法和装置,所述方法包括:接收车载系统转发的语音请求;对所述语音请求的文本进行意图分类;根据意图分类结果,对所述文本进行置信度分类;根据置信度分类结果,从所述文本确定候选词进行筛选,筛选出关键词作为语音识别缺陷。本发明实施例可以更准确地识别出语音请求的文本中的语音识别缺陷。(The embodiment of the invention provides a method and a device for detecting voice recognition defects, wherein the method comprises the following steps: receiving a voice request forwarded by a vehicle-mounted system; performing intent classification on text of the voice request; performing confidence classification on the text according to the intention classification result; and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects. The embodiment of the invention can more accurately identify the voice identification defect in the text of the voice request.)

1. A method for detecting a voice recognition defect, comprising:

receiving a voice request forwarded by a vehicle-mounted system;

performing intent classification on text of the voice request;

performing confidence classification on the text according to the intention classification result;

and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects.

2. The method of claim 1, wherein said confidence classifying the text according to the intent classification result comprises:

determining whether the text has a user intention according to the intention classification result;

and if the text is determined to have the user intention according to the intention classification result, performing confidence classification on the text.

3. The method of claim 1, wherein the determining candidate words from the text for screening to screen out keywords as speech recognition defects according to the confidence classification result comprises:

and if the text is determined to have the wrong words according to the confidence classification result, determining candidate words from the text, screening, and screening out keywords as the voice recognition defects.

4. The method of claim 2, wherein the intent classification of the text of the voice request comprises:

predicting whether the text has the user intention by adopting a preset intention classification model to obtain a prediction result; the prediction result includes a probability that the text has the user intention and a probability that the text does not have the user intention.

5. The method of claim 4, wherein the determining whether the text has a user intent according to the intent classification result comprises:

judging whether a preset probability condition is met or not according to the probability that the text has the user intention and the probability that the text does not have the user intention; the preset probability condition comprises the following steps: the probability of having the user intention is greater than the probability of not having the user intention, and/or the probability of having the user intention is greater than a preset probability threshold;

if the preset probability condition is met, determining that the text has the user intention;

and if the preset probability condition is not met, determining that the text does not have the user intention.

6. The method of claim 2, wherein said confidence classifying said text comprises:

obtaining confidence information of the text, wherein the confidence information comprises confidence of each word of the text obtained by performing voice recognition on the voice request;

classifying the texts according to the confidence coefficient information to obtain confidence coefficient classification results; and the confidence degree classification result is that the text has error words or the text does not have error words.

7. The method of claim 6, wherein classifying the text according to the confidence information to obtain a confidence classification result comprises:

judging whether each word in the text has a word with a confidence coefficient lower than a preset confidence coefficient threshold value;

if the text has words with confidence degrees lower than a preset confidence degree threshold value, determining that error words exist in the text;

and if the text does not have words with the confidence degrees lower than a preset confidence degree threshold value, determining that no error words exist in the text.

8. The method of claim 6, wherein classifying the text according to the confidence information to obtain a confidence classification result comprises:

determining the confidence coefficient of each character of the text and the word forming position of each character;

predicting whether the text has wrong characters according to the confidence coefficient of each character of the text and the word forming position of each character;

and determining a confidence degree classification result according to the prediction result.

9. The method of claim 8, wherein predicting whether the text has an error word according to the confidence level of each word of the text and the word formation position of each word comprises:

and inputting each character of the text, the confidence corresponding to each character and the word forming position of each character into a preset error prediction model for processing to obtain a prediction result of whether each character is wrong or not.

10. The method of claim 1, wherein the determining candidate words from the text and screening out keywords as speech recognition defects comprises:

identifying a field corresponding to the text;

respectively determining the importance degree of each word of the text in the text of the field;

determining candidate words according to the importance degree of each word of the text in the text of the field;

and screening out keywords as voice recognition defects according to the parts of speech of the candidate words.

11. The method of claim 10, wherein the screening out keywords as speech recognition defects according to parts of speech of the candidate words comprises:

and screening out nouns and/or verbs from the candidate words to serve as voice recognition defects.

12. A speech recognition defect detecting apparatus, comprising:

the voice request receiving module is used for receiving a voice request forwarded by the vehicle-mounted system;

an intention classification module for performing intention classification on the text of the voice request;

the confidence classification module is used for carrying out confidence classification on the text according to the intention classification result;

and the screening module is used for determining candidate words from the text according to the confidence degree classification result, screening the candidate words and screening out keywords as the voice recognition defects.

13. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the speech recognition defect detection method according to any one of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the textual voice recognition defect detection method according to any one of claims 1 to 11.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice recognition defect detection method and a voice recognition defect detection device.

Background

The intelligent voice system is widely applied to products such as mobile phones, bracelets, sound boxes, televisions, vehicles and the like, and supports various voice interaction scenes such as question answering and voice control. The accuracy of Speech recognition asr (automatic Speech recognition) in a Speech system is a key constraint factor affecting the development of intelligent Speech products, the Speech recognition system converts a Speech request of a user into a text, and the intelligent Speech system performs subsequent natural Language processing nlp (natural Language processing) according to the text. However, for technical reasons, the text generated by the speech recognition system may contain erroneous words, which in turn leads to deviations in the subsequent natural language processing.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a speech recognition defect detection method and a corresponding speech recognition defect detection apparatus that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a method for detecting a voice recognition defect, including:

receiving a voice request forwarded by a vehicle-mounted system;

performing intent classification on text of the voice request;

performing confidence classification on the text according to the intention classification result;

and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects.

Optionally, the classifying the text according to the intention includes performing confidence classification on the text according to an intention classification result, including:

determining whether the text has a user intention according to the intention classification result;

and if the text is determined to have the user intention according to the intention classification result, performing confidence classification on the text.

Optionally, the determining candidate words from the text according to the confidence classification result to screen out keywords as the voice recognition defects includes:

and if the text is determined to have the wrong words according to the confidence classification result, determining candidate words from the text, screening, and screening out keywords as the voice recognition defects.

Optionally, the intent classification of the text of the voice request includes:

predicting whether the text has the user intention by adopting a preset intention classification model to obtain a prediction result; the prediction result includes a probability that the text has the user intention and a probability that the text does not have the user intention.

Optionally, the determining whether the text has the user intention according to the intention classification result includes:

judging whether a preset probability condition is met or not according to the probability that the text has the user intention and the probability that the text does not have the user intention; the preset probability condition comprises the following steps: the probability of having the user intention is greater than the probability of not having the user intention, and/or the probability of having the user intention is greater than a preset probability threshold;

if the preset probability condition is met, determining that the text has the user intention;

and if the preset probability condition is not met, determining that the text does not have the user intention.

Optionally, the performing confidence classification on the text includes:

obtaining confidence information of the text, wherein the confidence information comprises confidence of each word of the text obtained by performing voice recognition on the voice request;

classifying the texts according to the confidence coefficient information to obtain confidence coefficient classification results; and the confidence degree classification result is that the text has error words or the text does not have error words.

Optionally, the classifying the text according to the confidence information to obtain a confidence classification result includes:

judging whether each word in the text has a word with a confidence coefficient lower than a preset confidence coefficient threshold value;

if the text has words with confidence degrees lower than a preset confidence degree threshold value, determining that error words exist in the text;

and if the text does not have words with the confidence degrees lower than a preset confidence degree threshold value, determining that no error words exist in the text.

Optionally, the classifying the text according to the confidence information to obtain a confidence classification result includes:

determining the confidence coefficient of each character of the text and the word forming position of each character;

predicting whether the text has wrong characters according to the confidence coefficient of each character of the text and the word forming position of each character;

and determining a confidence degree classification result according to the prediction result.

Optionally, the predicting whether the text has an error word according to the confidence of each word of the text and the word formation position of each word includes:

and inputting each character of the text, the confidence corresponding to each character and the word forming position of each character into a preset error prediction model for processing to obtain a prediction result of whether each character is wrong or not.

Optionally, determining candidate words from the text, screening, and screening out keywords as a voice recognition defect, including:

identifying a field corresponding to the text;

respectively determining the importance degree of each word of the text in the text of the field;

determining candidate words according to the importance degree of each word of the text in the text of the field;

and screening out keywords as voice recognition defects according to the parts of speech of the candidate words.

Optionally, the screening out a keyword as a voice recognition defect according to the part of speech of the candidate word includes:

and screening out nouns and/or verbs from the candidate words to serve as voice recognition defects.

The embodiment of the invention also discloses a voice recognition defect detection device, which comprises:

the voice request receiving module is used for receiving a voice request forwarded by the vehicle-mounted system;

an intention classification module for performing intention classification on the text of the voice request;

the confidence classification module is used for carrying out confidence classification on the text according to the intention classification result;

and the screening module is used for determining candidate words from the text according to the confidence degree classification result, screening the candidate words and screening out keywords as the voice recognition defects.

Optionally, the confidence classification module comprises:

an intention determining submodule for determining whether the text has a user intention according to the intention classification result;

and the confidence degree classification submodule is used for carrying out confidence degree classification on the text if the text is determined to have the user intention according to the intention classification result.

Optionally, the screening module comprises:

and the screening submodule is used for determining candidate words from the text to screen if the text is determined to have the wrong words according to the confidence classification result, and screening out keywords as the voice recognition defects.

Optionally, the intent classification module includes:

the intention prediction submodule is used for predicting whether the text has the user intention by adopting a preset intention classification model to obtain a prediction result; the prediction result includes a probability that the text has the user intention and a probability that the text does not have the user intention.

Optionally, the intent determination sub-module comprises:

the probability condition judging unit is used for judging whether a preset probability condition is met or not according to the probability that the text has the user intention and the probability that the text does not have the user intention; the preset probability condition comprises the following steps: the probability of having the user intention is greater than the probability of not having the user intention, and/or the probability of having the user intention is greater than a preset probability threshold;

a first intention determining unit, configured to determine that the text has a user intention if the preset probability condition is satisfied;

and the second intention determining unit is used for determining that the text does not have the user intention if the preset probability condition is not met.

Optionally, the confidence classification submodule includes:

a confidence information acquiring unit, configured to acquire confidence information of the text, where the confidence information includes a confidence of each word of the text obtained by performing voice recognition on the voice request;

the confidence classification unit is used for classifying the texts according to the confidence information to obtain confidence classification results; and the confidence degree classification result is that the text has error words or the text does not have error words.

Optionally, the confidence classification unit includes:

the confidence threshold comparison subunit is used for judging whether a word with a confidence lower than a preset confidence threshold exists in each word of the text;

the first error word determining subunit is used for determining that an error word exists in the text if the text has a word with a confidence coefficient lower than a preset confidence coefficient threshold;

and the second error word determining subunit is used for determining that no error word exists in the text if the text does not have a word with the confidence coefficient lower than a preset confidence coefficient threshold value.

Optionally, the confidence classification unit includes:

the character information determining subunit is used for determining the confidence coefficient of each character of the text and the word forming position of each character;

the wrong character prediction subunit is used for predicting whether the text has wrong characters or not according to the confidence coefficient of each character of the text and the word formation position of each character;

and the confidence degree classification subunit is used for determining a confidence degree classification result according to the prediction result.

Optionally, the incorrect word prediction subunit includes:

and the model prediction subunit is used for inputting each character of the text, the confidence coefficient corresponding to each character and the word formation position of each character into a preset error prediction model for processing to obtain a prediction result of whether each character is wrong.

Optionally, the screening module comprises:

the field identification submodule is used for identifying the field corresponding to the text;

the importance degree determining submodule is used for respectively determining the importance degree of each word of the text in the text of the field;

the candidate word determining submodule is used for determining candidate words according to the importance degree of each word of the text in the text of the field;

and the keyword screening submodule is used for screening out keywords as voice recognition defects according to the parts of speech of the candidate words.

Optionally, the keyword screening submodule includes:

and the part of speech screening unit is used for screening out nouns and/or verbs from the candidate words to serve as voice recognition defects.

The embodiment of the invention also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of the speech recognition defect detection method as described above.

The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the voice recognition defect detection method are realized.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the server can receive the voice request forwarded by the vehicle-mounted system; performing intention classification on a text obtained by performing voice recognition on the voice request; performing confidence classification on the text according to the intention classification result; and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects. The embodiment of the invention can more accurately identify the voice identification defect in the text of the voice request.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for detecting and processing a speech recognition defect according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating steps of another method for detecting speech recognition defects according to an embodiment of the present invention;

FIG. 3 is a flowchart of a query text mining for ASR errors in an embodiment of the present invention;

fig. 4 is a block diagram of a text processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The voice recognition defect is mainly caused by that a voice recognition system wrongly recognizes similar voice, thereby influencing voice interaction in an actual application scene. For example. In a vehicle-mounted scene, the actual text corresponding to the voice request of the user is as follows: "open and remove two shelves of fog", the text that the speech recognition system discerns and obtains is: and opening the window for two wind speeds. The speech recognition system erroneously recognizes "window wind speed" as "defogging".

In order to reduce the influence of the voice recognition defect on the voice interaction, the invention provides a voice recognition defect detection method which can screen out keywords from a text of a voice request as the voice recognition defect, and further can correct the text to perform the voice interaction.

Referring to fig. 1, a flowchart illustrating steps of a method for detecting a speech recognition defect according to an embodiment of the present invention is shown, where the method specifically includes the following steps:

and step 101, receiving a voice request forwarded by the vehicle-mounted system.

In the vehicle-mounted scene, the vehicle-mounted system can receive a voice request sent by a user and send the voice request to the server. The server may include a speech recognition system by which speech requests may be recognized as text.

The voice recognition system recognizes the voice based on the voice recognition model obtained by pre-training. The speech recognition model can be obtained by collecting a large amount of speech for training, the recognition effects of the speech recognition models obtained by using different speech training are different, and texts obtained by recognizing the same speech by different speech recognition systems can also be different.

And 102, classifying the intentions of the text of the voice request.

The server can perform intention classification on the text of the voice request to obtain an intention classification result.

And 103, carrying out confidence degree classification on the text according to the intention classification result.

The intention classification result may represent the type of intention corresponding to each text. According to the intention type corresponding to the text, the text of the specific intention type can be selected to be subjected to confidence degree classification, and a confidence degree classification result is obtained.

The confidence classification is to classify the text by using the confidence information of the text and determine whether the text is accurate in speech recognition or wrong in speech recognition. The confidence information may be confidence information output when the speech recognition system performs speech recognition on speech, and may indicate the accuracy of the speech recognition system in recognizing speech, where the higher the accuracy, the lower the probability of the speech recognition defect, and the lower the accuracy, the higher the probability of the speech recognition defect.

And 104, determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects.

For the text which belongs to the speech recognition error, candidate words can be determined from the text for screening, and keywords are screened out to serve as the speech recognition defects.

In the invention, the server can receive the voice request forwarded by the vehicle-mounted system; performing intention classification on a text obtained by performing voice recognition on the voice request; performing confidence classification on the text according to the intention classification result; and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects. The invention more accurately identifies the voice recognition defects in the text of the voice request.

Referring to fig. 2, a flowchart illustrating steps of a method for detecting a speech recognition defect according to an embodiment of the present invention is shown, where the method specifically includes the following steps:

step 201, receiving a voice request forwarded by the vehicle-mounted system.

In the vehicle-mounted scene, the vehicle-mounted system can receive a voice request sent by a user and send the voice request to the server. The server may include a speech recognition system by which speech requests may be recognized as text.

Step 202, performing intent classification on the text of the voice request.

The server can perform intention classification on the text of the voice request to obtain an intention classification result. The intent classification result may indicate whether the text of the language request has a user intent.

In a vehicle-mounted scene, the vehicle-mounted system not only collects the voice request of the user, but also may include non-user voices such as environmental sounds and noises, for example, navigation voice played by a vehicle. The text recognized by the server through the speech recognition system may include: the text obtained by recognizing the voice request of the user and the text obtained by recognizing the non-user voice.

Natural language processing is generally concerned with language requests spoken by a user with a user's intent, and is not concerned with user speech and non-user speech spoken by a user without a user's intent. The text recognized by the ASR system from the user's speech can be further divided into text with user intent and text without user intent.

In the present invention, the step 202 may further include: predicting whether the text has the user intention by adopting a preset intention classification model to obtain a prediction result; the prediction result includes a probability that the text has the user intention and a probability that the text does not have the user intention.

The preset intention classification model may be a model trained in advance to predict whether the text has the user intention. The preset intention classification model does not need to predict what the user intention corresponding to the text is, but only whether the text has the user intention.

In one example, the intended classification model may be a multi-label classification model constructed from BERT (Bidirectional Encoder Representations from transducers, bi-directionally encoded representation based on transducers) + SOFTMAX. The training process of the intent classification model may include: the text as a sample is labeled, and the labels can be divided into: user intention, no user intention, not belonging to the user voice (e.g.; voice announcement of vehicle); the text and corresponding labels as samples are input into an intention classification model to train the intention classification model. The use of the intent classification model may include: and inputting the text into the intention classification model to obtain the confidence degrees of the corresponding labels output by the intention classification model. For example, the input text is "i want to listen to a song", the intent classification model output is: "i want to listen to a song", the probability of the tag "with user intention" is 0.9, the probability of the tag "without user intention" is 0.1, and the probability of the tag "not belonging to user speech" is 0. In practice, those skilled in the art can also use other machine-trained models to predict whether the text has the user's intention, which is not limited by the invention.

Step 203, determining whether the text has the user intention according to the intention classification result.

In the present invention, the step 203 may further include: judging whether a preset probability condition is met or not according to the probability that the text has the user intention and the probability that the text does not have the user intention; the preset probability condition comprises the following steps: the probability of having the user intention is greater than the probability of not having the user intention, and/or the probability of having the user intention is greater than a preset probability threshold; if the preset probability condition is met, determining that the text has the user intention; and if the preset probability condition is not met, determining that the text does not have the user intention.

In one example, a preset probability condition is considered to be satisfied if the probability that the predicted text has the user intent is greater than the probability that the predicted text has the user intent. In another example, a predetermined probability condition is considered to be satisfied if the probability of the predicted text having the user intent is greater than the probability of having the user intent and the probability of having the user intent is greater than a predetermined probability threshold. In yet another example, the predetermined probability condition is considered to be satisfied if the probability that the predicted text has the user intent is greater than a predetermined probability threshold (e.g., 0.8).

Step 204, if the text is determined to have the user intention according to the intention classification result, obtaining confidence information of the text, wherein the confidence information comprises a confidence of each word of the text obtained by performing voice recognition on the voice request.

In the present invention, whether the text has the user's intention may be determined according to the intention classification result. For text with user intent, confidence classification may be further performed. For a text not having the user's intention, the subsequent natural language processing is not performed.

Performing confidence classification requires obtaining confidence information of a text, where the confidence information includes a confidence of each word of the text obtained by performing speech recognition on the speech request. For example, the text obtained by performing speech recognition processing on speech is: "open window wind speed two shelves", the pronunciation actually is: the confidence degrees of the words contained in the words of opening the two gears of the defogging and the words of opening the two gears of the window and the wind speed are respectively as follows: open, confidence 0.99; "window", confidence 0.4; "wind speed", confidence 0.7; two, confidence 1; "gear", confidence 1. The text includes which words, and the words can be obtained through word segmentation processing.

Step 205, classifying the texts according to the confidence information to obtain a confidence classification result; and the confidence degree classification result is that the text has error words or the text does not have error words.

The error words are the words which are wrongly recognized by the voice recognition system, and are classified according to the confidence coefficient of each word of the text, so that the type of the text with the error words or the type of the text without the error words is determined.

In an alternative example of the present invention, the step 205 may further include the following sub-steps:

and a substep S11, determining whether there is a word with a confidence level lower than a preset confidence level threshold value in the words of the text.

And a substep S12, determining that an error word exists in the text if the text has a word with a confidence lower than a preset confidence threshold.

And a substep S13, if the text does not have words with confidence lower than a preset confidence threshold, determining that no error word exists in the text.

And if the confidence coefficient of one word in the text is lower than a preset confidence coefficient threshold value, determining that the text has an error word. And if the confidence degrees of all words in the text are not lower than the preset confidence degree threshold value, determining that no error word exists in the text.

The preset confidence threshold may be obtained by statistical analysis of the confidence of a large number of words of text. Specifically, the text of words belonging to different confidence intervals may be analyzed, with a proportion of erroneous words present. And determining a confidence threshold according to the interval with higher error word proportion. For example, the confidence coefficient is divided into 10 intervals by taking 0.1 as a unit from 0 to 1, and 100 texts are selected in each interval; the marking personnel listen, mark whether the listening result is consistent with the text, and respectively mark the situations of normal, defect and audio loss. And analyzing the proportion of the error words existing in the text in each interval. And if the proportion of the error words existing in the text in the interval of the confidence coefficient 0-0.8 is determined to be greater than the proportion threshold, determining that 0.8 is the confidence coefficient threshold.

In another alternative example of the present invention, the step 205 may further include the following sub-steps:

and a substep S21 of determining a confidence level of each character of the text and a word formation position of each character.

In the present invention, a word segmentation can be regarded as a word formation position classification problem of a word, and a word formation position (also referred to as a word position) of a word can be divided into: the beginning of a word, the end of a word, the middle of a word, a single word. For example, the text: the word segmentation result of "the played song is a forest in norway" is: played/song/yes/norway forest, the (S) song (B) song (E) with the word formation position marked as (B) play (E) is (I) forest (E) of (S) norway (B) way (I). Wherein B represents the beginning of a word, I represents the word, E represents the end of a word, and S represents a single word.

And a substep S22, predicting whether the text has wrong words according to the confidence coefficient of each word of the text and the word forming position of each word.

In the invention, each character of the text, the confidence corresponding to each character and the word forming position of each character are input into a preset error prediction model for processing to obtain a prediction result of whether each character is wrong or not.

The error prediction model may predict whether individual words are ASR errors. In one example, the error prediction model may be a prediction model consisting of BilSTM (Bi-directional Long Short-Term Memory, bidirectional recurrent neural network) + SOFTMAX. The training process of the error prediction model may include: adding a labeling result indicating whether an error is identified or not to each word of the text as a sample; and inputting each character of the text as a sample, the confidence coefficient corresponding to each character, the word formation position of each character and the labeling result corresponding to each character into the error prediction model so as to train the error prediction model. For example, referring to table 1, input information for training the misprediction model:

character (Chinese character) Word formation position Confidence level Annotating results
Seeding B 0.355 W (Correct)
Put E 0.355 W
Is/are as follows S 0.222 W
Is that S 0.486 W
Sundries B 0.99 R (error)
Chinese character' Tao E 0.99 R

Table 1.

The prediction process of the wrong prediction model may include: and inputting each character of the text, the confidence corresponding to each character and the word forming position of each character into the error prediction model to obtain a prediction result whether the corresponding character output by the error prediction model is wrongly recognized or not. For example, referring to table 2, the output information of the prediction process of the error prediction model and the prediction result corresponding to the output are:

character (Chinese character) Word formation position Confidence level Predicted results
Beat and beat B 0.99 R
Opening device E 0.99 R
Window (Refreshment window) B 0.4 W
Household E 0.4 W
Wind power B 0.739 W
Speed measuring device E 0.739 W
Two are S 1 R
Gear S 1 R

Table 2.

And a substep S23 of determining a confidence classification result according to the prediction result.

The prediction result can be that the text has wrong words or the text does not have the wrong words, and if the text has the wrong words, the text is determined to have the wrong words; and if the text does not have the wrong words, determining that the text does not have the wrong words.

And step 206, if the text is determined to have the wrong words according to the confidence classification result, determining candidate words from the text, and screening out keywords as the voice recognition defects.

In the present invention, the step 205 may further include the following sub-steps:

and a substep S31 of identifying a domain corresponding to the text.

In the invention, the server can call the voice recognition system to convert the voice request into the text, then call the natural language understanding system to carry out natural language understanding processing on the text of the voice request to obtain the user intention, and then carry out subsequent operation according to the user intention obtained by recognition. The natural language understanding process may include two parts, namely, domain recognition and intention recognition, and the natural language understanding system may perform the domain recognition on the text to obtain the corresponding domain, and then perform the intention recognition on the text to obtain the corresponding user intention. For example, the text: a corresponding field of "opening a window" may be control; text: the corresponding field of "open music" may be music.

And a substep S32 of determining the degree of importance of each word of said text in the text of said domain, respectively.

The importance of a word in a text in a certain domain may represent the importance of the word to the semantic of the text. The greater the importance of a word in a text in a certain domain, the greater the importance of the word to the text semantics.

In the present invention, the degree of importance of a word in a text of a certain field may be represented using a word frequency tf (term frequency) -inverse Document frequency idf (inverse Document frequency) value. The substep S32 may specifically include: acquiring text data of the field; and calculating TF-IDF values of each word of the text in the text of the field by adopting the text data of the field. Text data of a certain field can be extracted from a corpus.

And a substep S33 of determining candidate words according to the degree of importance of each word of the text in the text of the field.

Words with a higher degree of importance in the text of the domain may be used as candidate words. In one example, the sub-step S33 may include: sequencing all words of the text according to the importance degree of the words in the text of the field; and selecting candidate words from all words of the text according to the sorting result.

The sorting can be performed according to the degree of importance from large to small, or according to the degree of importance from small to large. In one example, the TF-IDF values of the text for each word in the domain may be ordered from size to size. From the words of the text, words whose ranking order is within a preset order range may be selected as candidate words. In one example, if sorted in a big-to-small manner, the top N words may be selected as candidate words. Wherein, N may be a settable integer. For example, for the 5 words "open", "window", "wind speed", "two", "shelves", the top 3 words may be selected as candidate words, resulting in "window", "wind speed", "shelves" as candidate words.

And a substep S34, screening out keywords as voice recognition defects according to the parts of speech of the candidate words.

In the present invention, the sub-step S34 may include: and screening out nouns and/or verbs from the candidate words to serve as voice recognition defects.

And performing part-of-speech analysis on the candidate words to determine the part-of-speech of each word. Since verbs and nouns usually occupy the core position in sentences and are more important for semantic representation of texts, the nouns and/or verbs are screened out as defects in speech recognition.

In the invention, the server can receive the voice request forwarded by the vehicle-mounted system and perform intention classification on the text obtained by performing voice recognition on the voice request; if the text is determined to have the user intention according to the intention classification result, acquiring confidence information of the text; classifying the texts according to the confidence coefficient information to obtain a confidence coefficient classification result; and if the text is determined to have the wrong words according to the confidence classification result, determining candidate words from the text, screening, and screening out keywords as the voice recognition defects. The invention can more accurately identify the voice recognition defects in the text of the voice request.

On the other hand, the method for detecting the voice recognition defect can recognize the voice recognition defect of the voice request in a real-time voice interaction scene. The method can be used for mining the voice recognition defect of the large-scale voice request, and the text with the voice recognition defect is adopted, so that the voice recognition model of the voice recognition system can be optimized, the recognition effect of the voice recognition model is improved, and the follow-up natural language processing can be carried out according to the more accurate text. Fig. 3 is a flowchart illustrating a process of mining a speech recognition defect text according to an embodiment of the present invention.

And carrying out voice recognition on the large-scale voice request through a voice recognition system to obtain a corresponding text.

For the texts corresponding to all the voice requests, firstly, predicting each text through an intention classification model, wherein the predicted text is the text which has user intention or does not have the user intention or does not belong to the recognized user voice.

And filtering out the text which does not have the user intention or does not belong to the user voice recognition, and finishing the processing without processing by a labeling person.

The query text with the user intent is classified according to confidence. The output of the speech recognition system for speech recognition of speech may include the text, and the confidence level for each word of the text.

The classification according to the confidence may include two ways, the first is to determine whether the confidence of each word of the text is smaller than a confidence threshold, and the confidence threshold may be a confidence threshold obtained through statistical analysis and used for determining whether the word is stored in an ASR error; if the confidence of a certain word of the text is smaller than the confidence threshold, the text is considered to have a possible wrong word and needs to be further processed by a labeling person; and if the confidence degrees of all the words in the text are not less than the confidence degree threshold value, determining that no error word exists in the query text, and ending the processing without needing a labeling person to process. The second method is that whether error words exist in the query text or not is predicted through an error prediction model, and the error prediction model is trained to predict whether each word of the query text is an error word or not according to each word of the query text, a confidence coefficient corresponding to each word and a word forming position of each word; and if the text is predicted to have the wrong words, the text is considered to have the wrong words.

And ending the processing of the text which is confirmed to have no error word without the need of a marking person for processing.

For the text confirmed to have the error word, the keywords can be further screened out to serve as the voice recognition defect, and when the annotating personnel processes the text, the annotating personnel can be prompted to screen out the keywords to help the annotating personnel to label.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 4, a block diagram of a structure of a speech recognition defect detecting apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

the voice request receiving module 401 is configured to receive a voice request forwarded by the vehicle-mounted system;

an intent classification module 402 for intent classifying text of the voice request;

a confidence classification module 403, configured to perform confidence classification on the text according to the intention classification result;

and the screening module 404 is configured to determine candidate words from the text according to the confidence classification result, and screen out keywords as the voice recognition defects.

In the present invention, the confidence classification module 403 may include:

an intention determining submodule for determining whether the text has a user intention according to the intention classification result;

and the confidence degree classification submodule is used for carrying out confidence degree classification on the text if the text is determined to have the user intention according to the intention classification result.

In the present invention, the screening module 404 may include:

and the screening submodule is used for determining candidate words from the text to screen if the text is determined to have the wrong words according to the confidence classification result, and screening out keywords as the voice recognition defects.

In the present invention, the intention classification module 402 may include:

the intention prediction submodule is used for predicting whether the text has the user intention by adopting a preset intention classification model to obtain a prediction result; the prediction result includes a probability that the text has the user intention and a probability that the text does not have the user intention.

In the present invention, the intention determining submodule may include:

the probability condition judging unit is used for judging whether a preset probability condition is met or not according to the probability that the text has the user intention and the probability that the text does not have the user intention; the preset probability condition comprises the following steps: the probability of having the user intention is greater than the probability of not having the user intention, and/or the probability of having the user intention is greater than a preset probability threshold;

a first intention determining unit, configured to determine that the text has a user intention if the preset probability condition is satisfied;

and the second intention determining unit is used for determining that the text does not have the user intention if the preset probability condition is not met.

In the present invention, the confidence classification submodule may include:

a confidence information acquiring unit, configured to acquire confidence information of the text, where the confidence information includes a confidence of each word of the text obtained by performing voice recognition on the voice request;

the confidence classification unit is used for classifying the texts according to the confidence information to obtain confidence classification results; and the confidence degree classification result is that the text has error words or the text does not have error words.

In the present invention, the confidence classification unit may include:

the confidence threshold comparison subunit is used for judging whether a word with a confidence lower than a preset confidence threshold exists in each word of the text;

the first error word determining subunit is used for determining that an error word exists in the text if the text has a word with a confidence coefficient lower than a preset confidence coefficient threshold;

and the second error word determining subunit is used for determining that no error word exists in the text if the text does not have a word with the confidence coefficient lower than a preset confidence coefficient threshold value.

In the present invention, the confidence classification unit may include:

the character information determining subunit is used for determining the confidence coefficient of each character of the text and the word forming position of each character;

the wrong character prediction subunit is used for predicting whether the text has wrong characters or not according to the confidence coefficient of each character of the text and the word formation position of each character;

and the confidence degree classification subunit is used for determining a confidence degree classification result according to the prediction result.

In the present invention, the erroneous-word prediction subunit may include:

and the model prediction subunit is used for inputting each character of the text, the confidence coefficient corresponding to each character and the word formation position of each character into a preset error prediction model for processing to obtain a prediction result of whether each character is wrong.

In the present invention, the screening module 404 may include:

the field identification submodule is used for identifying the field corresponding to the text;

the importance degree determining submodule is used for respectively determining the importance degree of each word of the text in the text of the field;

the candidate word determining submodule is used for determining candidate words according to the importance degree of each word of the text in the text of the field;

and the keyword screening submodule is used for screening out keywords as voice recognition defects according to the parts of speech of the candidate words.

In the present invention, the keyword screening submodule may include:

and the part of speech screening unit is used for screening out nouns and/or verbs from the candidate words to serve as voice recognition defects.

In the invention, the server can receive the voice request forwarded by the vehicle-mounted system; performing intention classification on a text obtained by performing voice recognition on the voice request; performing confidence classification on the text according to the intention classification result; and determining candidate words from the text according to the confidence classification result, screening, and screening out keywords as voice recognition defects. The invention more accurately identifies the voice recognition defects in the text of the voice request.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, including:

the method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the voice recognition defect detection method embodiment is realized, the same technical effect can be achieved, and the details are not repeated here to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the foregoing voice recognition defect detection method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes in detail a speech recognition defect detection method and a speech recognition defect detection apparatus provided by the present invention, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音识别方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!