Automatic speech recognition confidence classifier

文档序号:246613 发布日期:2021-11-12 浏览:8次 中文

阅读说明:本技术 自动语音识别置信度分类器 (Automatic speech recognition confidence classifier ) 是由 K·库玛 A·阿纳斯塔萨科斯 龚一凡 于 2020-03-05 设计创作,主要内容包括:一种增强自动语音识别置信度分类器的方法,其包括接收来自一个或多个解码单词的一组基线置信度特征,从基线置信度特征得到单词嵌入置信度特征,将基线置信度特征与单词嵌入置信度特征结合以创建特征向量,以及执行置信度分类器以生成置信度得分,其中置信度分类器用一组训练示例训练,所述组训练示例具有对应于特征向量的标记特征。(A method of enhancing an automatic speech recognition confidence classifier includes receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, combining the baseline confidence features with the word embedding confidence features to create a feature vector, and executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.)

1. A method of enhancing an automatic speech recognition confidence classifier, comprising:

receiving a set of baseline confidence features from one or more decoded words;

deriving a word embedding confidence feature from the baseline confidence feature;

combining the baseline confidence features with the word embedding confidence features to create a feature vector; and

executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained using a set of training examples having labeled features corresponding to the feature vectors.

2. The method of claim 1, wherein the word embedding confidence features include character embedding and phoneme embedding.

3. The method of any of claims 1-2, wherein the word embedding confidence features comprise character embedding, wherein the character embedding comprises less than 26 letter-containing embeddings.

4. The method of claim 3, wherein the character embedding for a word comprises a vector having a value for each letter in the word, the value consisting of a count of the number of each letter.

5. The method of any of claims 1-2, wherein the word embedding confidence features comprise phoneme embedding, the phoneme embedding comprising a monophonic selected from a dictionary comprising 40 or fewer monophonic elements.

6. The method of any one of claims 1-2, wherein the feature vector further comprises GLOVE embedding.

7. The method of any of claims 1-2, wherein the confidence classifier is trained for word-level classification as well as utterance-level classification.

8. The method of any of claims 1-2, wherein the baseline characteristics include two or more of: acoustic model score, background model score, silence model score, noise model score, language model score, and duration feature.

9. A machine-readable storage having instructions executed by a processor of a machine to cause the processor to perform operations to generate a confidence score for a word or utterance, the operations comprising:

receiving a set of baseline confidence features from one or more decoded words;

deriving a word embedding confidence feature from the baseline confidence feature;

combining the baseline confidence features with word embedding confidence features to create a feature vector; and

executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained using a set of training examples having labeled features corresponding to the feature vectors.

10. The apparatus of claim 9, wherein the word embedding confidence features include character embedding and phoneme embedding.

11. The apparatus of any of claims 9-10, wherein the word embedding confidence features comprise character embedding, wherein the character embedding comprises 26 or fewer embeddings containing letters of an alphabet, and wherein the character embedding for a word comprises a vector having a value for each letter in the word, the value consisting of a count of the number of each letter.

12. The apparatus of any of claims 9-10, wherein the word embedding confidence features comprise phoneme embedding, the phoneme embedding comprising a monophonic selected from a dictionary comprising 40 or fewer monophonic elements.

13. The apparatus of any of claims 9-10, wherein the confidence classifier is trained for word-level classification and utterance-level classification, and wherein the baseline features include two or more of: an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and a duration feature.

14. An apparatus, comprising:

a processor; and

a memory device coupled to the processor and having a program stored thereon for execution by the processor to perform operations comprising:

receiving a set of baseline confidence features from one or more decoded words;

deriving a word embedding confidence feature from the baseline confidence feature;

combining the baseline confidence features with word embedding confidence features to create a feature vector; and

executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained using a set of training examples having labeled features corresponding to the feature vectors.

15. The apparatus of claim 14, wherein the word embedding confidence features comprise one or more of character embedding and phoneme embedding comprising monophonics, and wherein the confidence classifier is trained for word-level classification and utterance-level classification, and wherein the baseline features comprise two or more of: an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and a duration feature.

Background

The confidence classifier is an integral part of an Automatic Speech Recognition (ASR) system. The classifier predicts the accuracy of the ASR hypothesis by correlating confidence scores in the [0,1] range, where a larger score means that the probability of the hypothesis being correct is higher. Although such classifiers work well for native speakers, speech with different accents may result in a higher false alarm rate. In other words, the confidence score in the predicted word may be too high to cause the application receiving the classifier output to believe that the correct word was provided.

Disclosure of Invention

A method of enhancing an automatic speech recognition confidence classifier includes receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, combining the baseline confidence features with the word embedding confidence features to create a feature vector, and executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

In another embodiment, a system is configured to perform the method. In yet another embodiment, a computer-readable medium has code stored thereon to cause a computer to perform the method when the code is executed.

Drawings

FIG. 1 is a block diagram illustrating an enhanced speech recognition system with added word embedding features according to an example embodiment.

Fig. 2 is a graph illustrating the dependency of words and associated acoustic scores according to an example embodiment.

FIG. 3 is a table illustrating character embedding of example words, according to an example embodiment.

FIG. 4 is a flow diagram illustrating a computer-implemented method for enhancing a speech recognition confidence classifier with word-embedded confidence features according to an example embodiment.

FIG. 5 is a schematic block diagram of a computer system implementing one or more example embodiments.

Detailed Description

In the following description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes may be made without departing from the scope of the present invention. The following description of exemplary embodiments is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

In one embodiment, the functions or algorithms described herein may be implemented in software. Software may be comprised of computer-executable instructions stored on a computer-readable medium or computer-readable storage device, such as one or more non-transitory memories or other types of hardware-based local or network storage devices. Further, these functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the described embodiments are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, to turn such computer system into a specially programmed machine.

The functions may be configured to perform operations using, for example, software, hardware, firmware, and so forth. For example, the phrase "configured to" may refer to a logical circuit arrangement of hardware elements that implements the associated function. The phrase "configured to" may also refer to a logical circuit configuration of hardware elements that is to implement a coded design of the associated function of firmware or software. The term "module" refers to a structural element that may be implemented using any suitable hardware (e.g., processor, etc.), software (e.g., application, etc.), firmware, or any combination of hardware, software, and firmware. The term "logic" includes any functionality for performing a task. For example, each operation illustrated in the flowcharts corresponds to logic for performing the operation. Operations may be performed using software, hardware, firmware, or the like. The terms "component," "system," and the like can refer to a computer-related entity, hardware, and software in execution, firmware, or a combination thereof. A component may be a process running on a processor, an object, an executable, a program, a function, a subroutine, a computer, or a combination of software and hardware. The term "processor" may refer to a hardware component, such as a processing unit of a computer system.

Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computing device to implement the disclosed subject matter. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any computer-readable storage device or media. The computer-readable storage medium may include, but is not limited to, magnetic storage devices such as hard disks, floppy disks, magnetic strips, optical disks, Compact Disks (CDs), Digital Versatile Disks (DVDs), smart cards, flash memory devices, and the like. In contrast, computer-readable media, i.e., non-storage media, may additionally include communication media such as transmission media for wireless signals and the like.

Automatic Speech Recognition (ASR) has significantly increased hands-free communication with devices such as smartphones, tablets, game consoles, etc. ASR technology has been very successful in the past decade and is rapidly deployed from a laboratory environment into real life.

While one desires to obtain perfect recognition from ASR, the actual decoded utterance is always in error. In this case, the confidence measure for the recognized utterance provides a quantitative representation of the reliability of the ASR decoding. This confidence measure is especially important for applications where ASR-enabled devices are always in an active listening mode in an application-constrained grammar. The application constraint grammar in one example application may include game media commands such as play/pause, etc. Sometimes background OOG speech (not including a play command) may trigger one of the IG commands and confidence measures may be used to evaluate its correctness. In that case, some of the out-of-grammar (OOG) utterances may still be considered as in-grammar (IG) utterances.

The confidence classifier is trained to provide a measure of reliability of the decoded utterance to help reject OOG utterances. The confidence measures are also used to validate ASR decoding in the presence of background noise, reverberation, and other mismatched acoustic conditions. The confidence measures may be trained for word-based confidence and utterance-based confidence.

ASR confidence has many applications. Confidence is a key indicator that helps speech applications better handle responses to potentially incorrect ASR hypotheses. Confidence classifiers are used for push-to-talk devices such as handsets, but also for continuous listening devices such as Xbox, where the speech engine is running in the background all the time. Thus, the ASR system is listening to the speech prepared for it as well as to unintended speech in side speech, background noise, and other environmental sounds. There, ASR may generate an intra-grammar (IG) recognition for unexpected or grammatically erroneous (OOG) utterances. ASR systems utilize confidence classifiers to detect possible misrecognitions and avoid system responses.

A multi-layer perceptron (MLP) or deep learning model may be used to train confidence scores from a defined set of features. Many confidence features and training methods have been developed and used to predict confidence. Confidence scores for words and utterances may be calculated. To improve the confidence score, word embedding confidence features may be derived and added to the set of features.

FIG. 1 is a block flow diagram illustrating an enhanced speech recognition system 100. A voice utterance is received at the voice input 110. The input sequence may be word-level or utterance-level. Features are extracted in a known manner at feature extraction 120. The extracted features are provided to an ASR engine 130, which ASR engine 130 decodes the speech and produces a hypothesis and a set of features, such as lattice (lattice) based confidence features 140. A lattice is a representation of a sequence of alternative words that is sufficiently likely for a particular utterance. The engine 130 provides scores based on an Acoustic Model (AM) and a Language Model (LM). The scores are normalized according to utterance length. In one embodiment, the lattice-based confidence feature 140 is a baseline feature set (expressed as a generic feature) consisting of 16 predictors.

The features 140 are used to derive word embedding confidence features 150. The features 140 and word embedding confidence features 150 are concatenated and provided to a confidence classifier 160, which confidence classifier 160 has been trained on two sets of features. The confidence classifier 160 determines whether the input 110 sequence is in-syntax (IG) or out-of-syntax (OOG). Multi-layer perceptrons (MLPs) may be used for IG decision-making on (versis) OOGs. MLPs are trained for word-level classification and utterance-level classification. The output of the MLP is a confidence score for the input utterance.

In some embodiments, the confidence classifier 160 generates a confidence score 170 that is provided to the application 180. Application 180 may accept or reject the hypothesis. In other words, the application 180 may choose to accept or ignore the word or word sequence hypothesized by the engine 130 based on the confidence score.

The speech application 180 uses these scores and makes a decision to accept the recognition event by comparing the scores to a set threshold, such as 0.8([0,1] range, where a larger score means that the probability of assuming correctness is higher) or other thresholds that may depend on the application. The confidence score helps to mitigate unnecessary responses of the application to tasks such as background noise or television sounds.

Word embedding confidence features 150 are used to improve the confidence classifier. Word characters and phoneme pronunciation are embedded for specialized representation and decomposition (factorize) of acoustic confidence features.

Confidence classification can be expressed as a binary classification problem, with two categories: (1) correct SR recognition, (2) incorrect recognition, which includes false recognition of IG utterances and recognition from OOG utterances or background audio. Example confidence features may include:

1. acoustic model score

2. Background model score

3. Silence model and noise model score

4. Language model score

5. Duration characteristics

The baseline confidence features 140 in one embodiment include 21 features obtained from ASRs lattice during decoding. Confidence features may be obtained from the background, silence, and noise model scores. A set of Language Model (LM) features may be obtained to represent LM scores, puzzles, and fanouts (fanouts). May include modeling speech rate and absolute speech duration based on the duration characteristics. These features may be normalized to be robust to speech of different durations and intensities.

New confidence features, such as word embedding confidence features 150, may be added to further improve confidence performance. The acoustic scores for individual words are obtained in the ASR hypothesis as a set of frame-level acoustic scores for the corresponding particular word. A stronger acoustic score indicates a greater match of the constituent speech frame to the acoustic model and therefore a greater probability of the word being correct. The ASR system uses contextually relevant bound triphones, namely senones, as the state representing the word. During decoding, the best path is found along the states under the language model constraints to predict the best hypothesis.

The acoustic score per frame represents a match between the speech frame and a particular acoustic state. Note that the baseline confidence features 140 include a duration that implicitly facilitates interpreting the acoustic scores for smaller words versus longer words. Further, many normalizations of the engine scores were performed on the baseline confidence features. The acoustic scores based on the baseline confidence features still have a significant dependence on the underlying acoustic state. By representing the acoustic scores with acoustic states, word embedding is used to account for the dependence of the acoustic scores on the underlying acoustic states.

Acoustic scores are often an important feature of ASR confidence classifiers. However, there is a dependency between the acoustic score and the underlying ASR state. Confidence classifiers assign higher confidence scores to words with stronger acoustic scores, but such reliance on acoustic scores means that the aggregated acoustic scores are not sufficient to accurately represent acoustic matches and not the underlying acoustic state. Given that a large ASR task consists of data across acoustic conditions, speakers, and audio pipes, acoustic scores can vary considerably even for correctly recognized words.

In one embodiment, the dependencies between a number of words and associated acoustic scores are shown graphically at 200 in FIG. 2. The distribution of acoustic scores for The 3 words "The", "Play", and "Game" are shown at 210, 220, and 230, respectively. A lower acoustic score indicates a stronger match. The distribution is obtained from words that are correctly recognized by the ASR system. Assuming the remaining confidence features are the same, the variance in acoustic scores will affect the confidence score. The confidence scores indicate The probability of The word being correct, so different acoustic score distributions, as shown at 210, 220 and 230, will result in different interpretations of The "and" Play "words for any given confidence score. For example, a recognized word "The" with a confidence of 0.9 may have a higher or lower probability of correctness than The word "Play" with a confidence of 0.9.

Word embedding features are used to represent and rationalize acoustic scores in acoustic states. Two different types of word embedding features, word character embedding and phoneme pronunciation embedding, may be used. Different types of word embedding features may be used alone or in combination and added to the baseline confidence features 140 described above.

Word character embedding may be used to represent and decompose the acoustic scores. In one embodiment, word character embedding is simply a count of letters in the language. For enUS (american english), a 26-dimensional character embedding corresponding to each letter is constructed. Other languages may have a different number of letters embedded and a corresponding different number of dimension characters. enUS is based on the roman alphabet. Other common letters with different numbers of letters include arabic, cyrillic, and latin. In yet another embodiment, less than all of the letters may be used for word character embedding. Some applications may select a subset of letters of the corresponding language for word character embedding to minimally increase the complexity or size of the model. For example, five embeddings from a vowel may be used.

Referring to table 1, table 300 as shown in fig. 3, the character embedding of "cortana" at 310 is a vector having {2,1,1,1,1,1} at the corresponding positions of { 'a', 'c', 'n', 'o', 'r','t'. The remaining vector elements are 0. A "2" in the vector indicates that the letter "a" appears twice in "cortana".

Character embedding provides several advantages: (a) they are smaller dimensional features, (b) they require little computational resources, (c) they are easy to compute in the run (on the fly), and do not require any memory or storage, since the character count of a word can be easily computed when needed, without the need to store character embeddings. As shown at 100 in fig. 1, the existing confidence mechanism is extended by extracting baseline confidence features from the ASR lattice (features 140). The special word is used for character embedding. Functionality is embedded in the lattice generation or lattice post-processing steps to compute character embeddings of words in the included ASR hypotheses along with the features 140. ASR systems essentially model phonemes and character embedding is at best a good approximation. Furthermore, "Cortana" pronounced in a different way will have the same character embedding, although the acoustic scores are different. In this regard, phoneme embedding described below may also or alternatively be used to extend the features provided to the confidence classifier 160.

The ASR system is essentially a match between speech frames and acoustic states under the constraints of a language model. In some examples, 9000 context-dependent triphones may be used to represent acoustic states. A 9000 dimensional vector may be used to represent the counts for each triphone in a word, but the counts are significantly larger than the 21 confidence features in an example set of baseline confidence features 140 and are likely to over-fit the task. Such large vectors also make training and maintaining confidence classifiers difficult due to sparsity issues, since only a few states in a word are non-zero.

In one embodiment, a single phone unit is used for word pronunciation embedding. The phoneme of "cortina" is embedded as shown at 320 in table 300. In one embodiment, the enUS ASR model consists of 40 monosyllabic elements, and a handcrafted monosyllabic dictionary is used to represent words in monosyllabic units. The pronunciation of a word may be represented by a series of symbols corresponding to the individual sound units that make up the word. These are called "phonemes" or "phonemes". Monophonic refers to a single phoneme. For example, the word "translate" may correspond to the following single-phoneme word: t r @ n s l e t. Monophones are well known structures in speech recognition. As with word character embedding, in some embodiments less than 40 monophones may be used.

Phoneme embedding retains all the advantages of character embedding. With character embedding, the same character embedding problem exists for different pronunciations of words. Phoneme embedding solves this problem by allowing words to have multiple pronunciations in the dictionary. The computation of phoneme embedding is similar to character embedding, except that the embedding unit is a phoneme. The embedding of multiple pronunciations of a word can be calculated as the average of the embedding from a single pronunciation. This calculation uses only the particular words and single-phoneme dictionaries that the ASR decoding has access to.

In one example, OOG utterances may be prepared from a movie or a meeting task. By decoding the IG utterance against a mismatched syntax, OOG data can be simulated. The performance of the confidence classifier may be characterized in terms of the Mean Square Error (MSE) of the training and verification tasks and the CA ═ AllCorrects exceeding the threshold # AllCorrects and FA ═ AllIncorrects exceeding the threshold # AllIncorrects, where # indicates the count.

To train the confidence classifier 160, the labeled confidence training data for over 1000 hours of speech for one or more applications (such as Xbox and server) may be used. In yet another embodiment, significantly smaller or larger amounts or training data may be used. The combination of acoustic and character embedding may increase the MSE to 0: 199. Furthermore, combining embedding with the full baseline signature increases the MSE from 0:188 to 0: 183.

In one embodiment, the confidence classifier is an MLP classifier. The MLP classifier can be enhanced by using a deep framework in DNNs and K-DCN. Deep Neural Networks (DNNs) are widely used in the most advanced learning systems. DNN extends ML in terms of a larger number of hidden layers. Different hidden layers can model and learn local and higher order structures in the data.

The core deep convex network (K-DCN) is a core version of the Deep Convex Network (DCN). The architecture of DCN and K-DCN concatenates the output of all previous layers with the original input data as input for the current layer. The K-DCN consists of a kernel ridge regression model, which can be expressed as:

wherein sample x is evaluated against all training samplesα is a regression coefficient, and the vector k (x) has the element kn(x)=k(xn(ii) a x). The regression coefficient α has a closed form solution:

α=(λI+K)-1Y (2)

where lambda is a regularization parameter,is a core

Matrix K with elementsmn=k(xm,xn),From a training set, anIs a class M label vector for training.

In yet another embodiment, in addition to one or more word embedding confidence features, GLOVE embedding may be added to the baseline features. The GLOVE embedding encodes context word information. Such embedding is different from character and phoneme embedding.

FIG. 4 is a flow diagram illustrating a computer-implemented method 400 of enhancing a speech recognition confidence classifier with word-embedded confidence features. The method 400 begins at operation 410 with receiving a set of baseline confidence features from one or more decoded words. At operation 420, word embedding confidence features are derived from the baseline confidence features. Word embedding confidence features include character embedding. The character embedding includes 26 embeddings, which include letters of the english alphabet. The character embedding of a word may be in the form of a vector having a value for each letter that consists of a count of each letter in the word. Values in the vector corresponding to letters not in the word or other utterance may be set to zero.

Word embedding confidence may also be or include features that include phoneme embedding. The phoneme embedding may be a monophonic selected from a dictionary containing 40 monophonic elements. Word embedding confidence features may include character embedding and phoneme embedding. In yet another example of the present invention,

at operation 430, the baseline confidence features are combined (join) with the word embedding confidence features to create a feature vector. At operation 430, a confidence classifier is executed to generate a confidence score. The confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vectors. The feature vector may also include GLOVE embedding.

The confidence classifier may be trained on word-level classifications as well as utterance-level classifications. The baseline characteristics may include two or more of: an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and a duration feature.

FIG. 5 is a schematic block diagram of a computer system 500 that implements an improved ASR confidence classifier algorithm in accordance with an example embodiment. Not all components need be used in various embodiments.

An example computing device in the form of a computer 500 may include a processing unit 502, memory 503, removable storage 510, and non-removable storage 512. While the example computing device is shown and described as computer 500, in different embodiments, the computing device may be in different forms. For example, the computing device may instead be a smartphone, tablet, smart watch, Smart Storage Device (SSD), or other computing device that includes the same or similar elements as shown and described in fig. 5. Equipment such as smartphones, tablets, smartwatches, etc. are commonly referred to collectively as mobile devices or user equipment.

Although various data storage elements are shown as part of computer 500, storage may also or alternatively include cloud-based storage, such as internet-or server-based storage, that is accessible via a network. It is also noted that the SSD may include a processor running a parser, allowing the parsed, filtered data to be transferred over an I/O channel between the SSD and the main memory.

Memory 503 may include volatile memory 514 and non-volatile memory 508. Computer 500 may include, or have access to, a computing environment that includes, a variety of computer-readable media, such as volatile memory 514 and non-volatile memory 508, removable storage 510, and non-removable storage 512. Computer memory includes Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM) or Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, Compact Disc Read Only Memory (CDROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.

Computer 500 may include or have access to a computing environment that includes input interface 506, output interface 504, and communication interface 516. The output interface 504 may include a display device, such as a touch screen, that may also serve as an input device. Input interface 506 may include one or more of a touch screen, a touch pad, a mouse, a keyboard, a camera, one or more device-specific buttons, one or more sensors integrated within computer 500 or coupled to computer 500 via wired or wireless data, and other input devices. The computer may operate in a networked environment using communication connections to connect to one or more remote computers, such as a database server. The remote computers may include Personal Computers (PCs), servers, routers, network PCs, peer devices or other common data stream network switches, and the like. The communication connection may include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, Wi-Fi, Bluetooth, or other network. According to one embodiment, the various components of computer 500 are connected to a system bus 520.

Computer readable instructions stored on a computer readable medium, such as program 518, may be executed by processing unit 502 of computer 500. In some embodiments, the program 518 includes software that implements one or more confidence classifiers. Hard drives, CD-ROMs, and RAM are some examples of articles including a non-transitory computer-readable medium, such as a storage device. The terms computer-readable medium and storage do not include a carrier wave because a carrier wave is considered too transitory. Storage may also include network storage, such as a Storage Area Network (SAN). The computer programs 518, in conjunction with the workspace manager 522, may be used to cause the processing unit 502 to perform one or more of the methods or algorithms described herein.

Examples of the invention

1. A method of enhancing an automatic speech recognition confidence classifier includes receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, combining the baseline confidence features with the word embedding confidence features to create a feature vector, and executing the confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

2. The method of example 1, wherein the word embedding confidence feature comprises character embedding.

3. The method of example 2, wherein the character embedding includes less than 26 embeddings containing letters.

4. The method of any of examples 2-3, wherein the character embedding of the word comprises a vector having a value for each letter consisting of a count of the number of each letter in the word.

5. The method of any of examples 1 to 4, wherein the word embedding confidence features comprise phoneme embedding.

6. The method of example 5, wherein the phoneme embedding comprises a monophonic selected from a dictionary comprising 40 or fewer monophonic elements.

7. The method of any of examples 1 to 6, wherein the word embedding confidence features comprise character embedding and phoneme embedding.

8. The method according to any one of examples 1 to 7, wherein the feature vector further comprises GLOVE embedding.

9. The method according to any of examples 1 to 8, wherein the confidence classifier is trained for word-level classification as well as utterance-level classification.

10. The method of any of examples 1 to 9, wherein the baseline features comprise two or more of an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and duration features.

11. A machine-readable storage device has instructions for execution by a processor of a machine to cause the processor to perform operations to generate a confidence score for a word or utterance. The operations include receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, combining the baseline confidence features with the word embedding confidence features to create a feature vector, and executing a confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

12. The apparatus of example 11, wherein the word embedding confidence feature comprises character embedding.

13. The apparatus of example 12, wherein the character embedding comprises 26 or fewer embeddings comprising letters of the alphabet.

14. The apparatus of any of examples 12 to 13, wherein the character embedding of the word comprises a vector having a value for each letter consisting of a count of the number of each letter in the word.

15. The apparatus according to any one of examples 11 to 14, wherein the word embedding confidence feature comprises phoneme embedding, the phoneme embedding comprising a monophonic selected from a dictionary comprising 40 or fewer monophonic elements.

16. The apparatus of any of examples 11 to 15, wherein the word embedding confidence features comprise character embedding and phoneme embedding.

17. The apparatus of any of examples 11 to 16, wherein the confidence classifier is trained for word-level classification and utterance-level classification, and wherein the baseline features include two or more of an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and a duration feature.

18. An apparatus includes a processor and a memory device coupled to the processor and having a program stored thereon for execution by the processor to perform operations. The operations include receiving a set of baseline confidence features from one or more decoded words, deriving word embedding confidence features from the baseline confidence features, combining the baseline confidence features with the word embedding confidence features to create a feature vector, and executing a confidence classifier to generate a confidence score, wherein the confidence classifier is trained with a set of training examples having labeled features corresponding to the feature vector.

19. The apparatus of example 18, wherein the word embedding confidence feature comprises one or more of a character embedding and a phoneme embedding comprising a monophonic.

20. The apparatus of any of examples 18 to 19, wherein the confidence classifier is trained for word-level classification and utterance-level classification, and wherein the baseline features include two or more of an acoustic model score, a background model score, a silence model score, a noise model score, a language model score, and a duration feature.

Although several embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be removed, from, the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:联合自动语音识别和说话人二值化

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!