System and method for end-to-end speech recognition with triggered attention

文档序号：174378 发布日期：2021-10-29 浏览：25次中文

阅读说明：本技术 用于具有触发注意力的端到端语音识别的系统和方法 (System and method for end-to-end speech recognition with triggered attention ) 是由 N·莫里茨堀贵明 J·勒鲁克斯于 2020-01-16 设计创作，主要内容包括：一种语音识别系统包括用于将输入声学信号转换为编码器状态序列的编码器、用于识别编码器状态序列中的对转录输出进行编码的编码器状态的位置的对齐解码器、用于基于所识别的编码器状态的位置将编码器状态序列划分成分区集合的划分模块、以及确定针对作为输入提交至基于注意力的解码器的编码器状态的每个分区的转录输出的基于注意力的解码器。当接收到声学信号时,系统使用编码器生成编码器状态序列,基于由对齐解码器识别出的编码器状态的位置,将编码器状态序列划分为分区集合,并且将分区集合顺序地提交到基于注意力的解码器中,以产生针对每个所提交的分区的转录输出。(A speech recognition system includes an encoder for converting an input acoustic signal into a sequence of encoder states, an alignment decoder for identifying a position of an encoder state in the sequence of encoder states that encodes a transcription output, a partitioning module for partitioning the sequence of encoder states into a set of partitions based on the identified position of the encoder state, and an attention-based decoder that determines a transcription output for each partition of the encoder states submitted as input to the attention-based decoder. Upon receiving the acoustic signal, the system generates an encoder state sequence using the encoder, divides the encoder state sequence into partition sets based on the positions of the encoder states identified by the alignment decoder, and sequentially submits the partition sets into the attention-based decoder to produce a transcription output for each submitted partition.)

1. A speech recognition system, the speech recognition system comprising:

computer memory configured to store:

an encoder configured to convert an input acoustic signal into a sequence of encoder states;

an alignment decoder configured to identify a position of an encoder state in the sequence of encoder states that encodes a transcription output;

a partitioning module configured to partition the sequence of encoder states into a set of partitions based on the identified locations of the encoder states; and

an attention-based decoder configured to determine a transcription output for each partition of encoder states submitted as input to the attention-based decoder;

an input interface configured to receive the acoustic signal representing at least a portion of a speech utterance;

a hardware processor configured to:

submitting the received acoustic signal to the encoder to produce the sequence of encoder states;

submitting the sequence of encoder states into the alignment decoder to identify a location of an encoder state encoding the transcription output;

based on the identified locations of the encoder states, dividing the sequence of encoder states into the set of partitions using the dividing module; and

sequentially submitting the set of partitions into the attention-based decoder to produce a transcription output for each submitted partition; and

an output interface configured to output the transcription output.

2. The speech recognition system of claim 1, wherein the output interface is configured to output each transcription output individually as each transcription output is transcribed.

3. The speech recognition system of claim 1, wherein the output interface is configured to accumulate a set of transcription outputs to form words and output each word separately.

4. The speech recognition system of claim 1, wherein the processor partitions the sequence of encoder states for each position of the identified frame such that a number of partitions is defined by a number of identified encoder states.

5. The speech recognition system of claim 4, wherein each partition comprises an encoder state from a start of the sequence of encoder states up to a look-ahead encoder state determined by shifting a position of the identified encoder state forward by a fixed displacement.

6. The speech recognition system of claim 4, wherein each partition corresponding to the identified position of the encoder state comprises a predetermined number of encoder states around the identified position of the encoder state.

7. The speech recognition system of claim 1, wherein the set of partitions includes a first partition and a subsequent partition, wherein the processor processes the first partition with the attention-based decoder to produce a first transcription output, wherein after the attention-based decoder completes processing of the first partition placing the attention-based decoder in its internal state, the processor processes the subsequent partition with the attention-based decoder without resetting the internal state of the attention-based decoder to produce transcription outputs for the subsequent partitions one after another.

8. The speech recognition system of claim 1, wherein the attention-based decoder is configured to process different partitions without resetting an internal state of the attention-based decoder, wherein the processor, when determining that the speech utterance ended, is configured to reset the internal state of the attention-based decoder.

9. The speech recognition system of claim 1, wherein the processor, upon receiving a subsequent acoustic signal representing a subsequent portion of the speech utterance, is configured to:

submitting the subsequent acoustic signal to the encoder to generate a subsequent sequence of the encoder states;

submitting the subsequent sequence of encoder states to the alignment decoder to identify a position of an encoder state in the subsequent sequence of encoder states that encodes a transcribed output;

concatenating the sequence of encoder states and the subsequent sequence of encoder states together to generate a concatenated sequence of encoder states; and

the concatenated sequence of encoder states is partitioned based on the identified position of the encoder state to update the partition sequence.

10. The speech recognition system of claim 9, further comprising:

a gate that divides the speech utterance into acoustic signal blocks such that the input interface receives one acoustic signal block at a time.

11. The speech recognition system of claim 1, wherein the encoder, the alignment decoder, and the attention-based decoder are jointly trained neural networks.

12. The speech recognition system of claim 11, wherein the alignment decoder comprises a neural network based on connected time-series classification (CTC) or a Hidden Markov Model (HMM) based classifier.

13. The speech recognition system of claim 11, wherein the alignment decoder is a CTC-based neural network based on connection timing classification, wherein the attention-based decoder is an attention-based neural network,

wherein the transcriptional output determined by the attention-based neural network comprises a probability of the transcriptional output,

wherein the CTC based neural network is further trained to determine a probability of a transcription output in the encoder state provided as an input to the CTC based neural network,

wherein the processor determines a first sequence of probabilities of transcription outputs in the acoustic signal by submitting the sequence of encoder states into the CTC-based neural network,

wherein the processor determines a second probability sequence of transcription output in the acoustic signal by submitting partitions of the encoder state sequence into the attention-based neural network,

wherein the processor is configured to determine a transcription output in the acoustic signal based on a combination of the first and second probability sequences of transcription outputs.

14. A method of speech recognition, wherein the method uses a processor coupled with stored instructions that implement the method, wherein the instructions, when executed by the processor, perform the steps of the method, the method comprising the steps of:

receiving an acoustic signal representing at least a portion of a speech utterance;

converting the acoustic signal into a sequence of encoder states;

identifying a position of an encoder state in the sequence of encoder states that encodes a transcription output;

dividing the sequence of encoder states into a set of partitions based on the identified locations of the encoder states;

sequentially submitting the set of partitions into an attention-based decoder to produce a transcription output for each submitted partition; and

outputting the transcription output.

15. A non-transitory computer readable storage medium having embodied thereon a program executable by a processor for performing a method comprising:

receiving an acoustic signal representing at least a portion of a speech utterance;

converting the acoustic signal into a sequence of encoder states;

identifying a position of an encoder state in the sequence of encoder states that encodes a transcription output;

dividing the sequence of encoder states into a set of partitions based on the identified locations of the encoder states;

sequentially submitting the set of partitions into an attention-based decoder to produce a transcription output for each submitted partition; and

outputting the transcription output.

Technical Field

The present invention relates generally to a system and method for speech recognition and, more particularly, to a method and system for end-to-end speech recognition.

Background

Automatic Speech Recognition (ASR) systems are widely used in a variety of interface applications, such as speech searching. However, it is challenging to make a speech recognition system that achieves high recognition accuracy. This is because such production requires in-depth language knowledge of the target language accepted by the ASR system. For example, phone sets, vocabularies, and pronunciation dictionaries are essential to making such ASR systems. The phone set needs to be carefully defined by the linguist of the language. A pronunciation dictionary needs to be created manually by assigning one or more phoneme sequences to each word in a vocabulary that includes more than 10 ten thousand words. Furthermore, some languages do not have explicit lexical boundaries, so we may need tokenization to create a vocabulary from a text corpus. Therefore, it is very difficult to develop a speech recognition system, especially for a small language. Another problem is that the speech recognition system is broken down into a number of modules, including acoustic, dictionary and language models, which are optimized separately. This architecture may result in local optima, but each model is trained to match the other models.

End-to-end and sequence-to-sequence neural network models have recently gained increasing attention and popularity in the ASR community, respectively. The output of an end-to-end ASR system is typically a grapheme sequence, which may be a single letter or a larger unit, such as a fragment of a word and an entire word. The appeal of end-to-end ASR is that it enables a simplified system architecture by building an ASR system, in comparison to traditional ASR systems, consisting of neural network components and avoiding the need for linguistic expertise. The end-to-end ASR system can directly learn all the components of the speech recognizer, including pronunciation, acoustics, and language models, thereby avoiding the need for language specific language information and text normalization.

The goal of end-to-end speech recognition is to reduce the traditional architecture to a single neural network architecture within a deep learning framework. For example, some end-to-end ASR systems use attention-based neural networks proposed by Chan et al, tomilong university and Google Brain, and bahdana et al, university of gamelian and university of montreal, in 2015. Attention-based neural networks (see, e.g., U.S. patent 9,990,918) demonstrate the latest success of end-to-end speech recognition. However, attention-based neural networks have output delays and are not well suited for online/streaming ASR where low delay is required.

Accordingly, there is a need to reduce the output delay caused by such attention-based model architectures for end-to-end and/or sequence-to-sequence speech recognition.

Disclosure of Invention

Automatic Speech Recognition (ASR) can be viewed as a sequence-to-sequence problem in which the input is a sequence of acoustic features extracted from audio frames at a rate, and the output is a sequence of characters. It is an object of some embodiments to improve the performance of attention-based networks for end-to-end and/or sequence-to-sequence speech recognition. Additionally or alternatively, another object of some embodiments is to reduce the output delay introduced by attention-based model architectures and adapt end-to-end attention-based ASR systems for recognition in streaming/online fashion.

Some embodiments are based on the recognition that attention-based ASR systems need to observe an input sequence (which is typically the entire speech utterance divided by speech pauses) to assign a weight to each input frame in order to identify each transcription output of the output sequence. For example, the transcription output may include a single alphabetic character or a series of characters, such as a word or sentence fragment. Attention-based networks typically need to handle large input sequences due to the lack of a priori knowledge about which portions of the input sequence are relevant to identifying the next transcription output and the need to assign weights to each input frame. This processing allows the exploitation of focusing on different parts of the utterance, but also increases output delay and is therefore impractical for speech recognition in a streaming/online manner.

As used herein, the output delay of an ASR is the difference between the time that an acoustic frame of a speech utterance is received and the time that the received acoustic frame is recognized. For example, when an attention-based ASR system operates on an entire speech utterance, recognition of words in the utterance is delayed until the last audio sample of the utterance is received. This identified delay results in an increased output delay.

Some embodiments are based on the following recognition: an example of a priori knowledge about the relevance of different parts of the input sequence to identifying the next transcription output is an indication of the location of the frame corresponding to the transcription segment to be identified in the input sequence. In fact, if the location of the transcribed fragments is known, attention-based networks can be constrained by limiting the input sequences to focus more on the area around them. In this way, for each transcription output, the attention-based network can focus on the region around the assumed position of the transcription segment in the input sequence. This directed attention reduces the need to process large input sequences, which in turn reduces output delays, making the attention-based network practical for recognition in a streaming/online manner.

Therefore, there is a need to determine the positional alignment of the inputs of the attention-based network with the outputs of the attention-based network to reduce output latency. Unfortunately, however, for ASR applications, this alignment is far from straightforward due to human pronunciation irregularities. For example, even within a single utterance, the pronunciation speed may vary, thereby introducing different numbers of silent segments between different words of the same utterance or even between different characters of a single word. Furthermore, most attention-based systems first convert input features (such as acoustic features) through a network of encoders into a different representation referred to herein as encoder states. For this reason, the desired alignment is performed on the encoding states rather than the input acoustic features.

Some embodiments are based on the recognition that it is desirable to provide an alignment network that is trained to determine the location of encoder states that encode transcription outputs (such as characters, bits, words, etc.). For example, Connection Timing Classification (CTC) is a type of neural network output and associated scoring function used to train a Recurrent Neural Network (RNN), such as a Long Short Term Memory (LSTM) network, to solve the timing variable sequence problem. CTC-based ASR systems are an alternative to attention-based ASR systems. The CTC-based neural network generates output for each frame of the input sequence, i.e., the input and output are synchronized, and folds (collapse) the neural network output to an output transcription using a beam search algorithm. The performance of the attention-based ASR system may be superior to that of the CTC-based ASR system. However, some embodiments are based on the following recognition: input and output frame alignment used by intermediate operations of a CTC-based ASR system may be used by an attention-based ASR system to address the output latency deficiencies thereof described above.

Additionally or alternatively, some embodiments are based on the following recognition: a hidden markov model (HMM based) based system may provide the desired alignment. In particular, alignment information can be computed using traditional HMM-based ASR systems, such as HMM models based on a hybrid Deep Neural Network (DNN) or Gaussian Mixture Model (GMM).

Accordingly, one embodiment discloses a speech recognition system that is trained to produce a transcription of an utterance from an acoustic signal. The speech recognition system includes: an encoder network configured to process an acoustic signal to produce an encoded acoustic signal comprising a sequence of encoder states; an alignment network, such as a connected time series classification (CTC-based) neural network and/or an HMM-based model, configured to process the sequence of encoder states to produce an alignment of the transcription output, so as to identify the locations of the encoder states that encode the most relevant information to generate the transcription output; and an attention-based neural network configured to determine a representation of a transcription of the utterance from the subsequence of encoder states.

To this end, the speech recognition system submits the received acoustic signals into an encoder network to generate a sequence of encoder states; submitting the encoder state sequence into an alignment network to identify a position of an encoder state in the encoder state sequence that encodes the transcribed output; dividing the sequence of encoder states into a set of partitions based on the identified position of the encoder state; and sequentially submitting the set of partitions into an attention-based neural network to produce a transcription output for each submitted partition. Because of this sequential processing of partitions including transcript output, attention-based networks employ end-to-end attention-based ASR systems for recognition in a streaming/online manner.

In some implementations, the encoder, the alignment decoder, and the attention-based decoder are neural networks adapted for joint training. It is noted that an alignment decoder (such as a CTC-based neural network) may also operate not on the original acoustic feature frames, but on the encoder states generated by the encoder. Thus, the CTC-based neural network may be trained on the same encoder used to train the attention-based neural network to generate an alignment of encoder states that are provided as inputs to the attention-based neural network. The alignment generated by the CTC-based neural network indicates the position of a frame in a sequence of frames of the encoded acoustic signal encoding the output of the transcription of the utterance. Because of this alignment, the attention-based neural network may use this a priori knowledge as anchor points to find input frame sequences that include sufficient information to identify the next transcription output. This alignment, in turn, allows for reduced transcription errors, reduced computational complexity, and/or adjustment of the attention-based network in a streaming/online manner for speech recognition.

Another problem addressed by various embodiments is how to adjust the attention of an attention-based decoder in an efficient manner using alignment of positions. For example, one embodiment modifies the structure of the attention-based decoder to accept the location of the transcribed segment as side information, and trains the attention-based neural network to draw attention using the side information. Another embodiment partitions the input into an attention-based neural network based on the location of the next transcription output detected by the alignment decoder. This division forces the attention-based decoder to focus on the desired input frame only. Furthermore, such a division allows reducing the need to wait until a future input frame reaching the end of the utterance is received, which reduces the output delay.

For example, in some implementations, the ASR system partitions a sequence of encoder states representing the encoded acoustic signal according to the indicated locations. These partitions of the encoded acoustic signal are iteratively processed by an attention-based decoder to produce a transcription of the utterance. In this way, different iterations process different portions of the overall input signal. This allows processing the input acoustic signal in a streaming/on-line manner.

For example, in one embodiment, an iteration of the attention-based decoder starts with an internal state resulting from a previous iteration to process a different partition than was processed during the previous iteration. Thus, the internal state of the attention-based neural network is preserved for processing not only characters of the same input frame sequence, but also characters of different input frame sequences. In this way, the attention-based decoder forwards its internal state to handle different parts, i.e. different frame sequences. This distinction allows the attention-based model to focus attention on different parts of the utterance to reduce errors caused by, for example, deleting/skipping transcription outputs.

For example, in one implementation, each partition corresponds to a location identified by the CTC-based neural network to include a portion of a sequence of frames from the beginning of the sequence up to some look-ahead frame. Such partitions incrementally add new information to the input frame sequence while retaining previously processed information. In fact, such partitioning follows the principles of attention-based models, allowing the same portion of an utterance to be processed multiple times and weights to be used to preferentially focus on different portions of the utterance. However, since the previous portion has already been decoded and the new portion added corresponds to the new transcription output to be decoded, the attention-based model may increase the attention to the newly added frame to improve the accuracy of the decoding.

Additionally or alternatively, some embodiments limit not only the processing of future input frames, but also the number of past frames processed by the attention-based decoder. For example, one embodiment partitions the encoded acoustic frames such that each partition includes a subsequence of a frame sequence having a fixed number of frames. The partitioning is performed to include the frame at the identified position in the sub-sequence of frames according to the position identified by the alignment decoder. For example, a sub-sequence of frames may be centered around the frame at the respective identified location and/or include a sub-sequence of frames around the frame at the identified location. The present embodiment reduces the size of the partitions of attention-based neural network processing to reduce computational complexity.

In some embodiments, the encoder neural network, the CTC-based neural network, and the attention-based neural network are jointly trained to form a Triggered Attention (TA) neural network. In this way, the CTC-based neural network and the attention-based neural network are trained from the output of the same encoder neural network. This improves the accuracy of the collaboration between the different components of the TA network and allows the TA network to be trained in an end-to-end manner to produce an end-to-end ASR system.

Accordingly, one embodiment discloses a speech recognition system comprising: a computer memory configured to store an encoder configured to convert an input acoustic signal into a sequence of encoder states; an alignment decoder configured to identify a position of an encoder state in a sequence of encoder states that encodes a transcription output; a partitioning module configured to partition the sequence of encoder states into a set of partitions based on the identified locations of the encoder states; an attention-based decoder configured to determine a transcription output for each partition submitted as input to an encoder state of the attention-based decoder; an input interface configured to receive an acoustic signal representing at least a portion of a speech utterance; a hardware processor configured to submit the received acoustic signal to an encoder to generate a sequence of encoder states; submitting the sequence of encoder states into an alignment decoder to identify a location of an encoder state encoding the transcription output; based on the identified position of the encoder state, dividing the encoder state sequence into a set of partitions using a dividing module; and sequentially submitting the set of partitions into an attention-based decoder to generate a transcription output for each submitted partition; and an output interface configured to output the transcription output.

Another embodiment discloses a method of speech recognition, wherein the method uses a processor associated with stored instructions implementing the method, wherein the instructions, when executed by the processor, perform the steps of the method, the method comprising the steps of: receiving an acoustic signal representing at least a portion of a speech utterance; converting the acoustic signal into a sequence of encoder states; identifying a position of an encoder state in the sequence of encoder states that encodes the transcription output; dividing the sequence of encoder states into a set of partitions based on the identified position of the encoder state; sequentially submitting the set of partitions into an attention-based decoder to generate a transcription output for each submitted partition; and outputs the transcription output.

Yet another embodiment discloses a non-transitory computer readable storage medium having embodied thereon a program executable by a processor to perform a method. The method comprises the following steps: receiving an acoustic signal representing at least a portion of a speech utterance; converting the acoustic signal into a sequence of encoder states; identifying a position of an encoder state in the sequence of encoder states that encodes the transcription output; dividing the sequence of encoder states into a set of partitions based on the identified position of the encoder state; sequentially submitting the set of partitions into an attention-based decoder to generate a transcription output for each submitted partition; and outputs the transcription output.

Drawings

[ FIG. 1]

FIG. 1 illustrates a schematic diagram of a speech recognition system (ASR) configured for end-to-end speech recognition, according to some embodiments.

[ FIG. 2A ]

Fig. 2A shows a schematic diagram of an alignment decoder according to some embodiments.

[ FIG. 2B ]

Fig. 2B illustrates an example of partitioning a sequence of encoder states, according to some embodiments.

[ FIG. 2C ]

Fig. 2C illustrates an example of partitioning a sequence of encoder states, according to some embodiments.

[ FIG. 3]

Fig. 3 illustrates an example of an attention-based decoder according to some embodiments.

[ FIG. 4]

FIG. 4 illustrates a block diagram of a speech recognition system according to some embodiments.

[ FIG. 5]

[ FIG. 6]

FIG. 6 illustrates a block diagram of a triggered attention neural network of an end-to-end speech recognition system, according to one embodiment.

[ FIG. 7]

Figure 7 is a schematic diagram illustrating a combined neural network, according to some embodiments.

[ FIG. 8]

FIG. 8 illustrates a performance comparison graph of speech recognition according to some embodiments.

[ FIG. 9]

FIG. 9 is a block diagram illustrating some components that may be used in various configurations for implementing systems and methods, according to some embodiments.

Detailed Description

FIG. 1 illustrates a schematic diagram of a speech recognition system (ASR)100 configured for end-to-end speech recognition, according to some embodiments. The speech recognition system 100 obtains an input acoustic sequence and processes the input acoustic sequence to generate a transcription output sequence. Each transcribed output sequence is a transcription of the utterance or a part of the utterance represented by the corresponding input acoustic signal. For example, the speech recognition system 100 may obtain the input acoustic signal 102 and generate a corresponding transcription output 110, the transcription output 110 being a transcription of the utterance represented by the input acoustic signal 102.

The input acoustic signal 102 may include a multi-frame sequence of audio data, e.g., a continuous data stream, that is a digital representation of an utterance. The sequence of frames of audio data may correspond to a sequence of time steps, for example, where each frame of audio data is associated with 25 milliseconds of audio stream data further offset in time by 10 milliseconds from a previous frame of audio data. Each frame of audio data in the multi-frame sequence of audio data may include a feature value characterizing a frame of the speech portion at the corresponding time step. For example, a sequence of multi-frame audio data may include a filter bank spectral feature vector (filter bank feature vector).

The transcription output 110 may include a sequence of transcription fragments of the utterance represented by the input acoustic signal 102. The transcription output may include one or more characters. For example, the transcription output may be a character or a series of characters from a Unicode character set. For example, the character set may include the alphabets of English, Asian, Siberian, and Arabic. The character set may also include arabic numerals, space characters, and punctuation marks. Additionally or alternatively, the transcription output may include bits, words, and other linguistic structures.

The speech recognition system 100 includes an acoustic encoder 104 and an attention-based decoder 108. The acoustic encoder 104 processes the input acoustic signal 102 and generates an encoder state sequence 106, the encoder state sequence 106 providing an alternative (e.g., higher) representation for the input acoustic signal 102. The encoder state sequence may include a replacement sequence of a plurality of frames of audio data corresponding to the second set of time steps. In some implementations, the alternative representation of the input acoustic sequence is subsampled to a lower frame rate, i.e., the second set of time steps in the alternative representation is smaller than the first set of time steps in the input acoustic sequence. The attention-based decoder 108 is trained to process encoder states 106 representing alternative representations for the input acoustic signal 102, and a transcription output 110 is generated from a sequence of encoder states provided to the attention-based decoder 108.

Some embodiments are based on the recognition that: an attention-based ASR system may need to observe the entire speech utterance, divided by speech pauses, to assign a weight to each input frame in order to identify each transcription output 110. Attention-based decoders typically need to handle large input sequences due to the lack of a priori knowledge about which part of the input acoustic signal is relevant for identifying the next transcription output and the need to assign weights to each encoder state. This processing allows the use of focusing on different parts of the utterance, but also increases output delay and is therefore impractical for speech recognition in a streaming/online manner.

As used herein, the output delay of an ASR is the difference between the time that an acoustic frame of a speech utterance is received and the time that the received acoustic frame is recognized. For example, when an attention-based ASR system operates on an entire speech utterance, recognition of words in the utterance is delayed until the last word of the utterance is received. This identified delay results in an increased output delay.

Some embodiments are based on the following recognition: an example of a priori knowledge about the relevance of different parts of the input sequence to identifying the next transcription output is an indication of the position of the frame corresponding to the transcription output to be identified in the input sequence. In fact, if the transcription output locations are known, the attention-based decoder may be forced to focus more on these locations and less or no other locations by constraining the input sequence. In this way, for each transcription output, the attention-based network may focus its attention around its position in the input sequence. This directed attention reduces the need to process large input sequences, which in turn reduces output delays, enabling an attention-based decoder to recognize in a streaming/online manner.

To this end, the ASR 100 includes an alignment decoder 120, which alignment decoder 120 is trained to determine positions 125, such as characters, bits, words, etc., of encoder states in the sequence 106 that encode the transcription output. For example, Connection Timing Classification (CTC) is an objective function and associated neural network output used to train a Recurrent Neural Network (RNN), such as a Long Short Term Memory (LSTM) network, to solve the timing variable sequence problem. CTC-based ASR systems are an alternative to attention-based ASR systems. The CTC-based neural network generates outputs for each frame of the input sequence, i.e., the inputs and outputs are synchronized, and a beam search algorithm is used to find the best output sequence before folding the neural network outputs into output transcripts. The performance of the attention-based ASR system may be superior to that of the CTC-based ASR system. However, some embodiments are based on the following recognition: input and output frame alignment used by intermediate operations of a CTC-based ASR system may be used by an attention-based ASR system to address the output latency deficiencies thereof described above.

To utilize the alignment information 125 provided by the alignment decoder 120, the ASR system 100 includes a partitioning module 130 configured to partition the encoder state sequence 106 into a set of partitions 135. For example, the partitioning module 130 may partition the encoder state sequence for each position 125 of the identified encoder state such that the number of partitions 135 is defined by, for example, a number equal to the identified encoder state 125 encoding the transcription output. In this way, the attention-based decoder does not accept the entire sequence 106 as input, but rather portions 135 of the sequence, each of which may include a new transcription output to form the transcription output sequence 110. In some embodiments, the combination of an alignment decoder, an attention-based decoder, and a partitioning module is referred to as a triggered attention decoder. In practice, the triggered attention decoder may process portions of the utterance as it is received, making the ASR system 100 practical for recognition in a streaming/online manner.

Fig. 2A shows a schematic diagram of an alignment decoder 120 according to some embodiments. One of the goals of the alignment decoder 120 is to decode the encoder state sequence 106 produced by the encoder 104. To this end, the alignment decoder is trained to decode the sequence 106 to produce a transcribed output sequence 126. This is why the alignment decoder 120 is referred to as a decoder in this disclosure. However, at least some embodiments do not use the decoded transcription output of the alignment decoder. Rather, some embodiments use intermediate alignment information generated by the alignment decoder to decode the encoder state sequence 106. In other words, some embodiments ignore the transcription output decoded by the aligned decoder, but rather use the position 125 of the encoder state in the sequence 106 to improve the performance of the attention-based decoder 108. The rationale behind this approach is that the performance of the attention-based decoder 108 may be better than the performance of the alignment decoder 120. To this end, the intermediate alignment information generated by the alignment decoder 120 is used to further improve the performance of the attention-based decoder 108.

However, in some embodiments, the transcription output 126 decoded by the alignment decoder 120 is further combined with the transcription output decoded by the attention-based decoder 108 to further improve the accuracy of the recognition. In these embodiments, the alignment decoder 120 is used twice: the first time to help partition the encoder state sequence of the attention-based decoder 108 and the second time to further improve the accuracy of the transcription output decoded by the attention-based decoder 108.

Fig. 2A shows an example of the operation of an alignment decoder to process an example portion of an utterance with the word "dog". The box around the elements of the indexed sequence identifies the position of the encoder state 125 in the encoder state sequence 106 that encodes the transcribed output. For example, the encoder 104 converts an input acoustic sequence X of acoustic features (such as log-mel spectral energy) into a T-encoder state sequence H:

H＝Encoder(X)。

for example, in one implementation, the encoder output is subsampled to four times the low frame rate compared to the feature matrix x, which has a sampling rate of 100 Hz. Let Z be (Z)₁，...，z_T) Representing TThe frame-by-frame sequence of encoder states 106, and,wherein the content of the first and second substances,the representation may be a set of different graphemes of a single character or word fragment and e.g. a set of e-space symbols. Let C be ═ C₁，...，c_L) And is anda grapheme sequence of length L is represented such that when a repeated tag is folded into a single occurrence and blank symbols are deleted, the sequence Z is reduced to C.

In some embodiments, the aligned decoder probabilistically decodes the sequence of encoder states, and the probability is derived as:

where p (Z | C) represents the transition probability and p (Z | H) represents the acoustic model.

In some embodiments, the alignment decoder identifies as the identified encoder state the frame with the highest probability within each subsequence of frames corresponding to the same grapheme in Z. For example, if sequence Z uses index i_lAnd j_lFor the l-th tag c in Z_lThe beginning and end of the occurrence are programmed, andand for all t, Z_t＝c_lSuch that for all other indices, i_l≤t≤j_lAnd z is_tE. The alignment decoder performs a sequence Z' from the sequence Z to a subset of the same length T comprising the identified encoder state encoding the transcription output with the highest probability 125 (e ∈)^*，c₁，∈^*，[0]c₂，∈^*，...，c_L，∈^*) Wherein denotes zero or more repetitions, and wherein each c_lIn correspondence with c_lOnly occurs once at the frame with the highest probability among those frames as follows:

alternatively, the alignment decoder may identify the first or last frame within each sub-sequence of frames corresponding to the same grapheme in Z as the identified encoder state.

Fig. 2B and 2C illustrate examples of partitioning the encoder state sequence, according to some embodiments. In various embodiments, the partitioning is performed by a partitioning module 130 operatively connected to the alignment decoder 120, the attention-based decoder 108, and the encoder 104. The partitioning module 130 is configured to access alignment information 125 of the alignment decoder 120, partition the encoder state sequence generated by the encoder 104, and submit portions 135 of the encoder state sequence sequentially to the attention-based decoder 108.

For example, in one embodiment of FIG. 2B, each partition 135B includes encoder states from the beginning of the sequence of encoder states until a look-ahead encoder state is determined by shifting the position of the identified encoder state forward by a fixed displacement. FIG. 1 shows an example of look-ahead encoder state 140. For example, if the value of the fixed displacement is 5 and the identified position of the state encoder is the 8 th in the sequence of encoder states, partition 135b includes the first 13 encoder states. If the position of the subsequently identified encoder state is 11, partition 135b includes the first 16 encoder states. In effect, each partition includes the encoder state of the new transcription output, while increasing the length of the partition allows the attention-based decoder to take advantage of its length.

In an alternative embodiment of fig. 2C, the partition 135C corresponding to the location of the identified encoder state includes a predetermined number of encoder states centered around the location of the identified encoder state. For example, if the predetermined number of encoder states is 7 and the identified position of the encoder state is the 15 th in the sequence of encoder states, partition 135c includes encoder states between the 12 th and 18 th in the sequence of encoder states 106. In effect, each partition includes encoder state for the new transcription output, while having fixed-length partitions to reduce the computational burden of the attention-based decoder. Additionally or alternatively, the sector 135c corresponding to the identified position of the encoder state includes a predetermined number of encoder states around the identified position of the encoder state, e.g., shifted from center to provide off-center coverage.

Fig. 3 illustrates an example of an attention-based decoder 108 according to some embodiments. The attention-based decoder 108 includes a context vector generator 304 and a decoder neural network 306. The context vector generator 304 receives as input the hidden decoder state 312 of the decoder neural network 306 from a previous time step, the attention weight distribution 310 of the context vector generator from a previous time step, the alternative representation 106 (i.e., the alternative representation of the acoustic signal 102 described above with reference to fig. 1). The context vector generator 304 processes the previous hidden decoder state 306, the previous attention weight distribution 310 and the alternative representation 106 of the decoder neural network to calculate an attention weight distribution over a time frame of the alternative representation 106 and to generate as an output a time-stepped context vector 314. The context vector generator 304 provides the time-stepped context vector 314 to the decoder neural network 306.

For different iterations, the attention-based decoder 108 receives different partitions 331, 333, and 335. For example, the partition set includes a first partition 331 and subsequent partitions 333 and 335. The attention-based decoder 108 processes the first partition 331 to produce a first transcription output. After the attention-based neural network completes processing the first partition placing the attention-based network in its internal state, the attention-based decoder 108 processes subsequent partitions using the attention-based network without resetting the internal state of the attention-based network to generate transcription outputs for the subsequent partitions one-by-one.

In effect, the attention-based decoder 108 processes the different partitions to utilize the previously decoded information without resetting the internal state of the attention-based network. Upon determining that the speech utterance ended, the attention-based decoder 108 is configured to reset its internal state.

The decoder neural network 306 receives as inputs the context vector 314 for a time step, as well as the transcription output 308 and the hidden decoder state 312 for a previous time step. The decoder neural network 306 initializes its internal hidden states with the previous hidden decoder state 312 to generate as output a set of transcribed output scores 316 for a time step prior to processing the context vector 314 for the time step and the transcribed output 308 from the previous time step. In some implementations, the decoder neural network 306 is a Recurrent Neural Network (RNN) with a softmax output layer. Each transcription output score corresponds to a respective transcription output from the set of transcription outputs. For example, as described above with reference to FIG. 1, the transcription output set may be a character or sequence of characters from a Unicode set of characters used to compose one or more natural languages, such as the character tables of English, Asian, Cyrillic, and Arabic. The transcription output set may also include arabic numerals, space characters, and punctuation marks. The score for a given transcription output represents a likelihood that the corresponding transcription output is a current transcription segment at a time step in the output sequence that is a transcription of the utterance.

The speech recognition system processes the transcription output score 316 for each time step to determine a transcription output sequence that represents the transcription of the utterance. For example, for each time step, the speech recognition system may select the transcription output with the highest score from the set of transcription output scores to determine the transcription output sequence.

FIG. 4 illustrates a block diagram of a speech recognition system 400 according to some embodiments. The speech recognition system 400 may have multiple interfaces that connect the system 400 with other systems and devices. The network interface controller 450 is adapted to connect the system 400 via the bus 406 to a network 490 that connects the speech recognition system 400 with the sensing devices. For example, the speech recognition system 400 includes an audio interface 470 configured to accept input from an acoustic input device 475, such as a microphone. Through the input audio interface 470, the system 400 can accept acoustic signals representing at least a portion of a speech utterance.

Additionally or alternatively, the speech recognition system 400 may receive acoustic signals from various other types of input interfaces. Examples of input interfaces include a Network Interface Controller (NIC)450 configured to accept the acoustic sequence 495 over a network 490, which network 490 may be one or a combination of a wired network and a wireless network. Additionally or alternatively, the system 400 may include a human-machine interface 410. A human machine interface 410 within the system 400 connects the system to a keyboard 411 and a pointing device 412, where the pointing device 412 may include a mouse, trackball, touchpad, joystick, pointing stick, stylus or touch screen, or the like.

The speech recognition system 400 includes an output interface 460 configured to output a transcription output of the system 400. For example, output interface 460 may display the transcription output on display device 465, store the transcription output in a storage medium, and/or transmit the transcription output over a network. Examples of display device 465 include a computer monitor, camera, television, projector, mobile device, or the like. The system 400 may also be connected to an application interface 480, the application interface 480 being adapted to connect the system to external devices 485 to perform various tasks.

The system 400 includes a processor 420 configured to execute stored instructions 430, and a memory 440 storing instructions executable by the processor. Processor 420 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 440 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system. Processor 420 may be connected to one or more input and output devices through bus 406.

According to some embodiments, instructions 430 may implement a method for end-to-end speech recognition. To this end, the computer memory 440 stores an encoder 104 trained to convert an input acoustic signal into a sequence of encoder states, an alignment decoder 120 trained to determine the position of the encoder states in the input sequence of encoder states that encode the transcription output, and an attention-based decoder 108 trained to determine the transcription output for each input subsequence of encoder states. In some implementations, the output of the attention-based decoder 108 is a transcription output of the system 400. In some other implementations, the outputs of the attention-based decoder 108 and the alignment decoder 120 are the transcription outputs of the system 400.

Upon accepting an acoustic sequence representing at least a portion of a speech utterance, the processor 420 is configured to submit the received acoustic sequence to the encoder network 104 to produce an encoder state sequence, submit the encoder state sequence produced by the encoder 104 to the alignment decoder 120 to identify a location of an encoder state in the encoder state sequence that encodes the transcription output, execute the partitioning module 130 to partition the encoder state sequence into partition sets based on the identified location of the encoder state, and sequentially submit the partition sets to the attention-based decoder 108 to generate the transcription output for each submitted partition.

An output interface (e.g., interface 460) outputs the transcription output. For example, in one embodiment, the output interface is configured to output each transcription output separately. For example, if the transcription output represents a character, the output interface outputs character by character. Similarly, if the transcription output represents a word, the output interface outputs word by word. Additionally or alternatively, in one embodiment, the output interface is configured to accumulate the transcription output set to form words and output each word in the speech utterance separately. For example, the attention-based decoder 108 may be configured to detect end-of-word characters such that the output interface outputs an accumulated transcription output upon receipt of the end-of-word characters.

In some implementations, the attention-based decoder 108 is configured to process the different partitions without resetting an internal state of the attention-based network, wherein the processor is configured to reset the internal state of the attention-based network upon determining that the speech utterance ended. To this end, in some implementations, the memory 440 also stores an utterance end module 436 configured to detect an end of a speech utterance. Different embodiments use different techniques to implement module 436. For example, some embodiments use a Speech Activity Detection (SAD) module to detect the end of an utterance or a combination of SAD and auxiliary endpoint detection systems.

In some implementations, the attention-based ASR system 100 is configured to recognize in a streaming/online manner. For example, the memory 440 may include a gate 434 to divide the speech utterance into a set of acoustic sequences. For example, in some embodiments, the gate is implemented as part of the audio interface 470 that divides the speech during its conversion. The length of each acoustic sequence in the set may be the same or may vary based on the characteristics of the pronounced speech. In this way, the ASR system 100 transcribes the input acoustic sequence in a streamlined manner. In some implementations, the gate divides the speech utterance into acoustic signal blocks such that the input interface receives one acoustic signal block at a time. For example, the gate may be implemented by a sound card, and the block processing may be defined by a clock of the sound card, such that audio received from the sound card is sampled block by block.

FIG. 5 illustrates a block diagram of a method performed by an ASR system upon receiving a subsequent acoustic signal representing a subsequent portion of a speech utterance, according to one embodiment. The method submits 510 a subsequent acoustic signal to the encoder 104 to produce a subsequent sequence 515 of encoder states. The method submits 520 a subsequent sequence of encoder states 515 to the alignment decoder 120 to identify a position 525 of the encoder states in the subsequent sequence of encoder states that encode the transcribed output. The method concatenates the sequence of encoder states 530 received by processing the previous acoustic signal with a subsequent sequence of encoder states 515 to produce a concatenated sequence of encoder states 535. The method partitions 540 the concatenated encoder state sequence 535 based on the identified locations 525 of the encoder states to update the partition sequence 545. In this way, the incoming acoustic signals are spliced together to achieve seamless online transcription.

In some implementations of the speech recognition system, the encoder, the alignment decoder, and the attention-based decoder are jointly trained neural networks. These embodiments utilize joint training in a cooperative manner of neural networks to improve the accuracy of speech recognition.

FIG. 6 illustrates a block diagram of a triggered attention neural network 600 of an end-to-end speech recognition system according to one embodiment. In this embodiment, the encoder, the alignment decoder, and the attention-based decoder are implemented as a neural network. For example, the alignment decoder 120 is a neural network based on connection timing classification (CTC-based). To this end, the triggered neural network 600 includes an encoder network module 602, encoder network parameters 603, an attention decoder module 604, decoder network parameters 605, a partitioning module 606, a CTC module 608, and CTC network parameters 609. The encoder network parameters 603, decoder network parameters 605 and CTC network parameters 609 are stored in a storage device to provide parameters to the corresponding modules 602, 604 and 608, respectively. The acoustic feature sequence 601 is extracted from the audio waveform data and may be stored in a storage device and provided to the encoder network module 602. The audio waveform data may be obtained using a digital signal processing module (not shown) to receive and process speech sounds in the audio data via an input device.

The encoder network module 602 includes a network of encoders that convert the acoustic feature sequence 601 into a sequence of encoder feature vectors using a network of encoders that read parameters from the encoder network parameters 603. The CTC module 608 receives the hidden vector sequence from the encoder network module 602 and computes a CTC-based a posteriori probability distribution of the tag sequence using CTC network parameters 609 and a dynamic programming technique. After computation, the CTC module 608 provides the location of the most likely tag to the partitioning module 606.

The attention decoder network module 604 includes a decoder network. The attention decoder network module 604 receives partitions from the partitioning module 606, each partition comprising a portion of the sequence of encoder feature vectors, and then computes an attention-based a posteriori probability distribution for the tag using a decoder network that reads the parameters from the decoder network parameters 605.

End-to-end speech recognition is generally defined as seeking given a sequence of input acoustic features XFinding the most likely tag sequenceThe problem of (a) that, namely,

wherein the content of the first and second substances,representing a given set of predefined lettersA set of possible tag sequences.

In end-to-end speech recognition, p (Y | X) is computed by a pre-trained neural network without a pronunciation dictionary and without a heavy WFST-based graph search. In the related art attention-based end-to-end speech recognition, a neural network is composed of an encoder network and a decoder network.

The encoder module 602 includes an encoder network for encoding the acoustic signature sequence X according to the following equation₁，...，x_TConversion into a hidden vector sequence H ═ H₁，...，h_T

H＝Encoder(X) (2)

The function encoder (x) may include, among other things, a heap of one or more Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN). The RNN can be implemented as a Long Short Term Memory (LSTM) with an input gate, a forgetting gate, an output gate, and a memory cell in each hidden cell. Another RNN may be a Bidirectional RNN (BRNN) or a Bidirectional LSTM (BLSTM). BLSTM is a pair of LSTM RNNs, one forward LSTM and the other backward LSTM. The concealment vector of BLSTM is obtained as a concatenation of the concealment vectors of forward LSTM and backward LSTM.

Using forward LSTM, forward tth concealment vectorIs calculated as

Wherein σ (·) is an element-by-element sigmoid function, tanh (·) is an element-by-element hyperbolic tangent function, and andinput gate, forget gate, output gate and cell activation vector for Xt, respectively. As indicates element-by-element multiplication between vectors. Weight matrixAnd an offset vectorIs a parameter of the LSTM, identified by the subscript z ∈ { x, h, i, f, o, c }. For example,is hidden to the input gate matrix, andis input to the output gate matrix. Hidden vectorIs from the input vector Xt and the previous hidden vectorObtained recursively, wherein,is assumed to be a zero vector.

Using backward LSTM, backward tth hidden vectorIs calculated as

Wherein the content of the first and second substances,andinput gate, forget gate, output gate and cell activation vector for Xt, respectively. Weight matrixAnd an offset vectorIs a parameter of LSTM, which is identified by subscripts in the same manner as forward LSTM. Hidden vectorIs from an input vector x_tAnd subsequent hidden vectorsObtained recursively, wherein,a zero vector is assumed.

The concealment vector of BLSTM is obtained by concatenating a forward concealment vector and a backward concealment vector

Where T denotes a transpose operation of vectors assuming that all vectors are column vectors.Andare considered parameters of BLSTM.

To obtain better concealment vectors, some implementations stack multiple BLSTMs by providing the concealment vectors of a first BLSTM to a second BLSTM, then providing the concealment vectors of the second BLSTM to a third BLSTM, and so on. If h is_t′Is a hidden vector obtained by one BLSTM, assuming x when it is provided to another BLSTM_t＝h_t′. To reduce the amount of computation it may only provide every second concealment vector of one BLSTM to another BLSTM. In this case, the length of the output concealment vector sequence becomes half the length of the input acoustic feature sequence.

The attention decoder module 604 includes a decoder network for calculating a tag sequence probability p (Y | X) using the hidden vector sequence H. Assuming Y is a tag sequence Y of length L₁，y₂，...，y_L. To efficiently compute p (Y | X), the probability can be decomposed into probability chains by a probability chain rule

Probability per tag p (y)_l|y₁，…，y_l-1X) is obtained from the probability distribution of the tags, which is estimated as using the decoder network

p(y|y₁，…，y_l-1，X)＝Decoder(r_l，q_l-1) (15)

Where y is a sequence of integers, each integer representing a tag, r_lReferred to as a content vector, which contains the content information of H. q. q.s_l-1Is a decoder state vector that contains the previous tag y₁，...，y_l-1Context information and previous content vector r₀，...，r_l-1. Thus, the label probability is obtained as the probability y ═ y given the context_lI.e. by

p(y_l|y₁，…，y_l-1，X)＝p(y＝y_l|y₁，…，y_l-1，X) (16)

Content vector r_lUsually given as a weighted sum of the concealment vectors of the encoder network, i.e.

Wherein, a_ltReferred to as attention weight, which satisfies Σ_ta_lt1. Attention weight may use q_l-1And H is calculated as

e_lt＝w^Ttanh(W_ql-1+Vh_t+Uf_lt+b) (18)

f_l＝F*a_l-1 (19)

Wherein W, V, F and U areThe matrices, w and b, are vectors, which are trainable parameters of the decoder network. e.g. of the type_ltIs the (l-1) th state vector q_l-1And the t-th hidden vector h_tTo form a time-aligned distribution a_l＝{a_lt|t＝1，...，T}。a_l-1Representation for predicting previous tag y_l-1Previous alignment distribution of { a }_(l-1)t|t＝1，...，T}。f_l＝{f_ltI T1.. T is the same as for a_l-1Is used to reflect the previous alignment to the current alignment. "+" indicates a convolution operation.

The label probability distribution is formed by a state vector q_l-1And a content vector r_lIs obtained as

Decoder(r_l，q_l-1)＝softmax(W_qyq_l-1+W_ryr_l+b_y) (21)

Wherein, W_qyAnd W_ryIs a matrix, b_yAre vectors, which are trainable parameters of the decoder network. The softmax () function is computed as

For a K-dimensional vector v, where v [ i ] represents the ith element of v.

The decoder state vector q is then_l-1Updated to q using LSTM as follows_l

Wherein the content of the first and second substances,andare respectively an input vector x_lInput gate, forget gate, output gate, and cell activation vector. Weight matrixAnd an offset vectorAre parameters of LSTM, which are identified with subscripts as with forward LSTM. State vector q_lIs from an input vectorAnd a previous state vector q_l-1Obtained recursively, wherein q is assumed_-1＝0、y₀＝<sos>And a₀Calculate q 1/T₀. For a decoder network, input vectorsAs label y_lAnd a content vector r_lGiven the concatenated vector of (a), can obtainWhere Embed (-) represents label embedding, which converts the label into a fixed dimension vector.

In attention-based speech recognition, estimating the appropriate attention weight is very important for predicting the correct label, since the content vector r_lVery dependent on the alignment distribution a_lAs shown in equation (17). In speech recognition, the content vector represents the hidden vector of the encoder around the alignment distribution peakAnd the acoustic information is a predictive tag y_lThe most important clues of (1). However, attention mechanisms typically provide an irregular alignment distribution because there are no explicit constraints so that y is predicted incrementally_lThe peak of the distribution may monotonically progress over time. In speech recognition, the alignment between the input sequence and the output sequence should generally be monotonic. Although the convolution characteristic f_ltThe creation of irregular alignments is mitigated but it is not guaranteed that they are avoided.

Given the concealment vector sequence H, the CTC module 608 computes CTC forward probabilities for the tag sequence Y. Note that the CTC equation uses the L-length tag sequence Y ═ Y (Y)₁，...，y_L) Wherein, in the step (A),and isIs a collection of different tags. By introducing a frame-by-frame tag sequence Z ═ Z (Z)₁，...，z_T) And isWhere e represents an additional blank label and using the probability chain rules and conditional independent assumptions, the posterior distribution p (Y | X) is decomposed as follows:

wherein, p (z)_t|z_t-1Y) is considered to be the label transition probability including a blank label. p (z)_t| X) is a frame-by-frame posterior distribution conditioned on the input sequence X and modeled using Bidirectional Long Short Term Memory (BLSTM):

wherein h is_tIs obtained through a network of encoders.Is a matrix of the signals that are,are vectors, which are trainable parameters of the CTC. While equation (28) must handle the summation of all possible zs, it is computed efficiently by using a forward algorithm and dynamic programming.

The forward algorithm for CTC is performed as follows. Some implementations use an extended tag sequence Y 'of length 2L + 1'₁，y′₂，...，y′_2L+1＝∈，y₁，∈，y₂，...，∈，y_LE, where a blank tag e is inserted between each pair of adjacent tags. Let alpha_t(s) is the forward probability, which represents the tag sequence y₁，...，y_lA posterior probability for time frame 1, 1.. and t, where s represents a position in the extended tag sequence Y'.

For initialization, we set up

α₁(1)＝p(z₁＝∈|X) (30)

α₁(2)＝p(z₁＝y₁|X) (31)

For T2 to T, α_t(s) is recursively calculated as

Wherein the content of the first and second substances,

finally, the probability of the tag sequence based on CTC is obtained as

p(Y|X)＝α_T(2L+1)+α_T(2L) (35)

The frame-by-frame tag sequence Z represents the alignment between the input acoustic feature sequence X and the output tag sequence Y. When calculating the forward probability, the recursive forcing Z of equation (33) is monotonic and does not allow for cycles or large jumps s in the alignment Z, since α is obtained_t(s) recursion only taking into account at most a_t-1(s)，α_t-1(s-1⁾，α_t-1(s-2). This means that when the time frame is processed one frame, the label changes from the previous label or space, or remains the same label. The constraint transition probability p (z)_t|z_t-1Y), which forces the alignment to be monotonic. Thus, when p (Y | X) is calculated based on an irregular (non-monotonic) alignment, it can be 0 or a very small value. The alignment between the sequence of input acoustic features X and the sequence of output labels Y is used by the partitioning module 606 to control the operation of the attention-based neural network 604.

Some embodiments are based on the recognition that: the accuracy of the identification can be further improved by combining the decoder outputs from the CTC and attention-based decoders. For example, in one implementation of end-to-end speech recognition 600, the CTC forward probability in equation (34) is combined with the attention-based probability in equation (14) to obtain a more accurate tag sequence probability.

Figure 7 is a schematic diagram illustrating a combined neural network, according to some embodiments. The combined neural network includes an encoder network module 602, an attention decoder network module 604, and a CTC module 608. Each arrow represents data transmission with or without transformation, and each square or circular node represents a vector or predictive label. Acoustic signature sequence X ═ X₁，...，x_TIs provided to an encoder net module 602 in which two BLSTMs are stacked and each second concealment vector of the first BLSTM is provided to a second BLSTM. The output of the encoder module 602 results in a concealment vector sequence H ═ H'₁，h′₂，...，h′_T′Wherein T' is T/2. H is then fed to the CTC module 608 and the decoder network module 604. CTC-based and attention-based sequence profilesThe rates are computed with the CTC module 608 and the decoder network module 604, respectively, and combined to obtain tag sequence possibilities.

In some embodiments, the probabilities may be combined in the log domain as

log p(Y|X)＝λlog p_ctc(Y|X)+(1-λ)log p_att(Y|X) (36)

Wherein p is_ctc(Y | X) is the CTC-based tag sequence probability in equation (35), and p_att(Y | X) is the attention-based tag sequence probability in equation (14). λ is a scaling factor that balances the probability based on CTC and the probability based on attention.

Some embodiments perform a tag sequence search to find the most likely tag sequence from the tag sequence probability distribution p (Y | X)Namely, it is

In some attention-based speech recognition, p (Y | X) is p_att(Y | X). However, in some embodiments, p (Y | X) is calculated by a combination of tag sequence probabilities as in equation (36), i.e., it is found according to

Some embodiments are based on the following recognition: it is difficult to enumerate all possible tag sequences for Y and calculate p (Y | X) because the number of possible tag sequences increases exponentially with the length of the sequence. Therefore, beam search techniques are typically used to findThis is a limited numberThe score of (1) is higher than the other hypotheses. Finally, the best tag sequence hypothesis is selected from the complete hypotheses for reaching the end of the utterance.

FIG. 8 illustrates a performance comparison graph of speech recognition according to some embodiments. A Character Error Rate (CER) of an end-to-end ASR system according to some embodiments is shown to evaluate the impact of the look-ahead parameter 140 on three different attention mechanisms, namely dot product based attention 810, content based attention 820 and location aware attention 830, which are used to compute the context vector 314. However, end-to-end ASR systems according to some embodiments are not limited to these three attention mechanisms, which are merely examples. The dot-product based attention 810, content based attention 320, and location aware attention 830 results indicate that different look-ahead parameter settings may be advantageous depending on the type of attention mechanism. For example, location-aware attention types achieve lower CER for larger look-ahead values, while dot-product-based attention and content-based attention tend to prefer smaller look-ahead values for low error rates, which also reduces processing latency.

FIG. 9 is a block diagram illustrating some components that may be used in various configurations for implementing systems and methods, according to some embodiments. For example, the component 900 may include a hardware processor 11 in communication with a sensor 2 or a plurality of sensors (such as acoustic sensors) that collect data from the environment 1 including the acoustic signals 8. Further, the sensor 2 may convert acoustic input into acoustic signals the hardware processor 11 is in communication with a computer storage memory (i.e., memory 9) such that the memory 9 includes storage of data, including algorithms, instructions, and other data, that may be implemented by the hardware processor 11.

Alternatively, the hardware processor 11 may be connected to a network 7, which is in communication with the data source 3, the computer device 4, the mobile telephone device 5 and the storage device 6. Also optionally, the hardware processor 11 may be connected to a network enabled server 13 connected to the client device 15. The hardware processor 11 may optionally be connected to an external storage device 17 and/or a transmitter 19. Further, the text of the speaker may be output according to the particular user intended use 21, e.g., some types of user uses may include displaying the text on one or more display devices (such as a monitor or screen) or inputting the text of the speaker into a computer-related device for further analysis, and so forth.

It is contemplated that hardware processor 11 may include one or more hardware processors, where a processor may be internal or external, depending on the requirements of a particular application. Of course, other components may be combined with the assembly 900, the assembly 900 including an output interface and transceiver, among other devices.

By way of non-limiting example, the network 7 may include one or more Local Area Networks (LANs) and/or Wide Area Networks (WANs). The network environment may be similar to, among other things, enterprise-wide computer networks, intranets, and the internet. There may be any number of client devices, storage components, and data sources used with the component 900 in view of all the noted components. Each may comprise a single device or multiple devices cooperating in a distributed environment. Further, the component 900 may include one or more data sources 3. The data source 3 comprises data resources for training the speech recognition network. The data provided by the data source 3 may include tagged data and untagged data, such as transcribed data and untranscribed data. For example, in one embodiment, the data includes one or more sounds and may also include corresponding transcription information or tags that may be used to initialize the speech recognition network.

Furthermore, the unlabeled data in the data source 3 may be provided by one or more feedback loops. For example, usage data from a spoken search query executed on a search engine may be provided as untranscribed data. Other examples of data sources may include, for example, but are not limited to, various spoken audio or image sources including streaming voice or video, Web queries, mobile device camera or audio information, Web camera feeds, smart glasses and smart-watch feeds, customer attention systems, security camera feeds, Web documents, directories, user feeds, SMS logs, instant message logs, spoken word transcriptions, game system user interactions such as voice commands or captured images (e.g., depth camera images), tweets, chat or video call recordings, or social networking media. The particular data source 3 used may be determined based on the application, including whether the data is of a particular class (e.g., data related only to a particular type of sound, including, for example, machine systems, entertainment systems) or generic in nature (non-class specific).

The component 900 may include or be connected to third party devices 4, 5, and the third party devices 4, 5 may include any type of computing device such that it may be of interest to have an Automatic Speech Recognition (ASR) system on the computing device. For example, the third party device may comprise a computer device 4 or a mobile device 5. It is contemplated that the user device may be embodied as a Personal Digital Assistant (PDA), a mobile device such as a smartphone, a smart watch, smart glasses (or other wearable smart device), an augmented reality headset, a virtual reality headset. Further, the user device may be a laptop computer, such as a tablet computer, remote control, entertainment system, in-vehicle computer system, embedded system controller, appliance, home computer system, security system, consumer electronic device, or other similar electronic device. In one embodiment, the client device is capable of receiving input data, such as audio and image information that may be used by the ASR system described herein running on the device. For example, the third party device may have a microphone or line-in for receiving audio information, a camera for receiving video or image information, or a communication component (e.g., Wi-Fi functionality) for receiving such information from another source, such as the internet or data source 3.

ASR models using speech recognition networks can process input data to determine computer-usable information. For example, a query spoken by a user into a microphone may be processed to determine the content of the query, e.g., whether a question is posed. Example third party devices 4, 5 are optionally included in the component 900 to illustrate an environment in which the deep neural network model may be deployed. Furthermore, some embodiments of the present disclosure may not include third party devices 4, 5. For example, the deep neural network model may be on a server or in a cloud network, system, or similar arrangement.

With respect to the storage 6, the storage 6 may store information including data, computer instructions (e.g., software program instructions, routines, or services), and/or models used in implementations of the techniques described herein. For example, the memory 6 may store data from one or more data sources 3, one or more deep neural network models, information used to generate and train the deep neural network models, and computer-usable information output by the one or more deep neural networks.

This description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with a enabling description for implementing one or more exemplary embodiments. Various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosed subject matter as set forth in the appended claims.

In the following description specific details are given to provide a thorough understanding of the embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements of the disclosed subject matter may be shown in block diagram form as components in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Moreover, like reference numbers and designations in the various drawings indicate like elements.

Furthermore, various embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged. A process may terminate when its operations are complete, but may have additional steps not discussed or included in the figure. Moreover, not all operations in a process described in detail may occur in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, etc. When a process corresponds to a function, the termination of the function may correspond to a return of the function to the calling function or the main function.

Furthermore, embodiments of the disclosed subject matter may be implemented, at least in part, manually or automatically. Manual or automated implementations may be performed or at least assisted by the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine-readable medium. The processor may perform the necessary tasks.

Furthermore, embodiments of the disclosure and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, or in combinations of one or more of them (including the structures disclosed in this specification and their structural equivalents). Further embodiments of the disclosure may be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Still further, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

According to embodiments of the present disclosure, the term "data processing apparatus" may encompass all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also can be referred to or described as a program, software application, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, that can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that contains other programs or data, such as one or more scripts stored in a markup language document, a single file dedicated to the program in question, or multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. A computer suitable for executing a computer program comprises a central processing unit that may be, for example, based on a general or special purpose microprocessor or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for the fulfillment or execution of instructions and one or more storage devices for the storage of instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices (e.g., magneto-optical disks, or optical disks). However, a computer need not have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game player, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, to name a few.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user; for example, in response to a request received from a web browser, a user is interacted with by sending a web page to the web browser on the user's client device.

Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN") and a wide area network ("WAN"), such as the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Although the present disclosure has been described with reference to certain preferred embodiments, it is to be understood that various other modifications and changes may be made within the spirit and scope of the present disclosure. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the disclosure.

34页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：音频信号编码方法、音频信号解码方法、程序、编码装置、音频系统及解码装置

System and method for end-to-end speech recognition with triggered attention

相关技术

网友询问留言