Method and system for training a multilingual speech recognition network and speech recognition system for performing multilingual speech recognition

文档序号：1246809 发布日期：2020-08-18 浏览：6次中文

阅读说明：本技术 用于训练多语言语音识别网络的方法和系统以及用于执行多语言语音识别的语音识别系统 (Method and system for training a multilingual speech recognition network and speech recognition system for performing multilingual speech recognition ) 是由渡部晋司堀贵明关博史 J·勒鲁克斯 J·赫尔希于 2018-05-31 设计创作，主要内容包括：一种用于训练多语言语音识别网络的方法包括：提供与预定语言相对应的话语数据集；在话语数据集中插入语言标识(ID)标签,其中,话语数据集中的每一个由语言ID标签中的每一个来加标签；将加标签的话语数据集级联；从话语数据集生成初始网络参数；根据预定序列选择初始网络参数；以及用一系列所选择的初始网络参数和级联的加标签的话语数据集迭代地训练端到端网络,直到训练结果达到阈值。(A method for training a multilingual speech recognition network comprising: providing an utterance data set corresponding to a predetermined language; inserting language Identification (ID) tags in the utterance data set, wherein each of the utterance data set is tagged by each of the language ID tags; concatenating the tagged utterance data set; generating initial network parameters from the utterance data set; selecting initial network parameters according to a predetermined sequence; and iteratively training the end-to-end network with the series of selected initial network parameters and the concatenated tagged utterance dataset until a training result reaches a threshold.)

1. A method of training a multilingual speech recognition network, the method comprising the steps of:

providing an utterance data set corresponding to a predetermined language;

inserting language Identification (ID) tags in the utterance data set, wherein each of the utterance data set is tagged by each of the language ID tags;

concatenating the tagged utterance data set;

generating initial network parameters from the utterance data set; and

iteratively training an end-to-end network with a series of the initial network parameters and the concatenated tagged utterance dataset until a training result reaches a threshold.

2. The method of claim 1, wherein each of the utterance data sets includes a pair of an acoustic data set and a truth label corresponding to the acoustic data set.

3. The method of claim 1, wherein the end-to-end network is a language independent model.

4. A method according to claim 3, wherein the language independent model uses a deep BLSTM encoder network.

5. The method of claim 4, wherein the number of layers in the network of deep BLSTM encoders is 7 or greater than 7.

6. The method of claim 1, wherein the ID tags are arranged to the utterance data set according to an arrangement rule.

7. The method of claim 6, wherein the placement rule causes each of the ID tags to be added to a head position of each of the utterance data set.

8. The method of claim 1, further comprising the steps of:

generating trained network parameters when the training result reaches the threshold.

9. The method of claim 1, wherein the end-to-end network jointly optimizes a series of the initial network parameters and the concatenated tagged utterance data set based on a predetermined method.

10. A speech recognition system for performing multilingual speech recognition, the speech recognition system comprising:

an interface that receives speech sounds;

one or more processors; and

one or more storage devices storing an end-to-end speech recognition network module that has been trained using trained network parameters obtained by the method of claim 1, wherein the end-to-end speech recognition network module comprises instructions that when executed cause the one or more processors to perform operations comprising:

extracting an acoustic feature sequence from audio waveform data converted from the speaking sound using an acoustic feature extraction module;

encoding the acoustic feature sequence into a hidden vector sequence using an encoder network having encoder network parameters;

predicting a first output tag sequence probability by feeding the concealment vector sequence to a decoder network having decoder network parameters;

predicting, by a associative semantic temporal classification (CTC) module, a second output tag sequence probability using CTC network parameters and the sequence of hidden vectors from the encoder network; and

searching for an output tag sequence having a highest sequence probability by combining the first output tag sequence probability and the second output tag sequence probability provided from the decoder network and the CTC module using a tag sequence search module.

11. A multilingual speech recognition system for generating trained network parameters for use in multilingual speech recognition, the multilingual speech recognition system comprising:

one or more processors; and

one or more storage devices storing parameters and program modules comprising instructions executable by the one or more processors that, when executed, cause the one or more processors to perform operations comprising:

providing an utterance data set corresponding to a predetermined language;

inserting language Identification (ID) tags in the utterance data set, wherein each of the utterance data set is tagged by each of the language ID tags;

concatenating the tagged utterance data set;

generating initial network parameters from the utterance data set;

selecting the initial network parameters according to a predetermined sequence; and

iteratively training the end-to-end network with a series of selected initial network parameters and the concatenated tagged utterance dataset until a training result reaches a threshold.

12. The system of claim 11, wherein each of the utterance data sets includes a pair of acoustic data sets and a truth label corresponding to the acoustic data set.

13. The system of claim 11, wherein the end-to-end network is a language independent model.

14. The system of claim 13, wherein the language independent model uses a deep BLSTM encoder network.

15. The system of claim 14, wherein the number of layers in the network of deep BLSTM encoders is 7 or greater than 7.

16. The system of claim 11, wherein the ID tags are arranged to the utterance data set according to an arrangement rule.

17. The system of claim 16, wherein the placement rule causes each of the ID tags to be added to a head position of each of the utterance data set.

18. The system of claim 11, further comprising:

generating trained network parameters when the training result reaches the threshold.

Technical Field

The present invention relates generally to an apparatus and method for multilingual end-to-end speech recognition, and more particularly, to a method and system for training a neural network for joint speech recognition and speech recognition.

Background

End-to-end Automatic Speech Recognition (ASR) has recently demonstrated its effectiveness by achieving the most advanced performance achieved by traditional hybrid ASR systems, while also exceeding them in terms of ease of development. Conventional ASR systems require language dependent resources such as pronunciation dictionaries and word segmentation that are incorporated into models with phonemes as the intermediate representation. These resources are developed manually and therefore have two drawbacks: first, they may be error prone or otherwise sub-optimal; second, they greatly increase the amount of work required to develop ASR systems, particularly for new languages. Thus, the use of language dependent resources makes the development of multilingual recognition systems particularly complex. In contrast, the end-to-end ASR system converts the input speech feature sequence directly to the output tag sequence (in embodiments of the invention, a sequence of characters or tokens consisting primarily of n-gram characters) without any explicit intermediate representation of the speech/language structure, such as phonemes or words. Their main advantage is that the need for language dependent resources for manual production is avoided.

There have been some prior studies on multilingual/language independent ASR. In the context of a Deep Neural Network (DNN) based multilingual system, DNN is used to compute language independent bottleneck features. Therefore, it is necessary to prepare language-dependent backend systems such as pronunciation dictionaries and language models. In addition, it is necessary to predict the spoken language to cascade the language independent module and the language dependent module.

Disclosure of Invention

In the present invention, a system and method having a language independent neural network architecture is disclosed that is capable of jointly recognizing speech and identifying languages among a plurality of different languages. For example, the present invention enables us to automatically recognize utterances in english, japanese, mandarin, german, spanish, french, italian, dutch, portuguese, and russian, and combine the languages that identify each sentence of the utterances.

According to an embodiment of the invention, the network shares all parameters including the softmax layer across languages.

For example, since the network shares all parameters including the softmax layer by concatenating grapheme sets of a plurality of languages, the language independent neural network architecture of the present invention is capable of jointly recognizing speech and identifying languages in different languages such as english, japanese, mandarin chinese, german, spanish, french, italian, dutch, portuguese, russian, and the like.

The language independent neural network of the present invention is capable of multilingual end-to-end speech recognition by the following steps: (1) creating a universal set of labels that is a union of grapheme sets and language IDs from multiple languages and constructing an initial network based thereon, (2) inserting language ID labels into the transcription of each utterance of multiple different language corpora, (3) generating the utterance by selecting one or more utterances from the multiple different language corpora and concatenating them in an arbitrary order, wherein the corresponding transcriptions are also concatenated in the same order, (4) training the initial network with the generated utterances and the transcriptions, and (5) recognizing speech with the trained network.

This integrated end-to-end ASR system for multilingual speech recognition has 3 advantages: firstly, the integral architecture removes a language dependent ASR module and an external language identification module; secondly, the end-to-end architecture makes it unnecessary to prepare a manually made pronunciation dictionary; and third, sharing the network so that even for resource-poor languages, better characterization can be learned.

Because the training data is augmented to include language switches, the present invention also enables the end-to-end ASR system to function properly even if there are language switches in the speech signal.

According to some embodiments of the invention, a method for training a multilingual speech recognition network comprises: providing an utterance data set corresponding to a predetermined language; inserting language Identification (ID) tags in the utterance data set, wherein each of the utterance data set is tagged by each of the language ID tags; concatenating the tagged utterance data set; generating initial network parameters from the utterance data set; and iteratively training the end-to-end network with a series of initial network parameters and the concatenated tagged utterance dataset until a training result reaches a threshold.

Further, according to an embodiment of the present invention, a speech recognition system for performing multilingual speech recognition includes: an interface to receive speech sounds; one or more processors; and one or more storage devices storing an end-to-end speech recognition network module that has been trained by trained network parameters obtained by a method for training a multilingual speech recognition network, wherein the end-to-end speech recognition network module comprises instructions that, when executed, cause one or more processors to perform operations comprising: extracting an acoustic feature sequence from audio waveform data converted from a speech sound using an acoustic feature extraction module; encoding the acoustic feature sequence into a hidden vector sequence using an encoder network having encoder network parameters; predicting a first output tag sequence probability by feeding the concealment vector sequence to a decoder network having decoder network parameters; predicting, by a join-ambiguity sequential classification (CTC) module, a second output tag sequence probability using CTC network parameters and a sequence of hidden vectors from an encoder network; and searching for an output tag sequence having the highest sequence probability by combining the first output tag sequence probability and the second output tag sequence probability provided from the decoder network and the CTC module using a tag sequence search module.

Still further, in accordance with an embodiment of the present invention, a multilingual speech recognition system for generating trained network parameters for multilingual speech recognition includes: one or more processors; and one or more storage devices storing parameters and program modules comprising instructions executable by the one or more processors, the instructions when executed causing the one or more processors to perform operations comprising: providing an utterance data set corresponding to a predetermined language; inserting language Identification (ID) tags in the utterance data set, wherein each of the utterance data set is tagged by each of the language ID tags; concatenating the tagged utterance data set; generating initial network parameters from the utterance data set; and selecting initial network parameters according to a predetermined sequence; and iteratively training the end-to-end network with the series of selected initial network parameters and the concatenated tagged utterance dataset until a training result reaches a threshold.

The presently disclosed embodiments will be further explained with reference to the drawings. The drawings shown are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.

Drawings

[ FIG. 1]

FIG. 1 is a block diagram illustrating a method for multilingual speech recognition according to an embodiment of the present invention.

[ FIG. 2]

FIG. 2 is a block diagram illustrating a speech recognition module using a multilingual end-to-end network according to an embodiment of the present invention.

[ FIG. 3]

FIG. 3 is a diagram illustrating a neural network in a multilingual speech recognition module according to an embodiment of the present invention.

[ FIG. 4]

FIG. 4 is a block diagram illustrating a multilingual speech-recognition system according to an embodiment of the present invention.

[ FIG. 5]

FIG. 5 is a diagram illustrating a neural network in a multilingual speech recognition module according to an embodiment of the present invention.

[ FIG. 6]

FIG. 6 is a data preparation process for training a multilingual speech recognition module according to an embodiment of the present invention.

[ FIG. 7]

FIG. 7 is an evaluation of a character error rate as a function of the number of languages in an utterance according to an indication of multi-lingual speech recognition according to an embodiment of the present invention.

[ FIG. 8]

FIG. 8 illustrates an example output of multi-lingual speech recognition according to an embodiment of the present invention.

Detailed Description

While the above-identified drawing figures set forth the presently disclosed embodiments, other embodiments are also contemplated, as noted in the discussion. The present disclosure presents exemplary embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.

In a preferred embodiment of the present invention, language independent neural networks are constructed using multiple speech corpuses of different languages.

Neural networks can be used to recognize spoken utterances and combine the languages in which the utterances are recognized. For example, the neural network may be used to automatically transcribe utterances in english, japanese, mandarin, german, spanish, french, italian, dutch, portuguese, and russian, in combination with a language that identifies each utterance. If one says "How are you in english" to a system constructed according to an embodiment of the invention? "the system can output" [ EN ] how you? ". If another person says "comment allez-vous? ", it may output" [ FR ] commendallez-vous? ". [ EN ] and [ FR ] represent language ID tags corresponding to English and French, respectively. If one says "How are you" in English and French to the system? comment allez-vous? ", the system may output" [ EN ] How are you? [ FR ] comment allez-vous? ".

The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with a description that can be used to implement one or more exemplary embodiments. It is contemplated that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosed subject matter as set forth in the appended claims.

In the following description specific details are given to provide a thorough understanding of the embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements of the disclosed subject matter may be shown in block diagram form as components in order not to obscure the implementations in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Moreover, like reference numbers and designations in the various drawings indicate like elements.

Furthermore, various embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged. A process may terminate when its operations are completed, but may have other steps not discussed or included in the figures. Moreover, not all operations in any process specifically described may occur in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When the procedure corresponds to a function, the termination of the function may correspond to a return of the function to the calling function or the main function.

Furthermore, implementations of the disclosed subject matter can be implemented, at least in part, either manually or automatically. Manual or automated implementations may be implemented or at least assisted by using machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine-readable medium. The processor may perform the necessary tasks.

The modules and networks exemplified in this disclosure may be computer programs, software, or instruction code that may be executed using one or more processors. The modules and networks may be stored in one or more storage devices or otherwise stored in a computer-readable medium (e.g., magnetic disk, optical disk, or tape) such as a storage medium, computer storage medium, or data storage device (removable and/or non-removable), where the computer-readable medium is accessible from one or more processors to execute the instructions.

Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data. The computer storage media may be RAM, ROM, EEPROM or flash memory, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the application, module, or both using one or more processors. Any such computer storage media may be part of the device, or may be media accessible by or connectable to the device. Any of the applications or modules described herein may be implemented using computer-readable/executable instructions that may be stored or otherwise maintained by such computer-readable media.

FIG. 1 illustrates the multilingual speech recognition module 100 stored in the storage 430 of FIG. 4. The multilingual speech recognition module 100 is a processor (hardware processor) executable program including program modules (computer-executable instruction modules) such as a language ID insertion module 112, a utterance concatenation module 113, an initial network construction module 115, an end-to-end network training module 117, and an end-to-end speech recognition module 200. Program modules 112, 113, 115, 117 and 100 comprised in the multilingual speech recognition module 100 are also depicted in fig. 4. In addition, the storage 430 includes an encoder network module 202, an attention decoder network module 204, a CTC module 208, and an end-to-end speech recognition module 200, which will be discussed later.

The multilingual speech recognition module 100 constructs a language independent network according to the following steps:

(1) the initial network construction module 115 produces (generates) initial network parameters 116 using a set of generic labels obtained using a union of the grapheme sets and the language IDs of the speech corpus 110 of different languages.

(2) Language ID insertion module 112 inserts language ID tags into the transcription of each utterance in speech corpus 110 in a different language.

(3) The utterance concatenation module 113 generates an utterance by selecting one or more utterances from the speech corpus 110 in different languages and concatenating them in a random order, where the corresponding transcriptions are also concatenated in the same order as the concatenated utterance.

(4) The end-to-end network training module 117 uses the generated utterance and transcription to optimize the initial network parameters 116 and outputs trained network parameters 118.

In some cases, the speech corpus 110 in different languages may be referred to as an acoustic dataset 110. Further, depending on system design, the modules and network parameters indicated in the present disclosure may be stored in one storage or multiple storages, and the modules are computer (hardware processor) executable programs by the processor 430 shown in fig. 4. The processor 430 may be one or more (hardware) processors (computers). Each module performs a predetermined process or processes by being executed by a processor or processors.

Using the language independent networks stored in the trained network parameters 118, the end-to-end speech recognition module 200 is able to jointly recognize speech and the language ID for speech input and output the recognition result.

FIG. 2 is a block diagram illustrating an end-to-end speech recognition module 200 according to an embodiment of the present invention.

The end-to-end speech recognition module 200 includes an encoder network module 202, encoder network parameters 203, an attention decoder module 204, decoder network parameters 205, a tag sequence search module 206, a CTC module 208, CTC network parameters 209. The encoder network parameters 203, decoder network parameters 205, and CTC network parameters 209 are stored in a storage device to provide parameters to the respective modules 202, 204, and 208, respectively. The acoustic feature extraction module 434 in fig. 4 is used to extract the acoustic feature sequence 201 from the audio waveform data or the spectrum data. The audio waveform data or spectral data may be stored in a storage device and provided to the encoder network module 202. The audio waveform data or spectral data may be obtained via the input device 475 in fig. 4 using a digital signal processing module (not shown) that receives speech sounds and converts the speech sounds into audio waveform or spectral data. In addition, audio waveforms or spectral data stored in the storage 430 or memory 440 may be provided to the encoder network module 202. The signal of the speech sounds may be provided via the network 490 in fig. 4, and the input device 475 may be a microphone device.

The encoder network module 202 comprises a network of encoders that convert the acoustic signature sequence 201 into a sequence of hidden vectors using a network of encoders that read parameters from the encoder network parameters 203.

The attention mechanism using the attention decoder network 204 is described below. The attention decoder network module 204 includes a decoder network. The attention decoder network module 204 receives the sequence of hidden vectors from the encoder network module 202 and the previous tag from the tag sequence search module 206 and then calculates a first a posteriori probability distribution for the next tag for the previous tag using the decoder network reading the parameters from the decoder network parameters 205. The attention decoder network module 204 provides the first a posteriori probability distribution to the tag sequence search module 206. The CTC module 208 receives the hidden vector sequence from the encoder network module 202 and the previous tag from the tag sequence search module 206 and calculates a second a posteriori probability distribution for the next tag sequence using CTC network parameters 209 and dynamic programming techniques. After calculation, the CTC module 208 provides the second a posteriori probability distribution to the tag sequence search module 206.

The tag sequence search module 206 uses the first a posteriori probability distribution and the second a posteriori probability distribution provided from the attention decoder network module 204 and the CTC module 208 to find the tag sequence with the highest sequence probability. The first a posteriori probability and the second a posteriori probability of the tag sequence calculated by the attention decoder network module 204 and the CTC module 208 are combined into one probability. In this case, the combination of the calculated posterior probabilities may be performed based on a linear combination. With the end-to-end speech recognition module 200, CTC probabilities can be considered to find better alignment hypotheses for the input acoustic feature sequence.

Neural network architecture for language independent end-to-end speech recognition

End-to-end speech recognition is generally defined as finding the most likely sequence of tags given an input acoustic feature sequence XThe problem of (a) that, namely,

wherein the content of the first and second substances,representing a set of tags at a given predefinedThe set of possible tag sequences in the case (2), the tag may beA character or a word. The tag sequence probability p (Y | X) can be calculated using a pre-trained neural network.

In an embodiment of the present invention, the language independent neural network may be a combination of different networks such as a Feed Forward Neural Network (FFNN), a Convolutional Neural Network (CNN), and a Recurrent Neural Network (RNN).

For example, a hybrid attention/CTC architecture may be used for language independent neural networks. Fig. 2 is a block diagram illustrating a speech recognition module 200 using a multilingual end-to-end network with a hybrid attention/CTC architecture, wherein tag sequence probabilities are computed as follows.

The encoder module 202 includes a module for converting the sequence of acoustic features X-X₁,…,x_TConversion into a hidden vector sequence H ═ H₁,…,h_TEncoder networks, e.g.

H＝Encoder(X), (2)

Wherein the function encoder (x) may be composed of a cascade of one or more Recurrent Neural Networks (RNNs). The RNN may be implemented as a Long Short Term Memory (LSTM) having an input gate, a forgetting gate, an output gate, and a storage unit in each hidden unit. Another RNN may be a Bidirectional RNN (BRNN) or a Bidirectional LSTM (BLSTM). BLSTM is a pair of LSTM RNNs, one of which is the forward LSTM and the other of which is the backward LSTM. The concealment vector for BLSTM is obtained as a concatenation of concealment vectors for forward LSTM and backward LSTM.

Concealing the forward tth vectors using forward LSTMIs calculated as

Wherein σ (-) is an element-by-element sigmoid function, tanh (-) is an element-by-element hyperbolic tangent function, andandare each x_tInput gate, forgetting gate, output gate, and cell activation vector ⊙ represents an element-by-element multiplication between vectorsSum deviation vectorIs a parameter of LSTM identified by the subscript z ∈ { x, h, i, f, o, c }.Is a matrix hidden to the input gate, andis a matrix of inputs to output gates. Hidden vectorIs from an input vector x_tAnd previous hidden vectorObtained recursively, wherein it is assumedIs a zero vector.

Using backward LSTM, the backward tth hidden vectorIs calculated as

Wherein the content of the first and second substances,andare each x_tInput gate, forget-to-remember gate, output gate, and cell activation vector. Weight matrixSum deviation vectorIs a parameter of LSTM identified by a subscript in the same manner as forward LSTM. Hidden vectorIs from an input vector x_tAnd subsequent hidden vectorsObtained recursively, wherein it is assumedIs a zero vector.

The concealment vector of BLSTM is obtained by concatenating the forward concealment vector and the backward concealment vector as follows:

where T denotes a transpose operation on the vectors assuming that all vectors are column vectors.Andconsidered as parameters of BLSTM.

To obtainTo get better concealment vectors, we can stack multiple BLSTMs by feeding the concealment vectors of a first BLSTM to a second BLSTM, then feeding the concealment vectors of the second BLSTM to a third BLSTM, and so on. If h is_t' is a hidden vector obtained by one BLSTM, then when feeding it to another BLSTM, we assume x_t＝h_t'. To reduce the amount of computation, only every other concealment vector of one BLSTM may be fed to another BLSTM. In this case, the length of the output concealment vector sequence becomes half the length of the input acoustic feature sequence.

Parameters of the plurality of BLSTMs identified by subscripts z ∈ { x, h, i, f, o, c }Andall stored in the encoder network parameters 203 and used to calculate the concealment vector sequence H.

The attention decoder module 104 comprises a decoder network for calculating a tag sequence probability p using a hidden vector sequence H_att(Y | X). Assuming Y is a tag sequence Y of length L₁，y₂,., yL. To calculate p efficiently_att(Y | X), the probability can be decomposed into

And per tag probability p_att(y_l|y₁，…，y_l-1X) is obtained from a probability distribution for the tag, which is estimated using the decoder network as:

p_att(y|y₁，…，y_l-1，X)＝Decoder(r_l，q_l-1)， (15)

where y is a random variable representing the label, r_lReferred to as a content vector, having content information of H. q. q.s_l-1Is the decoder stateVector containing previous tag y₁，...，y_l-1And a previous content vector r₀，...，r_l-1The contextual information of. Thus, given the context, the label probability is y ═ y_lThe probability of (a) is obtained, i.e.,

p_att(y_l|y₁，…，y_l-1，X)＝p_att(y＝y_l|y₁，…，y_l-1，X) (16)

content vector r_lUsually given as a weighted sum of the concealment vectors of the encoder network, i.e.,

wherein, a_ltReferred to as attention weight, which satisfies ∑_ta_lt1. Q may be used_l-1And H the attention weight is calculated as follows:

e_lt＝w^Ttanh(Wq_l-1+Vh_t+Uf_lt+b) (18)

f_l＝F*a_l-1(19)

where W, V, F and U are matrices and w and b are vectors as trainable parameters of the decoder network. e.g. of the type_ltIs the (l-1) th state vector q_l-1And t hidden vector h_tTo form a time-aligned distribution a_l＝{a_lt|t＝1，...，T}。a_l-1Representation for predicting previous label y_l-1Previous alignment distribution of { a }_(l-1)t|t＝1，...，T}。f_l＝{f_ltI T1.. T is for a_l-1Is used to reflect the previous alignment into the current alignment. "+" indicates a convolution operation.

Using state vectors q_l-1And a content vector r_lThe label probability distribution is obtained as follows:

Decoder(r_l，q_l-1)＝softmax(W_qyq_l-1+W_ryr_l+b_y)， (21)

wherein, W_qyAnd W_ryIs a matrix, and b_yAre vectors, these are trainable parameters of the decoder network. For the K-dimensional vector v, the softmax () function is computed as follows:

wherein v [ i ] represents the ith element of v.

The decoder state vector q is then expressed using LSTM_l-1Is updated to q_l

Wherein the content of the first and second substances,andare respectively an input vector x_lInput gate, forget-to-remember gate, output gate, and cell activation vector. Weight matrixSum deviation vectorIs a parameter of the LSTM identified by a subscript in the same manner as the forward LSTM. State vector q_lIs from an input vectorAnd a previous state vector q_l-1Obtained recursively, wherein q is assumed_-1＝0、y₀＝<sos>、a₀Calculating q 1/T₀. For a decoder network, input vectors are usedAs label y_lAnd a content vector r_lGiven as a concatenated vector, this may be in accordance withWhere Embed (-) represents the label embedding, converting the label into a fixed-dimension vector. For example, it can be calculated by the following formula

Where onehot (y) represents the 1/N encoding of tag y, which converts the tag index into a heat vector representation.Is a matrix as a trainable parameter.

Parameters identified by subscripts z ∈ { x, h, i, f, o, c }Andand W_qy，W_ry，b_y，All stored in the decoder network parameters 205 and used to calculate the label probability distribution p_att(y＝y_l|y₁，…，y_l-1，X)。

The CTC module 208 computes a CTC forward probability for a tag sequence Y given a hidden vector sequence H. Note that the CTC formula uses sets with different tagsA tag sequence of length LBy introducing a frame-by-frame tag sequence with an appended "blank" tag,where b represents a blank label. By using probability chain rules and conditional independent assumptions, the posterior distribution p (Y | X) is decomposed as follows:

wherein, it is considered that p (z)_t|z_t-1Y) is the label transition probability that contains a blank label. p (z)_t| X) is a frame-by-frame posterior distribution conditioned on the input sequence X and modeled using Bidirectional Long Short Term Memory (BLSTM):

wherein h is obtained using an encoder network_t。Is a matrix andare vectors, which are trainable parameters of the CTC and are stored in CTC network parameters 209. Although equation (29) must handle the summation of all possible zs, the calculation can be done efficiently by using a forward algorithm.

The forward algorithm for CTC is performed as follows. We used an extended tag sequence Y ' Y ' of length 2L +1 '₁，y′₂，...，y′_2L+1＝b，y₁，b，y₂，...，b，y_LB, wherein a blank label "b" is inserted between each pair of adjacent labels α_t(s) is the forward probability, which represents time frame 1,., tag sequence y within t₁，...，y_lWherein s indicates the position in the extended tag sequence Y'.

For initialization, we set:

α₁(1)＝p(z₁＝b|X) (31)

α₁(2)＝p(z₁＝y₁|X) (32)

for T2 to T, α is calculated recursively as follows_t(s)

Wherein the content of the first and second substances,

finally, CTC-based tag sequence probability is obtained as follows

p_ctc(Y|X)＝α_T(2L+1)+α_T(2L). (36)

When calculating the forward probabilities, the recursion of equation (34) forces Z to be monotonic, and does not allow loops or large jumps of s in alignment Z, because α is obtained_tThe recursion of(s) takes into account only α at most_t-1(s)、α_t-1(s-1)、α_t-1(s-2). This means that when the time frame is advanced by one frame, the label will change from the previous label or blank, or remain the same label. This constraint shifts the probability p (z)_t|z_t-1Y) the transition probability p (z)_t|z_t-1Y) forces the alignment to be monotonic. Therefore, when p is_ctc(Y | X) may be 0 or a very small value when calculated based on irregular (non-monotonic) alignments.

Finally, we obtain the tag sequence probability by combining the CTC-based probability in equation (36) and the attention-based probability in equation (14) in the log domain as follows:

log p(Y|X)＝λlog p_ctc(Y|X)+(1-λ)log p_att(Y|X)， (37)

where λ is a scaling weight such that 0 ≦ λ ≦ 1, and may be determined manually.

Fig. 3 is a schematic diagram illustrating a combined neural network module 300 according to an embodiment of the present invention. The combined neural network 300 includes an encoder network module 202, an attention decoder network module 204, and a CTC module 208. Each arrow represents a data transmission with or without transformation, and each square or circular node represents a vector or predictive label. Acoustic signature sequence X ═ X₁，...，x_TIs fed to an encoder network module 202 where two BLSTMs are stacked in the encoder network module 202 and every other hidden vector of a first BLSTM is fed to a second BLSTM. The output of the encoder module 202 yields a concealment vector sequence H ═ H'₁，h′₂，...，h′_T′Wherein T' is T/2. H is then fed to the CTC module 208 and the decoder network module 204. The CTC-based sequence probability and the attention-based sequence probability are computed with the CTC module 208, the attention decoder network module 204, respectively, and combined to obtain a tag sequence probability.

Joint language identification and speech recognition

The key idea of a language independent end-to-end system is to treat the augmented character set, which includes the union of the character sets present in all target languages, as a set of output labels, i.e.,whereinIs a character set of a particular language. By using such an augmented character set, the possibility of calculating character sequences for any language without the need for a separate language identification module is possible. Training a network to automatically predict each utteranceThe correct character sequence of the language target language. The use of a union, as opposed to using a unique character set for each language, eliminates duplication of output symbols that occurs in multiple languages and produces a more compact model representation while reducing computational costs. The language independent system repeats the prediction of language ID and speech recognition given continuous multilingual speech.

Furthermore, by further augmenting the set of output tags to include the language ID, the prediction of the language ID becomes an explicit part of the system, i.e., willSet of labels for end-to-end speech recognitionAccording to an embodiment of the present invention, the network first predicts the language ID, k ∈ { [ EN { []，[JP]… }. Instead of the posterior distribution p (Y | X), where Y ═ Y₁，...，y_LIs thatAnd X is a sequence of acoustic feature vectors, the system models the language ID and a joint distribution p (k, Y | X) of character sequences as a joint distribution of augmented sequences Y '═ k, Y, where Y'₁K andformulated using probability chain rules as:

further, for the case where multiple languages are included in the utterance, the network is allowed to always output multiple language IDs. For theY 'of'₁，...，y′_LWe use l₁，...，l_NRepresentation as languageCharacter of ID, YIndex of (i.e., k)_n∈{[EN]，[JP]… }). The system now models the joint distribution of language IDs and characters as:

this is in contrast to the examples such as "[ EN ] how < space > are < space > you? [ FR ] comment < space > allez-vous? ", where < space > formally represents that the modeling of the distribution of the language mixture character sequence including the language ID for the space character is the same.

A hybrid attention/CTC architecture may be used to model such language hybrid character sequences. When a language-mixed utterance is recognized, the network can switch the language of the output sequence. Fig. 5 shows an example of character sequence prediction using a hybrid attention/CTC network. The encoder network computes a sequence of hidden vectors by taking as input acoustic features consisting of japanese and english speech. Although in the example we assume x₁，…，x₅Corresponding to Japanese and x₆，…，x_TCorresponds to english, but there is no indicator of language differentiation in the actual acoustic signature sequence. According to an embodiment of the invention, the attention decoder network is able to predict the language ID' of the following Japanese character sequence [ JP]", and after decoding the first japanese character sequence, the network can further predict the language ID matching the following character sequence, here" [ EN]”。

Data generation for multilingual speech recognition

To predict a language-mixed utterance, a set of such language-mixed corpora needs to be used to train a mixed attention/CTC network. However, it is difficult to collect enough speech corpora that multiple languages appear in the same utterance. In practice, collecting and transcribing such utterances is very expensive and time consuming. In an embodiment of the invention, such a corpus is generated manually from a collection of already existing language dependent corpora.

It is assumed that each sentence utterance in the multiple corpora has its corresponding transcription as a sequence of characters. Hereinafter, a method for generating such a language corpus mixture is described according to an embodiment of the present invention in fig. 1. First, the language ID insertion module 112 inserts a language ID into the transcription of each utterance in the language-dependent corpus. The language ID may be located at the beginning of each character sequence. Next, the utterance concatenation module 113 selects an utterance from the language dependent corpus while noting the coverage of the selected utterance and the variations in language translation (as described further below). The selected utterances (and their transcriptions) are then concatenated and considered as single-sentence utterances in the generated corpus. This process is repeated until the duration of the generated corpus reaches the duration of the union of the original corpora.

Fig. 6 shows details of the generation process. We first define the probability of the sampling language and utterance. The probability of a sampling language is proportional to the duration of its original corpus and a constant term of 1/N is added to mitigate the selection bias caused by the data size. In our experiments, the maximum number of utterances N that we will cascade_concatSet to 3. For in 1 and N_concatEach number n in between_concatWe sample n by sampling based on their sampling probabilities_concatLanguages and utterances created from n from the original corpus_concatA concatenated utterance composed of sentence utterances. To maximize the coverage of the original corpus, we use the maximum usage count n_reuse(set to 5 for the training set and 2 for the development and evaluation set) to prevent utterances from being used too many times. We use this process to generate a training set, a development set, and an evaluation set.

Training process

Jointly optimizing encoder network parameters 203, decoder network parameters 205, and CTC network parameters 209 in end-to-end network training module 117 to reduce loss functions

Where X and Y are training data including acoustic feature sequences and tag sequences. According to an embodiment of the present invention, training data (X, Y) is generated from an existing language dependent corpus using language ID insertion module 112 and utterance concatenation module 113 in FIG. 1.

Θ represents a set of network parameters including encoder network parameters 203, decoder network parameters 205, and CTC network parameters 209.Is the number of training samples, X_nIs the n-th acoustic signature sequence in X, Y_nIs the nth tag sequence in Y. p is a radical of_ctc(Y_n|X_nΘ) is the CTC-based sequence probability in equation (53) calculated with the parameter set Θ, and p_att(Y_n|X_nAnd Θ) is the attention-based sequence probability in equation (14) calculated with the parameter set Θ.

The set of network parameters Θ can be optimized by a random gradient descent method. First, the initial network construction module 115 obtains initial network parameters 116, wherein the sizes of the matrix and vector and the initial values of each element in the matrix and vector are determined. The size of the matrix and vector may be determined manually or automatically. For example, for a tag set dependent onAccording to the size of the tag setAnd (5) determining the size. For example, the matrix W_qyAnd W_ryShould be equal toVector b in equation (21)_yShould also be equal toThis is because the number should be equal to the label probability scoreCloth p_att(y|y₁，…，y_l-1And X) dimension. Each element of the matrix and vector may be set to a random real number. Determination by obtaining unique characters and language IDs in an original corpus of speech 110 in different languages

Next, the end-to-end network training module 117 jointly optimizes the encoder parameters, decoder parameters, and CTC network parameters in parameter set Θ. Based on the gradient descent method, each element of the parameter set Θ is repeatedly updated as follows:

up toConvergence, where η is the learning rate.

X and Y can also be divided into M small subsetsSo thatAndand for M-1, …, M, updating the parameters by repeating the following equation:

by updating the parameters with a small subset, the parameters can be updated more frequently and the loss function can converge more quickly.

In our experiments, we consider two training processes. In the flat start (flal start) process, the model is trained using only the generated corpus from scratch. In the retraining process, the model is trained in two steps using the original corpus and the generated corpus as follows. We first train the model using training data without language switching (i.e., the original corpus) and then continue training using data with language switching (the generated corpus). We consider these two steps for the following reasons. First, a model trained with data without language switching is a good starting point for training more challenging data with language switching. Second, we allow the data generation algorithm to select repeated utterances to increase the proportion of resource-poor languages. However, this property can lead to reduced coverage. Two-step training alleviates this problem.

Tag sequence search

The tag sequence search module 206 finds the most likely tag sequence based on the combined tag sequence probabilities according to the following equation

Wherein p is_ctc(Y | X) is the CTC-based tag sequence probability in formula (36), p_att(Y | X) is the attention-based tag sequence probability in equation (14), and according to an embodiment of the present invention,

however, since the number of possible tag sequences increases exponentially with the length of the sequence, all possible tag sequences for Y are enumerated and λ log p is calculated_ctc(Y|X)+(1-λ)log p_att(Y | X) is difficult. Therefore, beam search techniques are typically used to findWhere shorter tag sequence hypotheses are first generated and then only a limited number of hypotheses with higher scores than others are expanded to obtain longer hypotheses.Finally, the best marker sequence hypothesis is selected among the complete hypotheses for reaching the end of the sequence.

Let omega_lIs a set of partial hypotheses of length l. At the beginning of the beam search, Ω₀Containing only the starting symbol < sos>One assumption of (2). For L1 to L_maxExtending Ω by appending possible single tags_l-1And stores the new hypothesis in Ω_lIn which L is_maxIs the assumed maximum length to be searched.

The score for each partial hypothesis h is calculated as follows:

ψ_joint(h)＝λψ_ctc(h，X)+(1-λ)ψ_att(h)， (44)

wherein psi_att(h) Calculated as follows:

ψ_att(h)＝ψ_att(g)+log p_att(y|g). (45)

to calculate psi_ctc(h, X), we use the CTC prefix probability, which is defined as the cumulative probability of all tag sequences prefixed by h:

also, we defined CTC scores as

Where v represents all possible tag sequences except the empty string. CTC score failed to agree with the psi in equation (45)_att(h) That is obtained recursively, but the CTC score can be efficiently computed by keeping the forward probability over the input time frame for each partial hypothesis.

According to an embodiment of the present invention, the tag sequence search module 206 finds the tag sequence according to the following process

Input：X，L_max

Output：

In this process, Ω_lAndimplemented as a queue that accepts partial hypotheses and full hypotheses of length l, respectively. In lines 1-2, Ω₀Andis initialized as an empty queue. In line 3, the initial assumptions<sos>Is set to 0. In lines 4-24, through the tag setEach tag y in (1) extends omega_l-1Wherein the operation Head (Ω) returns the first hypothesis in the queue Ω, and Dequeue (Ω) removes the first hypothesis from the queue.

In line 11, each extended hypothesis h is scored using the attention decoder network and combined with the CTC scores in line 12. Then, if y is equal to<eos>Then in line 14 it is assumed that h is complete and storedIn whichIs toAdding the operation of h. If y ≠<eos>Then in row 16 h is stored at Ω_lWherein in row 17 Ω_lNumber of hypotheses inAmount (i.e., | Ω)_l|) is compared with a predetermined number of beamWidth. If | Ω_lIf | exceeds beamWidth, then in lines 18-19, from Ω_lRemoving omega in_lHas the smallest score h_minWherein Remove (Ω)_l，h_min) Is from Ω_lIn which h is removed_minAnd (4) performing the operation of (1). Finally, select in line 25As the best assumption.

The CTC score ψ may be calculated using a modified forward algorithm_ctc(h, X). Is provided withAndlet us assume the forward probability of h over time frame 1.. t, where the superscripts (n) and (b) represent different cases where all CTC paths end with a non-blank label or blank label, respectively. Prior to starting the beam search, for T1, T,andis initialized to:

wherein we assumeAnd b is a blank label. Note that due to the sub-sampling technique of the encoder, the time index T and the input length T may be different from the time index and the input length of the input utterance. The CTC scoring function may be implemented as follows.

Input：h，X

Output：ψ_ctc(h，X)

In this function, in line 1, a given hypothesis h is first partitioned into a last label y and remaining labels g. If y is<eos>Then in line 3, the logarithm of the forward probability that hypothesis h is a complete hypothesis is returned. According toAndthe forward probability of h is given by:

if y is not<eos>Then the forward probability is calculated assuming h is not a complete hypothesisAndand prefix probability Ψ ═ p_ctc(h, … | X). Lines 5-13 describe the initialization and recursion steps of these probabilities. In this function, it is assumed that each calculation is made in lines 10-12The probability in line 9 that has been obtained by the beam search procedure when t and t have been reachedAndthis is because g is a prefix of h, such that | g | < | h |. Therefore, the prefix probability and the forward probability can be efficiently calculated. Note that last (g) in line 9 is the last one to return gA function of the label.

Multi-language end-to-end speech recognition device

FIG. 4 illustrates a block diagram of a multilingual end-to-end speech recognition system 400 according to some embodiments of the present invention. The end-to-end speech recognition system 400 includes a Human Machine Interface (HMI)410 connectable to a keyboard 411 and a pointing device/medium 412, one or more processors 420, a storage device 430, a memory 440, a network interface controller 450(NIC) connectable to a network 490 including a local area network and the internet, a display interface 460, an audio interface 470 connectable to a microphone device 475, and a printer interface 480 connectable to a printing device 485. Memory 440 may be one or more memory units. The end-to-end speech recognition system 400 may receive electronic audio waveform/spectrum data 495 via a network 490 connected to the NIC 450. The storage 430 includes an end-to-end speech recognition module 200, a language ID insertion module 112, a utterance concatenation module 113, an initial network construction module 115, an end-to-end network training module 117, an encoder network module 202, an attention decoder network module 204, a CTC module 208, and an end-to-end speech recognition module 200 and an acoustic feature extraction module 434. In some cases, modules 112, 113, 115, 117, 202, 204, and 208 may be independently arranged in storage 430, memory 440, or externally connectable memory (not shown), depending on the system design.

Other program modules such as a tag sequence search module, encoder network parameters, decoder network parameters, and CTC network parameters are omitted from the drawing. Further, the pointing device/medium 412 may include modules that are computer-executable (processor-executable) programs stored on a computer-readable recording medium. The attention decoder network module 204, the encoder network module 202, and the CTC module 208 may be formed from neural network parameters. The acoustic feature extraction module 434 is a program module for extracting a sequence of acoustic features therefrom. The sequence of acoustic features may be a sequence of Mel-scale filter bank coefficients (Mel-scale filter bank coefficients) with first and second order time derivatives and/or syllable features.

To perform end-to-end speech recognition, an indication may be sent to the end-to-end speech recognition system 400 using the keyboard 411, the pointing device/medium 412, or via a network 490 connected to other computers (not shown). The system 400 receives the indication via the HMI410 and executes the indication to perform end-to-end speech recognition using the processor 420 coupled to the memory 440 by loading the end-to-end speech recognition module 200, the attention decoder network module 204, the encoder network module 202, the CTC module 208, and the acoustic feature extraction module 434 stored in the storage 430.

Experiment of

The original corpus is based on WSJ, CSJ (Maekawa et al, 2000), HKUST (Liu et al, 2006) and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian) ("Voxforge", nd).

We build language dependent end-to-end systems and language independent end-to-end systems with a hybrid attention/CTC network architecture. The language dependent model uses a 4-layer BLSTM encoder network, while the language independent model has a deep BLSTM encoder network, such as a 7-layer deep BLSTM encoder network. The number of layers in the network of BLSTM encoders is not limited to 7 layers. In some cases, the number of layers may be 2 or greater than 2, depending on the system design. We use the 80 vimel filter bank feature concatenated with the 3-dimensional syllable feature. For the language independent model, the final softmax layer in both the CTC-based branch and the attention-based branch has 5,520 dimensions (i.e.,)。

for English, to handle relatively long sentences in the WSJ corpus, we extend the alphabetic character set to 201 by adding tokens corresponding to character sequences up to 5-grams that often occur in the WSJ text corpus. This makes the output length L shorter to reduce computational cost and GPU memory usage.

For each language, we train a language dependent ASR model, where the dimension of the final softmax layer is set to the number of different characters/labels for that language. The end-to-end ASR concept is strictly followed herein without the use of any pronunciation lexicon, word-based language model, GMM/HMM, or DNN/HMM. Our hybrid attention/CTC architecture is implemented with Chainer (Tokui et al, 2015).

Results

FIG. 7 illustrates Character Error Rates (CERs) of a trained language dependent end-to-end ASR system and a language independent end-to-end ASR system over a set of multilingual evaluations that include language switches. The CER averages over 10 languages and shows them separately in terms of the number of languages in each sentence.

In the case where only one language is included in each utterance (i.e., no language switching), the language independent model is significantly superior to the language dependent model. When the number of languages is two or three, the language independent model trained with the language switching data reduces the CER from 31.5% to 21.3% for the case of 2 languages and from 38.6% to 20.8% for the case of 3 languages. By retraining the flat-start language-independent model, we obtained further reductions in CER, i.e., 19.4%, 19.3%, and 18.6% in each case, respectively.

Thus, language independent ASR successfully reduces CER, and models trained with language switching data can properly switch between languages during the decoding process.

We also calculate the language ID error rate by extracting the language ID from the recognition result obtained from the language independent model retrained with the language switching data. In the case where only one language is included in each utterance, the language ID error rate is 2.0%. In the case where 1 to 3 languages are included in each sentence of utterance, the ID error rate is 8.5%. Therefore, the present invention can jointly recognize a multilingual speech and a language ID with a small number of errors.

FIG. 8 shows an example of the transcription generated by our model. Utterances are composed of japanese, english, and dutch. Models without language switching are not able to predict either the correct language ID or the usage of the japanese character set. We can observe that models with language switching recognize multilingual speech with low CER.

In some embodiments of the present disclosure, when the above-described end-to-end speech recognition system is installed in a computer system, speech recognition can be efficiently and accurately performed with less computing power, whereby use of the end-to-end speech recognition method or system of the present disclosure can reduce the use and power consumption of a central processing unit.

Furthermore, embodiments according to the present disclosure provide efficient methods for performing end-to-end speech recognition, and thus, the use of methods and systems using end-to-end speech recognition models can reduce Central Processing Unit (CPU) usage, power consumption, and/or network bandwidth usage.

The above-described embodiments of the present disclosure may be implemented in any of a variety of ways. For example, embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such a processor may be implemented as an integrated circuit with one or more processors in the integrated circuit assembly. However, a processor may be implemented using circuitry in any suitable format.

Additionally, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and may also be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.

Furthermore, embodiments of the present disclosure may be embodied as a method, examples of which have been provided. The actions performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in exemplary embodiments. Furthermore, the use of ordinal terms such as first, second and the like in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for the use of the ordinal term) to distinguish the claim elements.

29页详细技术资料下载

Method and system for training a multilingual speech recognition network and speech recognition system for performing multilingual speech recognition

相关技术

网友询问留言